Human Action Recognition in Videos using Convolution Long Short-Term Memory Network with Spatio-Temporal Networks

Ashok Sarabu, Ajit Kumar Santra


Two-stream convolutional networks plays an essential role as a powerful feature extractor in human action recognition in videos. Recent studies have shown the importance of two-stream Convolutional Neural Networks (CNN) to recognize human action recognition. Recurrent Neural Networks (RNN) has achieved the best performance in video activity recognition combining CNN. Encouraged by CNN's results with RNN, we present a two-stream network with two CNNs and Convolution Long-Short Term Memory (CLSTM). First, we extricate Spatio-temporal features using two CNNs using pre-trained ImageNet models. Second, the results of two CNNs from step one are combined and fed as input to the CLSTM to get the overall classification score. We also explored the various fusion function performance that combines two CNNs and the effects of feature mapping at different layers. And, conclude the best fusion function along with layer number. To avoid the problem of overfitting, we adopt the data augmentation techniques. Our proposed model demonstrates a substantial improvement compared to the current two-stream methods on the benchmark datasets with 70.4% on HMDB-51 and 95.4% on UCF-101 using the pre-trained ImageNet model.


Doi: 10.28991/esj-2021-01254

Full Text: PDF


Convolution LSTM; Action Recognition; Human Activity; Two-Stream Networks.


Karpathy, Andrej, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and Li Fei-Fei. “Large-Scale Video Classification with Convolutional Neural Networks.” 2014 IEEE Conference on Computer Vision and Pattern Recognition (June 2014). doi:10.1109/cvpr.2014.223.

Tran, Du, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. “Learning Spatiotemporal Features with 3D Convolutional Networks.” 2015 IEEE International Conference on Computer Vision (ICCV) (December 2015). doi:10.1109/iccv.2015.510.

Simonyan, Karen, and Andrew, Zisserman. "Two-Stream Convolutional Networks for Action Recognition in Videos.". In Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 1 (pp. 568–576). MIT Press, 2014. doi:10.5555/2968826.2968890.

Ma, Chih-Yao, Min-Hung Chen, Zsolt Kira, and Ghassan AlRegib. “TS-LSTM and Temporal-Inception: Exploiting Spatiotemporal Dynamics for Activity Recognition.” Signal Processing: Image Communication 71 (February 2019): 76–87. doi:10.1016/j.image.2018.09.003.

Wang, Limin, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc Van Gool. “Temporal Segment Networks: Towards Good Practices for Deep Action Recognition.” Lecture Notes in Computer Science (2016): 20–36. doi:10.1007/978-3-319-46484-8_2.

Zhu, Guangming, Liang Zhang, Peiyi Shen, and Juan Song. “Multimodal Gesture Recognition Using 3-D Convolution and Convolutional LSTM.” IEEE Access 5 (2017): 4517–4524. doi:10.1109/access.2017.2684186.

SHI, Xingjian, Zhourong, Chen, Hao, Wang, Dit-Yan, Yeung, Wai-kin, Wong, and Wang-chun, WOO. "Convolutional LSTM Network: A Machine Learning Approach for Precipitation Nowcasting.". In Advances in Neural Information Processing Systems (pp. 802–810). Curran Associates, Inc., 2015. doi:10.5555/2969239.2969329.

Laptev, and Lindeberg. “Space-Time Interest Points.” Proceedings Ninth IEEE International Conference on Computer Vision (2003). doi:10.1109/iccv.2003.1238378.

Liu, Ping, Jin Wang, Mary She, and Honghai Liu. “Human Action Recognition Based on 3D SIFT and LDA Model.” 2011 IEEE Workshop on Robotic Intelligence In Informationally Structured Space (April 2011). doi:10.1109/riiss.2011.5945790.

Wang, Heng, and Cordelia Schmid. “Action Recognition with Improved Trajectories.” 2013 IEEE International Conference on Computer Vision (December 2013). doi:10.1109/iccv.2013.441.

Feichtenhofer, Christoph, Axel Pinz, and Andrew Zisserman. “Convolutional Two-Stream Network Fusion for Video Action Recognition.” 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (June 2016). doi:10.1109/cvpr.2016.213.

Wang, Limin, Yu Qiao, and Xiaoou Tang. “Action Recognition with Trajectory-Pooled Deep-Convolutional Descriptors.” 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (June 2015). doi:10.1109/cvpr.2015.7299059.

Feichtenhofer, Christoph, Axel, Pinz, and Richard P., Wildes. "Spatiotemporal Residual Networks for Video Action Recognition.". In Proceedings of the 30th International Conference on Neural Information Processing Systems (pp. 3476–3484). Curran Associates Inc., 2016.doi: 10.5555/3157382.3157486.

Ji, Jingwei, Shyamal Buch, Alvaro Soto, and Juan Carlos Niebles. “End-to-End Joint Semantic Segmentation of Actors and Actions in Video.” Lecture Notes in Computer Science (2018): 734–749. doi:10.1007/978-3-030-01225-0_43.

Carreira, Joao, and Andrew Zisserman. “Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset.” 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (July 2017). doi:10.1109/cvpr.2017.502.

Donahue, Jeff, Yangqing, Jia, Oriol, Vinyals, Judy, Hoffman, Ning, Zhang, Eric, Tzeng, and Trevor, Darrell. "DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition.". In Proceedings of the 31st International Conference on International Conference on Machine Learning - Volume 32 (pp. I–647–I–655)., 2014. doi:10.5555/3044805.3044879.

Srivastava, Rupesh Kumar, Klaus, Greff, and Jürgen, Schmidhuber. "Training Very Deep Networks.". In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 2 (pp. 2377–2385). MIT Press, 2015. doi:10.5555/2969442.2969505.

Kim, Ye-Ji, Dong-Gyu Lee, and Seong-Whan Lee. "First-person activity recognition based on three-stream deep features." In 2018 18th International Conference on Control, Automation and Systems (ICCAS), pp. 297-299. IEEE, 2018.

Patraucean, Viorica, Ankur Handa, and Roberto Cipolla. "Spatio-temporal video autoencoder with differentiable memory." arXiv preprint arXiv:1511.06309 (2015).

Medel, Jefferson Ryan, and Andreas Savakis. "Anomaly detection in video using predictive convolutional long short-term memory networks." doi: arXiv:1612.00390 (2016).

He, Kaiming, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. “Deep Residual Learning for Image Recognition.” 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (June 2016). doi:10.1109/cvpr.2016.90.

Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. “ImageNet Classification with Deep Convolutional Neural Networks.” Communications of the ACM 60, no. 6 (May 24, 2017): 84–90. doi:10.1145/3065386.

Ioffe, Sergey, and Christian Szegedy. "Batch normalization: Accelerating deep network training by reducing internal covariate shift." In International conference on machine learning, pp. 448-456. PMLR, 2015. doi:10.5555/3045118.3045167.

Brox T., Bruhn A., Papenberg N., Weickert J. (2004) High Accuracy Optical Flow Estimation Based on a Theory for Warping. In: Pajdla T., Matas J. (eds) Computer Vision - ECCV 2004. ECCV 2004. Lecture Notes in Computer Science, vol 3024. Springer, Berlin, Heidelberg. doi:10.1007/978-3-540-24673-2_3.

Zach, C., T. Pock, and H. Bischof. “A Duality Based Approach for Realtime TV-L 1 Optical Flow.” Pattern Recognition (n.d.): 214–223. doi:10.1007/978-3-540-74936-3_22.

Donahue, Jeff, Lisa A. Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, and Trevor Darrell. “Long-Term Recurrent Convolutional Networks for Visual Recognition and Description” (November 17, 2014). doi:10.21236/ada623249.

Zhenzhong Lan, Ming Lin, Xuanchong Li, Alexander G. Hauptmann, and Bhiksha Raj. “Beyond Gaussian Pyramid: Multi-Skip Feature Stacking for Action Recognition.” 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (June 2015). doi:10.1109/cvpr.2015.7298616.

Kuehne, H., H. Jhuang, E. Garrote, T. Poggio, and T. Serre. “HMDB: A Large Video Database for Human Motion Recognition.” 2011 International Conference on Computer Vision (November 2011). doi:10.1109/iccv.2011.6126543.

Soomro, Khurram, Amir Roshan Zamir, and Mubarak Shah. "UCF101: A dataset of 101 human actions classes from videos in the wild." doi: arXiv:1212.0402 (2012).

Sarabu, Ashok; Santra, Ajit K. 2020. "Distinct Two-Stream Convolutional Networks for Human Action Recognition in Videos Using Segment-Based Temporal Modeling" Data 5, no. 4: 104. doi:10.3390/data5040104.

Wu, Xiao; Ji, Qingge. 2020. "TBRNet: Two-Stream BiLSTM Residual Network for Video Action Recognition" Algorithms 13, no. 7: 169. doi:10.3390/a13070169.

Full Text: PDF

DOI: 10.28991/esj-2021-01254


  • There are currently no refbacks.

Copyright (c) 2021 Ashok Sarabu, Ajit Kumar Santra