Human Action Recognition in Videos using Convolution Long Short-Term Memory Network with Spatio-Temporal Networks

Two-stream convolutional networks plays an essential role as a powerful feature extractor in human action recognition in videos. Recent studies have shown the importance of two-stream Convolutional Neural Networks (CNN) to recognize human action recognition. Recurrent Neural Networks (RNN) has achieved the best performance in video activity recognition combining CNN. Encouraged by CNN's results with RNN, we present a two-stream network with two CNNs and Convolution Long-Short Term Memory (CLSTM). First, we extricate Spatio-temporal features using two CNNs using pre-trained ImageNet models. Second, the results of two CNNs from step one are combined and fed as input to the CLSTM to get the overall classification score. We also explored the various fusion function performance that combines two CNNs and the effects of feature mapping at different layers. And, conclude the best fusion function along with layer number. To avoid the problem of overfitting, we adopt the data augmentation techniques. Our proposed model demonstrates a substantial improvement compared to the current two-stream methods on the benchmark datasets with 70.4% on HMDB-51 and 95.4% on UCF-101 using the pre-trained ImageNet model.


1-Introduction
Human Action Recognition (HAR) in videos has received tremendous attention in the realm of pattern recognition and computer vision academic and research community because of its broad spectrum of applications like video monitoring, video retrieving, human-computer interaction, medical applications, etc. Compared to still image recognition, video action recognition is difficult. Because videos contain temporal correlation between frames, these temporal data is additional information that needs to be analyzed to find the action in a video. At the same time, this task demands more computations, because each video contains hundreds of frames. In recent times, deeper CNN applications have shown a steep performance increase in video activity recognition. Driven by the rapid growth in the performance of deep CNN models, the computer vision academic and research community started to expand the application of CNNs to human action recognition [1,2].
Action classification in videos is comparatively slow when compared to action classification in still images. There are two factors for comparatively slow; existing video activity recognition datasets are small in size and diversity compared to the still image recognition datasets. Therefore, datasets that are small in size will overfit the model and cannot generate the generalized solution for action classification. It is also hard and challenging to build bigger-sized video datasets and train them on deep networks. Second, the video classification task requires complex data analysis because it consists of additional information called temporal data. Recently, many researchers have addressed the solution to these complex action classification problems in videos. Karpathy et al. [1] proposed different performance analysis solutions to the video activity classification on SPORTS-1M dataset and showed the several CNN models outcomes.
Simonyan et al. [3], first developed two-stream CNN architecture, which used two CNNs for recognizing human activities in videos.. Even though the authors did not achieve great performance compared to the hand-crafted solution, but laid a path in which further research showed a considerable performance increase. In a two-stream methodology, feature maps extracted from the RGB image and optical flow images contain spatial information and additional cue called temporal information. The final prediction is calculated by fusing the results of two CNNs. Moreover, many researchers have explored the two-stream network architectures and proven with good performance. However, these architectures lack in exploiting Spatio-temporal dynamics. To solve this, researchers further extended and proposed the architecture with the combination of CNN and Recurrent Neural Network [3][4][5]. In recent two-stream architectures with CNN and RNN models, CNN resulting vectors is fed as an input to the RNN. Since the input of the RNN is the output of CNN, it converts the three-dimensional feature maps to one-dimensional feature vectors [6]. Doing this process will decrease the number of parameters compared to its previous work, and this process will diminish the spatial information. Xingjian et al. [7] extended Long Short Term Memory (LSTM) to three-dimensional and proposed CLSTM, proved with better performance. We further extend this method with different architecture; that is, we trained two streams architecture/model end-to-end and fuse the output and feed it as input to CLSTM.
We propose a two-stream architecture for action recognition with a combination of CNNs and RNN, as shown in Figure 1. First, we train and fine-tune the spatial stream and temporal stream networks with inputs as RGB image and optical flow frames with a pre-trained ImageNet model. Second, we fuse two stream's outputs with their respective dimensions (7*7*2048). Finally, to train the long-term temporal dependencies, the resulting fully connected layers output features are given as input to the CLSTM. Furthermore, Article is organized as 2. Related work, discussion of a related literature survey, 3. Technical approach, 4. Experimental section discussion of implementation details and comparison with State-of-art results.

2-Related Works
Hand-crafted methods for activity recognition in videos like space-time interest point [8], 3D-SIFT [9], dense trajectories [10] have shown good performance, but the solutions are not generalized. Whereas deep learning techniques for video activity recognition demand more computational cost than hand-crafted techniques, but solutions are more generalizable and shown to improve accuracy. But, with the decrease in the hardware cost, training deep networks became easy and produces better results. With the generalized solutions, deep CNN has achieved tremendous progress in the task of action recognition. Moreover, deep CNNs are occupying and replacing the hand-crafted methods because of the lesser the hardware cost. Along with this, deep CNNs also perform well in motion recognition tasks using RGB and optical flow frames extracted from videos. Optical flow frames are work as one of the hidden cues in gaining highperformance accuracy.
Two stream architecture utilizes two convolutional models, one CNN to extract spatial data and another CNN to extract temporal data in videos. Feichtenhofer et al. [11] introduced a new two-stream CNN model and investigated various methods to fuse two stream's results. They showed that fusing of the output of networks spatially at final convolve layers will increase the accuracy. K. Simonyan et al. [3] introduced a two-stream architecture for video activity classification, where RGB images are given as input to the spatial stream network, and optical flow frames are given as input to the temporal stream network. Final action classification scores are calculated by combining the outputs of the spatial and temporal stream network. Wang et al. [12] introduced the Trajectory-pooled Deep-convolution Descriptor (TDD), in this method author's trained using deep CNNs and unified the trajectory features. This method shows significant improvement in performance by combining depth network features and shallow local features. Wang et al. [5] presented Temporal Segment Networks; author's showed an increase in accuracy by training complete video by introducing long-range temporal model. Feichtenhofer et al. [13] proposed ST-ResNet; authors trained the models by combining spatial and temporal network models, improving the correlation across two streams. Ji et al. [14] presented an end-to-end learning architecture; this method executes the pixel-level activity classification and segmenting. With this, the authors addressed the video activity recognition by two-stream CNN architecture and aggregating temporal information. Carriera et al. [15] proposed I3D networks, which achieves the best performance and accuracy using the three-dimensional convolutional neural networks, using the pre-trained models, Kinetics and ImageNet. Tran et al. [2] proposed a new three-dimensional CNN called C3D, which works using three-dimensional convolutional kernels.
CNN with RNN is another approach to classify action recognition in videos. RNN is capable of encoding the present state and retrieving the temporal information. Among the different RNNs, LSTM architecture/model performs better to capture the long-range temporal information and differentiate videos with intra-class variations from the input video.  [16,17]. Ma et al. [4] adopted the Temporal-Segment method [5] to extricate the long-range temporal dependency giving output to LSTM layers and achieved the performance improvement. Xingjian et al. [7] introduced convolutional LSTM, replacing the fully connect gates with convolutional gates to extract spatiotemporal information effectively. Xingjian et al. (2015) [7], applied the proposed convolutional LSTM method for radar images and achieved better results. Later, many researchers [18][19][20] applied convolutional LSTM and demonstrated it as a good choice. After the literature survey, convolutional LSTM is used as one of the models in our proposed approach. Figure 2 demonstrates the basic network framework of our proposed model. The proposed methodology consists of a pre-trained CNN model, data pre-processing for temporal stream network, CLSTM.

3-1-Two-Stream CNN
Video is a combination of frames. The sequence of frames in the video contains spatial information and motion information. Human visual sensory processes the perceived information with spatial and temporal streams. Spatial and temporal streams are two individual systems in terms of receiving and processing the data. Only static information is processed in the spatial stream process; that is, only static image appearance and object are identified. In the temporal stream process, the motion of the objects is identified using information across the frames; that is, only the object's motion is calculated. Simonyan et al. [3] presented two-stream CNN to process spatial and temporal data with two convolutional models; one for spatial stream and another for the temporal stream. In the spatial stream, RGB images are fed as input to the spatial stream to recognize still objects. To be specific, single RGB images are used as input to the spatial stream convolutional network. To recognize the motion across the sequence of frames, optical flow frames are given as input to the temporal network. The optical flow frame is the collection of horizontal and vertical convolved frames, and the total number for this is initialized to 2L=20. Lastly, the spatial stream and temporal stream are trained independently. The final output of two streams is fed as input to the convolutional long short term memory (CLSTM). The input of soft-max layers is the output of CLSTM to identify the final score for classification.
In this subsection, we present a human activity recognition in videos using CLSTM with Spatio-temporal networks.
In the proposed network model, the spatial network and temporal network are trained with two CNNs and CLSTM, as shown in Figure 2. The purpose of modeling our network model is to retain the spatial information using RNN at the end of two streams. Second, to show that RNN (CLSTM) perform better along with the original two-stream architecture. With many experiments, we observed that the model accuracy of the proposed architecture is superior to the existing two-stream model.

3-2-Residual Networks
Residual networks are adopted to train spatial and temporal stream networks in our proposed network. Since deep layer ResNet extricates discriminant features from frames, we use deep ResNets instead of shallow layered ResNet. As the number of layers increases, network degradation increases. To mitigate this problem, He et al. [21] introduced the deeper layered ResNet. Rather than the vanilla fit function, they trained the ResNets by mapping function ( ) ∶= ( ) − . Residual unit as +1 = ( + ( , )). Where and +1 are the input and output of the ℎ network layer. is the ReLU function [22], ( ; )is non-linear residual mapping of weight of CNN filters. The implementation of the residual block is because, it works as the shortcut link that interconnects the first layer to any other layer in the network, which breaks the traditional method of connecting one layer to the preceding layer. This step will avoid the gradient explosion problem by bypassing some layer's loss and directly transferring the loss to any connected shallow layers in the network. This simplified solution will avoid the increase in the number of parameters and computational complexity. In Residual networks, batch normalization (BN) [23] is performed before every activation layer and after every convolution operation. With this step, the network's convergence will be fast; along with this, the problem of co-variate shift is solved [21]. Finally, instead of a fully connected layer and SoftMax layers, global average pooling and SoftMax operations are combined. With this step, the number of parameters is effectively reduced. In addition to this, the bottleneck structure will reduce the cost of complexity, and the overall network model's performance is guaranteed.

3-3-Data Pre-processing for Temporal Stream
Optical flow frames are the input to the temporal stream. Frames are given as input data so that the input to temporal stream is trainable on pre-trained networks. For this, data need to be pre-processed to get the optical flow frames from RGB images. Two optical flow frames are generated for every processed RGB image, one frame with vertical and another with horizontal edges. There are two popular methods to get optical flow frames. 1. Brox [24] and 2. Total Variation-L1 [25]. Ma et al. [4], the authors have demonstrated that Total Variation-L1 performs better than Brox. We establish the same network mentioned in [4,15,26,27] and stack the optical flow frames with input L=20 consisting of 10 vertical and ten horizontal edge frames. Along with this, we perform horizontal transformation, vertical transformation, re-scaling to keep the input frame in the range of [0,255]. The final frame values will be in [0,255] and able to use pre-trained networks.

3-4-Convolutional LSTM
Convolution LSTM is a variant of vanilla LSTM, in which fully connected gates are replaced with convolutional gates. Instead of matrix multiplications in fully connected gated, Convolution gate performs convolution operations at every gate. And, equations of the convolutional gates in Convolution LSTM are: In above equation, i, f, o are convolution LSTMs, input, forget, and output gates. C and H are cells and hidden states. Sigma is the sigmoid function. W are the convolutional kernels. * is the concatenation operation. All the inputs, convolutional gates, cells, and hidden states are the three-dimensional tensors of size 5/10 6 . With these operations, Spatio-temporal relations will be maintained throughout the network. We initialize the two-dimensional input and output convolutional kernels to 5*5 and 3*3. Hidden states are zero-padded, and dimensions of all output to 7 * 7*300. Once the features are trained in convolutional LSTM, global average pooling is applied to it. The final classification is decided using the SoftMax layer. Inspired by the [7], we try to explore the results of the single-layer convolutional LSTM and two-layers convolutional LSTM.

3-5-Fusing Techniques
In Feichtenhofer et al. (2016) [11] research, the authors demonstrated how to combine the feature extracts from the two-stream convolutional architecture. The authors also showed the different fusing methods (sum, max, concatenation, Conv.) and their accuracy. At temporal/time location L, we fuse the feature maps , of two CNNs to Where , ∈ ℝ * * ∈ ℝ ′ * ′ * ′ . V, H denotes height and width and C denotes number of channels of the feature maps.

=
( , ) computes the maximum of two CNN feature outputs at spatial location j, k, and at channel l.

3-5-3-Conv. Function
= ( , ) stacks the same spatial locations j, k of the feature maps at feature channels c, similar to the above-mentioned concatenation function. And, convolve the features maps of stacked data with the filter size ∈ ℝ 1 * 1 * 2 * and biases ∈ ℝ .
Where c in number of channel of output. 1*1*2c is the dimension of the filter.

3-5-4-Sum Function
= ( , ) performs of the sum of the two feature maps of the CNN models at the spatial location j, k, and feature channel c.
Among all the fusion methods mentioned in Feichtenhofer et al. (2016) [11] research, Conv. fusion shows the best, and max fusion shows the worst performance. In our proposed method, we adopt the sum fusion method to aggregate the results of two-streams. Since sum fusion contains fewer computations than the other fusion functions and results are equivalent to the Conv. fusion function. Therefore, in the proposed model, we use the sum fusion function as a fusion function and compare the fusion results at the different CNN layers and tabulated in Table 1.

4-1-Datasets and Implementation Details
We conducted the experiments and evaluated our proposed methodology on two large-scale video activity recognition datasets, HMDB-51 dataset [28] and UCF-101 dataset [29]. HMDB-51 activity recognition dataset consists of 51 activity classes and an overall of 6849 video clips. Each action category contains a minimum of 101 clips. HMDB-51 dataset is a collection of video clips from the sources of YouTube videos, Google videos. The UCF-101 consists of 101 activity classes and an overall 13,3220 videos. On average, every video consists of 100 to 300 frames with 3 -10 seconds of duration. Similarly, experiments are performed on these two datasets using a standard evaluation scheme with three train and test splits of the UCF-101 dataset. The findings of our proposed method are compared with the State-Of-The-Art methods. Analysis of the accuracies is evaluated on the three splits of both datasets.
We adopt different data augmentation methods to avoid CNN overfitting because of the smaller size of the datasets. In the proposed architecture, first, we apply random cropping with the size of 256*256. Second, random scaling is performed on the cropped image to 0.75 of its previous image size. Furthermore, we scale up the resulting image to 224*224.
The ADAM optimizer method is adopted to train the weight of the networks. We initially set the network weights with pre-training model weights from ImageNet [22]. We set the weight decay, batch size to 10 −4 , 256 for both spatial and temporal stream CNNs. And, to prevent the over-fitting of both CNNs, momentum is set to 9*10 −1 .
We initialize the value of momentum to 0.9. Initially, the spatial and temporal stream network's learning rate is initialized to 0.5 *10 −7 and 0.5 *10 −4 . The learning rate of the spatial stream network is reduced to 0.1 after every 15000 iterations, and the complete training of the network stops at 36000 iterations. Similarly, the temporal stream network's learning rate is reduced to 0.1 at 20000 and 32000 iterations, and the complete training of the temporal stream network halt at 40000 iterations. For CLSTM, the initial learning rate is set to 0.5*10 −6 . We implement the random shuffle for every iteration for all 60 epochs in the CLSTM. TVL1 optical flow algorithm [25] is employed to generate optical flow frames from videos. We use data parallelization to accelerate the training process with multiple GPUs on the Pytorch platform, and associated code is posted on GitHub (https://github.com/ashoksarabu/SpatioTemporal_CLSTM).

4-2-Analysis of Feature Maps Fusion of Two CNNs
The fusing of features maps of the two CNN is an important process that will increase the model's final performance. We use methods described in Feichtenhofer et al. (2016) [11], max fusion, concatenation fusion, conv fusion, sum fusion, to fuse two feature maps. To learn the features (fully connected) after the fusion, we use the CLSTM. The feature map after every convolution layer will have the spatial structure of the frame. We find out the best layers where the fusion of two CNN will maintain the video's spatial structure using the methods described in section 3.5. And, the results are tabulated in Table 1.
From the Table results, we conclude that fusion of the last convolution layers will give the best accuracy compared to the fusion of the former convolution layer even though both convolution layers have the same magnitude. Similarly, the convolution layer that is more distant from the fully connected layer will have less impact on the accuracy. Using of conv and sum fusion methods will get indistinguishable performance, and using sum fusion will have less computation complexity than conv fusion.

4-3-Testing
We utilize the same network parameters of the original two-stream CNN to evaluate our proposed model [3]. We sample the number of input images/frames with an equal number of intervals for both spatial and temporal streams. The number of fixed input images/ frames is initialized to 25 (for RGB images and optical flow frames). To evaluate the CNNs, we perform some operations for every image/frame; they are: cropping four corners, one center, horizontal flips. Weighted averaging is used to fuse the outputs of spatial and temporal stream CNNs. The spatial and temporal stream CNN weights are initialized to 1 and 1.5 because, when training spatial stream and temporal stream, there is a small performance gap compared to the original two-stream CNN architecture [3].

4-4-Exploration Study and Comparison with State-Of-The-Art
The presented two-stream architecture is trained on PyTorch and, overall training is implemented end-to-end using the convolutional models that are pre-trained on ImageNet [22]. The proposed model performance has shown a significant performance improvement, as mentioned in section 4.1; the combination of the two-stream model with CLSTM. The I3D [15] achieved a significant performance improvement compared to the original two-stream convolutional network using a pre-trained model on Kinetics. We still utilized the pre-trained ImageNet models, and compared to the latest deep learning architecture; we achieved 95.4% and 70.8% on HMDB-51 and UCF-101 datasets. Moreover, our experiment on the proposed model achieved better performance than the other two stream models. We outperform on the TSN [5] by 2.3% on HMDB-51 and Spatio-Temporal ResNet [13] by 2.0% on UCF-101 dataset. The proposed method, two-stream CNN with RNN (CLSTM), demonstrates spatial information integrity and its correlation with temporal information. And, spatial correlation is maintained throughout the training process. Moreover, overall comparison with state-of-art results is shown in Table 2.

5-Conclusion
Two-stream human action recognition for video generally uses two convolutional neural networks, one convolutional neural network for spatial stream and another convolutional neural network for the temporal stream. The Convolutional neural network with the recurrent neural network has proven a good performance for video action classification. However, these methods use a one-dimensional feature that contains damaged spatiotemporal features. The proposed architecture introduces convolution long-short term memory to the original two-stream CNN to overcome this problem, showing a significant performance improvement. Along with this, the addition of CLSTM preserves the spatial information. We explored various fusion functions to combine CNNs, and appropriate layer to integrate the spatial and temporal features. Moreover, there are some limitations to this proposed work; the solution may not work for the large videos, but there are still gaps to improve the work. For example, techniques like temporal segment networks can be implemented to preserve long-temporal dependencies for lengthy videos. In the future, we try to implement the twostream network with TSN and use the latest pre-trained model like Kinetics to improve the accuracy furthermore.

6-1-Data Availability Statement
The original contributions presented in the study, code, and datasets are provided with the GitHub link in the article; further enquiries can be directed to the corresponding author (Available online: https://github.com/ashoksarabu/SpatioTemporal_CLSTM).

6-2-Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.

6-3-Conflicts of Interest
The author declares that there is no conflict of interests regarding the publication of this manuscript. In addition, the ethical issues, including plagiarism, informed consent, misconduct, data fabrication and/or falsification, double publication and/or submission, and redundancies have been completely observed by the authors.