SlowFast-TCN: A Deep Learning Approach for Visual Speech Recognition

Nicole Yah Yie Ha, Lee-Yeng Ong, Meng-Chew Leow

Abstract


Visual Speech Recognition (VSR), commonly referred to as automated lip-reading, is an emerging technology that interprets speech by visually analyzing lip movements. A challenge in VSR where visually distinct words produce similar lip movements is known as homopheme problem. Visemes are the basic visual units of speech that are produced by the lip movements and positions. Furthermore, visemes are typically having shorter durations than words. Consequently, there is less temporal information for distinguishing between different viseme classes, leading to increased visual ambiguity during classification. To address this challenge, viseme classification must not only extract lip image spatial features, but also to handle visemes of varying durations and temporal features. Therefore, this study proposed a new deep learning approach SlowFast-TCN. SlowFast network is used as the frontend architecture to extract the spatio-temporal features of the slow and fast pathways. Temporal Convolutional Network (TCN) is used as the backend architecture to learn the features from the frontend to perform the classification. A comparative ablation analysis to dissect each component of the proposed SlowFast-TCN is performed to evaluate the impact of each component. This study utilizes a benchmark dataset, Lip Reading in Wild (LRW), that focuses on English language. Two subsets of the LRW dataset, comprising of homopheme words and unique words, represent the homophemic and non-homophemic dataset, respectively. The proposed approach is evaluated on varying lighting conditions to assess its performance in real-world scenarios. It was found that illumination can significantly affect the visual data. Key performance metrics, such as accuracy and loss are used to evaluate the effectiveness of the proposed approach. The proposed approach outperforms traditional baseline models in accuracy while maintaining competitive execution time. Its dual-pathway architecture effectively captures both long-term dependencies and short-term motions, leading to better performance in both homophemic and non-homophemic datasets. However, it is less robust when dealing with non-ideal lighting scenarios, indicating the need for further enhancements to handle diverse lighting scenarios.

 

Doi: 10.28991/ESJ-2024-08-06-024

Full Text: PDF


Keywords


Visual Speech Recognition; Temporal Convolutional Network; Lip Reading in Wild; SlowFast Network; Homophemes.

References


Dixit, A., Sethi, P., Garg, P., Pruthi, J., & Chauhan, R. (2024). CNN based lip-reading system for visual input: A review. AIP Conference Proceedings, 3121(1), 40031. doi:10.1063/5.0221717.

Hao, M., Mamut, M., Yadikar, N., Aysa, A., & Ubul, K. (2020). A survey of research on lipreading technology. IEEE Access, 8, 204518–204544. doi:10.1109/ACCESS.2020.3036865.

Thapa, K. (2023). End-to-end Lip-reading: A Preliminary Study. Masters Thesis, London South Bank University, London, United Kingdom. doi:10.18744/lsbu.92zq5.

Sheng, C., Kuang, G., Bai, L., Hou, C., Guo, Y., Xu, X., Pietikainen, M., & Liu, L. (2024). Deep Learning for Visual Speech Analysis: A Survey. IEEE Transactions on Pattern Analysis and Machine Intelligence 46(9), 6001–6022. doi:10.1109/TPAMI.2024.3376710.

Kim, M., Yeo, J. H., & Ro, Y. M. (2022). Distinguishing Homophenes Using Multi-Head Visual-Audio Memory for Lip Reading. Proceedings of the 36th AAAI Conference on Artificial Intelligence, AAAI 2022, 36(1), 1007–1015. doi:10.1609/aaai.v36i1.20003.

Fenghour, S., Chen, D., Guo, K., & Xiao, P. (2020). Disentangling homophemes in lip reading using perplexity analysis. arXiv preprint arXiv:2012.07528. doi:10.48550/arXiv.2012.07528.

Jeon, S., Elsharkawy, A., & Kim, M. S. (2022). Lipreading architecture based on multiple convolutional neural networks for sentence-level visual speech recognition. Sensors, 22(1), 72. doi:10.3390/s22010072.

Shi, B., Hsu, W. N., Lakhotia, K., & Mohamed, A. (2022). Learning audio-visual speech representation by masked multimodal cluster prediction. arXiv preprint arXiv:2201.02184. doi:10.48550/arXiv.2201.02184.

Sepas-Moghaddam, A., Pereira, F., Correia, P. L., & Etemad, A. (2021). Multi-perspective LSTM for joint visual representation learning. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 16535–16543. doi:10.1109/CVPR46437.2021.01627.

Stafylakis, T., & Tzimiropoulos, G. (2017). Combining residual networks with LSTMs for lipreading. Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 2017-August, 3652–3656. doi:10.21437/Interspeech.2017-85.

Shashidhar, R., Patilkulkarni, S., & Puneeth, S. B. (2022). Combining audio and visual speech recognition using LSTM and deep convolutional neural network. International Journal of Information Technology (Singapore), 14(7), 3425–3436. doi:10.1007/s41870-022-00907-y.

Fenghour, S. (2022). Viseme-based Lip-Reading using Deep Learning. Doctoral dissertation, London South Bank University, London, United Kingdom. doi:10.18744/lsbu.9280w.

Ma, S., Wang, S., & Lin, X. (2020). A transformer-based model for sentence-level Chinese mandarin lipreading. Proceedings - 2020 IEEE 5th International Conference on Data Science in Cyberspace, DSC 2020, 78–81. doi:10.1109/DSC50466.2020.00020.

Martinez, B., Ma, P., Petridis, S., & Pantic, M. (2020). Lipreading Using Temporal Convolutional Networks. ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, 2020-May, 6319–6323. doi:10.1109/ICASSP40776.2020.9053841.

Zhu, D., Han, C., Guo, J., & Sun, L. (2024). TWLip: Exploring Through-Wall Word-Level Lip Reading Based on Coherent SISO Radar. IEEE Internet of Things Journal, 11(19), 32310 - 32323. doi:10.1109/JIOT.2024.3427329.

Chopadekar, G., Pandey, N., Rakhangi, N., Balsaraf, S., & Patil, V. (2024). Literature survey - lip reading model. International Research Journal of Innovations in Engineering and Technology, 8(4), 143. doi:10.47001/IRJIET/2024.804019.

He, Y., Yang, L., Wang, S., & Liew, A. W. C. (2024). Lip Feature Disentanglement for Visual Speaker Authentication in Natural Scenes. IEEE Transactions on Circuits and Systems for Video Technology, 34(10), 9898-9909. doi:10.1109/TCSVT.2024.3405640.

Lee, K. S. (2024). Improving the Performance of Automatic Lip-Reading Using Image Conversion Techniques. Electronics (Switzerland), 13(6), 1032. doi:10.3390/electronics13061032.

Feichtenhofer, C., Fan, H., Malik, J., & He, K. (2019). Slowfast networks for video recognition. Proceedings of the IEEE International Conference on Computer Vision, 2019-October, 6201–6210. doi:10.1109/ICCV.2019.00630.

Bai, S., Kolter, J. Z., & Koltun, V. (2018). An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv preprint arXiv:1803.01271. doi:10.48550/arXiv.1803.01271.

Fenghour, S., Chen, D., Guo, K., Li, B., & Xiao, P. (2021). Deep Learning-Based Automated Lip-Reading: A Survey. IEEE Access, 9, 121184–121205. doi:10.1109/ACCESS.2021.3107946.

Cheng, S., Ma, P., Tzimiropoulos, G., Petridis, S., Bulat, A., Shen, J., & Pantic, M. (2020). Towards Pose-Invariant Lip-Reading. ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, 2020-May, 4357–4361. doi:10.1109/ICASSP40776.2020.9054384.

Koumparoulis, A., & Potamianos, G. (2022). Accurate and Resource-Efficient Lipreading with Efficientnetv2 and Transformers. ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, 2022-May, 5937–5941. doi:10.1109/ICASSP43922.2022.9747729.

Ma, P., Wang, Y., Shen, J., Petridis, S., & Pantic, M. (2021). Lip-reading with densely connected temporal convolutional networks. Proceedings - 2021 IEEE Winter Conference on Applications of Computer Vision, WACV 2021, 2856–2865. doi:10.1109/WACV48630.2021.00290.

Lahiri, A., Kwatra, V., Frueh, C., Lewis, J., & Bregler, C. (2021). LipsyNc3D: Data-Efficient Learning of Personalized 3D Talking Faces from Video using Pose and Lighting Normalization. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2754–2763. doi:10.1109/CVPR46437.2021.00278.

Ma, P., Martinezy, B., Petridis, S., & Pantic, M. (2021). Towards practical lipreading with distilled and efficient models. ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, 2021-June, 7608–7612. doi:10.1109/ICASSP39728.2021.9415063.

Wei, D., Tian, Y., Wei, L., Zhong, H., Chen, S., Pu, S., & Lu, H. (2022). Efficient dual attention SlowFast networks for video action recognition. Computer Vision and Image Understanding, 222, 103484. doi:10.1016/j.cviu.2022.103484.

Hovad, E., Hougaard-Jensen, T., & Clemmensen, L. K. H. (2024). Classification of Tennis Actions Using Deep Learning. arXiv preprint arXiv:2402.02545. doi:10.48550/arXiv.2402.02545.

Calderó, M. S., Varas, D., & Bou-Balust, E. (2021). Spatio-temporal context for action detection. arXiv preprint arXiv:2106.15171. doi:10.48550/arXiv.2106.15171.

Sheshpoli, A. J., & Nadian-Ghomsheh, A. (2019). Temporal and spatial features for visual speech recognition. Lecture Notes in Electrical Engineering, 480, 135–145. doi:10.1007/978-981-10-8672-4_10.

Petridis, S., Stafylakis, T., Ma, P., Cai, F., Tzimiropoulos, G., & Pantic, M. (2018). End-to-End Audiovisual Speech Recognition. ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, 2018-April, 6548–6552. doi:10.1109/ICASSP.2018.8461326.

Chung, J. S., Senior, A., Vinyals, O., & Zisserman, A. (2017). Lip reading sentences in the wild. Proceedings - 30th IEEE Conference on Computer Vision and Pattern Recognition, 2017-January, 3444–3450. doi:10.1109/CVPR.2017.367.

Nemani, P., Krishna, G. S., Supriya, K., & Kumar, S. (2023). Speaker independent VSR: A systematic review and futuristic applications. Image and Vision Computing, 138, 104787. doi:10.1016/j.imavis.2023.104787.

Mudaliar, N. K., Hegde, K., Ramesh, A., & Patil, V. (2020). Visual Speech Recognition: A Deep Learning Approach. 2020 5th International Conference on Communication and Electronics Systems, 1218–1221. doi:10.1109/icces48766.2020.9137926.

Ma, P., Wang, Y., Petridis, S., Shen, J., & Pantic, M. (2022). Training Strategies for Improved Lip-Reading. ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, 2022-May, 8472–8476. doi:10.1109/ICASSP43922.2022.9746706.

King, D. E. (2009). Dlib-ml: A machine-learning toolkit. Journal of Machine Learning Research, 10, 1755–1758.

Ma, N., Zhang, X., Zheng, H. T., & Sun, J. (2018). Shufflenet V2: Practical guidelines for efficient CNN architecture design. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics): Vol. 11218 LNCS, 122–138. doi:10.1007/978-3-030-01264-9_8.

Park, D. S., Chan, W., Zhang, Y., Chiu, C. C., Zoph, B., Cubuk, E. D., & Le, Q. V. (2019). Specaugment: A simple data augmentation method for automatic speech recognition. Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 2019-September, 2613–2617. doi:10.21437/Interspeech.2019-2680.

Butt, W. R., & Lombardi, L. (2021). Lip Detection and Tracking with Geometric Constraints under Uneven Illumination and Shadows. International Journal of Advanced Computer Science and Applications, 12(8), 17–24. doi:10.14569/IJACSA.2021.0120803.


Full Text: PDF

DOI: 10.28991/ESJ-2024-08-06-024

Refbacks

  • There are currently no refbacks.