SlowFast-TCN: A Deep Learning Approach for Visual Speech Recognition
Abstract
Doi: 10.28991/ESJ-2024-08-06-024
Full Text: PDF
Keywords
References
Dixit, A., Sethi, P., Garg, P., Pruthi, J., & Chauhan, R. (2024). CNN based lip-reading system for visual input: A review. AIP Conference Proceedings, 3121(1), 40031. doi:10.1063/5.0221717.
Hao, M., Mamut, M., Yadikar, N., Aysa, A., & Ubul, K. (2020). A survey of research on lipreading technology. IEEE Access, 8, 204518–204544. doi:10.1109/ACCESS.2020.3036865.
Thapa, K. (2023). End-to-end Lip-reading: A Preliminary Study. Masters Thesis, London South Bank University, London, United Kingdom. doi:10.18744/lsbu.92zq5.
Sheng, C., Kuang, G., Bai, L., Hou, C., Guo, Y., Xu, X., Pietikainen, M., & Liu, L. (2024). Deep Learning for Visual Speech Analysis: A Survey. IEEE Transactions on Pattern Analysis and Machine Intelligence 46(9), 6001–6022. doi:10.1109/TPAMI.2024.3376710.
Kim, M., Yeo, J. H., & Ro, Y. M. (2022). Distinguishing Homophenes Using Multi-Head Visual-Audio Memory for Lip Reading. Proceedings of the 36th AAAI Conference on Artificial Intelligence, AAAI 2022, 36(1), 1007–1015. doi:10.1609/aaai.v36i1.20003.
Fenghour, S., Chen, D., Guo, K., & Xiao, P. (2020). Disentangling homophemes in lip reading using perplexity analysis. arXiv preprint arXiv:2012.07528. doi:10.48550/arXiv.2012.07528.
Jeon, S., Elsharkawy, A., & Kim, M. S. (2022). Lipreading architecture based on multiple convolutional neural networks for sentence-level visual speech recognition. Sensors, 22(1), 72. doi:10.3390/s22010072.
Shi, B., Hsu, W. N., Lakhotia, K., & Mohamed, A. (2022). Learning audio-visual speech representation by masked multimodal cluster prediction. arXiv preprint arXiv:2201.02184. doi:10.48550/arXiv.2201.02184.
Sepas-Moghaddam, A., Pereira, F., Correia, P. L., & Etemad, A. (2021). Multi-perspective LSTM for joint visual representation learning. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 16535–16543. doi:10.1109/CVPR46437.2021.01627.
Stafylakis, T., & Tzimiropoulos, G. (2017). Combining residual networks with LSTMs for lipreading. Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 2017-August, 3652–3656. doi:10.21437/Interspeech.2017-85.
Shashidhar, R., Patilkulkarni, S., & Puneeth, S. B. (2022). Combining audio and visual speech recognition using LSTM and deep convolutional neural network. International Journal of Information Technology (Singapore), 14(7), 3425–3436. doi:10.1007/s41870-022-00907-y.
Fenghour, S. (2022). Viseme-based Lip-Reading using Deep Learning. Doctoral dissertation, London South Bank University, London, United Kingdom. doi:10.18744/lsbu.9280w.
Ma, S., Wang, S., & Lin, X. (2020). A transformer-based model for sentence-level Chinese mandarin lipreading. Proceedings - 2020 IEEE 5th International Conference on Data Science in Cyberspace, DSC 2020, 78–81. doi:10.1109/DSC50466.2020.00020.
Martinez, B., Ma, P., Petridis, S., & Pantic, M. (2020). Lipreading Using Temporal Convolutional Networks. ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, 2020-May, 6319–6323. doi:10.1109/ICASSP40776.2020.9053841.
Zhu, D., Han, C., Guo, J., & Sun, L. (2024). TWLip: Exploring Through-Wall Word-Level Lip Reading Based on Coherent SISO Radar. IEEE Internet of Things Journal, 11(19), 32310 - 32323. doi:10.1109/JIOT.2024.3427329.
Chopadekar, G., Pandey, N., Rakhangi, N., Balsaraf, S., & Patil, V. (2024). Literature survey - lip reading model. International Research Journal of Innovations in Engineering and Technology, 8(4), 143. doi:10.47001/IRJIET/2024.804019.
He, Y., Yang, L., Wang, S., & Liew, A. W. C. (2024). Lip Feature Disentanglement for Visual Speaker Authentication in Natural Scenes. IEEE Transactions on Circuits and Systems for Video Technology, 34(10), 9898-9909. doi:10.1109/TCSVT.2024.3405640.
Lee, K. S. (2024). Improving the Performance of Automatic Lip-Reading Using Image Conversion Techniques. Electronics (Switzerland), 13(6), 1032. doi:10.3390/electronics13061032.
Feichtenhofer, C., Fan, H., Malik, J., & He, K. (2019). Slowfast networks for video recognition. Proceedings of the IEEE International Conference on Computer Vision, 2019-October, 6201–6210. doi:10.1109/ICCV.2019.00630.
Bai, S., Kolter, J. Z., & Koltun, V. (2018). An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv preprint arXiv:1803.01271. doi:10.48550/arXiv.1803.01271.
Fenghour, S., Chen, D., Guo, K., Li, B., & Xiao, P. (2021). Deep Learning-Based Automated Lip-Reading: A Survey. IEEE Access, 9, 121184–121205. doi:10.1109/ACCESS.2021.3107946.
Cheng, S., Ma, P., Tzimiropoulos, G., Petridis, S., Bulat, A., Shen, J., & Pantic, M. (2020). Towards Pose-Invariant Lip-Reading. ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, 2020-May, 4357–4361. doi:10.1109/ICASSP40776.2020.9054384.
Koumparoulis, A., & Potamianos, G. (2022). Accurate and Resource-Efficient Lipreading with Efficientnetv2 and Transformers. ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, 2022-May, 5937–5941. doi:10.1109/ICASSP43922.2022.9747729.
Ma, P., Wang, Y., Shen, J., Petridis, S., & Pantic, M. (2021). Lip-reading with densely connected temporal convolutional networks. Proceedings - 2021 IEEE Winter Conference on Applications of Computer Vision, WACV 2021, 2856–2865. doi:10.1109/WACV48630.2021.00290.
Lahiri, A., Kwatra, V., Frueh, C., Lewis, J., & Bregler, C. (2021). LipsyNc3D: Data-Efficient Learning of Personalized 3D Talking Faces from Video using Pose and Lighting Normalization. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2754–2763. doi:10.1109/CVPR46437.2021.00278.
Ma, P., Martinezy, B., Petridis, S., & Pantic, M. (2021). Towards practical lipreading with distilled and efficient models. ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, 2021-June, 7608–7612. doi:10.1109/ICASSP39728.2021.9415063.
Wei, D., Tian, Y., Wei, L., Zhong, H., Chen, S., Pu, S., & Lu, H. (2022). Efficient dual attention SlowFast networks for video action recognition. Computer Vision and Image Understanding, 222, 103484. doi:10.1016/j.cviu.2022.103484.
Hovad, E., Hougaard-Jensen, T., & Clemmensen, L. K. H. (2024). Classification of Tennis Actions Using Deep Learning. arXiv preprint arXiv:2402.02545. doi:10.48550/arXiv.2402.02545.
Calderó, M. S., Varas, D., & Bou-Balust, E. (2021). Spatio-temporal context for action detection. arXiv preprint arXiv:2106.15171. doi:10.48550/arXiv.2106.15171.
Sheshpoli, A. J., & Nadian-Ghomsheh, A. (2019). Temporal and spatial features for visual speech recognition. Lecture Notes in Electrical Engineering, 480, 135–145. doi:10.1007/978-981-10-8672-4_10.
Petridis, S., Stafylakis, T., Ma, P., Cai, F., Tzimiropoulos, G., & Pantic, M. (2018). End-to-End Audiovisual Speech Recognition. ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, 2018-April, 6548–6552. doi:10.1109/ICASSP.2018.8461326.
Chung, J. S., Senior, A., Vinyals, O., & Zisserman, A. (2017). Lip reading sentences in the wild. Proceedings - 30th IEEE Conference on Computer Vision and Pattern Recognition, 2017-January, 3444–3450. doi:10.1109/CVPR.2017.367.
Nemani, P., Krishna, G. S., Supriya, K., & Kumar, S. (2023). Speaker independent VSR: A systematic review and futuristic applications. Image and Vision Computing, 138, 104787. doi:10.1016/j.imavis.2023.104787.
Mudaliar, N. K., Hegde, K., Ramesh, A., & Patil, V. (2020). Visual Speech Recognition: A Deep Learning Approach. 2020 5th International Conference on Communication and Electronics Systems, 1218–1221. doi:10.1109/icces48766.2020.9137926.
Ma, P., Wang, Y., Petridis, S., Shen, J., & Pantic, M. (2022). Training Strategies for Improved Lip-Reading. ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, 2022-May, 8472–8476. doi:10.1109/ICASSP43922.2022.9746706.
King, D. E. (2009). Dlib-ml: A machine-learning toolkit. Journal of Machine Learning Research, 10, 1755–1758.
Ma, N., Zhang, X., Zheng, H. T., & Sun, J. (2018). Shufflenet V2: Practical guidelines for efficient CNN architecture design. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics): Vol. 11218 LNCS, 122–138. doi:10.1007/978-3-030-01264-9_8.
Park, D. S., Chan, W., Zhang, Y., Chiu, C. C., Zoph, B., Cubuk, E. D., & Le, Q. V. (2019). Specaugment: A simple data augmentation method for automatic speech recognition. Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 2019-September, 2613–2617. doi:10.21437/Interspeech.2019-2680.
Butt, W. R., & Lombardi, L. (2021). Lip Detection and Tracking with Geometric Constraints under Uneven Illumination and Shadows. International Journal of Advanced Computer Science and Applications, 12(8), 17–24. doi:10.14569/IJACSA.2021.0120803.
DOI: 10.28991/ESJ-2024-08-06-024
Refbacks
- There are currently no refbacks.