Accent Classification Across Continents: A Deep Learning Approach
Downloads
This study focuses on a deep learning based accent classification across continents and greatly enhances speech recognition systems by identifying the accents of Asia, Europe, North America, Africa, and Oceania. The Convolutional Neural Network (CNN) was trained on the Mozilla Common Voice dataset, which comprises the features extracted - Mel-Frequency Cepstral Coefficients, Delta, Delta-Delta, Chroma Frequency, and spectral features- and trained to classify accents. Multiple convolutional and dense layers for accent classification were combined with dropout and batch normalization layers to avoid overfitting during training. Out of the total validation data, 82% accuracy has been achieved. The Asian and European accents were classified with greater accuracy since their datasets were larger, whereas African and Oceanian accents were more misclassified due to limited representation and the greater diversity of languages. In contrast to the past research, which focused only on country-based accent classification, this work introduced a feature based deep learning approach of continent-based accent classification along the way. The recognition of this accent variation, in turn, helps integrate and improve various aspects of speech recognition systems and makes their application more inclusive for voice assistants and language learning tools with diverse linguistic patterns. The future work will concentrate on extending the dataset to the seven continents while enhancing classification accuracy via better feature engineering and model tuning.
Downloads
[1] Munro, M. J., & Derwing, T. M. (1999). Foreign accent, comprehensibility, and intelligibility in the speech of second language learners. Language learning, 49, 285-310. doi:10.1111/j.1467-1770.1995.tb00963.x
[2] Esquivel, P., Gill, K., Goldberg, M., Sundaram, S. A., Morris, L., & Ding, D. (2024). Voice Assistant Utilization among the Disability Community for Independent Living: A Rapid Review of Recent Evidence. Human Behavior and Emerging Technologies, 2024(1), 6494944. doi:10.1155/2024/6494944.
[3] Jayne, C., Chang, V., Bailey, J., & Xu, Q. A. (2022). Automatic Accent and Gender Recognition of Regional UK Speakers. Communications in Computer and Information Science, 1600 CCIS, 67–80. doi:10.1007/978-3-031-08223-8_6.
[4] Rizwan, M., & Anderson, D. V. (2018). A weighted accent classification using multiple words. Neurocomputing, 277, 120–128. doi:10.1016/j.neucom.2017.01.116.
[5] Huang, C., Chen, T., Li, S., Chang, E., & Zhou, J. (2001). Analysis of speaker variability. EUROSPEECH 2001 - SCANDINAVIA - 7th European Conference on Speech Communication and Technology, 1377–1380. doi:10.21437/eurospeech.2001-356.
[6] Mak, L., Sheng, A., & Wei Xiong, M. E. (2018). Deep Learning Approach to Accent Classification. CS229, 1–6.
[7] Arslan, L. M., & Hansen, J. H. L. (1996). Language accent classification in American English. Speech Communication, 18(4), 353–367. doi:10.1016/0167-6393(96)00024-6.
[8] Badhon, S. M. S. I., Rahaman, M. H., & Rupon, F. R. (2020). A Machine Learning Approach to Automating Bengali Voice Based Gender Classification. Proceedings of the 2019 8th International Conference on System Modeling and Advancement in Research Trends, SMART 2019, 55–61. doi:10.1109/SMART46866.2019.9117385.
[9] Bahari, M. H., & Van Hamme, H. (2011). Speaker age estimation and gender detection based on supervised non-negative matrix factorization. BioMS 2011 - 2011 IEEE Workshop on Biometric Measurements and Systems for Security and Medical Applications, Proceedings, 27–32. doi:10.1109/BIOMS.2011.6052385.
[10] Hansen, J. H. L., Williams, K., & Bořil, H. (2015). Speaker height estimation from speech: Fusing spectral regression and statistical acoustic models. The Journal of the Acoustical Society of America, 138(2), 1052–1067. doi:10.1121/1.4927554.
[11] Mporas, I., & Ganchev, T. (2009). Estimation of unknown speaker’s height from speech. International Journal of Speech Technology, 12(4), 149–160. doi:10.1007/s10772-010-9064-2.
[12] Sreelatha, M. B., Pranathi, M., Reddy, K. M., & Shirisha, J. (2024). A Survey on Telugu Accent Classification and Conversion. International Journal of Advances in Engineering and Management (IJAEM), 6(03), 624. doi:10.35629/5252-0603624629.
[13] Singh, M. K. (2024). A text independent speaker identification system using ANN, RNN, and CNN classification technique. Multimedia Tools and Applications, 83(16), 48105–48117. doi:10.1007/s11042-023-17573-2.
[14] Choueiter, G., Zweig, G., & Nguyen, P. (2008). An empirical study of automatic accent classification. ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, 4265–4268. doi:10.1109/ICASSP.2008.4518597.
[15] Zheng, Y., Sproat, R., Gu, L., Shafran, I., Zhou, H., Su, Y., Jurafsky, D., Starr, R., & Yoon, S. Y. (2005). Accent detection and speech recognition for Shanghai-accented Mandarin. 9th European Conference on Speech Communication and Technology, 217–220. doi:10.21437/interspeech.2005-112.
[16] Zhang, L., Zhao, Y., Zhang, P., Yan, K., & Zhang, W. (2015). Chinese accent detection research based on RASTA - PLP algorithm. Proceedings of 2015 International Conference on Intelligent Computing and Internet of Things, ICIT 2015, 31–34. doi:10.1109/ICAIOT.2015.7111531.
[17] Joseph, J., & Upadhya, S. S. (2018). Indian accent detection using dynamic time warping. IEEE International Conference on Power, Control, Signals and Instrumentation Engineering, ICPCSI 2017, 2814–2817. doi:10.1109/ICPCSI.2017.8392233.
[18] Kibria, S., Rahman, M. S., Selim, M. R., & Iqbal, M. Z. (2020). Acoustic Analysis of the Speakers’ Variability for Regional Accent-Affected Pronunciation in Bangladeshi Bangla: A Study on Sylheti Accent. IEEE Access, 8, 35200–35221. doi:10.1109/ACCESS.2020.2974799.
[19] Mannepalli, K., Narahari Sastry, P., & Rajesh, V. (2015). Accent detection of Telugu speech using supra-segmental features. International Journal of Soft Computing, 10(5), 287–292. doi:10.3923/ijscomp.2015.287.292.
[20] Ma, Y., Mp, P., Yaacob, S., Ab, S., & Mokhtar, N. F. (2013). Statistical formant descriptors with linear predictive coefficients for accent classification. Proceedings of the 2013 IEEE 8th Conference on Industrial Electronics and Applications, ICIEA 2013, 906–911. doi:10.1109/ICIEA.2013.6566496.
[21] Danao, G., Torres, J., Tubio, J. V., & Vea, L. (2017). Tagalog regional accent classification in the Philippines. HNICEM 2017 - 9th International Conference on Humanoid, Nanotechnology, Information Technology, Communication and Control, Environment and Management, 2018-January, 1–6. doi:10.1109/HNICEM.2017.8269545.
[22] Hossain, M. F., Hasan, M. M., Ali, H., Sarker, M. R. K. R., & Hassan, M. T. (2020). A machine learning approach to recognize speakers region of the United Kingdom from continuous speech based on accent classification. Proceedings of 2020 11th International Conference on Electrical and Computer Engineering, ICECE 2020, 210–213. doi:10.1109/ICECE51571.2020.9393038.
[23] Pedersen, C., & Diederich, J. (2007). Accent classification using support vector machines. Proceedings - 6th IEEE/ACIS International Conference on Computer and Information Science, ICIS 2007; 1st IEEE/ACIS International Workshop on e-Activity, IWEA 2007, 444–449. doi:10.1109/ICIS.2007.47.
[24] Deshpande, S., Chikkerur, S., & Govindaraju, V. (2005). Accent classification in speech. Proceedings - Fourth IEEE Workshop on Automatic Identification Advanced Technologies, AUTO ID 2005, 139–143. doi:10.1109/AUTOID.2005.10.
[25] Berjon, P., Nag, A., & Dev, S. (2021). Analysis of French phonetic idiosyncrasies for accent recognition. Soft Computing Letters, 3, 100018. doi:10.1016/j.socl.2021.100018.
[26] Lesnichaia, M., Mikhailava, V., Bogach, N., Lezhenin, I., Blake, J., & Pyshkin, E. (2022). Classification of Accented English Using CNN Model Trained on Amplitude Mel-Spectrograms. Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 3669–3673. doi:10.21437/Interspeech.2022-462.
[27] Kashif, K., Alwan, A., Wu, Y., De Nardis, L., & Di Benedetto, M. G. (2024). MKELM based multi-classification model for foreign accent identification. Heliyon, 10(16), e36460. doi:10.1016/j.heliyon.2024.e36460.
[28] Zhang, Z., Wang, Y., & Yang, J. (2021). Accent Recognition with Hybrid Phonetic Features. Sensors (Basel, Switzerland), 21(18), 6258. doi:10.3390/s21186258.
[29] Zuluaga-Gomez, J., Ahmed, S., Visockas, D., & Subakan, C. (2023). CommonAccent: Exploring Large Acoustic Pretrained Models for Accent Classification Based on Common Voice. Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 5291–5295. doi:10.21437/Interspeech.2023-2419.
[30] Qian, Y., Gong, X., & Huang, H. (2022). Layer-Wise Fast Adaptation for End-to-End Multi-Accent Speech Recognition. IEEE/ACM Transactions on Audio Speech and Language Processing, 30, 2842–2853. doi:10.1109/TASLP.2022.3198546.
[31] Song, T., Nguyen, L. T. H., & Ta, T. V. (2025). MPSA-DenseNet: A novel deep learning model for English accent classification. Computer Speech and Language, 89, 101676. doi:10.1016/j.csl.2024.101676.
[32] G, P. D., & Rao, K. S. (2023). Accent classification from an emotional speech in clean and noisy environments. Multimedia Tools and Applications, 82(3), 3485–3508. doi:10.1007/s11042-022-13236-w.
[33] Demirsahin, I., Kjartansson, O., Gutkin, A., & Rivera, C. (2020). Opensource multispeaker corpora of the english accents in the british isles. LREC 2020 - 12th International Conference on Language Resources and Evaluation, Conference Proceedings, 6532–6541.
[34] Singh, Y., Pillay, A., & Jembere, E. (2020). Features of speech audio for accent recognition. 2020 International Conference on Artificial Intelligence, Big Data, Computing and Data Communication Systems, IcABCD 2020 - Proceedings, 1–6. doi:10.1109/icABCD49160.2020.9183893.
[35] Upadhyay, R., & Lui, S. (2018). Foreign English Accent Classification Using Deep Belief Networks. Proceedings - 12th IEEE International Conference on Semantic Computing, ICSC 2018, 2018-January, 290–293. doi:10.1109/ICSC.2018.00053.
[36] Mozilla. (2021). Mozilla Common Voice. Common Voice. Available online: https://commonvoice.mozilla.org/id/about (accessed on December 2025).
[37] Haton, J. P. (2003). Automatic speech recognition: A Review. ICEIS 2003 - Proceedings of the 5th International Conference on Enterprise Information Systems, 1, IS5–IS10. doi:10.5120/9722-4190.
[38] Ganchev, T., Fakotakis, N., & Kokkinakis, G. (2005). Comparative Evaluation of Various MFCC Implementations on the Speaker Verification Task. October, 1(3), 191–194. doi:10.1.1.75.8303.
[39] Furui, S. (1981). Comparison of Speaker Recognition Methods Using Statistical Features and Dynamic Features. IEEE Transactions on Acoustics, Speech, and Signal Processing, 29(3), 342–350. doi:10.1109/TASSP.1981.1163605.
[40] Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(1), 1929–1958.
[41] Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. In 32nd International Conference on Machine Learning, ICML 2015, 448–456.
[42] Kingma, D. P., & Ba, J. L. (2015). Adam: A method for stochastic optimization. Proceedings of the 3rd International Conference on Learning Representations (ICLR), San Diego, United States.
[43] Schaul, T., Zhang, S., & LeCun, Y. (2013). No more pesky learning rates. 30th International Conference on Machine Learning, ICML 2013, PART 2, 1380–1388.
- This work (including HTML and PDF Files) is licensed under a Creative Commons Attribution 4.0 International License.



















