Multilingual Question Answering for Malaysia History with Transformer-based Language Model

Qi Zhi Lim, Chin Poo Lee, Kian Ming Lim, Jing Xiang Ng, Eric Khang Heng Ooi, Nicole Kai Ning Loh

Abstract


In natural language processing (NLP), a Question Answering System (QAS) refers to a system or model that is designed to understand and respond to user queries in natural language. As we navigate through the recent advancements in QAS, it can be observed that there is a paradigm shift of the methods used from traditional machine learning and deep learning approaches towards transformer-based language models. While significant progress has been made, the utilization of these models for historical QAS and the development of QAS for Malay language remain largely unexplored. This research aims to bridge the gaps, focusing on developing a Multilingual QAS for history of Malaysia by utilizing a transformer-based language model. The system development process encompasses various stages, including data collection, knowledge representation, data loading and pre-processing, document indexing and storing, and the establishment of a querying pipeline with the retriever and reader. A dataset with a collection of 100 articles, including web blogs related to the history of Malaysia, has been constructed, serving as the knowledge base for the proposed QAS. A significant aspect of this research is the use of the translated dataset in English instead of the raw dataset in Malay. This decision was made to leverage the effectiveness of well-established retriever and reader models that were trained on English data. Moreover, an evaluation dataset comprising 100 question-answer pairs has been created to evaluate the performance of the models. A comparative analysis of six different transformer-based language models, namely DeBERTaV3, BERT, ALBERT, ELECTRA, MiniLM, and RoBERTa, has been conducted, where the effectiveness of the models was examined through a series of experiments to determine the best reader model for the proposed QAS. The experimental results reveal that the proposed QAS achieved the best performance when employing RoBERTa as the reader model. Finally, the proposed QAS was deployed on Discord and equipped with multilingual support through the incorporation of language detection and translation modules, enabling it to handle queries in both Malay and English.

 

Doi: 10.28991/ESJ-2024-08-02-019

Full Text: PDF


Keywords


Question Answering; Historical Knowledge; Natural Language Processing; DeBERTaV3; BERT; ALBERT; ELECTRA; MiniLM; RoBERTa.

References


Chee-Huay, C., & Kee-Jiar, Y. (2016). Why Students Fail in History: A Minor Case Study in Malaysia and Solutions from Cognitive Psychology Perspective. Mediterranean Journal of Social Sciences. doi:10.5901/mjss.2016.v7n1p517.

Woods, W., Kaplan, R. M., & Nash-Webber, B. L. (1972). The Lunar Science Natural Language Information System: Final Report. BBN Report No. 11501, Contract No. NAS9-1115 NASA Manned Spacecraft Center, Houston, Texas, United States.

Androutsopoulos, I., Ritchie, G. D., & Thanisch, P. (1996). A Framework for Natural Language Interfaces to Temporal Databases. Proceedings of the 20th Australasian Computer Science Conference, 5–7 February, 1997, Sydney, Australia.

Ojokoh, B., & Adebisi, E. (2019). A review of question answering systems. Journal of Web Engineering, 17(8), 717–758. doi:10.13052/jwe1540-9589.1785.

Zheng, Z. (2002). AnswerBus question answering system. Proceedings of the Second International Conference on Human Language Technology Research, 399-404. doi:10.3115/1289189.1289238.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. 31st Conference on Neural Information Processing Systems (NIPS 2017), 4-9 December, 2017, Long Beach, United States.

Ou, Y.-Y., Chuang, S.-W., Wang, W.-C., & Wang, J.-F. (2022). Automatic Multimedia-based Question-Answer Pairs Generation in Computer Assisted Healthy Education System. 2022 10th International Conference on Orange Technology (ICOT), Shanghai, China. doi:10.1109/icot56925.2022.10008119.

Zhang, J. (2022). Application Research of Similarity Algorithm in the Design of English Intelligent Question Answering System. 2022 IEEE 2nd International Conference on Mobile Networks and Wireless Communications (ICMNWC), Karnataka, India. doi:10.1109/icmnwc56175.2022.10031708.

Das, B., & Nirmala, S. J. (2022). Improving Healthcare Question Answering System by Identifying Suitable Answers. 2022 IEEE 2nd Mysore Sub Section International Conference (MysuruCon), Mysuru, India. doi:10.1109/mysurucon55714.2022.9972435.

Gupta, S. (2023). Top K Relevant Passage Retrieval for Biomedical Question Answering. arXiv preprint arXiv:2308.04028. doi:10.48550/arXiv.2308.04028.

Pudasaini, S., & Shakya, S. (2023). Question Answering on Biomedical Research Papers using Transfer Learning on BERT-Base Models. 2023 7th International Conference on I-SMAC (IoT in Social, Mobile, Analytics and Cloud) (I-SMAC), Kirtipur, Nepal. doi:10.1109/i-smac58438.2023.10290240.

Alzubi, J. A., Jain, R., Singh, A., Parwekar, P., & Gupta, M. (2023). COBERT: COVID-19 Question Answering System Using BERT. Arabian Journal for Science and Engineering, 48(8), 11003–11013. doi:10.1007/s13369-021-05810-5.

Acharya, S., Sornalakshmi, K., Paul, B., & Singh, A. (2022). Question Answering System using NLP and BERT. 3rd International Conference on Smart Electronics and Communication (ICOSEC), Trichy, India. doi:10.1109/icosec54921.2022.9952050.

Yin, J. (2022). Research on Question Answering System Based on BERT Model. 2022 3rd International Conference on Computer Vision, Image and Deep Learning & International Conference on Computer Engineering and Applications (CVIDL & ICCEA), Changchun, China. doi:10.1109/cvidliccea56201.2022.9824408.

Yang, J., Yang, X., Li, R., Luo, M., Jiang, S., Zhang, Y., & Wang, D. (2023). BERT and hierarchical cross attention-based question answering over bridge inspection knowledge graph. Expert Systems with Applications, 233, 120896. doi:10.1016/j.eswa.2023.120896.

Tian, D., Li, M., Ren, Q., Zhang, X., Han, S., & Shen, Y. (2023). Intelligent question answering method for construction safety hazard knowledge based on deep semantic mining. Automation in Construction, 145, 104670. doi:10.1016/j.autcon.2022.104670.

Liu, S., & Huang, X. (2019). A Chinese Question Answering System based on GPT. 2019 IEEE 10th International Conference on Software Engineering and Service Science (ICSESS), Beijing, China. doi:10.1109/icsess47205.2019.9040807.

Noraset, T., Lowphansirikul, L., & Tuarob, S. (2021). WabiQA: A Wikipedia-Based Thai Question-Answering System. Information Processing & Management, 58(1), 102431. doi:10.1016/j.ipm.2020.102431.

Ainon, R. N., Salim, S. S., & Noor, N. E. M. (1989). A question-answering system in Bahasa Malaysia. Fourth IEEE Region 10 International Conference TENCON, Bombay, India. doi:10.1109/tencon.1989.176892.

Puteh, N., Husin, M. Z., Tahir, H. M., & Hussain, A. (2019). Building a question classification model for a Malay question answering system. International Journal of Innovative Technology and Exploring Engineering, 8(5s), 184–190.

Lim, H. T., Huspi, S. H., & Ibrahim, R. (2021). A Conceptual Framework for Malay-English Mixed-language Question Answering System. 2021 International Congress of Advanced Technology and Engineering (ICOTEN), Taiz, Yemen. doi:10.1109/icoten52080.2021.9493503.

Pietsch, M., Möller, T., Kostic, B., Risch, J., Pippi, M., Jobanputra, M., Zanzottera, S., Cerza, S., Blagojevic, V., Stadelmann, T., Soni, T., & Lee, S. (2019). Haystack: the end-to-end NLP framework for pragmatic builders. Available online: https://github.com/deepset-ai/haystack (accessed on March 2024).

Pietsch, M., Möller, T., Kostic, B., Risch, J., Pippi, M., Jobanputra, M., ... & Lee, S. (2019). Haystack: the end-to-end NLP framework for pragmatic builders.

Zaheer, M., Guruganesh, G., Dubey, A., Ainslie, J., Alberti, C., Ontanon, S., Pham, P., Ravula, A., Wang, Q., Yang, L., & Ahmed, A. (2020). Big bird: Transformers for longer sequences. 34th Conference on Neural Information Processing Systems (NeurIPS 2020), 6-12 December, 2020, Vancouver, Canada.

Rajpurkar, P., Jia, R., & Liang, P. (2018). Know what you don't know: Unanswerable questions for SQuAD. arXiv preprint arXiv:1806.03822. doi:10.18653/v1/p18-2124.

He, P., Gao, J., & Chen, W. (2021). Debertav3: Improving deberta using electra-style pre-training with gradient-disentangled embedding sharing. arXiv preprint arXiv:2111.09543.

He, P., Liu, X., Gao, J., & Chen, W. (2020). Deberta: Decoding-enhanced BERT with disentangled attention. arXiv preprint arXiv:2006.03654. doi:10.48550/arXiv.2111.09543.

Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. doi:10.48550/arXiv.1810.04805.

Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., & Soricut, R. (2019). Albert: A lite BERT for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942. doi:10.48550/arViv.1909.11942.

Clark, K., Luong, M. T., Le, Q. V., & Manning, C. D. (2020). Electra: Pre-training text encoders as discriminators rather than generators. arXiv preprint arXiv:2003.10555. doi:10.48550/arXiv.2003/10555.

Wang, W., Wei, F., Dong, L., Bao, H., Yang, N., & Zhou, M. (2020). MINILM: Deep self-attention distillation for task-agnostic compression of pre-trained transformers. 34th Conference on Neural Information Processing Systems (NeurIPS 2020), 6-12 December, 2020, Vancouver, Canada.

Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., ... & Stoyanov, V. (2019). Roberta: A robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692. doi:10.48550/arXiv.1907.11692.


Full Text: PDF

DOI: 10.28991/ESJ-2024-08-02-019

Refbacks

  • There are currently no refbacks.


Copyright (c) 2024 Chin Poo Lee