Fine-Tuning Whisper Model for Mandar Speech Recognition: Approach and Performance Evaluation
DOI:
https://doi.org/10.37385/jaets.v7i1.7170Keywords:
Whisper, Fine-Tuning, Mandar Language, Word Error Rate (WER), Automatic Speech RecognitionAbstract
This research focuses on the development of speech recognition technology for the Mandar language, a regional language in Indonesia with limited digital resources. The main challenge lies in the lack of local datasets and the minimal representation of the Mandar language in existing multilingual speech recognition models. This study aims to enhance the performance of Automatic Speech Recognition (ASR) systems by fine-tuning the Whisper model using a Mandar-specific dataset. The dataset consists of 1,000 audio recordings with various dialects and recording qualities, which underwent preprocessing steps such as segmentation, normalization, and data augmentation. Fine-tuning was conducted using supervised learning methods with hyperparameter optimization, resulting in a reduction of Word Error Rate (WER) from 73.7% in the pretrained model to 37.4% after fine-tuning, and an increase in accuracy from 26.3% to 62.6%. The optimized model was also compared with other ASR models, such as DeepSpeech and wav2vec 2.0, demonstrating superior performance in terms of accuracy and time efficiency. Further analysis revealed that recording quality and dialect variations significantly impacted model performance, with high-quality recordings and standard dialects yielding the best results. The model was implemented as a web application prototype, enabling efficient and near real-time transcription of Mandar speech. This research not only contributes to the development of ASR technology for low-resource languages but also opens new opportunities for preserving and utilizing the Mandar language through digital technology. For future improvements, larger datasets, more advanced augmentation techniques, and the exploration of additional language model integration are recommended.
Downloads
References
Adnew, S., & Liang, P. P. (2024). Semantically Corrected Amharic Automatic Speech Recognition. https://arxiv.org/pdf/2404.13362
Ali, A., & Renals, S. (2018). Word error rate estimation for speech recognition: E-wer. ACL 2018 - 56th Annual Meeting of the Association for Computational Linguistics, 2, 20-24. https://doi.org/10.18653/v1/p18-2004
Arisaputra, P., Handoyo, A. T., & Zahra, A. (2024). XLS-R Deep Learning Model for Multilingual ASR on Low-Resource Languages: Indonesian, Javanese, and Sundanese. ICIC Express Letters, Part B: Applications, 15(6), 551-559. https://doi.org/10.24507/icicelb.15.06.551
Baevski, A., Zhou, Y., Mohamed, A., & Auli, M. (2020). wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations. Advances in Neural Information Processing Systems, 33, 12449-12460.
Bawitlung, A., Dash, S. K., & Pattanayak, R. M. (2025). Mizo Automatic Speech Recognition: Leveraging Wav2vec 2.0 and XLS-R for Enhanced Accuracy in Low-Resource Language Processing. ACM Transactions on Asian and Low-Resource Language Information Processing, 24(7). https://doi.org/10.1145/3746063
Bender, E. M., Gebru, T., McMillan-Major, A., & Shmitchell, S. (2021). On the dangers of stochastic parrots: Can language models be too big. ACM, 610-623. https://doi.org/10.1145/3442188.3445922
Bhat, C., & Strik, H. (2025). Two-stage data augmentation for improved ASR performance for dysarthric speech. Computers in Biology and Medicine, 189, 109954. https://doi.org/10.1016/j.compbiomed.2025.109954
Brodkin, D. (2022). Two steps to high absolutive syntax: Austronesian voice and agent focus in Mandar. Journal of East Asian Linguistics, 31(4), 465-516. https://doi.org/10.1007/s10831-022-09248-0
Cahyawijaya, S., Lovenia, H., Aji, A. F., Winata, G. I., Wilie, B., Koto, F., Mahendra, R., Wibisono, C., Romadhony, A., Vincentio, K., Santoso, J., Moeljadi, D., Wirawan, C., Hudi, F., Wicaksono, M. S., Parmonangan, I. H., Alfina, I., Putra, I. F., Rahmadani, S., ... Purwarianti, A. (2023). NusaCrowd: Open Source Initiative for Indonesian NLP Resources. ACL, 13745-13818. https://doi.org/10.18653/v1/2023.findings-acl.868
Das, R., & Singh, T. D. (2023). Multimodal Sentiment Analysis: A Survey of Methods, Trends, and Challenges. ACM Computing Surveys, 55(13). https://doi.org/10.1145/3586075
Dey, S., Sahidullah, M., & Saha, G. (2022). An Overview of Indian Spoken Language Recognition from Machine Learning Perspective. ACM TALLIP, 21(6), 1-45. https://doi.org/10.1145/3523179
Ferdiansyah, D., & Aditya, C. S. K. (2024). Implementasi Automatic Speech Recognition Bacaan Al-Quran Menggunakan Metode Wav2Vec 2.0 dan OpenAI-Whisper. Jurnal Teknik Elektro dan Komputer TRIAC, 11(1), 11-16. https://doi.org/10.21107/triac.v11i1.24332
Gillioz, A., Casas, J., & others. (2020). Overview of Transformer-based Models for NLP Tasks. https://doi.org/10.15439/2020F20
Gurunath Shivakumar, P., & Georgiou, P. (2020). Transfer learning from adult to children for speech recognition: Evaluation, analysis and recommendations. Computer Speech & Language, 63, 101077. https://doi.org/10.1016/j.csl.2020.101077
Hasrullah. (2018). Ribuan Mandar: Preservasi Bahasa dan Kebudayaan Mandar Sulawesi Barat. https://www.researchgate.net/publication/331630308
Imam, S. H., Belay, T. D., Husse, K. Y., Ahmad, I. S., Abdulmumin, I., Umar, H. A., Bello, M. Y., Nakatumba-Nabende, J., Yimam, S. M., & Muhammad, S. H. (2025). Automatic Speech Recognition (ASR) for African Low-Resource Languages: A Systematic Literature Review. https://arxiv.org/pdf/2510.01145
Kastner, K., Wang, G., Elias, I., Saeki, T., Mengibar, P. M., Beaufays, F., Rosenberg, A., & Ramabhadran, B. (2025). Speech Re-Painting for Robust ASR. ICASSP. https://doi.org/10.1109/ICASSP49660.2025.10888357
Kuhn, K., Kersken, V., Reuter, B., Egger, N., & Zimmermann, G. (2023). Measuring the accuracy of automatic speech recognition solutions. ACM, 16(4). https://doi.org/10.1145/3636513
Lei, C., Singh, S., Hou, F., & Wang, R. (2024). Mix-fine-tune: An Alternate Fine-tuning Strategy for Domain Adaptation and Generalization of Low-resource ASR. MMAsia 2024. https://doi.org/10.1145/3696409.3700259
Liu, Y., Yang, X., & Qu, D. (2024a). Exploration of Whisper fine-tuning strategies for low-resource ASR. Eurasip Journal on Audio, Speech, and Music Processing, 2024(1), 1-11. https://doi.org/10.1186/s13636-024-00349-3
Liu, Y., Yang, X., & Qu, D. (2024b). Exploration of Whisper fine-tuning strategies for low-resource ASR. Eurasip Journal on Audio, Speech, and Music Processing, 2024(1), 1-11. https://doi.org/10.1186/s13636-024-00349-3
Machado, F., Rahali, A., & Akhloufi, M. A. (2023). End-to-end transformer-based models in textual-based NLP. https://doi.org/10.3390/ai4010004
Nakatumba-Nabende, J., Kagumire, S., Kantono, C., & Nabende, P. (2025). A Systematic Literature Review on Bias Evaluation and Mitigation in ASR Models for Low-Resource African Languages. ACM Computing Surveys. https://doi.org/10.1145/3769089
Nalli, S., Haria, S., Hill, M. D., Swift, M. M., Volos, H., & Keeton, K. (2017). An Analysis of Persistent Memory Use with WHISPER. ACM SIGPLAN Notices, 52(4), 135-148. https://doi.org/10.1145/3093336.3037730
Niu, T., Chen, Y., Qu, D., & Hu, H. (2025). Enhancing Far-Field Speech Recognition with Mixer: A Novel Data Augmentation Approach. Applied Sciences, 15(7), 4073. https://doi.org/10.3390/app15074073
Nugroho, K., Noersasongko, E., Purwanto, Muljono, & Setiadi, D. R. I. M. (2022). Enhanced Indonesian Ethnic Speaker Recognition using Data Augmentation Deep Neural Network. Journal of King Saud University - Computer and Information Sciences, 34(7), 4375-4384. https://doi.org/10.1016/j.jksuci.2021.04.002
Nurfadhilah, E., Yuyun, Santosa, A., Latief, A. D., Nurul Afra, D. I., Gusnawaty, Pammuda, Kaharuddin, M. N., Rosvita, I., Nurfaedah, & Hazriani. (2024). Comparative Analysis of Part of Speech Tagging Methods for the Bugis Language. IC3INA 2024, 48-53. https://doi.org/10.1109/IC3INA64086.2024.10732773
Papala, G., Ransing, A., & others. (2023). Sentiment Analysis and Speaker Diarization in Hindi and Marathi Using Finetuned Whisper. SCPE, 24(4), 835-846. https://doi.org/10.12694/scpe.v24i4.2248
Peng, J., Stafylakis, T., Gu, R., Plchot, O., Mosner, L., Burget, L., & Cernocky, J. (2023). Parameter-Efficient Transfer Learning Using Adapters. ICASSP. https://doi.org/10.1109/ICASSP49357.2023.10094795
Radford, A., Kim, J. W., Xu, T., Brockman, G., Mcleavey, C., & Sutskever, I. (2023). Robust Speech Recognition via Large-Scale Weak Supervision. https://proceedings.mlr.press/v202/radford23a.html
Rolland, T., Abad, A., Cucchiarini, C., & Strik, H. (2022). Multilingual Transfer Learning for Children ASR. https://aclanthology.org/2022.lrec-1.795/
Romero, M., Gomez-Canaval, S., & Torre, I. G. (2024). Automatic Speech Recognition Advancements for Indigenous Languages. Applied Sciences, 14(15), 6497. https://doi.org/10.3390/app14156497
Roos, Q. (2022). Fine-tuning Pre-trained Language Models for CEFR-level Text Generation. https://www.diva-portal.org/smash/record.jsf?pid=diva2:1708538
Rusdiah, R., Rasjid, N., Irianti, A., Adiheri, A., & Reski, R. (2023). Digitalisasi Cerita Rakyat Mandar. Jurnal Pengabdian Masyarakat Universitas Lamappapoleonro, 1(2), 66-70.
Sekiguchi, K., Bando, Y., & others. (2019). Semi-supervised multichannel speech enhancement with a deep speech prior. https://ieeexplore.ieee.org/document/8861142
Serva, M., Pasquini, M., Tawaqal, B., & Suyanto, S. (2021). Recognizing Five Major Dialects in Indonesia Based on MFCC and DRNN. Journal of Physics: Conference Series, 1844, 012003. https://doi.org/10.1088/1742-6596/1844/1/012003
Suyanto, S., Arifianto, A., Sirwan, A., & Rizaendra, A. P. (2020). End-to-End Speech Recognition for Low-Resourced Indonesian Language. ICoICT 2020. https://doi.org/10.1109/ICOICT49345.2020.9166346
Xiao, A., Zheng, W., Keren, G., Le, D., Zhang, F., Fuegen, C., Kalinli, O., Saraf, Y., & Mohamed, A. (2021). Scaling ASR Improves Zero and Few Shot Learning. Interspeech 2022, 5135-5139. https://doi.org/10.21437/Interspeech.2022-11023
Zevallos, R., Cordova, J., & Camacho, L. (2020). Automatic Speech Recognition of Quechua Language Using HMM Toolkit. CCIS, 1070, 61-68. https://doi.org/10.1007/978-3-030-46140-9_6
Zhao, J., & Zhang, W. Q. (2022). Improving ASR Performance for Low-Resource Languages with Self-Supervised Models. IEEE JSTSP, 16(6), 1227-1241. https://doi.org/10.1109/JSTSP.2022.3184480


CITEDNESS IN SCOPUS
CITEDNESS IN WOS




