Fine-Tuning Whisper Model for Mandar Speech Recognition: Approach and Performance Evaluation

Authors

  • Jafar Jafar Universitas Negeri Makassar, Indonesia
  • Mar Athul Wazithah Tb Univesitas Negeri Makassar
  • Firman Aziz Universitas Pancasakti Makassar
  • Rosary Iriany Universitas Pancasakti
  • Norma Nasir Univesitas Negeri Makassar

DOI:

https://doi.org/10.37385/jaets.v7i1.7170

Keywords:

Whisper, Fine-Tuning, Mandar Language, Word Error Rate (WER), Automatic Speech Recognition

Abstract

This research focuses on the development of speech recognition technology for the Mandar language, a regional language in Indonesia with limited digital resources. The main challenge lies in the lack of local datasets and the minimal representation of the Mandar language in existing multilingual speech recognition models. This study aims to enhance the performance of Automatic Speech Recognition (ASR) systems by fine-tuning the Whisper model using a Mandar-specific dataset. The dataset consists of 1,000 audio recordings with various dialects and recording qualities, which underwent preprocessing steps such as segmentation, normalization, and data augmentation. Fine-tuning was conducted using supervised learning methods with hyperparameter optimization, resulting in a reduction of Word Error Rate (WER) from 73.7% in the pretrained model to 37.4% after fine-tuning, and an increase in accuracy from 26.3% to 62.6%. The optimized model was also compared with other ASR models, such as DeepSpeech and wav2vec 2.0, demonstrating superior performance in terms of accuracy and time efficiency. Further analysis revealed that recording quality and dialect variations significantly impacted model performance, with high-quality recordings and standard dialects yielding the best results. The model was implemented as a web application prototype, enabling efficient and near real-time transcription of Mandar speech. This research not only contributes to the development of ASR technology for low-resource languages but also opens new opportunities for preserving and utilizing the Mandar language through digital technology. For future improvements, larger datasets, more advanced augmentation techniques, and the exploration of additional language model integration are recommended.

Downloads

Download data is not yet available.

References

Adnew, S., & Liang, P. P. (2024). Semantically Corrected Amharic Automatic Speech Recognition. https://arxiv.org/pdf/2404.13362

Ali, A., & Renals, S. (2018). Word error rate estimation for speech recognition: E-wer. ACL 2018 - 56th Annual Meeting of the Association for Computational Linguistics, 2, 20-24. https://doi.org/10.18653/v1/p18-2004

Arisaputra, P., Handoyo, A. T., & Zahra, A. (2024). XLS-R Deep Learning Model for Multilingual ASR on Low-Resource Languages: Indonesian, Javanese, and Sundanese. ICIC Express Letters, Part B: Applications, 15(6), 551-559. https://doi.org/10.24507/icicelb.15.06.551

Baevski, A., Zhou, Y., Mohamed, A., & Auli, M. (2020). wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations. Advances in Neural Information Processing Systems, 33, 12449-12460.

Bawitlung, A., Dash, S. K., & Pattanayak, R. M. (2025). Mizo Automatic Speech Recognition: Leveraging Wav2vec 2.0 and XLS-R for Enhanced Accuracy in Low-Resource Language Processing. ACM Transactions on Asian and Low-Resource Language Information Processing, 24(7). https://doi.org/10.1145/3746063

Bender, E. M., Gebru, T., McMillan-Major, A., & Shmitchell, S. (2021). On the dangers of stochastic parrots: Can language models be too big. ACM, 610-623. https://doi.org/10.1145/3442188.3445922

Bhat, C., & Strik, H. (2025). Two-stage data augmentation for improved ASR performance for dysarthric speech. Computers in Biology and Medicine, 189, 109954. https://doi.org/10.1016/j.compbiomed.2025.109954

Brodkin, D. (2022). Two steps to high absolutive syntax: Austronesian voice and agent focus in Mandar. Journal of East Asian Linguistics, 31(4), 465-516. https://doi.org/10.1007/s10831-022-09248-0

Cahyawijaya, S., Lovenia, H., Aji, A. F., Winata, G. I., Wilie, B., Koto, F., Mahendra, R., Wibisono, C., Romadhony, A., Vincentio, K., Santoso, J., Moeljadi, D., Wirawan, C., Hudi, F., Wicaksono, M. S., Parmonangan, I. H., Alfina, I., Putra, I. F., Rahmadani, S., ... Purwarianti, A. (2023). NusaCrowd: Open Source Initiative for Indonesian NLP Resources. ACL, 13745-13818. https://doi.org/10.18653/v1/2023.findings-acl.868

Das, R., & Singh, T. D. (2023). Multimodal Sentiment Analysis: A Survey of Methods, Trends, and Challenges. ACM Computing Surveys, 55(13). https://doi.org/10.1145/3586075

Dey, S., Sahidullah, M., & Saha, G. (2022). An Overview of Indian Spoken Language Recognition from Machine Learning Perspective. ACM TALLIP, 21(6), 1-45. https://doi.org/10.1145/3523179

Ferdiansyah, D., & Aditya, C. S. K. (2024). Implementasi Automatic Speech Recognition Bacaan Al-Quran Menggunakan Metode Wav2Vec 2.0 dan OpenAI-Whisper. Jurnal Teknik Elektro dan Komputer TRIAC, 11(1), 11-16. https://doi.org/10.21107/triac.v11i1.24332

Gillioz, A., Casas, J., & others. (2020). Overview of Transformer-based Models for NLP Tasks. https://doi.org/10.15439/2020F20

Gurunath Shivakumar, P., & Georgiou, P. (2020). Transfer learning from adult to children for speech recognition: Evaluation, analysis and recommendations. Computer Speech & Language, 63, 101077. https://doi.org/10.1016/j.csl.2020.101077

Hasrullah. (2018). Ribuan Mandar: Preservasi Bahasa dan Kebudayaan Mandar Sulawesi Barat. https://www.researchgate.net/publication/331630308

Imam, S. H., Belay, T. D., Husse, K. Y., Ahmad, I. S., Abdulmumin, I., Umar, H. A., Bello, M. Y., Nakatumba-Nabende, J., Yimam, S. M., & Muhammad, S. H. (2025). Automatic Speech Recognition (ASR) for African Low-Resource Languages: A Systematic Literature Review. https://arxiv.org/pdf/2510.01145

Kastner, K., Wang, G., Elias, I., Saeki, T., Mengibar, P. M., Beaufays, F., Rosenberg, A., & Ramabhadran, B. (2025). Speech Re-Painting for Robust ASR. ICASSP. https://doi.org/10.1109/ICASSP49660.2025.10888357

Kuhn, K., Kersken, V., Reuter, B., Egger, N., & Zimmermann, G. (2023). Measuring the accuracy of automatic speech recognition solutions. ACM, 16(4). https://doi.org/10.1145/3636513

Lei, C., Singh, S., Hou, F., & Wang, R. (2024). Mix-fine-tune: An Alternate Fine-tuning Strategy for Domain Adaptation and Generalization of Low-resource ASR. MMAsia 2024. https://doi.org/10.1145/3696409.3700259

Liu, Y., Yang, X., & Qu, D. (2024a). Exploration of Whisper fine-tuning strategies for low-resource ASR. Eurasip Journal on Audio, Speech, and Music Processing, 2024(1), 1-11. https://doi.org/10.1186/s13636-024-00349-3

Liu, Y., Yang, X., & Qu, D. (2024b). Exploration of Whisper fine-tuning strategies for low-resource ASR. Eurasip Journal on Audio, Speech, and Music Processing, 2024(1), 1-11. https://doi.org/10.1186/s13636-024-00349-3

Machado, F., Rahali, A., & Akhloufi, M. A. (2023). End-to-end transformer-based models in textual-based NLP. https://doi.org/10.3390/ai4010004

Nakatumba-Nabende, J., Kagumire, S., Kantono, C., & Nabende, P. (2025). A Systematic Literature Review on Bias Evaluation and Mitigation in ASR Models for Low-Resource African Languages. ACM Computing Surveys. https://doi.org/10.1145/3769089

Nalli, S., Haria, S., Hill, M. D., Swift, M. M., Volos, H., & Keeton, K. (2017). An Analysis of Persistent Memory Use with WHISPER. ACM SIGPLAN Notices, 52(4), 135-148. https://doi.org/10.1145/3093336.3037730

Niu, T., Chen, Y., Qu, D., & Hu, H. (2025). Enhancing Far-Field Speech Recognition with Mixer: A Novel Data Augmentation Approach. Applied Sciences, 15(7), 4073. https://doi.org/10.3390/app15074073

Nugroho, K., Noersasongko, E., Purwanto, Muljono, & Setiadi, D. R. I. M. (2022). Enhanced Indonesian Ethnic Speaker Recognition using Data Augmentation Deep Neural Network. Journal of King Saud University - Computer and Information Sciences, 34(7), 4375-4384. https://doi.org/10.1016/j.jksuci.2021.04.002

Nurfadhilah, E., Yuyun, Santosa, A., Latief, A. D., Nurul Afra, D. I., Gusnawaty, Pammuda, Kaharuddin, M. N., Rosvita, I., Nurfaedah, & Hazriani. (2024). Comparative Analysis of Part of Speech Tagging Methods for the Bugis Language. IC3INA 2024, 48-53. https://doi.org/10.1109/IC3INA64086.2024.10732773

Papala, G., Ransing, A., & others. (2023). Sentiment Analysis and Speaker Diarization in Hindi and Marathi Using Finetuned Whisper. SCPE, 24(4), 835-846. https://doi.org/10.12694/scpe.v24i4.2248

Peng, J., Stafylakis, T., Gu, R., Plchot, O., Mosner, L., Burget, L., & Cernocky, J. (2023). Parameter-Efficient Transfer Learning Using Adapters. ICASSP. https://doi.org/10.1109/ICASSP49357.2023.10094795

Radford, A., Kim, J. W., Xu, T., Brockman, G., Mcleavey, C., & Sutskever, I. (2023). Robust Speech Recognition via Large-Scale Weak Supervision. https://proceedings.mlr.press/v202/radford23a.html

Rolland, T., Abad, A., Cucchiarini, C., & Strik, H. (2022). Multilingual Transfer Learning for Children ASR. https://aclanthology.org/2022.lrec-1.795/

Romero, M., Gomez-Canaval, S., & Torre, I. G. (2024). Automatic Speech Recognition Advancements for Indigenous Languages. Applied Sciences, 14(15), 6497. https://doi.org/10.3390/app14156497

Roos, Q. (2022). Fine-tuning Pre-trained Language Models for CEFR-level Text Generation. https://www.diva-portal.org/smash/record.jsf?pid=diva2:1708538

Rusdiah, R., Rasjid, N., Irianti, A., Adiheri, A., & Reski, R. (2023). Digitalisasi Cerita Rakyat Mandar. Jurnal Pengabdian Masyarakat Universitas Lamappapoleonro, 1(2), 66-70.

Sekiguchi, K., Bando, Y., & others. (2019). Semi-supervised multichannel speech enhancement with a deep speech prior. https://ieeexplore.ieee.org/document/8861142

Serva, M., Pasquini, M., Tawaqal, B., & Suyanto, S. (2021). Recognizing Five Major Dialects in Indonesia Based on MFCC and DRNN. Journal of Physics: Conference Series, 1844, 012003. https://doi.org/10.1088/1742-6596/1844/1/012003

Suyanto, S., Arifianto, A., Sirwan, A., & Rizaendra, A. P. (2020). End-to-End Speech Recognition for Low-Resourced Indonesian Language. ICoICT 2020. https://doi.org/10.1109/ICOICT49345.2020.9166346

Xiao, A., Zheng, W., Keren, G., Le, D., Zhang, F., Fuegen, C., Kalinli, O., Saraf, Y., & Mohamed, A. (2021). Scaling ASR Improves Zero and Few Shot Learning. Interspeech 2022, 5135-5139. https://doi.org/10.21437/Interspeech.2022-11023

Zevallos, R., Cordova, J., & Camacho, L. (2020). Automatic Speech Recognition of Quechua Language Using HMM Toolkit. CCIS, 1070, 61-68. https://doi.org/10.1007/978-3-030-46140-9_6

Zhao, J., & Zhang, W. Q. (2022). Improving ASR Performance for Low-Resource Languages with Self-Supervised Models. IEEE JSTSP, 16(6), 1227-1241. https://doi.org/10.1109/JSTSP.2022.3184480

Downloads

Published

2025-12-29

How to Cite

Jafar, J., Tb, M. A. W., Aziz, F., Iriany, R., & Nasir, N. (2025). Fine-Tuning Whisper Model for Mandar Speech Recognition: Approach and Performance Evaluation. Journal of Applied Engineering and Technological Science (JAETS), 7(1), 261–273. https://doi.org/10.37385/jaets.v7i1.7170