Twitter Data Analysis and Text Normalization in Collecting Standard Word


  • Arif Ridho Lubis Politeknik Negeri Medan
  • Mahyuddin K M Nasution Universitas Sumatera Utara



NLP, Word, Formal, Analysis, Twitter


is one of the most important data sources in social data analysis. However, the text contained on Twitter is often unstructured, resulting in difficulties in collecting standard words. Therefore, in this research, we analyze Twitter data and normalize text to produce standard words that can be used in social data analysis. The purpose of this research is to improve the quality of data collection on standard words on social media from Twitter and facilitate the analysis of social data that is more accurate and valid. The method used is natural language processing techniques using classification algorithms and text normalization techniques. The result of this study is a set of standard words that can be used for social data analysis with a total of 11430 words, then 4075 words with structural or formal words and 7355 informal words. Informal words are corrected by trusted sources to create a corpus of formal and informal words obtained from social media tweet data @fullSenyum. The contribution to this research is that the method developed can improve the quality of social data collection from Twitter by ensuring the words used are standard and accurate and the text normalization method used in this study can be used as a reference for text normalization in other social data, thus facilitating collection. and better-quality social data analysis. This research can assist researchers or practitioners in understanding natural language processing techniques and their application in social data analysis. This research is expected to assist in collecting social data more effectively and efficiently.


Download data is not yet available.

Author Biography

Mahyuddin K M Nasution, Universitas Sumatera Utara




Alhaj, Y. A., Dahou, A., Al-qaness, M. A. A., Abualigah, L., Abbasi, A. A., Almaweri, N. A. O., Elaziz, M. A., & Damaševi?ius, R. (2022). A novel text classification technique using improved particle swarm optimization: A case study of Arabic language. Future Internet, 14(7), 194.

Anandhan, A., Shuib, L., Ismail, M. A., & Mujtaba, G. (2018). Social media recommender systems: review and open research issues. IEEE Access, 6, 15608–15628.

Baccouche, A., Ahmed, S., Sierra-Sosa, D., & Elmaghraby, A. (2020). Malicious text identification: deep learning from public comments and emails. Information, 11(6), 312.

Basan, E., Basan, A., Nekrasov, A., Fidge, C., Abramov, E., & Basyuk, A. (2022). A Data Normalization Technique for Detecting Cyber Attacks on UAVs. Drones, 6(9), 1–21.

bin Sazali, M. A. H., & Idris, N. B. (2022). Neural Machine Translation for Malay Text Normalization using Synthetic Dataset. 2022 10th International Conference on Information and Communication Technology (ICoICT), 386–390.

Chen, W., Xu, Z., Zheng, X., Yu, Q., & Luo, Y. (2020). Research on sentiment classification of online travel review text. Applied Sciences, 10(15), 5275.

Dirkson, A., Verberne, S., Sarker, A., & Kraaij, W. (2019). Data-driven lexical normalization for medical social media. Multimodal Technologies and Interaction, 3(3), 60.

Göker, S., & Can, B. (2018). Neural text normalization for turkish social media. 2018 3rd International Conference on Computer Science and Engineering (UBMK), 161–166.

Gunawan, D., Saniyah, Z., & Hizriadi, A. (2019). Normalization of abbreviation and acronym on Microtext in Bahasa Indonesia by using dictionary-based and longest common subsequence (LCS). Procedia Computer Science, 161, 553–559.

Iskandar, D., & Marjuki, M. (2022). Classification of Melinjo Fruit Levels Using Skin Color Detection With RGB and HSV. Journal of Applied Engineering and Technological Science (JAETS), 4(1), 123–130.

Izonin, I., Tkachenko, R., Shakhovska, N., Ilchyshyn, B., & Singh, K. K. (2022). A Two-Step Data Normalization Approach for Improving Classification Accuracy in the Medical Diagnosis Domain. Mathematics, 10(11), 1–18.

Javaloy, A., & García-Mateos, G. (2020). Text normalization using encoder–decoder networks based on the causal feature extractor. Applied Sciences, 10(13), 4551.

Jimenez-Marquez, J. L., Gonzalez-Carrasco, I., Lopez-Cuadrado, J. L., & Ruiz-Mezcua, B. (2019). Towards a big data framework for analyzing social media content. International Journal of Information Management, 44, 1–12.

Jose, G., & Raj, N. S. (2014). Noisy SMS text normalization model. International Conference for Convergence for Technology-2014, 1–6.

Khan, J., & Lee, S. (2021). Enhancement of Text Analysis Using Context-Aware Normalization of Social Media Informal Text. Applied Sciences, 11(17), 8172.

Kumar, A., Tyagi, V., & Das, S. (2021). Deep Learning for Hate Speech Detection in social media. 2021 IEEE 4th International Conference on Computing, Power and Communication Technologies (GUCON), 1–4.

Kusumawardani, R. P., Priansya, S., & Atletiko, F. J. (2018). Context-sensitive normalization of social media text in bahasa Indonesia based on neural word embeddings. Procedia Computer Science, 144, 105–117.

Liang, P.-W., & Dai, B.-R. (2013). Opinion mining on social media data. 2013 IEEE 14th International Conference on Mobile Data Management, 2, 91–96.

Liu, K., & Chen, L. (2019). Medical social media text classification integrating consumer health terminology. IEEE Access, 7, 78185–78193.

Lubis, A.R., Lubis, M., & Azhar, C. D. (2019). The effect of social media to the sustainability of short message service (SMS) and phone call. Procedia Computer Science, 161.

Lubis, Arif Ridho, Nasution, M. K. M., Sitompul, O. S., & Zamzami, E. M. (2023). A new approach to achieve the users’ habitual opportunities on social media. IAES International Journal of Artificial Intelligence, 12(1), 41–47.

Lubis, Arif Ridho, Prayudani, S., Lubis, M., & Nugroho, O. (2022). Sentiment Analysis on Online Learning During the Covid-19 Pandemic Based on Opinions on Twitter using KNN Method. 2022 1st International Conference on Information System & Information Technology (ICISIT), 106–111.

Lubis, Arif Ridho, Prayudani, S., Nugroho, O., Lase, Y. Y., & Lubis, M. (2022). Comparison of Model in Predicting Customer Churn Based on Users’ habits on E-Commerce. 2022 5th International Seminar on Research of Information Technology and Intelligent Systems (ISRITI), 300–305.

Lubis, Arif Ridho, Utara, U. S., Sitompul, O. S., Utara, U. S., Nasution, M. K. M., Utara, U. S., Zamzami, E. M., & Utara, U. S. (2020). Obtaining Value From The Constraints in Finding User Habitual Words. 8–11.

Maghfur, N. M., Ibrohim, M. O., Fahmi, J., Putera, A. S., & Riandi, O. (2021). Text Normalization for Indonesian Text-to-Speech (TTS) using Rule-Based Approach: A Dataset and Preliminary Study. 2021 4th International Conference of Computer and Informatics Engineering (IC2IE), 129–134.

Middleton, S. E., Middleton, L., & Modafferi, S. (2013). Real-time crisis mapping of natural disasters using social media. IEEE Intelligent Systems, 29(2), 9–17.

Neto, A. F. de S., Bezerra, B. L. D., & Toselli, A. H. (2020). Towards the natural language processing as spelling correction for offline handwritten text recognition systems. Applied Sciences, 10(21), 7711.

Nguyen, L. H., Salopek, A., Zhao, L., & Jin, F. (2017). A natural language normalization approach to enhance social media text reasoning. 2017 IEEE International Conference on Big Data (Big Data), 2019–2026.

Pano, T., & Kashef, R. (2020). A complete VADER-based sentiment analysis of bitcoin (BTC) tweets during the era of COVID-19. Big Data and Cognitive Computing, 4(4), 33.

Rahman, T., Agustin, F. E. M., & Rozy, N. F. (2019). Normalization of Unstructured Indonesian Tweet Text For Presidential Candidates Sentiment Analysis. 2019 7th International Conference on Cyber and IT Service Management (CITSM), 7, 1–6.

Roshini, T., Sireesha, P. V., Parasa, D., & Bano, S. (2019). Social media survey using decision tree and Naive Bayes classification. 2019 2nd International Conference on Intelligent Communication and Computational Techniques (ICCT), 265–270.

Sarimole, F. M., & Fadillah, M. I. (2022). Classification Of Guarantee Fruit Murability Based on HSV Image With K-Nearest Neighbor. Journal of Applied Engineering and Technological Science (JAETS), 4(1), 48–57.

Schreck, T., & Keim, D. (2012). Visual analysis of social media data. Computer, 46(5), 68–75.

Sebastian, D., & Nugraha, K. A. (2019). Text normalization for Indonesian abbreviated word using crowdsourcing method. 2019 International Conference on Information and Communications Technology, ICOIACT 2019, 529–532.

Shu, K., Sliva, A., Wang, S., Tang, J., & Liu, H. (2017). Fake news detection on social media: A data mining perspective. ACM SIGKDD Explorations Newsletter, 19(1), 22–36.

Tanna, D., Dudhane, M., Sardar, A., Deshpande, K., & Deshmukh, N. (2020). Sentiment analysis on social media for emotion classification. 2020 4th International Conference on Intelligent Computing and Control Systems (ICICCS), 911–915.

Villavicencio, C., Macrohon, J. J., Inbaraj, X. A., Jeng, J.-H., & Hsieh, J.-G. (2021). Twitter sentiment analysis towards covid-19 vaccines in the Philippines using naïve bayes. Information, 12(5), 204.

Xuanyuan, M., Xiao, L., & Duan, M. (2021). Sentiment classification algorithm based on multi-modal social media text information. IEEE Access, 9, 33410–33418.

Young, T., Hazarika, D., Poria, S., & Cambria, E. (2018). Recent trends in deep learning based natural language processing. Ieee Computational IntelligenCe Magazine, 13(3), 55–75.

Zeng, D., Chen, H., Lusch, R., & Li, S. (2010). Social Media Analytics and Intelligence. DEcEMbEr.

Zheng, H., Lin, F., Feng, X., & Chen, Y. (2020). A hybrid deep learning model with attention-based conv-LSTM networks for short-term traffic flow prediction. IEEE Transactions on Intelligent Transportation Systems, 22(11), 6910–6920.

Zheng, X., Chen, W., Wang, P., Shen, D., Chen, S., Wang, X., Zhang, Q., & Yang, L. (2015). Big data for social transportation. IEEE Transactions on Intelligent Transportation Systems, 17(3), 620–630.




How to Cite

Lubis, A. R., & Nasution, M. K. M. (2023). Twitter Data Analysis and Text Normalization in Collecting Standard Word. Journal of Applied Engineering and Technological Science (JAETS), 4(2), 855–863.