TWITTER DATA ANALYSIS AND TEXT NORMALIZATION IN COLLECTING STANDARD WORD

This study discusses the Twitter data analysis and text normalization in standard word collection. Twitter is one of the most important data sources in social data analysis. However, the text contained on Twitter is often unstructured, resulting in difficulties in collecting standard words. Therefore, in this research, we analyze Twitter data and normalize text to produce standard words that can be used in social data analysis. The purpose of this research is to improve the quality of data collection on standard words on social media from Twitter and facilitate the analysis of social data that is more accurate and valid. The method used is natural language processing techniques using classification algorithms and text normalization techniques. The result of this study is a set of standard words that can be used for social data analysis with a total of 11430 words, then 4075 words with structural or formal words and 7355 informal words. Informal words are corrected by trusted sources to create a corpus of formal and informal words obtained from social media tweet data @fullSenyum. The contribution to this research is that the method developed can improve the quality of social data collection from Twitter by ensuring the words used are standard and accurate and the text normalization method used in this study can be used as a reference for text normalization in other social data, thus facilitating collection. and better-quality social data analysis. This research can assist researchers or practitioners in understanding natural language processing techniques and their application in social data analysis. This research is expected to assist in collecting social data more effectively and efficiently.


Introduction
The development of social media has now become a trend in terms of communication between on line users (Anandhaun et al., 2018;Zeng et al., 2010), social media is an online media with a gathering of users who can easily communicate, share and participate one another (Schreck & Keim, 2012;Middleton et al., 2013;A.R. Lubis et al., 2019). Data in social media represents research and manageable challenges in natural language processing (Young et al., 2018;Liang & Dai, 2013). Twitter is a widely popular social media platform around the world, where users can post short messages known as "tweets" . Due to the brevity of tweets and the fact that they sometimes do not follow proper grammar or spelling rules, text normalization is required to collect correct and consistent words(A.R. Lubis et al., 2019;Arif Ridho Lubis et al., 2020). In this context, text normalization refers to the process of converting non-standard text to standard text by correcting spelling, grammar, and other errors (Neto et al., 2020;Dirkson et al., 2019). The results of text normalization can help improve the accuracy of collecting correct and consistent words for further analysis. The data obtained from social media was still unstructured which still needed to be improved (H. Zheng et al., 2020;X. Zheng et al., 2015). Several studies had been carried out on social media data in many languages such as Indian (Tanna et al., 2020;Roshini et al., 2019;Kumar et al., 2021) Chinese (Xuanyuan et al., 2021;Liu & Chen, 2019). The conducted research focused on improving the technique of the preprocessing process and the completion of nonstandard and unstructured words due to the use of words and phrases in communication in Indonesian-language social media Alhaj et al., 2022), Preprocessing is a stage in the natural language preprocessing method intended for documents in the form of text (Pano & Kashef, 2020;Villavicencio et al., 2021). The goal was to prepare data or text obtained from unstructured social media into good data and could be easily processed for further processing (Jimenez-Marquez et al., 2019;Shu et al., 2017;Iskandar & Marjuki, 2022). In the preprocessing technique there were several processes such as parsing, case folding, tokenizing, stemming, filtering/stop words, normalization Baccouche et al., 2020;Sarimole & Fadillah, 2022). In the text normalization stage, it was very important to be able to help parse Indonesian language that could understand lexical meaning well, performance in processing structural and unstructured words could be improved if the preprocessing stages were carried out properly, especially normalization for unstructured words (Izonin et al., 2022;Basan et al., 2022).
Research by (Göker & Can, 2018) In this study, two approaches were carried out for the Turkish text i.e. the contextual normalization approach and the sequence-to-sequence normalization approach using a neural encoder model. Other studies (Nguyen et al., 2017) carried out normalization which could extract information related to data accurately. Based on the conclusions and objectives of previous research, there are still weaknesses in normalizing text, so this study will apply text normalization which will use Indonesian language data originating from social media Twitter, which has unstructured and non-standard words, normalization techniques used to carry out text normalization process so that data can be processed and analyzed further. This normalization stage is adjusted to the data that has been obtained from social media (Jose & Raj, 2014;bin Sazali & Idris, 2022). The nature of the data obtained from social media is that there are users who generally communicate and express expressions (Sebastian & Nugraha, 2019), for example they say "kota in sunggh sjuk setiap hari", in the tweet snippets obtained by naked eye you can see several errors like "in" which should "ini", "sunggh" that should be "sungguh", and "sejk" harus "sejuk". If the words are processed and analyzed, such as sentiment, and classification, the results obtained are not accurate, so text processing analysis is needed. The purpose of this study is to use a dictionary-based normalization approach so that it can have superior differences from previous research. This study divides the process into 3 parts, namely text normalization, statistical words or lexicon, and non-standard words that appear in tweet data. In related research, standard Indonesian is specifically discussed. This study also provides slang and formal words to understand the characteristics of the tweet data.

Literature Review
Many researchers had applied the model of deep learning to many cases such as (Maghfur et al., 2021) Text-to-Speech (TTS) is widely used for both academic/non-commercial and industrial/commercial purposes. However, in some cases, text normalization is added to improve TTS performance. In this study, a rule-based approach is proposed to create a normalized Indonesian text dataset that has raw text and spoken form to improve Indonesian TTS performance. This approach shows good performance for text normalization for Indonesian TTS with a Word Error Rate (WER) of 0.0805. Another study conducted by (Khan & Lee, 2021) concluded that In this research, it is proposed to develop an application called Textual Variations Handler (TVH), which is a generic application that works in a variation-independent manner to handle various types of noise in textual data originating from various social media (SM) applications to improve text analysis. The aim of this research is to introduce a hybrid normalization technique that is effective in ensuring that information obtained from noisy text data can be utilized in the desired form. This study integrates the TVH application with a deeplearning state-of-the-art (SOTA) based text analysis method to improve its performance in analyzing noisy SM text data. The simulation results show that the proposed scheme is promising in terms of precision, recall, accuracy, and F1 scores in the analysis of informal texts on social media. This research (Sebastian & Nugraha, 2019), this research normalized the Indonesian language with data having several words from data consisted of unstructured sentences and nonstandard words. The normalization method was carried out to analyze data for the next process.
This research (Rahman et al., 2019) analyzed the data of Indonesian tweets consisting of unstructured text in order to complete the word processing process and clean up tweet data from unstructured text. In processing text by using case folding, filtering stages, tokenizing. Then the normalization process was carried out so that words having excess letters, abbreviation, slang and all documents were converted into standard words and if there were words without meaning, they were deleted. Other research conducted by (Javaloy & García-Mateos, 2020) concluded that this study proposes a new method for the encoder in the encoder-decoder architecture called Causal Feature Extractor (CFE) for text normalization as the first step in a text-to-speech system ( TTS). CFE is compared to other encoding methods and shows better results in terms of accuracy, number of parameters, convergence time, and the use of the attention matrix based on the attention mechanism. The proposed method is general in nature and can be applied to various input types such as text, audio and images. This research (Sebastian & Nugraha, 2019) collected data in conducting research in the field of text mining. Data collection consists of many languages which were then processed to obtain normalized data in the word processing. Abbreviated words were a problem in text mining which resulted in the system not being able to process the text optimally due to differences in the meaning of the abbreviated words. The purpose of this study was to develop and obtain a set of Indonesian abbreviations. This research applied Crowdsourcing method in developing the dataset. Based on previous research that has been described there are deficiencies that can be resolved in this study which are listed in Table 1 below: The solution offered was the use of statistical matching translation to carry out the process of normalizing Indonesian text by utilizing translation at the phrase and character level in text data. In this case, using an external corpus for the data to be used in the normalization model. Indonesian language has a colloquial structure that could be found in the tweet data. In some cases, the everyday language words contained in the tweet data were still difficult to understand.
The following examples of categories and samples could be seen in Table 2. On the linguistic side, you could also use the lexicon to search and observe slang in social media. In general, a comparison of the appearance of slang words from tweet data on social media was then carried out.

Research Methods
The quantitative research method will assist in systematically and objectively collecting data from Twitter. The obtained data will be processed using data analysis techniques such as descriptive statistics and text classification. Additionally, this research will also employ text normalization methods to correct spelling and grammar errors in the text obtained from Twitter. Text normalization methods can include the removal of punctuation marks, the use of wordbreaking algorithms, and the combination of separated words. This study will use a sample of data from the Twitter account @fullsenyum, which will be taken through a random sampling process. The data will then be processed using data analysis techniques and text normalization methods. Figure 1 shows the research flowchart. To obtain tweet data from the @fullsenyum community on the Twitter social media platform, this study used crawling techniques. The collected tweet data covers the period from 2018 to 2021, and a total of 11,430 tweets were successfully gathered. Table 3 shows the results of the crawling process.

b) Preprocessing
The preprocessing stage was needed in this research because the tweet data obtained from Twitter social media had unstructured data so that preprocessing was carried out to remove punctuation marks, emoticons, and make changes to uppercase letters to lowercase letters and eliminate words considered not important. The following was the result of preprocessing data.

c) Method
In this study, the text normalization method was used, which is the process of converting non-standardized text into standardized text, so that it is easier for computers to understand and process. There are several techniques commonly used in text normalization, such as removing non-alphanumeric characters, replacing slang words with standard words, adjusting the use of capital and small letters, and handling abbreviations and abbreviations. More sophisticated methods of text normalization use natural language processing (NLP) technologies and machine learning to recognize and correct errors in text, such as misspellings or incorrect use of words in certain contexts

Results and Discussions
This study built a lexicon by processing the words contained in the tweet data, totaling 11,430 data obtained from social media Twitter on the @fullsenyum account which had a public figure account. The tweet data obtained were preprocessed with several steps to improve the sentence structure of the tweet data. After the preprocessing stage, the daily Indonesian language lexicon produced contained 4075 formal words and 7355 informal words. Most of them were informal words in Indonesian. Each record had 2 columns: • Non-formal words: words with non-formal meaning • Formal: a formal word suited the Indonesian dictionary Table 4 presented basic information about total words, total formal words and total informal words. There were 7355 informal words then there were 4075 formal words. Table 5 showed ten examples of informal words and then changes were made to informal words so that the quality of the data was better. Among 4075 unique formal words, 1,159 (67%) words occurred only once. Furthermore, Figure 1 showed that the distribution of words in the data to the occurrence of formal words applied Zipf law. From the analyzed tweet data, this study found that all the words contained in the @fullsenyum community tweet data had bad structures so after preprocessing, 10221 words were obtained in the Indonesian Dictionary. This study comprehended only the types of AOA words contained in the data, this study presented 10 examples in table 6. Most of the informal words that were not listed in the Indonesian Dictionary had word excess or lack of words. Thus, it was difficult to represent the word and difficult to process From this study could see the distribution of words in the data shown in Figure 2, formal words tend to be more structured and had longer words than non-formal words with the average number of characters per word being 7 for the first and 6 for the last. At this stage analyzed the frequency of words contained in the tweet data that had been preprocessed. Frequency was done to see how many repetitions of words in the data. At this stage it focused on making changes manually to words with poor structures including excess words, word deficiencies and errors in the preprocessing process. Table 7 showed examples of words with bad structure, which were then corrected manually so that they could be detected by the Indonesian dictionary. Based on Table 7, the results of the 11430 Tweet data were then preprocessed which resulted in data with a total of 10,221 then separated standard and non-standard sentences according to the data of the Indonesian language dictionary obtained by the author. The obtained results were 4075 standard words and non-standard words. The standard number was 4778, after a lexical analysis had been carried out that informal words had poor word structures. Thus, to get words with good structures, the author conducted a manual word identification according to the sources that had been given. Words with poor structure contained excessive words, repeated words and no space words.

Conclusion
In this study, the researcher presented a lexicon of formal words contained in the Twitter social media data with the normalized @fullsenyum account. The resulting corpus data was useful for natural language preprocessing processes or tasks in Indonesian. In this case, the Indonesian language corpus data were added to the data obtained from Twitter. the data were obtained available on GitHub under the MIT license. In this study there were several techniques in making formal Indonesian word corpus data as a dictionary for the stages in text normalization, and as a data collection as a model of natural language preprocessing but this corpus data could be used in developing science in the field of natural language preprocessing in detail. This research should be developed more widely to utilize tweet data with many characteristics of the sentences contained in the tweet data. Many studies had been carried out in the field of natural language preprocessing. This research was expected to improve the performance of social media analysis, especially Twitter in Indonesia This research can contribute to the development of a more standardized and structured Indonesian lexicon. In this context, the text normalization method used in this research can be the basis for collecting standard words that are often used in Indonesian based on Twitter data. Another implication is that this research can assist in the development of a more effective and accurate text mining system for analyzing social media data, especially in the Indonesian context.