PRODUCT CODEFICATION ACCURACY WITH COSINE SIMILARITY AND WEIGHTED TERM FREQUENCY AND INVERSE DOCUMENT FREQUENCY (TF-IDF)

In the SiPaGa application, the codefication search process is still inaccurate, so OPD often make mistakes in choosing goods codes. We need Cosine Similarity and TF-IDF methods that can improve the accuracy of the search. Cosine Similarity is a method for calculating similarity by using keywords from the code of goods. Term Frequency and Inverse Document (TFIDF) is a way to give weight to a one-word relationship (term). The purpose of this research is to improve the accuracy of the search for goods codification. Codification of goods processed in this study were 14,417 data sourced from the Goods and Price Planning Information System (SiPaGa) application database. The search keywords were processed using the Cosine Similarity method to see the similarities and using TF-IDF to calculate the weighting. This research produces the calculation of cosine similarity and TF-IDF weighting and is expected to be applied to the SiPaGa application so that the search process on the SiPaGa application is more accurate than before. By using the cosine sismilarity algorithm and TF-IDF, it is hoped that it can improve the accuracy of the search for product codification. So that OPD can choose the product code as desired.


Introduction
The SiPAGA application is an information system for planning prices for goods and services managed by the Procurement Administration Bureau and BMD Management. In this application, there is a codification search feature to complete the Regional Property Needs Plan (RKBMD) as well as the Standard Price for Goods and Services (SHBJ). The codification of goods is an important part of the SiPAGA application which aims to facilitate the implementation of the management and administration of regional property.In the code of goods, there is the same description but the accounts, groups, types and objects are different, there is often the wrong choice of codification of goods, to reduce error grouping of goods We need a search technique that looks at the suitability of the codification being sought. If the search for codefication uses the keyword "bahan bangunan" and the code for the goods available for that keyword, but if the keyword is changed to "bangunan bahan" then no existing codification will appear. Search should be made with both "bahan bangunan" or "bangunan bahan" because they contain the same word. For this reason, the application of text-mining can be used in analyzing data for the codification of goods. In order to apply text-mining, existing goods coding data is used as new knowledge. The Cosine Similarity algorithm and Term Frequency and Invers Document Frequency (TF-IDF) weighting can be used to classify the codification of goods according to the search word.
The codefication of goods is based on the regulation of the Ministry of Home Affairs of the Republic of Indonesia number 108 of 2016 concerning the classification and coding of regional property with the aim of the local government doing the codification which describes the account code, group code, type code, object code, object detail code, object sub detail code and code. the sub-details of the objects belonging to the area. Codification includes 7 levels including: 1. Level 1 shows the account code 2. Level 2 shows the group code 3. Level 3 shows the type code 4. Level 4 shows the object code 5. Level 5 shows the detailed code of the object 6. Level 6 shows the sub code details of the object 7. Level 7 shows the sub-code details of the object The number of available goods codification is 14,417 lines, if the item codification is not available in the sub-details of the object, it can be added that the goods codification is determined by the regional head's decision, so that the product codification can be more than 14,417 lines previously available.

Literature Review
Knowledge Discovery In Database (KDD) is the process of searching for and identifying patterns in data, the resulting patterns make data more useful and understandable . In KDD there are several phases, namely selection where data is changed from unstructured to structured. Text mining uses Term Frequency-Invers Document Frequency (TF-IDF) weighting and cosine similarity. The cosine similarity method is widely used in data mining and machine learning. In particular, cosine equality is most commonly used in higher dimensional spaces. For example, in information retrieval and text mining, cosine similarity provides a useful measure of how similar two documents. (Luo et al., 2018). In the research that has been carried out for testing web browsers using Ontology and TF-IDF (Hafeez & Patil, 2017), which aims to improve searches on web browsers. To assess text similarity using Cosine Similarity has also been researched and produced an application that can be used to detect text similarities (Rozeva & Zerkova, 2017).
In the field of education, the application of TF-IDF weighting and the Vector Space Model is carried out to determine the examiner lecturers (Siregar et al., 2017). The results of this study can recommend the examiner according to the topic based on the suitability of the title and abstract. Other research to detect plagiarism in scientific works has also been applied by using bibliography to find similar themes (Sejati et al., 2019), using the cosine similarity method. Other research is used to determine the supervisor (Yasni et al., 2018), This research produces supervisors who are in accordance with the final project submitted by students. The system created using Cosine Similarity Matching. In other studies discussing about check the document similarity (Naf'an et al., 2019). Documents with high similarity with a value of 50% and a low value of 40% so this method produces a similarity value from each of the comparators. The method used by Cosine Similarity to detect document similarities. Other studies to see similarities in journals (Kharismadita & Rahutomo, 2017). This system gets a similarity value that compares the entire journal content starting from the abstract, title and content. This system uses TF-IDF and Cosine Similarity.
In social media, Twitter is used to analyze sentiment among Twitter users (Deviyanto & Wahyudi, 2018), discussing sentiment using K-Nearest Neighbor in the DKI regional elections, the results obtained an accuracy of 67.2%, 56.94% precision value and 78.24% recall. In the religious field, especially Islam is used to find out the sharah hadith (Amrizal, 2018), The resulting system can find the hadith syarah accurately because of the high level of accuracy, recall and precision that applies the TF-IDF and Cosine Similarity methods. Paper grouping based on classification by combining TF-IDF and LDA to calculate the importance of each paper and grouping papers with similar subjects using the K-Means algorithm in order to get the correct classification results (Kim & Gil, 2019). Classifier-based approach to ontology alignment based on a hybrid of string-based features and semantic similarity. Word embedding is used to produce a slick feature for classification in addition to the new features being introduced (Nkisi-Orji et al, 2019). To detect topics that are being discussed to get the latest information from existing words, using an algorithm using TF-IDF is proposed to solve this problem, with experimental results with search accuracy reaching 78.36% (Zhu et al, 2019). Experiments in the IMDB dataset show that accuracy is improved when using Cosine Similarity compared to using point products, whereas using a combination of features with weighted Naıve Bayes n-gram bags achieves a new state of the art accuracy of 97.42% (Thongtan & Phienthrakul, 2019)

Research Methods
Text Mining is the process of extracting information from unstructured or less structured data sources, such as from Word documents, PDFs, text citations, etc. whereas Data Mining is structured data (Siregar et al., 2017). Another definition of Text Mining is that it can be broadly defined as a process of extracting information in which a user interacts with a set of documents using analysis tools which are components in mining data, one of which is categorization (Amrizal, 2018). Text Mining generates data in the form of basic words from data sources, Each root word can appear in more than one document, the number of occurrences of each word is useful for measuring how important a word is in the document (Nurdiansyah et al., 2019). Data Mining is part of the Knowledge Discovery from Data (KDD) process (Putra, Randi Rian, 2018 Text Processing Text Preprocessing is a stage for preparing text into data to be processed (Amrizal, 2018). The text preprocessing stage also selects and removes words that have no meaning (Sejati et al., 2019). The following are the stages in the text preprocessing process: a. Case Folding The Case Folding process is carried out to uniform all characters in the text to lowercase.

b. Tokenizing
This process is carried out to separate the text into individual features (tokens) which will be processed by the system.

Text Transformation
The text transformation stage consists of the stemming process and the term stopword or filtering a. Stopword Removal This stage will take words that are considered important from the results of tokenization or discard words that are considered not too important in the text mining process. b. Stemming Stemming aims to transform a word into a root word by removing all word affixes.
TF-IDF TF-IDF weights are calculated locally from the Post editing dataset using word frequency as TF.
The word stop is included in the model (Arroyo-Fernández et al., 2019). TF-IDF is a way to give weight to a word relationship (term) to a document. This method combines two concepts for weight calculation, namely the frequency of occurrence of a word in a particular document and the inverse frequency of documents containing that word. The frequency with which the word appears in a given document shows how important it is in the document.

Results and Discussions
The input data used as an example is only 4 can be seen in table 1 as follows All the data that has been obtained are then processed and analyzed for the problems that occur so as to produce useful information to overcome the problems and can propose an improvement. This data processing uses text mining theories with Cosine Similarity and TF-IDF weighting. can be seen in table 2 as follows     The results of the text mining system can be seen using software that has been built using the Codeigniter Framework, following the appearance of a text mining system that applies Cosine Similarity and TF-IDF weighting, can be seen in Figure 3. Next, to test the system, input the sample data that has been provided previously as much as 300 sample data by clicking the Choose file button, then selecting the file with the .xls extension to be input, this page can be seen in the figure, can be seen in Figure 4  After the data is uploaded successfully, it will appear as shown in Figure 5 Furthermore, the test is carried out using the keyword "Masker Gas", the display of the test can be seen in Figure 6

Conclusion
Based on the results obtained, the application of TF-IDF weighting and cosine similarity has succeeded in increasing the accuracy in the search for goods codification with Cosine Similarity calculations and TF-IDF weighting after entering keywords as keywords for search. Based on the keyword, there are several product codifications with a Cosine Similarity value of more than zero (0), the Codification of the item is the result of the search so that the highest value is the codefication of the item that is most similar to the keyword.