IMPROVING TEXT SUMMARIZATION QUALITY BY COMBINING T5-BASED MODELS AND CONVOLUTIONALSEQ2SEQ MODELS

Within the field of natural language processing, there are several subfields that are closely related to information retrieval, including the subfield of automatic text summarization. This research utilizes the T5 and Seq2Seq convolutional models via the Hugging Face platform to improve the quality of text summaries. This research identifies important features that influence text summary results, such as the use of uppercase and lowercase letters that influence the understanding of document content. To optimize the model, this research adjusts parameters involving layer dimensions, learning rate, batch size, and Dropout implementation to avoid overfitting. Evaluation of model performance is carried out using the comprehensive ROUGE metric. The results of this research show promising results, with ROUGE-1 values reaching an average of 0.8 on the four documents tested, reflecting optimal performance. Similarly, ROUGE-2 recorded an average of 0.83, which also reflects optimal results. Furthermore, ROUGE-L also achieved an average of 0.8 for the text summary model on the four documents evaluated, indicating optimal performance.


Introduction
In the natural language processing field, several sub-fields are very closely related to information retrieval, such as the automatic text summarization sub-field (Dhivyaa et al., 2022) (Farahani et al., 2021), text summarization is a way to produce a digest consisting of information in a document consisting of subject objects and predicate (Reddy & Guha, 2023) (Lubis, Prayudani, et al., 2022) (Lubis & Nasution, 2023).The purpose of text summarization is to display information from longer documents into documents with text that is more concise and can be easily understood (Lalitha et al., 2023) (Alambo et al., 2022).The development of text summarization is very significant with available models such as transformers-based models, namely the T5 model which has advantages in producing good summaries in the field of natural language processing (Batra et al., 2021).Along with these developments, there are many challenges in producing good quality text summaries (Lee et al., 2021), there are several challenges such as uninformative summaries, and difficulty understanding so many researchers do text summaries to improve quality As research conducted by Chaves (Chaves et al., 2022) by extracting summaries of biological documents, Adam (Abdel-Salam & Rafea, 2022) conducted text summaries using the Bert model and Abdulateef's research (Abdulateef et al., 2020) utilized the Word2vec technique in summarizing Arabic documents.Previous studies have used only one model in conducting text summaries such as the T5 model, the convolutional Seq2Seq model (Vogel-Fernandez et al., 2022).This research will extract sentences from the original document by using the most relevant paraphrasing technique approach from the document using the appropriate vocabulary according to the Big Indonesian Dictionary.The paraphrasing techniques used are T5 convolutional and Seq2Seq, each model has its own advantages in summarizing text because the T5 model has advantages in producing language so that it produces text that is easier to understand structurally.(Song et al., 2022) (Chouikhi & Alsuhaibani, 2022) while the convolutional Seq2Seq model can perform settlement of unstructured data and perform feature extraction (Di Egidio et al., 2023) based on document data so that combining the models will get better results and can see the potential feature features in the representation of a very large number of words from T5 and completion of text that is unstructured from the convolutional Seq2Seq model.The purpose of combining this summary model is to see the quality of the text produced and to improve the abbreviations and accuracy of words according to the Big Indonesian Dictionary.In evaluating this study, ROUGE will be used as a combination of parameters that are often used in the convolutional Seq2Seq model.This research contributes to improving the quality of text summaries with a combination of T5-based and Convolutional Seq2Seq models, then this research can be used in various applications that have a text summary feature.

Literature Review
In this study, we will combine the models used to summarize, namely the transformer or T5 model with the convolutional Seq2Seq model.Several related studies such as those conducted by Fendji (Fendji et al., 2021) conducted summaries on French Wikipedia documents by applying the T5 model, the T5 model has several advantages in generating text so many researchers use the T5 model in conducting summaries as did Chouikhi (Chouikhi & Alsuhaibani, 2022), Jung (Jung et al., 2022), and Quatra (La Quatra & Cagliero, 2022).In many studies in conducting abstract summaries, there are challenges such as unstructured text which must be changed with text preprocessing techniques such as those carried out by Widyassari (Widyassari et al., 2022) and Christian (Christian et al., 2016) utilizing text preprocessing techniques to be used for document summary processes.One of the models that can be used in summarizing text is the convolutional Seq2Seq model.Many researchers used this model as did Liang (Liang et al., 2020) and Shi (Shi et al., 2021) who produced a text summary model with structure.In combining text summary models, not much has been done, but in implementing text summaries many use word representation features such as those carried out by Lubis (Lubis, Nasution, et al., 2022), Zhou (Xi et al., 2020), Zhang (Cheng et al., 2020), and Wu (Xu et al., 2022) who made use of Glove, Word2vec.The dataset used in implementing the model must be accompanied by structured sentences that can be understood by machines (El-Kassas et al., 2021) (Allahyari et al., 2017) (Zhang et al., 2018) (Lubis, Nasution, et al., 2022).
In this research, structured data collection aims to improve the quality of text summaries because structured data collections have good quality and consistency compared to unstructured data.Structured data usually has undergone a planned process of preparation and organization that allows a better understanding of the model (Lubis, Prayudani, et al., 2022).For unstructured data, it is necessary to carry out text processing which aims to improve the quality of the text so that it can be used as a dataset for text summary research.The T5 model has an architecture that can be used in developing text summaries (Ranganathan & Abuka, 2022) (Ramesh et al., 2022) while the convolutional Seq2Seq model has parameter tuning that can improve the quality of summaries (Fendji et al., 2021).Many related studies have implemented text summary models, text classification, and clustering as shown in Figure 1.In Figure 1 the development of text summarization has been carried out with many combinations such as classification, clustering, and the use of transfer learning models that use many models that have been carried out by Gupta (Gupta et al., 2022) and Zolotareva (Zolotareva et al., 2020).

Research Methods
This study will propose an architecture that will be used in summarizing text where the model from T5 will be combined with the model from convolutional Seq2Seq to improve a more effective summary.This model will be applied to Indonesian language documents.The following is a research architecture that will combine the model from T5 with the convolutional Seq2Seq shown in Figure 2. The data used in this research architecture uses a dataset originating from abstracts contained in final assignment collections and scientific work publications at the Medan State Polytechnic.The data contains 1000 articles equipped with abstract columns and summarization columns.After data collection, data cleaning stages will be carried out using text preprocessing techniques with the settings of case folding, tokenizing, stopword, and steaming.In this research, the data is divided into training data and testing data.In its application, this research divides the training data into 80% while the testing data accounts for 20% of the total amount of data.In Figure 2 it is explained that in the architecture to be proposed, there is accuracy such as documents will be processed with a T5 encoder which will produce large-volume text representation results then text representation will be continued in the Seq2Seq decoder process convolutional which generally forms tokens that can be used for the training process of combining the convolutional T5 and Seq2Seq models, in the training process will combine learning rate values, layers and dimensions, batch size and the use of dropouts which are techniques for the purpose of avoiding overfitting the summary model, In this study it will use iterations and a combination of all parameters to produce an optimal summary model.After the training process, a testing process will be carried out to see the performance of the summary model using convolutional T5 and Seq2Seq, the testing process will use techniques from ROUGE (Recall-Oriented Understudy for Gisting Evaluation) and BLEU technique (Bilingual Evaluation Understudy).This study will use variations of ROUGE such as ROUGE-N and ROUGE-L.In this research there is a T5 model with the following process sequence: a) The T5 model there is a basic structure consisting of encoder and decoder layers, each layer has a function such as the encoder can decompose the input text into an internal representation, the decoder produces output.b) In the T5 model there is a Text-to-text concept which will carry out all language processes by converting text into output text.c) Then fine tuning can be done on the T5 model with the aim of improving the model that has been trained.Meanwhile, the Seq2seq model will solve a data sequence problem with the following process: a) In the Seq2Seq model there will be an encoder and decoder process where the encoder will make changes to vector representation then at the decoder stage it will produce text output.b) The Seq2seq model will apply an attention mechanism so that the text summary model focuses on the important parts of the word so that the quality of the summary is good.c) Model training is also carried out using a supervised learning approach.

Results and Discussions
This research will use the Python programming language, the Python library consists of many features that can be used to perform text summaries.This study utilizes machine learning which is open by hugging the face which provides a learning model with large amounts of data, the convolutional T5 and Seq2Seq models will be optimized by hugging the face so that the model can be trained and adapted to needs.In this study, a summary of Indonesian language documents will be carried out by filtering the features of the model.

a) Dataset
In applying text summaries, the number of data sets is an important parameter in carrying out good text summaries on Indonesian language documents, so this research will use Indonesian language data sets which have many features in the corpus.This research uses a dataset derived from abstracts contained in final assignment collections and scientific work publications at the Medan State Polytechnic.Table 1 data contains 1000 complete articles with abstract columns and summary columns.

b) ROUGE Evaluation
In this study, evaluation from ROUGE will be used which will look at the performance of the model from carrying out text summaries, the workings of ROUGE will compare the resulting summaries with the summaries contained in the data.The following is a ROUGE evaluation of the scenarios that will be shown in Table 2. Based on the evaluation, it will use ROUGE which produces data from ROUGE as follows.From table 3 above it can be calculated by counting the number of characters from the reference text, the number of matching unigram characters, the number of unigrams from the summary text and the number of matching unigrams as follows.Based on the results obtained in Table 3, this is the result of a text summary with Indonesian language documents utilizing the T5 and convolutional Seq2Seq models.The results can be seen by evaluating metrics using ROUGE.Table 4 shows ROUGE-1, ROUGE-2 and ROUGE-L.The value of ROUGE-1 is Unigram has an optimal value if the value is greater than 0.4, and in ROUGE-2 it is a bigram that has an optimal value greater than 0.2 because it can only find out the similarity of the bigram while ROUGE-L has a level of similarity in documents that have an optimal value greater than 0.4.this study produces a value of ROUGE-1 on 4 documents that are tested which produces an average of 0.8 which is the optimal value, for ROUGE-2 on 4 documents that are tested which results in an average of 0.83 which is an optimal value while ROUGE-L on 4 documents conducted tests that produce an average of 0.8 which is the optimal value for the summary model.The following is the calculation from ROUGE-1.

c) BLEU Evaluation (Bilingual Evaluation Understudy)
This study will use the BLEU (Bilingual Evaluation Understudy) evaluation which aims to measure the difference between the results of the summary and the summary dataset so that we can see whether the resulting model is working optimally or not.The following is the evaluation result from BLEU (Bilingual Evaluation Understudy).Based on Figure 3 and Figure 4 it produces a value of 0.5 for BLEU adequacy evaluation while for BLEU fluency it produces 0.5.The results of this evaluation show good performance because in this evaluation the value range from 0 -1 is the assessment of the BLEU evaluation.

Discussion
Based on the results obtained from the convolutional T5 and Seq2Seq models in summarizing text on the hugging face found features that can affect text summary such as upper and lowercase letters which have an impact on changing the understanding of the text of the document.The results of evaluating metrics with ROUGE produce good accuracy with an average value that is significant to the summary in the dataset with a value of 0.83 for ROUGE-1, an average value of 0.85 for ROUGE-2 and for ROUGE-L worth 0.8275.at the same time, the BLEU evaluation produces a value of 0.5.This proves that the combined model of convolutional T5 and Seq2Seq has a better ROUGE value when compared to the stand-alone model.The results of this study prove that the combination model can outperform other text summary models even though it must require a comparison step in carrying out summary tasks on Indonesian language documents.

Conclusion
The aim of this study is to analyze the combined use of the T5 model and the convolutional Seq2Seq model in carrying out a text summarization task.This study produces values from ROUGE and BLEU in demonstrating model performance.The ROUGE value has variations such as ROUGE-N and ROUGE-L.this study produces a value for ROUGE-N amounting to 0.8 which means the model has optimally performed text summaries while for ROUGE-L it produces a value of 0.8 which means the model from T5 and convolutional Seq2Seq has worked optimally.This research focuses on implementing the model by conducting training on Indonesian language data with data originating from abstracts contained in a collection of final assignment collections and scientific work publications at the Medan State Polytechnic, then an encoder and decoder process is carried out, then a combination of parameters such as the number of layer dimensions, the number of sizes batch, and learning rate.This research is expected to be able to use multilingualism and the use of machine translation in making language changes as a result of summarizing the text.However, this research has weaknesses in terms of uppercase and lowercase letters contained in the dataset, so that for further research it can add the amount of data and perform analysis of feature utilization.

Fig. 1 .
Fig. 1.Development Of Text Summaries In The Field Of Natural Language Processing