STANDARDSCALER'S POTENTIAL IN ENHANCING BREAST CANCER ACCURACY USING MACHINE LEARNING

The major consequence of breast cancer is death. It has been proven in many studies that machine learning techniques are more efficient in diagnosing breast cancer. These algorithms have also been used to estimate a person's likelihood of surviving breast cancer. In this study, we employed machine learning algorithms to predict breast cancer. The aim of this research is to increase accuracy in predicting breast cancer. A total of 569 breast cancer datasets were obtained from kaggle sites. Some of the machine learning algorithms that we use are K-Nearest Neighbor (KNN), Random Forest (RF), Gradient Boosting (GB), Gaussian Naive Bayes (GNB), Vector Support Machine (SVM), and Logistic Regression (LR). Before algorithms were used to train and test breast cancer datasets, StandardScaler was leveraged to transform training datasets and test datasets for improved algorithm performance. As a result of this utilization, the performance measurement carried out succeeded in producing high accuracy. The highest results were obtained from the Logistic Regression algorithm with an accuracy value of 99%. The value of precision is 99% benign, and 100% malignant. The recall results are 100% benign, and 98% malignant. The F1-Score results show 99% benign, and 99% malignant. It is hoped that this research can help the medical party to determine the next step in dealing with breast cancer.


Introduction
Breast cancer has been the cause of numerous fatalities.Every year, according to WHO, there are approximately 1.5 million cases of breast cancer that attack women worldwide.Breast carcinoma, one of the most well-known malignancies, was first discovered in Egypt around 1600 BC (Monirujjaman Khan et al., 2022).In Indonesia, the type of disease that often attacks the breasts of women resulting in death is breast cancer (Widiana & Irawan, 2020).16.6% of Indonesia's 396,914 new instances of cancer, or 68,858 new cases, were breast cancer (Fadilah et al., 2022).More than 22,000 deaths were reported during this time.In fact, when patients regularly practice early detection and minimize cancer-causing risk factors, nearly 43% of cancer-related deaths are preventable.Breast cancer can be found through tumors.Malignant or benign tumors are classified as tumor types (Mekha & Teeyasuksaet, 2019).The doctor must use active determination strategies to find aggressive malignancies.But even for experts, it's quite difficult to detect cancer (Hughes et al., 2022).Therefore, automated cancer detection techniques are required.Often studies have tried to use Machine Learning (ML) techniques to predict a person's propensity to survive cancer.These algorithms appear to be more effective at detecting carcinoma (Nozomi et al., 2022).Usually, the accuracy of patient detection requires the experience and knowledge of the doctor (Chakraborty et al., 2019).However, these skills have been developed over the years to confirm diagnosis and observe the negative effects of many individuals.Even so, dependence cannot be guaranteed.Because processing technology has advanced (Klein et al., 2021).
Large amounts of data can now be collected and stored with relative ease, such as in specialized databases of patient data electronically (Cios & William Moore, 2002).Health professionals would not be able to decipher this huge database without the help of computers, especially when performing significant data analysis (van der Niet & Bleakley, 2021).Correctly classifying severe tumors can also keep some patients from receiving the required care (Tazin et al., 2021).Therefore, a contentious scientific issue is the precise diagnosis and classification of breast cancer into benign and malignant categories.ML approaches were widely utilized to recognize breast cancer and infers new ideas from data patterns in the last century.The use of machine learning to categorize and model breast cancer is widely known (Amrane et al., 2018).Hidden patterns and regularities in different data sets are identified by this method.There are many strategies for identifying patterns, paradigms, and relationships in data sets.Additionally, developing hypotheses about these connections that can be applied to emerge previously unknown data.Because AI is very successful in predicting and categorizing, even more so in breast cancer clinical analysis, and its use in the clinical field is growing rapidly (L.K. Singh et al., 2023).In biomedical research, it is also widely used.After lung cell death, breast cancer is the second most common reason for mortality in women (Faramarzi et al., 2021).As a result, it is critical to detect breast cancer at an early stage.By separating facts from information that suggests a disease, one can build expectations about the disease.The review used a careful examination of AI tactics to improve the accuracy of breast cancer rate estimates.
Scientists developed a clever technique to identify malignant breast growths using a machine learning classifier (Omondiagbe et al., 2019).A machine learning model was created to differentiate between benign and malignant breasts by leveraging the Wisconsin Diagnostic data set (Sengar et al., 2020).To convey the ethics of ML and its prospects, numerous studies have been conducted to distinguish exemplary scalable approaches and conventional ML characterization processes.Results show that ML strategies, which are the result of developing and improving AI techniques as well as the growing volume and complexity of information, have the most prominent unwavering quality characteristics (Abdulhay et al., 2018).A group technique is used to combine several models in the demonstrated study so that the expected precision of each classifier can be compared across different types of item classes.This method combines SVM, NB, and J48 with the democratic classifier methodology to achieve a precision of 97.13, which is higher than any separate classifier (Kumar et al., 2017).Several studies have been conducted in classifying breast cancer disease using models (ML) showing good results.But how standardscaler can increase the accuracy of using machine learning algorithms to predict breast cancer.Therefore, it is necessary to conduct a continuous study in predicting breast cancer, so that it can help medical personnel to take further action and appropriate treatment.

Literature Review
This section reviews earlier research in the topic of data classification for breast cancer.A portion of these publications are devoted to classification schemes.The findings of earlier research will be first explained in the following.
Research conducted by Atban et al. (2023) that a publicly available benchmark dataset, BreakHis, has been used for experimental investigation of the suggested method.Experimental results show that the recommended strategy uses Support Vector Machine (SVM) with Gaussian and radial-based functions (RBF) to achieve an F-score of 97.75% for features derived from ResNet18-EO.Botlagunta et al. (2023) in her research, removing outliers from blood profile data significantly improves the accuracy of machine learning models.With an AUC of 0.87, the Decision Tree (DT) classifier demonstrated 83% accuracy.Next, they used Flask to apply a DT classifier to build a web application for reliable diagnosis of MBC patients.All things considered, they concluded that ML models built on blood profile data could help doctors select MBC patients who require intensive treatment to improve overall survival rates.
Another study by Egwom et al. (2022) describes a classification model for breast cancer using ML.For feature classification and extraction, SVM and linear discriminant analysis (LDA), respectively, are used.The study had better results, with 99.2% accuracy, 98.0% recall, and 98.0% precision on the WBCD data set, compared to 79.5% accuracy, 76.0% recall, and 59.0% precision on the WPBC data set.When LDA is used and median is used to calculate missing values, SVM classifiers work better when handling classification issues.Bayrak et al. (2019) used two widely used machine learning algorithms to classify the Wisconsin Breast Cancer (Native) dataset.Accuracy, precision, recall and Area ROC scores are used to compare the classification performance of these techniques with each other.The Support Vector Machine approach provides the best results with the highest accuracy.
Yadavendra & Chand (2020) used various ML methods in this work, categorizing breast cancer tumors and assessing the effectiveness of several classifiers.For the classification of breast cancer tumors, the Xception technique performs better than any alternative method in terms of precision, memory, and F1 scores.Assiri et al. (2020) in his research, ML classifiers were used, namely ensemble classification using voting mechanisms, simple LR learning, SVM learning with stochastic gradient decrease optimization, and multilayer perceptron networks were used.Comparing the performance of the hard voting mechanism (majority-based election) with the WBCD's advanced algorithm, the hard voting mechanism performed better with 99.42%.Ara et al. (2021) in her research, tumors were divided into benign and malignant categories using machines learning.To select the most accurate approach, each method must have a calculation and comparison of accuracy.The investigation found that the SVM and RFperformed with 96.5% accuracy better than other classifiers.This classifier can be used to develop automated diagnostic tools for the early diagnosis of breast cancer.Bayrak et al. (2019) in his study, The Wisconsin Breast Cancer Dataset (WBCD) was classified using two popular ML methods.The classification performance of these approaches was contrasted using accuracy, precision, memory, and ROC Area values.Performance is optimal when using SVM method, which offers the maximum accuracy.
Wu & Hicks (2021) evaluated four different classification methods to train a model to characterize two types of breast cancer.Compared to other ML algorithms that have been evaluated, the supporting vector engine is able to classify lung cancer as triple negative or nontriple negative, as well as having a more favorable classification threshold than the other three algorithms.Jabbar (2021) in his research, to use ensemble learning to solve categorizing breast cancer data problem.The new strategy goes beyond existing methodologies, according to experimental results, and records an astonishing accuracy of 97% when classifying breast cancer data.
Zhang et al. ( 2022) in his research, in identifying normal cells from breast cancer and predicting breast cancer subtypes, They try to streamline this process by leveraging Raman spectroscopy and ML approaches.Principal component analysis (PCA)-discriminant function analysis (DFA) and SVM PCA are two of the many machine learning techniques used to deal with data.Breast cancer cell lines that have been cultured are used to obtain Raman spectra.These two algorithms have an accuracy rate of more than 97% in the ability to distinguish between breast cancer and healthy cells, and more than 92% in the ability to classify breast cancer subtypes.Laghmati et al. (2020) in her research, the machine learning technique was tested and then trained using WBCD.Features loaded from the data set are implemented into the model so that when feature selection can use Environmental Component Analysis (NCA), which reduces the number of features and model complexity.The best predictive specificity is the 9S.S6% for Binary SVM models, and maximum predictive sensitivity up to one for KNN and Adaboost models.The highest prediction accuracy was 99.12% for the KNN model.

Research Methods
By using ML algorithms and collecting data, breast cancer can be classified using machine learning.Figure 1 general steps of using machine learning to classify breast cancer.The research steps consisted of collecting breast cancer patient datasets, preprocessing data, data separation, data transformation using StandardScaler, machine learning models, performance measurement, and breast cancer prediction using confusion matrix.Each process has its own tasks to achieve the desired goals.In this research, we added a StandardScaler process for the reason of increasing performance measurement accuracy.

Data Collection
The practice of gathering information or data from diverse sources for analysis, investigation, or decision-making is known as data collection (Yang et al., 2022).The first step in comprehending a certain occurrence or issue is gathering data.The data collection process in this research can be taken for free from the Kaggle site.

Preprocessing Data
Before being used by machine learning algorithms, data must be processed.This procedure can involve eliminating unnecessary values, standardizing data, and selecting features (features most related to categorization) (Maharana et al., 2022).Determine the missing values in the data set, and then decide how to treat them.Some alternatives include deleting rows or columns containing mASissing values [Shahidul Islam Khan], or using more complex methods such as interpolation (Wang et al., 2020).It is important to encode category variables into numerical values for analysis or modeling if the data set contains such variables (Budholiya et al., 2022).Depending on the type of data being processed and the algorithm used, this can be achieved using methods such as label coding.

Data Splitting
It is necessary to separate the data into two categories: training data and testing data (Medin & Smith, 1981).The model was trained using training data, and its effectiveness was evaluated using test data (Bao et al., 2019).A technique called "random separation" involves creating random subsets from the data set (Mamdouh Farghaly et al., 2023).To illustrate, you can divide the dataset into 20% for testing and 80% for training.If your data set is large enough and accurately represents the population, this random separation is helpful.

StandardScaler
One of the most widely used techniques for data pre-processing or data normalization in machine learning is StandardScaler.Each numerical feature (column) in the data set must be changed by StandardScaler so that it has a mean of zero and a standard deviation of one (G et al., 2022).Utilizing StandardScaler has the advantage of maintaining a consistent scale between numerical characteristics in the data set.When using machine learning techniques, it can help be sensitive to data size (de Amorim et al., 2023).

Model Machine Learning
The problem to be solved at this time, we try to take advantage of some of the well-known methods of ML.Including KNN, then SVM, there is also RF, the other well-known GB, LR, and also GNB.

K-Nearest Neighbor (KNN)
Concerning classification and regression problems, ML technique KNN is employed (Ertuğrul & Tağluk, 2017).Nearest neighbor-based learning algorithms include instance-based KNN algorithms.Finding the nearest neighbor K from a new data point in the feature space is a basic principle of KNN.KNN presupposes that data with related features will have associated labels.As a result, KNN considered the label of the nearest neighbor when making predictions on the new data and chose the majority label as the prediction (Z.Zhang et al., 2018).

Support Vector Machine (SVM)
Regarding regression and prediction issues, SVM is a frequently used ML technique (Ali et al., 2021).A hyperplane (dividing plane) in a feature space is constructed using SVM learning techniques to maximize the distance between samples belonging to different classes.Finding a hyperplane that can distinguish the two classes by the largest margin is the basic idea of SVM.The margin is the separation between the nearest sample in each class and the hyperplane.A hyperplane with a maximum-margin hyperplane, which SVM sought, is known as a maximum margin hyperplane (Rizwan et al., 2021).

Random Forest (RF)
RF is an ensemble learning strategy that deserves to be utilized in classification and regression.A "forest" is what is created when several separate decision trees are combined (Vos et al., 2017).A random subset of the training data and a random subset of the feature set were used in the construction of every tree in RF (Svetnik et al., 2003).

Gradient Boosting (GB)
Gradient Boosting is an ensemble learning approach or strategy that combines several weak or simple prediction models to create strong predictive models.Regression and classification problems are often addressed using this technique.Gradient Boosting involves creating predictive models sequentially, with each subsequent model concentrating on correcting errors caused by the previous model.By focusing on the gradient (subtraction) of the loss function, which is used to calculate the difference between the model's prediction and the actual value of the training data, this process is carried out.New models are introduced into the ensemble each iteration, and are selected by optimizing the gradient of the loss function compared to the current error.In order for the model ensemble as a whole to become more adept at forecasting the right results, each subsequent model strives to correct shortcomings that the previous model did not address (Licheng Zhang & Zhan, 2017).

Logistic Regression (LR)
One machine learning technique used for categorization problems is logistic regression.Although the word "regression" is in its name, LR is actually used to estimate the likelihood of binary outcomes (e.g., class "1" or "0") based on input variables or features.Logistic or sigmoid functions are used by logistic regression algorithms to represent the relationship between input data (in the form of real numbers) and binary output variables.The output is converted by the sigmoid function into a number between 0 and 1, which represents the probability of a successful outcome.A higher probability of a positive outcome is indicated by a value close to 1, while a greater probability of a negative event is indicated by a value close to 0 (Tu, 1996).

Gaussian Naive Bayes (GNB)
The Naive Bayes family of algorithms includes a classification algorithm known as Gaussian NB (Naive Bayes).For classifications based on Bayes' theorem, this approach is often used in machine learning (G.Singh et al., 2019).From the premise that the features used for classification are normally distributed (or Gaussian), Gaussian NB is based.Based on the possible features seen in the training data, this algorithm generates the probability of the class (Shiri Harzevili & Alizadeh, 2018).

Performance Measurement
Various metrics of evaluation that are widely used to measure model performance in ML, particularly in the context of classification.Some significant performance metrics are as follows:

Accuracy
The easiest and most popular metric to measure how well a model can perform accurate categorization is accuracy.By dividing the number of accurate predictions by the entire amount of data, accuracy can be obtained in this way.However, when the data is uneven or the class is relatively sparse, accuracy is not necessarily the most revealing metric.Determining accuracy can be done with Equation (1).

Accuracy = (TP + TN) / (TP + TN + FP + FN)
The correct positive and negative numbers are TP and TN, respectively.False positives and false negatives are measured by the letters FP and FN, respectively.

Precision and Recall
Precision and recall can be used to measure performance in detecting positive classes.Precision measures the accuracy of the model's positive predictions, whereas recall assesses how well the model can locate each instance of a genuine positive class.Calculation of precision is positive class occurrences total number divided by correct positive predictions number.Whila recall calculation is positive predictions number divided by correct positive predictions number.Determining precision can be done with Equation (2), and recall with Equation (3).

Precision = TP / (TP + FP) Recall = TP / (TP + FN)
3.6.3F1-Score F1 scores combine memory and precision into a single number.F1 scores result in a balanced average of precision and memory between the two, resulting in a misaligned average.The F1 score is determined by multiplying the precision and recall numbers twice and dividing the result by the total number of precision and recall.Determining precision can be done with Equation (4).

Results and Discussions
After the steps in the study are carried out which aims to classify benign and malignant breast cancer, then here we describe the results of the methods that have been done.

Data Collection
The breast lump dataset underwent a biopsy to classify it as malignant (cancerous) or benign (not cancerous).Digital images of fine needle aspiration biopsy slides are used to computationally extract features.The size, shape, and regularity of features correspond to the cell nucleus.For a total of 30 features, the mean, standard deviation, and worst values of each of the 10 nuclear parameters are presented in Table 1 (Breast Cancer Wisconsin Diagnostic Dataset | Kaggle, n.d.).The dataset presented in Table 1 consists of 30 features to predict breast cancer with benign and malignant values.A total of 569 data will be trained and tested through 30 features consisting of; radius is the radius of the nucleus (the average distance from the center to points on the circumference), texture is the texture of the nucleus (standard deviation of grayscale values), perimeter is the perimeter of the nucleus, area is the area of the nucleus, smoothness is the smoothness of the nucleus (local variation in radius length), concavity is the compactness of the nucleus (perimeter^2/area -1), concave point is the concave of the nucleus (severity of the concave part of the contour), symmetry is the symmetry of the nucleus, and the fractal dimension is the fractal dimension of the nucleus ("approximate coastline" -1).The Y feature as a target is a two-level factor that indicates whether a mass is malignant ("M") or benign ("B").

Preprocessing Data
At this stage, unnecessary data cleaning is carried out.As seen in Figure 2, the Unnamed variable is not needed.These variables are omitted so as not to interfere in the classification process.The process of replacing the target variable Y is also carried out, so that it is clearly visible the variable that is the target in this dataset.Then the most important thing is to convert the target data which is still categorical into numerical variables, so that the ML classification process can run well.The results of preprocessing can be seen in Figure 2. Figure 2 shows the condition of the dataset that has been preprocessed.The target variable Y has been changed to BreastCancer.While the data on the target variable has been converted into numerical data.The value of "B" categorized as benign is changed to 0, and the value of "M" categorized as malignant is changed to 1.After this, the dataset can be processed further at the data separation stage.Before that we show the variable distribution of breast cancer in Figure 3.In Figure 3, you can see the difference in the presentation of the target variable value.The number of breast cancer scores was 62.7%, higher than the benign breast cancer score of 37.3%.

Data Splitting
The data separation stage is carried out by separating the training data from the test data.Training data was taken as much as 75% of the total dataset, and test data was taken as much as 25% of the total dataset.So that in the next process, the classification process can be carried out by several models in ML.

StandardScaler
The scikit-learn (sklearn) library in python contains the StandardScaler implementation.Figure 4 is the form of the script we used when StandardScaler was implemented after data splitting was done.StandardScaler is used to transform both training datasets and test datasets.This is implemented so that performance performed using ML results in better performance.

Performance Measurement
Accuracy, precision, F1-score, and recall are measured as a function of various ML methods performance.The tests conducted on each model yielded the following results, which are presented below.Table 2 is the performance measurement results of several ML algorithms.The KNN algorithm shows precison results of 0.96 benign, and 0.98 malignant.While the recall result of 0.99 is benign, and 0.93 is malignant.The F1-Score shows 0.97 benign, and 0.95 malignant.The accuracy obtained from the KNN algorithm is 0.97.The SVM algorithm shows a precison result of 0.98 benign, and 1.00 malignant.While the recall results of 1.00 are benign, and 0.96 are malignant.The F1-Score shows 0.99 benign, and 0.98 malignant.The accuracy obtained from the SVM algorithm is 0.99.The RF algorithm shows a precison result of 0.97 benign, and 1.00 malignant.While the recall results of 1.00 are benign, and 0.95 are malignant.The F1-Score shows 0.98 benign, and 0.97 malignant.The accuracy obtained from the RF algorithm is 0.98.The GB algorithm shows a precison result of 0.96 benign, and 1.00 malignant.While the recall results of 1.00 are benign, and 0.93 are malignant.The F1-Score shows 0.98 benign, and 0.98 malignant.The accuracy obtained from the GB algorithm is 0.98.The LR algorithm shows precison results of 0.99 benign, and 1.00 malignant.While the recall results of 1.00 are benign, and 0.98 are malignant.The F1-Score results show 0.99 benign, and 0.99 malignant.The accuracy obtained from the LR algorithm is 0.99.The GNB algorithm shows a precison result of 0.96 benign, and 1.00 malignant.While the recall results of 1.00 are benign, and 0.93 are malignant.The F1-Score results show 0.98 benign, and 0.96 malignant.The accuracy obtained from the RF algorithm is 0.97.

Breast Cancer Prediction
Confusion Matrix is used to show the results of the prediction algorithm used in this study.Here we show the prediction results of each algorithm.Figure 5 is the result of the confusion matrix of the ML algorithm.The KNN algorithm displays 138 total correct predictions and 5 total incorrect ones.The SVM algorithm displays a total of 141 accurate predictions and 2 inaccurate ones.According to the RF algorithm, there were 140 total right predictions and three incorrect ones.According to the GB algorithm, there were 139 correct predictions overall and four wrong ones.According to the LR algorithm, there were 142 correct predictions overall, and one incorrect prediction.Additionally, the GNB algorithm indicates that there were up to 139 correct predictions overall, whereas there were up to four incorrect ones.Accuracy-based comparison of the suggested approach with a few prior works can be seen in Table 3.The results of the accuracy values shown in table 3 show that the accuracy results of the proposed method have increased compared to previous research.Sequentially, the highest accuracy value shown is 99.3% obtained from the method proposed in this research, then 99.2% in the research of Egwom et al., (2022), and 99,1% by the research of Khandezamin et al., (2020).

Conclusion
The breast cancer diagnosis dataset that we have tested, yielded an excellent classification.Accuracy is very satisfactory, as is precision, recall, and the F1-Score score is also very satisfactory, showing how reliable the classifier is.A total of 6 algorithms used on average achieved accuracy above 96%.The use of SS in this study has an impact on performance results and predictions using ML algorithms.It can be seen that the recall results of benign diagnosis and the precision results of malignant diagnosis on average almost reach 100%.While the highest accuracy in this study was obtained from the LR algorithm, which is 99,3%.This proves that testing breast cancer diagnosis datasets using ML produces excellent performance and prediction, with the help of SS to optimize in training and testing data transformation.So that this testing can help the medical party in the follow-up of patients infected with breast cancer.

Table 1 -
Breast Cancer Dataset radius mean

Table 3 -
Accuracy-Based Comparison of The Suggested Approach With A Few Prior Works