Coursera Learner working on a presentation with Coursera logo and
Coursera Learner working on a presentation with Coursera logo and

Cross-validation helps with the evaluation of machine learning models. This statistical method helps in comparing and selecting the model in applied machine learning. Understanding and implementing this predictive modeling problem is easy and straightforward. This technique has a lower bias while estimating the skills of the model. This article will help you understand the concept of k-fold cross-validation and how you can evaluate a machine learning model using this technique.

K-Fold Cross-Validation

The k-fold cross validation signifies the data set splits into a K number. It divides the dataset at the point where the testing set utilizes each fold. Let’s understand the concept with the help of 5-fold cross-validation or K+5. In this scenario, the method will split the dataset into five folds. The model uses the first fold in the first iteration to test the model. It uses the remaining data sets to train the model. The second fold helps in testing the dataset and other support with the training process. The same process repeats itself till the testing set uses every fold from the five folds.

K-Fold Cross-Validation in Machine Learning

Along with numerous benefits of machine learning algorithms, the model follows the same model to predict and generate the data of discrete or continuous values. It is essential to ensure that the data of the model is accurate and does not underfit or overfit. Underfitting and overfitting are two major concepts of machine learning. These terms define how well a model is trained to predict data. To check the performance and behavior of the algorithm, the overfitting includes a Hyperparameter value.

Underfitting in Machine Learning

The model can generate accurate predictions with new data when the model fits according to the dataset perfectly. An appropriate algorithm for the trained dataset can help train the new dataset. Furthermore, if the machine learning model relies on an unfit training process, it will not generate accurate data or adequate predictions. Therefore, the model will fail to process important patterns from datasets.

When the model stops during the training process, it will lead to underfitting. This indicates that the data requires more time to process completely. This will affect the performance of the model for new data. The model will not produce accurate results and will be of no use.

Overfitting in Machine Learning

Overfitting is simply the opposite of underfitting. This means that other than learning the data and extracting the pattern, the model is learning more than its capacity. This condition indicates that the data will capture noise, leading to it generalizing the model for new data. The noise is the irrelevant data that affects the output of the prediction while encountering new data.

Evaluating an ML model using K-Fold Cross-Validation

Below, we will evaluate the simple regression model by using the K-fold cross-validation technique. We will perform 10-Fold cross-validation in this example.

Importing libraries

The first step is to import all the libraries that you require to perform this cross-validation technique on a simple machine learning model.

import pandas
from sklearn.model_selection import KFold
From sklearn.preprocessing import MinMaxScaler
from sklearn.svm import SVR
import NumPy as np

These libraries will help to perform different tasks:

Numpy–Helps you to perform scientific computation

Pandas –Helpsyouto manipulates the data structure easily.

Sklearn–Is a machine learning library that you can use for python.

Reading the data set

Now, you will read the data set that you will use. We will use pandas to structure the data frame.

dataset = pandas.read_csv(‘housing.csv’)

Pre-processing

Identify the output variables and the features of our dataset.

X = dataset.iloc[:, [0, 12]]
y = dataset.iloc[:, 13]

According to the above code, all the rows from 0 to 12 are the features, and the index 13 on the column is the dependent variable or the output of the model. Now, we can apply the preprocessing technique. This MinMax scaling technique will normalize the data set.

scaler = MinMaxScaler(feature_range=(0, 1))
X = scaler.fit_transform(X)

With the help of this technique, you can re-scale the data in a specific range. In this example, the range would be between 0 to 1. This will help us understand that various features will not affect the final prediction.

K-Fold CV

Now, we will start the validation process with the following codes:

scores = []
best_svr = SVR(kernel=’rbf’)
cv = KFold(n_splits=10, random_state=42, shuffle=False)
for train_index, test_index in cv.split(X):
    print(“Train Index: “, train_index, “\n”)
    print(“Test Index: “, test_index)

X_train, X_test, y_train, y_test = X[train_index], X[test_index], y[train_index], y[test_index]
best_svr.fit(X_train, y_train)
scores.append(best_svr.score(X_test, y_test))

Conclusion

K-fold cross-validation improves the model by validating the data. This technique ensures that the model’s score does not relate to the technique we use to choose the test or training dataset. K-fold cross-validation method divides the data set into subsets as K number. Therefore it repeats the holdout method k number of times.