The classification process helps with the categorization of the dataset into different classes. A machine learning model enables you to:
- Frame the problem,
- Collect the data,
- Add the variables,
- Train the model,
- Measure the performance,
- Improve the model with the help of cost function.
But how can we measure the performance of a model? By comparing the predicted and actual model? However, this will not solve the classification problem. A confusion matrix can help you to analyze the data and solve the problem. Let’s understand how this technique helps the machine learning model.
The confusion matric technique helps with performance measurement for machine learning classification. With this type of model, you can distinguish and classify the model with the known true values on the set of test data. The term confusion matrix is straightforward yet confusing. This article will simplify the concept so you can easily understand and create a confusion matrix on your own.
Calculating the Confusion Matrix
Follow these simple steps to calculate the confusion matrix for data mining:
Estimate the outcome values of the dataset.
Test the dataset with the help of expected output.
Predict the rows in your test dataset.
Calculate the expected outcomes and predictions. You need to consider the:
- Total correct predictions of the class
- Total incorrect predictions of the class
After performing these steps, you need to organize the numbers in the below methods:
- Link each row of the matrix with predicted class
- Correspond each column of the matrix with the actual class
- Enter the correct and incorrect classification of the model into the table
- Include the total of the correct predictions into the predicted column. Also, add the class value in the expected row.
- Include the total of the incorrect predictions into the expected row and class value in the predicted column.
Understanding the Outcome in a Confusion Matrix
The actual and predicted values are the same. The predicted value of the model is positive, along with an actual positive value.
The actual and predicted values are the same. The predicted value of the model is negative, along with an actual negative value.
False Positive (Type 1 Error)
The actual and predicted values are not the same. The predicted value of the model is positive and falsely predicted. However, the actual value is negative. You can refer to this as a Type 1 error.
False Negative (Type 2 Error)
The actual and predicted values are not the same. The predicted value of the model is negative and falsely predicted. However, the actual value is positive. You can refer to this error as a Type 2 error.
Importance of Confusion Matrix
Before answering the question, we should understand the hypothetical classification problem. Suppose you are predicting the number of people infected with the virus before showing symptoms. This way, you can easily isolate them and ensure a healthy population. We can choose two variables to define the target population: Infected and non-infected.
Now you might think, why use a confusion matrix when the variables are too straightforward. Well, this technique helps with the accuracy of the classification. The data in this example is the imbalanced dataset. Let’s suppose we have 947 negative data points and three positive data points. Now we will calculate the accuracy with this formula:
With the help of the following table, you can check the accuracy:
The total output values will be:
TP = 30, TN = 930, FP = 30, FN = 10
So you can calculate the accuracy of the model as:
96% accuracy for a model is incredible. But you can only generate the wrong idea from the outcome. According to this model, you can predict the infected people 96% of the time. However, the calculation predicts that 96% of the population will not become infected. However, sick people are still spreading the virus.
Does this model look like a perfect solution for the problem, or should we measure the positive cases and isolate them to stop the spread of the virus. Therefore we use a confusion matrix to solve these types of problems. Here are some benefits of the confusion matrix:
- The matrix helps with the classification of the model while making the predictions
- This technique signifies the type and the insight of the errors so you can understand the case easily
- You can overcome the restriction with the accurate classification of the data
- The columns of the confusion matrix will represent the instances of the predicted class
- Each row will indicate the instances of the actual class
- The confusion matrix will highlight the errors that the classifier
Confusion Matrix in Python
Now, as you know the concept of the confusion matrix, you can practice using the following code in Python with the help of the Scikit-learn library.
|# confusion matrix in sklearn|
|# actual values|
|actual = [1,0,0,1,0,0,1,0,0,1]|
|# predicted values|
|predicted = [1,0,0,1,0,0,0,1,0,0]|
|# confusion matrix|
|matrix =confusion_matrix(actual,predicted, labels=[1,0])|
|print(‘Confusion matrix : \n’,matrix)|
|# outcome values order in sklearn|
|tp, fn, fp, tn=confusion_matrix(actual,predicted,labels=[1,0]).reshape(-1)|
|print(‘Outcome values : \n’, tp, fn, fp, tn)|
|# classification report for precision, recall f1-score and accuracy|
|print(‘Classification report : \n’,matrix)|
The confusion matrix helps with the restriction in the accuracy of the classification method. Also, it highlights important details about different classes. Furthermore, it analyzes the variables and the data so you can compare actual data with the prediction.