This article will discuss how SMOTE module helps increase underrepresented numbers in the dataset of a machine learning model. SMOTE is the best method that enables you to increase rare cases instead of duplicating the previous ones.

When you have an imbalanced dataset, you can connect the model with the SMOTE module. There may be numerous reasons for an imbalanced dataset. Maybe the target category has a unique dataset in the population, or data is difficult to collect. You can seek help from SMOTE to analyze the under-represented class. The output from the module will contain the original as well as additional samples. These new samples are synthetic minority samples. You need to define the number of these synthetic samples before starting the technique.

What is Imbalanced Data?

When the data classification is not equal, you can refer to it as imbalanced data. It is a classification task and causes various problems in the output of the model. For instance, you have 100 cases in a binary classification problem. Class-1 includes 80 labeled instances. On the other hand, the remaining marked sample will be in Class-2. This will be a simple example of an imbalanced dataset. The ratio of Class 1 and Class 2 instances would be 4:1.

The classes’ imbalance problem is very common, whether you talk about real test datasets or Kaggle competition. The real-world classification problems will include some level of classification imbalance. This usually happens when there are not any suitable data instances that fit in any class. Therefore, it is essential to choose the correct valuation metric of the model. If the model has an imbalanced dataset, your outcome will be useless. However, if you solve a real-life problem with this model, the result will be a waste.

In various situations, class imbalance will always occur. A good example is when you consider the dataset of fraudulent and non-fraudulent transactions. You will find fewer fraudulent transactions than non-fraudulent transactions. This is where you will find problems.

What is SMOTE?

SMOTE is a technique that you can use for oversampling data. This technique creates new synthetics instead of oversampling by replacements. SMOTE introduces synthetic examples in the line segments for oversampling the minority class samples. It joins all the k minority class that is close to neighbors. The choice of neighbors of the k nearest neighbors is random. The number depends upon the over-sampling amount that the model needs.

The primary function of SMOTE is to construct minority classes. There is a simple algorithm for making these classes. As you may know, the development of repetitive instances or oversampling can cause overfitting. Furthermore, the decision boundary gets even tighter. You can solve the problem by generating similar samples other than repeating them all. SMOTE generates newly constructed samples that have different features than previous samples. Therefore, the decision boundary will become softer. This will help the algorithm to estimate the accurate hypothesis. Below you will find some benefits of SMOTE:

  • The information will not lose.
  • This technique is simple, and you can easily interpret and implement it in the model.
  • It improves the overfitting as synthetic examples. This will help to generate new instances instead of replicating them.

How to Solve Class Imbalance Problem with SMOTE?

SMOTE synthesizes the new minority instances similar to the real minority instances. Imagine there is a line between existing instances. SMOTE draws these lines to create synthetic and new minority instances on these lines.

library(smotefamily)

dat_plot = SMOTE(dat[,2:4],  # feature values

as.numeric(dat[,6]),  # class labels

              K = 6, dup_size =0)  # function parameters

Once you complete the synthesizing process of new minority instances, the output will include the lesser imbalance of the data. The model will add new instances with the help of SMOTE leveling the classes.

The Function Parameters of SMOTE

Dup_size and K are the two parameters of SMOTE (). If you want to understand Dup_size and K, you need to learn the working mechanism of SMOTE (). The SMOTE () will work through the outlook of existing instances and generate new ones randomly. The function will create a new instance at some distance to their neighboring instance. However, it is still unclear how the SMOTE () considers its neighbors for each minority instance they create

  • The function will consider the closest neighbor at K = 1.
  • The function will consider the closest and the next neighbors at K = 2.

Normally, the SMOTE () will loop through the original minority instance. While loop iteration is a single instance, the model will create the new instance between the original instance and the neighbor. The dup_size parameter indicates how many times the SMOTE function will loop the original instance. For instance, at dup_size = 1, the model will synthesize only four new data points and so on.

Conclusion

When creating a predictive model in machine learning, you might experience imbalanced datasets. These datasets can affect the outcome of the model. You can solve this problem by oversampling the minority data. So instead of duplicating the data, utilize the SMOTE technique and create synthetic data for oversampling. Below you will find some variations of SMOTE:

  • Borderline-SMOTE
  • SMOTE-NC
  • SMOTE
  • ADASYN
  • SVM-SMOTE