Categorical encoding is a technique to encode categorical data. Keep in mind that categorical data are the sets of variables that contain label variables instead of numerical values. Many machine learning algorithms are unable to process categorical variables. Therefore, it is important to encode the data into a suitable form so you can preprocess these variables. Because you need to fit and evaluate your model, you must encode categorical data and convert all input and output variables into numeric. Consequently, the model will be able to comprehend and extract the information generating the desired output. A different set of data vary depending on the number of possible values.
Most categorical variables are nominal. These variables help to categorize and label the attributes. The variables contain different values, and each value represents a separate category. For instance, color is a variable, and it includes different values such as blue, green, yellow. Similarly, a pet is a variable, but cats and dogs are separate values representing different categories. In another example, a place is a variable, but the first, second, and third are the values. These categories may or may not have a natural relationship with each other. In the above example of a place, you can see that the place is a variable and its categories are in the natural order. You can refer to these types of variables as an ordinal variable.
Convert Label Data into Numeric
There are two steps to convert label or categorical data into numerical data:
-
Integer Encoding
In this first step, you will assign the integer value for each category value. For instance, blue is 1, green is 2, and yellow is 3. You can easily reverse this type of encoding. You can also refer to integer encoding as label encoding. For many variables, this step is enough to process through the model. There is a natural ordered relationship among different integer values. You can connect the relationship through a machine learning algorithm. For instance, if you consider the ordinal variables such as place, the categories are already in order. Therefore, you do not have to find the relationship between the variables. So, label encoding would be enough.
-
One-Hot Encoding
Categorical variables do not include any relationship between variables like ordinal variables. Therefore, you need to take the encoding process to another level. For these types of data, integer encoding is not enough. Even if you encode the data without natural order, the categories will not align together. Because of the poor performance, the result will be unexpected.
Subsequently, one-hot encoding is a technique to apply to the integer representation. While encoding the data using during this step, the model will remove the integer variables. Then, it will include a new binary variable for every unique integer value. For instance, if we consider the color variable, you need to encode 3 categories. Achieving that result, you will place the value “1” for the binary color variable and “0” for the other.
How to Code One Hot Encoding
You can understand the process of coding with the following example. In order to make the tasks simple and quick, you should use libraries. Libraries are pre-written codes that help to optimize your tasks. Without using relevant libraries, your tasks will become tedious. Start by including three common libraries in the project. These libraries are sklearn, NumPy, and pandas.
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
import NumPy as np
import pandas as pd
Once you equip yourself with the tools, you can start to encode. For understanding the concept, use this made-up dataset. Use the panda library to input the dataset. You will use .read_csv and the file name to open the file.
dataset = pd.read_csv(‘made_up_thing.csv’)
Until now, you performed self-explanatory tasks. However, the next step will be a bit trickier. When you consider a spreadsheet as an input, you care or do not care about some columns. To keep the process simple, let’s accept all the columns except the last one. After considering the columns to include, add the .iloc. It is a feature of the pandas’ library. You will use it to choose the data from a specific column.
X = dataset.iloc[:, :-1].values
After selecting the segment, in the above code, we will include the values using the .values function. You can understand it as the first part of this code will select the values, and the second part will consider the values in this section.
In this example, you include a label encoder along with one hot encoder. As one hot encoder converts numbers into binary, and you include category as the data, you have to first convert the categories into numbers. Therefore, we will use a labelencoder before one hot encoder. Now setup the labelencoder using the following code:
le = LabelEncoder()
After converting categories into numbers use sklearn’s .fit_transform function. Because sklearn is a one-hot encoding library, it will convert those numbers into binary so the model can understand. Use the following code for encoding the first columns:
X[:, 0] = le.fit_transform(X[:, 0])
Now you should include one code encoder to complete the encoding.
ohe = OneHotEncoder(categorical_features = [0])
X = ohe.fit_transform(X).toarray()
Categorical_features is a function to specify the column to encode with one-hot encoding. As we are encoding the first column, we include [0]. Now the fit_transform feature will convert the selected variables into binary.
We are done with the conversion. Keep in mind that we include 0 for the first column. If you want to include more columns, you should add the column number. To add more than one column, you should add a “for loop” as follows:
le = LabelEncoder()#for 10 columns
for i in range(10):
X[:,i] = le.fit_transform(X[:,i])
Conclusion
To sum it up, categorical encoding is a technique that converts categorical data into binary. As the machine learning model cannot process categorical data, you must convert it into numerical or binary, depending on the algorithm. There are two steps to covert categorical or label data into binary data. Label encoding is the first technique for categorical encoding. It transforms the data into numerical form. Use label encoding for ordinal data. For non-ordinal data, use the one-hot encoding technique. This is an effective method to convert your data into binary form.