Which algorithm takes the crown: Light GBM vs XGBOOST?
1. what’s Light GBM?
Light GBM may be a fast, distributed, high-performance gradient boosting framework supported decision tree algorithm, used for ranking, classification and lots of other machine learning tasks.
Since it’s supported decision tree algorithms, it splits the tree leaf wise with the simplest fit whereas other boosting algorithms split the tree depth wise or level wise instead of leaf-wise. So when growing on an equivalent leaf in Light GBM, the leaf-wise algorithm can reduce more loss than the level-wise algorithm and hence leads to far better accuracy which may rarely be achieved by any of the prevailing boosting algorithms. Also, it’s surprisingly in no time , hence the word ‘Light’.
Before may be a diagrammatic representation by the manufacturers of the sunshine GBM to elucidate the difference clearly.
2. Advantages of sunshine GBM
Faster training speed and better efficiency: Light GBM use histogram based algorithm i.e it buckets continuous feature values into discrete bins which fasten the training procedure.
Lower memory usage: Replaces continuous values to discrete bins which end in lower memory usage.
Better accuracy than the other boosting algorithm: It produces far more complex trees by following leaf wise split approach instead of a level-wise approach which is that the main think about achieving higher accuracy. However, it can sometimes cause overfitting which may be avoided by setting the max_depth parameter.
Compatibility with Large Datasets: it’s capable of performing equally good with large datasets with a big reduction in training time as compared to XGBOOST.
Parallel learning supported.
3. Installing Light GBM
For Windows
Using Visual Studio (Or MSBuild)
-Install git for windows, cmake and MS Build (Not need the MSbuild if you already install Visual Studio).
-Run following command:
git clone –recursive https://github.com/Microsoft/LightGBM
cd LightGBM
mkdir build
cd build
cmake -DCMAKE_GENERATOR_PLATFORM=x64 ..
cmake –build . –target ALL_BUILD –config Release
The exe and dll are going to be in LightGBM/Release folder.
Using MinGW64
-Install git for windows, cmake and MinGW64.
-Run following command:
git clone –recursive https://github.com/Microsoft/LightGBM
cd LightGBM
mkdir build
cd build
cmake -G “MinGW Makefiles” ..
mingw32-make.exe -j4
The exe and dll are going to be in LightGBM/ folder.
For Linux
Light GBM uses cmake to create . Run following:
git clone –recursive https://github.com/Microsoft/LightGBM
cd LightGBM
mkdir build
cd build
cmake ..
make -j4
For OSX
LightGBM depends on OpenMP for compiling, which isn’t supported by Apple Clang.Please use gcc/g++ instead.
-Run following:
brew install cmake
brew install gcc –without-multilib
git clone –recursive https://github.com/Microsoft/LightGBM
cd LightGBM
mkdir build
cd build
cmake ..
make -j4
Now before we dive head first into building our dawn GBM model, allow us to check out a number of the parameters of sunshine GBM to possess an understanding of its underlying procedures.
4. Important Parameters of sunshine GBM
task : default value = train ; options = train , prediction ; Specifies the task we wish to perform which is either train or prediction.
application: default=regression, type=enum, options= options :
regression : perform regression task
binary : Binary classification
multiclass: Multiclass Classification
lambdarank : lambdarank application
data: type=string; training data , LightGBM will train from this data
num_iterations: number of boosting iterations to be performed ; default=100; type=int
num_leaves : number of leaves in one tree ; default = 31 ; type =int
device : default= cpu ; options = gpu,cpu. Device on which we would like to coach our model. Choose GPU for faster training.
max_depth: Specify the max depth to which tree will grow. This parameter is employed to affect overfitting.
min_data_in_leaf: Min number of knowledge in one leaf.
feature_fraction: default=1 ; specifies the fraction of features to be taken for every iteration
bagging_fraction: default=1 ; specifies the fraction of knowledge to be used for every iteration and is usually wont to speed up the training and avoid overfitting.
min_gain_to_split: default=.1 ; min gain to perform splitting
max_bin : max number of bins to bucket the feature values.
min_data_in_bin : min number of knowledge in one bin
num_threads: default=OpenMP_default, type=int ;Number of threads for Light GBM.
label : type=string ; specify the label column
categorical_feature : type=string ; specify the specific features we would like to use for training our model
num_class: default=1 ; type=int ; used just for multi-class classification
Also, undergo this text explaining parameter tuning in XGBOOST intimately .
5. LightGBM vs XGBoost
So now let’s compare LightGBM with XGBoost by applying both the algorithms to a dataset then comparing the performance.
Here we are using dataset that contains the knowledge about individuals from various countries. Our target is to predict whether an individual makes 50k annually on basis of the opposite information available. Dataset consists of 32561 observations and 14 features describing individuals.
Here is that the link to the dataset: http://archive.ics.uci.edu/ml/datasets/Adult.
Go through the dataset to possess a correct intuition about predictor variables then that you simply could understand the code b#importing standard libraries
import numpy as np
import pandas as pd
from pandas import Series, DataFrame
#import lightgbm and xgboost
import lightgbm as lgb
import xgboost as xgb
#loading our training dataset ‘adult.csv’ with name ‘data’ using pandas
data=pd.read_csv(‘adult.csv’,header=None)
#Assigning names to the columns
data.columns=[‘age’,’workclass’,’fnlwgt’,’education’,’education-num’,’marital_Status’,’occupation’,’relationship’,’race’,’sex’,’capital_gain’,’capital_loss’,’hours_per_week’,’native_country’,’Income’]
#glimpse of the dataset
data.head()
# Label Encoding our target variable
from sklearn.preprocessing import LabelEncoder,OneHotEncoder
l=LabelEncoder()
l.fit(data.Income)
l.classes_
data.Income=Series(l.transform(data.Income)) #label encoding our target variable
data.Income.value_counts()
#One Hot Encoding of the Categorical features
one_hot_workclass=pd.get_dummies(data.workclass)
one_hot_education=pd.get_dummies(data.education)
one_hot_marital_Status=pd.get_dummies(data.marital_Status)
one_hot_occupation=pd.get_dummies(data.occupation)
one_hot_relationship=pd.get_dummies(data.relationship)
one_hot_race=pd.get_dummies(data.race)
one_hot_sex=pd.get_dummies(data.sex)
one_hot_native_country=pd.get_dummies(data.native_country)
#removing categorical features
data.drop([‘workclass’,’education’,’marital_Status’,’occupation’,’relationship’,’race’,’sex’,’native_country’],axis=1,inplace=True)
#Merging one hot encoded features with our dataset ‘data’
data=pd.concat([data,one_hot_workclass,one_hot_education,one_hot_marital_Status,one_hot_occupation,one_hot_relationship,one_hot_race,one_hot_sex,one_hot_native_country],axis=1)
#removing dulpicate columns
_, i = np.unique(data.columns, return_index=True)
data=data.iloc[:, i]
#Here our target variable is ‘Income’ with values as 1 or 0.
#Separating our data into features dataset x and our target dataset y
x=data.drop(‘Income’,axis=1)
y=data.Income
#Imputing missing values in our target variable
y.fillna(y.mode()[0],inplace=True)
#Now splitting our dataset into test and train
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=.3)
#Applying xgboost
#The data is stored in a DMatrix object
#label is used to define our outcome variable
dtrain=xgb.DMatrix(x_train,label=y_train)
dtest=xgb.DMatrix(x_test)
#setting parameters for xgboost
parameters={‘max_depth’:7, ‘eta’:1, ‘silent’:1,’objective’:’binary:logistic’,’eval_metric’:’auc’,’learning_rate’:.05}
#training our model
num_round=50
from datetime import datetime
start = datetime.now()
xg=xgb.train(parameters,dtrain,num_round)
stop = datetime.now()
#Execution time of the model
execution_time_xgb = stop-start
execution_time_xgb
#datetime.timedelta( , , ) representation => (days , seconds , microseconds)
#now predicting our model on test set
ypred=xg.predict(dtest)
ypred
#Converting probabilities into 1 or 0
for i in range(0,9769):
if ypred[i]>=.5: # setting threshold to .5
ypred[i]=1
else:
ypred[i]=0
#calculating accuracy of our model
from sklearn.metrics import accuracy_score
accuracy_xgb = accuracy_score(y_test,ypred)
accuracy_xgb
# Light GBM
train_data=lgb.Dataset(x_train,label=y_train)
#setting parameters for lightgbm
param = {‘num_leaves’:150, ‘objective’:’binary’,’max_depth’:7,’learning_rate’:.05,’max_bin’:200}
param[‘metric’] = [‘auc’, ‘binary_logloss’]
#Here we have set max_depth in xgb and LightGBM to 7 to have a fair comparison between the two.
#training our model using light gbm
num_round=50
start=datetime.now()
lgbm=lgb.train(param,train_data,num_round)
stop=datetime.now()
#Execution time of the model
execution_time_lgbm = stop-start
execution_time_lgbm
#predicting on test set
ypred2=lgbm.predict(x_test)
ypred2[0:5] # showing first 5 predictions
#converting probabilities into 0 or 1
for i in range(0,9769):
if ypred2[i]>=.5: # setting threshold to .5
ypred2[i]=1
else:
ypred2[i]=0
#calculating accuracy
accuracy_lgbm = accuracy_score(ypred2,y_test)
accuracy_lgbm
y_test.value_counts()
from sklearn.metrics import roc_auc_score
#calculating roc_auc_score for xgboost
auc_xgb = roc_auc_score(y_test,ypred)
auc_xgb
#calculating roc_auc_score for light gbm.
auc_lgbm = roc_auc_score(y_test,ypred2)
auc_lgbm comparison_dict = {‘accuracy score’:(accuracy_lgbm,accuracy_xgb),’auc score’:(auc_lgbm,auc_xgb),’execution time’:(execution_time_lgbm,execution_time_xgb)}
#Creating a dataframe ‘comparison_df’ for comparing the performance of Lightgbm and xgb.
comparison_df = DataFrame(comparison_dict)
comparison_df.index= [‘LightGBM’,’xgboost’]
comparison_dfelow properly.
Performance comparison
There has been only a small increase in accuracy and auc score by applying Light GBM over XGBOOST but there’s a big difference within the execution time for the training procedure. Light GBM is nearly 7 times faster than XGBOOST and may be a far better approach when handling large datasets.
This seems to be an enormous advantage once you are performing on large datasets in limited time competitions.
6. Tuning Parameters of sunshine GBM
Light GBM uses leaf wise splitting over depth wise splitting which enables it to converge much faster but also results in overfitting. So here may be a quick guide to tune the parameters in Light GBM.
For best fit
num_leaves : This parameter is employed to line the amount of leaves to be formed during a tree. Theoretically relation between num_leaves and max_depth is num_leaves= 2^(max_depth). However, this is often not an honest estimate just in case of sunshine GBM since splitting takes place leaf wise instead of depth wise. Hence num_leaves set must be smaller than 2^(max_depth) otherwise it’s going to cause overfitting. Light GBM doesn’t have an immediate relation between num_leaves and max_depth and hence the 2 must not be linked with one another .
min_data_in_leaf : it’s also one among the important parameters in handling overfitting. Setting its value smaller may cause overfitting and hence must be set accordingly. Its value should be hundreds to thousands of huge datasets.
max_depth: It specifies the utmost depth or level up to which tree can grow.
For faster speed
bagging_fraction : is employed to perform bagging for faster results
feature_fraction : Set fraction of the features to be used at each iteration
max_bin : Smaller value of max_bin can save much time because it buckets the feature values in discrete bins which is computationally inexpensive.
For better accuracy
Use bigger training data
num_leaves : Setting it to high value produces deeper trees with increased accuracy but cause overfitting. Hence its higher value isn’t preferred.
max_bin : Setting it to high values has similar effect as caused by increasing value of num_leaves and also slower our training procedure.