Which algorithm takes the crown: Light GBM vs XGBOOST?

1. what’s Light GBM?

Light GBM may be a fast, distributed, high-performance gradient boosting framework supported decision tree algorithm, used for ranking, classification and lots of other machine learning tasks.

Since it’s supported decision tree algorithms, it splits the tree leaf wise with the simplest fit whereas other boosting algorithms split the tree depth wise or level wise instead of leaf-wise. So when growing on an equivalent leaf in Light GBM, the leaf-wise algorithm can reduce more loss than the level-wise algorithm and hence leads to far better accuracy which may rarely be achieved by any of the prevailing boosting algorithms. Also, it’s surprisingly in no time , hence the word ‘Light’.

Before may be a diagrammatic representation by the manufacturers of the sunshine GBM to elucidate the difference clearly.

2. Advantages of sunshine GBM

Faster training speed and better efficiency: Light GBM use histogram based algorithm i.e it buckets continuous feature values into discrete bins which fasten the training procedure.

Lower memory usage: Replaces continuous values to discrete bins which end in lower memory usage.

Better accuracy than the other boosting algorithm: It produces far more complex trees by following leaf wise split approach instead of a level-wise approach which is that the main think about achieving higher accuracy. However, it can sometimes cause overfitting which may be avoided by setting the max_depth parameter.

Compatibility with Large Datasets: it’s capable of performing equally good with large datasets with a big reduction in training time as compared to XGBOOST.

Parallel learning supported.

3. Installing Light GBM

For Windows

Using Visual Studio (Or MSBuild)

-Install git for windows, cmake and MS Build (Not need the MSbuild if you already install Visual Studio).

-Run following command:

git clone –recursive https://github.com/Microsoft/LightGBM

cd LightGBM

mkdir build

cd build

cmake -DCMAKE_GENERATOR_PLATFORM=x64 ..

cmake –build . –target ALL_BUILD –config Release

The exe and dll are going to be in LightGBM/Release folder.

Using MinGW64

-Install git for windows, cmake and MinGW64.

-Run following command:

git clone –recursive https://github.com/Microsoft/LightGBM

cd LightGBM

mkdir build

cd build

cmake -G “MinGW Makefiles” ..

mingw32-make.exe -j4

The exe and dll are going to be in LightGBM/ folder.

For Linux

Light GBM uses cmake to create . Run following:

git clone –recursive https://github.com/Microsoft/LightGBM

cd LightGBM

mkdir build

cd build

cmake ..

make -j4

For OSX

LightGBM depends on OpenMP for compiling, which isn’t supported by Apple Clang.Please use gcc/g++ instead.

-Run following:

brew install cmake

brew install gcc –without-multilib

git clone –recursive https://github.com/Microsoft/LightGBM

cd LightGBM

mkdir build

cd build

cmake ..

make -j4

Now before we dive head first into building our dawn GBM model, allow us to check out a number of the parameters of sunshine GBM to possess an understanding of its underlying procedures.

4. Important Parameters of sunshine GBM

task : default value = train ; options = train , prediction ; Specifies the task we wish to perform which is either train or prediction.

application: default=regression, type=enum, options= options :

regression : perform regression task

binary : Binary classification

multiclass: Multiclass Classification

lambdarank : lambdarank application

data: type=string; training data , LightGBM will train from this data

num_iterations: number of boosting iterations to be performed ; default=100; type=int

num_leaves : number of leaves in one tree ; default = 31 ; type =int

device : default= cpu ; options = gpu,cpu. Device on which we would like to coach our model. Choose GPU for faster training.

max_depth: Specify the max depth to which tree will grow. This parameter is employed to affect overfitting.

min_data_in_leaf: Min number of knowledge in one leaf.

feature_fraction: default=1 ; specifies the fraction of features to be taken for every iteration

bagging_fraction: default=1 ; specifies the fraction of knowledge to be used for every iteration and is usually wont to speed up the training and avoid overfitting.

min_gain_to_split: default=.1 ; min gain to perform splitting

max_bin : max number of bins to bucket the feature values.

min_data_in_bin : min number of knowledge in one bin

num_threads: default=OpenMP_default, type=int ;Number of threads for Light GBM.

label : type=string ; specify the label column

categorical_feature : type=string ; specify the specific features we would like to use for training our model

num_class: default=1 ; type=int ; used just for multi-class classification

Also, undergo this text explaining parameter tuning in XGBOOST intimately .

5. LightGBM vs XGBoost

So now let’s compare LightGBM with XGBoost by applying both the algorithms to a dataset then comparing the performance.

Here we are using dataset that contains the knowledge about individuals from various countries. Our target is to predict whether an individual makes 50k annually on basis of the opposite information available. Dataset consists of 32561 observations and 14 features describing individuals.

Here is that the link to the dataset: http://archive.ics.uci.edu/ml/datasets/Adult.

Go through the dataset to possess a correct intuition about predictor variables then that you simply could understand the code b#importing standard libraries 

import numpy as np 

import pandas as pd 

from pandas import Series, DataFrame 

#import lightgbm and xgboost 

import lightgbm as lgb 

import xgboost as xgb 

#loading our training dataset ‘adult.csv’ with name ‘data’ using pandas 

data=pd.read_csv(‘adult.csv’,header=None) 

#Assigning names to the columns 

data.columns=[‘age’,’workclass’,’fnlwgt’,’education’,’education-num’,’marital_Status’,’occupation’,’relationship’,’race’,’sex’,’capital_gain’,’capital_loss’,’hours_per_week’,’native_country’,’Income’] 

#glimpse of the dataset 

data.head() 

# Label Encoding our target variable 

from sklearn.preprocessing import LabelEncoder,OneHotEncoder

l=LabelEncoder() 

l.fit(data.Income) 

l.classes_ 

data.Income=Series(l.transform(data.Income))  #label encoding our target variable 

data.Income.value_counts() 

#One Hot Encoding of the Categorical features 

one_hot_workclass=pd.get_dummies(data.workclass) 

one_hot_education=pd.get_dummies(data.education) 

one_hot_marital_Status=pd.get_dummies(data.marital_Status) 

one_hot_occupation=pd.get_dummies(data.occupation)

one_hot_relationship=pd.get_dummies(data.relationship) 

one_hot_race=pd.get_dummies(data.race) 

one_hot_sex=pd.get_dummies(data.sex) 

one_hot_native_country=pd.get_dummies(data.native_country) 

#removing categorical features 

data.drop([‘workclass’,’education’,’marital_Status’,’occupation’,’relationship’,’race’,’sex’,’native_country’],axis=1,inplace=True) 

#Merging one hot encoded features with our dataset ‘data’ 

data=pd.concat([data,one_hot_workclass,one_hot_education,one_hot_marital_Status,one_hot_occupation,one_hot_relationship,one_hot_race,one_hot_sex,one_hot_native_country],axis=1) 

#removing dulpicate columns 

 _, i = np.unique(data.columns, return_index=True) 

data=data.iloc[:, i] 

#Here our target variable is ‘Income’ with values as 1 or 0.  

#Separating our data into features dataset x and our target dataset y 

x=data.drop(‘Income’,axis=1) 

y=data.Income 

#Imputing missing values in our target variable 

y.fillna(y.mode()[0],inplace=True) 

#Now splitting our dataset into test and train 

from sklearn.model_selection import train_test_split 

x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=.3)

#Applying xgboost

#The data is stored in a DMatrix object 

#label is used to define our outcome variable

dtrain=xgb.DMatrix(x_train,label=y_train)

dtest=xgb.DMatrix(x_test)

#setting parameters for xgboost

parameters={‘max_depth’:7, ‘eta’:1, ‘silent’:1,’objective’:’binary:logistic’,’eval_metric’:’auc’,’learning_rate’:.05}

#training our model 

num_round=50

from datetime import datetime 

start = datetime.now() 

xg=xgb.train(parameters,dtrain,num_round) 

stop = datetime.now()

#Execution time of the model 

execution_time_xgb = stop-start 

execution_time_xgb

#datetime.timedelta( , , ) representation => (days , seconds , microseconds) 

#now predicting our model on test set 

ypred=xg.predict(dtest) 

ypred

#Converting probabilities into 1 or 0  

for i in range(0,9769): 

    if ypred[i]>=.5:       # setting threshold to .5 

       ypred[i]=1 

    else: 

       ypred[i]=0  

#calculating accuracy of our model 

from sklearn.metrics import accuracy_score 

accuracy_xgb = accuracy_score(y_test,ypred) 

accuracy_xgb

# Light GBM

train_data=lgb.Dataset(x_train,label=y_train)

#setting parameters for lightgbm

param = {‘num_leaves’:150, ‘objective’:’binary’,’max_depth’:7,’learning_rate’:.05,’max_bin’:200}

param[‘metric’] = [‘auc’, ‘binary_logloss’]

#Here we have set max_depth in xgb and LightGBM to 7 to have a fair comparison between the two.

#training our model using light gbm

num_round=50

start=datetime.now()

lgbm=lgb.train(param,train_data,num_round)

stop=datetime.now()

#Execution time of the model

execution_time_lgbm = stop-start

execution_time_lgbm

#predicting on test set

ypred2=lgbm.predict(x_test)

ypred2[0:5]  # showing first 5 predictions

#converting probabilities into 0 or 1

for i in range(0,9769):

    if ypred2[i]>=.5:       # setting threshold to .5

       ypred2[i]=1

    else:  

       ypred2[i]=0

#calculating accuracy

accuracy_lgbm = accuracy_score(ypred2,y_test)

accuracy_lgbm

y_test.value_counts()

from sklearn.metrics import roc_auc_score

#calculating roc_auc_score for xgboost

auc_xgb =  roc_auc_score(y_test,ypred)

auc_xgb

#calculating roc_auc_score for light gbm. 

auc_lgbm = roc_auc_score(y_test,ypred2)

auc_lgbm comparison_dict = {‘accuracy score’:(accuracy_lgbm,accuracy_xgb),’auc score’:(auc_lgbm,auc_xgb),’execution time’:(execution_time_lgbm,execution_time_xgb)}

#Creating a dataframe ‘comparison_df’ for comparing the performance of Lightgbm and xgb. 

comparison_df = DataFrame(comparison_dict) 

comparison_df.index= [‘LightGBM’,’xgboost’] 

comparison_dfelow properly.

Performance comparison

https://cdn.analyticsvidhya.com/wp-content/uploads/2017/06/11200955/result.png

There has been only a small increase in accuracy and auc score by applying Light GBM over XGBOOST but there’s a big difference within the execution time for the training procedure. Light GBM is nearly 7 times faster than XGBOOST and may be a far better approach when handling large datasets.

This seems to be an enormous advantage once you are performing on large datasets in limited time competitions.

6. Tuning Parameters of sunshine GBM

Light GBM uses leaf wise splitting over depth wise splitting which enables it to converge much faster but also results in overfitting. So here may be a quick guide to tune the parameters in Light GBM.

For best fit

num_leaves : This parameter is employed to line the amount of leaves to be formed during a tree. Theoretically relation between num_leaves and max_depth is num_leaves= 2^(max_depth). However, this is often not an honest estimate just in case of sunshine GBM since splitting takes place leaf wise instead of depth wise. Hence num_leaves set must be smaller than 2^(max_depth) otherwise it’s going to cause overfitting. Light GBM doesn’t have an immediate relation between num_leaves and max_depth and hence the 2 must not be linked with one another .

min_data_in_leaf : it’s also one among the important parameters in handling overfitting. Setting its value smaller may cause overfitting and hence must be set accordingly. Its value should be hundreds to thousands of huge datasets.

max_depth: It specifies the utmost depth or level up to which tree can grow.

For faster speed

bagging_fraction : is employed to perform bagging for faster results

feature_fraction : Set fraction of the features to be used at each iteration

max_bin : Smaller value of max_bin can save much time because it buckets the feature values in discrete bins which is computationally inexpensive.

For better accuracy

Use bigger training data

num_leaves : Setting it to high value produces deeper trees with increased accuracy but cause overfitting. Hence its higher value isn’t preferred.

max_bin : Setting it to high values has similar effect as caused by increasing value of num_leaves and also slower our training procedure.