Imagine you were to shop for a car, would you only attend a store and buy the primary one that you simply see? No, right? You always consult few people around you, take their opinion, add your research thereto then choose the ultimate decision. Let’s take an easier scenario: whenever you choose a movie, does one ask your friends for reviews about the movie (unless, off-course it stars one among your favorite actress)?
Have you ever wondered why can we ask multiple people about their opinions or reviews before going for a movie or before buying a car or could also be , before planning a holiday? It’s because review of 1 person could also be biased as per her preference; however, once we ask multiple people we try to get rid of bias that one person may provide. One person may have a really strong dislike for a selected destination due to her experience at that location; however, ten people may have very strong preference for an equivalent destination because they need had an exquisite experience there. From this, we will infer that the one person was more like an exceptional case and her experience could also be one among a case.
Another example which I’m sure all folks have encountered is during the interviews at any company or college. We frequently need to undergo multiple rounds of interviews. albeit the questions asked in multiple rounds of interview are similar, if not same – companies still choose it. The rationale is that they need to possess views from multiple recruitment leaders. If multiple leaders are zeroing in on a candidate then the likelihood of her arising to be an honest hire is high.
In the world of analytics and data science, this is often called ‘ensembling’. Ensembling may be a “type of supervised learning technique where multiple models are trained on a training dataset and their individual outputs are combined by some rule to derive the ultimate output.”
Let’s break the above definition and appearance at it step by step.
When we say multiple models are trained on a dataset, same model with different hyper parameters or different models are often trained on the training dataset. Training observations may differ slightly while sampling; however, overall population remains an equivalent.
“Outputs are combined by some rule” – there might be multiple rules by which outputs are combined. The foremost common ones are the typical (in terms of numerical output) or vote (in terms of categorical output). When different models give us numerical output, we will simply take the typical of all the outputs and use the typical because the result. Just in case of categorical output, we will use vote – output occurring maximum number of times is that the final output. There are other complex methods of deriving at output also but they’re out of scope of this text.
Random Forest is one such very powerful ensembling machine learning algorithm which works by creating multiple decision trees then combining the output generated by each of the choice trees. Decision tree may be a classification model which works on the concept of data gain at every node. For all the info points, decision tree will attempt to classify data points at each of the nodes and check for information gain at each node. it’ll then classify at the node where information gain is maximum. it’ll follow this process subsequently until all the nodes are exhausted or there’s no further information gain. Decision trees are very simple and straightforward to know models; however, they need very low predictive power. In fact, they’re called weak learners.
Random Forest works on an equivalent weak learners. It combines the output of multiple decision trees then finally come up with its own output. Random Forest works on an equivalent principle as Decision Tress; however, it doesn’t select all the info points and variables in each of the trees. It randomly samples data points and variables in each of the tree that it creates then combines the output at the top. It removes the bias that a choice tree model might introduce within the system. Also, it improves the predictive power significantly. we’ll see this within the next section once we take a sample data set and compare the accuracy of Random Forest and Decision Tree.
Now, let’s take alittle case study and check out to implement multiple Random Forest models with different hyper parameters, and compare one among the Random Forest model with Decision Tree model. (I am sure you’ll accept as true with me on this – even without implementing the model, we will say intuitively that Random Forest will give us better results than Decision Tree). The dataset is taken from UCI website and may be found on this link. the info contains 7 variables – six explanatory (Buying Price, Maintenance, NumDoors, NumPersons, BootSpace, Safety) and one response variable (Condition). The variables are self-explanatory and ask the attributes of cars and therefore the response variable is ‘Car Acceptability’. All the variables are categorical in nature and have 3-4 factor levels in each.
Let’s start the R code implementation and predict the car acceptability supported explanatory variables.
# Data Source: https://archive.ics.uci.edu/ml/machine-learning-databases/car/
install.packages(“randomForest”)
library(randomForest)
1
2
3
4
5
6
7
8
# Load the dataset and explore
data1 <- read.csv(file.choose(), header = TRUE)
head(data1)
str(data1)
summary(data1)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
> head(data1)
BuyingPrice Maintenance NumDoors NumPersons BootSpace Safety Condition
1 vhigh vhigh 2 2 small low unacc
2 vhigh vhigh 2 2 small med unacc
3 vhigh vhigh 2 2 small high unacc
4 vhigh vhigh 2 2 med low unacc
5 vhigh vhigh 2 2 med med unacc
6 vhigh vhigh 2 2 med high unacc
> str(data1)
‘data.frame’: 1728 obs. of 7 variables:
$ BuyingPrice: Factor w/ 4 levels “high”,”low”,”med”,..: 4 4 4 4 4 4 4 4 4 4 …
$ Maintenance: Factor w/ 4 levels “high”,”low”,”med”,..: 4 4 4 4 4 4 4 4 4 4 …
$ NumDoors : Factor w/ 4 levels “2”,”3″,”4″,”5more”: 1 1 1 1 1 1 1 1 1 1 …
$ NumPersons : Factor w/ 3 levels “2”,”4″,”more”: 1 1 1 1 1 1 1 1 1 2 …
$ BootSpace : Factor w/ 3 levels “big”,”med”,”small”: 3 3 3 2 2 2 1 1 1 3 …
$ Safety : Factor w/ 3 levels “high”,”low”,”med”: 2 3 1 2 3 1 2 3 1 2 …
$ Condition : Factor w/ 4 levels “acc”,”good”,”unacc”,..: 3 3 3 3 3 3 3 3 3 3 …
> summary(data1)
BuyingPrice Maintenance NumDoors NumPersons BootSpace Safety Condition
high :432 high :432 2 :432 2 :576 big :576 high:576 acc : 384
low :432 low :432 3 :432 4 :576 med :576 low :576 good : 69
med :432 med :432 4 :432 more:576 small:576 med :576 unacc:1210
vhigh:432 vhigh:432 5more:432 vgood: 65
Now, we will split the dataset into train and validation set in the ratio 70:30. We can also create a test dataset, but for the time being we will just keep train and validation set.
1
2
3
4
5
6
7
8
# Split into Train and Validation sets
# Training Set : Validation Set = 70 : 30 (random)
set.seed(100)
train <- sample(nrow(data1), 0.7*nrow(data1), replace = FALSE)
TrainSet <- data1[train,]
ValidSet <- data1[-train,]
summary(TrainSet)
summary(ValidSet)
1
2
3
4
5
6
7
8
9
10
11
12
> summary(TrainSet)
BuyingPrice Maintenance NumDoors NumPersons BootSpace Safety Condition
high :313 high :287 2 :305 2 :406 big :416 high:396 acc :264
low :292 low :317 3 :300 4 :399 med :383 low :412 good : 52
med :305 med :303 4 :295 more:404 small:410 med :401 unacc:856
vhigh:299 vhigh:302 5more:309 vgood: 37
> summary(ValidSet)
BuyingPrice Maintenance NumDoors NumPersons BootSpace Safety Condition
high :119 high :145 2 :127 2 :170 big :160 high:180 acc :120
low :140 low :115 3 :132 4 :177 med :193 low :164 good : 17
med :127 med :129 4 :137 more:172 small:166 med :175 unacc:354
vhigh:133 vhigh:130 5more:123 vgood: 28
WELCOME!
Here you will find daily news and tutorials about R, contributed by hundreds of bloggers.
There are many ways to follow us –
By e-mail:
Your e-mail here
On Facebook:
If you are an R blogger yourself you are invited to add your own R content feed to this site (Non-English R bloggers should add themselves- here)
RSS JOBS FOR R-USERS
Lecturer of Statistical and Data Sciences
Senior Research Data Scientist
Data Scientist Position for Developing Software and Tools in Genomics, Big Data and Precision Medicine
Data Consultant 1 @ Cheney, Washington, United States
Senior Scientist, Translational Informatics @ Vancouver, BC, Canada
RECENT POSTS
Data science trainings in Berlin & Hamburg
Working with Statistics Canada Data in R, Part 4: Canadian Census Data – cancensus Package Setup
SMC on the 2019-2020 nCoV outbreak
rWind is working again!
Dataviz Workshop at RStudio::conf
Get Better: R for absolute beginners
What and who is IT community? What does it take to be part?
How is information gain calculated?
Lasso Regression (home made)
Hyperparameter tuning and #TidyTuesday food consumption
rstudio::conf 2020 Videos
Getting started in R markdown
RStudio 1.3 Preview: Configuration and Settings
Efficient Data Management in R
Clustered randomized trials and the design effect
OTHER SITES
SAS blogs
Jobs for R-users
How to implement Random Forests in R
January 9, 2018
By Perceptive Analytics
Share
[This article was first published on R-posts.com, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t.
Share
Tweet
Imagine you were to buy a car, would you just go to a store and buy the first one that you see? No, right? You usually consult few people around you, take their opinion, add your research to it and then go for the final decision. Let’s take a simpler scenario: whenever you go for a movie, do you ask your friends for reviews about the movie (unless, off-course it stars one of your favorite actress)?
Have you ever wondered why do we ask multiple people about their opinions or reviews before going for a movie or before buying a car or may be, before planning a holiday? It’s because review of one person may be biased as per her preference; however, when we ask multiple people we are trying to remove bias that a single person may provide. One person may have a very strong dislike for a specific destination because of her experience at that location; however, ten other people may have very strong preference for the same destination because they have had a wonderful experience there. From this, we can infer that the one person was more like an exceptional case and her experience may be one of a case.
Another example which I am sure all of us have encountered is during the interviews at any company or college. We often have to go through multiple rounds of interviews. Even though the questions asked in multiple rounds of interview are similar, if not same – companies still go for it. The reason is that they want to have views from multiple recruitment leaders. If multiple leaders are zeroing in on a candidate then the likelihood of her turning out to be a good hire is high.
In the world of analytics and data science, this is called ‘ensembling’. Ensembling is a “type of supervised learning technique where multiple models are trained on a training dataset and their individual outputs are combined by some rule to derive the final output.”
Let’s break the above definition and look at it step by step.
When we say multiple models are trained on a dataset, same model with different hyper parameters or different models can be trained on the training dataset. Training observations may differ slightly while sampling; however, overall population remains the same.
“Outputs are combined by some rule” – there could be multiple rules by which outputs are combined. The most common ones are the average (in terms of numerical output) or vote (in terms of categorical output). When different models give us numerical output, we can simply take the average of all the outputs and use the average as the result. In case of categorical output, we can use vote – output occurring maximum number of times is the final output. There are other complex methods of deriving at output also but they are out of scope of this article.
Random Forest is one such very powerful ensembling machine learning algorithm which works by creating multiple decision trees and then combining the output generated by each of the decision trees. Decision tree is a classification model which works on the concept of information gain at every node. For all the data points, decision tree will try to classify data points at each of the nodes and check for information gain at each node. It will then classify at the node where information gain is maximum. It will follow this process subsequently until all the nodes are exhausted or there is no further information gain. Decision trees are very simple and easy to understand models; however, they have very low predictive power. In fact, they are called weak learners.
Random Forest works on the same weak learners. It combines the output of multiple decision trees and then finally come up with its own output. Random Forest works on the same principle as Decision Tress; however, it does not select all the data points and variables in each of the trees. It randomly samples data points and variables in each of the tree that it creates and then combines the output at the end. It removes the bias that a decision tree model might introduce in the system. Also, it improves the predictive power significantly. We will see this in the next section when we take a sample data set and compare the accuracy of Random Forest and Decision Tree.
Now, let’s take a small case study and try to implement multiple Random Forest models with different hyper parameters, and compare one of the Random Forest model with Decision Tree model. (I am sure you will agree with me on this – even without implementing the model, we can say intuitively that Random Forest will give us better results than Decision Tree). The dataset is taken from UCI website and can be found on this link. The data contains 7 variables – six explanatory (Buying Price, Maintenance, NumDoors, NumPersons, BootSpace, Safety) and one response variable (Condition). The variables are self-explanatory and refer to the attributes of cars and the response variable is ‘Car Acceptability’. All the variables are categorical in nature and have 3-4 factor levels in each.
Let’s start the R code implementation and predict the car acceptability based on explanatory variables.
1
2
3
4
# Data Source: https://archive.ics.uci.edu/ml/machine-learning-databases/car/
install.packages(“randomForest”)
library(randomForest)
1
2
3
4
5
6
7
8
# Load the dataset and explore
data1 <- read.csv(file.choose(), header = TRUE)
head(data1)
str(data1)
summary(data1)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
> head(data1)
BuyingPrice Maintenance NumDoors NumPersons BootSpace Safety Condition
1 vhigh vhigh 2 2 small low unacc
2 vhigh vhigh 2 2 small med unacc
3 vhigh vhigh 2 2 small high unacc
4 vhigh vhigh 2 2 med low unacc
5 vhigh vhigh 2 2 med med unacc
6 vhigh vhigh 2 2 med high unacc
> str(data1)
‘data.frame’: 1728 obs. of 7 variables:
$ BuyingPrice: Factor w/ 4 levels “high”,”low”,”med”,..: 4 4 4 4 4 4 4 4 4 4 …
$ Maintenance: Factor w/ 4 levels “high”,”low”,”med”,..: 4 4 4 4 4 4 4 4 4 4 …
$ NumDoors : Factor w/ 4 levels “2”,”3″,”4″,”5more”: 1 1 1 1 1 1 1 1 1 1 …
$ NumPersons : Factor w/ 3 levels “2”,”4″,”more”: 1 1 1 1 1 1 1 1 1 2 …
$ BootSpace : Factor w/ 3 levels “big”,”med”,”small”: 3 3 3 2 2 2 1 1 1 3 …
$ Safety : Factor w/ 3 levels “high”,”low”,”med”: 2 3 1 2 3 1 2 3 1 2 …
$ Condition : Factor w/ 4 levels “acc”,”good”,”unacc”,..: 3 3 3 3 3 3 3 3 3 3 …
> summary(data1)
BuyingPrice Maintenance NumDoors NumPersons BootSpace Safety Condition
high :432 high :432 2 :432 2 :576 big :576 high:576 acc : 384
low :432 low :432 3 :432 4 :576 med :576 low :576 good : 69
med :432 med :432 4 :432 more:576 small:576 med :576 unacc:1210
vhigh:432 vhigh:432 5more:432 vgood: 65
Now, we will split the dataset into train and validation set in the ratio 70:30. We can also create a test dataset, but for the time being we will just keep train and validation set.
1
2
3
4
5
6
7
8
# Split into Train and Validation sets
# Training Set : Validation Set = 70 : 30 (random)
set.seed(100)
train <- sample(nrow(data1), 0.7*nrow(data1), replace = FALSE)
TrainSet <- data1[train,]
ValidSet <- data1[-train,]
summary(TrainSet)
summary(ValidSet)
1
2
3
4
5
6
7
8
9
10
11
12
> summary(TrainSet)
BuyingPrice Maintenance NumDoors NumPersons BootSpace Safety Condition
high :313 high :287 2 :305 2 :406 big :416 high:396 acc :264
low :292 low :317 3 :300 4 :399 med :383 low :412 good : 52
med :305 med :303 4 :295 more:404 small:410 med :401 unacc:856
vhigh:299 vhigh:302 5more:309 vgood: 37
> summary(ValidSet)
BuyingPrice Maintenance NumDoors NumPersons BootSpace Safety Condition
high :119 high :145 2 :127 2 :170 big :160 high:180 acc :120
low :140 low :115 3 :132 4 :177 med :193 low :164 good : 17
med :127 med :129 4 :137 more:172 small:166 med :175 unacc:354
vhigh:133 vhigh:130 5more:123 vgood: 28
Now, we will create a Random Forest model with default parameters and then we will fine tune the model by changing ‘mtry’. We can tune the random forest model by changing the number of trees (ntree) and the number of variables randomly sampled at each stage (mtry). According to Random Forest package description:
Ntree: Number of trees to grow. This should not be set to too small a number, to ensure that every input row gets predicted at least a few times.
Mtry: Number of variables randomly sampled as candidates at each split. Note that the default values are different for classification (sqrt(p) where p is number of variables in x) and regression (p/3)
# Create a Random Forest model with default parameters
model1 <- randomForest(Condition ~ ., data = TrainSet, importance = TRUE)
model1
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
> model1
Call:
randomForest(formula = Condition ~ ., data = TrainSet, importance = TRUE)
Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 2
OOB estimate of error rate: 3.64%
Confusion matrix:
acc good unacc vgood class.error
acc 253 7 4 0 0.04166667
good 3 44 1 4 0.15384615
unacc 18 1 837 0 0.02219626
vgood 6 0 0 31 0.16216216
By default, number of trees is 500 and number of variables tried at each split is 2 in this case. Error rate is 3.6%.
1
2
3
# Fine tuning parameters of Random Forest model
model2 <- randomForest(Condition ~ ., data = TrainSet, ntree = 500, mtry = 6, importance = TRUE)
model2
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
> model2
Call:
randomForest(formula = Condition ~ ., data = TrainSet, ntree = 500, mtry = 6, importance = TRUE)
Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 6
OOB estimate of error rate: 2.32%
Confusion matrix:
acc good unacc vgood class.error
acc 254 4 6 0 0.03787879
good 3 47 1 1 0.09615385
unacc 10 1 845 0 0.01285047
vgood 1 1 0 35 0.05405405
When we have increased the mtry to 6 from 2, error rate has reduced from 3.6% to 2.32%. We will now predict on the train dataset first and then predict on validation dataset.
1
2
3
4
# Predicting on train set
predTrain <- predict(model2, TrainSet, type = “class”)
# Checking classification accuracy
table(predTrain, TrainSet$Condition)
1
2
3
4
5
6
7
> table(predTrain, TrainSet$Condition)
predTrain acc good unacc vgood
acc 264 0 0 0
good 0 52 0 0
unacc 0 0 856 0
vgood 0 0 0 37
1
2
3
4
5
# Predicting on Validation set
predValid <- predict(model2, ValidSet, type = “class”)
# Checking classification accuracy
mean(predValid == ValidSet$Condition)
table(predValid,ValidSet$Condition)
1
2
3
4
5
6
7
8
9
> mean(predValid == ValidSet$Condition)
[1] 0.9884393
> table(predValid,ValidSet$Condition)
predValid acc good unacc vgood
acc 117 0 2 0
good 1 16 0 0
unacc 1 0 352 0
vgood 1 1 0 28
In case of prediction on train dataset, there is zero misclassification; however, in the case of validation dataset, 6 data points are misclassified and accuracy is 98.84%. We can also use function to check important variables. The below functions show the drop in mean accuracy for each of the variables.
1
2
3
# To check important variables
importance(model2)
varImpPlot(model2)
1
2
3
4
5
6
7
8
> importance(model2)
acc good unacc vgood MeanDecreaseAccuracy MeanDecreaseGini
BuyingPrice 143.90534 80.38431 101.06518 66.75835 188.10368 71.15110
Maintenance 130.61956 77.28036 98.23423 43.18839 171.86195 90.08217
NumDoors 32.20910 16.14126 34.46697 19.06670 49.35935 32.45190
NumPersons 142.90425 51.76713 178.96850 49.06676 214.55381 125.13812
BootSpace 85.36372 60.34130 74.32042 50.24880 132.20780 72.22591
Safety 179.91767 93.56347 207.03434 90.73874 275.92450 149.74474
1
2
> varImpPlot(model2)
<img data-attachment-id=”419″ data-permalink=”http://r-posts.com/how-to-implement-random-forests-in-r/1-3/” data-orig-file=”https://i2.wp.com/r-posts.com/wp-content/uploads/2018/01/1.png?fit=1191%2C900″ data-orig-size=”1191,900″ data-comments-opened=”1″ data-image-title=”1″ data-image-description=”” data-medium-file=”https://i2.wp.com/r-posts.com/wp-content/uploads/2018/01/1.png?fit=397%2C300″ data-large-file=”https://i2.wp.com/r-posts.com/wp-content/uploads/2018/01/1.png?fit=450%2C340″ src=”https://i2.wp.com/r-posts.com/wp-content/uploads/2018/01/1.png?resize=450%2C340″ alt=”Perceptive Analytics” width=”450″ height=”340″ class=”alignnone size-large wp-image-419 jetpack-lazy-image” data-recalc-dims=”1″ data-lazy-sizes=”(max-width: 450px) 100vw, 450px” data-lazy-src=”https://i2.wp.com/r-posts.com/wp-content/uploads/2018/01/1.png?resize=450%2C340&is-pending-load=1″ srcset=”data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7″><noscript><img data-attachment-id=”419″ data-permalink=”http://r-posts.com/how-to-implement-random-forests-in-r/1-3/” data-orig-file=”https://i2.wp.com/r-posts.com/wp-content/uploads/2018/01/1.png?fit=1191%2C900″ data-orig-size=”1191,900″ data-comments-opened=”1″ data-image-meta=”{“aperture”:”0″,”credit”:””,”camera”:””,”caption”:””,”created_timestamp”:”0″,”copyright”:””,”focal_length”:”0″,”iso”:”0″,”shutter_speed”:”0″,”title”:””,”orientation”:”0″}” data-image-title=”1″ data-image-description=”” data-medium-file=”https://i2.wp.com/r-posts.com/wp-content/uploads/2018/01/1.png?fit=397%2C300″ data-large-file=”https://i2.wp.com/r-posts.com/wp-content/uploads/2018/01/1.png?fit=450%2C340″ src=”https://i2.wp.com/r-posts.com/wp-content/uploads/2018/01/1.png?resize=450%2C340″ alt=”Perceptive Analytics” width=”450″ height=”340″ class=”alignnone size-large wp-image-419″ srcset_temp=”https://i2.wp.com/r-posts.com/wp-content/uploads/2018/01/1.png?resize=450%2C340 450w, https://i2.wp.com/r-posts.com/wp-content/uploads/2018/01/1.png?resize=397%2C300 397w, https://i2.wp.com/r-posts.com/wp-content/uploads/2018/01/1.png?resize=768%2C580 768w, https://i2.wp.com/r-posts.com/wp-content/uploads/2018/01/1.png?w=1191 1191w” sizes=”(max-width: 450px) 100vw, 450px” data-recalc-dims=”1″ /></noscript>
Now, we will use ‘for’ loop and check for different values of mtry.
1
2
3
4
5
6
7
8
9
10
11
12
# Using For loop to identify the right mtry for model
a=c()
i=5
for (i in 3:8) {
model3 <- randomForest(Condition ~ ., data = TrainSet, ntree = 500, mtry = i, importance = TRUE)
predValid <- predict(model3, ValidSet, type = “class”)
a[i-2] = mean(predValid == ValidSet$Condition)
}
a
plot(3:8,a)
1
2
3
4
5
> a
[1] 0.9749518 0.9884393 0.9845857 0.9884393 0.9884393 0.9903661
>
> plot(3:8,a)
<img data-attachment-id=”420″ data-permalink=”http://r-posts.com/how-to-implement-random-forests-in-r/2-3/” data-orig-file=”https://i2.wp.com/r-posts.com/wp-content/uploads/2018/01/2.png?fit=1191%2C900″ data-orig-size=”1191,900″ data-comments-opened=”1″ data-image-title=”2″ data-image-description=”” data-medium-file=”https://i2.wp.com/r-posts.com/wp-content/uploads/2018/01/2.png?fit=397%2C300″ data-large-file=”https://i2.wp.com/r-posts.com/wp-content/uploads/2018/01/2.png?fit=450%2C340″ src=”https://i2.wp.com/r-posts.com/wp-content/uploads/2018/01/2.png?resize=450%2C340″ alt=”Perceptive Analytics” width=”450″ height=”340″ class=”alignnone size-large wp-image-420 jetpack-lazy-image” data-recalc-dims=”1″ data-lazy-sizes=”(max-width: 450px) 100vw, 450px” data-lazy-src=”https://i2.wp.com/r-posts.com/wp-content/uploads/2018/01/2.png?resize=450%2C340&is-pending-load=1″ srcset=”data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7″><noscript><img data-attachment-id=”420″ data-permalink=”http://r-posts.com/how-to-implement-random-forests-in-r/2-3/” data-orig-file=”https://i2.wp.com/r-posts.com/wp-content/uploads/2018/01/2.png?fit=1191%2C900″ data-orig-size=”1191,900″ data-comments-opened=”1″ data-image-meta=”{“aperture”:”0″,”credit”:””,”camera”:””,”caption”:””,”created_timestamp”:”0″,”copyright”:””,”focal_length”:”0″,”iso”:”0″,”shutter_speed”:”0″,”title”:””,”orientation”:”0″}” data-image-title=”2″ data-image-description=”” data-medium-file=”https://i2.wp.com/r-posts.com/wp-content/uploads/2018/01/2.png?fit=397%2C300″ data-large-file=”https://i2.wp.com/r-posts.com/wp-content/uploads/2018/01/2.png?fit=450%2C340″ src=”https://i2.wp.com/r-posts.com/wp-content/uploads/2018/01/2.png?resize=450%2C340″ alt=”Perceptive Analytics” width=”450″ height=”340″ class=”alignnone size-large wp-image-420″ srcset_temp=”https://i2.wp.com/r-posts.com/wp-content/uploads/2018/01/2.png?resize=450%2C340 450w, https://i2.wp.com/r-posts.com/wp-content/uploads/2018/01/2.png?resize=397%2C300 397w, https://i2.wp.com/r-posts.com/wp-content/uploads/2018/01/2.png?resize=768%2C580 768w, https://i2.wp.com/r-posts.com/wp-content/uploads/2018/01/2.png?w=1191 1191w” sizes=”(max-width: 450px) 100vw, 450px” data-recalc-dims=”1″ /></noscript>
From the above graph, we can see that the accuracy decreased when mtry was increased from 4 to 5 and then increased when mtry was changed to 6 from 5. Maximum accuracy is at mtry equal to 8.
Now, we have seen the implementation of Random Forest and understood the importance of the model. Let’s compare this model with decision tree and see how decision trees fare in comparison to random forest.
1
2
3
4
5
6
7
8
9
# Compare with Decision Tree
install.packages(“rpart”)
install.packages(“caret”)
install.packages(“e1071”)
library(rpart)
library(caret)
library(e1071)
1
2
3
4
5
6
7
# We will compare model 1 of Random Forest with Decision Tree model
model_dt = train(Condition ~ ., data = TrainSet, method = “rpart”)
model_dt_1 = predict(model_dt, data = TrainSet)
table(model_dt_1, TrainSet$Condition)
mean(model_dt_1 == TrainSet$Condition)
1
2
3
4
5
6
7
8
9
10
> table(model_dt_1, TrainSet$Condition)
model_dt_1 acc good unacc vgood
acc 241 52 132 37
good 0 0 0 0
unacc 23 0 724 0
vgood 0 0 0 0
>
> mean(model_dt_1 == TrainSet$Condition)
[1] 0.7981803
On the training dataset, the accuracy is around 79.8% and there is lot of misclassification. Now, look at the validation dataset.
1
2
3
4
5
# Running on Validation Set
model_dt_vs = predict(model_dt, newdata = ValidSet)
table(model_dt_vs, ValidSet$Condition)
mean(model_dt_vs == ValidSet$Condition)
1
2
3
4
5
6
7
8
9
10
> table(model_dt_vs, ValidSet$Condition)
model_dt_vs acc good unacc vgood
acc 107 17 58 28
good 0 0 0 0
unacc 13 0 296 0
vgood 0 0 0 0
>
> mean(model_dt_vs == ValidSet$Condition)
[1] 0.7764933
The accuracy on validation dataset has decreased further to 77.6%.