Coursera Learner working on a presentation with Coursera logo and
Coursera Learner working on a presentation with Coursera logo and

Random Forest is one among the foremost popular and most powerful machine learning algorithms. It’s a kind of ensemble machine learning algorithm called Bootstrap Aggregation or bagging.

In this post you’ll discover the Bagging ensemble algorithm and therefore the Random Forest algorithm for predictive modeling. After reading this post you’ll know about:

The bootstrap method for estimating statistical quantities from samples.

The Bootstrap Aggregation algorithm for creating multiple different models from one training dataset.

The Random Forest algorithm that creates a little tweak to Bagging and leads to a really powerful classifier.

This post was written for developers and assumes no background in statistics or mathematics. The post focuses on how the algorithm works and the way to use it for predictive modeling problems.

Bootstrap Method

Before we get to Bagging, let’s take a fast check out a crucial foundation technique called the bootstrap.

The bootstrap may be a powerful statistical procedure for estimating a quantity from a knowledge sample. This is often easiest to know if the number may be a descriptive statistic like a mean or a typical deviation.

Let’s assume we’ve a sample of 100 values (x) and we’d wish to get an estimate of the mean of the sample.

We can calculate the mean directly from the sample as:

mean(x) = 1/100 * sum(x)

We know that our sample is little which our mean has error in it. We will improve the estimate of our mean using the bootstrap procedure:

Create many (e.g. 1000) random sub-samples of our dataset with replacement (meaning we will select an equivalent value multiple times).

Calculate the mean of every sub-sample.

Calculate the typical of all of our collected means and use that as our estimated mean for the info.

For example, let’s say we used 3 resamples and got the mean values 2.3, 4.5 and 3.3. Taking the typical of those we could take the estimated mean of the info to be 3.367.

This process are often wont to estimate other quantities just like the variance and even quantities utilized in machine learning algorithms, like learned coefficients.

Bootstrap Aggregation (Bagging)

Bootstrap Aggregation (or Bagging for short), may be a simple and really powerful ensemble method.

An ensemble method may be a technique that mixes the predictions from multiple machine learning algorithms together to form more accurate predictions than a person model.

Bootstrap Aggregation may be a general procedure which will be wont to reduce the variance for that algorithm that have high variance. An algorithm that has high variance are decision trees, like classification and regression trees (CART).

Decision trees are sensitive to the precise data on which they’re trained. If the training data is modified (e.g. a tree is trained on a subset of the training data) the resulting decision tree are often quite different and successively the predictions are often quite different.

Bagging is that the application of the Bootstrap procedure to a high-variance machine learning algorithm, typically decision trees.

Let’s assume we’ve a sample dataset of 1000 instances (x) and that we are using the CART algorithm. Bagging of the CART algorithm would work as follows.

Create many (e.g. 100) random sub-samples of our dataset with replacement.

Train a CART model on each sample.

Given a replacement dataset, calculate the typical prediction from each model.

For example, if we had 5 bagged decision trees that made the subsequent class predictions for a in input sample: blue, blue, red, blue and red, we might take the foremost frequent class and predict blue.

When bagging with decision trees, we are less concerned about individual trees over fitting the training data. For this reason and for efficiency, the individual decision trees are grown deep (e.g. few training samples at each leaf-node of the tree) and therefore the trees aren’t pruned. These trees will have both high variance and low bias. These are important characterize of sub-models when combining predictions using bagging.

The only parameters when bagging decision trees is that the number of samples and hence the amount of trees to incorporate. This will be chosen by increasing the amount of trees on run after run until the accuracy begins to prevent showing improvement (e.g. on a cross validation test harness). Very large numbers of models may take an extended time to organize, but won’t overfit the training data.

Just like the choice trees themselves, Bagging are often used for classification and regression problems.

Random Forest

Random Forests are an improvement over bagged decision trees.

A problem with decision trees like CART is that they’re greedy. They choose which variable to separate on employing a greedy algorithm that minimizes error. As such, even with Bagging, the choice trees can have tons of structural similarities and successively have high correlation in their predictions.

Combining predictions from multiple models in ensembles works better if the predictions from the sub-models are uncorrelated or at the best weakly correlated.

Random forest changes the algorithm for the way that the sub-trees are learned in order that the resulting predictions from all of the subtrees have less correlation.

It is an easy tweak. In CART, when selecting a split point, the training algorithm is allowed to seem through all variables and every one variable values so as to pick the foremost optimal split-point. The random forest algorithm changes this procedure in order that the training algorithm is restricted to a random sample of features of which to look.

The number of features which will be searched at each split point (m) must be specified as a parameter to the algorithm. You’ll try different values and tune it using cross validation.

For classification an honest default is: m = sqrt(p)

For regression an honest default is: m = p/3

Where m is that the number of randomly selected features which will be searched at a split point and p is that the number of input variables. For instance, if a dataset had 25 input variables for a classification problem, then?

m = sqrt(25)

m = 5

Estimated Performance

For each bootstrap sample taken from the training data, there’ll be samples left behind that weren’t included. These samples are called Out-Of-Bag samples or OOB.

The performance of every model on its overlooked samples when averaged can provide an estimated accuracy of the bagged models. This estimated performance is usually called the OOB estimate of performance.

These performance measures are reliable test error estimate and correlate well with cross validation estimates.

Variable Importance

As the Bagged decision trees are constructed, we will calculate what proportion the error function drops for a variable at each split point.

In regression problems this might be the drop by sum squared error and in classification this could be the Gini score.

These drops in error are often averaged across all decision trees and output to supply an estimate of the importance of every input variable. The greater the drop when the variable was chosen, the greater the importance.

These outputs can help identify subsets of input variables which will be most or least relevant to the matter and suggest at possible feature selection experiments you’ll perform where some features are faraway from the dataset.