Random Forest
Introduction
Random Forest is a widely-used machine
learning algorithm developed by Leo Breiman and Adele
Cutler, which combines the output of multiple decision trees to reach a single
result. Its ease of use and flexibility have fueled its adoption, as it handles
both classification and regression problems
What is a Random Forest Algorithm?
Random Forest Algorithm- widespread
popularity stems from its user-friendly nature and adaptability, enabling it to effectively tackle classification and regression problems. The algorithm’s
strength lies in its ability to handle complex datasets and mitigate overfitting,
making it a valuable tool for various predictive tasks in machine learning.
One of the most important features of
the Random Forest Algorithm is that it can handle the data set containing continuous
variables, as in the case of regression, and categorical
variables, as in the case of classification. It performs
better for classification and regression tasks. In this tutorial, we will
understand the working of random forests and implement random forests on a
classification task.
Real-Life
Analogy of Random Forest
Let’s dive into a real-life analogy to
understand this concept further. A student named X wants to choose a course
after his 10+2, and he is confused about the choice of course based on his
skill set. So he consults various people like his cousins, teachers,
parents, degree students, and working people. He asks them varied questions
like why he should choose, job opportunities with that course, course fee, etc.
Finally, after consulting various people about the course he decides to take
the course suggested by most people.
Working of
Random Forest Algorithm
Before understanding the workings of the
random forest algorithm in machine learning, we must look into the ensemble
learning technique. Ensemble simply means combining
multiple models. Thus a collection of models is used to make predictions rather
than an individual model.
The ensemble uses two types of methods:
Bagging
It creates a different training subset
from sample training data with replacement & the final output is based on
majority voting. For example, Random Forest.
Boosting
It combines weak learners into strong
learners by creating sequential models such that the final model has the
highest accuracy. For example, ADA BOOST, XG BOOST.
As mentioned earlier, Random forest
works on the Bagging principle. Now let’s dive in and understand bagging
in detail.
Bagging
Bagging, also known as Bootstrap
Aggregation, serves as the ensemble technique in the Random Forest algorithm.
Here are the steps involved in Bagging:
- Selection
of Subset: Bagging starts by choosing a random sample,
or subset, from the entire dataset.
- Bootstrap
Sampling: Each model is then created from these
samples, called Bootstrap Samples, which are taken from the original data
with replacement. This process is known as row sampling.
- Bootstrapping: The step of row sampling with replacement is
referred to as bootstrapping.
- Independent
Model Training: Each model is
trained independently on its corresponding Bootstrap Sample. This training
process generates results for each model.
- Majority
Voting: The final output is determined by combining
the results of all models through majority voting. The most commonly
predicted outcome among the models is selected.
- Aggregation: This step, which involves combining all the
results and generating the final output based on majority voting, is known
as aggregation.
Now let’s look at an example by breaking it down with the help of the following figure. Here the bootstrap sample is taken from actual data (Bootstrap sample 01, Bootstrap sample 02, and Bootstrap sample 03) with a replacement which means there is a high possibility that each sample won’t contain unique data. The model (Model 01, Model 02, and Model 03) obtained from this bootstrap sample is trained independently. Each model generates results as shown. Now the Happy emoji has a majority when compared to the Sad emoji. Thus based on majority voting final output is obtained as a Happy emoji.
Boosting is one of the techniques that
use the concept of ensemble learning. A boosting algorithm combines multiple
simple models (also known as weak learners or base estimators) to generate the
final output. It is done by building a model by using weak models in series.
There are several boosting algorithms;
AdaBoost was the first really successful boosting algorithm that was developed
for the purpose of binary classification. AdaBoost is an abbreviation for
Adaptive Boosting and is a prevalent boosting technique that combines multiple
“weak classifiers” into a single “strong classifier.” There are Other Boosting
techniques. For more, you can visit
Steps
Involved in Random Forest Algorithm
·
Step 1: In the Random forest model, a subset of data points and a
subset of features are selected for constructing each decision tree. Simply put,
n random records and m features are taken from the data set having k number of
records.
·
Step 2: Individual decision trees are constructed for each
sample.
·
Step 3: Each decision tree will generate an output.
·
Step 4: Final output is considered based on Majority
Voting or Averaging for Classification and Regression,
respectively.
For example
Consider the fruit basket as the data as shown in the figure below. Now n number of samples are taken from the fruit basket, and an individual decision tree is constructed for each sample. Each decision tree will generate an output, as shown in the figure. The final output is considered based on majority voting. In the below figure, you can see that the majority decision tree gives output as an apple when compared to a banana, so the final output is taken as an apple.
Important Features of Random Forest
- Diversity: Not all attributes/variables/features are considered while making an individual tree; each tree is different.
- Immune to the curse of dimensionality: Since each tree does not consider all the features, the feature space is reduced.
- Parallelization: Each tree is created independently out of different data and attributes. This means we can fully use the CPU to build random forests.
- Train-Test split: In a random forest, we don’t have to segregate the data for train and test as there will always be 30% of the data which is not seen by the decision tree.
- Stability: Stability arises because the result is based on majority voting/ averaging.
Difference Between Decision Tree and Random
Forest
Random
forest is a collection of decision trees; still, there are a lot of differences
in their behavior.
Decision
trees |
Random
Forest |
1. Decision
trees normally suffer from the problem of overfitting if it’s allowed to grow
without any control. |
1. Random
forests are created from subsets of data, and the final output is based on
average or majority ranking; hence the problem of overfitting is taken care
of. |
2. A single
decision tree is faster in computation. |
2. It is
comparatively slower. |
3. When a data
set with features is taken as input by a decision tree, it will formulate
some rules to make predictions. |
3. Random
forest randomly selects observations, builds a decision tree, and takes the
average result. It doesn’t use any set of formulas. |
Thus
random forests are much more successful than decision trees only if the trees
are diverse and acceptable.
Important Hyperparameters in Random Forest
Hyperparameters
are used in random forests to either enhance the performance and predictive
power of models or to make the model faster.
Hyperparameters to Increase the Predictive Power
· n_estimators: Number of trees the algorithm builds before
averaging the predictions.
· max_features: Maximum number of features random forest considers
splitting a node.
· mini_sample_leaf: Determines the minimum number of leaves required to
split an internal node.
· criterion: How to split the node in each tree? (Entropy/Gini
impurity/Log Loss)
· max_leaf_nodes: Maximum leaf nodes in each tree
Hyperparameters to Increase the Speed
· n_jobs: it tells the engine how many processors it is allowed to
use. If the value is 1, it can use only one processor, but if the value is -1,
there is no limit.
· random_state: controls randomness of the sample. The model will always
produce the same results if it has a definite value of random state and has
been given the same hyperparameters and training data.
· oob_score: OOB means out of the bag. It is a random
forest cross-validation method. In this, one-third of the sample is not used to
train the data; instead used to evaluate its performance. These samples are
called out-of-bag samples.
Random
Forest Algorithm Use Cases
This algorithm is widely used in E-commerce, banking, medicine, the stock market, etc. For example: In the Banking industry, it can be used to find which customer will default on a loan.
Applications of Random Forest
Some of the applications of Random
Forest Algorithm are listed below:
1.
Banking: It predicts a loan applicant’s solvency. This helps
lending institutions make a good decision on whether to give the customer loan
or not. They are also being used to detect fraudsters.
2.
Health Care: Health professionals use random forest systems to
diagnose patients. Patients are diagnosed by assessing their previous medical
history. Past medical records are reviewed to establish the proper dosage for
the patients.
3.
Stock Market: Financial analysts use it to identify potential
markets for stocks. It also enables them to remember the behavior of stocks.
4.
E-Commerce: Through this system, e-commerce vendors can predict
the preference of customers based on past consumption behaviour.
Advantages and Disadvantages of Random Forest Algorithm
Advantages
- It can be used in
classification and regression problems.
- It solves the
problem of overfitting as output is based on majority voting or averaging.
- It performs well
even if the data contains null/missing values.
- Each decision tree
created is independent of the other; thus, it shows the property of parallelization.
- It is highly
stable as the average answers given by a large number of trees are taken.
- It maintains
diversity as all the attributes are not considered while making each decision
tree though it is not true in all cases.
- It is immune to
the curse of dimensionality. Since each tree does not consider all the
attributes, feature space is reduced.
· Disadvantages
- · Random forest is
highly complex compared to decision trees, where decisions can be made by
following the path of the tree.
- · Training time is more than other models due to its complexity. Whenever it has to make a prediction, each decision tree has to generate output for the given input data.
- When to Avoid Using Random Forests?
Random Forests Algorithms are not ideal in the
following situations:
- Extrapolation:
Random Forest regression is not ideal in the extrapolation of data. Unlike
linear regression, which uses existing observations to estimate values
beyond the observation range.
- Sparse Data: Random Forest does not produce good results when the data is sparse. In this case, the subject of features and bootstrapped sample will have an invariant space. This will lead to unproductive spills, which will affect the outcome.
Comments
Post a Comment