25
JanNew Year Special : Self-Learning Courses: Get any course for just $49! - SCHEDULE CALL
You must have heard of Random Forest, Random Forest in R or Random Forest in Python! This article is curated to give you a great insight into how to implement Random Forest in R.
We will discuss Random Forest in R example to understand the concept even better--
When we are going to buy any elite or costly items like Car, Home or any investment in the share market then we prefer to take multiple people's advice. It is unlikely that we just go to a shop and purchase any item on a random basis. We collect many suggestions from different people we know and then take the best option by seeing the positives and negatives of individuals. The reason for taking is that a review of one person can be biased as per his interests and past experiences however by asking multiple people we are trying to mitigate bias caused by any individual. One person may have a very strong aversion for a specific product because of his experience for that product, on the other hand, several other people may have very strong favor for the same product because they have had a very positive experience there.
This concept is called ‘Ensembling’ in Analytics. Ensembling is a technique in which many models are trained on a training dataset and their outputs are assimilated by some rules to get the final output.”
Decision trees have one serious drawback that they are prone to overfitting. The decision tree is grown very deep then it will learn all possible relationships in data. Overfitting can be mitigated with a technique called Pruning which reduces the size of decision trees by removing parts of the tree that provides less power to correct classification. In spite of pruning, the result often is not up to the mark. The primary reason for this is that the algorithm makes a locally optimal choice at each split without any regard to the choice is best for overall grown tree So a bad choice of split at the starting stage can result in poor model and that cannot be compensated by post-ad-hoc pruning.
Decision trees are very popular because their idea of making decisions reflects how humans make decisions. They check options at different stages of tree split and selecting the best one. The analogy helps to suggest how decision trees can be improved.
One of the TV games provides an option (“Audience poll”) to contestants wherein he can ask the audience to vote on any question if he is clueless. The reason is that the answer from the majority of independent people has more chances of being correct.
Based on the above human thinking comparison, it seems reasonable to build many decision trees and selecting random subsets using:
Final Predictions can be drawn by taking the majority vote over all trees, mode of classification in-case of classification problems and median in case of regression problems. This is how the random forest algorithm works.
Data Science Training - Using R and Python
These above two strategies help to reduce overfitting by averaging the response over trees created from different samples of the dataset and decreasing the probability of a small dataset of strong predictors dominating the splits. But everything has a price. Here, model interpretability is reduced with an increase in computational complexity.
Without going into many mathematical details of the algorithm, let’s understand how the above points are implemented in the algorithm.
The main feature of this algorithm is to use different datasets for building a unique tree. This is achieved by a statistical method called bootstrap aggregating (bagging).
Imagine a dataset of size N. From this dataset we create a sample of size n (n <= N) by selecting n data points randomly with replacement. “Randomly” signifies that every data point in the dataset has an equal probability for selection and “with replacement” means that a particular data point can appear more than once in the subset.
Since the bootstrap aggregated sample is created by sampling with replacement, some data points will not be selected anytime. Generally, on an average each sample will use about two-thirds of the available data points and 1/3rd data points will not be selected in any samples so the model will not be trained on those 1/3rd datapoints. This gives us a way to estimate the model building.
Bootstrap aggregating (bagging) reduces overfitting to a certain extent but it does not eliminate overfitting issues completely. The reason for this is that there are certain input predictors that influence the tree split and they overshadow weak predictors. These predictors play an important role in the early split of the decision tree and eventually, they influence the structure and sizes of trees in the forest. This results in correlations between trees in random forests because the same predictors are deriving split and tree size so we will get the same classification result.
The random forest has a solution to this- that is, for each split, it selects a random set of subset predictors so each split will be different. So more strong predictors cannot overshadow other fields and hence we get more diverse forests.
Read: The Battle Between R and Python
We will proceed as follows to train the Random Forest:
Data Science Training - Using R and Python
Before you begin the exploration of the parameter, you need to install two libraries:-
Data Science Training - Using R and Python
trainControl() function controls the folder cross-validation. You can try to run the model with the default parameters and see the accuracy score.
The basic syntax is:-
train(formula, df, method = "rf", metric= "Accuracy", trControl = trainControl(), tuneGrid = NULL)
argument
- ‘formula’: Define the formula of the algorithm
- ‘method’: Define which model to train. Note, at the end of the tutorial, there is a list of all the models that can be trained
- ‘metric’ = "Accuracy": Define how to select the optimal model
- ‘trControl = trainControl()’: Define the control parameters
- ‘tuneGrid = NULL’: Return a data frame with all the possible combinations.
You will use the caret library to evaluate your model. The library has one function called train() to evaluate almost all machine learning algorithms. Say differently, you can use this function to train other algorithms.
set.seed(1234)
# Run the model
rf_default <- train(survived~.,
data = data_train,
method = "rf",
metric = "Accuracy",
trControl = trControl)
# Print the results
print(rf_default)
Code Explanation
train Control (method="cv", number=10, search="grid"): Evaluate the model with a grid search of 10 folder
train(...): Train a random forest model.
Output:
The algorithm uses 500 trees and tested three different values of mtry: 2, 6, 10.The final value used for the model was mtry = 2 with an accuracy of 0.78. Let's try to get a higher score.
Step 2) Finding best mtry
Let’s test the model with values of mtry from 1 to 10
set.seed(1234)
tuneGrid <- expand.grid(.mtry = c(1: 10))
rf_mtry <- train(survived~.,
data = data_train,
method = "rf",
metric = "Accuracy",
tuneGrid = tuneGrid,
trControl = trControl,
importance = TRUE,
nodesize = 14,
ntree = 300)
print(rf_mtry)
Code Explanation: tuneGrid <- expand.grid(.mtry=c(3:10))
: Construct a vector with value from 3:10
The final value used for the model was mtry = 4.
Output:
## Random Forest
## 836 samples
## 7 predictor
## 2 classes: 'No', 'Yes'
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 753, 752, 753, 752, 752, 752, ...
## Resampling results across tuning parameters:
## mtry Accuracy Kappa
## 1 0.7572576 0.4647368
## 2 0.7979346 0.5662364
## 3 0.8075158 0.5884815
## 4 0.8110729 0.5970664
## 5 0.8074727 0.5900030
## 6 0.8099111 0.5949342
## 7 0.8050918 0.5866415
## 8 0.8050918 0.5855399
## 9 0.8050631 0.5855035
## 10 0.7978916 0.5707336
##Final model was built using mtry = 4.
The best value of mtry is stored in:
rf_mtry$bestTune$mtry
You can store it and use it when you need to tune the other parameters.
max(rf_mtry$results$Accuracy)
Output:
## [1] 0.8110729
best_mtry <- rf_mtry$bestTune$mtry
best_mtry
Output:
## [1] 4
Let’s do a different iteration of loops to evaluate the different values of maxnodes. Below we will -
store_maxnode <- list()
tuneGrid <- expand.grid(.mtry = best_mtry)
for (maxnodes in c(5: 15)) {
set.seed(1234)
rf_maxnode <- train(survived~.,
data = data_train,
method = "rf",
metric = "Accuracy",
tuneGrid = tuneGrid,
trControl = trControl,
importance = TRUE,
nodesize = 14,
maxnodes = maxnodes,
ntree = 300)
current_iteration <- toString(maxnodes)
store_maxnode[[current_iteration]] <- rf_maxnode
}
results_mtry <- resamples(store_maxnode)
summary(results_mtry)
Output:
## Call:
## summary.resamples(object = results_mtry)
## Models: 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15
## Number of resamples: 10
## Accuracy
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 5 0.6785714 0.7529762 0.7903758 0.7799771 0.8168388 0.8433735 0
## 6 0.6904762 0.7648810 0.7784710 0.7811962 0.8125000 0.8313253 0
## 7 0.6904762 0.7619048 0.7738095 0.7788009 0.8102410 0.8333333 0
## 8 0.6904762 0.7627295 0.7844234 0.7847820 0.8184524 0.8433735 0
## 9 0.7261905 0.7747418 0.8083764 0.7955250 0.8258749 0.8333333 0
## 10 0.6904762 0.7837780 0.7904475 0.7895869 0.8214286 0.8433735 0
## 11 0.7023810 0.7791523 0.8024240 0.7943775 0.8184524 0.8433735 0
## 12 0.7380952 0.7910929 0.8144005 0.8051205 0.8288511 0.8452381 0
## 13 0.7142857 0.8005952 0.8192771 0.8075158 0.8403614 0.8452381 0
## 14 0.7380952 0.7941050 0.8203528 0.8098967 0.8403614 0.8452381 0
## 15 0.7142857 0.8000215 0.8203528 0.8075301 0.8378873 0.8554217 0
##
## Kappa
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 5 0.3297872 0.4640436 0.5459706 0.5270773 0.6068751 0.6717371 0
## 6 0.3576471 0.4981484 0.5248805 0.5366310 0.6031287 0.6480921 0
## 7 0.3576471 0.4927448 0.5192771 0.5297159 0.5996437 0.6508314 0
## 8 0.3576471 0.4848320 0.5408159 0.5427127 0.6200253 0.6717371 0
## 9 0.4236277 0.5074421 0.5859472 0.5601687 0.6228626 0.6480921 0
## 10 0.3576471 0.5255698 0.5527057 0.5497490 0.6204819 0.6717371 0
## 11 0.3794326 0.5235007 0.5783191 0.5600467 0.6126720 0.6717371 0
## 12 0.4460432 0.5480930 0.5999072 0.5808134 0.6296780 0.6717371 0
## 13 0.4014252 0.5725752 0.6087279 0.5875305 0.6576219 0.6678832 0
## 14 0.4460432 0.5585005 0.6117973 0.5911995 0.6590982 0.6717371 0
## 15 0.4014252 0.5689401 0.6117973 0.5867010 0.6507194 0.6955990 0
The last value of maxnode has the highest accuracy. You can try with higher values to see if you can get a higher score.
store_maxnode <- list()
tuneGrid <- expand.grid(.mtry = best_mtry)
for (maxnodes in c(20: 30)) {
set.seed(1234)
rf_maxnode <- train(survived~.,
data = data_train,
method = "rf",
metric = "Accuracy",
tuneGrid = tuneGrid,
trControl = trControl,
importance = TRUE,
nodesize = 14,
maxnodes = maxnodes,
ntree = 300)
key <- toString(maxnodes)
store_maxnode[[key]] <- rf_maxnode
}
results_node <- resamples(store_maxnode)
summary(results_node)
Output:
##
## Call:
## summary.resamples(object = results_node)
##
## Models: 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30
## Number of resamples: 10
##
## Accuracy
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 20 0.7142857 0.7821644 0.8144005 0.8075301 0.8447719 0.8571429 0
## 21 0.7142857 0.8000215 0.8144005 0.8075014 0.8403614 0.8571429 0
## 22 0.7023810 0.7941050 0.8263769 0.8099254 0.8328313 0.8690476 0
## 23 0.7023810 0.7941050 0.8263769 0.8111302 0.8447719 0.8571429 0
## 24 0.7142857 0.7946429 0.8313253 0.8135112 0.8417599 0.8690476 0
## 25 0.7142857 0.7916667 0.8313253 0.8099398 0.8408635 0.8690476 0
## 26 0.7142857 0.7941050 0.8203528 0.8123207 0.8528758 0.8571429 0
## 27 0.7023810 0.8060456 0.8313253 0.8135112 0.8333333 0.8690476 0
## 28 0.7261905 0.7941050 0.8203528 0.8111015 0.8328313 0.8690476 0
## 29 0.7142857 0.7910929 0.8313253 0.8087063 0.8333333 0.8571429 0
## 30 0.6785714 0.7910929 0.8263769 0.8063253 0.8403614 0.8690476 0
##
## Kappa
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 20 0.3956835 0.5316120 0.5961830 0.5854366 0.6661120 0.6955990 0
## 21 0.3956835 0.5699332 0.5960343 0.5853247 0.6590982 0.6919315 0
## 22 0.3735084 0.5560661 0.6221836 0.5914492 0.6422128 0.7189781 0
## 23 0.3735084 0.5594228 0.6228827 0.5939786 0.6657372 0.6955990 0
## 24 0.3956835 0.5600352 0.6337821 0.5992188 0.6604703 0.7189781 0
## 25 0.3956835 0.5530760 0.6354875 0.5912239 0.6554912 0.7189781 0
## 26 0.3956835 0.5589331 0.6136074 0.5969142 0.6822128 0.6955990 0
## 27 0.3735084 0.5852459 0.6368425 0.5998148 0.6426088 0.7189781 0
## 28 0.4290780 0.5589331 0.6154905 0.5946859 0.6356141 0.7189781 0
## 29 0.4070588 0.5534173 0.6337821 0.5901173 0.6423101 0.6919315 0
## 30 0.3297872 0.5534173 0.6202632 0.5843432 0.6590982 0.7189781 0
We can see that for max node 22, accuracy is highest.
After tuning mtry and max node values, now let's tune the number of trees. The method is for tuning ntree is the same as tuning of max nodes.
store_maxtrees <- list()
for (ntree in c(250, 300, 350, 400, 450, 500, 550, 600, 800, 1000, 2000)) {
set.seed(5678)
rf_maxtrees <- train(survived~.,
data = data_train,
method = "rf",
metric = "Accuracy",
tuneGrid = tuneGrid,
trControl = trControl,
importance = TRUE,
nodesize = 14,
maxnodes = 24,
ntree = ntree)
key <- toString(ntree)
store_maxtrees[[key]] <- rf_maxtrees
}
results_tree <- resamples(store_maxtrees)
summary(results_tree)
Output:
##
## Call:
## summary.resamples(object = results_tree)
##
## Models: 250, 300, 350, 400, 450, 500, 550, 600, 800, 1000, 2000
## Number of resamples: 10
##
## Accuracy
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 250 0.7380952 0.7976190 0.8083764 0.8087010 0.8292683 0.8674699 0
## 300 0.7500000 0.7886905 0.8024240 0.8027199 0.8203397 0.8452381 0
## 350 0.7500000 0.7886905 0.8024240 0.8027056 0.8277623 0.8452381 0
## 400 0.7500000 0.7886905 0.8083764 0.8051009 0.8292683 0.8452381 0
## 450 0.7500000 0.7886905 0.8024240 0.8039104 0.8292683 0.8452381 0
## 500 0.7619048 0.7886905 0.8024240 0.8062914 0.8292683 0.8571429 0
## 550 0.7619048 0.7886905 0.8083764 0.8099062 0.8323171 0.8571429 0
## 600 0.7619048 0.7886905 0.8083764 0.8099205 0.8323171 0.8674699 0
## 800 0.7619048 0.7976190 0.8083764 0.8110820 0.8292683 0.8674699 0
## 1000 0.7619048 0.7976190 0.8121510 0.8086723 0.8303571 0.8452381 0
## 2000 0.7619048 0.7886905 0.8121510 0.8086723 0.8333333 0.8452381 0
##
## Kappa
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 250 0.4061697 0.5667400 0.5836013 0.5856103 0.6335363 0.7196807 0
## 300 0.4302326 0.5449376 0.5780349 0.5723307 0.6130767 0.6710843 0
## 350 0.4302326 0.5449376 0.5780349 0.5723185 0.6291592 0.6710843 0
## 400 0.4302326 0.5482030 0.5836013 0.5774782 0.6335363 0.6710843 0
## 450 0.4302326 0.5449376 0.5780349 0.5750587 0.6335363 0.6710843 0
## 500 0.4601542 0.5449376 0.5780349 0.5804340 0.6335363 0.6949153 0
## 550 0.4601542 0.5482030 0.5857118 0.5884507 0.6396872 0.6949153 0
## 600 0.4601542 0.5482030 0.5857118 0.5884374 0.6396872 0.7196807 0
## 800 0.4601542 0.5667400 0.5836013 0.5910088 0.6335363 0.7196807 0
## 1000 0.4601542 0.5667400 0.5961590 0.5857446 0.6343666 0.6678832 0
## 2000 0.4601542 0.5482030 0.5961590 0.5862151 0.6440678 0.6656337 0
We have tuned all important parameters. Now we can train the random forest with the following parameters:
fit_rf <- train(survived~.,
data_train,
method = "rf",
metric = "Accuracy",
tuneGrid = tuneGrid,
trControl = trControl,
importance = TRUE,
nodesize = 14,
ntree = 800,
maxnodes = 24)
Step 5) Model Evaluation: caret library in R has a function to make predictions.
predict(model, newdata= df)
argument
- `model`: Define the model evaluated before.
- `newdata`: Define the dataset to make prediction
prediction <-predict(fit_rf, data_test)
You can use the prediction to compute the confusion matrix and see the accuracy score
confusionMatrix(prediction, data_test$survived)
Output:
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 110 32
## Yes 11 56
##
## Accuracy : 0.7943
## 95% CI : (0.733, 0.8469)
## No Information Rate : 0.5789
## P-Value [Acc > NIR] : 3.959e-11
##
## Kappa : 0.5638
## Mcnemar's Test P-Value : 0.002289
##
## Sensitivity : 0.9091
## Specificity : 0.6364
## Pos Pred Value : 0.7746
## Neg Pred Value : 0.8358
## Prevalence : 0.5789
## Detection Rate : 0.5263
## Detection Prevalence : 0.6794
## Balanced Accuracy : 0.7727
##
## 'Positive' Class : No
##
We have got an accuracy of 0.7943 percent, which is much higher than the default accuracy.
Step 6) Visualize Result
Now let’s find feature importance with the function varImp(). In the variable importance plot, it seems that the most relevant features are sex and age. The more important features tend to appear near the root of the tree, on the other hand, less important features will often appear close to the leaves.
varImpPlot(fit_rf)
varImp(fit_rf)
## rf variable importance
##
## Importance
## sexmale 100.000
## age 28.014
## pclassMiddle 27.016
## fare 21.557
## pclassUpper 16.324
## sibsp 11.246
## parch 5.522
## embarkedC 4.908
## embarkedQ 1.420
## embarkedS 0.000
We will use the Titanic dataset for our case study in the Random forest model. You can directly import a dataset from the internet.
Read: The Battle Between R and Python
The random forest has some parameters that can be changed to improve the generalization of the prediction. You will use the function RandomForest() to train the model. We need to install a RandomForest library or package to use this method.
A random forest model can be built using all predictors and the target variable as the categorical outcome. Random forest was attempted with the train function from the caret package and also with the randomForest function from the randomForest package.
The tuning parameter for a model is very cumbersome work. There can be many permutations and combinations for a set of hyperparameters. Trying all combinations can be a very time and memory consuming task. A better approach can be that the algorithm decides the best set of parameters. There are two common methods for tuning.
Random search does not evaluate all the combinations of hyperparameters . Instead, it will randomly select any combination at every iteration. The advantage is it’s lower the computational cost, memory cost and less time required.
In this tutorial, we will cover both methods, we will train the model using a grid search. Grid search is simple and the model is trained for all combinations we give in the parameters list.
If the number of trees is 10 , 20, 30 and the number of mtry(no. of candidates drawn to feed algorithm) equals 1, 2, 3, 4, 5. Then total models will be created.
The drawback of the grid search is the high amount of time and experiments carried out. To overcome this issue we can use random search.
So now, whenever anyone talks about Random forest in R, Random forest in Python or just random forest, you will have the basic idea of it. Implementing Random forest in Python is similar to how it was implemented in R.
Machine learning algorithms like the random forest, Neural networks are known for better accuracy and high performance, but the problem is that they are a black box. No-one knows how they work internally. So, results interpretation is a big issue and challenge. It's fine to not know the internal statistical details of the algorithm but how to tune random forest is of utmost importance. Tuning the Random forest algorithm is still relatively easy compared to other algorithms.
In spite of being a black-box random forest is a highly popular ensembling technique for better accuracy. It’s even called Panacea in Machine Learning Algorithms. It is said that if you are confused about deciding which algorithm to use for classification then you can use a random forest with closing eyes. Go to Janbask Training to get a better understanding of Random Forest.
A dynamic, highly professional, and a global online training course provider committed to propelling the next generation of technology learners with a whole new way of training experience.
Cyber Security
QA
Salesforce
Business Analyst
MS SQL Server
Data Science
DevOps
Hadoop
Python
Artificial Intelligence
Machine Learning
Tableau
Search Posts
Related Posts
Receive Latest Materials and Offers on Data Science Course
Interviews