Random Forest Algorithm in Machine Learning

In this page, we will learn What is Random Forest Algorithm?, Assumptions for Random Forest, Why use Random Forest?, How does Random Forest algorithm work?, Applications of Random Forest, Advantages of Random Forest, Disadvantages of Random Forest, Python Implementation of Random Forest Algorithm.


What is Random Forest Algorithm?

Random Forest is a well-known machine learning algorithm that uses the supervised learning method. In machine learning, it can be utilized for both classification and regression issues. It is based on ensemble learning, which is a method of integrating several classifiers to solve a complex problem and increase the model's performance.
As the name suggests, "Random Forest is a classifier that contains a number of decision trees on various subsets of the given dataset and takes the average to improve the predictive accuracy of that dataset." Instead than relying on a single decision tree, the random forest collects the forecasts from each tree and predicts the final output based on the majority votes of predictions.
The greater number of trees in the forest leads to higher accuracy and prevents the problem of overfitting.
The below diagram explains the working of the Random Forest algorithm:

random forest algorithm

Note: You need be familiar with the Decision Tree Algorithm in order to properly comprehend the Random Forest Algorithm.

Assumptions for Random Forest

Because the random forest combines numerous trees to forecast the dataset's class, some decision trees may correctly predict the output while others may not. However, when all of the trees are combined, the proper result is predicted. As a result, two assumptions for a better Random forest classifier are as follows:

  • The dataset's feature variable should have some actual values so that the classifier can predict accurate results rather than guesses.
  • Each tree's predictions must have very low correlations.

Why use Random Forest?

The following are some reasons why we should utilize the Random Forest algorithm:

  • When compared to other algorithms, it takes less time to train.
  • It predicts output with good accuracy, and it runs quickly even with a huge dataset.
  • When a considerable amount of the data is missing, it can still maintain accuracy.

How does Random Forest algorithm work?

The random forest is formed in two phases: the first is to combine N decision trees to build the random forest, and the second is to make predictions for each tree created in the first phase.
The following steps and diagram can be used to demonstrate the working process:
Step 1: Pick K data points at random from the training set.
Step 2: Create decision trees for the data points you've chosen (Subsets).
Step 3: Decide on the number N for the decision trees you wish to create.
Step 4: Repetition of Steps 1 and 2.
Step 5: Find the forecasts of each decision tree for new data points, and allocate the new data points to the category with the most votes.
The following example will help you understand how the algorithm works:

Example: There is a dataset including several fruit photos. As a result, the Random forest classifier is given this dataset. Each decision tree is given a portion of the dataset to work with. During the training phase, each decision tree generates a prediction result, and when a new data point appears, the Random Forest classifier predicts the final decision based on the majority of results. Consider the following illustration:

random forest algorithm 2

Applications of Random Forest

Random forest is most commonly utilized in the following four sectors:

  • Banking: This algorithm is mostly used in the banking industry to identify loan risk.
  • Medicine: This method can be used to identify disease trends as well as disease risks.
  • Land Use: Using this technique, we can find places with comparable land use.
  • Marketing: This algorithm can be used to identify marketing trends.

Advantages of Random Forest

  • Random Forest can handle both classification and regression problems.
  • It can handle huge datasets with a lot of dimensionality.
  • It improves the model's accuracy and eliminates the problem of overfitting.

Disadvantages of Random Forest

  • Despite the fact that random forest may be used for both classification and regression problems, it is not better suited to regression.

Python Implementation of Random Forest Algorithm

Now we'll use Python to implement the Random Forest Algorithm tree. We'll use the same dataset "user data.csv" that we did in prior classification models for this. We may compare the Random Forest classifier against other classification models such as Decision Tree Classifier, KNN, SVM, Logistic Regression, and others using the same dataset.

Implementation Steps are given below:

  • Data Pre-processing step
  • Fitting the Random forest algorithm to the Training set
  • Predicting the test result
  • Test accuracy of the result (Creation of Confusion matrix)
  • Visualizing the test set result.

1. Data Pre-Processing Step:

Below is the code for the pre-processing step:

 
   #importing libraries  
   import numpy as nm  
   import matplotlib.pyplot as mtp  
   import pandas as pd  
      
   #importing datasets  
   data_set = pd.read_csv('user_data.csv')  
      
   #Extracting Independent and dependent Variable  
   x = data_set.iloc[:, [2,3]].values  
   y = data_set.iloc[:, 4].values  
      
   # Splitting the dataset into training and test set.  
   from sklearn.model_selection import train_test_split  
   x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.25, random_state = 0)  
      
   #feature Scaling  
   from sklearn.preprocessing import StandardScaler    
   st_x = StandardScaler()    
   x_train = st_x.fit_transform(x_train)    
   x_test = st_x.transform(x_test)          

In the above code, we have pre-processed the data. Where we have loaded the dataset, which is given as:

random forest algorithm3

2. Fitting the Random Forest algorithm to the training set:

The Random forest algorithm will now be fitted to the training set. We'll use the RandomForestClassifier class from the sklearn.ensemble package to make it fit. The code is as follows:

 
   #Fitting Decision Tree classifier to the training set  
   from sklearn.ensemble import RandomForestClassifier  
   classifier = RandomForestClassifier(n_estimators = 10, criterion = "entropy")  
   classifier.fit(x_train, y_train)       

The classifier object in the given code has the following parameters:

  • n_estimators= The required number of trees in the Random Forest is n estimators. 10 is the default value. We can choose any number, but we must consider the issue of overfitting.
  • criterion= This is a function that evaluates the split's correctness. For the information gain, we've used "entropy."

Output:

 
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='entropy',
  max_depth = None, max_features = 'auto', max_leaf_nodes = None,
  min_impurity_decrease = 0.0, min_impurity_split = None,
  min_samples_leaf = 1, min_samples_split = 2,
  min_weight_fraction_leaf = 0.0, n_estimators = 10,
  n_jobs = None, oob_score = False, random_state = None,
  verbose = 0, warm_start = False)

3. Predicting the Test Set result

We can now anticipate the test outcome because our model has been fitted to the training data. We'll make a new prediction vector, y_pred, for prediction. The code for it is as follows:

 
   #Predicting the test set result  
   y_pred= classifier.predict(x_test)       

Output:

The prediction vector is given as:

random forest algorithm4

4. Creating the Confusion Matrix

Now we'll make the confusion matrix to figure out which forecasts are true and which are incorrect. The code for it is as follows:

 
   #Creating the Confusion matrix  
   from sklearn.metrics import confusion_matrix  
   cm = confusion_matrix(y_test, y_pred)  

Output:

random forest algorithm5

As we can see in the above matrix, there are 4+4= 8 incorrect predictions and 64+28= 92 correct predictions.

5. Visualizing the training Set result

The result of the training set will be visualized here. We will plot a graph for the Random forest classifier to visualize the training set outcome. As we saw in Logistic Regression, the classifier will predict yes or no for consumers who have purchased or not purchased the SUV car. The code for it is as follows:

 
   from matplotlib.colors import ListedColormap  
   x_set, y_set = x_train, y_train  
   x1, x2 = nm.meshgrid(nm.arange(start = x_set[:, 0].min() - 1, stop = x_set[:, 0].max() + 1, step  =0.01),  
   nm.arange(start = x_set[:, 1].min() - 1, stop = x_set[:, 1].max() + 1, step = 0.01))  
   mtp.contourf(x1, x2, classifier.predict(nm.array([x1.ravel(), x2.ravel()]).T).reshape(x1.shape),  
   alpha = 0.75, cmap = ListedColormap(('purple','green' )))  
   mtp.xlim(x1.min(), x1.max())  
   mtp.ylim(x2.min(), x2.max())  
   for i, j in enumerate(nm.unique(y_set)):  
     mtp.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1],  
       c = ListedColormap(('purple', 'green'))(i), label = j)  
   mtp.title('Random Forest Algorithm (Training set)')  
   mtp.xlabel('Age')  
   mtp.ylabel('Estimated Salary')  
   mtp.legend()  
   mtp.show()  

Output:

random forest algorithm6

The visualization result for the Random Forest classifier using the training set result is shown above. It looks a lot like the Decision tree classifier. The purple and green zones are the prediction regions, and each data point corresponds to each user in the user data. Users who did not purchase an SUV automobile are classed in the purple region, while those who did purchase an SUV are classified in the green region.
So we took 10 trees in the Random Forest classifier that predicted Yes or NO for the Purchased variable. The classifier calculated the result based on the majority of the predictions.

6. Visualizing the test set result

Now we will visualize the test set result. Below is the code for it:

 
   #Visulaizing the test set result  
   from matplotlib.colors import ListedColormap  
   x_set, y_set = x_test, y_test  
   x1, x2 = nm.meshgrid(nm.arange(start = x_set[:, 0].min() - 1, stop = x_set[:, 0].max() + 1, step  =0.01),  
   nm.arange(start = x_set[:, 1].min() - 1, stop = x_set[:, 1].max() + 1, step = 0.01))  
   mtp.contourf(x1, x2, classifier.predict(nm.array([x1.ravel(), x2.ravel()]).T).reshape(x1.shape),  
   alpha = 0.75, cmap = ListedColormap(('purple','green' )))  
   mtp.xlim(x1.min(), x1.max())  
   mtp.ylim(x2.min(), x2.max())  
   for i, j in enumerate(nm.unique(y_set)):  
     mtp.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1],  
       c = ListedColormap(('purple', 'green'))(i), label = j)  
   mtp.title('Random Forest Algorithm(Test set)')  
   mtp.xlabel('Age')  
   mtp.ylabel('Estimated Salary')  
   mtp.legend()  
   mtp.show()  

Output:

random forest algorithm7

The visualization result for the test set is shown above. Without the Overfitting problem, we can confirm that there are a minimum of (8) inaccurate predictions. Changing the number of trees in the classifier yields varied outcomes.