Multiple Linear Regression in Machine Learning

In this page, we will learn What is Multiple Linear Regression in Machine Learning?, Some key points about MLR, MLR equation, Assumptions for Multiple Linear Regression, Implementation of Multiple Linear Regression model using Python, Applications of Multiple Linear Regression.


What is Multiple Linear Regression in Machine Learning?

We learned about Simple Linear Regression in the previous topic, where a single Independent/Predictor(X) variable is utilized to model the response variable (Y). However, there are some circumstances where more than one predictor variable affects the response variable; in these cases, the Multiple Linear Regression technique is applied.

Furthermore, Multiple Linear Regression is an extension of Simple Linear Regression in that it predicts the response variable using more than one predictor variable. It can be defined as follows:

“Multiple Linear Regression is one of the important regression algorithms which models the linear relationship between a single dependent continuous variable and more than one independent variable.”

Example:
Prediction of CO2 emission based on engine size and number of cylinders in a car.

Some key points about MLR:

  • The dependent or target variable (Y) must be continuous/real for MLR to work, although the predictor or independent variable might be continuous or categorical.
  • Each feature variable must model the dependent variable's linear connection.
  • MLR is a technique for fitting a regression line through a multidimensional space of data points.

In this step, we'll build a Simple Linear Regression model to determine which line best represents the relationship between these two variables.

To use Python to create the Simple Linear regression model in machine learning, follow the steps below:

MLR equation:

The target variable (Y) in Many Linear Regression is a linear mixture of multiple predictor variables x1, x2, x3,...,xn. The equation becomes: Since it is an upgrade of Simple Linear Regression, the same is done to the multiple linear regression equation.
Y= b0+b1x1+ b2x2+ b3x3+...... bnxn ............... (a)

Where,
Y= Output/Response variable
b0, b1, b2, b3 , bn....= Coefficients of the model.
x1, x2, x3, x4,...= Various Independent / feature variable

Assumptions for Multiple Linear Regression:

  • The Target and Predictor variables should have a linear relationship.
  • The residuals from the regression must be normally distributed.
  • MLR considers data to have minimal or no multicollinearity (correlation between independent variables).

Implementation of Multiple Linear Regression model using Python:

To implement MLR using Python, we have below problem:

Problem Description:

We have 50 start-up companies in our database. R&D Spend, Administration Spend, Marketing Spend, State, and Profit for a financial year are all included in this dataset. Our goal is to develop a model that can quickly assess which company has the highest profit margin and which element has the most impact on a company's profit margin.

Profit is the dependent variable, and the other four variables are independent variables, because we need to find it. The following are the key phases in implementing the MLR model:

  • Data Pre-processing Steps
  • Fitting the MLR model to the training set
  • Predicting the result of the test set

Step-1: Data Pre-processing Step:

Data pre-processing is the initial phase, which we've already covered in this course. The steps in this procedure are as follows:

  • Importing libraries: We'll start by importing the library that will assist us in creating the model. The code for it is as follows:
 
  #importing libraries 
  import numpy as nm  
  import matplotlib.pyplot as mtp 
  import pandas as pd 

  • Importing dataset: We'll now import the 50 CompList dataset, which contains all of the variables. The code for it is as follows:


  #importing datasets 
  data_set = pd.read_csv('50_CompList.csv')  

Output: We will get the dataset as:

multiple inear regression in machine learning

We can see that there are five variables in the above output, four of which are continuous and one of which is categorical.

  • Extracting dependent and independent Variables:

    #Extracting Independent and dependent Variable
    x = data_set.iloc[:, :-1].values 
    y = data_set.iloc[:, 4].values 

Output:

Out[5]:


  array([[165349.2, 136897.8, 471784.1, 'New York'],
    [162597.7, 151377.59, 443898.53, 'California'],
    [153441.51, 101145.55, 407934.54, 'Florida'],
    [144372.41, 118671.85, 383199.62, 'New York'],
    [142107.34, 91391.77, 366168.42, 'Florida'],
    [131876.9, 99814.71, 362861.36, 'New York'],
    [134615.46, 147198.87, 127716.82, 'California'],
    [130298.13, 145530.06, 323876.68, 'Florida'],
    [120542.52, 148718.95, 311613.29, 'New York'],
    [123334.88, 108679.17, 304981.62, 'California'],
    [101913.08, 110594.11, 229160.95, 'Florida'],
    [100671.96, 91790.61, 249744.55, 'California'],
    [93863.75, 127320.38, 249839.44, 'Florida'],
    [91992.39, 135495.07, 252664.93, 'California'],
    [119943.24, 156547.42, 256512.92, 'Florida'],
    [114523.61, 122616.84, 261776.23, 'New York'],
    [78013.11, 121597.55, 264346.06, 'California'],
    [94657.16, 145077.58, 282574.31, 'New York'],
    [91749.16, 114175.79, 294919.57, 'Florida'],
    [86419.7, 153514.11, 0.0, 'New York'],
    [76253.86, 113867.3, 298664.47, 'California'],
    [78389.47, 153773.43, 299737.29, 'New York'],
    [73994.56, 122782.75, 303319.26, 'Florida'],
    [67532.53, 105751.03, 304768.73, 'Florida'],
    [77044.01, 99281.34, 140574.81, 'New York'],
    [64664.71, 139553.16, 137962.62, 'California'],
    [75328.87, 144135.98, 134050.07, 'Florida'],
    [72107.6, 127864.55, 353183.81, 'New York'],
    [66051.52, 182645.56, 118148.2, 'Florida'],
    [65605.48, 153032.06, 107138.38, 'New York'],
    [61994.48, 115641.28, 91131.24, 'Florida'],
    [61136.38, 152701.92, 88218.23, 'New York'],
    [63408.86, 129219.61, 46085.25, 'California'],
    [55493.95, 103057.49, 214634.81, 'Florida'],
    [46426.07, 157693.92, 210797.67, 'California'],
    [46014.02, 85047.44, 205517.64, 'New York'],
    [28663.76, 127056.21, 201126.82, 'Florida'],
    [44069.95, 51283.14, 197029.42, 'California'],
    [20229.59, 65947.93, 185265.1, 'New York'],
    [38558.51, 82982.09, 174999.3, 'California'],
    [28754.33, 118546.05, 172795.67, 'California'],
    [27892.92, 84710.77, 164470.71, 'Florida'],
    [23640.93, 96189.63, 148001.11, 'California'],
    [15505.73, 127382.3, 35534.17, 'New York'],
    [22177.74, 154806.14, 28334.72, 'California'],
    [1000.23, 124153.04, 1903.93, 'New York'],
    [1315.46, 115816.21, 297114.46, 'Florida'],
    [0.0, 135426.92, 0.0, 'California'],
    [542.05, 51743.15, 0.0, 'New York'],
    [0.0, 116983.8, 45173.06, 'California']], dtype=object)



As can be seen in the result above, the last column comprises categorical variables that are not suitable for fitting the model directly. As a result, we must encrypt this variable.

Encoding Dummy Variables:

We will encode one categorical variable (State) because it cannot be directly applied to the model. The LabelEncoder class will be used to convert the categorical variable to integers. However, it is insufficient because there is still some relational order, which may result in an incorrect model. To solve this problem, we'll employ OneHotEncoder, which will generate dummy variables. The following is the code for it:


  #Catgorical data  
  from sklearn.preprocessing import LabelEncoder,OneHotEncoder
  labelencoder_x = LabelEncoder()
  x[:, 3] = labelencoder_x.fit_transform(x[:,3])
  onehotencoder = OneHotEncoder(categorical_features = [3])
  x = onehotencoder.fit_transform(x).toarray()  

Because the other variables are continuous, we are only encoding one independent variable, which is state.

Output:

multiple linear regression in machine learning 2

The state column has been turned into dummy variables, as shown in the above output (0 and 1). Each dummy variable column corresponds to a single State in this example. By comparing it to the original dataset, we can be sure. The first column corresponds to the state of California, the second column to the state of Florida, and the third column to the state of New York.

[ Note: We should not use all of the dummy variables at the same time, thus the number of dummy variables must be 1 fewer than the total number of dummy variables; otherwise, a dummy variable trap will result. ]

  • To avoid the dummy variable trap, we're creating a single line of code now:
 
  #avoiding the dummy variable trap: 
  x = x[:, 1:]


It's possible that if we don't delete the first dummy variable, the model will become multicollinear.

multiple linear regression in machine learning 3

The first column has been eliminated, as seen in the output image above.

We'll now divide the dataset into two groups: training and testing. The following is the code for this:

 
  #Splitting the dataset into training and test set. 
  from sklearn.model_selection import train_test_split 
  x_train, x_test, y_train, y_test = train_test_split( x, y,test_size = 0.2, random_state = 0 ) 

The above code will split our dataset into a training set and test set.
The preceding code will partition the dataset into two sets: training and test. The output can be viewed by using the variable explorer option in Spyder IDE, which will display the test and training sets as shown below:

multiple linear regression in machine learning 4

Training set:

multiple linear regression in machine learning 5

Note: We will not implement feature scaling in MLR because the library will take care of it and we won't have to do it manually.

Step: 2- Fitting our MLR model to the Training set:

Now that our dataset has been properly prepared for training, we will fit our regression model to the training set. It will be identical to the Simple Linear Regression model that we used previously. This will be coded as follows:

 
  #Fitting the MLR model to the training set:
  from sklearn.linear_model import LinearRegression 
  regressor = LinearRegression() 	
  regressor.fit(x_train, y_train) 	

Output:
Out[9]: LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)
Using the training dataset, we've effectively trained our model. We'll use the test dataset to evaluate the model's performance in the next phase.

Step: 3- Prediction of Test set results:

The final step in our model's development is to evaluate its performance. We'll do it by forecasting the outcome of the test set. A y_pred vector will be created for prediction. The code for it is as follows:

 
  #Predicting the Test set result; 
  y_pred = regressor.predict(x_test)

Output:

multiple linear regression in machine learning 6

We have anticipated result set and test set in the above output. By comparing these two values index by index, we may evaluate the model's performance. The first index, for example, has predicted profit of 103015 dollars and a test/real profit of 103282 dollars. The difference is only 267 dollars, which is a good prediction, thus our model is completely complete.

We may also look at the scores for the training and test datasets. The code for it is as follows:

 
  print('Train Score: ', regressor.score(x_train, y_train))
  print('Test Score: ', regressor.score(x_test, y_test))                               

Output: The score is:

 
  Train Score: 0.9501847627493607
  Test Score: 0.9347068473282446									

Our model is 95 percent accurate with the training dataset and 93 percent accurate with the test dataset, according to the above score.
Note: In the next article, we'll look at how we may use the Backward Elimination procedure to improve the model's performance.

Applications of Multiple Linear Regression:

Multiple Linear Regression has primarily two applications:

  • Effectiveness of Independent variable on prediction:
  • Predicting the impact of changes: