Simple Linear Regression in Machine Learning

In this page we will learn about What is Simple Linear Regression in Machine Learning?, Simple Linear Regression Model, Data Pre-processing, Fitting the Simple Linear Regression to the Training Set, Prediction of test set result, visualizing the Training set results, visualizing the Test set results.


What is Simple Linear Regression in Machine Learning?

A type of regression technique known as simple linear regression models the relationship between a dependent variable and a single independent variable. A Simple Linear Regression model shows a linear or sloped straight line relationship, which is why it is called Simple Linear Regression.

The dependant variable must be a continuous/real value, which is the most important aspect of Simple Linear Regression. The independent variable, on the other hand, can be assessed using either continuous or categorical values.

The major goals of the simple linear regression algorithm are:

  • Create a model that depicts the link between the two variables. Such as the income-to-expenditure ratio, experience-to-salary ratio, and so on.
  • New observations are being predicted. For example, weather predicting based on temperature, corporation revenue based on annual investments, and so on.

Simple Linear Regression Model:

The Simple Linear Regression model can be represented using the below equation:

y = a0 + a1x+ ε

Where,

a0 = It is the intercept of the Regression line (can be obtained putting x=0)

a1 = It is the slope of the regression line, which tells whether the line is increasing or decreasing.

ε = The error term. (For a good model it will be negligible)

Implementation of Simple Linear Regression Algorithm using Python
Problem Statement example for Simple Linear Regression:

  • We're looking to see if there's a link between these two variables.
  • The best fit line for the dataset will be found.
  • Changes in the dependent variable as a result of the independent variable.

In this step, we'll build a Simple Linear Regression model to determine which line best represents the relationship between these two variables.

To use Python to create the Simple Linear regression model in machine learning, follow the steps below:

Step-1: Data Pre-processing

Data pre-processing is the initial stage in developing the Simple Linear Regression model. It's something we've already done in this tutorial. However, there will be certain adjustments, which are outlined in the stages below:

  • To begin, we'll import three key libraries that will aid us in loading the dataset, plotting graphs, and building the Simple Linear Regression model.

    
        import numpy as np
        import matplotlib.pyplot as mtp
        import pandas as pd
    
    
  • Next, we will load the dataset into our code:

    
        data_set = pd.read_csv('Salary_Data.csv')  
    
    

By executing the above line of code (ctrl+ENTER), we can read the dataset on our Spyder IDE screen by clicking on the variable explorer option.

simple linear regression in machine learning

The dataset, which has two variables: salary and experience, is shown in the result above.

Note: The folder containing the code file must be saved as a working directory in Spyder IDE, and the dataset or csv file must be in the same folder.

  • The dependent and independent variables must then be extracted from the given dataset. Years of experience is the independent variable, and salary is the dependent variable. The following is the code for it:

    
        x = data_set.iloc[:, :-1].values  
        y = data_set.iloc[:, 1].values   
    
    

We used a -1 value for the x variable in the preceding lines of code because we want to eliminate the last column from the dataset. We used 1 as a parameter for the y variable because we want to extract the second column and indexing starts at zero.

If we run the above piece of code, we will obtain the following results for the X and Y variables:

simple linear regression  in machine learning2simple linear regression in machine learning2

The X (independent) variable and Y (dependent) variable have been retrieved from the given dataset, as shown in the above output image.

  • After that, we'll divide both variables into test and training sets. Because we have 30 observations, we'll use 20 for the training set and 10 for the test set. We've divided our dataset into two parts so that we can train our model with one and then test it with the other. The following is the code for this:

    #Splitting the dataset into training and test set.  
    from sklearn.model_selection import train_test_split  
    x_train, x_test, y_train, y_test= train_test_split(x, y, test_size = 1/3, random_state = 0)  

We will get x-test, x-train and y-test, y-train dataset by executing the above code. Consider the below images:

Test Dataset:

simple linear regression in machine learning3

Training Dataset:

When we provide the input values (data) to the function, it gives the S-curve as follows:

simple linear regression in machine learning 4
  • We shall not employ Feature Scaling in simple linear regression. We don't need to do that here because Python libraries take care of it in some circumstances. Now that our information is ready to work with, we'll begin developing a Simple Linear Regression model to solve the problem.

Step-2: Fitting the Simple Linear Regression to the Training Set:

Fitting our model to the training dataset is the next step. To accomplish so, we'll use the Linear-model library's Linear Regression class from scikit learn. Following the import of the class, we'll create a regressor object. The following is the code for this:


    #Fitting the Simple Linear Regression model to the training dataset  
    from sklearn.linear_model import LinearRegression  
    regressor = LinearRegression()  
    regressor.fit(x_train, y_train)  

We used the fit() function to fit our Simple Linear Regression object to the training set in the previous code. We gave the x_train and y_train parameters to the fit() method, which are our training datasets for the dependent and independent variables, respectively. Our regressor object has been fitted to the training set so that the model can learn the correlations between the predictor and target variables quickly. We will obtain the following output after executing the above lines of code.

Output:


    Out[7]: LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

Step: 3. Prediction of test set result:

Salary is a dependent variable, and salary is an independent variable (Experience). As a result, our model is now ready to forecast the results of the new observations. In this stage, we'll give the model the test dataset (new observations) to see if it can correctly predict the outcome.

We'll make two prediction vectors, y_pred and x_pred, which will include test dataset and training set predictions, respectively.


    #Prediction of Test and Training set result  
    y_pred = regressor.predict(x_test)  
    x_pred = regressor.predict(x_train)

When the above lines of code are run, two variables named y_pred and x_pred will appear in the variable explorer options, containing salary forecasts for the training and test sets, respectively.

Output:

You may inspect the variable by using the IDE's variable explorer tool, and you can compare the results by comparing y_pred and y_test values. We can see how well our model is functioning by comparing these numbers.

Step: 4. visualizing the Training set results:

We will now visualize the training set result in this phase. We'll perform this with the scatter() function from the pyplot package, which we already imported during the pre-processing stage. The scatter () function plots observations in a scatter plot.

Employees' Years of Experience will be plotted on the x-axis, and their salaries will be plotted on the y-axis. The real values of the training set, which are a year of experience x_train, a training set of Salaries y_train, and the color of the observations, will be passed to the function. We're going to use green as our observation color, but you can use any color you want.

Now we need to plot the regression line, so we'll use the pyplot library's plot() function for that. We'll send the years of experience for the training set, the expected pay for the training set x_pred, and the line's color to this function.

The plot's title will be given next. So, we'll use the pyplot library's title() function and supply the term ("Salary vs Experience (Training Dataset)").

After that, we'll use the xlabel() and ylabel() functions to give labels to the x- and y-axes, respectively. Finally, we will use display to represent all of the above on a graph (). The code is as follows:


    mtp.scatter(x_train, y_train, color = "green")   
    mtp.plot(x_train, x_pred, color = "red")    
    mtp.title("Salary vs Experience (Training Dataset)")  
    mtp.xlabel("Years of Experience")  
    mtp.ylabel("Salary(In Rupees)")  
    mtp.show()   


Output

By executing the above lines of code, we will get the below graph plot as an output.

simple linear regression in machine learning 5

The true values observations are shown in green dots in the above image, while anticipated values are covered by the red regression line. A correlation exists between the dependent and independent variables, as shown by the regression line.

Calculating the difference between real and projected values might reveal how well the line fits. However, as seen in the graph above, the majority of the observations are close to the regression line, indicating that our model is suitable for the training set.

Step: 5. visualizing the Test set results:

We visualized our model's performance on the training set in the previous stage. We'll now repeat the process for the Test set. The entire code will be the same as before, except that instead of x_train and y _rain, we will use x test and y test.

To distinguish the two plots, we change the color of the observations and regression line, but this is optional.


    #visualizing the Test set results  
    mtp.scatter(x_test, y_test, color = "blue")   
    mtp.plot(x_train, x_pred, color = "red")    
    mtp.title("Salary vs Experience (Test Dataset)")  
    mtp.xlabel("Years of Experience")  
    mtp.ylabel("Salary(In Rupees)")  
    mtp.show()    


Output

By executing the above line of code, we will get the output as:

simple linear regression in machine learning 6

The blue hue represents observations, and the red regression line represents prediction in the above plot. As can be seen, the majority of the observations are close to the regression line, indicating that our Simple Linear Regression is a good model capable of making accurate predictions.