Multiple Linear Regression (MLR) is a statistical technique that models the linear relationship between a single dependent variable (y) and two or more independent variables (x₁, x₂,... xₙ). Unlike simple linear regression, which focuses on a single factor, MLR allows us to analyze the influence of multiple factors on the dependent variable.
Mathematically, it's expressed as:
y = β₀ + β₁x₁ + β₂x₂ + ... + βₙxₙ + ε
y represents the dependent variable, x₁, x₂,..., xₙ are independent variables, β₀ is the intercept, β₁, β₂,..., βₙ are the coefficients (weights) that show how each independent variable impacts y, and ε is the error term (which accounts for the variance in y not explained by the independent variables).
In multiple linear regression, the algorithm determines the optimal values for the coefficients by minimizing the difference between the actual and predicted values of the dependent variable. This process is called Ordinary Least Squares (OLS), where the goal is to minimize the sum of squared residuals (the differences between observed and predicted values).
Linear Regression (LR) involves only one independent variable, while Multiple Linear Regression (MLR) involves two or more independent variables. This distinction defines the complexity and utility of each model in real-world applications.
Linear Regression is simpler, easier to interpret, and typically requires fewer data points. MLR is more complex as it requires handling multiple independent variables and can capture relationships with more factors, leading to potentially more accurate predictions.
Linear Regression is easier to interpret because it focuses on the relationship between one independent variable and the dependent variable. Multiple Linear Regression requires interpretation of multiple coefficients, which can be more challenging.
Overfitting is more common in MLR because it includes more variables, which can fit the training data very well but perform poorly on unseen data. Regularization techniques can help prevent overfitting in MLR.
Multicollinearity, which occurs when independent variables are highly correlated with each other, is a concern in MLR but not in LR since LR only deals with one independent variable.
The mathematical representation of LR and MLR differs in terms of the number of independent variables. While LR uses a single variable, MLR involves multiple variables to explain the outcome.
LR can be easily visualized on a 2D plot as a straight line, whereas MLR involves multidimensional space, making it harder to visualize directly.
LR is used when the outcome depends on one factor, while MLR is applied when the outcome is influenced by multiple factors.
Here’s a comparison of LR vs MLR in a table format:
Aspect | Linear Regression (LR) | Multiple Linear Regression (MLR) |
---|---|---|
Number of Variables | One independent variable | Two or more independent variables |
Complexity | Simpler, easier to interpret | More complex, harder to interpret |
Overfitting | Less prone to overfitting | More prone to overfitting |
Multicollinearity | No concern | Can be a concern |
Equation Form | y = β₀ + β₁x + ε | y = β₀ + β₁x₁ + β₂x₂ + ... + βₙxₙ + ε |
Visualization | 2D graph, straight line | Multidimensional space, harder to visualize |
Application | Simple relationships | Complex relationships with multiple factors |
Let's implement Multiple Linear Regression in Python using the 50 Startups Dataset. This dataset predicts a startup's profit based on variables like R&D Spend, Administration, Marketing Spend, and State.
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
Explanation:
We import the essential Python libraries:
dataset = pd.read_csv('datasets/50_Startups.csv') # Replace with your dataset path
X = dataset.iloc[:, :-1].values # Independent variables (everything except the last column)
y = dataset.iloc[:, -1].values # Dependent variable (last column - profit)
Explanation:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer([("State", OneHotEncoder(), [3])], remainder='passthrough')
X = ct.fit_transform(X)
X = X[:, 1:] # Avoid dummy variable trap
Explanation:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
Explanation:
regressor = LinearRegression()
regressor.fit(X_train, y_train)
Explanation:
y_pred = regressor.predict(X_test)
Explanation:
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f'Mean Squared Error: {mse}')
print(f'R-squared: {r2}')
Explanation:
Multiple Linear Regression is a powerful method for analyzing the relationship between multiple variables and a target variable. With proper preprocessing, splitting, and evaluation, you can use this technique to build effective predictive models. Understanding the metrics like Mean Squared Error and R-squared helps evaluate the model's performance and reliability.