Multiple Linear Regression in Python using Statsmodels and Sklearn
Regression models are widely used as statistical technique for prediction the outcome based on observed data.
Linear regressions allows describe how dependent variable (outcome) changes relatively to independent variable(s) (feature, predictor).
When there is one independent variable and one dependent, it is called simple linear regression (SLR).
When there is more than one independent variable and one dependent, it is called multiple linear regression (MLR).
A simple linear regression equation looks like:
y = a + bx
where:
x — the independent (explanatory) variable,
y — the dependent (responce) variable,
a — intercept,
b — slope of the line (coefficient).
And multiple linear regression formula can looks like:
y = a + b1*x1 + b2*x2 + b3*x3 + + + bn*xn
Dependent variable is continuous by its nature and independent variable can be continuous or categorical.
Before building model we need to make sure that our data meets multiple regression assumptions:
Linearity — the best fit (straight) line goes through data points;
Normality — the data follows normal distribution (bell shape);
Homoscedasticity — when error term in relation of predictor and outcome is the same across all values of feature.
For our model we will use Ordinary Least Squares (OLS) regression. Also there is WLS (Weighted Least Squares), GLS (Generalized Least Squares), etc.
We’ll use modified King county house sale prices dataset.
Using python and Jupyter notebook let’s import needed libraries:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
Load data (in my case from the same page as current .ipynb file):
df = pd.read_csv("kc_house_data.csv")
And explore initial data:
display(df.head())
display(df.info())
Initial data visualization, cleaning and normalization (if needed) was omitted from this article in order to make it more concise.
The dependent (outcome) variable will be `price` column, and independent — ` zipcode`, `grade` and `sqft_living`.
Significant features we can select using for example stepwise selection or recursive feature elimination (RFE).
Set variables with our predictors names and dependent variable name:
cat_cols = ['zipcode','grade']
cont_cols = ['sqft_living']
outcome = 'price'
Categorical data should be encoded using one-hot encoding scheme:
df_ohe = pd.get_dummies(df[cat_cols], columns=cat_cols, drop_first=True)
Create dependent and independent variables:
X = pd.concat([df[cont_cols], df_ohe], axis=1)
y = df[outcome]
Linear regression using Statsmodels
const_X = sm.add_constant(X)model = sm.OLS(y, const_X)
linreg = model.fit()
Get regression summary:
linreg.summary()
We got R-squared equals to 0.809 what is pretty good
Linear regression using scikit-learn
Split data into train and test subsets and fit the line:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)lr = LinearRegression()
lr.fit(X_train, y_train)print(f'R-Squared : {lr.score(X_test, y_test)}')
# R-Squared : 0.8156541661758959
Run model on test data and visualize prediction accuracy:
y_hat = lr.predict(X_test)plt.figure(figsize=(16,8))
sns.distplot(y_hat, hist = False, label = f'Predicted {outcome}')
sns.distplot(y_test, hist = False, label = f'Actual {outcome}')plt.title(f'Actual vs Predicted {outcome}')
plt.xlabel(outcome)
plt.ylabel('Density')
plt.show()
Conclusion
We built basic multiple linear regression model and get relatively good R-squared value
Just code please:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_splitdf = pd.read_csv("kc_house_data.csv")#display(df.head())
#display(df.info())# Set variables with our predictors names and dependent variable name
cat_cols = ['zipcode','grade']
cont_cols = ['sqft_living']
outcome = 'price'df_ohe = pd.get_dummies(df[cat_cols], columns=cat_cols, drop_first=True)X = pd.concat([df[cont_cols], df_ohe], axis=1)
y = df[outcome]# Linear regression using Statsmodels
const_X = sm.add_constant(X)model = sm.OLS(y, const_X)
linreg = model.fit()# Get regression summary
#print(linreg.summary())# Linear regression using scikit-learn
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)lr = LinearRegression()
lr.fit(X_train, y_train)#print(f'R-Squared : {lr.score(X_test, y_test)}')# Run model on test data and visualize prediction accuracy
y_hat = lr.predict(X_test)plt.figure(figsize=(16,8))
sns.distplot(y_hat, hist = False, label = f'Predicted {outcome}')
sns.distplot(y_test, hist = False, label = f'Actual {outcome}')plt.title(f'Actual vs Predicted {outcome}')
plt.xlabel(outcome)
plt.ylabel('Density')
plt.show()