# Multiple Linear Regression in Python using Statsmodels and Sklearn

Regression models are widely used as statistical technique for prediction the outcome based on observed data.

Linear regressions allows describe how dependent variable (outcome) changes relatively to independent variable(s) (feature, predictor).

When there is one independent variable and one dependent, it is called simple linear regression (SLR).

When there is more than one independent variable and one dependent, it is called multiple linear regression (MLR).

A simple linear regression equation looks like:

`y = a + bx`

where:

x — the independent (explanatory) variable,

y — the dependent (responce) variable,

a — intercept,

b — slope of the line (coefficient).

And multiple linear regression formula can looks like:

`y = a + b1*x1 + b2*x2 + b3*x3 + + + bn*xn`

Dependent variable is continuous by its nature and independent variable can be continuous or categorical.

Before building model we need to make sure that our data meets multiple regression assumptions:

**Linearity **— the best fit (straight) line goes through data points;** Normality** — the data follows normal distribution (bell shape);**Homoscedasticity** — when error term in relation of predictor and outcome is the same across all values of feature.

For our model we will use Ordinary Least Squares (OLS) regression. Also there is WLS (Weighted Least Squares), GLS (Generalized Least Squares), etc.

We’ll use modified King county house sale prices dataset.

Using python and Jupyter notebook let’s import needed libraries:

`import numpy as np`

import pandas as pd

import matplotlib.pyplot as plt

import seaborn as sns

import statsmodels.api as sm

from sklearn.linear_model import LinearRegression

from sklearn.model_selection import train_test_split

Load data (in my case from the same page as current .ipynb file):

`df = pd.read_csv("kc_house_data.csv")`

And explore initial data:

`display(df.head())`

display(df.info())

Initial data visualization, cleaning and normalization (if needed) was omitted from this article in order to make it more concise.

The dependent (outcome) variable will be `price` column, and independent — ` zipcode`, `grade` and `sqft_living`.

Significant features we can select using for example stepwise selection or recursive feature elimination (RFE).

Set variables with our predictors names and dependent variable name:

`cat_cols = ['zipcode','grade']`

cont_cols = ['sqft_living']

outcome = 'price'

Categorical data should be encoded using one-hot encoding scheme:

`df_ohe = pd.get_dummies(df[cat_cols], columns=cat_cols, drop_first=True)`

Create dependent and independent variables:

`X = pd.concat([df[cont_cols], df_ohe], axis=1)`

y = df[outcome]

**Linear regression using Statsmodels**

const_X = sm.add_constant(X)model = sm.OLS(y, const_X)

linreg = model.fit()

Get regression summary:

`linreg.summary()`

We got R-squared equals to 0.809 what is pretty good

**Linear regression using scikit-learn**

Split data into train and test subsets and fit the line:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)lr = LinearRegression()

lr.fit(X_train, y_train)print(f'R-Squared : {lr.score(X_test, y_test)}')

# R-Squared : 0.8156541661758959

Run model on test data and visualize prediction accuracy:

y_hat = lr.predict(X_test)plt.figure(figsize=(16,8))

sns.distplot(y_hat, hist = False, label = f'Predicted {outcome}')

sns.distplot(y_test, hist = False, label = f'Actual {outcome}')plt.title(f'Actual vs Predicted {outcome}')

plt.xlabel(outcome)

plt.ylabel('Density')

plt.show()

**Conclusion**

We built basic multiple linear regression model and get relatively good R-squared value

**Just code please:**

import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

import seaborn as sns

import statsmodels.api as sm

from sklearn.linear_model import LinearRegression

from sklearn.model_selection import train_test_splitdf = pd.read_csv("kc_house_data.csv")#display(df.head())

#display(df.info())# Set variables with our predictors names and dependent variable name

cat_cols = ['zipcode','grade']

cont_cols = ['sqft_living']

outcome = 'price'df_ohe = pd.get_dummies(df[cat_cols], columns=cat_cols, drop_first=True)X = pd.concat([df[cont_cols], df_ohe], axis=1)

y = df[outcome]# Linear regression using Statsmodels

const_X = sm.add_constant(X)model = sm.OLS(y, const_X)

linreg = model.fit()# Get regression summary

#print(linreg.summary())# Linear regression using scikit-learn

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)lr = LinearRegression()

lr.fit(X_train, y_train)#print(f'R-Squared : {lr.score(X_test, y_test)}')# Run model on test data and visualize prediction accuracy

y_hat = lr.predict(X_test)plt.figure(figsize=(16,8))

sns.distplot(y_hat, hist = False, label = f'Predicted {outcome}')

sns.distplot(y_test, hist = False, label = f'Actual {outcome}')plt.title(f'Actual vs Predicted {outcome}')

plt.xlabel(outcome)

plt.ylabel('Density')

plt.show()