# Multiple Linear Regression in Python using Statsmodels and Sklearn

Regression models are widely used as statistical technique for prediction the outcome based on observed data.

Linear regressions allows describe how dependent variable (outcome) changes relatively to independent variable(s) (feature, predictor).

When there is one independent variable and one dependent, it is called simple linear regression (SLR).
When there is more than one independent variable and one dependent, it is called multiple linear regression (MLR).

A simple linear regression equation looks like:

`y = a + bx`

where:
x — the independent (explanatory) variable,
y — the dependent (responce) variable,
a — intercept,
b — slope of the line (coefficient).

And multiple linear regression formula can looks like:

`y = a + b1*x1 + b2*x2 + b3*x3 + + + bn*xn`

Dependent variable is continuous by its nature and independent variable can be continuous or categorical.

Before building model we need to make sure that our data meets multiple regression assumptions:

Linearity — the best fit (straight) line goes through data points;
Normality — the data follows normal distribution (bell shape);
Homoscedasticity — when error term in relation of predictor and outcome is the same across all values of feature.

For our model we will use Ordinary Least Squares (OLS) regression. Also there is WLS (Weighted Least Squares), GLS (Generalized Least Squares), etc.

We’ll use modified King county house sale prices dataset.

Using python and Jupyter notebook let’s import needed libraries:

`import numpy as npimport pandas as pdimport matplotlib.pyplot as pltimport seaborn as snsimport statsmodels.api as smfrom sklearn.linear_model import LinearRegressionfrom sklearn.model_selection import train_test_split`

Load data (in my case from the same page as current .ipynb file):

`df = pd.read_csv("kc_house_data.csv")`

And explore initial data:

`display(df.head())display(df.info())`

Initial data visualization, cleaning and normalization (if needed) was omitted from this article in order to make it more concise.

The dependent (outcome) variable will be `price` column, and independent — ` zipcode`, `grade` and `sqft_living`.
Significant features we can select using for example stepwise selection or recursive feature elimination (RFE).

Set variables with our predictors names and dependent variable name:

`cat_cols = ['zipcode','grade']cont_cols = ['sqft_living']outcome = 'price'`

Categorical data should be encoded using one-hot encoding scheme:

`df_ohe = pd.get_dummies(df[cat_cols], columns=cat_cols, drop_first=True)`

Create dependent and independent variables:

`X = pd.concat([df[cont_cols], df_ohe], axis=1)y = df[outcome]`

Linear regression using Statsmodels

`const_X = sm.add_constant(X)model = sm.OLS(y, const_X)linreg = model.fit()`

Get regression summary:

`linreg.summary()`

We got R-squared equals to 0.809 what is pretty good

Linear regression using scikit-learn

Split data into train and test subsets and fit the line:

`X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)lr = LinearRegression()lr.fit(X_train, y_train)print(f'R-Squared : {lr.score(X_test, y_test)}')# R-Squared : 0.8156541661758959`

Run model on test data and visualize prediction accuracy:

`y_hat = lr.predict(X_test)plt.figure(figsize=(16,8))sns.distplot(y_hat, hist = False, label = f'Predicted {outcome}')sns.distplot(y_test, hist = False, label = f'Actual {outcome}')plt.title(f'Actual vs Predicted {outcome}')plt.xlabel(outcome)plt.ylabel('Density')plt.show()`

Conclusion

We built basic multiple linear regression model and get relatively good R-squared value

`import numpy as npimport pandas as pdimport matplotlib.pyplot as pltimport seaborn as snsimport statsmodels.api as smfrom sklearn.linear_model import LinearRegressionfrom sklearn.model_selection import train_test_splitdf = pd.read_csv("kc_house_data.csv")#display(df.head())#display(df.info())# Set variables with our predictors names and dependent variable namecat_cols = ['zipcode','grade']cont_cols = ['sqft_living']outcome = 'price'df_ohe = pd.get_dummies(df[cat_cols], columns=cat_cols, drop_first=True)X = pd.concat([df[cont_cols], df_ohe], axis=1)y = df[outcome]# Linear regression using Statsmodelsconst_X = sm.add_constant(X)model = sm.OLS(y, const_X)linreg = model.fit()# Get regression summary#print(linreg.summary())# Linear regression using scikit-learnX_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)lr = LinearRegression()lr.fit(X_train, y_train)#print(f'R-Squared : {lr.score(X_test, y_test)}')# Run model on test data and visualize prediction accuracyy_hat = lr.predict(X_test)plt.figure(figsize=(16,8))sns.distplot(y_hat, hist = False, label = f'Predicted {outcome}')sns.distplot(y_test, hist = False, label = f'Actual {outcome}')plt.title(f'Actual vs Predicted {outcome}')plt.xlabel(outcome)plt.ylabel('Density')plt.show()`