Ordinary linear regression and its assumptions, weighted linear regression, Lasso regression, Ridge regression and Elastic Net Regression¶
After so many articles about regression on Medium.com, I think its best to abridge the definition and explanation of regression to avoid repetition. Regression is a line equation, when one predictor is used to predict the dependent variable. For two predictor variable, it is plane or a curved surface (when interaction terms are included).
Similarly, the concept can be expanded to dataset of mutiple dimensions.
All machine learning algorithms involve minimizing or maximizing a function or equation, called objective function. In other words, every machine learning problem or algorithm is an optimisation problem with a objective function and constraints.¶
In regression, the objective function is mean squared error and in deep learning, we use log loss function.¶
regression equation:
y=bx+b0¶
Here, y is predicted variable, b is coefficient and b0 is intercept (constant)
sum of mean squared error is given by¶
E=1/N Σ (Y - y)2 Here Y is actual value
expanding the term by substituting the regression equation we get
E=1/N Σ (Y - bx+b0)2
Now to find the b0 and b at which the Error is minimum, we differentiate with respect to b and b0 and equate it to zero. ()¶
we get a regression line with slope b that passes through a point (mean of x, mean of Y).
But what exactly is b, b = cov(x,Y)/var(x)
Compare that with correlation coefficient corr(x,Y)=cov(x,Y)/(std(x)*std(y))
Hence b is similar to the correlation between x and Y. this can be reworded as, slope of the line is nothing but the correlation between the independent and dependent variable. sounds logical, isn't it.¶
The same definition and derivation can be extended to multiple linear regression.
Assumptions of the Linear regression:¶
(L.I.N.E)
Linearity: Non linear data should be modeled with polynomial regression. Additionally, linear regression is highly susceptible to outliers.
Independent (No Multicollinearity): Multi-collinearity is the correlation between independent variables. methods to find and avoid this problems are detailed below.
Normally distributed Independent variables: Independent variables should be normally distributed, and if they are not, taking log transformation or feature engineering would help.
Equal variance (Homoscedasticity): Heteroscedarcity can be dealt by taking log transformation of features or by using weighted least square method.
Assumptions for residuals:¶
No auto correlation among the residuals
Why use shrinkage/regularization factors?¶
Before we answer that question, we need to discuss about the lengendary bias vs variance trade-off phenomenon in data science. Different people have different definitions for this topic. I'll try to explain it in two lines. Bias is the difference between the expected (or average) prediction of multiple models created on different samples of the same population and the correct value (Truth) which we are trying to predict. Bias error decreases with increase in complexity of the model. And, variance error arises from sensitivity to small fluctuations in the training set. High variance can cause an algorithm to model the random noise in the training data, rather than the intended outputs. Total error is the sum of these two errors along with irreducible error. As more and more features are added to model, the bias decreases and model variance increases. Finally, it all comes down to complexity vs variance of the model. Decreasing complexity increases bias error but decreases model variance. According to Gauss Markov theorem, OLS has lowest variance or mean squared error among unbiased models (), given that (E(bi)=Bi). Therefore regularization is used to introduce bias and decrease model variance in regression.
(I am going to avoid the mathematical representations here) There Three forms of regularization:
L2 - Ridge A penalty term multiplied with squared coefficient is used to reduce or shrink the coefficients of regression, thus, introducing bias into the model. ridge is good for grouped selection.
L1 - Lasso A penalty term multiplied with absolute value of coefficients is used to reduce or shrink the coefficients of regression, thus, introducing bias into the model. Lasso also helps in feature selection. Regularization helps in reducing the variance of the model and improves the prediction accuracy of the model.
As suggested by Richard Berk In the book "Regression analysis: a constructive critique", simple linear regression works as descriptive tool, whereas regularized regression techniques can be used as predictive tool for more accurate prediction.¶
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
import numpy as np
import pandas as pd
import sklearn
import warnings
import os
warnings.filterwarnings('ignore')
%config IPCompleter.greedy=True
pd.set_option('display.max_rows', 50)
pd.set_option('display.max_columns', 50)
os.getcwd()
df_raw = pd.read_csv('SwissRailways.csv')
df_raw.head(10)
Description of attributes
Swiss Railways, N = 49, T(i) = 3 to 13
Variables in the file are
ID = Company ID from 1 to 50 (49 companies, 604 obs) ID = 21 zero count.
RAILROAD = Company ID reduced to remove gap at ID=21, values 1 to 49.
YEAR = Year (1985 to 1997).
NI = Number of year observations for firm, repeated in each observation.
STOPS = Number of stations on the network.
NETWORK = Length of railway network (m).
LABOREXP = Labor expenses in 1000 CHF.
STAFF = Number of employees.
ELECEXP = Electricity expenses in 1000 CHF.
KWH = Total consumed electricity (in 1000 kWh).
TOTCOST = Total cost (in 1000 CHF).
NARROW_T = Dummy for the networks with narrow track (1 m wide).
RACK = Dummy for the networks with RACK RAIL (cremaillere) in at least some part;
(used to maintain a slow movement of the train on high slopes).
TUNNEL = Dummy for networks that have tunnels with an average length of more than 300 meters.
VIRAGE = Dummy for the networks whose minimum radius of curvature is 100 meters or less.
CT = Total costs adjusted for inflation (1000 CHF).
Q1 = Total output in train kilometers.
Q2 = Total passenger output in passenger kilometers.
Q3 = Total goods output in ton kilometers.
PL = Labor price adjusted for inflation (in CHF per person per year).
PK = Capital price using the total number of seats as a proxy for capital stock (CHF per seat).
PE = Price of electricity (CHF per kWh).
LABOR = Quantity of labor.
ELEC = Quantity of energy.
CAPITAL = Quantity of Capital
Capital costs = TOTCOST- (LABOREXP + ELECEXP)
Inflation adjustment is done with respect to 1997 prices.
Logs of costs and prices (lnCT, lnPK, lnPL) are normalized by PE.
LNCT, LNPK, LNPL, LNQ1, LNQ2, LNQ3, LNSTOP, LNCAP, LNNET = logs of variables
lets make the dataset simpler by removing Logs of costs and prices (lnCT, lnPK, lnPL) that are normalized by PE. Also, since labor expenses and electricity expenses are directly related to total costs, the prediction variable, we remove these features too.
df_raw.columns.values
df=df_raw[['STOPS', 'NETWORK', 'STAFF',
'KWH', 'TOTCOST', 'NARROW_T', 'RACK', 'TUNNEL',
'Q1', 'Q2', 'Q3', 'VIRAGE', 'LABOR', 'ELEC',
'CAPITAL', 'LNQ1', 'LNQ2', 'LNQ3', 'LNNET', 'LNPE',
'LNSTOP', 'LNCAP']]
print(df.columns.values)
df.head(20)
#lets study variables
sns.distplot(df.TOTCOST )
sns.distplot(df.LABOR)
sns.distplot(df.NETWORK)
sns.distplot(df.ELEC)
sns.distplot(df.Q2)
##### pair plots for correlation analysis
#sns.color_palette("Set1")
sns.pairplot(df[['STOPS', 'NETWORK', 'STAFF',
'KWH', 'TOTCOST',
'Q1', 'Q2', 'Q3', 'LABOR', 'ELEC',
'CAPITAL']])
df.isnull().sum()
#sns.light_palette("navy", reverse=True)
#cmap = sns.cubehelix_palette(light=1, as_cmap=True)
sns.reset_orig()
g = sns.PairGrid(df[['STOPS', 'NETWORK', 'STAFF',
'KWH', 'TOTCOST',
'Q1', 'Q2', 'Q3', 'LABOR', 'ELEC',
'CAPITAL']])
g.map_diag(sns.kdeplot)
g.map_offdiag(sns.kdeplot, n_levels=6);
Since our main intension behind this notebook is to try regression with different cost functions and variable selection methods, lets begin with the modeling without any further ado. As discussed in the clustering notebook, the data should be normalized before modelling
#subsetting the attributes for modelling.
df.temp=df.loc[:, df.columns != 'TOTCOST']
df.temp.head()
from sklearn import preprocessing
df.scaled = preprocessing.scale(df.temp)
df.scaled
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
import statsmodels.api as sm
from scipy import stats
X=df.loc[:, df.columns != 'TOTCOST']
y=df[['TOTCOST']]
X2 = sm.add_constant(X) #beta0 or intercept to the model
est = sm.OLS(y, X2)
model = est.fit()
print(model.summary())
Now that we have constructed a full model, its time for building the parsimonious model. To build the simplest model we need to remove the insignificant features, in this case Land and labor, that have p value greater than 0.05 and rebuild the model.
Also according to the fundamental assumptions of regression, every predictor variable should be independent of other variables. The correlation among the features can cause multi collinearity. why is this multi-collienarity such big issue? glad you asked. The basic definition of regression cofficients is that the increase of response variable when the predictor variable is increased by a unit while keeping all other variables constant. Yes, keeping all other variables constant is the important part here. When the features are correlated, we cannot keep other variables constant. Hence, it breakdowns the entire regression equation.
y=b1(X)+b2(X)+b3(X)+b0
There are several methods to find correlated features; Correlation plots and variance inflation methods are few of them.
But, feature selection can also be done by other methods such as
forward selection: here the model starts from null model to full model by selecting one feature at a time and by evaluating each models accuracy statistics (R2, AIC and BIC)
Backward elimination: eliminates individual feature from full model until the best model is obtained.
stepwise slection: In this method we do both the bove methods simultaneously by doing partial F-test</strong></i>
comparing the variation inflation factor with correlation coefficients, we can identify features that are highly correlated and avoid multicollinearity problems
from statsmodels.stats.outliers_influence import variance_inflation_factor
for j in range(X.shape[1]):
print (df.columns.values[j],variance_inflation_factor(X.values, j))
#cols=[0,1,3,4]
df.corr()
from sklearn.feature_selection import RFE
from sklearn.linear_model import LinearRegression
model = LinearRegression()
rfe = RFE(model,13)
fit = rfe.fit(X, y)
y_pred=fit.predict(X)
print("Num Features: %s" % (fit.n_features_))
print("Selected Features: %s" % (fit.support_))
print("Feature Ranking: %s" % (fit.ranking_))
print("Mean squared error: %.2f"
% mean_squared_error(y, y_pred))
# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f' % r2_score(y, y_pred))
Feature selection didnt give a optimal set of features.
We see that recursive feature elimination doesn't always yield optimal feature set, especially not helpful in rejecting the features that cause multi-collinearity. For these reasons, we apply VIF and filter out the features that have VIF factor by coss-referencing with correlation matrix created. In this particular case, we find that 'VIRAGE', 'STOPS', 'ELEC' and 'KWH' have high VIF factor. Also, all the insignificant features should be removed to get parsimonious model
X.corr()
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
import statsmodels.api as sm
from scipy import stats
X=df.loc[:, df.columns != 'TOTCOST']
y=df[['TOTCOST']]
X['LNKWH']=np.log(X.KWH)
X['LNLABOR']=np.log(X.LABOR)
X['LNSTAFF']=np.log(X.STAFF)
X2 = sm.add_constant(X[[ 'LNKWH','LNQ2','LNQ3','STOPS','LNSTAFF','LNNET','ELEC','TUNNEL']]) #beta0 or intercept to the model
est = sm.OLS(y, X2)
model = est.fit()
print(model.summary())
sns.distplot(X.KWH)
X.columns.values
regr = LinearRegression()
# Train the model using the training sets
regr.fit(X[[ 'LNKWH','LNQ2','LNQ3','STOPS','LNSTAFF','LNNET','ELEC','TUNNEL','KWH']].values, y.values)
# Make predictions using the testing set
y_pred = regr.predict(X[[ 'LNKWH','LNQ2','LNQ3','STOPS','LNSTAFF','LNNET','ELEC','TUNNEL','KWH']].values)
from math import sqrt
from sklearn.metrics import mean_squared_error, r2_score
print("Mean squared error: %.2f"
% mean_squared_error(y, y_pred))
print("Root mean squared error: %.2f"
% sqrt(mean_squared_error(y, y_pred)))
# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f' % r2_score(y, y_pred))
residuals=(y-y_pred)
#residuals
ax=sns.kdeplot(residuals.TOTCOST,shade=True,color="g",legend=False)
ax.set(xlabel='Residuals', ylabel='Density')
plt.show()
ax=sns.residplot(y_pred.flatten(),residuals.TOTCOST.values, lowess=True, color="g")
ax.set(xlabel='Fitted Values', ylabel='Residuals')
plt.show()
from statsmodels.graphics.gofplots import qqplot
qqplot(preprocessing.scale(residuals.TOTCOST), line='s')
plt.show()
Theres a high heteroscedarcity because of outliers. The only way to reduce this is to remove the outliers or model them seperately
The fitted values vs residuals plot clearly shows heteroskedasticity, which in other words means increase of variance of predicted values with increase in the values of predictors. Lets confirm our assumption with Breush-Pagan Test, the test thats done to find conditional heteroskedasticity
from statsmodels.compat import lzip
name = ['Lagrange multiplier statistic', 'p-value',
'f-value', 'f p-value']
test=sm.stats.diagnostic.het_breuschpagan(residuals.TOTCOST.values ,X[[ 'LNKWH','LNQ2','LNQ3','STOPS','LNSTAFF','LNNET','ELEC','TUNNEL','KWH']].values)
lzip(name, test)
Null Hypothesis: Error terms are homoscedastic
Alternative Hypothesis: Error terms are heteroscedastic.
But since p value is almost 0 we can reject the null hypothesis. (we consider the null hypothesis when the p values is significantly greater than 0.05)
Hherefore the residuals are heteroskedastic
Hence it is essential to transform the data to remove heteroskedasticity.
import math
trans_y=y.TOTCOST.apply(math.log)
#X['FEED/COWS']=X["FEED"]/X["COWS"]
from scipy import stats
import numpy as np
regr.fit(X[[ 'LNKWH','LNQ2','LNQ3','STOPS','LNSTAFF','LNNET','ELEC','TUNNEL','KWH']].values, trans_y)
# Make predictions using the testing set
y_pred = regr.predict(X[[ 'LNKWH','LNQ2','LNQ3','STOPS','LNSTAFF','LNNET','ELEC','TUNNEL','KWH']].values)
ax=sns.residplot(np.exp(y_pred),((np.exp(y_pred))-y.TOTCOST), lowess=True, color="g")
ax.set(xlabel='Fitted Values', ylabel='Residuals')
plt.show()
Heteroscedarcity is reduced to some extent
print("Mean squared error: %.2f"
% mean_squared_error(y,np.exp(y_pred)))
# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f' % r2_score(y, np.exp(y_pred)))
# K-fold cross validation
from sklearn.model_selection import ShuffleSplit
from sklearn.model_selection import cross_val_score
regr = LinearRegression()
cv = ShuffleSplit(n_splits=5, test_size=0.25, random_state=0)
scores = cross_val_score(regr, X[[ 'LNKWH','LNQ2','LNQ3','STOPS','LNSTAFF','LNNET','ELEC','TUNNEL','KWH']].values, y, scoring='neg_mean_squared_error', cv=cv)
print("regression model's mean squared errors of cross validation",-scores.mean())
from sklearn.linear_model import Ridge
#Ridge regression model
ridgeReg = Ridge(alpha=0.005, normalize=True)
ridgeReg.fit( X[[ 'LNKWH','LNQ2','LNQ3','STOPS','LNSTAFF','LNNET','ELEC','TUNNEL','KWH']],y)
y_ridge = ridgeReg.predict( X[[ 'LNKWH','LNQ2','LNQ3','STOPS','LNSTAFF','LNNET','ELEC','TUNNEL','KWH']])
print("Mean squared error: %.2f"
% mean_squared_error(y,(y_ridge)))
# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f' % r2_score(y, (y_ridge)))
n=0
scores=[0]*12
alpha=[1,.7,.5,.3,.1,.07,.05,.01,.007,.005,.001,.0005]
for i in alpha:
ridgeReg = Ridge(alpha=i, normalize=True)
cv = ShuffleSplit(n_splits=5, test_size=0.25, random_state=0)
score = cross_val_score(ridgeReg, X[[ 'LNKWH','LNQ2','LNQ3','STOPS','LNSTAFF','LNNET','ELEC','TUNNEL','KWH']].values, y, scoring='neg_mean_squared_error', cv=cv)
scores[n]=score.mean()
print("mean of ridge regression model's mean squared errors of cross validation for alpha",i,"is",-score.mean())
n+=1
from sklearn.linear_model import Lasso
#Lasso regression model
lassoReg = Lasso(alpha=0.1, normalize=True)
lassoReg.fit( X[[ 'LNKWH','LNQ2','LNQ3','STOPS','LNSTAFF','LNNET','ELEC','TUNNEL','KWH']],y)
y_lasso = lassoReg.predict( X[[ 'LNKWH','LNQ2','LNQ3','STOPS','LNSTAFF','LNNET','ELEC','TUNNEL','KWH']])
print("Mean squared error: %.2f"
% mean_squared_error(y,(y_lasso)))
# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f' % r2_score(y, (y_lasso)))
n=0
scores=[0]*14
alpha=[10,5,4,3,1,.7,.5,.3,.1,.07,.05,.01,.007,.005,.001,.0005]
for i in alpha:
lassoReg = Lasso(alpha=i, normalize=True)
cv = ShuffleSplit(n_splits=5, test_size=0.25, random_state=0)
score = cross_val_score(lassoReg, X[[ 'LNKWH','LNQ2','LNQ3','STOPS','LNSTAFF','LNNET','ELEC','TUNNEL','KWH']], y, scoring='neg_mean_squared_error', cv=cv)
scores[n]=score.mean()
print("mean of Lasso regression model's mean squared errors of cross validation for alpha",i,"is",-score.mean())
from sklearn.linear_model import ElasticNet
#Lasso regression model
elasReg = ElasticNet(alpha=0.01,l1_ratio=0.5, normalize=True)
elasReg.fit( X[[ 'LNKWH','LNQ2','LNQ3','STOPS','LNSTAFF','LNNET','ELEC','TUNNEL','KWH']],y)
y_elas = elasReg.predict( X[[ 'LNKWH','LNQ2','LNQ3','STOPS','LNSTAFF','LNNET','ELEC','TUNNEL','KWH']])
print("Mean squared error: %.2f"
% mean_squared_error(y,(y_elas)))
# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f' % r2_score(y, (y_elas)))
n=0
scores=[0]*14
alpha=[10,5,4,3,1,.7,.5,.3,.1,.07,.05,.01,.007,.005,.001,.0005,0.0001,0.00007]
for i in alpha:
elasReg = ElasticNet(alpha=i, normalize=True,l1_ratio=0.9)
cv = ShuffleSplit(n_splits=5, test_size=0.25, random_state=0)
score = cross_val_score(elasReg, X[[ 'LNKWH','LNQ2','LNQ3','STOPS','LNSTAFF','LNNET','ELEC','TUNNEL','KWH']], y, scoring='neg_mean_squared_error', cv=cv)
scores[n]=score.mean()
print("mean of Lasso regression model's mean squared errors of cross validation for alpha",i,"is",-score.mean())
Comparing the mean squared errors of Lasso, Ridge and regular regression, we find that lasso and ridge have lower mean squared error than simple linear regression