Ordinary linear regression and its assumptions, weighted linear regression, Lasso regression, Ridge regression and Elastic Net Regression

After so many articles about regression on Medium.com, I think its best to abridge the definition and explanation of regression to avoid repetition. Regression is a line equation, when one predictor is used to predict the dependent variable. For two predictor variable, it is plane or a curved surface (when interaction terms are included).

Similarly, the concept can be expanded to dataset of mutiple dimensions.

All machine learning algorithms involve minimizing or maximizing a function or equation, called objective function. In other words, every machine learning problem or algorithm is an optimisation problem with a objective function and constraints.
In regression, the objective function is mean squared error and in deep learning, we use log loss function.

regression equation:

y=bx+b0

Here, y is predicted variable, b is coefficient and b0 is intercept (constant)

sum of mean squared error is given by
E=1/N Σ (Y - y)2 Here Y is actual value

expanding the term by substituting the regression equation we get

E=1/N Σ (Y - bx+b0)2
Now to find the b0 and b at which the Error is minimum, we differentiate with respect to b and b0 and equate it to zero. ()

we get a regression line with slope b that passes through a point (mean of x, mean of Y).

But what exactly is b, b = cov(x,Y)/var(x)

Compare that with correlation coefficient corr(x,Y)=cov(x,Y)/(std(x)*std(y))

Hence b is similar to the correlation between x and Y. this can be reworded as, slope of the line is nothing but the correlation between the independent and dependent variable. sounds logical, isn't it.

The same definition and derivation can be extended to multiple linear regression.

Assumptions of the Linear regression:

(L.I.N.E)

Linearity: Non linear data should be modeled with polynomial regression. Additionally, linear regression is highly susceptible to outliers.

Independent (No Multicollinearity): Multi-collinearity is the correlation between independent variables. methods to find and avoid this problems are detailed below.

Normally distributed Independent variables: Independent variables should be normally distributed, and if they are not, taking log transformation or feature engineering would help.

Equal variance (Homoscedasticity): Heteroscedarcity can be dealt by taking log transformation of features or by using weighted least square method.

Assumptions for residuals:

No auto correlation among the residuals

Why use shrinkage/regularization factors?

Before we answer that question, we need to discuss about the lengendary bias vs variance trade-off phenomenon in data science. Different people have different definitions for this topic. I'll try to explain it in two lines. Bias is the difference between the expected (or average) prediction of multiple models created on different samples of the same population and the correct value (Truth) which we are trying to predict. Bias error decreases with increase in complexity of the model. And, variance error arises from sensitivity to small fluctuations in the training set. High variance can cause an algorithm to model the random noise in the training data, rather than the intended outputs. Total error is the sum of these two errors along with irreducible error. As more and more features are added to model, the bias decreases and model variance increases. Finally, it all comes down to complexity vs variance of the model. Decreasing complexity increases bias error but decreases model variance. According to Gauss Markov theorem, OLS has lowest variance or mean squared error among unbiased models (), given that (E(bi)=Bi). Therefore regularization is used to introduce bias and decrease model variance in regression.

(I am going to avoid the mathematical representations here) There Three forms of regularization:

L2 - Ridge A penalty term multiplied with squared coefficient is used to reduce or shrink the coefficients of regression, thus, introducing bias into the model. ridge is good for grouped selection.

L1 - Lasso A penalty term multiplied with absolute value of coefficients is used to reduce or shrink the coefficients of regression, thus, introducing bias into the model. Lasso also helps in feature selection. Regularization helps in reducing the variance of the model and improves the prediction accuracy of the model.

As suggested by Richard Berk In the book "Regression analysis: a constructive critique", simple linear regression works as descriptive tool, whereas regularized regression techniques can be used as predictive tool for more accurate prediction.

In [2]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
import numpy as np
import pandas as pd
import sklearn
import warnings
import os
warnings.filterwarnings('ignore')
%config IPCompleter.greedy=True
pd.set_option('display.max_rows', 50)
pd.set_option('display.max_columns', 50)
In [3]:
os.getcwd()
Out[3]:
'c:\\users\\annep\\documents\\notebooks'
In [4]:
df_raw = pd.read_csv('SwissRailways.csv')
df_raw.head(10)
Out[4]:
id YEAR NI STOPS NETWORK LABOREXP STAFF ELECEXP KWH TOTCOST NARROW_T RACK TUNNEL T Q1 Q2 Q3 CT PL PE PK VIRAGE LABOR ELEC CAPITAL LNCT LNQ1 LNQ2 LNQ3 LNNET LNPL LNPE LNPK LNSTOP LNCAP RAILROAD
0 1 90 7 10 13107 1061 16.0 163 1631 2536 0 0 0 5 159000 2401000 179385 3255.456 85125.16 0.128291 3709.715 0 41.83754 6.427445 51.73502 10.141540 11.97666 14.69140 12.09729 2.573147 13.40533 -2.053454 10.272160 2.302585 2.650892 1
1 1 91 7 10 13107 1148 17.0 170 1428 2634 0 0 0 6 161000 2578000 196815 3162.065 81067.72 0.142914 3479.806 0 43.58390 6.454062 49.96203 10.004490 11.98916 14.76252 12.19002 2.573147 13.24855 -1.945510 10.100240 2.302585 2.650892 1
2 1 92 7 10 7351 1145 17.0 196 2000 2846 0 0 0 7 158000 2573000 209298 3190.583 75507.78 0.109865 3716.343 0 40.23190 6.886859 52.88124 10.276460 11.97035 14.76058 12.25151 1.994836 13.44049 -2.208499 10.428990 2.302585 2.650892 1
3 1 93 7 10 13107 1148 17.0 190 1305 2763 0 0 0 8 156000 2512000 121183 2964.592 72456.45 0.156217 3367.775 0 41.54904 6.876584 51.57438 9.851007 11.95761 14.73659 11.70506 2.573147 13.04725 -1.856512 9.978519 2.302585 2.650892 1
4 1 94 7 10 13107 1138 16.0 203 1236 2874 0 0 0 9 149000 2431000 124408 2978.239 73704.66 0.170196 3499.121 0 39.59638 7.063326 53.34029 9.769890 11.91170 14.70381 11.73132 2.573147 12.97862 -1.770802 9.931069 2.302585 2.650892 1
5 1 95 7 10 13107 1038 15.0 199 1226 2620 0 0 0 10 145000 2322000 56566 2689.938 71047.23 0.166649 3127.573 0 39.61832 7.595420 52.78626 9.689137 11.88449 14.65794 10.94316 2.573147 12.96296 -1.791863 9.839876 2.302585 2.650892 1
6 1 96 7 10 13107 1018 15.0 160 1163 2899 0 0 0 11 144000 2168000 115340 2922.379 68413.98 0.138685 3821.320 0 35.11556 5.519145 59.36530 9.955706 11.87757 14.58932 11.65564 2.573147 13.10888 -1.975552 10.223900 2.302585 2.650892 1
7 2 85 13 15 27024 2240 41.0 82 861 4006 1 0 0 0 245000 5291000 35196 5961.310 81300.81 0.141723 2303.265 1 55.91613 2.046930 42.03695 10.646920 12.40901 15.48152 10.46869 3.296725 13.25979 -1.953878 9.695961 2.708050 3.251079 2
8 2 86 13 15 27024 2424 42.0 152 1597 4575 1 0 0 1 301000 6379000 37433 6480.170 81748.28 0.134814 2980.468 1 52.98361 3.322404 43.69399 10.780360 12.61487 15.66852 10.53031 3.296725 13.31526 -2.003862 10.003700 2.708050 3.251079 2
9 2 87 13 15 27024 2558 46.0 180 1899 4875 1 0 0 2 336000 7522000 27382 6827.731 77883.33 0.132754 3150.523 1 52.47179 3.692308 43.83590 10.848000 12.72487 15.83334 10.21764 3.296725 13.28222 -2.019253 10.074580 2.708050 3.251079 2

Description of attributes
Swiss Railways, N = 49, T(i) = 3 to 13

Variables in the file are

ID = Company ID from 1 to 50 (49 companies, 604 obs) ID = 21 zero count.
RAILROAD = Company ID reduced to remove gap at ID=21, values 1 to 49.
YEAR = Year (1985 to 1997).
NI = Number of year observations for firm, repeated in each observation.
STOPS = Number of stations on the network.
NETWORK = Length of railway network (m).
LABOREXP = Labor expenses in 1000 CHF.
STAFF = Number of employees.
ELECEXP = Electricity expenses in 1000 CHF.
KWH = Total consumed electricity (in 1000 kWh).
TOTCOST = Total cost (in 1000 CHF).
NARROW_T = Dummy for the networks with narrow track (1 m wide).
RACK = Dummy for the networks with RACK RAIL (cremaillere) in at least some part;
(used to maintain a slow movement of the train on high slopes).
TUNNEL = Dummy for networks that have tunnels with an average length of more than 300 meters.
VIRAGE = Dummy for the networks whose minimum radius of curvature is 100 meters or less.
CT = Total costs adjusted for inflation (1000 CHF).
Q1 = Total output in train kilometers.
Q2 = Total passenger output in passenger kilometers.
Q3 = Total goods output in ton kilometers.
PL = Labor price adjusted for inflation (in CHF per person per year).
PK = Capital price using the total number of seats as a proxy for capital stock (CHF per seat).
PE = Price of electricity (CHF per kWh).
LABOR = Quantity of labor.
ELEC = Quantity of energy.
CAPITAL = Quantity of Capital Capital costs = TOTCOST- (LABOREXP + ELECEXP)
Inflation adjustment is done with respect to 1997 prices.
Logs of costs and prices (lnCT, lnPK, lnPL) are normalized by PE.
LNCT, LNPK, LNPL, LNQ1, LNQ2, LNQ3, LNSTOP, LNCAP, LNNET = logs of variables

lets make the dataset simpler by removing Logs of costs and prices (lnCT, lnPK, lnPL) that are normalized by PE. Also, since labor expenses and electricity expenses are directly related to total costs, the prediction variable, we remove these features too.

In [5]:
df_raw.columns.values
Out[5]:
array(['id', 'YEAR', 'NI', 'STOPS', 'NETWORK', 'LABOREXP', 'STAFF',
       'ELECEXP', 'KWH', 'TOTCOST', 'NARROW_T', 'RACK', 'TUNNEL', 'T',
       'Q1', 'Q2', 'Q3', 'CT', 'PL', 'PE', 'PK', 'VIRAGE', 'LABOR', 'ELEC',
       'CAPITAL', 'LNCT', 'LNQ1', 'LNQ2', 'LNQ3', 'LNNET', 'LNPL', 'LNPE',
       'LNPK', 'LNSTOP', 'LNCAP', 'RAILROAD '], dtype=object)
In [6]:
df=df_raw[['STOPS', 'NETWORK', 'STAFF',
       'KWH', 'TOTCOST', 'NARROW_T', 'RACK', 'TUNNEL',
       'Q1', 'Q2', 'Q3', 'VIRAGE', 'LABOR', 'ELEC',
       'CAPITAL', 'LNQ1', 'LNQ2', 'LNQ3', 'LNNET', 'LNPE',
        'LNSTOP', 'LNCAP']]
In [7]:
print(df.columns.values)
df.head(20)
['STOPS' 'NETWORK' 'STAFF' 'KWH' 'TOTCOST' 'NARROW_T' 'RACK' 'TUNNEL' 'Q1'
 'Q2' 'Q3' 'VIRAGE' 'LABOR' 'ELEC' 'CAPITAL' 'LNQ1' 'LNQ2' 'LNQ3' 'LNNET'
 'LNPE' 'LNSTOP' 'LNCAP']
Out[7]:
STOPS NETWORK STAFF KWH TOTCOST NARROW_T RACK TUNNEL Q1 Q2 Q3 VIRAGE LABOR ELEC CAPITAL LNQ1 LNQ2 LNQ3 LNNET LNPE LNSTOP LNCAP
0 10 13107 16.0 1631 2536 0 0 0 159000 2401000 179385 0 41.83754 6.427445 51.73502 11.97666 14.69140 12.097290 2.573147 -2.053454 2.302585 2.650892
1 10 13107 17.0 1428 2634 0 0 0 161000 2578000 196815 0 43.58390 6.454062 49.96203 11.98916 14.76252 12.190020 2.573147 -1.945510 2.302585 2.650892
2 10 7351 17.0 2000 2846 0 0 0 158000 2573000 209298 0 40.23190 6.886859 52.88124 11.97035 14.76058 12.251510 1.994836 -2.208499 2.302585 2.650892
3 10 13107 17.0 1305 2763 0 0 0 156000 2512000 121183 0 41.54904 6.876584 51.57438 11.95761 14.73659 11.705060 2.573147 -1.856512 2.302585 2.650892
4 10 13107 16.0 1236 2874 0 0 0 149000 2431000 124408 0 39.59638 7.063326 53.34029 11.91170 14.70381 11.731320 2.573147 -1.770802 2.302585 2.650892
5 10 13107 15.0 1226 2620 0 0 0 145000 2322000 56566 0 39.61832 7.595420 52.78626 11.88449 14.65794 10.943160 2.573147 -1.791863 2.302585 2.650892
6 10 13107 15.0 1163 2899 0 0 0 144000 2168000 115340 0 35.11556 5.519145 59.36530 11.87757 14.58932 11.655640 2.573147 -1.975552 2.302585 2.650892
7 15 27024 41.0 861 4006 1 0 0 245000 5291000 35196 1 55.91613 2.046930 42.03695 12.40901 15.48152 10.468690 3.296725 -1.953878 2.708050 3.251079
8 15 27024 42.0 1597 4575 1 0 0 301000 6379000 37433 1 52.98361 3.322404 43.69399 12.61487 15.66852 10.530310 3.296725 -2.003862 2.708050 3.251079
9 15 27024 46.0 1899 4875 1 0 0 336000 7522000 27382 1 52.47179 3.692308 43.83590 12.72487 15.83334 10.217640 3.296725 -2.019253 2.708050 3.251079
10 16 27024 45.0 1913 5132 1 0 0 341000 7575000 27273 1 52.02650 3.702260 44.27124 12.73964 15.84036 10.213650 3.296725 -1.991950 2.772589 3.251079
11 16 27024 47.0 1883 5450 1 0 0 332000 7868000 28009 1 53.70642 3.449541 42.84404 12.71289 15.87831 10.240280 3.296725 -2.012489 2.772589 3.251079
12 16 27024 47.0 2105 5872 1 0 0 373000 8438000 17733 1 56.72684 3.593324 39.67984 12.82933 15.94826 9.783182 3.296725 -2.050468 2.772589 3.251079
13 16 27024 49.0 2209 6820 1 0 0 411000 9064000 13610 1 56.14370 3.475073 40.38123 12.92635 16.01982 9.518560 3.296725 -2.049513 2.772589 3.251079
14 17 27024 51.0 2357 7229 1 0 0 433000 9130000 11849 1 57.98866 3.513626 38.49772 12.97849 16.02708 9.379999 3.296725 -2.113522 2.833213 3.251079
15 17 27024 50.0 2279 7115 1 0 0 419000 10600000 12556 1 60.42165 3.682361 35.89599 12.94563 16.17249 9.437954 3.296725 -2.092725 2.833213 3.251079
16 17 27024 49.0 2337 6911 1 0 0 431000 10100000 12090 1 61.06208 4.123860 34.81406 12.97386 16.12775 9.400134 3.296725 -2.068507 2.833213 3.251079
17 17 27024 48.0 2086 7433 1 0 0 428000 9875000 9598 1 54.83654 3.430647 41.73281 12.96688 16.10552 9.169310 3.296725 -2.075396 2.833213 3.251079
18 17 27024 46.0 2347 7284 1 0 0 428000 9817000 9875 1 58.49808 4.146074 37.35585 12.96688 16.09963 9.197762 3.296725 -2.042434 2.833213 3.251079
19 17 27024 49.0 2343 7157 1 0 0 429000 10100000 8622 1 59.95529 4.191700 35.85301 12.96921 16.12914 9.062073 3.296725 -2.055405 2.833213 3.251079
In [8]:
#lets study variables
sns.distplot(df.TOTCOST )
Out[8]:
<matplotlib.axes._subplots.AxesSubplot at 0x1480dbcd550>
In [9]:
sns.distplot(df.LABOR)
Out[9]:
<matplotlib.axes._subplots.AxesSubplot at 0x1480e09a128>
In [10]:
sns.distplot(df.NETWORK)
Out[10]:
<matplotlib.axes._subplots.AxesSubplot at 0x1480f13f6d8>
In [11]:
sns.distplot(df.ELEC)
Out[11]:
<matplotlib.axes._subplots.AxesSubplot at 0x1480f2789e8>
In [12]:
sns.distplot(df.Q2)
Out[12]:
<matplotlib.axes._subplots.AxesSubplot at 0x1480f34bac8>
In [13]:
##### pair plots for correlation analysis
#sns.color_palette("Set1")
sns.pairplot(df[['STOPS', 'NETWORK', 'STAFF',
       'KWH', 'TOTCOST', 
       'Q1', 'Q2', 'Q3', 'LABOR', 'ELEC',
       'CAPITAL']])
Out[13]:
<seaborn.axisgrid.PairGrid at 0x1480f444a90>
In [14]:
df.isnull().sum()
Out[14]:
STOPS       0
NETWORK     0
STAFF       0
KWH         0
TOTCOST     0
NARROW_T    0
RACK        0
TUNNEL      0
Q1          0
Q2          0
Q3          0
VIRAGE      0
LABOR       0
ELEC        0
CAPITAL     0
LNQ1        0
LNQ2        0
LNQ3        0
LNNET       0
LNPE        0
LNSTOP      0
LNCAP       0
dtype: int64
In [15]:
#sns.light_palette("navy", reverse=True)
#cmap = sns.cubehelix_palette(light=1, as_cmap=True)
sns.reset_orig()
g = sns.PairGrid(df[['STOPS', 'NETWORK', 'STAFF',
       'KWH', 'TOTCOST', 
       'Q1', 'Q2', 'Q3', 'LABOR', 'ELEC',
       'CAPITAL']])
g.map_diag(sns.kdeplot)
g.map_offdiag(sns.kdeplot, n_levels=6);

Since our main intension behind this notebook is to try regression with different cost functions and variable selection methods, lets begin with the modeling without any further ado. As discussed in the clustering notebook, the data should be normalized before modelling

In [16]:
#subsetting the attributes for modelling.
df.temp=df.loc[:, df.columns != 'TOTCOST']
df.temp.head()
Out[16]:
STOPS NETWORK STAFF KWH NARROW_T RACK TUNNEL Q1 Q2 Q3 VIRAGE LABOR ELEC CAPITAL LNQ1 LNQ2 LNQ3 LNNET LNPE LNSTOP LNCAP
0 10 13107 16.0 1631 0 0 0 159000 2401000 179385 0 41.83754 6.427445 51.73502 11.97666 14.69140 12.09729 2.573147 -2.053454 2.302585 2.650892
1 10 13107 17.0 1428 0 0 0 161000 2578000 196815 0 43.58390 6.454062 49.96203 11.98916 14.76252 12.19002 2.573147 -1.945510 2.302585 2.650892
2 10 7351 17.0 2000 0 0 0 158000 2573000 209298 0 40.23190 6.886859 52.88124 11.97035 14.76058 12.25151 1.994836 -2.208499 2.302585 2.650892
3 10 13107 17.0 1305 0 0 0 156000 2512000 121183 0 41.54904 6.876584 51.57438 11.95761 14.73659 11.70506 2.573147 -1.856512 2.302585 2.650892
4 10 13107 16.0 1236 0 0 0 149000 2431000 124408 0 39.59638 7.063326 53.34029 11.91170 14.70381 11.73132 2.573147 -1.770802 2.302585 2.650892
In [17]:
from sklearn import preprocessing
df.scaled = preprocessing.scale(df.temp)
In [18]:
df.scaled
Out[18]:
array([[-0.5629073 , -0.46297081, -0.46206069, ..., -1.26772948,
        -0.72052812, -1.48360239],
       [-0.5629073 , -0.46297081, -0.45905378, ..., -0.56130556,
        -0.72052812, -1.48360239],
       [-0.5629073 , -0.56499434, -0.45905378, ..., -2.28239909,
        -0.72052812, -1.48360239],
       ..., 
       [ 5.47735469,  5.9868801 ,  4.10843935, ..., -2.08378444,
         3.10123292, -0.12927144],
       [ 5.47735469,  5.9868801 ,  3.95208014, ..., -1.32083693,
         3.10123292, -0.12927144],
       [ 5.36852114,  5.9868801 ,  3.84082454, ..., -2.62852351,
         3.07568304, -0.12927144]])
In [19]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
import statsmodels.api as sm
from scipy import stats

X=df.loc[:, df.columns != 'TOTCOST']
y=df[['TOTCOST']]
X2 = sm.add_constant(X) #beta0 or intercept to the model
est = sm.OLS(y, X2)
model = est.fit()
print(model.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                TOTCOST   R-squared:                       0.984
Model:                            OLS   Adj. R-squared:                  0.983
Method:                 Least Squares   F-statistic:                     1750.
Date:                Wed, 03 Apr 2019   Prob (F-statistic):               0.00
Time:                        21:54:39   Log-Likelihood:                -6047.8
No. Observations:                 604   AIC:                         1.214e+04
Df Residuals:                     583   BIC:                         1.223e+04
Df Model:                          20                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const        -21.7121      4.422     -4.910      0.000     -30.398     -13.026
STOPS        269.8442    137.355      1.965      0.050       0.072     539.616
NETWORK        0.1123      0.036      3.091      0.002       0.041       0.184
STAFF        -16.6760      7.224     -2.308      0.021     -30.865      -2.487
KWH            1.8598      0.193      9.622      0.000       1.480       2.239
NARROW_T   -3087.5008   1424.067     -2.168      0.031   -5884.427    -290.575
RACK        2965.2821    724.182      4.095      0.000    1542.958    4387.606
TUNNEL     -2297.8698    953.082     -2.411      0.016   -4169.763    -425.977
Q1            -0.0027      0.002     -1.201      0.230      -0.007       0.002
Q2             0.0001   3.09e-05      4.405      0.000    7.55e-05       0.000
Q3             0.0002   1.98e-05      7.856      0.000       0.000       0.000
VIRAGE      3377.3691   1424.574      2.371      0.018     579.446    6175.292
LABOR       -134.3962    134.503     -0.999      0.318    -398.566     129.773
ELEC       -1938.3287    272.185     -7.121      0.000   -2472.911   -1403.746
CAPITAL      -98.4774    130.140     -0.757      0.450    -354.077     157.123
LNQ1        -497.7727   2060.705     -0.242      0.809   -4545.082    3549.536
LNQ2        3478.3767    998.965      3.482      0.001    1516.369    5440.385
LNQ3        -128.7748    178.577     -0.721      0.471    -479.507     221.957
LNNET       -222.3410   1212.392     -0.183      0.855   -2603.528    2158.846
LNPE        4219.4687   1764.663      2.391      0.017     753.598    7685.339
LNSTOP     -5543.6414   2412.607     -2.298      0.022   -1.03e+04    -805.182
LNCAP      -2707.2628   1172.768     -2.308      0.021   -5010.627    -403.898
==============================================================================
Omnibus:                      142.699   Durbin-Watson:                   0.412
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             2423.922
Skew:                           0.542   Prob(JB):                         0.00
Kurtosis:                      12.754   Cond. No.                     1.81e+15
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 1.02e-12. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.

Now that we have constructed a full model, its time for building the parsimonious model. To build the simplest model we need to remove the insignificant features, in this case Land and labor, that have p value greater than 0.05 and rebuild the model.

Also according to the fundamental assumptions of regression, every predictor variable should be independent of other variables. The correlation among the features can cause multi collinearity. why is this multi-collienarity such big issue? glad you asked. The basic definition of regression cofficients is that the increase of response variable when the predictor variable is increased by a unit while keeping all other variables constant. Yes, keeping all other variables constant is the important part here. When the features are correlated, we cannot keep other variables constant. Hence, it breakdowns the entire regression equation.

y=b1(X)+b2(X)+b3(X)+b0

There are several methods to find correlated features; Correlation plots and variance inflation methods are few of them.

But, feature selection can also be done by other methods such as
forward selection: here the model starts from null model to full model by selecting one feature at a time and by evaluating each models accuracy statistics (R2, AIC and BIC)
Backward elimination: eliminates individual feature from full model until the best model is obtained.
stepwise slection: In this method we do both the bove methods simultaneously by doing partial F-test</strong></i>

comparing the variation inflation factor with correlation coefficients, we can identify features that are highly correlated and avoid multicollinearity problems

In [21]:
from statsmodels.stats.outliers_influence import variance_inflation_factor
for j in range(X.shape[1]):
    print (df.columns.values[j],variance_inflation_factor(X.values, j))
STOPS 127.440136513
NETWORK 84.0999520981
STAFF 115.460180091
KWH 181.246228083
TOTCOST 8.86797367744
NARROW_T 1.88639315417
RACK 2.78206760096
TUNNEL 116.668971283
Q1 57.9987118074
Q2 21.8546367338
Q3 8.2387287145
VIRAGE 1027.23285999
LABOR 27.1979168373
ELEC 673.606590656
CAPITAL 86.1323886389
LNQ1 35.7080413282
LNQ2 4.68194573423
LNQ3 24.0928637212
LNNET 1.45435136415
LNPE 49.5498154984
LNSTOP 2.96043864134
In [20]:
#cols=[0,1,3,4]
df.corr()
Out[20]:
STOPS NETWORK STAFF KWH TOTCOST NARROW_T RACK TUNNEL Q1 Q2 Q3 VIRAGE LABOR ELEC CAPITAL LNQ1 LNQ2 LNQ3 LNNET LNPE LNSTOP LNCAP
STOPS 1.000000 0.937483 0.676245 0.587233 0.652653 0.221042 -0.138500 0.402401 0.833750 0.765818 0.191431 0.190760 0.220767 -0.024349 -0.220353 0.716570 0.649731 0.483496 0.774594 -0.222202 0.867703 0.011668
NETWORK 0.937483 1.000000 0.790394 0.708852 0.762797 0.075749 -0.091713 0.419698 0.871335 0.818496 0.311179 0.044869 0.236263 0.017797 -0.242405 0.669803 0.621301 0.554145 0.776215 -0.212689 0.720221 -0.077349
STAFF 0.676245 0.790394 1.000000 0.981292 0.981150 -0.096577 -0.085073 0.494285 0.917086 0.921017 0.783490 -0.121627 0.298010 0.078934 -0.314238 0.674856 0.661000 0.613199 0.656203 -0.153619 0.550021 -0.071705
KWH 0.587233 0.708852 0.981292 1.000000 0.981919 -0.174182 -0.108910 0.510177 0.883055 0.893069 0.855668 -0.199182 0.247461 0.178892 -0.277956 0.664089 0.650076 0.618470 0.619940 -0.126006 0.506215 -0.095261
TOTCOST 0.652653 0.762797 0.981150 0.981919 1.000000 -0.106021 -0.090182 0.512486 0.905470 0.918659 0.812322 -0.124588 0.252610 0.095354 -0.270637 0.679804 0.669411 0.611614 0.650564 -0.145257 0.544439 -0.068818
NARROW_T 0.221042 0.075749 -0.096577 -0.174182 -0.106021 1.000000 0.274255 -0.110358 -0.023946 -0.032331 -0.218845 0.910120 0.072376 -0.464770 -0.003653 0.111019 0.108543 -0.420119 0.129899 -0.198490 0.314526 0.503465
RACK -0.138500 -0.091713 -0.085073 -0.108910 -0.090182 0.274255 1.000000 -0.007995 -0.175809 -0.139335 -0.103634 0.235738 0.205715 -0.116043 -0.191312 -0.217385 -0.121740 -0.342599 -0.087045 -0.004416 -0.158986 0.200514
TUNNEL 0.402401 0.419698 0.494285 0.510177 0.512486 -0.110358 -0.007995 1.000000 0.535252 0.617501 0.324352 -0.044377 -0.051990 0.275888 0.011328 0.480588 0.566684 0.352088 0.400839 0.042044 0.388774 0.125645
Q1 0.833750 0.871335 0.917086 0.883055 0.905470 -0.023946 -0.175809 0.535252 1.000000 0.965545 0.566255 -0.048813 0.229915 0.125315 -0.252108 0.821516 0.779904 0.623884 0.743461 -0.188072 0.714909 -0.063535
Q2 0.765818 0.818496 0.921017 0.893069 0.918659 -0.032331 -0.139335 0.617501 0.965545 1.000000 0.608059 -0.048558 0.234328 0.094527 -0.251962 0.744904 0.756053 0.582984 0.670537 -0.175236 0.625587 0.041099
Q3 0.191431 0.311179 0.783490 0.855668 0.812322 -0.218845 -0.103634 0.324352 0.566255 0.608059 1.000000 -0.242653 0.176697 0.122147 -0.197632 0.385020 0.374809 0.452011 0.322399 -0.029540 0.214631 -0.105037
VIRAGE 0.190760 0.044869 -0.121627 -0.199182 -0.124588 0.910120 0.235738 -0.044377 -0.048813 -0.048558 -0.242653 1.000000 -0.000686 -0.501909 0.076059 0.088650 0.090084 -0.399277 0.086584 -0.196262 0.261437 0.512594
LABOR 0.220767 0.236263 0.298010 0.247461 0.252610 0.072376 0.205715 -0.051990 0.229915 0.234328 0.176697 -0.000686 1.000000 -0.171082 -0.988997 0.166294 0.212082 0.142532 0.243992 -0.231094 0.192714 0.287272
ELEC -0.024349 0.017797 0.078934 0.178892 0.095354 -0.464770 -0.116043 0.275888 0.125315 0.094527 0.122147 -0.501909 -0.171082 1.000000 0.023443 0.238740 0.250492 0.294408 0.168856 0.246700 0.086106 -0.248853
CAPITAL -0.220353 -0.242405 -0.314238 -0.277956 -0.270637 -0.003653 -0.191312 0.011328 -0.252108 -0.251962 -0.197632 0.076059 -0.988997 0.023443 1.000000 -0.204583 -0.252808 -0.188831 -0.272929 0.197445 -0.208473 -0.254125
LNQ1 0.716570 0.669803 0.674856 0.664089 0.679804 0.111019 -0.217385 0.480588 0.821516 0.744904 0.385020 0.088650 0.166294 0.238740 -0.204583 1.000000 0.953608 0.642355 0.879468 -0.098619 0.840233 0.005468
LNQ2 0.649731 0.621301 0.661000 0.650076 0.669411 0.108543 -0.121740 0.566684 0.779904 0.756053 0.374809 0.090084 0.212082 0.250492 -0.252808 0.953608 1.000000 0.600872 0.825550 -0.114533 0.761829 0.152731
LNQ3 0.483496 0.554145 0.613199 0.618470 0.611614 -0.420119 -0.342599 0.352088 0.623884 0.582984 0.452011 -0.399277 0.142532 0.294408 -0.188831 0.642355 0.600872 1.000000 0.657063 0.109996 0.457153 -0.287333
LNNET 0.774594 0.776215 0.656203 0.619940 0.650564 0.129899 -0.087045 0.400839 0.743461 0.670537 0.322399 0.086584 0.243992 0.168856 -0.272929 0.879468 0.825550 0.657063 1.000000 -0.055171 0.851932 -0.043606
LNPE -0.222202 -0.212689 -0.153619 -0.126006 -0.145257 -0.198490 -0.004416 0.042044 -0.188072 -0.175236 -0.029540 -0.196262 -0.231094 0.246700 0.197445 -0.098619 -0.114533 0.109996 -0.055171 1.000000 -0.144047 -0.110212
LNSTOP 0.867703 0.720221 0.550021 0.506215 0.544439 0.314526 -0.158986 0.388774 0.714909 0.625587 0.214631 0.261437 0.192714 0.086106 -0.208473 0.840233 0.761829 0.457153 0.851932 -0.144047 1.000000 0.093964
LNCAP 0.011668 -0.077349 -0.071705 -0.095261 -0.068818 0.503465 0.200514 0.125645 -0.063535 0.041099 -0.105037 0.512594 0.287272 -0.248853 -0.254125 0.005468 0.152731 -0.287333 -0.043606 -0.110212 0.093964 1.000000
In [118]:
from sklearn.feature_selection import RFE
from sklearn.linear_model import LinearRegression
model = LinearRegression()
rfe = RFE(model,13)               
fit = rfe.fit(X, y)
y_pred=fit.predict(X)
print("Num Features: %s" % (fit.n_features_))
print("Selected Features: %s" % (fit.support_))
print("Feature Ranking: %s" % (fit.ranking_))
Num Features: 13
Selected Features: [False False False False  True False  True False False False  True  True
  True  True  True  True False False False  True  True  True  True  True]
Feature Ranking: [ 5  9  4  8  1  7  1 10 12 11  1  1  1  1  1  1  6  3  2  1  1  1  1  1]
In [119]:
print("Mean squared error: %.2f"
      % mean_squared_error(y, y_pred))
# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f' % r2_score(y, y_pred))
Mean squared error: 490612376.54
Variance score: 0.72

Feature selection didnt give a optimal set of features.

We see that recursive feature elimination doesn't always yield optimal feature set, especially not helpful in rejecting the features that cause multi-collinearity. For these reasons, we apply VIF and filter out the features that have VIF factor by coss-referencing with correlation matrix created. In this particular case, we find that 'VIRAGE', 'STOPS', 'ELEC' and 'KWH' have high VIF factor. Also, all the insignificant features should be removed to get parsimonious model

In [93]:
X.corr()
Out[93]:
STOPS NETWORK STAFF KWH NARROW_T RACK TUNNEL Q1 Q2 Q3 VIRAGE LABOR ELEC CAPITAL LNQ1 LNQ2 LNQ3 LNNET LNPE LNSTOP LNCAP LNKWH LNLABOR
STOPS 1.000000 0.937483 0.676245 0.587233 0.221042 -0.138500 0.402401 0.833750 0.765818 0.191431 0.190760 0.220767 -0.024349 -0.220353 0.716570 0.649731 0.483496 0.774594 -0.222202 0.867703 0.011668 0.645946 0.215464
NETWORK 0.937483 1.000000 0.790394 0.708852 0.075749 -0.091713 0.419698 0.871335 0.818496 0.311179 0.044869 0.236263 0.017797 -0.242405 0.669803 0.621301 0.554145 0.776215 -0.212689 0.720221 -0.077349 0.668158 0.230529
STAFF 0.676245 0.790394 1.000000 0.981292 -0.096577 -0.085073 0.494285 0.917086 0.921017 0.783490 -0.121627 0.298010 0.078934 -0.314238 0.674856 0.661000 0.613199 0.656203 -0.153619 0.550021 -0.071705 0.735178 0.275997
KWH 0.587233 0.708852 0.981292 1.000000 -0.174182 -0.108910 0.510177 0.883055 0.893069 0.855668 -0.199182 0.247461 0.178892 -0.277956 0.664089 0.650076 0.618470 0.619940 -0.126006 0.506215 -0.095261 0.741588 0.229972
NARROW_T 0.221042 0.075749 -0.096577 -0.174182 1.000000 0.274255 -0.110358 -0.023946 -0.032331 -0.218845 0.910120 0.072376 -0.464770 -0.003653 0.111019 0.108543 -0.420119 0.129899 -0.198490 0.314526 0.503465 -0.059722 0.060680
RACK -0.138500 -0.091713 -0.085073 -0.108910 0.274255 1.000000 -0.007995 -0.175809 -0.139335 -0.103634 0.235738 0.205715 -0.116043 -0.191312 -0.217385 -0.121740 -0.342599 -0.087045 -0.004416 -0.158986 0.200514 -0.038721 0.208971
TUNNEL 0.402401 0.419698 0.494285 0.510177 -0.110358 -0.007995 1.000000 0.535252 0.617501 0.324352 -0.044377 -0.051990 0.275888 0.011328 0.480588 0.566684 0.352088 0.400839 0.042044 0.388774 0.125645 0.546056 -0.053777
Q1 0.833750 0.871335 0.917086 0.883055 -0.023946 -0.175809 0.535252 1.000000 0.965545 0.566255 -0.048813 0.229915 0.125315 -0.252108 0.821516 0.779904 0.623884 0.743461 -0.188072 0.714909 -0.063535 0.804606 0.216167
Q2 0.765818 0.818496 0.921017 0.893069 -0.032331 -0.139335 0.617501 0.965545 1.000000 0.608059 -0.048558 0.234328 0.094527 -0.251962 0.744904 0.756053 0.582984 0.670537 -0.175236 0.625587 0.041099 0.760654 0.220269
Q3 0.191431 0.311179 0.783490 0.855668 -0.218845 -0.103634 0.324352 0.566255 0.608059 1.000000 -0.242653 0.176697 0.122147 -0.197632 0.385020 0.374809 0.452011 0.322399 -0.029540 0.214631 -0.105037 0.466487 0.158260
VIRAGE 0.190760 0.044869 -0.121627 -0.199182 0.910120 0.235738 -0.044377 -0.048813 -0.048558 -0.242653 1.000000 -0.000686 -0.501909 0.076059 0.088650 0.090084 -0.399277 0.086584 -0.196262 0.261437 0.512594 -0.077286 -0.006375
LABOR 0.220767 0.236263 0.298010 0.247461 0.072376 0.205715 -0.051990 0.229915 0.234328 0.176697 -0.000686 1.000000 -0.171082 -0.988997 0.166294 0.212082 0.142532 0.243992 -0.231094 0.192714 0.287272 0.217572 0.988299
ELEC -0.024349 0.017797 0.078934 0.178892 -0.464770 -0.116043 0.275888 0.125315 0.094527 0.122147 -0.501909 -0.171082 1.000000 0.023443 0.238740 0.250492 0.294408 0.168856 0.246700 0.086106 -0.248853 0.419484 -0.150283
CAPITAL -0.220353 -0.242405 -0.314238 -0.277956 -0.003653 -0.191312 0.011328 -0.252108 -0.251962 -0.197632 0.076059 -0.988997 0.023443 1.000000 -0.204583 -0.252808 -0.188831 -0.272929 0.197445 -0.208473 -0.254125 -0.283753 -0.980247
LNQ1 0.716570 0.669803 0.674856 0.664089 0.111019 -0.217385 0.480588 0.821516 0.744904 0.385020 0.088650 0.166294 0.238740 -0.204583 1.000000 0.953608 0.642355 0.879468 -0.098619 0.840233 0.005468 0.917466 0.159526
LNQ2 0.649731 0.621301 0.661000 0.650076 0.108543 -0.121740 0.566684 0.779904 0.756053 0.374809 0.090084 0.212082 0.250492 -0.252808 0.953608 1.000000 0.600872 0.825550 -0.114533 0.761829 0.152731 0.932390 0.204090
LNQ3 0.483496 0.554145 0.613199 0.618470 -0.420119 -0.342599 0.352088 0.623884 0.582984 0.452011 -0.399277 0.142532 0.294408 -0.188831 0.642355 0.600872 1.000000 0.657063 0.109996 0.457153 -0.287333 0.675537 0.137904
LNNET 0.774594 0.776215 0.656203 0.619940 0.129899 -0.087045 0.400839 0.743461 0.670537 0.322399 0.086584 0.243992 0.168856 -0.272929 0.879468 0.825550 0.657063 1.000000 -0.055171 0.851932 -0.043606 0.852342 0.250791
LNPE -0.222202 -0.212689 -0.153619 -0.126006 -0.198490 -0.004416 0.042044 -0.188072 -0.175236 -0.029540 -0.196262 -0.231094 0.246700 0.197445 -0.098619 -0.114533 0.109996 -0.055171 1.000000 -0.144047 -0.110212 -0.096647 -0.235935
LNSTOP 0.867703 0.720221 0.550021 0.506215 0.314526 -0.158986 0.388774 0.714909 0.625587 0.214631 0.261437 0.192714 0.086106 -0.208473 0.840233 0.761829 0.457153 0.851932 -0.144047 1.000000 0.093964 0.730809 0.194081
LNCAP 0.011668 -0.077349 -0.071705 -0.095261 0.503465 0.200514 0.125645 -0.063535 0.041099 -0.105037 0.512594 0.287272 -0.248853 -0.254125 0.005468 0.152731 -0.287333 -0.043606 -0.110212 0.093964 1.000000 -0.029449 0.295729
LNKWH 0.645946 0.668158 0.735178 0.741588 -0.059722 -0.038721 0.546056 0.804606 0.760654 0.466487 -0.077286 0.217572 0.419484 -0.283753 0.917466 0.932390 0.675537 0.852342 -0.096647 0.730809 -0.029449 1.000000 0.214477
LNLABOR 0.215464 0.230529 0.275997 0.229972 0.060680 0.208971 -0.053777 0.216167 0.220269 0.158260 -0.006375 0.988299 -0.150283 -0.980247 0.159526 0.204090 0.137904 0.250791 -0.235935 0.194081 0.295729 0.214477 1.000000
In [114]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
import statsmodels.api as sm
from scipy import stats

X=df.loc[:, df.columns != 'TOTCOST']
y=df[['TOTCOST']]
X['LNKWH']=np.log(X.KWH)
X['LNLABOR']=np.log(X.LABOR)
X['LNSTAFF']=np.log(X.STAFF)
X2 = sm.add_constant(X[[ 'LNKWH','LNQ2','LNQ3','STOPS','LNSTAFF','LNNET','ELEC','TUNNEL']]) #beta0 or intercept to the model
est = sm.OLS(y, X2)
model = est.fit()
print(model.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                TOTCOST   R-squared:                       0.728
Model:                            OLS   Adj. R-squared:                  0.724
Method:                 Least Squares   F-statistic:                     199.0
Date:                Wed, 03 Apr 2019   Prob (F-statistic):          1.09e-162
Time:                        22:39:33   Log-Likelihood:                -6896.3
No. Observations:                 604   AIC:                         1.381e+04
Df Residuals:                     595   BIC:                         1.385e+04
Df Model:                           8                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const        1.77e+04   2.16e+04      0.818      0.414   -2.48e+04    6.02e+04
LNKWH       3.897e+04   4529.019      8.605      0.000    3.01e+04    4.79e+04
LNQ2       -1.878e+04   2156.844     -8.709      0.000    -2.3e+04   -1.45e+04
LNQ3        3112.0238    478.238      6.507      0.000    2172.783    4051.264
STOPS        629.9722     82.994      7.591      0.000     466.975     792.969
LNSTAFF     8106.1383   3721.291      2.178      0.030     797.676    1.54e+04
LNNET      -1.764e+04   2522.807     -6.993      0.000   -2.26e+04   -1.27e+04
ELEC       -8891.6243   1252.454     -7.099      0.000   -1.14e+04   -6431.856
TUNNEL      1.969e+04   2936.610      6.705      0.000    1.39e+04    2.55e+04
==============================================================================
Omnibus:                      401.617   Durbin-Watson:                   0.235
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             5954.250
Skew:                           2.739   Prob(JB):                         0.00
Kurtosis:                      17.373   Cond. No.                         823.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
In [33]:
sns.distplot(X.KWH)
Out[33]:
<matplotlib.axes._subplots.AxesSubplot at 0x14821777588>
In [24]:
X.columns.values
Out[24]:
array(['STOPS', 'NETWORK', 'STAFF', 'KWH', 'NARROW_T', 'RACK', 'TUNNEL',
       'Q1', 'Q2', 'Q3', 'VIRAGE', 'LABOR', 'ELEC', 'CAPITAL', 'LNQ1',
       'LNQ2', 'LNQ3', 'LNNET', 'LNPE', 'LNSTOP', 'LNCAP'], dtype=object)
In [123]:
regr = LinearRegression()

# Train the model using the training sets
regr.fit(X[[ 'LNKWH','LNQ2','LNQ3','STOPS','LNSTAFF','LNNET','ELEC','TUNNEL','KWH']].values, y.values)

# Make predictions using the testing set
y_pred = regr.predict(X[[ 'LNKWH','LNQ2','LNQ3','STOPS','LNSTAFF','LNNET','ELEC','TUNNEL','KWH']].values)
In [124]:
from math import sqrt
from sklearn.metrics import mean_squared_error, r2_score
print("Mean squared error: %.2f"
      % mean_squared_error(y, y_pred))
print("Root mean squared error: %.2f"
      % sqrt(mean_squared_error(y, y_pred)))
# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f' % r2_score(y, y_pred))
Mean squared error: 38106449.16
Root mean squared error: 6173.04
Variance score: 0.98
In [112]:
residuals=(y-y_pred)
#residuals
Out[112]:
TOTCOST
0 8642.628229
1 14713.468931
2 -4785.084996
3 23131.884959
4 26814.409419
5 33722.484531
6 14089.781447
7 17958.066546
8 8916.559831
9 9087.074513
10 8838.770827
11 7803.208003
12 7896.610487
13 7743.618172
14 5581.341016
15 10990.812449
16 13174.065173
17 12428.285829
18 14191.883676
19 15001.505907
20 -2706.361328
21 -1845.737073
22 771.011074
23 1533.656284
24 3171.826606
... ...
579 5776.949219
580 3829.302964
581 3718.979889
582 2831.112492
583 5826.401578
584 4863.488437
585 6470.024378
586 5887.457925
587 14584.535539
588 21948.526936
589 22235.092505
590 30021.173760
591 -9814.147500
592 -10548.479986
593 -10062.786050
594 -5584.659203
595 5024.335713
596 24610.584171
597 41102.762672
598 52491.320880
599 54910.146740
600 55318.807911
601 48855.484514
602 50067.596876
603 47469.493397

604 rows × 1 columns

In [125]:
ax=sns.kdeplot(residuals.TOTCOST,shade=True,color="g",legend=False)
ax.set(xlabel='Residuals', ylabel='Density')
plt.show()
In [126]:
ax=sns.residplot(y_pred.flatten(),residuals.TOTCOST.values, lowess=True, color="g")
ax.set(xlabel='Fitted Values', ylabel='Residuals')
plt.show()
In [127]:
from statsmodels.graphics.gofplots import qqplot
qqplot(preprocessing.scale(residuals.TOTCOST), line='s')
plt.show()

Theres a high heteroscedarcity because of outliers. The only way to reduce this is to remove the outliers or model them seperately

The fitted values vs residuals plot clearly shows heteroskedasticity, which in other words means increase of variance of predicted values with increase in the values of predictors. Lets confirm our assumption with Breush-Pagan Test, the test thats done to find conditional heteroskedasticity

In [129]:
from statsmodels.compat import lzip

name = ['Lagrange multiplier statistic', 'p-value',
        'f-value', 'f p-value']
test=sm.stats.diagnostic.het_breuschpagan(residuals.TOTCOST.values ,X[[ 'LNKWH','LNQ2','LNQ3','STOPS','LNSTAFF','LNNET','ELEC','TUNNEL','KWH']].values)
lzip(name, test)
Out[129]:
[('Lagrange multiplier statistic', 424.72777958369284),
 ('p-value', 9.5669769244734404e-87),
 ('f-value', 156.62898224179548),
 ('f p-value', 1.3537616000591543e-150)]

Null Hypothesis: Error terms are homoscedastic
Alternative Hypothesis: Error terms are heteroscedastic.

But since p value is almost 0 we can reject the null hypothesis. (we consider the null hypothesis when the p values is significantly greater than 0.05) Hherefore the residuals are heteroskedastic

Hence it is essential to transform the data to remove heteroskedasticity.

In [130]:
import math
trans_y=y.TOTCOST.apply(math.log)
In [131]:
#X['FEED/COWS']=X["FEED"]/X["COWS"]
In [134]:
from scipy import stats
import numpy as np
regr.fit(X[[ 'LNKWH','LNQ2','LNQ3','STOPS','LNSTAFF','LNNET','ELEC','TUNNEL','KWH']].values, trans_y)

# Make predictions using the testing set
y_pred = regr.predict(X[[ 'LNKWH','LNQ2','LNQ3','STOPS','LNSTAFF','LNNET','ELEC','TUNNEL','KWH']].values)
In [135]:
ax=sns.residplot(np.exp(y_pred),((np.exp(y_pred))-y.TOTCOST), lowess=True, color="g")
ax.set(xlabel='Fitted Values', ylabel='Residuals')
plt.show()

Heteroscedarcity is reduced to some extent

In [136]:
print("Mean squared error: %.2f"
      % mean_squared_error(y,np.exp(y_pred)))
# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f' % r2_score(y, np.exp(y_pred)))
Mean squared error: 43152197.97
Variance score: 0.98
In [138]:
#  K-fold cross validation
from sklearn.model_selection import ShuffleSplit
from sklearn.model_selection import cross_val_score

regr = LinearRegression()
cv = ShuffleSplit(n_splits=5, test_size=0.25, random_state=0)
scores = cross_val_score(regr, X[[ 'LNKWH','LNQ2','LNQ3','STOPS','LNSTAFF','LNNET','ELEC','TUNNEL','KWH']].values, y, scoring='neg_mean_squared_error', cv=cv)
print("regression model's mean squared errors of cross validation",-scores.mean())
regression model's mean squared errors of cross validation 36155603.4081
In [143]:
from sklearn.linear_model import Ridge

#Ridge regression model
ridgeReg = Ridge(alpha=0.005, normalize=True)
ridgeReg.fit( X[[ 'LNKWH','LNQ2','LNQ3','STOPS','LNSTAFF','LNNET','ELEC','TUNNEL','KWH']],y)
y_ridge = ridgeReg.predict( X[[ 'LNKWH','LNQ2','LNQ3','STOPS','LNSTAFF','LNNET','ELEC','TUNNEL','KWH']])
In [144]:
print("Mean squared error: %.2f"
      % mean_squared_error(y,(y_ridge)))
# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f' % r2_score(y, (y_ridge)))
Mean squared error: 38226728.02
Variance score: 0.98
In [145]:
n=0
scores=[0]*12
alpha=[1,.7,.5,.3,.1,.07,.05,.01,.007,.005,.001,.0005]
for i in alpha:
    ridgeReg = Ridge(alpha=i, normalize=True)
    cv = ShuffleSplit(n_splits=5, test_size=0.25, random_state=0)
    score = cross_val_score(ridgeReg, X[[ 'LNKWH','LNQ2','LNQ3','STOPS','LNSTAFF','LNNET','ELEC','TUNNEL','KWH']].values, y, scoring='neg_mean_squared_error', cv=cv)
    scores[n]=score.mean()    
    print("mean of ridge regression model's mean squared errors of cross validation for alpha",i,"is",-score.mean())
    n+=1
mean of ridge regression model's mean squared errors of cross validation for alpha 1 is 286416914.072
mean of ridge regression model's mean squared errors of cross validation for alpha 0.7 is 227308857.491
mean of ridge regression model's mean squared errors of cross validation for alpha 0.5 is 178377302.417
mean of ridge regression model's mean squared errors of cross validation for alpha 0.3 is 119178513.708
mean of ridge regression model's mean squared errors of cross validation for alpha 0.1 is 54325638.7728
mean of ridge regression model's mean squared errors of cross validation for alpha 0.07 is 46295285.557
mean of ridge regression model's mean squared errors of cross validation for alpha 0.05 is 41803381.9534
mean of ridge regression model's mean squared errors of cross validation for alpha 0.01 is 36298325.6909
mean of ridge regression model's mean squared errors of cross validation for alpha 0.007 is 36172623.3459
mean of ridge regression model's mean squared errors of cross validation for alpha 0.005 is 36123704.8444
mean of ridge regression model's mean squared errors of cross validation for alpha 0.001 is 36128221.351
mean of ridge regression model's mean squared errors of cross validation for alpha 0.0005 is 36140374.2196
In [148]:
from sklearn.linear_model import Lasso

#Lasso regression model
lassoReg = Lasso(alpha=0.1, normalize=True)
lassoReg.fit( X[[ 'LNKWH','LNQ2','LNQ3','STOPS','LNSTAFF','LNNET','ELEC','TUNNEL','KWH']],y)
y_lasso = lassoReg.predict( X[[ 'LNKWH','LNQ2','LNQ3','STOPS','LNSTAFF','LNNET','ELEC','TUNNEL','KWH']])
In [149]:
print("Mean squared error: %.2f"
      % mean_squared_error(y,(y_lasso)))
# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f' % r2_score(y, (y_lasso)))
Mean squared error: 38107566.24
Variance score: 0.98
In [150]:
n=0
scores=[0]*14
alpha=[10,5,4,3,1,.7,.5,.3,.1,.07,.05,.01,.007,.005,.001,.0005]
for i in alpha:
    lassoReg = Lasso(alpha=i, normalize=True)
    cv = ShuffleSplit(n_splits=5, test_size=0.25, random_state=0)
    score = cross_val_score(lassoReg,  X[[ 'LNKWH','LNQ2','LNQ3','STOPS','LNSTAFF','LNNET','ELEC','TUNNEL','KWH']], y, scoring='neg_mean_squared_error', cv=cv)
    scores[n]=score.mean()
    print("mean of Lasso regression model's mean squared errors of cross validation for alpha",i,"is",-score.mean())
mean of Lasso regression model's mean squared errors of cross validation for alpha 10 is 36206268.8699
mean of Lasso regression model's mean squared errors of cross validation for alpha 5 is 36184120.1548
mean of Lasso regression model's mean squared errors of cross validation for alpha 4 is 36180176.775
mean of Lasso regression model's mean squared errors of cross validation for alpha 3 is 36190093.3429
mean of Lasso regression model's mean squared errors of cross validation for alpha 1 is 36175759.3379
mean of Lasso regression model's mean squared errors of cross validation for alpha 0.7 is 36152486.0021
mean of Lasso regression model's mean squared errors of cross validation for alpha 0.5 is 36144847.427
mean of Lasso regression model's mean squared errors of cross validation for alpha 0.3 is 36143985.6039
mean of Lasso regression model's mean squared errors of cross validation for alpha 0.1 is 36149789.1527
mean of Lasso regression model's mean squared errors of cross validation for alpha 0.07 is 36151233.374
mean of Lasso regression model's mean squared errors of cross validation for alpha 0.05 is 36151922.9361
mean of Lasso regression model's mean squared errors of cross validation for alpha 0.01 is 36154785.9287
mean of Lasso regression model's mean squared errors of cross validation for alpha 0.007 is 36155029.7204
mean of Lasso regression model's mean squared errors of cross validation for alpha 0.005 is 36155192.5921
mean of Lasso regression model's mean squared errors of cross validation for alpha 0.001 is 36155520.6166
mean of Lasso regression model's mean squared errors of cross validation for alpha 0.0005 is 36155562.0554
In [151]:
from sklearn.linear_model import ElasticNet

#Lasso regression model
elasReg = ElasticNet(alpha=0.01,l1_ratio=0.5, normalize=True)
elasReg.fit( X[[ 'LNKWH','LNQ2','LNQ3','STOPS','LNSTAFF','LNNET','ELEC','TUNNEL','KWH']],y)
y_elas = elasReg.predict( X[[ 'LNKWH','LNQ2','LNQ3','STOPS','LNSTAFF','LNNET','ELEC','TUNNEL','KWH']])
In [152]:
print("Mean squared error: %.2f"
      % mean_squared_error(y,(y_elas)))
# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f' % r2_score(y, (y_elas)))
Mean squared error: 583385575.12
Variance score: 0.67
In [153]:
n=0
scores=[0]*14
alpha=[10,5,4,3,1,.7,.5,.3,.1,.07,.05,.01,.007,.005,.001,.0005,0.0001,0.00007]
for i in alpha:
    elasReg = ElasticNet(alpha=i, normalize=True,l1_ratio=0.9)
    cv = ShuffleSplit(n_splits=5, test_size=0.25, random_state=0)
    score = cross_val_score(elasReg,  X[[ 'LNKWH','LNQ2','LNQ3','STOPS','LNSTAFF','LNNET','ELEC','TUNNEL','KWH']], y, scoring='neg_mean_squared_error', cv=cv)
    scores[n]=score.mean()
    print("mean of Lasso regression model's mean squared errors of cross validation for alpha",i,"is",-score.mean())
mean of Lasso regression model's mean squared errors of cross validation for alpha 10 is 1656474666.53
mean of Lasso regression model's mean squared errors of cross validation for alpha 5 is 1626337277.7
mean of Lasso regression model's mean squared errors of cross validation for alpha 4 is 1611679291.78
mean of Lasso regression model's mean squared errors of cross validation for alpha 3 is 1587834628.25
mean of Lasso regression model's mean squared errors of cross validation for alpha 1 is 1420293209.0
mean of Lasso regression model's mean squared errors of cross validation for alpha 0.7 is 1330765631.58
mean of Lasso regression model's mean squared errors of cross validation for alpha 0.5 is 1228650581.76
mean of Lasso regression model's mean squared errors of cross validation for alpha 0.3 is 1046555169.88
mean of Lasso regression model's mean squared errors of cross validation for alpha 0.1 is 642327072.034
mean of Lasso regression model's mean squared errors of cross validation for alpha 0.07 is 536372438.729
mean of Lasso regression model's mean squared errors of cross validation for alpha 0.05 is 451156103.454
mean of Lasso regression model's mean squared errors of cross validation for alpha 0.01 is 165442987.971
mean of Lasso regression model's mean squared errors of cross validation for alpha 0.007 is 124651961.324
mean of Lasso regression model's mean squared errors of cross validation for alpha 0.005 is 95029239.0366
mean of Lasso regression model's mean squared errors of cross validation for alpha 0.001 is 40881290.6039
mean of Lasso regression model's mean squared errors of cross validation for alpha 0.0005 is 37372572.6236
mean of Lasso regression model's mean squared errors of cross validation for alpha 0.0001 is 36116715.3957
mean of Lasso regression model's mean squared errors of cross validation for alpha 7e-05 is 36107064.9865

Comparing the mean squared errors of Lasso, Ridge and regular regression, we find that lasso and ridge have lower mean squared error than simple linear regression