r - Linear regression fails in Python with large values in dependent variables -


i'm trying rewrite forecasting model (in stata) using python (with pandas.stats.api.ols), , ran issue linear regression: coefficients , intercept computed pandas not match stata.

investigation shows root cause might values of dependent values big. i've suspicion based on findings below:

1) created simple dataframe in python , ran linear regression it:

from pandas.stats.api import ols import pandas pd df = pd.dataframe({"a": [10,20,30,40,50,21], "b": [20, 30, 10, 40, 50,98], "c": [32, 234, 23, 23, 31,21], "d":[12,28,12,98,51,87], "e": [1,8,12,9,12,91]}) ols(y=df['a'], x=df[['b','c', 'd', 'e']]) 

and summary lr is:

-------------------------summary of regression analysis-------------------------  formula: y ~ <b> + <c> + <d> + <e> + <intercept>  number of observations:         6 number of degrees of freedom:   5  r-squared:         0.4627 adj r-squared:    -1.6865  rmse:             23.9493  f-stat (4, 1):     0.2153, p-value:     0.9026  degrees of freedom: model 4, resid 1  -----------------------summary of estimated coefficients------------------------       variable       coef    std err     t-stat    p-value    ci 2.5%   ci 97.5% --------------------------------------------------------------------------------              b     0.3212     1.1176       0.29     0.8218    -1.8693     2.5117              c    -0.0488     0.1361      -0.36     0.7806    -0.3155     0.2178              d     0.1512     0.4893       0.31     0.8092    -0.8077     1.1101              e    -0.4508     0.8268      -0.55     0.6822    -2.0713     1.1697      intercept    20.9222    23.6280       0.89     0.5386   -25.3887    67.2331 ---------------------------------end of summary--------------------------------- 

i saved dataframe stata .dta file, , ran lr in stata as:

 use "/tmp/lr.dta", clear  reg b c d e 

the result same:

      source |       ss       df       ms              number of obs =       6 -------------+------------------------------           f(  4,     1) =    0.22        model |  493.929019     4  123.482255           prob > f      =  0.9026     residual |  573.570981     1  573.570981           r-squared     =  0.4627 -------------+------------------------------           adj r-squared = -1.6865        total |      1067.5     5       213.5           root mse      =  23.949  ------------------------------------------------------------------------------            |      coef.   std. err.      t    p>|t|     [95% conf. interval] -------------+----------------------------------------------------------------            b |   .3211939   1.117591     0.29   0.822    -13.87914    14.52153            c |  -.0488429   .1360552    -0.36   0.781    -1.777589    1.679903            d |   .1512067   .4892539     0.31   0.809    -6.065353    6.367766            e |  -.4508122   .8267897    -0.55   0.682    -10.95617    10.05455        _cons |    20.9222   23.62799     0.89   0.539    -279.2998    321.1442 ------------------------------------------------------------------------------ 

i tried in r, , got same result.

2) however, if increased values of dependent variables in python:

df = pd.dataframe({"a": [10.0,20.0,30.0,40.0,50.0,21.0]}) df['b'] = pow(df['a'], 30) df['c'] = pow(df['a'], 5) df['d'] = pow(df['a'], 15) df['e'] = pow(df['a'], 25) 

i've made sure columns using float64 here: df.dtypes float64 b float64 c float64 d float64 e float64 dtype: object

the result got is:

-------------------------summary of regression analysis-------------------------  formula: y ~ <b> + <c> + <d> + <e> + <intercept>  number of observations:         6 number of degrees of freedom:   2  r-squared:        -0.7223 adj r-squared:    -1.1528  rmse:             21.4390  f-stat (4, 4):    -1.6775, p-value:     1.0000  degrees of freedom: model 1, resid 4  -----------------------summary of estimated coefficients------------------------       variable       coef    std err     t-stat    p-value    ci 2.5%   ci 97.5% --------------------------------------------------------------------------------              b    -0.0000     0.0000      -0.00     0.9973    -0.0000     0.0000              c     0.0000     0.0000       0.00     1.0000    -0.0000     0.0000              d     0.0000     0.0000       0.00     1.0000    -0.0000     0.0000              e     0.0000     0.0000       0.00     0.9975    -0.0000     0.0000      intercept     0.0000    21.7485       0.00     1.0000   -42.6271    42.6271 ---------------------------------end of summary--------------------------------- 

but in stata, got different result:

      source |       ss       df       ms              number of obs =       6 -------------+------------------------------           f(  4,     1) =  237.35        model |  1066.37679     4  266.594196           prob > f      =  0.0486     residual |   1.1232144     1   1.1232144           r-squared     =  0.9989 -------------+------------------------------           adj r-squared =  0.9947        total |      1067.5     5       213.5           root mse      =  1.0598  ------------------------------------------------------------------------------            |      coef.   std. err.      t    p>|t|     [95% conf. interval] -------------+----------------------------------------------------------------            b |  -1.45e-45   2.32e-46    -6.24   0.101    -4.40e-45    1.50e-45            c |   2.94e-06   3.67e-07     8.01   0.079    -1.72e-06    7.61e-06            d |  -3.86e-21   6.11e-22    -6.31   0.100    -1.16e-20    3.90e-21            e |   4.92e-37   7.88e-38     6.24   0.101    -5.09e-37    1.49e-36        _cons |   9.881129    1.07512     9.19   0.069    -3.779564    23.54182 ------------------------------------------------------------------------------ 

and result in r aligns stata: lm(formula = ~ b + c + d + e, data = stata)

residuals:          1          2          3          4          5          6  -1.757e-01  8.211e-01  1.287e-03 -1.269e-06  1.289e-09 -6.467e-01   coefficients:               estimate std. error t value pr(>|t|)   (intercept)  9.881e+00  1.075e+00   9.191    0.069 . b           -1.449e-45  2.322e-46  -6.238    0.101   c            2.945e-06  3.674e-07   8.015    0.079 . d           -3.855e-21  6.106e-22  -6.313    0.100   e            4.919e-37  7.879e-38   6.243    0.101   --- signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1  residual standard error: 1.06 on 1 degrees of freedom multiple r-squared:  0.9989,  adjusted r-squared:  0.9947  f-statistic: 237.3 on 4 , 1 df,  p-value: 0.04864 

hence, appears me pandas having issue here. please advise?

i think because of relative precision issue in python (not in python, other programming languages well, c++). np.finfo(float).eps gives 2.2204460492503131e-16, less eps*max_value_of_your_data treated 0 when try primitive operations + - * /. example, 1e117 + 1e100 == 1e117 returns true, because 1e100/1e117 = 1e-17 < eps. @ data.

# data # ======================= print(df)                b          c           d           e 0  10  1.0000e+30     100000  1.0000e+15  1.0000e+25 1  20  1.0737e+39    3200000  3.2768e+19  3.3554e+32 2  30  2.0589e+44   24300000  1.4349e+22  8.4729e+36 3  40  1.1529e+48  102400000  1.0737e+24  1.1259e+40 4  50  9.3132e+50  312500000  3.0518e+25  2.9802e+42 5  21  4.6407e+39    4084101  6.8122e+19  1.1363e+33 

when relative precision taken account,

# =================================================== import numpy np  np.finfo(float).eps # 2.2204460492503131e-16  df[df < df.max().max()*np.finfo(float).eps] = 0 df               b  c  d           e 0  0  0.0000e+00  0  0  0.0000e+00 1  0  1.0737e+39  0  0  0.0000e+00 2  0  2.0589e+44  0  0  8.4729e+36 3  0  1.1529e+48  0  0  1.1259e+40 4  0  9.3132e+50  0  0  2.9802e+42 5  0  4.6407e+39  0  0  0.0000e+00 

so there no variation @ in y(a), , that's why statsmodels returns 0 coefficients. reminder, it's practice normalize data first before running regression.


Comments

Popular posts from this blog

qt - Using float or double for own QML classes -

Create Outlook appointment via C# .Net -

ios - Swift Array Resetting Itself -