r - Linear regression fails in Python with large values in dependent variables -
i'm trying rewrite forecasting model (in stata) using python (with pandas.stats.api.ols), , ran issue linear regression: coefficients , intercept computed pandas not match stata.
investigation shows root cause might values of dependent values big. i've suspicion based on findings below:
1) created simple dataframe in python , ran linear regression it:
from pandas.stats.api import ols import pandas pd df = pd.dataframe({"a": [10,20,30,40,50,21], "b": [20, 30, 10, 40, 50,98], "c": [32, 234, 23, 23, 31,21], "d":[12,28,12,98,51,87], "e": [1,8,12,9,12,91]}) ols(y=df['a'], x=df[['b','c', 'd', 'e']])
and summary lr is:
-------------------------summary of regression analysis------------------------- formula: y ~ <b> + <c> + <d> + <e> + <intercept> number of observations: 6 number of degrees of freedom: 5 r-squared: 0.4627 adj r-squared: -1.6865 rmse: 23.9493 f-stat (4, 1): 0.2153, p-value: 0.9026 degrees of freedom: model 4, resid 1 -----------------------summary of estimated coefficients------------------------ variable coef std err t-stat p-value ci 2.5% ci 97.5% -------------------------------------------------------------------------------- b 0.3212 1.1176 0.29 0.8218 -1.8693 2.5117 c -0.0488 0.1361 -0.36 0.7806 -0.3155 0.2178 d 0.1512 0.4893 0.31 0.8092 -0.8077 1.1101 e -0.4508 0.8268 -0.55 0.6822 -2.0713 1.1697 intercept 20.9222 23.6280 0.89 0.5386 -25.3887 67.2331 ---------------------------------end of summary---------------------------------
i saved dataframe stata .dta file, , ran lr in stata as:
use "/tmp/lr.dta", clear reg b c d e
the result same:
source | ss df ms number of obs = 6 -------------+------------------------------ f( 4, 1) = 0.22 model | 493.929019 4 123.482255 prob > f = 0.9026 residual | 573.570981 1 573.570981 r-squared = 0.4627 -------------+------------------------------ adj r-squared = -1.6865 total | 1067.5 5 213.5 root mse = 23.949 ------------------------------------------------------------------------------ | coef. std. err. t p>|t| [95% conf. interval] -------------+---------------------------------------------------------------- b | .3211939 1.117591 0.29 0.822 -13.87914 14.52153 c | -.0488429 .1360552 -0.36 0.781 -1.777589 1.679903 d | .1512067 .4892539 0.31 0.809 -6.065353 6.367766 e | -.4508122 .8267897 -0.55 0.682 -10.95617 10.05455 _cons | 20.9222 23.62799 0.89 0.539 -279.2998 321.1442 ------------------------------------------------------------------------------
i tried in r, , got same result.
2) however, if increased values of dependent variables in python:
df = pd.dataframe({"a": [10.0,20.0,30.0,40.0,50.0,21.0]}) df['b'] = pow(df['a'], 30) df['c'] = pow(df['a'], 5) df['d'] = pow(df['a'], 15) df['e'] = pow(df['a'], 25)
i've made sure columns using float64 here: df.dtypes float64 b float64 c float64 d float64 e float64 dtype: object
the result got is:
-------------------------summary of regression analysis------------------------- formula: y ~ <b> + <c> + <d> + <e> + <intercept> number of observations: 6 number of degrees of freedom: 2 r-squared: -0.7223 adj r-squared: -1.1528 rmse: 21.4390 f-stat (4, 4): -1.6775, p-value: 1.0000 degrees of freedom: model 1, resid 4 -----------------------summary of estimated coefficients------------------------ variable coef std err t-stat p-value ci 2.5% ci 97.5% -------------------------------------------------------------------------------- b -0.0000 0.0000 -0.00 0.9973 -0.0000 0.0000 c 0.0000 0.0000 0.00 1.0000 -0.0000 0.0000 d 0.0000 0.0000 0.00 1.0000 -0.0000 0.0000 e 0.0000 0.0000 0.00 0.9975 -0.0000 0.0000 intercept 0.0000 21.7485 0.00 1.0000 -42.6271 42.6271 ---------------------------------end of summary---------------------------------
but in stata, got different result:
source | ss df ms number of obs = 6 -------------+------------------------------ f( 4, 1) = 237.35 model | 1066.37679 4 266.594196 prob > f = 0.0486 residual | 1.1232144 1 1.1232144 r-squared = 0.9989 -------------+------------------------------ adj r-squared = 0.9947 total | 1067.5 5 213.5 root mse = 1.0598 ------------------------------------------------------------------------------ | coef. std. err. t p>|t| [95% conf. interval] -------------+---------------------------------------------------------------- b | -1.45e-45 2.32e-46 -6.24 0.101 -4.40e-45 1.50e-45 c | 2.94e-06 3.67e-07 8.01 0.079 -1.72e-06 7.61e-06 d | -3.86e-21 6.11e-22 -6.31 0.100 -1.16e-20 3.90e-21 e | 4.92e-37 7.88e-38 6.24 0.101 -5.09e-37 1.49e-36 _cons | 9.881129 1.07512 9.19 0.069 -3.779564 23.54182 ------------------------------------------------------------------------------
and result in r aligns stata: lm(formula = ~ b + c + d + e, data = stata)
residuals: 1 2 3 4 5 6 -1.757e-01 8.211e-01 1.287e-03 -1.269e-06 1.289e-09 -6.467e-01 coefficients: estimate std. error t value pr(>|t|) (intercept) 9.881e+00 1.075e+00 9.191 0.069 . b -1.449e-45 2.322e-46 -6.238 0.101 c 2.945e-06 3.674e-07 8.015 0.079 . d -3.855e-21 6.106e-22 -6.313 0.100 e 4.919e-37 7.879e-38 6.243 0.101 --- signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 residual standard error: 1.06 on 1 degrees of freedom multiple r-squared: 0.9989, adjusted r-squared: 0.9947 f-statistic: 237.3 on 4 , 1 df, p-value: 0.04864
hence, appears me pandas having issue here. please advise?
i think because of relative precision issue in python (not in python, other programming languages well, c++). np.finfo(float).eps
gives 2.2204460492503131e-16
, less eps*max_value_of_your_data
treated 0
when try primitive operations + - * /
. example, 1e117 + 1e100 == 1e117
returns true
, because 1e100/1e117 = 1e-17 < eps
. @ data.
# data # ======================= print(df) b c d e 0 10 1.0000e+30 100000 1.0000e+15 1.0000e+25 1 20 1.0737e+39 3200000 3.2768e+19 3.3554e+32 2 30 2.0589e+44 24300000 1.4349e+22 8.4729e+36 3 40 1.1529e+48 102400000 1.0737e+24 1.1259e+40 4 50 9.3132e+50 312500000 3.0518e+25 2.9802e+42 5 21 4.6407e+39 4084101 6.8122e+19 1.1363e+33
when relative precision taken account,
# =================================================== import numpy np np.finfo(float).eps # 2.2204460492503131e-16 df[df < df.max().max()*np.finfo(float).eps] = 0 df b c d e 0 0 0.0000e+00 0 0 0.0000e+00 1 0 1.0737e+39 0 0 0.0000e+00 2 0 2.0589e+44 0 0 8.4729e+36 3 0 1.1529e+48 0 0 1.1259e+40 4 0 9.3132e+50 0 0 2.9802e+42 5 0 4.6407e+39 0 0 0.0000e+00
so there no variation @ in y(a)
, , that's why statsmodels
returns 0 coefficients. reminder, it's practice normalize data first before running regression.
Comments
Post a Comment