It's a useful and common practice to append predicted values and residuals from running a regression onto a dataframe as distinct columns. I'm new to pandas, and I'm having trouble performing this very simple operation. I know I'm missing something obvious. There was a very similar question asked about a year-and-a-half ago, but it wasn't really answered.
The dataframe currently looks something like this:
y x1 x2
880.37 3.17 23
716.20 4.76 26
974.79 4.17 73
322.80 8.70 72
1054.25 11.45 16
And all I'm wanting is to return a dataframe that has the predicted value and residual from y = x1 + x2 for each observation:
y x1 x2 y_hat res
880.37 3.17 23 840.27 40.10
716.20 4.76 26 752.60 -36.40
974.79 4.17 73 877.49 97.30
322.80 8.70 72 348.50 -25.70
1054.25 11.45 16 815.15 239.10
I've tried resolving this using statsmodels and pandas and haven't been able to solve it. Thanks in advance!
Here is a variation on Alexander's answer using the OLS model from statsmodels instead of the pandas ols model. We can use either the formula or the array/DataFrame interface to the models.
fittedvalues and resid are pandas Series with the correct index.
predict does not return a pandas Series.
import numpy as np
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf
df = pd.DataFrame({'x1': [3.17, 4.76, 4.17, 8.70, 11.45],
'x2': [23, 26, 73, 72, 16],
'y': [880.37, 716.20, 974.79, 322.80, 1054.25]},
index=np.arange(10, 20, 2))
result = smf.ols('y ~ x1 + x2', df).fit()
df['yhat'] = result.fittedvalues
df['resid'] = result.resid
result2 = sm.OLS(df['y'], sm.add_constant(df[['x1', 'x2']])).fit()
df['yhat2'] = result2.fittedvalues
df['resid2'] = result2.resid
# predict doesn't return pandas series and no index is available
df['predicted'] = result.predict(df)
print(df)
x1 x2 y yhat resid yhat2 resid2 \
10 3.17 23 880.37 923.949309 -43.579309 923.949309 -43.579309
12 4.76 26 716.20 890.732201 -174.532201 890.732201 -174.532201
14 4.17 73 974.79 656.155079 318.634921 656.155079 318.634921
16 8.70 72 322.80 610.510952 -287.710952 610.510952 -287.710952
18 11.45 16 1054.25 867.062458 187.187542 867.062458 187.187542
predicted
10 923.949309
12 890.732201
14 656.155079
16 610.510952
18 867.062458
As preview, there is an extended prediction method in the model results in statsmodels master (0.7), but the API is not yet settled:
>>> print(result.get_prediction().summary_frame())
mean mean_se mean_ci_lower mean_ci_upper obs_ci_lower \
10 923.949309 268.931939 -233.171432 2081.070051 -991.466820
12 890.732201 211.945165 -21.194241 1802.658643 -887.328646
14 656.155079 269.136102 -501.844105 1814.154263 -1259.791854
16 610.510952 282.182030 -603.620329 1824.642233 -1339.874985
18 867.062458 329.017262 -548.584564 2282.709481 -1214.750941
obs_ci_upper
10 2839.365439
12 2668.793048
14 2572.102012
16 2560.896890
18 2948.875858
This should be self explanatory.
import pandas as pd
df = pd.DataFrame({'x1': [3.17, 4.76, 4.17, 8.70, 11.45],
'x2': [23, 26, 73, 72, 16],
'y': [880.37, 716.20, 974.79, 322.80, 1054.25]})
model = pd.ols(y=df.y, x=df.loc[:, ['x1', 'x2']])
df['y_hat'] = model.y_fitted
df['res'] = model.resid
>>> df
x1 x2 y y_hat res
0 3.17 23 880.37 923.949309 -43.579309
1 4.76 26 716.20 890.732201 -174.532201
2 4.17 73 974.79 656.155079 318.634921
3 8.70 72 322.80 610.510952 -287.710952
4 11.45 16 1054.25 867.062458 187.187542
So, it's polite to form your questions such that it's easy for contributors to run your code.
import pandas as pd
y_col = [880.37, 716.20, 974.79, 322.80, 1054.25]
x1_col = [3.17, 4.76, 4.17, 8.70, 11.45]
x2_col = [23, 26, 73, 72, 16]
df = pd.DataFrame()
df['y'] = y_col
df['x1'] = x1_col
df['x2'] = x2_col
Then calling df.head() yields:
y x1 x2
0 880.37 3.17 23
1 716.20 4.76 26
2 974.79 4.17 73
3 322.80 8.70 72
4 1054.25 11.45 16
Now for your question, it's fairly straightforward to add columns with calculated values, though I'm not agreeing with your sample data:
df['y_hat'] = df['x1'] + df['x2']
df['res'] = df['y'] - df['y_hat']
For me, these yield:
y x1 x2 y_hat res
0 880.37 3.17 23 26.17 854.20
1 716.20 4.76 26 30.76 685.44
2 974.79 4.17 73 77.17 897.62
3 322.80 8.70 72 80.70 242.10
4 1054.25 11.45 16 27.45 1026.80
Hope this helps!
Related
How do I preprocess this data containing a single feature with different scales? This will then be used for supervised machine learning classification.
Data
import pandas as pd
import numpy as np
np.random.seed = 4
df_eur_jpy = pd.DataFrame({"value": np.random.default_rng().uniform(0.07, 3.85, 50)})
df_usd_cad = pd.DataFrame({"value": np.random.default_rng().uniform(0.0004, 0.02401, 50)})
df_usd_cad["ticker"] = "usd_cad"
df_eur_jpy["ticker"] = "eur_jpy"
df = pd.concat([df_eur_jpy,df_usd_cad],axis=0)
df.head(1)
value ticker
0 0.161666 eur_jpy
We can see the different tickers contain data with a different scale when looking at the max/min of this groupby:
df.groupby("ticker")["value"].agg(['min', 'max'])
min max
ticker
eur_jpy 0.079184 3.837519
usd_cad 0.000405 0.022673
I have many tickers in my real data and would like to combine all of these in the one feature (pandas column) and use with an estimator in sci-kit learn for supervised machine learning classification.
If I Understand Carefully (IIUC), you can use the min-max scaling formula:
You can apply this formula to your dataframe with implemented sklearn.preprocessing.MinMaxScaler like below:
from sklearn.preprocessing import MinMaxScaler
df2 = df.pivot(columns='ticker', values='value')
# ticker eur_jpy usd_cad
# 0 3.204568 0.021455
# 1 1.144708 0.013810
# ...
# 48 1.906116 0.002058
# 49 1.136424 0.022451
df2[['min_max_scl_eur_jpy', 'min_max_scl_usd_cad']] = MinMaxScaler().fit_transform(df2[['eur_jpy', 'usd_cad']])
print(df2)
Output:
ticker eur_jpy usd_cad min_max_scl_eur_jpy min_max_scl_usd_cad
0 3.204568 0.021455 0.827982 0.896585
1 1.144708 0.013810 0.264398 0.567681
2 2.998154 0.004580 0.771507 0.170540
3 1.916517 0.003275 0.475567 0.114361
4 0.955089 0.009206 0.212517 0.369558
5 3.036463 0.019500 0.781988 0.812471
6 1.240505 0.006575 0.290608 0.256373
7 1.224260 0.020711 0.286163 0.864584
8 3.343022 0.020564 0.865864 0.858280
9 2.710383 0.023359 0.692771 0.978531
10 1.218328 0.008440 0.284540 0.336588
11 2.005472 0.022898 0.499906 0.958704
12 2.056680 0.016429 0.513916 0.680351
13 1.010388 0.005553 0.227647 0.212368
14 3.272408 0.000620 0.846543 0.000149
15 2.354457 0.018608 0.595389 0.774092
16 3.297936 0.017484 0.853528 0.725720
17 2.415297 0.009618 0.612035 0.387285
18 0.439263 0.000617 0.071386 0.000000
19 3.335262 0.005988 0.863740 0.231088
20 2.767412 0.013357 0.708375 0.548171
21 0.830678 0.013824 0.178478 0.568255
22 1.056041 0.007806 0.240138 0.309336
23 1.497400 0.023858 0.360896 1.000000
24 0.629698 0.014088 0.123489 0.579604
25 3.758559 0.020663 0.979556 0.862509
26 0.964214 0.010302 0.215014 0.416719
27 3.680324 0.023647 0.958150 0.990918
28 3.169445 0.017329 0.818372 0.719059
29 1.898905 0.017892 0.470749 0.743299
30 3.322663 0.020508 0.860293 0.855869
31 2.735855 0.010578 0.699741 0.428591
32 2.264645 0.017853 0.570816 0.741636
33 2.613166 0.021359 0.666173 0.892456
34 1.976168 0.001568 0.491888 0.040928
35 3.076169 0.013663 0.792852 0.561335
36 3.330470 0.013048 0.862429 0.534891
37 3.600527 0.012340 0.936318 0.504426
38 0.653994 0.008665 0.130137 0.346288
39 0.587896 0.013134 0.112052 0.538567
40 0.178353 0.011326 0.000000 0.460781
41 3.727127 0.016738 0.970956 0.693658
42 1.719622 0.010939 0.421696 0.444123
43 0.460177 0.021131 0.077108 0.882665
44 3.124722 0.010328 0.806136 0.417826
45 1.011988 0.007631 0.228085 0.301799
46 3.833281 0.003896 1.000000 0.141076
47 3.289872 0.017223 0.851322 0.714495
48 1.906116 0.002058 0.472721 0.062020
49 1.136424 0.022451 0.262131 0.939465
There is a question like this already but I want modifications and have tried few methods without much luck.
I have data and want to add the R squared of a regression by groups as a seperate column in the pandas dataframe. The caveat here is I only want to do the regression on values which do not have a extreme residual values within each group (ie, within 1 standard deviations or between -1 and 1 z score).
Here is the SAMPLE data frame:
df = pd.DataFrame({'gp': [1,1,1,1,1,2,2,2,2,2],
'x1': [3.17, 4.76, 4.17, 8.70, 11.45, 3.17, 4.76, 4.17, 8.70, 1.45],
'x2': [23, 26, 73, 72, 16, 26, 73, 72, 16, 25],
'y': [880.37, 716.20, 974.79, 322.80, 1054.25, 980.37, 816.20, 1074.79, 522.80, 1254.25]},
index=np.arange(10, 30, 2))
Now the answer which was on another post is such which works for me to get residuals in the group. This was the solution:
import pandas as pd
import numpy as np
import statsmodels.formula.api as sm
regmodel = 'y ~ x1 + x2'
def groupreg(g):
g['residual'] = sm.ols(formula=regmodel, data=g).fit().resid
return g
df = df.groupby('gp').apply(groupreg)
print(df)
Now this is great because I have a column residuals which gives the residual of the linear regression within each group.
However now I want to add another column which is R squared, where I want to add the R squared of the regression within each group only for the points where the residual is within +1/-1 zscore within within each group. So the goal is to add a R-squared which is stripping out extreme outliers in the regression (this should improve the R-squared values of a normal R-squared using all the data). Any help would be appreciated.
Edit**
FYI to add just a normal R squared the function would be this:
def groupreg(g):
g['residual'] = sm.ols(formula=regmodel, data=g).fit().resid
g['rsquared'] = sm.ols(formula=regmodel, data=g).fit().rsquared
return g
EDIT 2 **
Here is my code:
import pandas as pd
import numpy as np
import statsmodels.formula.api as sm
df = pd.DataFrame({'gp': [1,1,1,1,1,2,2,2,2,2],
'x1': [3.17, 4.76, 4.17, 8.70, 11.45, 3.17, 4.76, 4.17, 8.70, 1.45],
'x2': [23, 26, 73, 72, 16, 26, 73, 72, 16, 25],
'y': [880.37, 716.20, 974.79, 322.80, 1054.25, 980.37, 816.20, 1074.79, 522.80, 1254.25]},
index=np.arange(10, 30, 2))
regmodel = 'y ~ x1 + x2'
def groupreg(g):
g['residual'] = sm.ols(formula=regmodel, data=g).fit().resid
return g
df = df.groupby('gp').apply(groupreg)
print(df)
df['z_score'] = df.groupby('gp')['residual'].apply(lambda x: (x - x.mean())/x.std())
Output:
gp x1 x2 y residual z_score
10 1 3.17 23 880.37 -43.579309 -0.173726
12 1 4.76 26 716.20 -174.532201 -0.695759
14 1 4.17 73 974.79 318.634921 1.270214
16 1 8.70 72 322.80 -287.710952 -1.146938
18 1 11.45 16 1054.25 187.187542 0.746209
20 2 3.17 26 980.37 -67.245089 -0.822329
22 2 4.76 73 816.20 -96.883281 -1.184770
24 2 4.17 72 1074.79 104.400010 1.276691
26 2 8.70 16 522.80 21.017543 0.257020
28 2 1.45 25 1254.25 38.710817 0.473388
So Here I would like another column of R squared per group whilst not using the points which have a z-score greater and less than 1 and -1 respectively (eg would not use index 14, 16, 22 and 24 in the group by r square calculation.
First, use your full definition of groupreg that assigns both resid and rsquared columns:
def groupreg(g):
g['residual'] = sm.ols(formula=regmodel, data=g).fit().resid
g['rsquared'] = sm.ols(formula=regmodel, data=g).fit().rsquared
return g
Then, at the very end of your current code (after creating the z_score column), try the following to delete the rsquared entries in rows where -1 < z_score < 1:
df.loc[df['z_score'].abs() < 1, 'rsquared'] = np.NaN
Output:
gp x1 x2 y residual rsquared z_score
10 1 3.17 23 880.37 -43.579309 NaN -0.173726
12 1 4.76 26 716.20 -174.532201 NaN -0.695759
14 1 4.17 73 974.79 318.634921 0.250573 1.270214
16 1 8.70 72 322.80 -287.710952 0.250573 -1.146938
18 1 11.45 16 1054.25 187.187542 NaN 0.746209
20 2 3.17 26 980.37 -67.245089 NaN -0.822329
22 2 4.76 73 816.20 -96.883281 0.912987 -1.184770
24 2 4.17 72 1074.79 104.400010 0.912987 1.276691
26 2 8.70 16 522.80 21.017543 NaN 0.257020
28 2 1.45 25 1254.25 38.710817 NaN 0.473388
I'm working under python 2.5 (I'm restricted to that version due to external api) and would like to get same results as below code I wrote under python 2.7
import pandas as pd
df = pd.DataFrame({"lineId":[1,2,3,4], "idCaseMin": [10, 23, 40, 8], "min": [-110, -205, -80, -150], "idCaseMax": [5, 27, 15, 11], "max": [120, 150, 110, 90]})
df = df.set_index("lineId")
df["idMax"] = df["idCaseMax"].where(df["max"]>abs(df["min"]),df["idCaseMin"])
The DataFrame results in:
>>> df
idCaseMax max idCaseMin min idMax
lineId
1 5 10 120 -110 5
2 27 23 150 -205 23
3 15 40 110 -80 15
4 11 8 90 -150 8
The idMax column is defined based on the id which gets the greatest value, in absolute module, within max and min columns.
I can't use where function as it's not available under pandas 0.9.0 (latest version available for python 2.5) and numpy 1.7.1.
So, which options do I have to get same results for idMax column without using pandas where function?
IIUC you can use numpy.where():
In [120]: df['idMax'] = \
np.where(df["max"]<=abs(df["min"]),
df["idCaseMin"],
df["idCaseMax"])
In [121]: df
Out[121]:
idCaseMax idCaseMin max min idMax
lineId
1 5 10 120 -110 5
2 27 23 150 -205 23
3 15 40 110 -80 15
4 11 8 90 -150 8
I'll try and provide an optimised solution for 0.9. IIUC ix should work here.
m = df["max"] > df["min"].abs()
i = df.ix[m, 'idCaseMax']
j = df.ix[~m, 'idCaseMin']
df['idMax'] = i.append(j)
df
idCaseMax idCaseMin max min idMax
lineId
1 5 10 120 -110 5
2 27 23 150 -205 23
3 15 40 110 -80 15
4 11 8 90 -150 8
Your pandas should have this...
df['idMax']=(df["max"]>abs(df["min"]))* df["idCaseMax"]+(df["max"]<=abs(df["min"]))* df["idCaseMin"]
df
Out[1388]:
idCaseMax idCaseMin max min idMax
lineId
1 5 10 120 -110 5
2 27 23 150 -205 23
3 15 40 110 -80 15
4 11 8 90 -150 8
We can use the apply function as below code to attempt same results:
df["idMax"] = df.apply(lambda row: row["idCaseMax"] if row["max"]>abs(row["min"]) else row["idCaseMin"], axis = 1)
I am trying to figure out how I can combine daily dates into specific months and summing the data for the each day that falls within the specific month.
Note: I have a huge list with daily dates but I put a small sample here to simply the example.
File name: (test.xlsx)
For an Example (sheet1) contains in dataframe mode:
DATE 51 52 53 54 55 56
0 20110706 28.52 27.52 26.52 25.52 24.52 23.52
1 20110707 28.97 27.97 26.97 25.97 24.97 23.97
2 20110708 28.52 27.52 26.52 25.52 24.52 23.52
3 20110709 28.97 27.97 26.97 25.97 24.97 23.97
4 20110710 30.5 29.5 28.5 27.5 26.5 25.5
5 20110711 32.93 31.93 30.93 29.93 28.93 27.93
6 20110712 35.54 34.54 33.54 32.54 31.54 30.54
7 20110713 33.02 32.02 31.02 30.02 29.02 28.02
8 20110730 35.99 34.99 33.99 32.99 31.99 30.99
9 20110731 30.5 29.5 28.5 27.5 26.5 25.5
10 20110801 32.48 31.48 30.48 29.48 28.48 27.48
11 20110802 31.04 30.04 29.04 28.04 27.04 26.04
12 20110803 32.03 31.03 30.03 29.03 28.03 27.03
13 20110804 34.01 33.01 32.01 31.01 30.01 29.01
14 20110805 27.44 26.44 25.44 24.44 23.44 22.44
15 20110806 32.48 31.48 30.48 29.48 28.48 27.48
What I would like is to edit ("test.xlsx",'sheet1') to result in what is below:
DATE 51 52 53 54 55 56
0 201107 313.46 303.46 293.46 283.46 273.46 263.46
1 201108 189.48 183.48 177.48 171.48 165.48 159.48
How would I go about implementing this?
Here is my code thus far:
import pandas as pd
from pandas import ExcelWriter
df = pd.read_excel('thecddhddtestquecdd.xlsx')
def sep_yearmonths(x):
x['month'] = str(x['DATE'])[:-2]
return x
df = df.apply(sep_yearmonths,axis=1)
df.groupby('month').sum()
writer = ExcelWriter('thecddhddtestquecddMERGE.xlsx')
df.to_excel(writer,'Sheet1',index=False)
writer.save()
This will work if 'DATE' is a column of strings and not your index.
Example dataframe - shortened for clarity:
df = pd.DataFrame({'DATE': {0: '20110706', 1:'20110707', 2: '20110801'},
52: {0: 28.52, 1: 28.97, 2: 28.52},
55: { 0: 24.52, 1: 24.97, 2:24.52 }
})
Which yields:
52 55 DATE
0 28.52 24.52 20110706
1 28.97 24.97 20110707
2 28.52 24.52 20110801
Apply the following function over the dataframe to generate a new column:
def sep_yearmonths(x):
x['month'] = x['DATE'][:-2]
return x
Like this:
df = df.apply(sep_yearmonths,axis=1)
Over which you can then groupby and sum:
df.groupby('month').sum()
Resulting in the following:
52 55
month
201107 57.49 49.49
201108 28.52 24.52
If 'date' is your index, simply call reset_index before. If it's not a column of string values, then you need to do that beforehand.
Finally, you can rename your 'month' column to 'DATE'. I suppose you could just substitute the column 'DATE' inplace, but I choose to do things explictly. You can do that like so:
df['DATE'] = df['DATE'].apply(lambda x: x[:-2])
Then 'groupby' 'DATE' instead of month.
Use resample
import pandas as pd
myTable=pd.read_excel('test.xlsx')
myTable['DATE']=pd.to_datetime(myTable['DATE'], format="%Y%m%d")
myTable=myTable.set_index('DATE')
myTable.resample("M").sum()
I have a list like this
HEIGHT DATE TIME ANGL FC COL ROW
3.76 20120127 18 27 52 291.9 1 399.0 311.0
5.46 20120127 18 38 43 293.5 1 462.0 343.0
6.31 20120127 18 43 18 292.8 1 311.0 288.0
8.49 20120127 18 54 05 290.7 1 330.0 293.0
11.08 20120127 19 06 05 293.1 1 350.0 305.0
13.47 20120127 19 18 05 296.1 1 367.0 319.0
16.09 20120127 19 30 06 297.8 1 386.0 333.0
18.47 20120127 19 42 05 299.0 1 403.0 346.0
21.73 20120127 19 54 06 300.4 1 426.0 364.0
23.40 20120127 20 06 05 301.8 1 436.0 376.0
28.33 20120127 20 18 05 302.7 1 471.0 402.0
I want to make a kind of time arrange using DATE and TIME rows into a variables and then plot this versus anyother
I tried to use datetime but i dont get anything
import datetime as dt
data=loadtxt('CME27.txt', skiprows=1)
col=data[:,7]
row=data[:,8]
h=data[:,2]
m=data[:,3]
s=data[:,4]
t=dt.time(h,m,s)
i got an error!
i'd want to plot
plot(t,col)
Thanks
I don't think you can plot datetime.time objects directly using matplotlib. You can plot datetime.datetime objects, however. Given the NumPy array data, you'll have to use a Python loop to parse the floats into datetime.datetime objects.
You could do that like this:
import numpy as np
import datetime as DT
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
data = np.loadtxt('CME27.txt', skiprows=1)
col, row = data[:, 7:9].T
dates = []
for date, h, m, s in data[:,1:5]:
dates.append(
DT.datetime.strptime('{date:.0f} {h:.0f} {m:.0f} {s:.0f}'.format(**locals()),
'%Y%m%d %H %M %S'))
fig, ax = plt.subplots()
ax.plot(dates, col)
plt.xticks(rotation=25)
xfmt = mdates.DateFormatter('%H:%M:%S')
ax.xaxis.set_major_formatter(xfmt)
plt.show()
If you install pandas, then the above could be simplified to
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_table('CME27.txt', sep='\s+', skiprows=1, header=None,
parse_dates={'date':[1,2,3,4]})
df.columns = 'date height angle fc col row'.split()
df.plot('date', 'col')
plt.show()