Appending predicted residuals and rsquared to pandas dataframe - by groups - python

There is a question like this already but I want modifications and have tried few methods without much luck.
I have data and want to add the R squared of a regression by groups as a seperate column in the pandas dataframe. The caveat here is I only want to do the regression on values which do not have a extreme residual values within each group (ie, within 1 standard deviations or between -1 and 1 z score).
Here is the SAMPLE data frame:
df = pd.DataFrame({'gp': [1,1,1,1,1,2,2,2,2,2],
'x1': [3.17, 4.76, 4.17, 8.70, 11.45, 3.17, 4.76, 4.17, 8.70, 1.45],
'x2': [23, 26, 73, 72, 16, 26, 73, 72, 16, 25],
'y': [880.37, 716.20, 974.79, 322.80, 1054.25, 980.37, 816.20, 1074.79, 522.80, 1254.25]},
index=np.arange(10, 30, 2))
Now the answer which was on another post is such which works for me to get residuals in the group. This was the solution:
import pandas as pd
import numpy as np
import statsmodels.formula.api as sm
regmodel = 'y ~ x1 + x2'
def groupreg(g):
g['residual'] = sm.ols(formula=regmodel, data=g).fit().resid
return g
df = df.groupby('gp').apply(groupreg)
print(df)
Now this is great because I have a column residuals which gives the residual of the linear regression within each group.
However now I want to add another column which is R squared, where I want to add the R squared of the regression within each group only for the points where the residual is within +1/-1 zscore within within each group. So the goal is to add a R-squared which is stripping out extreme outliers in the regression (this should improve the R-squared values of a normal R-squared using all the data). Any help would be appreciated.
Edit**
FYI to add just a normal R squared the function would be this:
def groupreg(g):
g['residual'] = sm.ols(formula=regmodel, data=g).fit().resid
g['rsquared'] = sm.ols(formula=regmodel, data=g).fit().rsquared
return g
EDIT 2 **
Here is my code:
import pandas as pd
import numpy as np
import statsmodels.formula.api as sm
df = pd.DataFrame({'gp': [1,1,1,1,1,2,2,2,2,2],
'x1': [3.17, 4.76, 4.17, 8.70, 11.45, 3.17, 4.76, 4.17, 8.70, 1.45],
'x2': [23, 26, 73, 72, 16, 26, 73, 72, 16, 25],
'y': [880.37, 716.20, 974.79, 322.80, 1054.25, 980.37, 816.20, 1074.79, 522.80, 1254.25]},
index=np.arange(10, 30, 2))
regmodel = 'y ~ x1 + x2'
def groupreg(g):
g['residual'] = sm.ols(formula=regmodel, data=g).fit().resid
return g
df = df.groupby('gp').apply(groupreg)
print(df)
df['z_score'] = df.groupby('gp')['residual'].apply(lambda x: (x - x.mean())/x.std())
Output:
gp x1 x2 y residual z_score
10 1 3.17 23 880.37 -43.579309 -0.173726
12 1 4.76 26 716.20 -174.532201 -0.695759
14 1 4.17 73 974.79 318.634921 1.270214
16 1 8.70 72 322.80 -287.710952 -1.146938
18 1 11.45 16 1054.25 187.187542 0.746209
20 2 3.17 26 980.37 -67.245089 -0.822329
22 2 4.76 73 816.20 -96.883281 -1.184770
24 2 4.17 72 1074.79 104.400010 1.276691
26 2 8.70 16 522.80 21.017543 0.257020
28 2 1.45 25 1254.25 38.710817 0.473388
So Here I would like another column of R squared per group whilst not using the points which have a z-score greater and less than 1 and -1 respectively (eg would not use index 14, 16, 22 and 24 in the group by r square calculation.

First, use your full definition of groupreg that assigns both resid and rsquared columns:
def groupreg(g):
g['residual'] = sm.ols(formula=regmodel, data=g).fit().resid
g['rsquared'] = sm.ols(formula=regmodel, data=g).fit().rsquared
return g
Then, at the very end of your current code (after creating the z_score column), try the following to delete the rsquared entries in rows where -1 < z_score < 1:
df.loc[df['z_score'].abs() < 1, 'rsquared'] = np.NaN
Output:
gp x1 x2 y residual rsquared z_score
10 1 3.17 23 880.37 -43.579309 NaN -0.173726
12 1 4.76 26 716.20 -174.532201 NaN -0.695759
14 1 4.17 73 974.79 318.634921 0.250573 1.270214
16 1 8.70 72 322.80 -287.710952 0.250573 -1.146938
18 1 11.45 16 1054.25 187.187542 NaN 0.746209
20 2 3.17 26 980.37 -67.245089 NaN -0.822329
22 2 4.76 73 816.20 -96.883281 0.912987 -1.184770
24 2 4.17 72 1074.79 104.400010 0.912987 1.276691
26 2 8.70 16 522.80 21.017543 NaN 0.257020
28 2 1.45 25 1254.25 38.710817 NaN 0.473388

Related

Filtering one column based on values in two other columns

I've got an upper boundary and a lower boundary based on a predicted value and I want to filter out the data that do not fall between the upper and lower boundaries.
My data frame looks like this
weight KG
Upper Boundary
Lower Boundary
23.2
30
20
55.2
40
30
44.2
50
40
47.8
50
40
38.7
30
20
and I'd like it to look like this
weight KG
Upper Boundary
Lower Boundary
23.2
30
20
44.2
50
40
47.8
50
40
I have tried this but it does not filter properly.
df2= df1[(df1['weight_KG'] <= df1["UpperBoundary"]) & (df1['weight_KG'] >= df1["LowerBoundary"]]
Your code works just fine. If it doesn't do the job. It might have the version and platform related thing.
My environment is following:
Macbook M1 chip, Ventura
Python 3.9.14
Pandas 1.5.2
Code is following:
import pandas as pd
# Build DataFrame
names = ["weight_KG", "UpperBoundary", "LowerBoundary"]
weight_KG = [23.2, 55.2, 44.2, 47.8, 38.7, 0]
UpperBoundary = [30, 40, 50, 50, 30, 20]
LowerBoundary = [20, 30, 40, 40, 20, 10]
dict = {
"weight_KG": weight_KG,
"UpperBoundary": UpperBoundary,
"LowerBoundary": LowerBoundary,
}
df1 = pd.DataFrame(dict)
df2 = df1[
(df1["weight_KG"] <= df1["UpperBoundary"])
& (df1["weight_KG"] >= df1["LowerBoundary"])
]
print(df2)
print(pd.__version__)
Output is following:
weight_KG UpperBoundary LowerBoundary
0 23.2 30 20
2 44.2 50 40
3 47.8 50 40
1.5.2

Inverse Score in Python

I have a dataframe as follows,
import pandas as pd
df = pd.DataFrame({'value': [54, 74, 71, 78, 12]})
Expected output,
value score
54 scaled value
74 scaled value
71 scaled value
78 50.000
12 600.00
I want to assign a score between 50 and 600 to all, but lowest value must have a highest score. Do you have an idea?
Not sure what you want to achieve, maybe you could provide the exact expected output for this input.
But if I understand well, maybe you could try
import pandas as pd
df = pd.DataFrame({'value': [54, 74, 71, 78, 12]})
min = pd.DataFrame.min(df).value
max = pd.DataFrame.max(df).value
step = 550 / (max - min)
df['score'] = 600 - (df['value']-min) * step
print(df)
This will output
value score
0 54 250.000000
1 74 83.333333
2 71 108.333333
3 78 50.000000
4 12 600.000000
This is my idea. But I think you have a scale on your scores that is missing in your questions.
dfmin = df.min()[0]
dfmax = df.max()[0]
dfrange = dfmax - dfmin
score_value = (600-50)/dfrange
df.loc[:,'score'] = np.where(df['value'] == dfmin, 600,
np.where(df.value == dfmax,
50,
600 - ((df.value - dfmin)* (1/score_value))))
df
that produces:
value score
0 54 594.96
1 74 592.56
2 71 592.92
3 78 50.00
4 12 600.00
Not matching your output, because of the missing scale.

How to conditionally select column based on other columns under pandas DataFrame without using where function?

I'm working under python 2.5 (I'm restricted to that version due to external api) and would like to get same results as below code I wrote under python 2.7
import pandas as pd
df = pd.DataFrame({"lineId":[1,2,3,4], "idCaseMin": [10, 23, 40, 8], "min": [-110, -205, -80, -150], "idCaseMax": [5, 27, 15, 11], "max": [120, 150, 110, 90]})
df = df.set_index("lineId")
df["idMax"] = df["idCaseMax"].where(df["max"]>abs(df["min"]),df["idCaseMin"])
The DataFrame results in:
>>> df
idCaseMax max idCaseMin min idMax
lineId
1 5 10 120 -110 5
2 27 23 150 -205 23
3 15 40 110 -80 15
4 11 8 90 -150 8
The idMax column is defined based on the id which gets the greatest value, in absolute module, within max and min columns.
I can't use where function as it's not available under pandas 0.9.0 (latest version available for python 2.5) and numpy 1.7.1.
So, which options do I have to get same results for idMax column without using pandas where function?
IIUC you can use numpy.where():
In [120]: df['idMax'] = \
np.where(df["max"]<=abs(df["min"]),
df["idCaseMin"],
df["idCaseMax"])
In [121]: df
Out[121]:
idCaseMax idCaseMin max min idMax
lineId
1 5 10 120 -110 5
2 27 23 150 -205 23
3 15 40 110 -80 15
4 11 8 90 -150 8
I'll try and provide an optimised solution for 0.9. IIUC ix should work here.
m = df["max"] > df["min"].abs()
i = df.ix[m, 'idCaseMax']
j = df.ix[~m, 'idCaseMin']
df['idMax'] = i.append(j)
df
idCaseMax idCaseMin max min idMax
lineId
1 5 10 120 -110 5
2 27 23 150 -205 23
3 15 40 110 -80 15
4 11 8 90 -150 8
Your pandas should have this...
df['idMax']=(df["max"]>abs(df["min"]))* df["idCaseMax"]+(df["max"]<=abs(df["min"]))* df["idCaseMin"]
df
Out[1388]:
idCaseMax idCaseMin max min idMax
lineId
1 5 10 120 -110 5
2 27 23 150 -205 23
3 15 40 110 -80 15
4 11 8 90 -150 8
We can use the apply function as below code to attempt same results:
df["idMax"] = df.apply(lambda row: row["idCaseMax"] if row["max"]>abs(row["min"]) else row["idCaseMin"], axis = 1)

Pandas DataFrame: Complex linear interpolation

I have a dataframe with 4 sections
Section 1: Product details
Section 2: 6 Potential product values based on a range of simulations
Section 3: Upper and lower bound for the input parameter to the simulations
Section 4: Randomly generated values for the input parameters
Section 2 is generated by pricing the product at equal intervals between the upper and lower bound.
I need to take the values in Section 4 and figure out the corresponding product value. Here is a possible setup for this dataframe:
table2 = pd.DataFrame({
'Product Type': ['A', 'B', 'C', 'D'],
'State_1_Value': [10, 11, 12, 13],
'State_2_Value': [20, 21, 22, 23],
'State_3_Value': [30, 31, 32, 33],
'State_4_Value': [40, 41, 42, 43],
'State_5_Value': [50, 51, 52, 53],
'State_6_Value': [60, 61, 62, 63],
'Lower_Bound': [-1, 1, .5, 5],
'Upper_Bound': [1, 2, .625, 15],
'sim_1': [0, 0, .61, 7],
'sim_2': [1, 1.5, .7, 9],
})
>>> table2
Lower_Bound Product Type State_1_Value State_2_Value State_3_Value \
0 -1.0 A 10 20 30
1 1.0 B 11 21 31
2 0.5 C 12 22 32
3 5.0 D 13 23 33
State_4_Value State_5_Value State_6_Value Upper_Bound sim_1 sim_2
0 40 50 60 1.000 0.0 1.0
1 41 51 61 2.000 0.0 1.5
2 42 52 62 0.625 0.61 0.7
3 43 53 63 15.000 7.0 9.0
I will run through a couple examples of this calculation to make it clear what my question is.
Product A - sim_2
The input here is 1.0. This is equal to the upper bound for this product. Therefore the simulation value is equivalent to the state_6 value - 60
Product B - sim_2
The input here is 1.5. the LB to UB range is (1,2), therefore the 6 states are {1,1.2,1.4,1.6,1.8,2}. 1.5 is exactly in the middle of state_3 which has a value of 31 and state 4 which has a value of 41. Therefore the simulation value is 36.
Product C - sim_1
The input here is .61. The LB to UB range is (.5,.625), therefore the 6 states are {.5,.525,.55,.575,.6,.625}. .61 is between state 5 and 6. Specifically the bucket it would fall under would be 5*(.61-.5)/(.625-.5)+1 = 5.4 (it is multiplied by 5 as that is the number of intervals - you can calculate it other ways and get the same result). Then to calculate the value we use that bucket in a weighing of the values for state 5 and state 6: (62-52)*(5.4-5)+52 = 56.
Product B - sim_1
The input here is 0 which is below the lower bound of 1. Therefore we need to extrapolate the value. We use the same formula as above we just use the values of state 1 and state 2 to extrapolate. The bucket would be 5*(0-1)/(2-1)+1 = -4. The two values used at 11 and 21, so the value is (21-11)*(-4-1)+11= -39
I've also simplified the problem to try to visualize the solution, my final code needs to run on 500 values and 10,000 simulations, and the dataframe will have about 200 rows.
Here are the formulas I've used for the interpolation although I'm not committed to them specifically.
Bucket = N*(sim_value-LB)/(UB-LB) + 1
where N is the number of intervals
then nLower is the state value directly below the bucket, and nHigher is the state value directly above the bucket. If the bucket is outside the UB/LB, then force nLower and nHigher to be either the first two or last two values.
Final_value = (nHigher-nLower)*(Bucket1 - number_value_of_nLower)+nLower
To summarize, my question is how I can generate the final results based on the combination of input data provided. The most challenging part to me is how to make the connection from the Bucket number to the nLower and nHigher values.
I was able to generate the result using the following code. I'm not sure of the memory implications on a large dataframe, so still interested in better answers or improvements.
Edit: Ran this code on the full dataset, 141 rows, 500 intervals, 10,000 simulations, and it took slightly over 1.5 hours. So not quite as useless as I assumed, but there is probably a smarter way of doing this in a tiny fraction of that time.
for i in range(1,3):
table2['Bucket%s'%i] = 5 * (table2['sim_%s'%i] - table2['Lower_Bound']) / (table2['Upper_Bound'] - table2['Lower_Bound']) + 1
table2['lv'] = table2['Bucket%s'%i].map(int)
table2['hv'] = table2['Bucket%s'%i].map(int) + 1
table2.ix[table2['lv'] < 1 , 'lv'] = 1
table2.ix[table2['lv'] > 5 , 'lv'] = 5
table2.ix[table2['hv'] > 6 , 'hv'] = 6
table2.ix[table2['hv'] < 2 , 'hv'] = 2
table2['nLower'] = table2.apply(lambda row: row['State_%s_Value'%row['lv']],axis=1)
table2['nHigher'] = table2.apply(lambda row: row['State_%s_Value'%row['hv']],axis=1)
table2['Final_value_%s'%i] = (table2['nHigher'] - table2['nLower'])*(table2['Bucket%s'%i]-table2['lv']) + table2['nLower']
Output:
>>> table2
Lower_Bound Product Type State_1_Value State_2_Value State_3_Value \
0 -1.0 A 10 20 30
1 1.0 B 11 21 31
2 0.5 C 12 22 32
3 5.0 D 13 23 33
State_4_Value State_5_Value State_6_Value Upper_Bound sim_1 sim_2 \
0 40 50 60 1.000 0.00 1.0
1 41 51 61 2.000 0.00 1.5
2 42 52 62 0.625 0.61 0.7
3 43 53 63 15.000 7.00 9.0
Bucket1 lv hv nLower nHigher Final_value_1 Bucket2 Final_value_2
0 3.5 5 6 50 60 35.0 6.0 60.0
1 -4.0 3 4 31 41 -39.0 3.5 36.0
2 5.4 5 6 52 62 56.0 9.0 92.0
3 2.0 3 4 33 43 23.0 3.0 33.0
I posted a superior solution with no loops here:
Alternate method to avoid loop in pandas dataframe
df= pd.DataFrame({
'Product Type': ['A', 'B', 'C', 'D'],
'State_1_Value': [10, 11, 12, 13],
'State_2_Value': [20, 21, 22, 23],
'State_3_Value': [30, 31, 32, 33],
'State_4_Value': [40, 41, 42, 43],
'State_5_Value': [50, 51, 52, 53],
'State_6_Value': [60, 61, 62, 63],
'Lower_Bound': [-1, 1, .5, 5],
'Upper_Bound': [1, 2, .625, 15],
'sim_1': [0, 0, .61, 7],
'sim_2': [1, 1.5, .7, 9],
})
buckets = df.ix[:,-2:].sub(df['Lower_Bound'],axis=0).div(df['Upper_Bound'].sub(df['Lower_Bound'],axis=0),axis=0) * 5 + 1
low = buckets.applymap(int)
high = buckets.applymap(int) + 1
low = low.applymap(lambda x: 1 if x < 1 else x)
low = low.applymap(lambda x: 5 if x > 5 else x)
high = high.applymap(lambda x: 6 if x > 6 else x)
high = high.applymap(lambda x: 2 if x < 2 else x)
low_value = pd.DataFrame(df.filter(regex="State|Type").values[np.arange(low.shape[0])[:,None], low])
high_value = pd.DataFrame(df.filter(regex="State|Type").values[np.arange(high.shape[0])[:,None], high])
df1 = (high_value - low_value).mul((buckets - low).values) + low_value
df1['Product Type'] = df['Product Type']

Appending predicted values and residuals to pandas dataframe

It's a useful and common practice to append predicted values and residuals from running a regression onto a dataframe as distinct columns. I'm new to pandas, and I'm having trouble performing this very simple operation. I know I'm missing something obvious. There was a very similar question asked about a year-and-a-half ago, but it wasn't really answered.
The dataframe currently looks something like this:
y x1 x2
880.37 3.17 23
716.20 4.76 26
974.79 4.17 73
322.80 8.70 72
1054.25 11.45 16
And all I'm wanting is to return a dataframe that has the predicted value and residual from y = x1 + x2 for each observation:
y x1 x2 y_hat res
880.37 3.17 23 840.27 40.10
716.20 4.76 26 752.60 -36.40
974.79 4.17 73 877.49 97.30
322.80 8.70 72 348.50 -25.70
1054.25 11.45 16 815.15 239.10
I've tried resolving this using statsmodels and pandas and haven't been able to solve it. Thanks in advance!
Here is a variation on Alexander's answer using the OLS model from statsmodels instead of the pandas ols model. We can use either the formula or the array/DataFrame interface to the models.
fittedvalues and resid are pandas Series with the correct index.
predict does not return a pandas Series.
import numpy as np
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf
df = pd.DataFrame({'x1': [3.17, 4.76, 4.17, 8.70, 11.45],
'x2': [23, 26, 73, 72, 16],
'y': [880.37, 716.20, 974.79, 322.80, 1054.25]},
index=np.arange(10, 20, 2))
result = smf.ols('y ~ x1 + x2', df).fit()
df['yhat'] = result.fittedvalues
df['resid'] = result.resid
result2 = sm.OLS(df['y'], sm.add_constant(df[['x1', 'x2']])).fit()
df['yhat2'] = result2.fittedvalues
df['resid2'] = result2.resid
# predict doesn't return pandas series and no index is available
df['predicted'] = result.predict(df)
print(df)
x1 x2 y yhat resid yhat2 resid2 \
10 3.17 23 880.37 923.949309 -43.579309 923.949309 -43.579309
12 4.76 26 716.20 890.732201 -174.532201 890.732201 -174.532201
14 4.17 73 974.79 656.155079 318.634921 656.155079 318.634921
16 8.70 72 322.80 610.510952 -287.710952 610.510952 -287.710952
18 11.45 16 1054.25 867.062458 187.187542 867.062458 187.187542
predicted
10 923.949309
12 890.732201
14 656.155079
16 610.510952
18 867.062458
As preview, there is an extended prediction method in the model results in statsmodels master (0.7), but the API is not yet settled:
>>> print(result.get_prediction().summary_frame())
mean mean_se mean_ci_lower mean_ci_upper obs_ci_lower \
10 923.949309 268.931939 -233.171432 2081.070051 -991.466820
12 890.732201 211.945165 -21.194241 1802.658643 -887.328646
14 656.155079 269.136102 -501.844105 1814.154263 -1259.791854
16 610.510952 282.182030 -603.620329 1824.642233 -1339.874985
18 867.062458 329.017262 -548.584564 2282.709481 -1214.750941
obs_ci_upper
10 2839.365439
12 2668.793048
14 2572.102012
16 2560.896890
18 2948.875858
This should be self explanatory.
import pandas as pd
df = pd.DataFrame({'x1': [3.17, 4.76, 4.17, 8.70, 11.45],
'x2': [23, 26, 73, 72, 16],
'y': [880.37, 716.20, 974.79, 322.80, 1054.25]})
model = pd.ols(y=df.y, x=df.loc[:, ['x1', 'x2']])
df['y_hat'] = model.y_fitted
df['res'] = model.resid
>>> df
x1 x2 y y_hat res
0 3.17 23 880.37 923.949309 -43.579309
1 4.76 26 716.20 890.732201 -174.532201
2 4.17 73 974.79 656.155079 318.634921
3 8.70 72 322.80 610.510952 -287.710952
4 11.45 16 1054.25 867.062458 187.187542
So, it's polite to form your questions such that it's easy for contributors to run your code.
import pandas as pd
y_col = [880.37, 716.20, 974.79, 322.80, 1054.25]
x1_col = [3.17, 4.76, 4.17, 8.70, 11.45]
x2_col = [23, 26, 73, 72, 16]
df = pd.DataFrame()
df['y'] = y_col
df['x1'] = x1_col
df['x2'] = x2_col
Then calling df.head() yields:
y x1 x2
0 880.37 3.17 23
1 716.20 4.76 26
2 974.79 4.17 73
3 322.80 8.70 72
4 1054.25 11.45 16
Now for your question, it's fairly straightforward to add columns with calculated values, though I'm not agreeing with your sample data:
df['y_hat'] = df['x1'] + df['x2']
df['res'] = df['y'] - df['y_hat']
For me, these yield:
y x1 x2 y_hat res
0 880.37 3.17 23 26.17 854.20
1 716.20 4.76 26 30.76 685.44
2 974.79 4.17 73 77.17 897.62
3 322.80 8.70 72 80.70 242.10
4 1054.25 11.45 16 27.45 1026.80
Hope this helps!

Categories