Calculate %-deviation with values from a pandas Dataframe - python

I am fairly new to python and I have the following dataframe
setting_id subject_id seconds result_id owner_id average duration_id
0 7 1 0 1680.5 2.0 24.000 1.0
1 7 1 3600 1690.5 2.0 46.000 2.0
2 7 1 10800 1700.5 2.0 101.000 4.0
3 7 2 0 1682.5 2.0 12.500 1.0
4 7 2 3600 1692.5 2.0 33.500 2.0
5 7 2 10800 1702.5 2.0 86.500 4.0
6 7 3 0 1684.5 2.0 8.500 1.0
7 7 3 3600 1694.5 2.0 15.000 2.0
8 7 3 10800 1704.5 2.0 34.000 4.0
What I need to do is Calculate the deviation (%) from averages with a "seconds"-value not equal to 0 from those averages with a seconds value of zero, where the subject_id and Setting_id are the same
i.e. setting_id ==7 & subject_id ==1 would be:
(result/baseline)*100
------> for 3600 seconds: (46/24)*100 = +192%
------> for 10800 seconds: (101/24)*100 = +421%
.... baseline = average-result with a seconds value of 0
.... result = average-result with a seconds value other than 0
The resulting df should look like this
setting_id subject_id seconds owner_id average deviation duration_id
0 7 1 0 2 24 0 1
1 7 1 3600 2 46 192 2
2 7 1 10800 2 101 421 4
I want to use these calculations then to plot a regression graph (with seaborn) of deviations from baseline
I have played around with this df for 2 days now and tried different forloops but I just canĀ“t figure out the correct way.

You can use:
# identify rows with 0
m = df['seconds'].eq(0)
# compute the sum of rows with 0
s = (df['average'].where(m)
.groupby([df['setting_id'], df['subject_id']])
.sum()
)
# compute the deviation per group
deviation = (
df[['setting_id', 'subject_id']]
.merge(s, left_on=['setting_id', 'subject_id'], right_index=True, how='left')
['average']
.rdiv(df['average']).mul(100)
.round().astype(int) # optional
.mask(m, 0)
)
df['deviation'] = deviation
# or
# out = df.assign(deviation=deviation)
Output:
setting_id subject_id seconds result_id owner_id average duration_id deviation
0 7 1 0 1680.5 2.0 24.0 1.0 0
1 7 1 3600 1690.5 2.0 46.0 2.0 192
2 7 1 10800 1700.5 2.0 101.0 4.0 421
3 7 2 0 1682.5 2.0 12.5 1.0 0
4 7 2 3600 1692.5 2.0 33.5 2.0 268
5 7 2 10800 1702.5 2.0 86.5 4.0 692
6 7 3 0 1684.5 2.0 8.5 1.0 0
7 7 3 3600 1694.5 2.0 15.0 2.0 176
8 7 3 10800 1704.5 2.0 34.0 4.0 400

Related

How to calculate cumulative sum until a threshold and reset it after the threshold is reached considering groups in pandas dataframe in python?

I have a dataframe like this:
import pandas as pd
import numpy as np
data={'trip':[1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,3,3,3,3],
'timestamps':[1235471761, 1235471763, 1235471765, 1235471767, 1235471770, 1235471772, 1235471776, 1235471779, 1235471780, 1235471789,1235471792,1235471793,1235471829,1235471833,1235471835,1235471838,1235471844,1235471847,1235471848,1235471852,1235471855,1235471859,1235471900,1235471904,1235471911,1235471913]}
df = pd.DataFrame(data)
df['TimeDistance'] = df.groupby('trip')['timestamps'].diff(1)
df
What I am looking for is to start from the first row(consider it as an origin) in the "TimeDistance" column and doing cumulative sum over its values and whenever this summation reach 10, restart the cumsum and continue this procedure until the end of the trip (as you can see in this dataframe we have 3 trips in the "trip" column).
I want all the cumulative sum in a new column,lets say, "cumu" column.
Another important issue is that after reaching out threshold, the next row after threshold in the "cumu" column must be zero and the summation restart from this new origin again.
I hope I've understood your question right. You can use generator with .send():
def my_accumulate(maxval):
val = 0
yield
while True:
if val < maxval:
val += yield val
else:
yield val
val = 0
def fn(x):
a = my_accumulate(10)
next(a)
x["cumu"] = [a.send(v) for v in x["TimeDistance"]]
return x
df = df.groupby("trip").apply(fn)
print(df)
Prints:
trip timestamps TimeDistance cumu
0 1 1235471761 NaN 0.0
1 1 1235471763 2.0 2.0
2 1 1235471765 2.0 4.0
3 1 1235471767 2.0 6.0
4 1 1235471770 3.0 9.0
5 1 1235471772 2.0 11.0
6 1 1235471776 4.0 0.0
7 1 1235471779 3.0 3.0
8 1 1235471780 1.0 4.0
9 1 1235471789 9.0 13.0
10 1 1235471792 3.0 0.0
11 1 1235471793 1.0 1.0
12 2 1235471829 NaN 0.0
13 2 1235471833 4.0 4.0
14 2 1235471835 2.0 6.0
15 2 1235471838 3.0 9.0
16 2 1235471844 6.0 15.0
17 2 1235471847 3.0 0.0
18 2 1235471848 1.0 1.0
19 2 1235471852 4.0 5.0
20 2 1235471855 3.0 8.0
21 2 1235471859 4.0 12.0
22 3 1235471900 NaN 0.0
23 3 1235471904 4.0 4.0
24 3 1235471911 7.0 11.0
25 3 1235471913 2.0 0.0
Another solution:
df = df.groupby("trip").apply(
lambda x: x.assign(
cumu=(
val := 0,
*(
val := val + v if val < 10 else (val := 0)
for v in x["TimeDistance"][1:]
),
)
),
)
print(df)
Andrej's answer is better, as mine is probably not as efficient, and it depends on the df being ordered by trip and the TimeDistance being nan as the first value of each trip.
cummulative_sum = 0
df['cumu'] = 0
for i in range(len(df)):
if np.isnan(df.loc[i,'TimeDistance']) or cummulative_sum >= 10:
cummulative_sum = 0
df.loc[i, 'cumu'] = 0
else:
cummulative_sum += df.loc[i,'TimeDistance']
df.loc[i, 'cumu'] = cummulative_sum
print(df) outputs:
trip timestamps TimeDistance cumu
0 1 1235471761 NaN 0
1 1 1235471763 2.0 2
2 1 1235471765 2.0 4
3 1 1235471767 2.0 6
4 1 1235471770 3.0 9
5 1 1235471772 2.0 11
6 1 1235471776 4.0 0
7 1 1235471779 3.0 3
8 1 1235471780 1.0 4
9 1 1235471789 9.0 13
10 1 1235471792 3.0 0
11 1 1235471793 1.0 1
12 2 1235471829 NaN 0
13 2 1235471833 4.0 4
14 2 1235471835 2.0 6
15 2 1235471838 3.0 9
16 2 1235471844 6.0 15
17 2 1235471847 3.0 0
18 2 1235471848 1.0 1
19 2 1235471852 4.0 5
20 2 1235471855 3.0 8
21 2 1235471859 4.0 12
22 3 1235471900 NaN 0
23 3 1235471904 4.0 4
24 3 1235471911 7.0 11
25 3 1235471913 2.0 0

pandas add new row based on sum/difference of other rows

df have
id measure t1 t2 t3
1 savings 1 2 5
1 income 10 15 14
1 misc 5 5 5
2 savings 3 6 12
2 income 4 20 80
2 misc 1 1 1
df want- add a new row to the measure for each id, called spend, calculated by subtracting measure=income - measure=savings, for each of the periods t1,t2,t3, for each id
id measure t1 t2 t3
1 savings 1 2 5
1 income 10 15 14
1 misc 5 5 5
1 spend 9 13 9
2 savings 3 6 12
2 income 4 20 80
2 misc 1 1 1
2 spend 1 14 68
Trying:
df.loc[df['Measure'] == 'spend'] =
df.loc[df['Measure'] == 'income']-
(df.loc[df['Measure'] == 'savings'])
Failing because I am not incorporating groupby for desired outcome
Here is one way using groupby diff
df1=df[df.measure.isin(['savings','spend'])].copy()
s=df1.groupby('id',sort=False).diff().dropna().assign(id=df.id.unique(),measure='spend')
df=df.append(s,sort=True).sort_values('id')
df
Out[276]:
id measure t1 t2 t3
0 1 savings 1.0 2.0 5.0
1 1 income 10.0 15.0 14.0
1 1 spend 9.0 13.0 9.0
2 2 savings 3.0 6.0 12.0
3 2 income 4.0 20.0 80.0
3 2 spend 1.0 14.0 68.0
Update
df1=df.copy()
df1.loc[df.measure.ne('income'),'t1':]*=-1
s=df1.groupby('id',sort=False).sum().assign(id=df.id.unique(),measure='spend')
df=df.append(s,sort=True).sort_values('id')

how to add a mean column for the groupby movieID?

I have a df like below
userId movieId rating
0 1 31 2.0
1 2 10 4.0
2 2 17 5.0
3 2 39 5.0
4 2 47 4.0
5 3 31 3.0
6 3 10 2.0
I need to add two column, one is mean for each movie, the other is diff which is the difference between rating and mean.
Please note that movieId can be repeated because different users may rate the same movie. Here row 0 and 5 is for movieId 31, row 1 and 6 is for movieId 10
userId movieId rating mean diff
0 1 31 2.0 2.5 -0.5
1 2 10 4.0 3 1
2 2 17 5.0 5 0
3 2 39 5.0 5 0
4 2 47 4.0 4 0
5 3 31 3.0 2.5 0.5
6 3 10 2.0 3 -1
here is some of my code which calculates the mean
df = df.groupby('movieId')['rating'].agg(['count','mean']).reset_index()
You can use transform to keep the same number of rows when calculating mean with groupby. Calculating the difference is straightforward from that:
df['mean'] = df.groupby('movieId')['rating'].transform('mean')
df['diff'] = df['rating'] - df['mean']

Is there a retain function in python? I want to transfer SAS code to python

I want to transfer SAS code to python, and cannot find a retain function in python.
The data is like :
type_id amount
1 100
1 200
1 400
2 0
1 200
1 300
2 0
1 150
What I want is when type_id = 2, the amount is equal to the negative value of the previous row. So the data will be like this after retain function:
type_id amount
1 100
1 200
1 400
2 -400
1 200
1 300
2 -300
1 150
The SAS code is :
data B;
set A;
retain tempvar 0;
if type_id = 2
then amount = tempvar;
else tempvar = -amount;
drop tempvar;
run;
Does anyone have any idea about how to do this in python?
Thanks!
IIUC
df
type_id amount
0 1 100
1 1 200
2 1 400
3 2 0
4 1 200
5 1 300
6 2 0
7 1 150
def retain(df):
df['ret'] = df['amount'].shift()
df.ix[df['type_id']==2,'amount'] = -df.ix[df['type_id']==2,'ret']
df.drop("ret", axis=1, inplace=True)
return df
retain(df)
type_id amount
0 1 100.0
1 1 200.0
2 1 400.0
3 2 -400.0
4 1 200.0
5 1 300.0
6 2 -300.0
7 1 150.0
Alternatively:
def retain(df):
df.amount.ix[df.type_id==2] = - df.amount.shift().ix[df.type_id==2]
return df
retain(df)
type_id amount
0 1 100.0
1 1 200.0
2 1 400.0
3 2 -400.0
4 1 200.0
5 1 300.0
6 2 -300.0
7 1 150.0
# another idea to try
class retain_memo(object):
def __init__(self):
self.value = None
def set_get(self, u):
self.value = u
return u
def retain(series):
memo = retain_memo()
return pd.Series([memo.set_get(u) if pd.notnull(u) else memo.value for u in series])
test_series = pd.Series([1,1,2,None,None,4,4,None,None,None,None])
sas_style_retained = retain(test_series)
pd.DataFrame({'original':test_series,'retained':sas_style_retained})
# original retained
# 0 1.0 1.0
# 1 1.0 1.0
# 2 2.0 2.0
# 3 NaN 2.0
# 4 NaN 2.0
# 5 4.0 4.0
# 6 4.0 4.0
# 7 NaN 4.0
# 8 NaN 4.0
# 9 NaN 4.0
# 10 NaN 4.0
# use something like df.groupby(<id>)[var].transform(retain)
# for grouped data

Implement a counter which resets in python panda data frame

Hi I would like to implement a counter which counts the number of successive zero observations in a dataframe (across multiple columns). But I would like to reset it if a non-zero observation is found. I have used a for loop but it is incredibly slow, I am sure there must be far more efficient ways. This is my code:
Here is a snapshot of df
df.head()
ACL ACT ADH ADR AFE AFH AFT
2013-02-05 NaN NaN NaN NaN NaN NaN NaN
2013-02-12 -0.136861 -0.020406 0.046150 0.000000 -0.005321 NaN 0.058195
2013-02-19 -0.006632 0.041665 0.007365 0.012738 0.040930 NaN -0.037818
2013-02-26 -0.023848 -0.023999 -0.030677 -0.003144 0.050604 NaN -0.047604
2013-03-05 0.009771 -0.024589 -0.021073 -0.039432 0.047315 NaN 0.068727
I first initialise an empty data frame which has the same properties of df (dataframe) above
df1=pd.DataFrame( index= df, columns=df)
df1=df1.fillna(0)
Then I create my function which iterates over the rows, but this only deals with one column at a time
def zero_obs(x=df,y=df1):
for i in range(len(x)):
if x[i] == 0:
y[i] = y[i-1] + 1
else:
y[i] = 0
return y
for col in df.columns:
df1[col] = zero_obs(x=df[col],y=df1[col])
Really appreciate any help!!
The output i expect is as follows:
df1.tail()
BRN AXL TTO AGL ACL
2017-01-03 3 125 0 0 0
2017-01-10 0 126 0 0 0
2017-01-17 1 127 0 0 0
2017-01-24 0 128 0 0 0
2017-01-31 0 129 1 0 0
setup
Consider the dataframe df
df = pd.DataFrame(
np.zeros((10, 2), dtype=int),
columns=list('AB')
)
df.loc[[0, 4, 8], 'A'] = 1
df.loc[6, 'B'] = 1
print(df)
A B
0 1 0
1 0 0
2 0 0
3 0 0
4 1 0
5 0 0
6 0 1
7 0 0
8 1 0
9 0 0
Option 1
pandas apply
def zero_obs(x):
"""`x` is assumed to be a `pd.Series`"""
csum = x.eq(0).cumsum()
cpos = csum.where(x.ne(0)).ffill().fillna(0)
return csum.sub(cpos)
print(df.apply(zero_obs))
A B
0 0.0 1.0
1 1.0 2.0
2 2.0 3.0
3 3.0 4.0
4 0.0 5.0
5 1.0 6.0
6 2.0 0.0
7 3.0 1.0
8 0.0 2.0
9 1.0 3.0
Option 2
don't use apply
This function works just as well on df
zero_obs(df)
A B
0 0.0 1.0
1 1.0 2.0
2 2.0 3.0
3 3.0 4.0
4 0.0 5.0
5 1.0 6.0
6 2.0 0.0
7 3.0 1.0
8 0.0 2.0
9 1.0 3.0

Categories