df have
id measure t1 t2 t3
1 savings 1 2 5
1 income 10 15 14
1 misc 5 5 5
2 savings 3 6 12
2 income 4 20 80
2 misc 1 1 1
df want- add a new row to the measure for each id, called spend, calculated by subtracting measure=income - measure=savings, for each of the periods t1,t2,t3, for each id
id measure t1 t2 t3
1 savings 1 2 5
1 income 10 15 14
1 misc 5 5 5
1 spend 9 13 9
2 savings 3 6 12
2 income 4 20 80
2 misc 1 1 1
2 spend 1 14 68
Trying:
df.loc[df['Measure'] == 'spend'] =
df.loc[df['Measure'] == 'income']-
(df.loc[df['Measure'] == 'savings'])
Failing because I am not incorporating groupby for desired outcome
Here is one way using groupby diff
df1=df[df.measure.isin(['savings','spend'])].copy()
s=df1.groupby('id',sort=False).diff().dropna().assign(id=df.id.unique(),measure='spend')
df=df.append(s,sort=True).sort_values('id')
df
Out[276]:
id measure t1 t2 t3
0 1 savings 1.0 2.0 5.0
1 1 income 10.0 15.0 14.0
1 1 spend 9.0 13.0 9.0
2 2 savings 3.0 6.0 12.0
3 2 income 4.0 20.0 80.0
3 2 spend 1.0 14.0 68.0
Update
df1=df.copy()
df1.loc[df.measure.ne('income'),'t1':]*=-1
s=df1.groupby('id',sort=False).sum().assign(id=df.id.unique(),measure='spend')
df=df.append(s,sort=True).sort_values('id')
Related
I am fairly new to python and I have the following dataframe
setting_id subject_id seconds result_id owner_id average duration_id
0 7 1 0 1680.5 2.0 24.000 1.0
1 7 1 3600 1690.5 2.0 46.000 2.0
2 7 1 10800 1700.5 2.0 101.000 4.0
3 7 2 0 1682.5 2.0 12.500 1.0
4 7 2 3600 1692.5 2.0 33.500 2.0
5 7 2 10800 1702.5 2.0 86.500 4.0
6 7 3 0 1684.5 2.0 8.500 1.0
7 7 3 3600 1694.5 2.0 15.000 2.0
8 7 3 10800 1704.5 2.0 34.000 4.0
What I need to do is Calculate the deviation (%) from averages with a "seconds"-value not equal to 0 from those averages with a seconds value of zero, where the subject_id and Setting_id are the same
i.e. setting_id ==7 & subject_id ==1 would be:
(result/baseline)*100
------> for 3600 seconds: (46/24)*100 = +192%
------> for 10800 seconds: (101/24)*100 = +421%
.... baseline = average-result with a seconds value of 0
.... result = average-result with a seconds value other than 0
The resulting df should look like this
setting_id subject_id seconds owner_id average deviation duration_id
0 7 1 0 2 24 0 1
1 7 1 3600 2 46 192 2
2 7 1 10800 2 101 421 4
I want to use these calculations then to plot a regression graph (with seaborn) of deviations from baseline
I have played around with this df for 2 days now and tried different forloops but I just canĀ“t figure out the correct way.
You can use:
# identify rows with 0
m = df['seconds'].eq(0)
# compute the sum of rows with 0
s = (df['average'].where(m)
.groupby([df['setting_id'], df['subject_id']])
.sum()
)
# compute the deviation per group
deviation = (
df[['setting_id', 'subject_id']]
.merge(s, left_on=['setting_id', 'subject_id'], right_index=True, how='left')
['average']
.rdiv(df['average']).mul(100)
.round().astype(int) # optional
.mask(m, 0)
)
df['deviation'] = deviation
# or
# out = df.assign(deviation=deviation)
Output:
setting_id subject_id seconds result_id owner_id average duration_id deviation
0 7 1 0 1680.5 2.0 24.0 1.0 0
1 7 1 3600 1690.5 2.0 46.0 2.0 192
2 7 1 10800 1700.5 2.0 101.0 4.0 421
3 7 2 0 1682.5 2.0 12.5 1.0 0
4 7 2 3600 1692.5 2.0 33.5 2.0 268
5 7 2 10800 1702.5 2.0 86.5 4.0 692
6 7 3 0 1684.5 2.0 8.5 1.0 0
7 7 3 3600 1694.5 2.0 15.0 2.0 176
8 7 3 10800 1704.5 2.0 34.0 4.0 400
I have a pd.DataFrame df with one column, say:
A = [1,2,3,4,5,6,7,8,2,4]
df = pd.DataFrame(A,columns = ['A'])
For each row, I want to take previous 2 values, current value and next 2 value (a window= 5) and get the sum and store it in new column. Desire output,
A A_sum
1 6
2 10
3 15
4 20
5 25
6 30
7 28
8 27
2 21
4 14
I have tried,
df['A_sum'] = df['A'].rolling(2).sum()
Tried with shift, but all doing either forward or backward, I'm looking for a combination of both.
Use rolling by 5, add parameter center=True and min_periods=1 to Series.rolling:
df['A_sum'] = df['A'].rolling(5, center=True, min_periods=1).sum()
print (df)
A A_sum
0 1 6.0
1 2 10.0
2 3 15.0
3 4 20.0
4 5 25.0
5 6 30.0
6 7 28.0
7 8 27.0
8 2 21.0
9 4 14.0
If you are allowed to use numpy, then you might use numpy.convolve to get desired output
import numpy as np
import pandas as pd
A = [1,2,3,4,5,6,7,8,2,4]
B = np.convolve(A,[1,1,1,1,1], 'same')
df = pd.DataFrame({"A":A,"A_sum":B})
print(df)
output
A A_sum
0 1 6
1 2 10
2 3 15
3 4 20
4 5 25
5 6 30
6 7 28
7 8 27
8 2 21
9 4 14
You can use shift for this (straightforward if not elegant):
df["A_sum"] = df.A + df.A.shift(-2).fillna(0) + df.A.shift(-1).fillna(0) + df.A.shift(1).fillna(0)
output:
A A_sum
0 1 6.0
1 2 10.0
2 3 14.0
3 4 18.0
4 5 22.0
5 6 26.0
6 7 23.0
7 8 21.0
8 2 14.0
9 4 6.0
I have a pandas dataframe containing the following information:
For each Timestamp, there are a number of Trays (between 1-4) out of 8 available Trays. (So there is a maximum number of 4 Trays per Timestamp.)
Each Tray consists of 4 positions.
A dataframe could look like this:
df =
timestamp t_idx position error type SNR
0 16229767 5 2 1 T1 123
1 16229767 5 1 0 T1 123
3 16229767 5 3 0 T1 123
4 16229767 5 4 0 T1 123
5 16229767 3 3 1 T9 38
6 16229767 3 1 0 T9 38
7 16229767 3 4 0 T9 38
8 29767162 7 1 0 T4 991
9 29767162 7 4 1 T4 991
If we look at the timestamp "16229767", there where 2 trays in use: Tray 3 and Tray 5.
Each position for Tray 5 was detected.
However, Tray 3 has missing data, as position 2 is missing.
I would like to fix that and add this line programmatically:
10 16229767 3 2 1 T9 38
11 29767162 7 2 1 T4 991
12 29767162 7 3 1 T4 991
I am not sure how to handle the missing values correctly. My naive approach right now is:
timestamps = df['timestamp'].unique()
for ts in timestamps:
tray_ids = df.loc[df['timestamp'] == timestamps ]["Tray ID"].unique()
for t_id in tray_ids:
# For timestamp and tray id: Each position (1 to 4) should exist once!
# df.loc[(df['timestamp'] == ts) & (df['Tray ID'] == t_id)]
# if not, append the position on the tray and set error to 1
How can I find the missing positions now and add the rows to my dataframe?
===
Edit:
I was simplifying my example, but missed a relevant information:
There are also other columns and the new generated rows should have the same content per tray. I made it clearer by adding to more columns.
Also, there was a question about the error: For each row that had to be added, the error should be automatically 1 (there is no logic behind).
We can start by converting position to the categorical type, use a groupby to fill all the missing values and set the corresponding error values to 1.
We also have to fill the type and SNR column with the correct values like so :
>>> df['position'] = pd.Categorical(df['position'], categories=df['position'].unique())
>>> df_grouped = df.groupby(['timestamp', 't_idx', 'position'], as_index=False).first()
>>> df_grouped['error'] = df_grouped['error'].fillna(1)
>>> df_grouped.sort_values('type', inplace=True)
>>> df_grouped['type'] = df_grouped.groupby(['timestamp','t_idx'])['type'].ffill().bfill()
>>> df_grouped.sort_values('SNR', inplace=True)
>>> df_grouped['SNR'] = df_grouped.groupby(['timestamp','t_idx'])['SNR'].ffill().bfill()
>>> df_grouped = df_grouped.reset_index(drop=True)
timestamp t_idx position error type SNR
0 16229767 3 1 0.0 T9 38.0
1 16229767 3 3 1.0 T9 38.0
2 16229767 3 4 0.0 T9 38.0
3 16229767 5 2 1.0 T1 123.0
4 16229767 5 1 0.0 T1 123.0
5 16229767 5 3 0.0 T1 123.0
6 16229767 5 4 0.0 T1 123.0
7 29767162 7 1 0.0 T4 991.0
8 29767162 7 4 1.0 T4 991.0
9 16229767 3 2 1.0 T9 38.0
10 16229767 7 2 1.0 T4 991.0
11 16229767 7 1 1.0 T4 991.0
12 16229767 7 3 1.0 T4 991.0
13 16229767 7 4 1.0 T4 991.0
14 29767162 3 2 1.0 T4 991.0
15 29767162 3 1 1.0 T4 991.0
16 29767162 3 3 1.0 T4 991.0
17 29767162 3 4 1.0 T4 991.0
18 29767162 5 2 1.0 T4 991.0
19 29767162 5 1 1.0 T4 991.0
20 29767162 5 3 1.0 T4 991.0
21 29767162 5 4 1.0 T4 991.0
22 29767162 7 2 1.0 T4 991.0
23 29767162 7 3 1.0 T4 991.0
And then, we filter on the value from the original DataFrame to get the expected result :
>>> df_grouped[
... pd.Series(
... list(zip(df_grouped['timestamp'].values, df_grouped['t_idx'].values))
... ).isin(list(zip(df['timestamp'].values, df['t_idx'].values)))
... ].sort_values(by=['timestamp', 't_idx']).reset_index(drop=True)
timestamp t_idx position error type SNR
0 16229767 3 1 0.0 T9 38.0
1 16229767 3 3 1.0 T9 38.0
2 16229767 3 4 0.0 T9 38.0
3 16229767 3 2 1.0 T9 38.0
4 16229767 5 2 1.0 T1 123.0
5 16229767 5 1 0.0 T1 123.0
6 16229767 5 3 0.0 T1 123.0
7 16229767 5 4 0.0 T1 123.0
8 29767162 7 1 0.0 T4 991.0
9 29767162 7 4 1.0 T4 991.0
10 29767162 7 2 1.0 T4 991.0
11 29767162 7 3 1.0 T4 991.0
pyjanitor has a complete function that exposes explicitly missing values (pyjanitor is a collection of convenient Pandas functions);
In the challenge above, only the explicitly missing values within the data needs to be exposed:
# pip install pyjanitor
import pandas as pd
import janitor
(df.complete(['timestamp', 't_idx', 'type', 'SNR'], 'position')
.fillna({"error":1}, downcast='infer')
.filter(df.columns)
)
timestamp t_idx position error type SNR
0 16229767 5 2 1 T1 123
1 16229767 5 1 0 T1 123
2 16229767 5 3 0 T1 123
3 16229767 5 4 0 T1 123
4 16229767 3 2 1 T9 38
5 16229767 3 1 0 T9 38
6 16229767 3 3 1 T9 38
7 16229767 3 4 0 T9 38
8 29767162 7 2 1 T4 991
9 29767162 7 1 0 T4 991
10 29767162 7 3 1 T4 991
11 29767162 7 4 1 T4 991
In the code above, just the the combination of ['timestamp', 't_idx', 'type', 'SNR'] and position is required to generate the missing values, limiting the output to only the explicit missing values within the dataframe; if all combinations of missing values were required, then the brackets would be dropped, and you'd probably get a much larger dataframe.
You can create a new dataframe with the timestamp with fixed range of position. Then you merge them together and you will end up with NaN values on errors columns for given missing position. Then you fill the NaN to 1.
Sample code:
unique_id = df.timestamp.unique().tolist()
df_tmp = pd.DataFrame({'timestamp':unique_id,'position':range(4)})
df = pd.merge(df_tmp, df, on=["timestamp", "position"], how="left")
df.error.fillna(1)
You can try this code:
def foo(df):
set_ = set(range(1,5))
if df.position.unique().size < 4:
diff_ = set_.difference(df.position.unique())
add_df = df.iloc[:len(diff_),:].copy()
add_df.loc[:, 'position'] = list(diff_)
# I did not understand by what rule the values in the error column are set. You can install it as you need
result_df = pd.concat([df, add_df], ignore_index=True)
return result_df
else:
return df
group = df.groupby(['timestamp', 't_idx'])
group.apply(foo)
timestamp t_idx position error
0 16229767 3 3 1
1 16229767 3 1 0
2 16229767 3 4 0
3 16229767 3 2 1
4 16229767 5 2 1
5 16229767 5 1 0
6 16229767 5 3 0
7 16229767 5 4 0
8 29767162 7 1 0
9 29767162 7 4 1
10 29767162 7 2 0
11 29767162 7 3 1
I have a couple of pandas dataframes.
DF A:
id
date
avg
count
1
27/06/2021
10
5
1
28/06/2021
12
4
DF B:
id
date
avg
count
1
27/06/2021
8
5
1
28/06/2021
6
6
1
29/06/2021
11
10
2
27/06/2021
3
10
2
28/06/2021
3
10
Basically, these are simplifications of intermediate tables aggregated from various Big Data sources. How can I merge these data frames so that the average for a id+date is correct (i.e. it is (avg1 * count1 + avg2 * count2)/(count1 + count2))
The expected DF for the above two should be like this:
id
date
avg
count
1
27/06/2021
9
10
1
28/06/2021
8.4
10
1
29/06/2021
11
10
2
27/06/2021
3
10
2
28/06/2021
3
10
Thanks.
Another way:
out=df1.merge(df2,on=['id','date'],suffixes=('_1','_2'),how='left'))
Now do calculations:
out['avg']=out.eval("(avg_1*count_1+avg_2*count_2)/(count_1+count_2)")
out['count']=out.eval("count_1+count_2")
out=out.drop(out.filter(like='_').columns,1)
Finally:
df2.update(out)
You may can do
s = pd.concat([df1,df2])
cnt = s.groupby(['id','date'])['count'].sum()
amount = (s['avg']*s['count']).groupby([s['id'],s['date']]).sum()/cnt
amount.name='avg'
out = pd.concat([cnt,amount],axis=1).reset_index()
out
Out[34]:
id date count avg
0 1 27/06/2021 10 9.0
1 1 28/06/2021 10 8.4
2 1 29/06/2021 10 11.0
3 2 27/06/2021 10 3.0
4 2 28/06/2021 10 3.0
I have a df like below
userId movieId rating
0 1 31 2.0
1 2 10 4.0
2 2 17 5.0
3 2 39 5.0
4 2 47 4.0
5 3 31 3.0
6 3 10 2.0
I need to add two column, one is mean for each movie, the other is diff which is the difference between rating and mean.
Please note that movieId can be repeated because different users may rate the same movie. Here row 0 and 5 is for movieId 31, row 1 and 6 is for movieId 10
userId movieId rating mean diff
0 1 31 2.0 2.5 -0.5
1 2 10 4.0 3 1
2 2 17 5.0 5 0
3 2 39 5.0 5 0
4 2 47 4.0 4 0
5 3 31 3.0 2.5 0.5
6 3 10 2.0 3 -1
here is some of my code which calculates the mean
df = df.groupby('movieId')['rating'].agg(['count','mean']).reset_index()
You can use transform to keep the same number of rows when calculating mean with groupby. Calculating the difference is straightforward from that:
df['mean'] = df.groupby('movieId')['rating'].transform('mean')
df['diff'] = df['rating'] - df['mean']