I have a couple of pandas dataframes.
DF A:
id
date
avg
count
1
27/06/2021
10
5
1
28/06/2021
12
4
DF B:
id
date
avg
count
1
27/06/2021
8
5
1
28/06/2021
6
6
1
29/06/2021
11
10
2
27/06/2021
3
10
2
28/06/2021
3
10
Basically, these are simplifications of intermediate tables aggregated from various Big Data sources. How can I merge these data frames so that the average for a id+date is correct (i.e. it is (avg1 * count1 + avg2 * count2)/(count1 + count2))
The expected DF for the above two should be like this:
id
date
avg
count
1
27/06/2021
9
10
1
28/06/2021
8.4
10
1
29/06/2021
11
10
2
27/06/2021
3
10
2
28/06/2021
3
10
Thanks.
Another way:
out=df1.merge(df2,on=['id','date'],suffixes=('_1','_2'),how='left'))
Now do calculations:
out['avg']=out.eval("(avg_1*count_1+avg_2*count_2)/(count_1+count_2)")
out['count']=out.eval("count_1+count_2")
out=out.drop(out.filter(like='_').columns,1)
Finally:
df2.update(out)
You may can do
s = pd.concat([df1,df2])
cnt = s.groupby(['id','date'])['count'].sum()
amount = (s['avg']*s['count']).groupby([s['id'],s['date']]).sum()/cnt
amount.name='avg'
out = pd.concat([cnt,amount],axis=1).reset_index()
out
Out[34]:
id date count avg
0 1 27/06/2021 10 9.0
1 1 28/06/2021 10 8.4
2 1 29/06/2021 10 11.0
3 2 27/06/2021 10 3.0
4 2 28/06/2021 10 3.0
Related
My end goal is to sum all minutes only from initial to final in column periods. This needs to be grouped by id
I have thousands of id and not all of them have the same amount of min in between initial and final.
Periods are sorted in a "journey" fashion each record represents a period of time of its id
Pseudocode:
Iterate rows and sum all values in column "min"
if sum starts in periods == initial and ends in periods = final
Example with 2 ids
id
periods
min
1
period_x
10
1
initial
2
1
progress
3
1
progress_1
4
1
final
5
2
period_y
10
2
period_z
2
2
initial
3
2
progress_1
20
2
final
3
Desired output
id
periods
min
sum
1
period_x
10
14
1
initial
2
14
1
progress
3
14
1
progress_1
4
14
1
final
5
14
2
period_y
10
26
2
period_z
2
26
2
initial
3
26
2
progress_1
20
26
2
final
3
26
So far I've tried:
L = ['initial' 'final']
df['sum'] = df.id.where( df.zone_name.isin(L)).groupby(df['if']).transform('sum')
But this doesn't count what is in between initial and final
Create groups using cumsum and then return the sum of group 1, then apply that sum to the entire column. "Group 1" is anything per id that is between initial and final:
import numpy as np
df['grp'] = df['periods'].isin(['initial','final'])
df['grp'] = np.where(df['periods'] == 'final', 1, df.groupby('id')['grp'].cumsum())
df['sum'] = np.where(df['grp'].eq(1), df.groupby(['id', 'grp'])['min'].transform('sum'), np.nan)
df['sum'] = df.groupby('id')['sum'].transform(max)
df
Out[1]:
id periods min grp sum
0 1 period_x 10 0 14.0
1 1 initial 2 1 14.0
2 1 progress 3 1 14.0
3 1 progress_1 4 1 14.0
4 1 final 5 1 14.0
5 2 period_y 10 0 26.0
6 2 period_z 2 0 26.0
7 2 initial 3 1 26.0
8 2 progress_1 20 1 26.0
9 2 final 3 1 26.0
following is my input data frame
>>data frame after getting avg
a b c d avg
0 1 4 7 8 5
1 3 4 5 6 4.5
2 6 8 2 9 6.25
3 2 9 5 6 5.5
Output required after adding criteria
>>
a b c d avg avg_criteria
0 1 4 7 8 5 7.5 (<=5)
1 3 4 5 6 4.5 5.5 (<=4.5)
2 6 8 2 9 6.25 8.5 (<=6.25)
3 2 9 5 6 5.5 7.5 (<=5.5)
> This is the code I have tried
read file
df_input_data = pd.DataFrame(pd.read_excel(file_path,header=2).dropna(axis=1, how= 'all'))
adding column after calculating average
df_avg = df_input_data.assign(Avg=df_input_data.mean(axis=1, skipna=True))
criteria
criteria = df_input_data.iloc[, :] >= df_avg.iloc[1][-1]
#creating output data frame
df_output = df_input_data.assign(Avg_criteria= criteria)
I am unable to solve this issue. I have tried and googled it many times
From what I understand, you can try df.mask/df.where after comparing with the mean and then calculate mean:
m=df.drop("avg",1)
m.where(m.ge(df['avg'],axis=0)).mean(1)
0 7.5
1 5.5
2 8.5
3 7.5
dtype: float64
print(df.assign(Avg_criteria=m.where(m.ge(df['avg'],axis=0)).mean(1)))
a b c d avg Avg_criteria
0 1 4 7 8 5.00 7.5
1 3 4 5 6 4.50 5.5
2 6 8 2 9 6.25 8.5
3 2 9 5 6 5.50 7.5
df have
id measure t1 t2 t3
1 savings 1 2 5
1 income 10 15 14
1 misc 5 5 5
2 savings 3 6 12
2 income 4 20 80
2 misc 1 1 1
df want- add a new row to the measure for each id, called spend, calculated by subtracting measure=income - measure=savings, for each of the periods t1,t2,t3, for each id
id measure t1 t2 t3
1 savings 1 2 5
1 income 10 15 14
1 misc 5 5 5
1 spend 9 13 9
2 savings 3 6 12
2 income 4 20 80
2 misc 1 1 1
2 spend 1 14 68
Trying:
df.loc[df['Measure'] == 'spend'] =
df.loc[df['Measure'] == 'income']-
(df.loc[df['Measure'] == 'savings'])
Failing because I am not incorporating groupby for desired outcome
Here is one way using groupby diff
df1=df[df.measure.isin(['savings','spend'])].copy()
s=df1.groupby('id',sort=False).diff().dropna().assign(id=df.id.unique(),measure='spend')
df=df.append(s,sort=True).sort_values('id')
df
Out[276]:
id measure t1 t2 t3
0 1 savings 1.0 2.0 5.0
1 1 income 10.0 15.0 14.0
1 1 spend 9.0 13.0 9.0
2 2 savings 3.0 6.0 12.0
3 2 income 4.0 20.0 80.0
3 2 spend 1.0 14.0 68.0
Update
df1=df.copy()
df1.loc[df.measure.ne('income'),'t1':]*=-1
s=df1.groupby('id',sort=False).sum().assign(id=df.id.unique(),measure='spend')
df=df.append(s,sort=True).sort_values('id')
I have a dataframe that looks like:
value 1 value 2
1 10
4 1
5 8
6 10
10 12
I want to go down each entry of value 1, average the previous value, and then create a new column beside value 2 with the average.
The output needs to look like:
value 1 value 2 avg
1 10 nan
4 1 2.5
5 8 4.5
6 10 5.5
10 12 8.0
.
.
How would I go about doing this?
shift
You can sum a series with the shifted version of itself:
df['avg'] = (df['value1'] + df['value1'].shift()) / 2
print(df)
value1 value2 avg
0 1 10 NaN
1 4 1 2.5
2 5 8 4.5
3 6 10 5.5
4 10 12 8.0
My data is like this:
ARTICLE Day Row
a 2 10
a 3 10
a 4 10
a 5 10
a 6 10
a 7 10
a 8 10
a 9 10
a 10 10
a 11 10
b 3 1
I want to generate a new column, called Date. Firstly, I group the data by ARTICLE. Then for every article group, if Row is 1, then the value in Date is the same as the one in Day. Otherwise, move all the values in Day one step upward and set the last value be 100. So, the new data should look like this:
ARTICLE Day Row Date
a 2 10 3
a 3 10 4
a 4 10 5
a 5 10 6
a 6 10 7
a 7 10 8
a 8 10 9
a 9 10 10
a 10 10 11
a 11 10 100
b 3 1 3
I assume this can be done by groupby and transform. A function is taken to generate Date. So, my code is:
def myFUN_PostDate1(NRow,Date):
if (NRow.unique()==1):
return Date
else:
Date1 = Date[1:Date.shape[0]]
Date1[Date1.shape[0] + 1] = 19800312
return Date1
a = pd.DataFrame({'ARTICLE': ['a','a','a','a','a','a','a','a','a','a','b'],
'Day': [2,3,4,5,6,7,8,9,10,11,3],
'Row':[10,10,10,10,10,10,10,10,10,10,1]})
a.loc[:,'Date'] = a.groupby(['ARTICLE']).transform(lambda x: myFUN_PostDate1(x.loc[:,'Row'],x.loc[:,'Day']))
But I have the error information:
pandas.core.indexing.IndexingError: ('Too many indexers', 'occurred at index Day')
I also tried groupby + np.where. But I have got the same error.
IIUC:
In [14]: df['Date'] = (df.groupby('ARTICLE')['Day']
.apply(lambda x: x.shift(-1).fillna(100) if len(x) > 1 else x))
In [15]: df
Out[15]:
ARTICLE Day Row Date
0 a 2 10 3.0
1 a 3 10 4.0
2 a 4 10 5.0
3 a 5 10 6.0
4 a 6 10 7.0
5 a 7 10 8.0
6 a 8 10 9.0
7 a 9 10 10.0
8 a 10 10 11.0
9 a 11 10 100.0
10 b 3 1 3.0