pandas groupby rolling mean/median with dropping missing values - python

How can get in pandas groupby rolling mean/median with dropping missing values? I.e. the output should drop missing values before calculating mean/median instead of giving me NaN if a missing value is present.
import pandas as pd
t = pd.DataFrame(data={v.date:[0,0,0,0,1,1,1,1,2,2,2,2],
'i0':[0,1,2,3,0,1,2,3,0,1,2,3],
'i1':['A']*12,
'x':[10.,20.,30.,np.nan,np.nan,21.,np.nan,41.,np.nan,np.nan,32.,42.]})
t.set_index([v.date,'i0','i1'], inplace=True)
t.sort_index(inplace=True)
print(t)
print(t.groupby('date').apply(lambda x: x.rolling(window=2).mean()))
gives
x
date i0 i1
0 0 A 10.0
1 A 20.0
2 A 30.0
3 A NaN
1 0 A NaN
1 A 21.0
2 A NaN
3 A 41.0
2 0 A NaN
1 A NaN
2 A 32.0
3 A 42.0
x
date i0 i1
0 0 A NaN
1 A 15.0
2 A 25.0
3 A NaN
1 0 A NaN
1 A NaN
2 A NaN
3 A NaN
2 0 A NaN
1 A NaN
2 A NaN
3 A 37.0
while I want the following for this example:
x
date i0 i1
0 0 A 10.0
1 A 15.0
2 A 25.0
3 A 30.0
1 0 A NaN
1 A 21.0
2 A 21.0
3 A 41.0
2 0 A NaN
1 A NaN
2 A 32.0
3 A 37.0
what I tried
t.groupby('date').apply(lambda x: x.rolling(window=2).dropna().median())
and
t.groupby('date').apply(lambda x: x.rolling(window=2).median(dropna=True))
(both raise exceptions, but maybe there exists something along the lines)
Thank you for your help!

You're looking for min_periods? Note that you don't need apply, callGroupBy.Rolling directly:
t.groupby('date', group_keys=False).rolling(window=2, min_periods=1).mean()
x
date i0 i1
0 0 A 10.0
1 A 15.0
2 A 25.0
3 A 30.0
1 0 A NaN
1 A 21.0
2 A 21.0
3 A 41.0
2 0 A NaN
1 A NaN
2 A 32.0
3 A 37.0

Related

Joining or merging multiple columns within one dataframe and keeping all data

I have this dataframe:
df = pd.DataFrame({'Position1':[1,2,3], 'Count1':[55,35,45],\
'Position2':[4,2,7], 'Count2':[15,35,75],\
'Position3':[3,5,6], 'Count3':[45,95,105]})
print(df)
Position1 Count1 Position2 Count2 Position3 Count3
0 1 55 4 15 3 45
1 2 35 2 35 5 95
2 3 45 7 75 6 105
I want to join the Position columns into one column named "Positions" while sorting the data in the Counts columns like so:
Positions Count1 Count2 Count3
0 1 55 Nan Nan
1 2 35 35 Nan
2 3 45 NaN 45
3 4 NaN 15 Nan
4 5 NaN NaN 95
5 6 Nan NaN 105
6 7 Nan 75 NaN
I've tried melting the dataframe, combining and merging columns but I am a bit stuck.
Note that the NaN types can easily be replaced by using df.fillna to get a dataframe like so:
df = df.fillna(0)
Positions Count1 Count2 Count3
0 1 55 0 0
1 2 35 35 0
2 3 45 0 45
3 4 0 15 0
4 5 0 0 95
5 6 0 0 105
6 7 0 75 0
Here is a way to do what you've asked:
df = df[['Position1', 'Count1']].rename(columns={'Position1':'Positions'}).join(
df[['Position2', 'Count2']].set_index('Position2'), on='Positions', how='outer').join(
df[['Position3', 'Count3']].set_index('Position3'), on='Positions', how='outer').sort_values(
by=['Positions']).reset_index(drop=True)
Output:
Positions Count1 Count2 Count3
0 1 55.0 NaN NaN
1 2 35.0 35.0 NaN
2 3 45.0 NaN 45.0
3 4 NaN 15.0 NaN
4 5 NaN NaN 95.0
5 6 NaN NaN 105.0
6 7 NaN 75.0 NaN
Explanation:
Use join first on Position1, Count1 and Position2, Count2 (with Position1 renamed as Positions) then on that join result and Position3, Count3.
Sort by Positions and use reset_index to create a new integer range index (ascending with no gaps).
Does this achieve what you are after?
import pandas as pd
df = pd.DataFrame({'Position1':[1,2,3], 'Count1':[55,35,45],\
'Position2':[4,2,7], 'Count2':[15,35,75],\
'Position3':[3,5,6], 'Count3':[45,95,105]})
df1, df2, df3 = df.iloc[:,:2], df.iloc[:, 2:4], df.iloc[:, 4:6]
df1.columns, df2.columns, df3.columns = ['Positions', 'Count1'], ['Positions', 'Count2'], ['Positions', 'Count3']
df1.merge(df2, on='Positions', how='outer').merge(df3, on='Positions', how='outer').sort_values('Positions')
Output:
wide_to_long unpivots the DF from Long to wide and that is what's used here.
columns names are also renamed here, with this edit
df['id'] = df.index
df2=pd.wide_to_long(df, stubnames=['Position','Count'], i='id', j='pos').reset_index()
df2=df2.pivot(index=['id','Position'], columns='pos', values='Count').reset_index().fillna(0).add_prefix('count_')
df2.rename(columns={'count_id': 'id', 'count_Position' :'Position'}, inplace=True)
df2
RESULT:
pos id Position 1 2 3
0 0 1 55.0 0.0 0.0
1 0 3 0.0 0.0 45.0
2 0 4 0.0 15.0 0.0
3 1 2 35.0 35.0 0.0
4 1 5 0.0 0.0 95.0
5 2 3 45.0 0.0 0.0
6 2 6 0.0 0.0 105.0
7 2 7 0.0 75.0 0.0
PS: I'm unable to format the output, I'll appreciate if someone guide me here. Thanks!
One option is to flip to long form with pivot_longer before flipping back to wide form with pivot_wider from pyjanitor:
# pip install pyjanitor
import pandas as pd
import janitor
(df
.pivot_longer(
index = None,
names_to = ('.value', 'num'),
names_pattern = r"(.+)(\d+)")
.pivot_wider(index = 'Position', names_from = 'num')
)
Position Count_1 Count_2 Count_3
0 1 55.0 NaN NaN
1 2 35.0 35.0 NaN
2 3 45.0 NaN 45.0
3 4 NaN 15.0 NaN
4 5 NaN NaN 95.0
5 6 NaN NaN 105.0
6 7 NaN 75.0 NaN
In the pivot_longer section, the .value determines which part of the column names remain as column headers - in this case it is is Position and Count.

Getting corresponding values in a groupby

I have a dataset similar to this
Serial A B
1 12
1 31
1
1 12
1 31 203
1 10
1 2
2 32 100
2 32 242
2 3
3 2
3 23 100
3
3 23
I group the dataframe based on Serial and find the maximum value of each A column by df['A_MAX'] = df.groupby('Serial')['A'].transform('max').values and retain the first value by df['A_MAX'] = df['A_MAX'].mask(df['Serial'].duplicated(), '')
Serial A B A_MAX B_corresponding
1 12 31 203
1 31
1
1 12
1 31 203
1 10
1 2
2 32 100 32 100
2 32 242
2 3
3 2 23 100
3 23 100
3
3 23
Now for the B_corresponding column, I would like to get the corresponding B values of the A_MAX. I thought of locating the A_MAX values in A but there are similar max A values per group. Additional condition, for example in Serial 2 I would also prefer to get the smallest B values between the 32
Idea is use DataFrame.sort_values for maximal values per groups, then remove missing values by DataFrame.dropna and get first rows by Serial by DataFrame.drop_duplicates. Create Series by DataFrame.set_index and last use Series.map:
df['A_MAX'] = df.groupby('Serial')['A'].transform('max')
df['A_MAX'] = df['A_MAX'].mask(df['Serial'].duplicated())
s = (df.sort_values(['Serial','A'], ascending=[True, False])
.dropna(subset=['B'])
.drop_duplicates('Serial')
.set_index('Serial')['B'])
df['B_corresponding'] = df['Serial'].map(s).mask(df['Serial'].duplicated())
print (df)
Serial A B A_MAX B_corresponding
0 1 12.0 NaN 31.0 203.0
1 1 31.0 NaN NaN NaN
2 1 NaN NaN NaN NaN
3 1 12.0 NaN NaN NaN
4 1 31.0 203.0 NaN NaN
5 1 10.0 NaN NaN NaN
6 1 2.0 NaN NaN NaN
7 2 32.0 100.0 32.0 100.0
8 2 32.0 242.0 NaN NaN
9 2 3.0 NaN NaN NaN
10 3 2.0 NaN 23.0 100.0
11 3 23.0 100.0 NaN NaN
12 3 NaN NaN NaN NaN
13 3 23.0 NaN NaN NaN
Converting missing values to empty strings is possible, but get mixed values - numeric and strings, so next processing should be problematic:
df['A_MAX'] = df.groupby('Serial')['A'].transform('max')
df['A_MAX'] = df['A_MAX'].mask(df['Serial'].duplicated(), '')
s = (df.sort_values(['Serial','A'], ascending=[True, False])
.dropna(subset=['B'])
.drop_duplicates('Serial')
.set_index('Serial')['B'])
df['B_corresponding'] = df['Serial'].map(s).mask(df['Serial'].duplicated(), '')
print (df)
Serial A B A_MAX B_corresponding
0 1 12.0 NaN 31 203
1 1 31.0 NaN
2 1 NaN NaN
3 1 12.0 NaN
4 1 31.0 203.0
5 1 10.0 NaN
6 1 2.0 NaN
7 2 32.0 100.0 32 100
8 2 32.0 242.0
9 2 3.0 NaN
10 3 2.0 NaN 23 100
11 3 23.0 100.0
12 3 NaN NaN
13 3 23.0 NaN
You could also use dictionaries to achieve the same if you are not so inclined to only use pandas.
a_to_b_mapping = df.groupby('A')['B'].min().to_dict()
series_to_a_mapping = df.groupby('Series')['A'].max().to_dict()
agg_df = {}
for series, a in series_to_a_mapping.items():
agg_df.append((series, a, a_to_b_mapping.get(a, None)))
agg_df = pd.DataFrame(agg_df, columns=['Series', 'A_max', 'B_corresponding'])
agg_df.head()
Series A_max B_corresponding
0 1 31.0 203.0
1 2 32.0 100.0
2 3 23.0 100.0
If you want, you could join this to original dataframe and mask duplicates.
dft = df.join(final_df.set_index('Serial'), on='Serial', how='left')
dft['A_max'] = dft['A_max'].mask(dft['A_max'].duplicated(), '')
dft['B_corresponding'] = dft['B_corresponding'].mask(dft['B_corresponding'].duplicated(), '')
dft

Find cumulative sums of each grouping in a row and then set the grouping equal to the maximum sum

If I have a pandas data frame of ones like this:
NaN 1 1 1 1 NaN 1 1 1 NaN 1
Nan NaN 1 1 1 1 NaN NaN 1 NaN 1
NaN NaN 1 1 1 1 1 1 1 1 1
How do I do a cumulative sum in each row such but then set each grouping with the maximum value of the cumulative sum such that I get a pandas data frame like this:
NaN 4 4 4 4 NaN 3 3 3 NaN 1
Nan NaN 4 4 4 4 NaN NaN 1 NaN 1
NaN NaN 9 9 9 9 9 9 9 9 9
First we do stack with isnull, the create the sub-group with cumsum and count the continue 1 with transform , last step we just need unstack convert the data back
s=df.isnull().stack()
s=s.groupby(level=0).cumsum()[~s]
s=s.groupby([s.index.get_level_values(0),s]).transform('count').unstack().reindex_like(df)
1 2 3 4 5 6 7 8 9 10 11
0 NaN 4.0 4.0 4.0 4.0 NaN 3.0 3.0 3.0 NaN 1.0
1 NaN NaN 4.0 4.0 4.0 4.0 NaN NaN 1.0 NaN 1.0
2 NaN NaN 9.0 9.0 9.0 9.0 9.0 9.0 9.0 9.0 9.0
Many more steps than #YOBEN_S but we can make use of melt and groupby
we use cumcount to create a condtional helper column to group with.
from io import StringIO
import pandas as pd
d = """ NaN 1 1 1 1 NaN 1 1 1 NaN 1
NaN NaN 1 1 1 1 NaN NaN 1 NaN 1
NaN NaN 1 1 1 1 1 1 1 1 1"""
df = pd.read_csv(StringIO(d), header=None, sep=r"\s+")
s = df.reset_index().melt(id_vars="index")
s.loc[s["value"].isnull(), "counter"] = s.groupby(
[s["index"], s["value"].isnull()]
).cumcount()
s["counter"] = s.groupby(["index"])["counter"].ffill()
s["val"] = s.groupby(["index", "counter"])["value"].cumsum()
s["val"] = s.groupby(["counter", "index"])["val"].transform("max")
s.loc[s["value"].isnull(), "val"] = np.nan
df2 = (
s.groupby(["index", "variable"])["val"]
.first()
.unstack()
.rename_axis(None, axis=1)
.rename_axis(None)
)
print(df2)
0 1 2 3 4 5 6 7 8 9 10
0 NaN 4.0 4.0 4.0 4.0 NaN 3.0 3.0 3.0 NaN 1.0
1 NaN NaN 4.0 4.0 4.0 4.0 NaN NaN 1.0 NaN 1.0
2 NaN NaN 9.0 9.0 9.0 9.0 9.0 9.0 9.0 9.0 9.0

Join dataframes by key - repeated data as new columns

I'm facing the next situation. I have two dataframes lets say df1 and df2, and I need to join them by a key ( ID_ed , ID ) the second dataframe may have more than one occurrence of the key, what I need is to join the two dataframes, and add the repeated occurrences of the keys as new columns ( as shown in the next Image )
I tried to use merge = df2.join( df1 , lsuffix='_ZID', rsuffix='_IID' , how = "left" ) and concat operations but no luck so far .It seems that it only preserve the last occurrence ( as if it was overwriting the data )
Any help in this is really appreciated, and thanks in advance.
Another approach is to create a serial counter for the ID_ed column, set_index and unstack before calling the pivot_table. The pivot_table aggregation would be first. This approach would be fairly similar to this SO answer
Generate the data
import pandas as pd
import numpy as np
a = [['ID_ed','color'],[1,5],[2,8],[3,7]]
b = [['ID','code'],[1,1],[1,5],
[2,np.nan],[2,20],[2,74],
[3,10],[3,98],[3,85],
[3,21],[3,45]]
df1 = pd.DataFrame(a[1:], columns=a[0])
df2 = pd.DataFrame(b[1:], columns=b[0])
print(df1)
ID_ed color
0 1 5
1 2 8
2 3 7
print(df2)
ID code
0 1 1.0
1 1 5.0
2 2 NaN
3 2 20.0
4 2 74.0
5 3 10.0
6 3 98.0
7 3 85.0
8 3 21.0
9 3 45.0
First the merge and unstack
# Merge and add a serial counter column
df = df1.merge(df2, how='inner', left_on='ID_ed', right_on='ID')
df['counter'] = df.groupby('ID_ed').cumcount()+1
print(df)
ID_ed color ID code counter
0 1 5 1 1.0 1
1 1 5 1 5.0 2
2 2 8 2 NaN 1
3 2 8 2 20.0 2
4 2 8 2 74.0 3
5 3 7 3 10.0 1
6 3 7 3 98.0 2
7 3 7 3 85.0 3
8 3 7 3 21.0 4
9 3 7 3 45.0 5
# Set index and unstack
df.set_index(['ID_ed','color','counter']).\
unstack().\
swaplevel(1,0,axis=1).\
sort_index(level=0,axis=1).add_prefix('counter_')
print(df)
counter counter_1 counter_2 \
counter_ID counter_code counter_ID counter_code\
ID_ed color \
1 5 1.0 1.0 1.0 5.0\
2 8 2.0 NaN 2.0 20.0\
3 7 3.0 10.0 3.0 98.0 \
counter_3 counter_4 counter_5
counter_ID counter_code counter_ID counter_code counter_ID counter_code
NaN NaN NaN NaN NaN NaN
2.0 74.0 NaN NaN NaN NaN
3.0 85.0 3.0 21.0 3.0 45.0
Next generate the pivot table
# Pivot table with 'first' aggregation
dfp = pd.pivot_table(df, index=['ID_ed','color'],
columns=['counter'],
values=['ID', 'code'],
aggfunc='first')
print(dfp)
ID code
counter 1 2 3 4 5 1 2 3 4 5
ID_ed color
1 5 1.0 1.0 NaN NaN NaN 1.0 5.0 NaN NaN NaN
2 8 2.0 2.0 2.0 NaN NaN NaN 20.0 74.0 NaN NaN
3 7 3.0 3.0 3.0 3.0 3.0 10.0 98.0 85.0 21.0 45.0
Finally rename the columns and slice by partial column name
# Rename columns
level_1_names = list(dfp.columns.get_level_values(1))
level_0_names = list(dfp.columns.get_level_values(0))
new_cnames = [b+'_'+str(f) for f, b in zip(level_1_names, level_0_names)]
dfp.columns = new_cnames
# Slice by new column names
print(dfp.loc[:, dfp.columns.str.contains('code')].reset_index(drop=False))
ID_ed color code_1 code_2 code_3 code_4 code_5
0 1 5 1.0 5.0 NaN NaN NaN
1 2 8 NaN 20.0 74.0 NaN NaN
2 3 7 10.0 98.0 85.0 21.0 45.0
I'd use cumcount and pivot_table:
In [11]: df1
Out[11]:
ID color
0 1 5
1 2 8
2 3 7
In [12]: df2
Out[12]:
ID code
0 1 1.0
1 1 5.0
2 2 NaN
3 2 20.0
4 2 74.0
In [13]: res = df1.merge(df2) # This is a merge if the column names match
In [14]: res
Out[14]:
ID color code
0 1 5 1.0
1 1 5 5.0
2 2 8 NaN
3 2 8 20.0
4 2 8 74.0
In [15]: res['count'] = res.groupby('ID').cumcount()
In [16]: res.pivot_table('code', ['ID', 'color'], 'count')
Out[16]:
count 0 1 2
ID color
1 5 1.0 5.0 NaN
2 8 NaN 20.0 74.0

Fill null values with values from a column in another dataset

I have 2 datasets like this:
df1.head(5)
category cost
0 1 33.0
1 1 33.0
2 2 18.0
3 1 NaN
4 3 8.0
5 2 NaN
df2.head(2)
cost
3 33.0
5 55.0
df2 contains one column with values on the same indexes, where df1 is null
I would like to do get this result:
df1.head(5)
category cost
0 1 33.0
1 1 33.0
2 2 18.0
3 1 33.0
4 3 8.0
5 2 55.0
So fill the cost column in df1 by values in df2 on the same indexes
fillna
Pandas assigns by index naturally:
df1['cost'] = df1['cost'].fillna(df2['cost'])
print(df1)
category cost
0 1 33.0
1 1 33.0
2 2 18.0
3 1 33.0
4 3 8.0
5 2 55.0

Categories