Getting corresponding values in a groupby - python

I have a dataset similar to this
Serial A B
1 12
1 31
1
1 12
1 31 203
1 10
1 2
2 32 100
2 32 242
2 3
3 2
3 23 100
3
3 23
I group the dataframe based on Serial and find the maximum value of each A column by df['A_MAX'] = df.groupby('Serial')['A'].transform('max').values and retain the first value by df['A_MAX'] = df['A_MAX'].mask(df['Serial'].duplicated(), '')
Serial A B A_MAX B_corresponding
1 12 31 203
1 31
1
1 12
1 31 203
1 10
1 2
2 32 100 32 100
2 32 242
2 3
3 2 23 100
3 23 100
3
3 23
Now for the B_corresponding column, I would like to get the corresponding B values of the A_MAX. I thought of locating the A_MAX values in A but there are similar max A values per group. Additional condition, for example in Serial 2 I would also prefer to get the smallest B values between the 32

Idea is use DataFrame.sort_values for maximal values per groups, then remove missing values by DataFrame.dropna and get first rows by Serial by DataFrame.drop_duplicates. Create Series by DataFrame.set_index and last use Series.map:
df['A_MAX'] = df.groupby('Serial')['A'].transform('max')
df['A_MAX'] = df['A_MAX'].mask(df['Serial'].duplicated())
s = (df.sort_values(['Serial','A'], ascending=[True, False])
.dropna(subset=['B'])
.drop_duplicates('Serial')
.set_index('Serial')['B'])
df['B_corresponding'] = df['Serial'].map(s).mask(df['Serial'].duplicated())
print (df)
Serial A B A_MAX B_corresponding
0 1 12.0 NaN 31.0 203.0
1 1 31.0 NaN NaN NaN
2 1 NaN NaN NaN NaN
3 1 12.0 NaN NaN NaN
4 1 31.0 203.0 NaN NaN
5 1 10.0 NaN NaN NaN
6 1 2.0 NaN NaN NaN
7 2 32.0 100.0 32.0 100.0
8 2 32.0 242.0 NaN NaN
9 2 3.0 NaN NaN NaN
10 3 2.0 NaN 23.0 100.0
11 3 23.0 100.0 NaN NaN
12 3 NaN NaN NaN NaN
13 3 23.0 NaN NaN NaN
Converting missing values to empty strings is possible, but get mixed values - numeric and strings, so next processing should be problematic:
df['A_MAX'] = df.groupby('Serial')['A'].transform('max')
df['A_MAX'] = df['A_MAX'].mask(df['Serial'].duplicated(), '')
s = (df.sort_values(['Serial','A'], ascending=[True, False])
.dropna(subset=['B'])
.drop_duplicates('Serial')
.set_index('Serial')['B'])
df['B_corresponding'] = df['Serial'].map(s).mask(df['Serial'].duplicated(), '')
print (df)
Serial A B A_MAX B_corresponding
0 1 12.0 NaN 31 203
1 1 31.0 NaN
2 1 NaN NaN
3 1 12.0 NaN
4 1 31.0 203.0
5 1 10.0 NaN
6 1 2.0 NaN
7 2 32.0 100.0 32 100
8 2 32.0 242.0
9 2 3.0 NaN
10 3 2.0 NaN 23 100
11 3 23.0 100.0
12 3 NaN NaN
13 3 23.0 NaN

You could also use dictionaries to achieve the same if you are not so inclined to only use pandas.
a_to_b_mapping = df.groupby('A')['B'].min().to_dict()
series_to_a_mapping = df.groupby('Series')['A'].max().to_dict()
agg_df = {}
for series, a in series_to_a_mapping.items():
agg_df.append((series, a, a_to_b_mapping.get(a, None)))
agg_df = pd.DataFrame(agg_df, columns=['Series', 'A_max', 'B_corresponding'])
agg_df.head()
Series A_max B_corresponding
0 1 31.0 203.0
1 2 32.0 100.0
2 3 23.0 100.0
If you want, you could join this to original dataframe and mask duplicates.
dft = df.join(final_df.set_index('Serial'), on='Serial', how='left')
dft['A_max'] = dft['A_max'].mask(dft['A_max'].duplicated(), '')
dft['B_corresponding'] = dft['B_corresponding'].mask(dft['B_corresponding'].duplicated(), '')
dft

Related

Pandas: start a new group on every non-NA value

I am looking for a method to create an array of numbers to label groups, based on the value of the 'number' column. If it's possible?
With this abbreviated example DF:
number = [nan,nan,1,nan,nan,nan,2,nan,nan,3,nan,nan,nan,nan,nan,4,nan,nan]
df = pd.DataFrame(columns=['number'])
df = pd.DataFrame.assign(df, number=number)
Ideally I would like to make a new column, 'group', based on the int in column 'number' - so there would be effectively be array's of 1, ,2, 3, etc. FWIW, the DF is 1000's lines long, with sporadically placed int's.
The result would be a new column, something like this:
number group
0 NaN 0
1 NaN 0
2 1.0 1
3 NaN 1
4 NaN 1
5 NaN 1
6 2.0 2
7 NaN 2
8 NaN 2
9 3.0 3
10 NaN 3
11 NaN 3
12 NaN 3
13 NaN 3
14 NaN 3
15 4.0 4
16 NaN 4
17 NaN 4
All advice much appreciated!
You can use notna combined with cumsum:
df['group'] = df['number'].notna().cumsum()
NB. if you had zeros: df['group'] = df['number'].ne(0).cumsum().
output:
number group
0 NaN 0
1 NaN 0
2 1.0 1
3 NaN 1
4 NaN 1
5 NaN 1
6 2.0 2
7 NaN 2
8 NaN 2
9 3.0 3
10 NaN 3
11 NaN 3
12 NaN 3
13 NaN 3
14 NaN 3
15 4.0 4
16 NaN 4
17 NaN 4
You can use forward fill:
df['number'].ffill().fillna(0)
Output:
0 0.0
1 0.0
2 1.0
3 1.0
4 1.0
5 1.0
6 2.0
7 2.0
8 2.0
9 3.0
10 3.0
11 3.0
12 3.0
13 3.0
14 3.0
15 4.0
16 4.0
17 4.0
Name: number, dtype: float64

Joining or merging multiple columns within one dataframe and keeping all data

I have this dataframe:
df = pd.DataFrame({'Position1':[1,2,3], 'Count1':[55,35,45],\
'Position2':[4,2,7], 'Count2':[15,35,75],\
'Position3':[3,5,6], 'Count3':[45,95,105]})
print(df)
Position1 Count1 Position2 Count2 Position3 Count3
0 1 55 4 15 3 45
1 2 35 2 35 5 95
2 3 45 7 75 6 105
I want to join the Position columns into one column named "Positions" while sorting the data in the Counts columns like so:
Positions Count1 Count2 Count3
0 1 55 Nan Nan
1 2 35 35 Nan
2 3 45 NaN 45
3 4 NaN 15 Nan
4 5 NaN NaN 95
5 6 Nan NaN 105
6 7 Nan 75 NaN
I've tried melting the dataframe, combining and merging columns but I am a bit stuck.
Note that the NaN types can easily be replaced by using df.fillna to get a dataframe like so:
df = df.fillna(0)
Positions Count1 Count2 Count3
0 1 55 0 0
1 2 35 35 0
2 3 45 0 45
3 4 0 15 0
4 5 0 0 95
5 6 0 0 105
6 7 0 75 0
Here is a way to do what you've asked:
df = df[['Position1', 'Count1']].rename(columns={'Position1':'Positions'}).join(
df[['Position2', 'Count2']].set_index('Position2'), on='Positions', how='outer').join(
df[['Position3', 'Count3']].set_index('Position3'), on='Positions', how='outer').sort_values(
by=['Positions']).reset_index(drop=True)
Output:
Positions Count1 Count2 Count3
0 1 55.0 NaN NaN
1 2 35.0 35.0 NaN
2 3 45.0 NaN 45.0
3 4 NaN 15.0 NaN
4 5 NaN NaN 95.0
5 6 NaN NaN 105.0
6 7 NaN 75.0 NaN
Explanation:
Use join first on Position1, Count1 and Position2, Count2 (with Position1 renamed as Positions) then on that join result and Position3, Count3.
Sort by Positions and use reset_index to create a new integer range index (ascending with no gaps).
Does this achieve what you are after?
import pandas as pd
df = pd.DataFrame({'Position1':[1,2,3], 'Count1':[55,35,45],\
'Position2':[4,2,7], 'Count2':[15,35,75],\
'Position3':[3,5,6], 'Count3':[45,95,105]})
df1, df2, df3 = df.iloc[:,:2], df.iloc[:, 2:4], df.iloc[:, 4:6]
df1.columns, df2.columns, df3.columns = ['Positions', 'Count1'], ['Positions', 'Count2'], ['Positions', 'Count3']
df1.merge(df2, on='Positions', how='outer').merge(df3, on='Positions', how='outer').sort_values('Positions')
Output:
wide_to_long unpivots the DF from Long to wide and that is what's used here.
columns names are also renamed here, with this edit
df['id'] = df.index
df2=pd.wide_to_long(df, stubnames=['Position','Count'], i='id', j='pos').reset_index()
df2=df2.pivot(index=['id','Position'], columns='pos', values='Count').reset_index().fillna(0).add_prefix('count_')
df2.rename(columns={'count_id': 'id', 'count_Position' :'Position'}, inplace=True)
df2
RESULT:
pos id Position 1 2 3
0 0 1 55.0 0.0 0.0
1 0 3 0.0 0.0 45.0
2 0 4 0.0 15.0 0.0
3 1 2 35.0 35.0 0.0
4 1 5 0.0 0.0 95.0
5 2 3 45.0 0.0 0.0
6 2 6 0.0 0.0 105.0
7 2 7 0.0 75.0 0.0
PS: I'm unable to format the output, I'll appreciate if someone guide me here. Thanks!
One option is to flip to long form with pivot_longer before flipping back to wide form with pivot_wider from pyjanitor:
# pip install pyjanitor
import pandas as pd
import janitor
(df
.pivot_longer(
index = None,
names_to = ('.value', 'num'),
names_pattern = r"(.+)(\d+)")
.pivot_wider(index = 'Position', names_from = 'num')
)
Position Count_1 Count_2 Count_3
0 1 55.0 NaN NaN
1 2 35.0 35.0 NaN
2 3 45.0 NaN 45.0
3 4 NaN 15.0 NaN
4 5 NaN NaN 95.0
5 6 NaN NaN 105.0
6 7 NaN 75.0 NaN
In the pivot_longer section, the .value determines which part of the column names remain as column headers - in this case it is is Position and Count.

Forward fill non na values with last observation carried forwards in Python

Suppose I had a column in a dataframe like :
colname
Na
Na
Na
1
2
3
4
Na
Na
Na
Na
2
8
5
44
Na
Na
Does anyone know of a function to forward fill the Non NA values with the first value in the non na run? To produce :
colname
Na
Na
Na
1
1
1
1
Na
Na
Na
Na
2
2
2
2
Na
Na
Use GroupBy.transform with GroupBy.first by compare values for missing values by Series.isna with cumulative sum by Series.cumsum, last correct NaNs by Series.where with Series.duplicated:
s = df['colNaNme'].isna().cumsum()
df['colNaNme'] = df.groupby(s)['colNaNme'].transform('first').where(s.duplicated())
print (df)
colNaNme
0 NaN
1 NaN
2 NaN
3 1.0
4 1.0
5 1.0
6 1.0
7 NaN
8 NaN
9 NaN
10 NaN
11 2.0
12 2.0
13 2.0
14 2.0
15 NaN
16 NaN
Or filter only non missing values by invert mask m and processing only these groups:
m = df['colNaNme'].isna()
df.loc[~m, 'colNaNme'] = df[~m].groupby(m.cumsum())['colNaNme'].transform('first')
print (df)
colNaNme
0 NaN
1 NaN
2 NaN
3 1.0
4 1.0
5 1.0
6 1.0
7 NaN
8 NaN
9 NaN
10 NaN
11 2.0
12 2.0
13 2.0
14 2.0
15 NaN
16 NaN
Solution with non groupby:
m = df['colNaNme'].isna()
m1 = m.cumsum().shift().bfill()
m2 = ~m1.duplicated() & m.duplicated(keep=False)
df['colNaNme'] = df['colNaNme'].where(m2).ffill().mask(m)
print (df)
colNaNme
0 NaN
1 NaN
2 NaN
3 1.0
4 1.0
5 1.0
6 1.0
7 NaN
8 NaN
9 NaN
10 NaN
11 2.0
12 2.0
13 2.0
14 2.0
15 NaN
16 NaN
You could try groupby and cumsum with shift and transform('first'):
>>> df.groupby(df['colname'].isna().ne(df['colname'].isna().shift()).cumsum()).transform('first')
colname
0 NaN
1 NaN
2 NaN
3 1
4 1
5 1
6 1
7 NaN
8 NaN
9 NaN
10 NaN
11 2
12 2
13 2
14 2
15 NaN
16 NaN
>>>
Or try something like:
>>> x = df.groupby(df['colname'].isna().cumsum()).transform('first')
>>> x.loc[~x.duplicated()] = np.nan
>>> x
colname
0 NaN
1 NaN
2 NaN
3 1
4 1
5 1
6 1
7 NaN
8 NaN
9 NaN
10 NaN
11 2
12 2
13 2
14 2
15 NaN
16 NaN
>>>

Pandas assign series to a column based on index

I have 2 dataframes:
DF1:
Count
0 98.0
1 176.0
2 260.5
3 389.0
I have to assign these values to a column in another dataframe for every 3rd row starting from 3rd row.
The Output of DF2 should look like this:
Count
0
1
2 98.0
3
4
5 176.0
6
7
8 260.5
9
10
11 389.0
I am doing
DF2.loc[2::3,'Count'] = DF1['Count']
But, I am not getting the expected results.
Use values
Ohterwise, Pandas tries to align the index values from DF1 and that messes you up.
DF2.loc[2::3, 'Count'] = DF1['Count'].values
DF2
Count
0 NaN
1 NaN
2 98.0
3 NaN
4 NaN
5 176.0
6 NaN
7 NaN
8 260.5
9 NaN
10 NaN
11 389.0
New From DF1
DF1.set_index(DF1.index * 3 + 2).reindex(range(len(DF1) * 3))
Count
0 NaN
1 NaN
2 98.0
3 NaN
4 NaN
5 176.0
6 NaN
7 NaN
8 260.5
9 NaN
10 NaN
11 389.0

concat result of groupby pandas

I am raising this question for learning a new method for myself.
I have a dataframe like below,
ID Value
0 1 10
1 1 12
2 1 14
3 1 16
4 1 18
5 2 32
6 2 12
7 2 -8
8 2 -28
9 2 -48
10 2 -68
11 3 12
12 3 1
13 3 43
I want to convert this into:
ID Value ID Value ID Value
0 1.0 10.0 2 32 3.0 12.0
1 1.0 12.0 2 12 3.0 1.0
2 1.0 14.0 2 -8 3.0 43.0
3 1.0 16.0 2 -28 NaN NaN
4 1.0 18.0 2 -48 NaN NaN
5 NaN NaN 2 -68 NaN NaN
one way to solve this,
print
pd.concat([df[df['ID']==1].reset_index(drop=True),df[df['ID']==2].reset_index(drop=True),df[df['ID']==3].reset_index(drop=True)],axis=1)
But I'm thinking can I do the same concat operation for each groupby method result instead of filtering by value?
Any better/new approaches are more appreciated.
Thanks in advance.
Yup, very possible and quite simple with pd.concat, in fact.
df = pd.concat({k : g.reset_index(drop=True) for k, g in df.groupby('ID')}, axis=1)
df.columns = df.columns.droplevel(0)
Or, a minor variation in Dark's (now deleted) answer (which does not give you the opportunity to specify column suffixes automatically) -
pd.concat([g.reset_index(drop=True) for _, g in df.groupby('ID')], axis=1)
df
ID Value ID Value ID Value
0 1.0 10.0 2 32 3.0 12.0
1 1.0 12.0 2 12 3.0 1.0
2 1.0 14.0 2 -8 3.0 43.0
3 1.0 16.0 2 -28 NaN NaN
4 1.0 18.0 2 -48 NaN NaN
5 NaN NaN 2 -68 NaN NaN
Those column names are terrible, though. Rather than dropping the first level, you should consider concatenating them to form a pre/suf-fix for the second level. That should be a good exercise for you with df.columns.map.

Categories