Pandas Groupby Max of Multiple Columns - python

Given this data frame:
import pandas as pd
df = pd.DataFrame({'group':['a','a','b','c','c'],'strings':['ab',' ',' ','12',' '],'floats':[7.0,8.0,9.0,10.0,11.0]})
group strings floats
0 a ab 7.0
1 a 8.0
2 b 9.0
3 c 12 10.0
4 c 11.0
I want to group by "group" and get the max value of strings and floats.
Desired result:
strings floats
group
a ab 8.0
b 9.0
c 12 11.0
I know I can just do this:
df.groupby(['group'], sort=False)['strings','floats'].max()
But in reality, I have many columns so I want to refer to all columns (save for "group") in one go.
I wish I could just do this:
df.groupby(['group'], sort=False)[x for x in df.columns if x != 'group'].max()
But, alas, "invalid syntax".

If need max of all columns without group is possible use:
df = df.groupby('group', sort=False).max()
print (df)
strings floats
group
a ab 8.0
b 9.0
c 12 11.0
Your second solution working if add next []:
df = df.groupby(['group'], sort=False)[[x for x in df.columns if x != 'group']].max()
print (df)
strings floats
group
a ab 8.0
b 9.0
c 12 11.0

Related

combine rows with identical index

How do I combine values from two rows with identical index and has no intersection in values?
import pandas as pd
df = pd.DataFrame([[1,2,3],[4,None,None],[None,5,6]],index=['a','b','b'])
df
#input
0 1 2
a 1.0 2.0 3.0
b 4.0 NaN NaN
b NaN 5.0 6.0
Desired output
0 1 2
a 1.0 2.0 3.0
b 4.0 5.0 6.0
Please stack(), drops all nans and unstack()
df.stack().unstack()
If possible simplify solution for first non missing values per index labels use GroupBy.first:
df1 = df.groupby(level=0).first()
If possible same output from sample data is use sum per labels use sum:
df1 = df.sum(level=0)
If there is multiple non missing values per groups is necessary specify expected output, obviously is is more complicated.

Inserting non-number rows in MultiIndex dataframe

I have a pandas data-frame with multiple features, where I would like to insert rows of nans corresponding to only the first feature. In other words, I would like to transform something like this:
into this:
As I will be dealing with large datasets, the speed is important.
For general solution for select missing values if more columns add new DataFrame created by DataFrame.drop_duplicates, selecting features columns and rewritten data in feat2, so if use concat are all another columns replaced to missing values. Last for correct order add DataFrame.sort_values:
df1 = df.drop_duplicates('feat1')[['feat1','feat2']].assign(feat2='-')
df2 = (pd.concat([df1, df], sort=False, ignore_index=True)
.sort_values('feat1'))
print (df2)
feat1 feat2 var
0 A - NaN
3 A x 0.0
4 A y 1.0
5 A z 2.0
1 B - NaN
6 B x 3.0
7 B y 4.0
8 B z 5.0
2 C - NaN
9 C x 6.0
10 C y 7.0
11 C z 8.0

Random number from column

The goal is to fill the nan values in a column with a random number chosen from that same column.
I can do this one column as a time but when iterating through all the columns in the data frame I get a variety of errors. When I use "random.choice" I get letters rather than column values.
df1 = df_na
df2 = df_nan.dropna()
for i in range(5):
for j in range(len(df1)):
if np.isnan(df1.iloc[j,i]):
df1.iloc[j,i] = np.random.choice(df2.columns[i])
df1
Any suggestions on how to move forward?
You can do:
# sample data
df =pd.DataFrame({'a':[1,2,None,18,20,None],
'b': [22,33,44,None,100,32]})
# fill missing with a random value from that column
for col in df.columns:
df[col].fillna(df[col].dropna().sample().values[0], inplace=True)
a b
0 1.0 22.0
1 2.0 33.0
2 20.0 44.0
3 18.0 100.0
4 20.0 100.0
5 20.0 32.0
You can use pd.DataFrame.apply with np.random.choice:
df = df.apply(lambda s: s.fillna(np.random.choice(s.dropna())))

pandas: groupby multiple columns, concatenating one column while adding another

If I had the following df:
amount name role desc
0 1.0 a x f
1 2.0 a y g
2 3.0 b y h
3 4.0 b y j
4 5.0 c x k
5 6.0 c x l
6 6.0 c y p
I want to group by the name and role columns, add up the amount, and also do a concatenation of the desc with a , :
amount name role desc
0 1.0 a x f
1 2.0 a y g
2 7.0 b y h,j
4 11.0 c x k,l
6 6.0 c y p
What would be the correct way of approaching this?
Side question: say if the df was being read from a .csv and it had other unrelated columns, how do I do this calculation and then write to a new .csv along with the other columns (same schema as the one read)?
May be not exact dupe but there are a lot of questions related to groupby agg
df.groupby(['name', 'role'], as_index=False)\
.agg({'amount':'sum', 'desc':lambda x: ','.join(x)})
name role amount desc
0 a x 1.0 f
1 a y 2.0 g
2 b y 7.0 h,j
3 c x 11.0 k,l
4 c y 6.0 p
Edit: If there are other columns in the dataframe, you can aggregate them using 'first' or 'last' or if their values are identical, include them in grouping.
Option1:
df.groupby(['name', 'role'], as_index=False).agg({'amount':'sum', 'desc':lambda x: ','.join(x), 'other1':'first', 'other2':'first'})
Option 2:
df.groupby(['name', 'role', 'other1', 'other2'], as_index=False).agg({'amount':'sum', 'desc':lambda x: ','.join(x)})
Extending #Vaishali's answer. To handle the remaining columns without having to specify each one you could create a dictionary and have that as the argument for the agg(regate) function.
dict = {}
for col in df:
if (col == 'column_you_wish_to_merge'):
dict[col] = ' '.join
else:
dict[col] = 'first' # or any other group aggregation operation
df.groupby(['key1', 'key2'], as_index=False).agg(dict)

python- flagging a second set of items in a series

I have a dataframe column which contains a list of numbers from a .csv. These numbers range from 1-1400 and may or may not be repeated and the a NaN value can appear pretty much anywhere at random.
Two examples would be
a=[1,4,NaN,5,6,7,...1398,1400,1,2,3,NaN,8,9,...,1398,NaN]
b=[1,NaN,2,3,4,NaN,7,10,...,1398,1399,1400]
I would like to create another column that finds the first 1-1400 and records a '1' in the same index and if the second set of 1-1400 exists, then mark that down as a '2' in the new column
I can think of some roundabout ways using temporary placeholders and some other kind of checks, but I was wondering if there was a 1-3 liner to do this operation
Edit1: I would prefer there to be a single column returned
a1=[1,1,NaN,1,1,1,...1,1,2,2,2,NaN,2,2,...,2,NaN]
b1=[1,NaN,1,1,1,NaN,1,1,...,1,1,1]
You can use groupby() and cumcount() to count numbers in each column:
# create new columns for counting
df['a1'] = np.nan
df['b1'] = np.nan
# take groupby for each value in column `a` and `b` and count each value
df.a1 = df.groupby('a').cumcount() + 1
df.b1 = df.groupby('b').cumcount() + 1
# set np.nan as it is
df.loc[df.a.isnull(), 'a1'] = np.nan
df.loc[df.b.isnull(), 'b1'] = np.nan
EDIT (after receiving a comment of 'does not work'):
df['a2'] = df.ffill().a.diff()
df['a1'] = df.loc[df.a2 < 0].groupby('a').cumcount() + 1
df['a1'] = df['a1'].bfill().shift(-1)
df.loc[df.a1.isnull(), 'a1'] = df.a1.max() + 1
df.drop('a2', axis=1, inplace=True)
df.loc[df.a.isnull(), 'a1'] = np.nan
you can use diff to check when the difference between two following values is negative, meaning of the start of a new range. Let's create a dataframe:
import pandas as pd
import numpy as np
# to create a dataframe with two columns my range go up to 12 but 1400 is the same
df = pd.DataFrame({'a':[1,4,np.nan,5,10,12,2,3,4,np.nan,8,12],'b':range(1,13)})
df.loc[[4,8],'b'] = np.nan
Because you have 'NaN', you need to use ffill to fill NaN with previous value and you want the opposite of the row (using ~) where the diff is greater or equal than 0 (I know it sound like less than 0, but not exactely here as it miss the first row of the dataframe). For column 'a' for example
print (df.loc[~(df.a.ffill().diff()>=0),'a'])
0 1.0
6 2.0
Name: a, dtype: float64
you get the two rows where a "new" range start. To use this property to create 'a1', you can do:
# put 1 in the rows with a new range start
df.loc[~(df.a.ffill().diff()>=0),'a1'] = 1
# create a mask to select notnull row in a:
mask_a = df.a.notnull()
# use cumsum and ffill on column a1 with the mask_a
df.loc[mask_a,'a1'] = df.loc[mask_a,'a1'].cumsum().ffill()
Finally, for several column, you can do:
list_col = ['a','b']
for col in list_col:
df.loc[~(df[col].ffill().diff()>=0),col+'1'] = 1
mask = df[col].notnull()
df.loc[mask,col+'1'] = df.loc[mask,col+'1'].cumsum().ffill()
and with my input, you get:
a b a1 b1
0 1.0 1.0 1.0 1.0
1 4.0 2.0 1.0 1.0
2 NaN 3.0 NaN 1.0
3 5.0 4.0 1.0 1.0
4 10.0 NaN 1.0 NaN
5 12.0 6.0 1.0 1.0
6 1.0 7.0 2.0 1.0
7 3.0 8.0 2.0 1.0
8 4.0 NaN 2.0 NaN
9 NaN 10.0 NaN 1.0
10 8.0 11.0 2.0 1.0
11 12.0 12.0 2.0 1.0
EDIT: you can even do it in one line for each column, same result:
df['a1'] = df[df.a.notnull()].a.diff().fillna(-1).lt(0).cumsum()
df['b1'] = df[df.b.notnull()].b.diff().fillna(-1).lt(0).cumsum()

Categories