If I had the following df:
amount name role desc
0 1.0 a x f
1 2.0 a y g
2 3.0 b y h
3 4.0 b y j
4 5.0 c x k
5 6.0 c x l
6 6.0 c y p
I want to group by the name and role columns, add up the amount, and also do a concatenation of the desc with a , :
amount name role desc
0 1.0 a x f
1 2.0 a y g
2 7.0 b y h,j
4 11.0 c x k,l
6 6.0 c y p
What would be the correct way of approaching this?
Side question: say if the df was being read from a .csv and it had other unrelated columns, how do I do this calculation and then write to a new .csv along with the other columns (same schema as the one read)?
May be not exact dupe but there are a lot of questions related to groupby agg
df.groupby(['name', 'role'], as_index=False)\
.agg({'amount':'sum', 'desc':lambda x: ','.join(x)})
name role amount desc
0 a x 1.0 f
1 a y 2.0 g
2 b y 7.0 h,j
3 c x 11.0 k,l
4 c y 6.0 p
Edit: If there are other columns in the dataframe, you can aggregate them using 'first' or 'last' or if their values are identical, include them in grouping.
Option1:
df.groupby(['name', 'role'], as_index=False).agg({'amount':'sum', 'desc':lambda x: ','.join(x), 'other1':'first', 'other2':'first'})
Option 2:
df.groupby(['name', 'role', 'other1', 'other2'], as_index=False).agg({'amount':'sum', 'desc':lambda x: ','.join(x)})
Extending #Vaishali's answer. To handle the remaining columns without having to specify each one you could create a dictionary and have that as the argument for the agg(regate) function.
dict = {}
for col in df:
if (col == 'column_you_wish_to_merge'):
dict[col] = ' '.join
else:
dict[col] = 'first' # or any other group aggregation operation
df.groupby(['key1', 'key2'], as_index=False).agg(dict)
Related
Given this data frame:
import pandas as pd
df = pd.DataFrame({'group':['a','a','b','c','c'],'strings':['ab',' ',' ','12',' '],'floats':[7.0,8.0,9.0,10.0,11.0]})
group strings floats
0 a ab 7.0
1 a 8.0
2 b 9.0
3 c 12 10.0
4 c 11.0
I want to group by "group" and get the max value of strings and floats.
Desired result:
strings floats
group
a ab 8.0
b 9.0
c 12 11.0
I know I can just do this:
df.groupby(['group'], sort=False)['strings','floats'].max()
But in reality, I have many columns so I want to refer to all columns (save for "group") in one go.
I wish I could just do this:
df.groupby(['group'], sort=False)[x for x in df.columns if x != 'group'].max()
But, alas, "invalid syntax".
If need max of all columns without group is possible use:
df = df.groupby('group', sort=False).max()
print (df)
strings floats
group
a ab 8.0
b 9.0
c 12 11.0
Your second solution working if add next []:
df = df.groupby(['group'], sort=False)[[x for x in df.columns if x != 'group']].max()
print (df)
strings floats
group
a ab 8.0
b 9.0
c 12 11.0
I want to calculate a pandas dataframe, but some rows contain missing values. For those missing values, i want to use a diffent algorithm. Lets say:
If column B contains a value, then substract A from B
If column B does not contain a value, then subtract A from C
import pandas as pd
df = pd.DataFrame({'a':[1,2,3,4], 'b':[1,1,None,1],'c':[2,2,2,2]})
df['calc'] = df['b']-df['a']
results in:
print(df)
a b c calc
0 1 1.0 2 0.0
1 2 1.0 2 -1.0
2 3 NaN 2 NaN
3 4 1.0 2 -3.0
Approach 1: fill the NaN rows using .where:
df['calc'].where(df['b'].isnull()) = df['c']-df['a']
which results in SyntaxError: cannot assign to function call.
Approach 2: fill the NaN rows using .iterrows():
for index, row in df.iterrows():
i = df['calc'].iloc[index]
if pd.isnull(row['b']):
i = row['c']-row['a']
print(i)
else:
i = row['b']-row['a']
print(i)
is executed without errors and calculation is correct, these i values are printed to the console:
0.0
-1.0
-1.0
-3.0
but the values are not written into df['calc'], the datafram remains as is:
print(df['calc'])
0 0.0
1 -1.0
2 NaN
3 -3.0
What is the correct way of overwriting the NaN values?
Finally, I stumbled over .fillna:
df['calc'] = df['calc'].fillna( df['c']-df['a'] )
gets the job done! Can anyone explain what is wrong with above two approaches...?
Approach 2:
you are assigning it to i value. but this won't modify your original dataframe.
for index, row in df.iterrows():
i = df['calc'].iloc[index]
if pd.isnull(row['b']):
i = row['c']-row['a']
print(i)
else:
i = row['b']-row['a']
print(i)
df.loc[index,'calc'] = i #<------------- here
also don't use iterrows() it is too slow.
Approach 1:
Pandas where() method is used to check a data frame for one or more condition and return the result accordingly. By default, The rows not satisfying the condition are filled with NaN value.
it should be:
df['calc'] = df['calc'].where(df['b'].isnull(), df['c']-df['a'])
but this will only find those row value where you have non zero value and fill that with the given value.
Use:
df['calc'] = df['calc'].where(~df['b'].isnull(), df['c']-df['a'])
OR
df['calc'] = np.where(df['b'].isnull(), df['c']-df['a'], df['calc'])
Instead of subtracting b from a then c from a what you can do is first fill the nan values in column b with the values from column c, then subtract column a:
df['calc'] = df['b'].fillna(df['c']) - df['a']
a b c calc
0 1 1.0 2 0.0
1 2 1.0 2 -1.0
2 3 NaN 2 -1.0
3 4 1.0 2 -3.0
I have a table as follows:
ID SCORE
A NaN
A NaN
B 1
B 2
C 5
I want the following output:
ID SUM_SCORE SIZE_SCORE
A NaN 2
B 3 2
C 5 1
Since I want to preserve NaN's, I need to use sum(min_count=1). So I have the following thus far:
grp = df.groupby('ID')
sum_score = grp['SCORE'].sum(min_count=1).reset_index()
size_score = grp['SCORE'].size().reset_index()
result = pd.merge(sum_score, size_score, on=['ID'])
This feels really inelegant. Is there a better way to get the result I'm looking for?
s=df.groupby('ID').SCORE.agg([('sum_score',lambda x : x.sum(min_count=1)),
('size_score','size')] ).reset_index()
ID sum_score size_score
0 A NaN 2
1 B 3.0 2
2 C 5.0 1
You can aggregate using the following:
df_agg = df.groupby("ID", as_index=False).agg(["sum","count"])
# rename your columns
df_agg.columns = ["ID","SUM_SCORE", "SIZE_SCORE"]
I have a pandas data-frame with multiple features, where I would like to insert rows of nans corresponding to only the first feature. In other words, I would like to transform something like this:
into this:
As I will be dealing with large datasets, the speed is important.
For general solution for select missing values if more columns add new DataFrame created by DataFrame.drop_duplicates, selecting features columns and rewritten data in feat2, so if use concat are all another columns replaced to missing values. Last for correct order add DataFrame.sort_values:
df1 = df.drop_duplicates('feat1')[['feat1','feat2']].assign(feat2='-')
df2 = (pd.concat([df1, df], sort=False, ignore_index=True)
.sort_values('feat1'))
print (df2)
feat1 feat2 var
0 A - NaN
3 A x 0.0
4 A y 1.0
5 A z 2.0
1 B - NaN
6 B x 3.0
7 B y 4.0
8 B z 5.0
2 C - NaN
9 C x 6.0
10 C y 7.0
11 C z 8.0
As per Categorical Data - Operations, by default groupby will show “unused” categories:
In [118]: cats = pd.Categorical(["a","b","b","b","c","c","c"], categories=["a","b","c","d"])
In [119]: df = pd.DataFrame({"cats":cats,"values":[1,2,2,2,3,4,5]})
In [120]: df.groupby("cats").mean()
Out[120]:
values
cats
a 1.0
b 2.0
c 4.0
d NaN
How to obtain the result with the “unused” categories dropped? e.g.
values
cats
a 1.0
b 2.0
c 4.0
Since version 0.23 you can specify observed=True in the groupby call to achieve the desired behavior.
Option 1
remove_unused_categories
df.groupby(df['cats'].cat.remove_unused_categories()).mean()
values
cats
a 1
b 2
c 4
You can also make the assignment first, and then groupby -
df.assign(cats=df['cats'].cat.remove_unused_categories()).groupby('cats').mean()
Or,
df['cats'] = df['cats'].cat.remove_unused_categories()
df.groupby('cats').mean()
values
cats
a 1
b 2
c 4
Option 2
astype to str conversion -
df.groupby(df['cats'].astype(str)).mean()
values
cats
a 1
b 2
c 4
Just chain with dropna. Like so:
df.groupby("cats").mean().dropna()
values
cats
a 1.0
b 2.0
c 4.0
If you want to remove unused categories from all categorical columns, you can:
def remove_unused_categories(df: pd.DataFrame):
for c in df.columns:
if pd.api.types.is_categorical_dtype(df[c]):
df[c].cat.remove_unused_categories(inplace=True)
Then before calling groupby, call:
remove_unused_categories(df_with_empty_cat)