Can I preserve the non-numerical columns (the 1st appeared value) when doing pandas.DataFrame.groupby().sum() ?
For example, I have a DataFrame like this:
df = pd.DataFrame({'A' : ['aa1', 'aa2', 'aa1', 'aa2'],'B' : ['bb1', 'bbb1', 'bb2', 'bbb2'],'C' : ['cc1', 'ccc2', 'ccc3', 'ccc4'],'D' : [1, 2, 3, 4],'E' : [1, 2, 3, 4]})
>>> df
A B C D E
0 aa1 bb1 cc1 1 1
1 aa2 bbb1 ccc2 2 2
2 aa1 bb2 ccc3 3 3
3 aa2 bbb2 ccc4 4 4
>>> df.groupby(["A"]).sum()
D E
A
aa1 4 4
aa2 6 6
Following is the result I want to obtain:
B C D E
A
aa1 bb1 cc1 4 4
aa2 bbb1 ccc2 6 6
Notice that the value of column B and C is the first associated B value and C value of each group key.
Just use 'first':
df.groupby(["A"]).agg({'B': 'first',
'C': 'first',
'D': sum,
'E': sum})
For each key in the groupby-sum dataframe, look up the key in the original dataframe and put the associated value of column B into a new column.
#groupby and sum over columns C and D
df_1 = df.groupby(['A']).sum()
Find the first values in column B associated with groupby keys
groupby keys
col_b = []
#iterate through keys and find the the first value in df['B'] with that key in column A
for i in df_1.index:
col_b.append(df['B'][df['A'] == i].iloc[0])
#insert list of values into new dataframe
df_1.insert(0, 'B', col_b)
>>>df_1
B D E
A
aa1 bb1 4 4
aa2 bbb1 6 6
Grouping only on column 'A' gives:
df.groupby(['A']).sum()
C D
A
bar 1.26 0.88
foo 0.92 -4.19
Grouping on column 'A' and 'B' gives:
df.groupby(['A','B']).sum()
C D
A B
bar one 1.38 -0.73
three 0.26 0.80
two -0.38 0.81
foo one 1.96 -2.72
three -0.42 -0.18
two -0.62 -1.29
If you want only the column 'B' that has 'one' you can do:
d = df.groupby(['A','B'], as_index=False).sum()
d[d.B=='one'].set_index('A')
B C D
A
bar one 1.38 -0.73
foo one 1.96 -2.72
I'm not sure I understand but is this what you want to do?
Note: I increased the output precision just to get the same numbers shown in the post.
d = df.groupby('A').sum()
d['B'] = 'one'
d.sort_index(axis=1)
B C D
A
bar one 1.259069 0.876959
foo one 0.921510 -4.193397
If you want to put the first sorted value of the column from 'B' instead you can use:
d['B'] = df.B.sort(inplace=False)[0]
So here I replaced 'one','two','three' with 'a', 'b','c' to see if this is what you are trying to do, and use insert() method as suggested by other post
df
A B C D
0 foo a 0.638362 -0.931817
1 bar a 1.380706 -0.733307
2 foo b -0.324514 0.203515
3 bar c 0.258534 0.803298
4 foo b -0.299485 -1.495979
5 bar b -0.380171 0.806968
6 foo a 1.324810 -1.792996
7 foo c -0.417663 -0.176120
d = df.groupby('A').sum()
d.insert(0, 'B', df.B.sort(inplace=False)[0])
d
B C D
A
bar a 1.259069 0.876959
foo a 0.921510 -4.193397
All these answers seem pretty wordy; the Pandas doc [1] also doesn't seem clear on this point despite giving an SQL example on the first page:
SELECT Column1, Column2, mean(Column3), sum(Column4)
FROM SomeTable
GROUP BY Column1, Column2
which will choose the "first" of Column1, Column2, while Pandas does not. As OP points out the non-numeric columns are simply dropped. My solution is not that pretty either, but perhaps more control is Pandas' design goal:
agg_d = { c: 'sum' if c == 'A' else 'first' for c in df.columns }
df = df.groupby( df['A'] ).agg( agg_d )
this maintains all non-numeric columns, like sql does. This is basically the same as Phillip's answer below, but without needing to explicitly enumerate the columns.
NOTES
https://pandas.pydata.org/pandas-docs/stable/groupby.html
Related
I have a dataframe (pandas) and a dictionary with keys and values as list. The values in lists are unique across all the keys. I want to add a new column to my dataframe based on values of the dictionary having keys in it. E.g. suppose I have a dataframe like this
import pandas as pd
df = {'a':1, 'b':2, 'c':2, 'd':4, 'e':7}
df = pd.DataFrame.from_dict(df, orient='index', columns = ['col2'])
df = df.reset_index().rename(columns={'index':'col1'})
df
col1 col2
0 a 1
1 b 2
2 c 2
3 d 4
4 e 7
Now I also have dictionary like this
my_dict = {'x':['a', 'c'], 'y':['b'], 'z':['d', 'e']}
I want the output like this
col1 col2 col3
0 a 1 x
1 b 2 y
2 c 2 x
3 d 4 z
4 e 7 z
Presently I am doing this by reversing the dictionary first, i.e. like this
my_dict_rev = {value:key for key in my_dict for value in my_dict[key]}
df['col3']= df['col1'].map(my_dict_rev)
df
But I am sure that there must be some direct method.
I know this is an old question but here are two other ways to do the same job. First convert my_dict to a Series object, then explode it. Then reverse the mapping and use map:
tmp = pd.Series(my_dict).explode()
df['col3'] = df['col1'].map(pd.Series(tmp.index, tmp))
Another option (starts similar to above) but instead of map, merge:
df = df.merge(pd.Series(my_dict, name='col1').explode().rename_axis('col3').reset_index())
Output:
col1 col2 col3
0 a 1 x
1 b 2 y
2 c 2 x
3 d 4 z
4 e 7 z
I have a dataframe that looks like 1000 rows, 10 columns
I want to add 20 columns with only one single value in each column (what I call a default value)
Therefore, my final df would be 1000 rows, with 30 columns
I know that I can do it 30 times by doing:
df['column 11'] = 'default value'
df['column 12'] = 'default value 2'
But I would like to do it in a proper way of coding
I have a dict with my {'column label' : 'defaultvalues'}
How can I do so ?
I've tried pd.insert or pd.concatenate but couldn't find my way through
thanks
regards,
Eric
One way to do so:
df_len = len(df)
new_df = pd.DataFrame({col: [val] * df_len for col,val in your_dict.items()})
df = pd.concat((df,new_df), axis=1)
Generally if possible spaces in keys in dictionary for new columns names use DataFrame constuctor with DataFrame.join:
df = pd.DataFrame({'a':range(5)})
print (df)
a
0 0
1 1
2 2
3 3
4 4
d = {'A 11' : 's', 'A 12':'c'}
df = df.join(pd.DataFrame(d, index=df.index))
print (df)
a A 11 A 12
0 0 s c
1 1 s c
2 2 s c
3 3 s c
4 4 s c
If no spaces and no numbers in columns names (need valid identifier) is possible use DataFrame.assign:
d = {'A11' : 's', 'A12':'c'}
df = df.assign(**d)
print (df)
a A11 A12
0 0 s c
1 1 s c
2 2 s c
3 3 s c
4 4 s c
Another solution is loop by dictionary and assign:
for k, v in d.items():
df[k] = v
I have the following Pandas dataframe:
A B C
A A Test1
A A Test2
A A XYZ
A B BA
A B AB
B A AA
I want to group this dataset twice: First by A and B to concate the group within C and afterwards only on A to get the groups defined solely by column A. The result looks like this:
A A Test1,Test2,XYZ
A B AB, BA
B A AA
And the final result should be:
A A,A:(Test1,Test2,XYZ), A,B:(AB, BA)
B B,A:(AA)
Concatenating itself works, however the sorting does not seem work.
Can anyone help me with this problem?
Kind regards.
Using groupby + join
s1=df.groupby(['A','B']).C.apply(','.join)
s1
Out[421]:
A B
A A Test1,Test2,XYZ
B BA,AB
B A AA
Name: C, dtype: object
s1.reset_index().groupby('A').apply(lambda x : x.set_index(['A','B'])['C'].to_dict())
Out[420]:
A
A {('A', 'A'): 'Test1,Test2,XYZ', ('A', 'B'): 'B...
B {('B', 'A'): 'AA'}
dtype: object
First sort_values by 3 columns, then groupby with join first, then join A with B columns and last groupby for dictionary per groups:
df1 = df.sort_values(['A','B','C']).groupby(['A','B'])['C'].apply(','.join).reset_index()
#if only 3 columns DataFrame
#df1 = df.sort_values().groupby(['A','B'])['C'].apply(','.join).reset_index()
df1['D'] = df1['A'] + ',' + df1['B']
print (df1)
A B C D
0 A A Test1,Test2,XYZ A,A
1 A B AB,BA A,B
2 B A AA B,A
s = df1.groupby('A').apply(lambda x: dict(zip(x['D'], x['C']))).reset_index(name='val')
print (s)
A val
0 A {'A,A': 'Test1,Test2,XYZ', 'A,B': 'AB,BA'}
1 B {'B,A': 'AA'}
If need tuples only change first part of code:
df1 = df.sort_values(['A','B','C']).groupby(['A','B'])['C'].apply(tuple).reset_index()
df1['D'] = df1['A'] + ',' + df1['B']
print (df1)
A B C D
0 A A (Test1, Test2, XYZ) A,A
1 A B (AB, BA) A,B
2 B A (AA,) B,A
s = df1.groupby('A').apply(lambda x: dict(zip(x['D'], x['C']))).reset_index(name='val')
print (s)
A val
0 A {'A,A': ('Test1', 'Test2', 'XYZ'), 'A,B': ('AB...
1 B {'B,A': ('AA',)}
I have a dataframe for which I have to calculate a series of metrics grouped by certain columns in the dataframe. I'd like to do this using a loop, but I cannot seem to figure out how (if there is a correct way).
So, what I'm trying to do is basically (semi-pseudo code, this does not run for obvious reasons):
df = pd.DataFrame({'ID': ['A', 'B', 'A', 'C', 'B', 'C', 'A'],
'Score': range(7)})
group = df.groupby('ID')
for stat in ['mean', 'min', 'max']:
group.stat()
I can get this to work if I use numpy and getattr. I.E.:
for stat in ['mean', 'min', 'max']:
df.groupby('ID').apply(getattr(np, stat))
The problem with this is that it is significantly slower than using the built-in .mean(), etc. pandas provides (at least for the size dataframe that I'm working with).
Is there a more appropriate way to accomplish this?
UPDATE:
In [116]: stats = df.groupby('ID', as_index=False).agg(['mean','min','max'])
In [117]: stats
Out[117]:
Score
mean min max
ID
A 2.666667 0 6
B 2.500000 1 4
C 4.000000 3 5
In [118]: stats.columns = ['{0[1]}_{0[0]}'.format(tup) for tup in stats.columns]
In [119]: stats
Out[119]:
mean_Score min_Score max_Score
ID
A 2.666667 0 6
B 2.500000 1 4
C 4.000000 3 5
In [120]: stats.reset_index()
Out[120]:
ID mean_Score min_Score max_Score
0 A 2.666667 0 6
1 B 2.500000 1 4
2 C 4.000000 3 5
old answer:
In [51]: df.groupby('ID').agg(['mean','min','max'])
Out[51]:
Score
mean min max
ID
A 2.666667 0 6
B 2.500000 1 4
C 4.000000 3 5
Here is a custom grouping function that takes a dataframe, a list of columns on which you'd like to group, a list of columns you'd like to aggregate, and a list of functions to apply to those columns:
import re
import numpy as np
import pandas as pd
# Sample data.
np.random.seed(0)
df = pd.DataFrame(np.random.randn(5, 3), columns=list('ABC'))
df['labels'] = ['a'] * 3 + ['b'] * 2
>>> df
A B C labels
0 1.764052 0.400157 0.978738 a
1 2.240893 1.867558 -0.977278 a
2 0.950088 -0.151357 -0.103219 a
3 0.410599 0.144044 1.454274 b
4 0.761038 0.121675 0.443863 b
# Custom function.
def group_agg(df, groupby, columns=None, funcs=None):
if not funcs:
funcs = sum
if not columns:
columns = df.columns
gb = df.groupby(groupby)
dfs = []
func_names = [re.findall(r'>?function (\w*)', str(foo))[0] for foo in funcs]
for col in columns:
col_names = (col + "_" + name for name in func_names)
names_func_dict = {col_name: foo for col_name, foo in zip(col_names, funcs)}
dfs.append(gb[col].agg(names_func_dict))
return pd.concat(dfs, axis=1)
# Example result.
>>> group_agg(df, groupby=['labels'], funcs=[sum, np.mean], columns=['A', 'B'])
A_sum A_mean B_mean B_sum
labels
a 4.955034 1.651678 0.705453 2.116358
b 1.171636 0.585818 0.132859 0.265719
There is a regex statement to get the function names.
>>> [str(foo) for foo in funcs]
['<built-in function sum>', '<function mean at 0x108f86ed8>']
>>> [re.findall(r'>?function (\w*)', str(foo))[0] for foo in funcs]
['sum', 'mean']
These names are then joined to the column, a a dictionary comprehension maps these names to the function.
For column A, for example, this is the contents of names_func_dict:
{'A_mean': <function numpy.core.fromnumeric.mean>,
'A_sum': <function sum>}
This dictionary is then passed to the groupby[coll].agg() function.
I have a dataframe that has two columns, user_id and item_bought.
Here user_id is the index of the dataframe. I want to group by both user_id and item_bought and get the item wise count for the user.
How do I do that?
From version 0.20.1 it is simplier:
Strings passed to DataFrame.groupby() as the by parameter may now reference either column names or index level names
arrays = [['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'],
['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']]
index = pd.MultiIndex.from_arrays(arrays, names=['first', 'second'])
df = pd.DataFrame({'A': [1, 1, 1, 1, 2, 2, 3, 3],
'B': np.arange(8)}, index=index)
print (df)
A B
first second
bar one 1 0
two 1 1
baz one 1 2
two 1 3
foo one 2 4
two 2 5
qux one 3 6
two 3 7
print (df.groupby(['second', 'A']).sum())
B
second A
one 1 2
2 4
3 6
two 1 4
2 5
3 7
this should work:
>>> df = pd.DataFrame(np.random.randint(0,5,(6, 2)), columns=['col1','col2'])
>>> df['ind1'] = list('AAABCC')
>>> df['ind2'] = range(6)
>>> df.set_index(['ind1','ind2'], inplace=True)
>>> df
col1 col2
ind1 ind2
A 0 3 2
1 2 0
2 2 3
B 3 2 4
C 4 3 1
5 0 0
>>> df.groupby([df.index.get_level_values(0),'col1']).count()
col2
ind1 col1
A 2 2
3 1
B 2 1
C 0 1
3 1
I had the same problem using one of the columns from multiindex. with multiindex, you cannot use df.index.levels[0] since it has only distinct values from that particular index level and will be most likely of different size than whole dataframe...
check http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Index.get_level_values.html - get_level_values "Return vector of label values for requested level, equal to the length of the index"
import pandas as pd
import numpy as np
In [11]:
df = pd.DataFrame()
In [12]:
df['user_id'] = ['b','b','b','c']
In [13]:
df['item_bought'] = ['x','x','y','y']
In [14]:
df['ct'] = 1
In [15]:
df
Out[15]:
user_id item_bought ct
0 b x 1
1 b x 1
2 b y 1
3 c y 1
In [16]:
pd.pivot_table(df,values='ct',index=['user_id','item_bought'],aggfunc=np.sum)
Out[16]:
user_id item_bought
b x 2
y 1
c y 1
I had the same problem - imported a bunch of data and I wanted to groupby a field that was the index. I didn't have a multi-index or any of that jazz and nor do you.
I figured the problem is that the field I want is the index, so at first I just reset the index - but this gives me a useless index field that I don't want. So now I do the following (two levels of grouping):
grouped = df.reset_index().groupby(by=['Field1','Field2'])
then I can use 'grouped' in a bunch of ways for different reports
grouped[['Field3','Field4']].agg([np.mean, np.std])
(which was what I wanted, giving me Field4 and Field3 averages, grouped by Field1 (the index) and Field2
For you, if you just want to do the count of items per user, in one simple line using groupby, the code could be
df.reset_index().groupby(by=['user_id']).count()
If you want to do more things then you can (like me) create 'grouped' and then use that. As a beginner, I find it easier to follow that way.
Please note, that the "reset_index" is not 'in place' and so will not mess up your original dataframe