I have following df,I'd like to group bycustomer and then,countandsum
at the same time,I wish add conditional grouping.
are there any way to achieve this?
customer product score
A a 1
A b 2
A c 3
B a 4
B a 5
B b 6
My desired result is following
customer count sum count(product =a) sum(product=a)
A 3 6 1 1
B 3 15 2 9
My work is like this..
grouped=df.groupby('customer')
grouped.agg({"product":"count","score":"sum"})
Thanks
Let us try crosstab
s = pd.crosstab(df['customer'],df['product'], df['score'],margins=True, aggfunc=['sum','count']).drop('All')
Out[76]:
sum count
product a b c All a b c All
customer
A 1.0 2.0 3.0 6 1.0 1.0 1.0 3
B 9.0 6.0 NaN 15 2.0 1.0 NaN 3
import pandas as pd
df = pd.DataFrame({'customer': ['A', 'A', 'A', 'B', 'B', 'B'], 'product': ['a', 'b', 'c', 'a', 'a', 'b'], 'score':[1, 2, 3, 4, 5, 6]})
df = df[df['product']=='a']
grouped=df.groupby('customer')
grouped = grouped.agg({"product":"count","score":"sum"}).reset_index()
print(grouped)
Output:
customer product score
0 A 1 1
1 B 2 9
Then merge this dataframe with the unfiltered grouped dataframe
Related
I have this data frame:
data = {name: ['a', 'a','b', 'c', 'd', 'b', 'b', 'a', 'c'],
number: [32, 25, 9 , 43,8, 5, 11, 21, 0]
}
and I want to get min number for each name where data in the number column for that name is not 0.
for my example, I want this result:
data = {'col1': ['a', 'b', 'c', 'd'],
'col2': [21, 5, 43, 8]
}
I don't want the repetitive name.
IIUC, you can try:
df = df.mask(df.number.eq(0)).dropna().groupby('name', as_index = False).min()
OUTPUT:
name number
0 a 21.0
1 b 5.0
2 c 43.0
3 d 8.0
Try with sort_values + drop_duplicates
out = df.loc[df.number!=0].sort_values('number').drop_duplicates('name')
Out[24]:
name number
5 b 5
4 d 8
7 a 21
3 c 43
Try:
df = df.query('number != 0')
df.loc[df.groupby('name')['number'].idxmin().tolist()]
Output:
name number
7 a 21
5 b 5
3 c 43
4 d 8
replace with groupby:
df.replace({"number":{0:np.nan}}).groupby("name",as_index=False)['number'].min()
name number
0 a 21.0
1 b 5.0
2 c 43.0
3 d 8.0
Cast it back to int if you want using astype
I'm trying to apply an expanding function to a pandas dataframe by group, but first filtering out all zeroes as well as the last value of each group. The code below does exactly what I need, but is a bit too slow:
df.update(df.loc[~df.index.isin(df.groupby('group')['value'].tail(1).index)&
(df['value']!= 0)].iloc[::-1].groupby('group')[
'value'].expanding().min().reset_index(level=0, drop=True))
I found a faster way doing this using below code:
df.update(df.iloc[::-1].groupby('group')[
'value'].expanding().min().reset_index(level=0, drop=True),
filter_func = lambda x: (x!=0)&(x[-1]==False))
However, with the dataset I am currently working on, I receive a warning ("C:...\anaconda3\lib\site-packages\ipykernel_launcher.py:22: DeprecationWarning: elementwise comparison failed; this will raise an error in the future.").
Strangely enough, I don't get an error using small dummy datasets such as this:
data = {'group':['A', 'A', 'A', 'A', 'A', 'B', 'B',
'B', 'B', 'B', 'B', 'C', 'C', 'C', 'C', 'C'],
'value':[3, 0, 8, 7, 0, -1, 0, 9, -2, 0, 0, 2, 0, 5, 0, 1]}
df = pd.DataFrame(data)
df
group value
0 A 3
1 A 0
2 A 8
3 A 7
4 A 0
5 B -1
6 B 0
7 B 9
8 B -2
9 B 0
10 B 0
11 C 2
12 C 0
13 C 5
14 C 0
15 C 1
Grateful if someone can help me understand why this error is coming up and how to avoid it.
I believe your fist code should be improved by DataFrame.duplicated for better performance, second code not working for me:
m = df.duplicated('group', keep='last') & (df['value']!= 0)
s = df[m].iloc[::-1].groupby('group')['value'].expanding().min().reset_index(level=0,drop=True)
df.update(s)
#alternative, not sure if faster
#df['value'] = s.reindex(df.index, fill_value=0)
print (df)
group value
0 A 3.0
1 A 0.0
2 A 7.0
3 A 7.0
4 A 0.0
5 B -2.0
6 B 0.0
7 B -2.0
8 B -2.0
9 B 0.0
10 B 0.0
11 C 2.0
12 C 0.0
13 C 5.0
14 C 0.0
15 C 1.0
I want to group by id, apply a function to the data, and create a new column with the results. It seems there must be a faster/more efficient way to do this than to pass the data to the function, make the changes, and return the data. Here is an example.
Example
dat = pd.DataFrame({'id': ['a', 'a', 'a', 'b', 'b', 'b'], 'x': [4, 8, 12, 25, 30, 50]})
def my_func(data):
data['diff'] = (data['x'] - data['x'].shift(1, fill_value=data['x'].iat[0]))
return data
dat.groupby('id').apply(my_func)
Output
> print(dat)
id x diff
0 a 4 0
1 a 8 4
2 a 12 4
3 b 25 0
4 b 30 5
5 b 50 20
Is there a more efficient way to do this?
You can use .groupby.diff() for this and after that fill the NaN with zero like following:
dat['diff'] = dat.groupby('id').x.diff().fillna(0)
print(dat)
id x diff
0 a 4 0.0
1 a 8 4.0
2 a 12 4.0
3 b 25 0.0
4 b 30 5.0
5 b 50 20.0
Given this trivial dataset
df = pd.DataFrame({'one': ['a', 'a', 'a', 'b', 'b', 'b'],
'two': ['c', 'c', 'c', 'c', 'd', 'd'],
'three': [1, 2, 3, 4, 5, 6]})
grouping on one / two and applying .max() returns me a Series indexed on the groupby vars, as expected...
df.groupby(['one', 'two'])['three'].max()
output:
one two
a c 3
b c 4
d 6
Name: three, dtype: int64
...in my case I want to shift() my records, by group. But for some reason, when I apply .shift() to the groupby object, my results don't include the groupby variables:
output:
df.groupby(['one', 'two'])['three'].shift()
0 NaN
1 1.0
2 2.0
3 NaN
4 NaN
5 5.0
Name: three, dtype: float64
Is there a way to preserve those groupby variables in the results, as either columns or a multi-indexed Series (as in .max())? Thanks!
It is difference between max and diff - max aggregate values (return aggregate Series) and diff not - return same size Series.
So is possible append output to new column:
df['shifted'] = df.groupby(['one', 'two'])['three'].shift()
Theoretically is possible use agg, but it return error in pandas 0.20.3:
df1 = df.groupby(['one', 'two'])['three'].agg(['max', lambda x: x.shift()])
print (df1)
ValueError: Function does not reduce
One possible solution is transform if need max with diff:
g = df.groupby(['one', 'two'])['three']
df['max'] = g.transform('max')
df['shifted'] = g.shift()
print (df)
one three two max shifted
0 a 1 c 3 NaN
1 a 2 c 3 1.0
2 a 3 c 3 2.0
3 b 4 c 4 NaN
4 b 5 d 6 NaN
5 b 6 d 6 5.0
As what Jez explained, shift return the Serise keep the same len of dataframe, if you assign it like max(), will getting the error
Function does not reduce
df.assign(shifted=df.groupby(['one', 'two'])['three'].shift()).set_index(['one','two'])
Out[57]:
three shifted
one two
a c 1 NaN
c 2 1.0
c 3 2.0
b c 4 NaN
d 5 NaN
d 6 5.0
Using max as the key , and shift value slice the value max row
df.groupby(['one', 'two'])['three'].apply(lambda x : x.shift()[x==x.max()])
Out[58]:
one two
a c 2 2.0
b c 3 NaN
d 5 5.0
Name: three, dtype: float64
Given the code below.
import numpy as np
import pandas as pd
df = pd.DataFrame({
'clients': pd.Series(['A', 'A', 'A', 'B', 'B']),
'odd1': pd.Series([1, 1, 2, 1, 2]),
'odd2': pd.Series([6, 7, 8, 9, 10])})
grpd = df.groupby(['clients', 'odd1']).agg({
'odd2': [np.sum, np.average]
}).reset_index('clients').reset_index('odd1')
>> grpd
odd1 clients odd2
sum average
0 1 A 13 6.5
1 2 A 8 8.0
2 1 B 9 9.0
3 2 B 10 10.0
I would like to create a pivot table as below:
| odd1 | odd1 | ...... | odd1 |
------------------------------------|---------|
clients| average | average | ..... | average |
The desired output is:
clients | 1 2
--------|------------------
A | 6.5 8.0
B | 9.0 10.0
This would work had we a column which is not multilevel:
grpd.pivot(index='clients', columns='odd1', values='odd2')
Not sure I understand how multilevel cols work.
import numpy as np
import pandas as pd
df = pd.DataFrame({
'clients': pd.Series(['A', 'A', 'A', 'B', 'B']),
'odd1': pd.Series([1, 1, 2, 1, 2]),
'odd2': pd.Series([6, 7, 8, 9, 10])})
aggd = df.groupby(['clients', 'odd1']).agg({
'odd2': [np.sum, np.average]})
print(aggd.unstack(['odd1']).loc[:, ('odd2','average')])
yields
odd1 1 2
clients
A 6.5 8
B 9.0 10
Explanation: One of the intermediate steps in grpd is
aggd = df.groupby(['clients', 'odd1']).agg({
'odd2': [np.sum, np.average]})
which looks like this:
In [52]: aggd
Out[52]:
odd2
sum average
clients odd1
A 1 13 6.5
2 8 8.0
B 1 9 9.0
2 10 10.0
Visual comparison between aggd and the desired result
odd1 1 2
clients
A 6.5 8
B 9.0 10
shows that the odd1 index needs to become a column index. That operation -- the moving of index labels to column labels -- is the job done by the unstack method. So it is natural to unstack aggd:
In [53]: aggd.unstack(['odd1'])
Out[53]:
odd2
sum average
odd1 1 2 1 2
clients
A 13 8 6.5 8
B 9 10 9.0 10
Now it is easy to see we just want to select the average columns. That can be done with loc:
In [54]: aggd.unstack(['odd1']).loc[:, ('odd2','average')]
Out[54]:
odd1 1 2
clients
A 6.5 8
B 9.0 10