Given the code below.
import numpy as np
import pandas as pd
df = pd.DataFrame({
'clients': pd.Series(['A', 'A', 'A', 'B', 'B']),
'odd1': pd.Series([1, 1, 2, 1, 2]),
'odd2': pd.Series([6, 7, 8, 9, 10])})
grpd = df.groupby(['clients', 'odd1']).agg({
'odd2': [np.sum, np.average]
}).reset_index('clients').reset_index('odd1')
>> grpd
odd1 clients odd2
sum average
0 1 A 13 6.5
1 2 A 8 8.0
2 1 B 9 9.0
3 2 B 10 10.0
I would like to create a pivot table as below:
| odd1 | odd1 | ...... | odd1 |
------------------------------------|---------|
clients| average | average | ..... | average |
The desired output is:
clients | 1 2
--------|------------------
A | 6.5 8.0
B | 9.0 10.0
This would work had we a column which is not multilevel:
grpd.pivot(index='clients', columns='odd1', values='odd2')
Not sure I understand how multilevel cols work.
import numpy as np
import pandas as pd
df = pd.DataFrame({
'clients': pd.Series(['A', 'A', 'A', 'B', 'B']),
'odd1': pd.Series([1, 1, 2, 1, 2]),
'odd2': pd.Series([6, 7, 8, 9, 10])})
aggd = df.groupby(['clients', 'odd1']).agg({
'odd2': [np.sum, np.average]})
print(aggd.unstack(['odd1']).loc[:, ('odd2','average')])
yields
odd1 1 2
clients
A 6.5 8
B 9.0 10
Explanation: One of the intermediate steps in grpd is
aggd = df.groupby(['clients', 'odd1']).agg({
'odd2': [np.sum, np.average]})
which looks like this:
In [52]: aggd
Out[52]:
odd2
sum average
clients odd1
A 1 13 6.5
2 8 8.0
B 1 9 9.0
2 10 10.0
Visual comparison between aggd and the desired result
odd1 1 2
clients
A 6.5 8
B 9.0 10
shows that the odd1 index needs to become a column index. That operation -- the moving of index labels to column labels -- is the job done by the unstack method. So it is natural to unstack aggd:
In [53]: aggd.unstack(['odd1'])
Out[53]:
odd2
sum average
odd1 1 2 1 2
clients
A 13 8 6.5 8
B 9 10 9.0 10
Now it is easy to see we just want to select the average columns. That can be done with loc:
In [54]: aggd.unstack(['odd1']).loc[:, ('odd2','average')]
Out[54]:
odd1 1 2
clients
A 6.5 8
B 9.0 10
Related
I need get mean of expanding grouped by name.
I already have this code:
data = {
'id': [1, 2, 3, 4, 5, 6, 7, 8],
'name': ['A', 'B', 'A', 'B', 'A', 'B', 'A', 'B'],
'number': [1, 3, 5, 7, 9, 11, 13, 15]
}
df = pd.DataFrame(data)
df['mean_number'] = df.groupby('name')['number'].apply(
lambda s: s.expanding().mean().shift()
)
Ps: I use .shift() for the mean not to include the current line
Result in this:
id name number mean_number
0 1 A 1 NaN
1 2 B 3 NaN
2 3 A 5 1.0
3 4 B 7 3.0
4 5 A 9 3.0
5 6 B 11 5.0
6 7 A 13 5.0
7 8 B 15 7.0
Works, but I only need the last result of each groupby.
id name number mean_number
6 7 A 13 5.0
7 8 B 15 7.0
I would like to know if it is possible to get the mean of only these last lines, because in a very large dataset, it takes a long time to create the variables of all the lines and filter only the last ones.
If you only need the last two mean numbers you can just take the sum and count per group and calculate the values like this:
groups = df.groupby('name').agg(name=("name", "first"), s=("number", "sum"), c=("number", "count")).set_index("name")
groups
s c
name
A 28 4
B 36 4
Then you can use .tail() to get the last row for each group
tail = df.groupby('name').tail(1).set_index("name")
tail
id number
name
A 7 13
B 8 15
Calculate the mean like this
(groups.s - tail.number) / (groups.c - 1)
name
A 5.0
B 7.0
I have following df,I'd like to group bycustomer and then,countandsum
at the same time,I wish add conditional grouping.
are there any way to achieve this?
customer product score
A a 1
A b 2
A c 3
B a 4
B a 5
B b 6
My desired result is following
customer count sum count(product =a) sum(product=a)
A 3 6 1 1
B 3 15 2 9
My work is like this..
grouped=df.groupby('customer')
grouped.agg({"product":"count","score":"sum"})
Thanks
Let us try crosstab
s = pd.crosstab(df['customer'],df['product'], df['score'],margins=True, aggfunc=['sum','count']).drop('All')
Out[76]:
sum count
product a b c All a b c All
customer
A 1.0 2.0 3.0 6 1.0 1.0 1.0 3
B 9.0 6.0 NaN 15 2.0 1.0 NaN 3
import pandas as pd
df = pd.DataFrame({'customer': ['A', 'A', 'A', 'B', 'B', 'B'], 'product': ['a', 'b', 'c', 'a', 'a', 'b'], 'score':[1, 2, 3, 4, 5, 6]})
df = df[df['product']=='a']
grouped=df.groupby('customer')
grouped = grouped.agg({"product":"count","score":"sum"}).reset_index()
print(grouped)
Output:
customer product score
0 A 1 1
1 B 2 9
Then merge this dataframe with the unfiltered grouped dataframe
I'm trying to apply an expanding function to a pandas dataframe by group, but first filtering out all zeroes as well as the last value of each group. The code below does exactly what I need, but is a bit too slow:
df.update(df.loc[~df.index.isin(df.groupby('group')['value'].tail(1).index)&
(df['value']!= 0)].iloc[::-1].groupby('group')[
'value'].expanding().min().reset_index(level=0, drop=True))
I found a faster way doing this using below code:
df.update(df.iloc[::-1].groupby('group')[
'value'].expanding().min().reset_index(level=0, drop=True),
filter_func = lambda x: (x!=0)&(x[-1]==False))
However, with the dataset I am currently working on, I receive a warning ("C:...\anaconda3\lib\site-packages\ipykernel_launcher.py:22: DeprecationWarning: elementwise comparison failed; this will raise an error in the future.").
Strangely enough, I don't get an error using small dummy datasets such as this:
data = {'group':['A', 'A', 'A', 'A', 'A', 'B', 'B',
'B', 'B', 'B', 'B', 'C', 'C', 'C', 'C', 'C'],
'value':[3, 0, 8, 7, 0, -1, 0, 9, -2, 0, 0, 2, 0, 5, 0, 1]}
df = pd.DataFrame(data)
df
group value
0 A 3
1 A 0
2 A 8
3 A 7
4 A 0
5 B -1
6 B 0
7 B 9
8 B -2
9 B 0
10 B 0
11 C 2
12 C 0
13 C 5
14 C 0
15 C 1
Grateful if someone can help me understand why this error is coming up and how to avoid it.
I believe your fist code should be improved by DataFrame.duplicated for better performance, second code not working for me:
m = df.duplicated('group', keep='last') & (df['value']!= 0)
s = df[m].iloc[::-1].groupby('group')['value'].expanding().min().reset_index(level=0,drop=True)
df.update(s)
#alternative, not sure if faster
#df['value'] = s.reindex(df.index, fill_value=0)
print (df)
group value
0 A 3.0
1 A 0.0
2 A 7.0
3 A 7.0
4 A 0.0
5 B -2.0
6 B 0.0
7 B -2.0
8 B -2.0
9 B 0.0
10 B 0.0
11 C 2.0
12 C 0.0
13 C 5.0
14 C 0.0
15 C 1.0
I want to group by id, apply a function to the data, and create a new column with the results. It seems there must be a faster/more efficient way to do this than to pass the data to the function, make the changes, and return the data. Here is an example.
Example
dat = pd.DataFrame({'id': ['a', 'a', 'a', 'b', 'b', 'b'], 'x': [4, 8, 12, 25, 30, 50]})
def my_func(data):
data['diff'] = (data['x'] - data['x'].shift(1, fill_value=data['x'].iat[0]))
return data
dat.groupby('id').apply(my_func)
Output
> print(dat)
id x diff
0 a 4 0
1 a 8 4
2 a 12 4
3 b 25 0
4 b 30 5
5 b 50 20
Is there a more efficient way to do this?
You can use .groupby.diff() for this and after that fill the NaN with zero like following:
dat['diff'] = dat.groupby('id').x.diff().fillna(0)
print(dat)
id x diff
0 a 4 0.0
1 a 8 4.0
2 a 12 4.0
3 b 25 0.0
4 b 30 5.0
5 b 50 20.0
I have the followig pd.DataFrame:
import pandas as pd
df = pd.DataFrame({'name': ['a', 'a', 'a', 'a', 'b', 'b', 'b', 'b'],
'x1': [1, 2, 3, 4, 1, 2, 3, 4],
'x2': [4, 3, 2, 1, 4, 3, 2, 1]
})
> df
name x1 x2
0 a 1 4
1 a 2 3
2 a 3 2
3 a 4 1
4 b 1 4
5 b 2 3
6 b 3 2
7 b 4 1
I would like to calculate a rolling mean of x1 and x2 withwindow-size of 2 and min_periods of 1. The mean should be grouped by the name and the input to the mean-function should be shifted by one row, that is, the resulting row with index 2, should be calculated from rows (0,1). So for x1 the rolling mean in row 2 should be (1+2)/2 = 1.5.
In Pandas version <= 0.18 I would do this:
> df.groupby('name').apply(lambda x: pd.rolling_mean(x.shift(1), window=2, min_periods=1))
x1 x2
0 NaN NaN
1 1.0 4.0
2 1.5 3.5
3 2.5 2.5
4 NaN NaN
5 1.0 4.0
6 1.5 3.5
7 2.5 2.5
Which is perfect, since row 0 and row 4 do not a any data, within each name group, of length 1, and the result should be np.nan.
In Pandas 0.19 and later the rolling_mean-function and functions alike, are throwing:
FutureWarning: pd.rolling_mean is deprecated for DataFrame and will be removed in a future version, replace with
DataFrame.rolling(min_periods=1,center=False,window=2).mean()
So in Pandas version >= 0.19 this is the best approach I could come up with:
df_shifted = df.groupby('name').apply(lambda x: x.shift(1))
> df_shifted.groupby('name').rolling(window=2, min_periods=1).mean()
name x1 x2
name
a 1 a 1.0 4.0
2 a 1.5 3.5
3 a 2.5 2.5
b 5 b 1.0 4.0
6 b 1.5 3.5
7 b 2.5 2.5
But this removes the nan-rows which I would like to keep for array-dimension reasons and returns a DataFrame with MultiIndex.
Is there a nice one-line-kind-of-way of solving this while keeping the nan-rows and returning a DataFrame with a flat index?
EDIT
The method should handle nan's like the 0.18-method. So if x1 = [np.nan, 2, 3, 4, 1, 2, 3, 4] the rolling mean at index 1 should return np.nan, but the rolling mean at index 2 should return 2.0, since (np.nan + 2)/1 -> 2.0 that is number of non-nan's is less or equal to min_periods.
To avoid the Deprecation warnings, starting with version 0.19.1, you can rewrite the syntax as shown:
shift the DF by 1 level
compute rolling mean
df.groupby('name').apply(lambda x: x.shift().rolling(window=2, min_periods=1).mean()
# DataFrame.rolling(*args, **kwargs).mean()