I'm trying to apply an expanding function to a pandas dataframe by group, but first filtering out all zeroes as well as the last value of each group. The code below does exactly what I need, but is a bit too slow:
df.update(df.loc[~df.index.isin(df.groupby('group')['value'].tail(1).index)&
(df['value']!= 0)].iloc[::-1].groupby('group')[
'value'].expanding().min().reset_index(level=0, drop=True))
I found a faster way doing this using below code:
df.update(df.iloc[::-1].groupby('group')[
'value'].expanding().min().reset_index(level=0, drop=True),
filter_func = lambda x: (x!=0)&(x[-1]==False))
However, with the dataset I am currently working on, I receive a warning ("C:...\anaconda3\lib\site-packages\ipykernel_launcher.py:22: DeprecationWarning: elementwise comparison failed; this will raise an error in the future.").
Strangely enough, I don't get an error using small dummy datasets such as this:
data = {'group':['A', 'A', 'A', 'A', 'A', 'B', 'B',
'B', 'B', 'B', 'B', 'C', 'C', 'C', 'C', 'C'],
'value':[3, 0, 8, 7, 0, -1, 0, 9, -2, 0, 0, 2, 0, 5, 0, 1]}
df = pd.DataFrame(data)
df
group value
0 A 3
1 A 0
2 A 8
3 A 7
4 A 0
5 B -1
6 B 0
7 B 9
8 B -2
9 B 0
10 B 0
11 C 2
12 C 0
13 C 5
14 C 0
15 C 1
Grateful if someone can help me understand why this error is coming up and how to avoid it.
I believe your fist code should be improved by DataFrame.duplicated for better performance, second code not working for me:
m = df.duplicated('group', keep='last') & (df['value']!= 0)
s = df[m].iloc[::-1].groupby('group')['value'].expanding().min().reset_index(level=0,drop=True)
df.update(s)
#alternative, not sure if faster
#df['value'] = s.reindex(df.index, fill_value=0)
print (df)
group value
0 A 3.0
1 A 0.0
2 A 7.0
3 A 7.0
4 A 0.0
5 B -2.0
6 B 0.0
7 B -2.0
8 B -2.0
9 B 0.0
10 B 0.0
11 C 2.0
12 C 0.0
13 C 5.0
14 C 0.0
15 C 1.0
Related
I need get mean of expanding grouped by name.
I already have this code:
data = {
'id': [1, 2, 3, 4, 5, 6, 7, 8],
'name': ['A', 'B', 'A', 'B', 'A', 'B', 'A', 'B'],
'number': [1, 3, 5, 7, 9, 11, 13, 15]
}
df = pd.DataFrame(data)
df['mean_number'] = df.groupby('name')['number'].apply(
lambda s: s.expanding().mean().shift()
)
Ps: I use .shift() for the mean not to include the current line
Result in this:
id name number mean_number
0 1 A 1 NaN
1 2 B 3 NaN
2 3 A 5 1.0
3 4 B 7 3.0
4 5 A 9 3.0
5 6 B 11 5.0
6 7 A 13 5.0
7 8 B 15 7.0
Works, but I only need the last result of each groupby.
id name number mean_number
6 7 A 13 5.0
7 8 B 15 7.0
I would like to know if it is possible to get the mean of only these last lines, because in a very large dataset, it takes a long time to create the variables of all the lines and filter only the last ones.
If you only need the last two mean numbers you can just take the sum and count per group and calculate the values like this:
groups = df.groupby('name').agg(name=("name", "first"), s=("number", "sum"), c=("number", "count")).set_index("name")
groups
s c
name
A 28 4
B 36 4
Then you can use .tail() to get the last row for each group
tail = df.groupby('name').tail(1).set_index("name")
tail
id number
name
A 7 13
B 8 15
Calculate the mean like this
(groups.s - tail.number) / (groups.c - 1)
name
A 5.0
B 7.0
I have following df,I'd like to group bycustomer and then,countandsum
at the same time,I wish add conditional grouping.
are there any way to achieve this?
customer product score
A a 1
A b 2
A c 3
B a 4
B a 5
B b 6
My desired result is following
customer count sum count(product =a) sum(product=a)
A 3 6 1 1
B 3 15 2 9
My work is like this..
grouped=df.groupby('customer')
grouped.agg({"product":"count","score":"sum"})
Thanks
Let us try crosstab
s = pd.crosstab(df['customer'],df['product'], df['score'],margins=True, aggfunc=['sum','count']).drop('All')
Out[76]:
sum count
product a b c All a b c All
customer
A 1.0 2.0 3.0 6 1.0 1.0 1.0 3
B 9.0 6.0 NaN 15 2.0 1.0 NaN 3
import pandas as pd
df = pd.DataFrame({'customer': ['A', 'A', 'A', 'B', 'B', 'B'], 'product': ['a', 'b', 'c', 'a', 'a', 'b'], 'score':[1, 2, 3, 4, 5, 6]})
df = df[df['product']=='a']
grouped=df.groupby('customer')
grouped = grouped.agg({"product":"count","score":"sum"}).reset_index()
print(grouped)
Output:
customer product score
0 A 1 1
1 B 2 9
Then merge this dataframe with the unfiltered grouped dataframe
I am trying to set all values that are <= 0, by group, to the maximum value in that group, but only after the last positive value. That is, all values <=0 in the group that come before the last positive value must be ignored. Example:
data = {'group':['A', 'A', 'A', 'A', 'A', 'B', 'B',
'B', 'B', 'B', 'B', 'C', 'C', 'C', 'C', 'C'],
'value':[3, 0, 8, 7, 0, -1, 0, 9, -2, 0, 0, 2, 0, 5, 0, 1]}
df = pd.DataFrame(data)
df
group value
0 A 3
1 A 0
2 A 8
3 A 7
4 A 0
5 B -1
6 B 0
7 B 9
8 B -2
9 B 0
10 B 0
11 C 2
12 C 0
13 C 5
14 C 0
15 C 1
and the result must be:
group value
0 A 3
1 A 0
2 A 8
3 A 7
4 A 8
5 B -1
6 B 0
7 B 9
8 B 9
9 B 9
10 B 9
11 C 2
12 C 0
13 C 5
14 C 0
15 C 1
Thanks to advise
Start by adding a column to identify the rows with negative value (more precisely <= 0):
df['neg'] = (df['value'] <= 0)
Then, for each group, find the sequence of last few entries that have 'neg' set to True and that are contiguous. In order to do that, reverse the order of the DataFrame (with .iloc[::-1]) and then use .cumprod() on the 'neg' column. cumprod() will treat True as 1 and False as 0, so the cumulative product will be 1 as long as you're seeing all True's and will become and stay 0 as soon as you see the first False. Since we reversed the order, we're going backwards from the end, so we're finding the sequence of True's at the end.
df['upd'] = df.iloc[::-1].groupby('group')['neg'].cumprod().astype(bool)
Now that we know which entries to update, we just need to know what to update them to, which is the max of the group. We can use transform('max') on a groupby to get that value and then all that's left is to do the actual update of 'value' where 'upd' is set:
df.loc[df['upd'], 'value'] = df.groupby('group')['value'].transform('max')
We can finish by dropping the two auxiliary columns we used in the process:
df = df.drop(['neg', 'upd'], axis=1)
The result I got matches your expected result.
UPDATE: Or do the whole operation in a single (long!) line, without adding any auxiliary columns to the original DataFrame:
df.loc[
df.assign(
neg=(df['value'] <= 0)
).iloc[::-1].groupby(
'group'
)['neg'].cumprod().astype(bool),
'value'
] = df.groupby(
'group'
)['value'].transform('max')
You can do it this way.
(df.loc[(df.assign(m=df['value'].lt(0)).groupby(['group'], sort=False)['m'].transform('any')) &
(df.index>=df.groupby('group')['value'].transform('idxmin')),'value']) = np.nan
df['value']=df.groupby('group').ffill()
df
Output
group value
0 A 3.0
1 A 0.0
2 A 8.0
3 A 7.0
4 A 0.0
5 B -1.0
6 B 0.0
7 B 9.0
8 B 9.0
9 B 9.0
10 B 9.0
11 C 2.0
12 C 0.0
13 C 5.0
14 C 0.0
15 C 1.0
I want to group by id, apply a function to the data, and create a new column with the results. It seems there must be a faster/more efficient way to do this than to pass the data to the function, make the changes, and return the data. Here is an example.
Example
dat = pd.DataFrame({'id': ['a', 'a', 'a', 'b', 'b', 'b'], 'x': [4, 8, 12, 25, 30, 50]})
def my_func(data):
data['diff'] = (data['x'] - data['x'].shift(1, fill_value=data['x'].iat[0]))
return data
dat.groupby('id').apply(my_func)
Output
> print(dat)
id x diff
0 a 4 0
1 a 8 4
2 a 12 4
3 b 25 0
4 b 30 5
5 b 50 20
Is there a more efficient way to do this?
You can use .groupby.diff() for this and after that fill the NaN with zero like following:
dat['diff'] = dat.groupby('id').x.diff().fillna(0)
print(dat)
id x diff
0 a 4 0.0
1 a 8 4.0
2 a 12 4.0
3 b 25 0.0
4 b 30 5.0
5 b 50 20.0
This question already has answers here:
How can I replicate rows of a Pandas DataFrame?
(10 answers)
Closed 11 months ago.
I want to replicate rows in a Pandas Dataframe. Each row should be repeated n times, where n is a field of each row.
import pandas as pd
what_i_have = pd.DataFrame(data={
'id': ['A', 'B', 'C'],
'n' : [ 1, 2, 3],
'v' : [ 10, 13, 8]
})
what_i_want = pd.DataFrame(data={
'id': ['A', 'B', 'B', 'C', 'C', 'C'],
'v' : [ 10, 13, 13, 8, 8, 8]
})
Is this possible?
You can use Index.repeat to get repeated index values based on the column then select from the DataFrame:
df2 = df.loc[df.index.repeat(df.n)]
id n v
0 A 1 10
1 B 2 13
1 B 2 13
2 C 3 8
2 C 3 8
2 C 3 8
Or you could use np.repeat to get the repeated indices and then use that to index into the frame:
df2 = df.loc[np.repeat(df.index.values, df.n)]
id n v
0 A 1 10
1 B 2 13
1 B 2 13
2 C 3 8
2 C 3 8
2 C 3 8
After which there's only a bit of cleaning up to do:
df2 = df2.drop("n", axis=1).reset_index(drop=True)
id v
0 A 10
1 B 13
2 B 13
3 C 8
4 C 8
5 C 8
Note that if you might have duplicate indices to worry about, you could use .iloc instead:
df.iloc[np.repeat(np.arange(len(df)), df["n"])].drop("n", axis=1).reset_index(drop=True)
id v
0 A 10
1 B 13
2 B 13
3 C 8
4 C 8
5 C 8
which uses the positions, and not the index labels.
You could use set_index and repeat
In [1057]: df.set_index(['id'])['v'].repeat(df['n']).reset_index()
Out[1057]:
id v
0 A 10
1 B 13
2 B 13
3 C 8
4 C 8
5 C 8
Details
In [1058]: df
Out[1058]:
id n v
0 A 1 10
1 B 2 13
2 C 3 8
It's something like the uncount in tidyr:
https://tidyr.tidyverse.org/reference/uncount.html
I wrote a package (https://github.com/pwwang/datar) that implements this API:
from datar import f
from datar.tibble import tribble
from datar.tidyr import uncount
what_i_have = tribble(
f.id, f.n, f.v,
'A', 1, 10,
'B', 2, 13,
'C', 3, 8
)
what_i_have >> uncount(f.n)
Output:
id v
0 A 10
1 B 13
1 B 13
2 C 8
2 C 8
2 C 8
Not the best solution, but I want to share this: you could also use pandas.reindex() and .repeat():
df.reindex(df.index.repeat(df.n)).drop('n', axis=1)
Output:
id v
0 A 10
1 B 13
1 B 13
2 C 8
2 C 8
2 C 8
You can further append .reset_index(drop=True) to reset the .index.