python pandas - transform custom aggregation - python

Having the following data frame, of user activity across 2 days:
user score
0 A 10
1 A 0
2 B 5
I would like to calculate the average user score for that time and transform the result to all the rows:
import pandas as pd
df = pd.DataFrame({'user' : ['A','A','B'],
'score': [10,0,5]})
df["avg"] = df.groupby(['user']).transform("sum")["score"]
df.head()
This could gives me the some of each user:
user score avg
0 A 10 10
1 A 0 10
2 B 5 5
And now I would like to divide each score by the number of days (2) to get:
user score avg
0 A 10 5
1 A 0 5
2 B 5 2.5
Can this be done on the same line where I calculated the sum?

You can divide output Series by 2:
df = pd.DataFrame({'user' : ['A','A','B'],
'score': [10,0,5]})
df["avg"] = df.groupby(['user']).transform("sum")["score"] / 2
print (df)
user score avg
0 A 10 5.0
1 A 0 5.0
2 B 5 2.5

here you can something like that
df["avg"] = df.groupby(['user']).transform("sum")["score"]/2
In [54]: df.head()
Out[54]:
user score avg
0 A 10 5.0
1 A 0 5.0
2 B 5 2.5

Related

Change numeric column based on category after group by

I have a df like this below:
dff = pd.DataFrame({'id':[1,1,2,2], 'categ':['A','B','A','B'],'cost':[20,5, 30,10] })
dff
id categ cost
0 1 A 20
1 1 B 5
2 2 A 30
3 2 B 10
What i want is to make a new df where I group by id and then the cost of category B takes the
20% of the price of category A, and at the same time category A loses this amount. I would like my desired output to be like this:
id category price
0 1 A 16
1 1 B 9
2 2 A 24
3 2 B 16
I have done this below but it only reduces the price of by 20%. Any idea how to do what i want?
dff['price'] = np.where(dff['category'] == 'A', dff['price'] * 0.8, dff['price'])
Do pivot then modify and stack back
s = df.pivot(*df)
s['B'] = s['B'] + s['A'] * 0.2
s['A'] *= 0.8
s = s.stack().reset_index(name='cost')
s
Out[446]:
id categ cost
0 1 A 16.0
1 1 B 9.0
2 2 A 24.0
3 2 B 16.0
You can transform to broadcast the 'A' value to every row in the group and take 20% of it. Then using map you can subtract for 'A' and add for 'B'
s = df['cost'].where(df.categ.eq('A')).groupby(df['id']).transform('first')*0.2
df['cost'] = df['cost'] + s*df['categ'].map({'A': -1, 'B': 1})
id categ cost
0 1 A 16.0
1 1 B 9.0
2 2 A 24.0
3 2 B 16.0

Classifying according to number of consecutive values with pandas

I have a dataframe column with 1s and 0s like this:
df['working'] =
1
1
0
0
0
1
1
0
0
1
which represents when a machine is working (1) or stopped (0). I need to classify this stops depending on their length ie if there are less or equal than n consecutive 0s change all them to short-stop (2) if there are more than n, to long-stop (3). The expected result should look like this when applied over the example with n=2:
df[['working', 'result']]=
1 1
1 1
0 3
0 3
0 3
1 1
1 1
0 2
0 2
1 1
of course this is an example, my df has more than 1M rows.
I tried looping through it but it's really slow and also using this. But I couldn't achieve to transform it to my problem.
Can anyone help?. Thanks so much in advance.
I hope Series.map with Series.value_counts should be used for improve performance:
n = 2
#compare 0 values
m = df['working'].eq(0)
#created groups only by mask
s = df['working'].cumsum()[m]
#counts only 0 groups
out = s.map(s.value_counts())
#set new values by mask
df['result'] = 1
df.loc[m, 'result'] = np.where(out > n, 3, 2)
print (df)
working result
0 1 1
1 1 1
2 0 3
3 0 3
4 0 3
5 1 1
6 1 1
7 0 2
8 0 2
9 1 1
Here's one approach:
# Counter for each gruop where there is a change
m = df.working.ne(df.working.shift()).cumsum()
# mask where working is 0
eq0 = df.working.eq(0)
# Get a count of consecutive 0s
count = df[eq0].groupby(m[eq0]).transform('count')
# replace 0s accordingly
df.loc[eq0, 'result'] = np.where(count > 2, 3, 2).ravel()
# fill the remaining values with 1
df['result'] = df.result.fillna(1)
print(df)
working result
0 1 1.0
1 1 1.0
2 0 3.0
3 0 3.0
4 0 3.0
5 1 1.0
6 1 1.0
7 0 2.0
8 0 2.0
9 1 1.0

Pandas - scoring column

I have data about product sales (1 column per product) at the customer level (1 row per customer).
I'm assessing which customers are more likely to be interested in a specific product. I have a list of the 10 most correlated products. (and I have this for multiple products, so I'm trying to build a scalable approach).
I'm trying to score all customers based on how many of those 10 products they buy.
Let's say my list is:
prod_x_corr_prod
How can I create a scoring column (say prox_x_propensity) which goes through the 10 relevant columns, for every row, and for each column with a value > 0 adds 1?
For instance, if customer Y bought 3 of the products correlated with product X, he would have a score of 3 in the "prox_x_score" column.
EDIT: thanks to all of you for the feedback.
For customer 5 I would ge a 2, while for 1,2,3 I would get 1. For 4, 0.
You can do:
df['prox_x_score'] = (df[prod_x_corr_prod] > 0).sum(axis=1)
Example with dummy data:
import numpy as np
import pandas as pd
prod_x_corr_prod = ["prod{}".format(i) for i in range(1, 11)]
df = pd.DataFrame({col:np.random.choice([0,1], size=5) for col in prod_x_corr_prod})
df['prox_x_score'] = (df[prod_x_corr_prod] > 0).sum(axis=1)
print(df)
Output:
prod1 prod10 prod2 prod3 prod4 prod5 prod6 prod7 prod8 prod9 \
0 1 1 1 0 0 1 1 1 1 0
1 1 1 1 0 1 0 0 1 1 0
2 1 1 1 1 0 1 0 0 1 0
3 0 0 0 0 0 0 1 0 1 0
4 0 0 0 0 0 0 0 1 1 0
prox_x_score
0 7
1 6
2 6
3 2
4 2

Pandas groupby treat nonconsecutive as different variables?

I want to treat non consecutive ids as different variables during groupby, so that I can take return the first value of stamp, and the sum of increment as a new dataframe. Here is sample input and output.
import pandas as pd
import numpy as np
df = pd.DataFrame([np.array(['a','a','a','b','c','b','b','a','a','a']),
np.arange(1, 11), np.ones(10)]).T
df.columns = ['id', 'stamp', 'increment']
df_result = pd.DataFrame([ np.array(['a','b','c','b','a']),
np.array([1,4,5,6,8]), np.array([3,1,1,2,3])]).T
df_result.columns = ['id', 'stamp', 'increment_sum']
In [2]: df
Out[2]:
id stamp increment
0 a 1 1
1 a 2 1
2 a 3 1
3 b 4 1
4 c 5 1
5 b 6 1
6 b 7 1
7 a 8 1
8 a 9 1
9 a 10 1
In [3]: df_result
Out[3]:
id stamp increment_sum
0 a 1 3
1 b 4 1
2 c 5 1
3 b 6 2
4 a 8 3
I can accomplish this via
def get_result(d):
sum = d.increment.sum()
stamp = d.stamp.min()
name = d.id.max()
return name, stamp, sum
#idea from http://stackoverflow.com/questions/25147091/combine-consecutive-rows-with-the-same-column-values
df['key'] = (df['id'] != df['id'].shift(1)).astype(int).cumsum()
result = zip(*df.groupby([df.key]).apply(get_result))
df = pd.DataFrame(np.array(result).T)
df.columns = ['id', 'stamp', 'increment_sum']
But I'm sure there must be a more elegant solution
Not that good in terms of optimum code, but solves the problem
> df_group = df.groupby('id')
we cant use id alone for groupby, so adding another new column to groupby within id based whether it is continuous or not
> df['group_diff'] = df_group['stamp'].diff().apply(lambda v: float('nan') if v == 1 else v).ffill().fillna(0)
> df
id stamp increment group_diff
0 a 1 1 0
1 a 2 1 0
2 a 3 1 0
3 b 4 1 0
4 c 5 1 0
5 b 6 1 2
6 b 7 1 2
7 a 8 1 5
8 a 9 1 5
9 a 10 1 5
Now we can the new column group_diff for secondary grouping.. Added sort function in the end as suggested in the comments to get the exact function
> df.groupby(['id','group_diff']).agg({'increment':sum, 'stamp': 'first'}).reset_index()[['id', 'stamp','increment']].sort('stamp')
id stamp increment
0 a 1 3
2 b 4 1
4 c 5 1
3 b 6 2
1 a 8 3

Pandas to calculate rolling aggregate rate

I'm trying to calculate a rolling aggregate rate for a time series.
The way to think about the data is that it is the results of a bunch of multigame series against a different teams. We don't know who wins the series until the last game. I'm trying to calculate the win rate as it evolves against each of the opposing teams.
series_id date opposing_team won_series
1 1/1/2000 a 0
1 1/3/2000 a 0
1 1/5/2000 a 1
2 1/4/2000 a 0
2 1/7/2000 a 0
2 1/9/2000 a 0
3 1/6/2000 b 0
Becomes:
series_id date opposing_team won_series percent_win_against_team
1 1/1/2000 a 0 NA
1 1/3/2000 a 0 NA
1 1/5/2000 a 1 100
2 1/4/2000 a 0 NA
2 1/7/2000 a 0 100
2 1/9/2000 a 0 50
3 1/6/2000 b 0 0
I still don't feel like I understand the rule for how you decide when a series is over. Is 3 over? Why is it NA, I would have thought 1/3rd. Still, here is a way to keep track of the number of completed series and (a) win rate.
Define 26472215table.csv:
series_id,date,opposing_team,won_series
1,1/1/2000,a,0
1,1/3/2000,a,0
1,1/5/2000,a,1
2,1/4/2000,a,0
2,1/7/2000,a,0
2,1/9/2000,a,0
3,1/6/2000,b,0
Code:
import pandas as pd
import numpy as np
df = pd.read_csv('26472215table.csv')
grp2 = df.groupby(['series_id'])
sr = grp2['date'].max()
sr.name = 'LastGame'
df2 = df.join( sr, on=['series_id'], how='left')
df2.sort('date')
df2['series_comp'] = df2['date'] == df2['LastGame']
df2['running_sr_cnt'] = df2.groupby(['opposing_team'])['series_comp'].cumsum()
df2['running_win_cnt'] = df2.groupby(['opposing_team'])['won_series'].cumsum()
winrate = lambda x: x[1]/ x[0] if (x[0] > 0) else None
df2['winrate'] = df2[['running_sr_cnt','running_win_cnt']].apply(winrate, axis = 1 )
Results df2[['date', 'winrate']]:
date winrate
0 1/1/2000 NaN
1 1/3/2000 NaN
2 1/5/2000 1.0
3 1/4/2000 1.0
4 1/7/2000 1.0
5 1/9/2000 0.5
6 1/6/2000 0.0

Categories