How to create cumulative bins in dataframe? - python

I have a df which looks like this:
date | user_id | purchase_probability | sales
2020-01-01 | 1 | 0.19 | 10
2020-01-20 | 1 | 0.04 | 0
2020-01-01 | 3 | 0.31 | 5
2020-01-10 | 2 | 0.05 | 18
How can I best create a new dataframe that creates cumulative buckets in 10% increments such as:
probability_bin | total_users | total_sales
0-10% | 2 | 18+0=18
0-20% | 2 | 18+0+10=28
0-30% | 2 | 28
0-40% | 3 | 10+0+5+18=33
0-50% | 3 | 33
0-60% | same for all rows below
0-70%
0-80%
0-90%
0-100%
I tried using a custom function and also pandas pcut and qcut but not sure how to get to that cumulative output.
Any ideas are appreciated.

Use cut to create normal bins, then aggregate and cumsum:
bins = np.arange(0, 101, 10)
labels = [f'0-{int(i)}%' for i in bins[1:]]
group = pd.cut(df['purchase_probability'], bins=bins/100, labels=labels)
(df.groupby(group)
.agg(total_users=('user_id', 'count'), total_sales=('sales', 'sum'))
.cumsum()
)

Related

How can I get percentile of column in dataframe considering only previous values? (Python)

I have a dataframe with a numeric column and I would link to calculate the percentile of the values in each row for that column considering only previous rows of the column. Here is an example:
+-------+
| col_1 |
+-------+
| 5 |
+-------+
| 4 |
+-------+
| 10 |
+-------+
| 1 |
+-------+
| 7 |
+-------+
I would like to obtain a dataframe like this:
+-------+------------+
| col_1 | percentile |
+-------+------------+
| 5 | 100 |
+-------+------------+
| 4 | 50 |
+-------+------------+
| 10 | 100 |
+-------+------------+
| 1 | 25 |
+-------+------------+
| 7 | 80 |
+-------+------------+
How can I calculate it?
Try as follows.
Use df.expanding with min_periods=1 to allow expanding window calculations.
For each window, we apply Expanding.rank with pct=True (and we multiply by 100).
We can assign the result directly to the new column percentile:
import pandas as pd
data = {'col_1':[5,4,10,1,7]}
df = pd.DataFrame(data)
df['percentile'] = df['col_1'].expanding(min_periods=1).rank(pct=True).mul(100)
print(df)
col_1 percentile
0 5 100.0
1 4 50.0
2 10 100.0
3 1 25.0
4 7 80.0
Update: Expanding.rank was added to pandas in version 1.4.0. For earlier versions, you could for instance try:
temp = df['col_1'].expanding(min_periods=1).agg(['rank','count'])
df['percentile'] = (temp['rank']/temp['count']).mul(100)
print(df)
col_1 percentile
0 5 100.0
1 4 50.0
2 10 100.0
3 1 25.0
4 7 80.0
Or, as a one-liner:
df['percentile'] = df['col_1'].expanding(min_periods=1)\
.apply(lambda x: (x.rank()/x.count()).to_numpy()[-1]*100)

How do I transpose and aggregate this dataframe in right order?

I am trying to find an efficient way to create a dataframe which lists all distinct game values as the columns and then aggregates the rows by user_id for game play hours accordingly? This is my example df:
user_id | game | game_hours | rank_order
1 | Fortnight | 1.5 | 1
1 | COD | 0.5 | 2
1 | Horizon | 1.7 | 3
1 | ... | ... | n
2 | Fifa2021 | 1.9 | 1
2 | A Way Out | 0.2 | 2
2 | ... | ... | n
...
Step 1: How do I get this to this df format (match rank order correctly due to time sequence)?
user_id | game_1 | game_2 | game_3 | game_n ...| game_hours
1 | Fortnight | COD | Horizon| | 3.7
2 | Fifa21 | A Way Out | | | 2.1
...
Use DataFrame.pivot with DataFrame.add_prefix and for new column DataFrame.assign with aggregation sum:
df = (df.pivot('user_id','rank_order','game')
.add_prefix('game_')
.assign(game_hours=df.groupby('user_id')['game_hours'].sum())
.reset_index()
.rename_axis(None, axis=1))
print (df)
user_id game_1 game_2 game_3 game_hours
0 1 Fortnight COD Horizon 3.7
1 2 Fifa2021 A Way Out NaN 2.1

Converting groupby pandas df of absolute numbers to percentage of row totals

I have some data in my df df that shows the 2 categories a user belongs to. For which I want to see the number of users for each category pair expressed as a %total of the row.
Original dataframe df:
+------+------+--------+
| cat1 | cat2 | user |
+------+------+--------+
| A | X | 687568 |
| A | Y | 68575 |
| B | Y | 56478 |
| A | X | 6587 |
| A | Y | 45678 |
| B | X | 5678 |
| B | X | 967 |
| A | X | 345 |
+------+------+--------+
I convert this to a groupby df using:
df2 = df.groupby(['cat1', 'cat2']).agg({'user': 'nunique'}).reset_index().pivot(index='cat1', columns='cat2',values='user')to get the pairwise calculation of number of users per combination of categories (numbers here are made up):
+------+----+----+
| cat2 | X | Y |
+------+----+----+
| cat1 | | |
+------+----+----+
| A | 5 | 5 |
| B | 10 | 40 |
+------+----+----+
And I would like to convert the numbers to percent totals of the rows (Cat1), e.g. for the first row, 5/(5+5) = 0.5 and so on to give:
+------+-----+-----+
| cat2 | X | Y |
+------+-----+-----+
| cat1 | | |
| A | 0.5 | 0.5 |
| B | 0.2 | 0.8 |
+------+-----+-----+
Would I have to create a new column in my grouped df that contains the row-wise sum, and then iterate through each value in a row and divide it by that total?
You can simplify your expression:
piv = df.pivot_table('user', 'cat1', 'cat2', aggfunc='nunique')
pct = piv.div(piv.sum(axis=1), axis=0)
Output:
>>> piv
cat2 X Y
cat1
A 3 2
B 2 1
>>> pct
cat2 X Y
cat1
A 0.600000 0.400000
B 0.666667 0.333333

How to calculate average percentage change using groupby

I want to create a dateframe that calculates the average percentage change over a time period.
Target dataframe would look like this:
| | City | Ave_Growth |
|---|------|------------|
| 0 | A | 0.0 |
| 1 | B | -0.5 |
| 2 | C | 0.5 |
While simplified, real data would be cities with average changes over past 7 days.
Original dataset, df_bycity, looks like this:
| | City | Date | Case_Num |
|---|------|------------|----------|
| 0 | A | 2020-01-01 | 1 |
| 1 | A | 2020-01-02 | 1 |
| 2 | A | 2020-01-03 | 1 |
| 3 | B | 2020-01-01 | 3 |
| 4 | C | 2020-01-03 | 3 |
While simplified, this represents real data. Some cities have fewer cases, some have more. In some cities, there will be days that have no reported cases. But I would like to calculate average change from last seven days from today. I've simplified.
I tried the following code but I'm not getting the results I want:
df_bycity.groupby(['City','Date']).pct_change()
Case_Num
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
Obviously I'm using either pct_change or groupby incorrectly. I'm just learning this.
Anyone can point me to the right direction? Thanks.

Pandas, create new column based on values from previuos rows with certain values

Hi I'm trying to use ML to predict some future sales. So i would like to add mean sales from the previous month/year for each product
My df is something like: [ id | year | month | product_id | sales ] I would like to add prev_month_mean_sale and prev_month_id_sale columns
id | year | month | product_id | sales | prev_month_mean_sale | prev_month_id_sale
----------------------------------------------------------------------
1 | 2018 | 1 | 123 | 5 | NaN | NaN
2 | 2018 | 1 | 234 | 4 | NaN | NaN
3 | 2018 | 1 | 345 | 2 | NaN | NaN
4 | 2018 | 2 | 123 | 3 | 3.6 | 5
5 | 2018 | 2 | 345 | 2 | 3.6 | 2
6 | 2018 | 3 | 123 | 4 | 2.5 | 3
7 | 2018 | 3 | 234 | 6 | 2.5 | 0
8 | 2018 | 3 | 567 | 7 | 2.5 | 0
9 | 2019 | 1 | 234 | 4 | 5.6 | 6
10 | 2019 | 1 | 567 | 3 | 5.6 | 7
also I would like to add prev_year_mean_sale and prev_year_id_sale
prev_month_mean_sale is the mean of the total sales of the previuos month, eg: for month 2 is (5+4+2)/3
My actual code is something like:
for index,row in df.iterrows():
loc = df.index[(df['month'] == row['month']-1) &
(df['year'] == row['year']) &
(df['product_id'] == row['product_id']).tolist()[0]]
df.loc[index, 'prev_month_id_sale'] = df.loc[ loc ,'sales']
but it is really slow and my df is really big. Maybe there is another option using groupby() or something like that.
A simple way to avoid loop is to use merge() from dataframe:
df["prev_month"] = df["month"] - 1
result = df.merge(df.rename(columns={"sales", "prev_month_id"sale"}),
how="left",
left_on=["year", "prev_month", "product_id"],
right_on=["year", "month", "product_id"])
The result in this way will have more columns than you needed. You should drop() some of them and/or rename() some other.

Categories