Incremental spend in 6 weeks for two groups using pandas - python

I have an excel data with the following information,
df.head()
User_id Group Week Spend Purchases Group
170309867 Test 2014-10-13 794.66 2 Test-NonRed
57954586 Test 2014-10-13 55.99 1 Test-Red
118068583 Test 2014-10-13 40.87 1 Test-NonRed
516478393 Test 2014-10-13 17.5 2 Test-NonRed
457873235 Test 2014-10-13 20.44 1 Test-Red
From the above information, I need to calculate the incremental spend in the six weeks for the total Test group (test-Red and test-NonRed) vs. control. I need it in absolute ($) and % terms.
I have tried pandas as,
df2= df.groupby(by=['Group','Week']).sum().abs().groupby(level=[0]).cumsum()
And I have the following result,
df2.head()
And then I calculated the sum for each group as,
df2.groupby(by=['group2']).sum()
df2.head()
I would like to have them ( the incremental spend) as an absolute value which I tried by abs(), as well as I need it in absolute percentage.Any help would be much appreciated,
The expected results are to calculate the incremental spend in the six weeks for the total Test group (test-Red and test-NonRed) vs. Control. I need it in absolute spend and then its percentage.The incremental spend for the 6 weeks. Something like this,
Group incremental_spend incremental_%
Control 11450175 #%
test-NonRed 50288158 #%
test-Red 12043938 #%
So my real questions,
1. Whether the above-mentioned approach is the right way to calculate incremental spend for Column Group in 6 Weeks from column Week on Spend?
2. Also, I need all my results in Absolute counts and Absolute %

I think there are several problems here which make your answer difficult to understand.
Vocabulary
What you describe as "Incremental spend" is just the sum.
What you do in two steps is the sum of the cumulative sum .cumsum().sum(), which is not right.
Also I am not sure whether you need abs, which gives the absolute value (abs(-1) gives 1) and will thus only have an effect if there are negative values in your data.
Unfortunately the sample dataset is not large enough to get a conclusion.
Dataset
Your dataset has two columns Group with identical names, which is error prone.
Missing information
You want to get final values (sums) as a ratio (%), but you do not indicate what is the reference value for this ratio.
Is it the sum of Spend for the control group ?
Potential solution
>>> df # Sample dataframe with one entry as 'Control' group
Out[]:
User_id Group Week Spend Purchases Group.1
0 170309867 Test 2014-10-13 794.66 2 Test-NonRed
1 57954586 Test 2014-10-13 55.99 1 Test-Red
2 118068583 Test 2014-10-13 40.87 1 Test-NonRed
3 516478393 Test 2014-10-13 17.50 2 Control
4 457873235 Test 2014-10-13 20.44 1 Test-Red
df2 = pd.DataFrame(df.groupby('Group.1').Spend.sum()) # Get 'Spend' sum for each group
>>> df2
Out[]:
Spend
Group.1
Control 17.50
Test-NonRed 835.53
Test-Red 76.43
control_spend_total = df2.loc['Control'].values # Get total spend for 'Control' group
>>> control_spend_total
Out[]: array([ 17.5])
df2['Spend_%'] = df2.Spend / control_spend_total * 100 # Add 'Spend_ratio' column
>>> df2
Out[]:
Spend Spend_%
Group.1
Control 17.50 100.000000
Test-NonRed 835.53 4774.457143
Test-Red 76.43 436.742857
Does it look like what you want?

Related

Pyspark efficiently create patterns within each window

I want to create a base dataframe from the existing one, which does not contain all I want, for example, I have the dataframe collecting the number of candies each people (tracked by "id") bought each year-month (but in this case each person didn't buy candies every month)
|id|year_month|num_of_candies_bought
1 2022-01 5
1 2022-03 10
1 2022-04 2
What I want is that to track them by fix the year-month I'm interested like this (for the first 5 months this year)
|id|year_month|num_of_candies_bought
1 2022-01 5
1 2022-02 0
1 2022-03 10
1 2022-04 2
1 2022-05 0
I think one way to do this is to use "crossjoin" but it turns out that this takes long time to process. Is there any way to do this without any join? In my work the first dataframe is very very huge (a million rows for instance) while the second is just fixed (like in this case only 5 rows) and much much smaller. Is it possible (if it is needed to use crossjoin) to improve performance drastically?
P.S. I want this to seperate each person (so I need to use window.partition thing)
I'd simply add a 0 (zero) line for each id and each id and year_month.
Let's assume df is your dataframe.
from pyspark.sql import functions as F
# generate a list of all year_month you need
year_month = ["2022-01", "2022-02", "2022-03", "2022-04", "2022-05"]
df_id = (
df.select("id")
.distinct()
.withColumn("num_of_candies_bought", F.lit(0))
.withColumn("year_month", F.explode(F.array(*map(F.lit, year_month))))
)
df = (
df.unionByName(df_id)
.groupBy("id", "year_month")
.agg(F.sum("num_of_candies_bought").alias("num_of_candies_bought"))
)

Replace outliers with groupby average in multi-index dataframe

I have the following multi-index data frame, where ID and Year are part of the multi-index. Some numbers for the variable ROA are unreasonable, so I want to replace every ROA value that is larger than the 99th percentile of ROA in the overall data frame by the average of its company (the same for everything smaller than the 1th percentile).
ID Year ROA
1 2016 1.5
1 2017 0.8
1 2018 NaN
2 2016 0.7
2 2017 0.8
2 2018 0.4
In a different thread I found the following approach (Replace values based on multiple conditions with groupby mean in Pandas):
mask = ((df['ROA'] > df['ROA'].quantile(0.99)) | (df['ROA'] < df['ROA'].quantile(0.01)))
df['ROA'] = np.where(~mask, df['ROA'], df.groupby('ID')['ROA'].transform('mean'))
However, this does not work for me. The maximum and minimum values of my data frame do not change. Does someone have an idea why this could be?
EDIT:
Alternatively, I thought of this function:
df_outliers = df[(df['ROA'] < df['ROA'].quantile(0.01))|(df['ROA'] >
df['ROA'].quantile(0.99))]
for i in df_outliers.index:
df.loc[(df.index.get_level_values('ID') == float(i[0])) &
(df.index.get_level_values('Year')==float(i[1])), 'ROA'] =
float(df.query('ID == {} and Year != {}'.format(i[0],
i[1])).ROA.mean())
However, here I run into the problem that in the df_outliers.index some companies are mentioned several times because their ROA is an outlier in several years. This makes the function defeat its purpose, as it is currently it only excludes one year from the calculation of the mean, and not several.

Find number of clusters in time series frequency table

So I want to be able to know how many clusters are in a time series frequency table.
Input would be a date index with the frequency.
The kind of output you would get when using .resample('D').sum()
Input Example:
Index
Count
01-01-2022
3
02-01-2022
4
03-01-2022
2
04-01-2022
2
05-01-2022
2
....
...
27-01-2022
5
28-01-2022
4
29-01-2022
2
30-01-2022
3
31-01-2022
2
Assume the dates not shown (... on table) are all frequency 0.
So essentially there is two clusters in the month of January 2022. First cluster is at the beginning of the month and the second cluster is at the end of the month.
Cluster 1 is between date range 01-01-2022 and 05-01-2022.
Cluster 2 is between date range 27-01-2022 and 31-01-2022.
Do you know which clustering algorithm would allow me to get the # of clusters with this type of data?
or is a clustering algorithm even necessary?
Thank you for your help

Pandas Error: Index contains duplicate entries, cannot reshape

My question seems duplicate as I found different questions with the same error as follows:
Pandas: grouping a column on a value and creating new column headings
Python/Pandas - ValueError: Index contains duplicate entries, cannot reshape
Pandas pivot produces "ValueError: Index contains duplicate entries, cannot reshape
I tried all the solutions presented on those posts, but none worked. I believe the error maybe be caused by my dataset format, which has Strings instead of numbers and possible duplicate entires. Here follows an example of my Dataset:
protocol_no
activity
description
1586212
walk
twice a day
1586212
drive
5 km
1586212
drive
At least 30 min
1586212
sleep
NaN
1586212
eat
1500 calories
2547852
walk
NaN
2547852
drive
NaN
2547852
eat
3200 calories
2547852
eat
Avoid pasta
2547852
sleep
At least 10 hours
The output I'm trying to achieve is:
protocol_no
walk
drive
sleep
eat
1586212
twice a day
5km
NaN
1500 calories
2547852
NaN
NaN
3200 calories
At least 10 hours
I tried using pivot and pivot_table with a code like this:
df.pivot(index="protocol_no", columns="activity", values="description")
But I'm still getting this error:
ValueError: Index contains duplicate entries, cannot reshape
Have no idea what is going wrong, so any help will be helpful!
EDIT:
I noticed my data contains duplicate entires as stated by the error and by #DYZ and #SeaBean users. So I've edited the database example and provided the correct answer for my dataset as well. Hope it helps someone.
Try using .piviot_table() with aggfunc='first' (or something similar) if you get duplicate index error when using .pivot()
df.pivot_table(index="protocol_no", columns="activity", values="description", aggfunc='first')
This is a common situation when the column you set as index has duplicated values. Using aggfunc='first' (or sometimes aggfunc='sum' depending on condition) most probably can solve the problem.
Result:
activity drive eat sleep walk
protocol_no
1586212 5 km 1500 calories NaN twice a day
2547852 NaN 3200 calories At least 10 hours NaN
Edit
Based on your latest edit with duplicate entries, you can just modify the solution above by changing the aggfunc function above, as follows:
df.pivot_table(index="protocol_no", columns="activity", values="description", aggfunc=lambda x: ' '.join(x.dropna()))
Here, we change the aggfunc from 'first' to lambda x: ' '.join(x.dropna()). It achieves the the same result as your desired output without adding multiple lines of codes.
Result:
activity drive eat sleep walk
protocol_no
1586212 5 km At least 30 min 1500 calories twice a day
2547852 3200 calories Avoid pasta At least 10 hours
Although the SeaBean answer worked on my data, I took a look into my data and noticed it really contained duplicated entires (as the example in my question I edited later). To deal with this, the best solution is to do a join with those duplicate entries.
1- Before the join, I needed to remove the NaNs of my Dataset. Otherwise it will raise another error:
df["description"].fillna("", inplace=True)
2- Then I executed the grouby function joining the duplicate entries:
df = df.groupby(["protocol_no", "activity"], as_index=False).agg({"description": " ".join})
3- The last, but not the least, I executed the pivot as I have intended to do in my question:
df.pivot(index="protocol_no", columns="activity", values="description")
4- VoilĂ , the result:
protocol_no
drive
eat
sleep
walk
1586212
5 km At least 30 min
1500 calories
twice a day
2547852
3200 calories Avoid pasta
At least 10 hours
5- The info of my dataset using df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2 entries, 1586212 to 2547852
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 drive 2 non-null object
1 eat 2 non-null object
2 sleep 2 non-null object
3 walk 2 non-null object
dtypes: object(4)
memory usage: 80.0+ bytes
Hope It helps someone and many thanks to SeaBean and DYZ insights. :)

How to use Pandas rolling average without guaranteed number of observations

I am looking at annualized baseball statistics and would like to calculate a rolling mean looking back at the previous 3 years' worth of performance in regard to number of Hits. However, I want to account for the fact that while my dataset reaches back more than 3 years, one single player may have only been in the league for 1-2 years and will not have 3 years' worth of observations off of which I can calculate the rolling mean. For example:
In[6]: df = pd.DataFrame({'PLAYER_ID': ['A', 'A', 'A', 'B', 'B'],
'HITS': [45, 55, 50, 20, 24]})
In[9]: df
Out[9]:
PLAYER_ID HITS
0 A 45
1 A 55
2 A 50
3 B 20
4 B 24
How would I use a groupby and aggregation/transform (or some other process) to calculate the rolling mean for each player with a max 3 years historic totals and then just use the max available historic observations for a player with less than 3 years' historic performance data available?
Pretty sure my answer lies within the Pandas package but would be interested in any solution.
Thanks!
pd.DataFrame.rolling handles this problem for you automatically. Using your example data, df.groupby('PLAYER_ID').rolling(1).mean() will give you:
HITS PLAYER_ID
PLAYER_ID
A 0 45.0 A
1 55.0 A
2 50.0 A
B 3 20.0 B
4 24.0 B
For your example case I'm using a window size of just 1, which means that we're treating each individual observation as its own mean. This isn't particularly interesting. With more data you can use a larger window size: for example, if your data is weekly, rolling(5) would give you an approximately monthly window size (or rolling(31) if your data is daily, and so on).
Two issues to be aware of when using this methodology:
If your data isn't sampled on a regular basis (e.g. if it skips a week or a month at a time), your rolling average won't be aligned in time. For this reason if your data isn't already regularly sampled you'll usually want to resample it.
If your data contains NaN values, those will be propagated: every window containing that NaN will also be NaN. You'll have to impute those values somehow to keep that from happening.

Categories