Aggregating on 5 minute windows in pyspark - python

I Have the following dataframe df:
User | Datetime | amount | length
A | 2016-01-01 12:01 | 10 | 20
A | 2016-01-01 12:03 | 6 | 10
A | 2016-01-01 12:05 | 1 | 3
A | 2016-01-01 12:06 | 3 | 5
B | 2016-01-01 12:01 | 10 | 20
B | 2016-01-01 12:02 | 8 | 20
And I want to use pyspark efficiently to aggregate over a 5 minute time window and do some calculations - so for example calculate the average amount & length for every use for every 5 minute time window - the df will look like this:
User | Datetime | amount | length
A | 2016-01-01 12:00 | 8 | 15
B | 2016-01-01 12:00 | 2 | 4
A | 2016-01-01 12:05 | 9 | 20
How can I achieve this in the most efficient way?
In pandas I used:
df.groupby(['cs_username', pd.TimeGrouper('5Min')].apply(...)

Unfortunately, in pyspark this won't look so cool like in pandas ;-)
You can try casting date to timestamp and using modulo, for example:
import pyspark.sql.functions as F
seconds = 300
seconds_window = F.from_unixtime(F.unix_timestamp('date') - F.unix_timestamp('date') % seconds)
dataframe.withColumn('5_minutes_window', seconds_window)
Then you can simply group by new column and perform requested aggregations.

Related

How to create cumulative bins in dataframe?

I have a df which looks like this:
date | user_id | purchase_probability | sales
2020-01-01 | 1 | 0.19 | 10
2020-01-20 | 1 | 0.04 | 0
2020-01-01 | 3 | 0.31 | 5
2020-01-10 | 2 | 0.05 | 18
How can I best create a new dataframe that creates cumulative buckets in 10% increments such as:
probability_bin | total_users | total_sales
0-10% | 2 | 18+0=18
0-20% | 2 | 18+0+10=28
0-30% | 2 | 28
0-40% | 3 | 10+0+5+18=33
0-50% | 3 | 33
0-60% | same for all rows below
0-70%
0-80%
0-90%
0-100%
I tried using a custom function and also pandas pcut and qcut but not sure how to get to that cumulative output.
Any ideas are appreciated.
Use cut to create normal bins, then aggregate and cumsum:
bins = np.arange(0, 101, 10)
labels = [f'0-{int(i)}%' for i in bins[1:]]
group = pd.cut(df['purchase_probability'], bins=bins/100, labels=labels)
(df.groupby(group)
.agg(total_users=('user_id', 'count'), total_sales=('sales', 'sum'))
.cumsum()
)

add values on a negative pandas df based on condition date

I have a dataframe which contains credit of a user, each row is how much credit has a given day.
A user loses 1 credit per day.
I need a way to code that if a user has accumulated credit in the past will fill all the days that credit was 0.
An example of refilling past credits:
import pandas as pd
from datetime import datetime
data = pd.DataFrame({'credit':[0,0,0,2,0,0,1],'date':pd.date_range('01-01-2021','01-07-2021')})
data['credit_after_consuming'] = data.credit_refill -1
Looks like:
| | credit_refill | date | credit_after_consuming |
|---:|----------------:|:--------------------|-------------------------:|
| 0 | 0 | 2021-01-01 00:00:00 | -1 |
| 1 | 0 | 2021-01-02 00:00:00 | -1 |
| 2 | 0 | 2021-01-03 00:00:00 | -1 |
| 3 | 2 | 2021-01-04 00:00:00 | 1 |
| 4 | 0 | 2021-01-05 00:00:00 | -1 |
| 5 | 0 | 2021-01-06 00:00:00 | -1 |
| 6 | 1 | 2021-01-07 00:00:00 | 0 |
The logic should be as you can see the three first days the user would have credit -1, until the 4th of January, where the user has 2 days of credit, one used that day and the other one is consumed the 5.
In total there would be 3 days(the first one without credits).
If at the start of the week a user picks 7 credits it is covered all week.
Another case would be
| | credit_refill | date | credit_after_consuming |
|---:|----------------:|:--------------------|-------------------------:|
| 0 | 2 | 2021-01-01 00:00:00 | 1 |
| 1 | 0 | 2021-01-02 00:00:00 | -1 |
| 2 | 0 | 2021-01-03 00:00:00 | -1 |
| 3 | 0 | 2021-01-04 00:00:00 | -1 |
| 4 | 1 | 2021-01-05 00:00:00 | 0 |
| 5 | 0 | 2021-01-06 00:00:00 | -1 |
In this case the participant would run out of credits the 3rd and 4th day, because it has 2 credits the 1st, one consumed the same day and then the 2nd.
Then the 5th would refill and consume the same day, to run out of credits the 6th day.
I feel it's like some variation of cumsum but I can-t manage to get the expected results.
I could do a sum all days and fill the 0's with those accumulated, but I have to take into account I can only refill with credits accumulated in the past.

How to resample pandas timeseries df into new rows representing equal cumsum of some measurement?

Is it possible to resample timeseries data by "bins" of cumulative sum of some column? I mean if my raw df is:
+------+------------+-------+----------------+
| time | value | bool | someothervalue |
+------+------------+-------+----------------+
| 00:01| 3 | True | 5 |
| 00:03| 1 | True | 3 |
| 00:04| 2 | False | 6 |
| 00:20| 2 | True | 7 |
| 00:27| 4 | True | 4 |
| 00:28| 1 | False | 6 |
| 00:29| 1 | True | 7 |
| 00:30| 2 | True | 3 |
+------+------------+-------+----------------+
I would like to resample it "by value" so that every resampled row would represent aggregated value of 4:
+-------+-------+-----+---------+--------------+-------------+
| start | end | valuesum | truecount | somevaluesum | sampledrows |
+-------+-------+----------+-----------+--------------+-------------+
| 00:01 | 00:03 | 4 | 2 | 8 | 2 |
| 00:04 | 00:20 | 4 | 1 | 13 | 2 |
| 00:27 | 00:27 | 4 | 1 | 4 | 1 |
| 00:28 | 00:30 | 4 | 2 | 16 | 3 |
+-------+-------+----------+-----------+--------------+-------------+
My current solution is "traditional" df.itertuples() but it is very slow and my target dataset is 100's milions of rows and I have to resample it for many different intervals. I'm looking for efficient solution similar to df.resample.ohlc() but based on "value" intervals not time intervals.
Edit: my exapmle is oversimplyfied, my real data is float so counting modulo is harder, also I need first (open) and last (close) values in resampled data. I promise not to overly simplify my problems in future, this is my first SO question.
Let us try get the mod then cumsum create the key , and we can do the groupby + agg
s=df.groupby(((df.value.cumsum()%4)==0).iloc[::-1].cumsum()).agg(st=('time','min'),
end=('time','max'),
truecount=('bool','sum'),
somevaluesum=('someothervalue','sum'),
sampledrows=('someothervalue','count'))
st end truecount somevaluesum sampledrows
value
1 00:28 00:30 2.0 16 3
2 00:27 00:27 1.0 4 1
3 00:04 00:20 1.0 13 2
4 00:01 00:03 2.0 8 2

How to calculate average percentage change using groupby

I want to create a dateframe that calculates the average percentage change over a time period.
Target dataframe would look like this:
| | City | Ave_Growth |
|---|------|------------|
| 0 | A | 0.0 |
| 1 | B | -0.5 |
| 2 | C | 0.5 |
While simplified, real data would be cities with average changes over past 7 days.
Original dataset, df_bycity, looks like this:
| | City | Date | Case_Num |
|---|------|------------|----------|
| 0 | A | 2020-01-01 | 1 |
| 1 | A | 2020-01-02 | 1 |
| 2 | A | 2020-01-03 | 1 |
| 3 | B | 2020-01-01 | 3 |
| 4 | C | 2020-01-03 | 3 |
While simplified, this represents real data. Some cities have fewer cases, some have more. In some cities, there will be days that have no reported cases. But I would like to calculate average change from last seven days from today. I've simplified.
I tried the following code but I'm not getting the results I want:
df_bycity.groupby(['City','Date']).pct_change()
Case_Num
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
Obviously I'm using either pct_change or groupby incorrectly. I'm just learning this.
Anyone can point me to the right direction? Thanks.

Difference of sum of consecutive years pandas

Suppose I have this pandas DataFrame df
Date | Year | Value
2017-01-01 | 2017 | 20
2017-01-12 | 2017 | 40
2018-01-12 | 2018 | 150
2019-10-10 | 2019 | 300
I want to calculate the difference between the total sum of Value per year between consecutive years. To get the total sum of Value per year I can do
df['YearlyValue'] = df.groupy('Year')['Value'].transform('sum')
which gives me
Date | Year | Value | YearlyValue
2017-01-01 | 2017 | 20 | 60
2017-01-12 | 2017 | 40 | 60
2018-01-12 | 2018 | 150 | 150
2019-10-10 | 2019 | 300 | 300
but how can I get a new column 'Increment' has difference between YearlyValue of consecutive years?

Categories