add values on a negative pandas df based on condition date - python

I have a dataframe which contains credit of a user, each row is how much credit has a given day.
A user loses 1 credit per day.
I need a way to code that if a user has accumulated credit in the past will fill all the days that credit was 0.
An example of refilling past credits:
import pandas as pd
from datetime import datetime
data = pd.DataFrame({'credit':[0,0,0,2,0,0,1],'date':pd.date_range('01-01-2021','01-07-2021')})
data['credit_after_consuming'] = data.credit_refill -1
Looks like:
| | credit_refill | date | credit_after_consuming |
|---:|----------------:|:--------------------|-------------------------:|
| 0 | 0 | 2021-01-01 00:00:00 | -1 |
| 1 | 0 | 2021-01-02 00:00:00 | -1 |
| 2 | 0 | 2021-01-03 00:00:00 | -1 |
| 3 | 2 | 2021-01-04 00:00:00 | 1 |
| 4 | 0 | 2021-01-05 00:00:00 | -1 |
| 5 | 0 | 2021-01-06 00:00:00 | -1 |
| 6 | 1 | 2021-01-07 00:00:00 | 0 |
The logic should be as you can see the three first days the user would have credit -1, until the 4th of January, where the user has 2 days of credit, one used that day and the other one is consumed the 5.
In total there would be 3 days(the first one without credits).
If at the start of the week a user picks 7 credits it is covered all week.
Another case would be
| | credit_refill | date | credit_after_consuming |
|---:|----------------:|:--------------------|-------------------------:|
| 0 | 2 | 2021-01-01 00:00:00 | 1 |
| 1 | 0 | 2021-01-02 00:00:00 | -1 |
| 2 | 0 | 2021-01-03 00:00:00 | -1 |
| 3 | 0 | 2021-01-04 00:00:00 | -1 |
| 4 | 1 | 2021-01-05 00:00:00 | 0 |
| 5 | 0 | 2021-01-06 00:00:00 | -1 |
In this case the participant would run out of credits the 3rd and 4th day, because it has 2 credits the 1st, one consumed the same day and then the 2nd.
Then the 5th would refill and consume the same day, to run out of credits the 6th day.
I feel it's like some variation of cumsum but I can-t manage to get the expected results.
I could do a sum all days and fill the 0's with those accumulated, but I have to take into account I can only refill with credits accumulated in the past.

Related

How to reindex a datetime-based multiindex in pandas

I have a dataframe that counts the number of times an event has occured per user per day. Users may have 0 events per day and (since the table is an aggregate from a raw event log) rows with 0 events are missing from the dataframe. I would like to add these missing rows and group the data by week so that each user has one entry per week (including 0 if applicable).
Here is an example of my input:
import numpy as np
import pandas as pd
np.random.seed(42)
df = pd.DataFrame({
"person_id": np.arange(3).repeat(5),
"date": pd.date_range("2022-01-01", "2022-01-15", freq="d"),
"event_count": np.random.randint(1, 7, 15),
})
# end of each week
# Note: week 2022-01-23 is not in df, but should be part of the result
desired_index = pd.to_datetime(["2022-01-02", "2022-01-09", "2022-01-16", "2022-01-23"])
df
| | person_id | date | event_count |
|---:|------------:|:--------------------|--------------:|
| 0 | 0 | 2022-01-01 00:00:00 | 4 |
| 1 | 0 | 2022-01-02 00:00:00 | 5 |
| 2 | 0 | 2022-01-03 00:00:00 | 3 |
| 3 | 0 | 2022-01-04 00:00:00 | 5 |
| 4 | 0 | 2022-01-05 00:00:00 | 5 |
| 5 | 1 | 2022-01-06 00:00:00 | 2 |
| 6 | 1 | 2022-01-07 00:00:00 | 3 |
| 7 | 1 | 2022-01-08 00:00:00 | 3 |
| 8 | 1 | 2022-01-09 00:00:00 | 3 |
| 9 | 1 | 2022-01-10 00:00:00 | 5 |
| 10 | 2 | 2022-01-11 00:00:00 | 4 |
| 11 | 2 | 2022-01-12 00:00:00 | 3 |
| 12 | 2 | 2022-01-13 00:00:00 | 6 |
| 13 | 2 | 2022-01-14 00:00:00 | 5 |
| 14 | 2 | 2022-01-15 00:00:00 | 2 |
This is how my desired result looks like:
| | person_id | level_1 | event_count |
|---:|------------:|:--------------------|--------------:|
| 0 | 0 | 2022-01-02 00:00:00 | 9 |
| 1 | 0 | 2022-01-09 00:00:00 | 13 |
| 2 | 0 | 2022-01-16 00:00:00 | 0 |
| 3 | 0 | 2022-01-23 00:00:00 | 0 |
| 4 | 1 | 2022-01-02 00:00:00 | 0 |
| 5 | 1 | 2022-01-09 00:00:00 | 11 |
| 6 | 1 | 2022-01-16 00:00:00 | 5 |
| 7 | 1 | 2022-01-23 00:00:00 | 0 |
| 8 | 2 | 2022-01-02 00:00:00 | 0 |
| 9 | 2 | 2022-01-09 00:00:00 | 0 |
| 10 | 2 | 2022-01-16 00:00:00 | 20 |
| 11 | 2 | 2022-01-23 00:00:00 | 0 |
I can produce it using:
(
df
.groupby(["person_id", pd.Grouper(key="date", freq="w")]).sum()
.groupby("person_id").apply(
lambda df: (
df
.reset_index(drop=True, level=0)
.reindex(desired_index, fill_value=0))
)
.reset_index()
)
However, according to the docs of reindex, I should be able to use it with level=1 as a kwarg directly and without having to do another groupby. However, when I do this I get an "inner join" of the two indices instead of an "outer join":
result = (
df
.groupby(["person_id", pd.Grouper(key="date", freq="w")]).sum()
.reindex(desired_index, level=1)
.reset_index()
)
| | person_id | date | event_count |
|---:|------------:|:--------------------|--------------:|
| 0 | 0 | 2022-01-02 00:00:00 | 9 |
| 1 | 0 | 2022-01-09 00:00:00 | 13 |
| 2 | 1 | 2022-01-09 00:00:00 | 11 |
| 3 | 1 | 2022-01-16 00:00:00 | 5 |
| 4 | 2 | 2022-01-16 00:00:00 | 20 |
Why is that, and how am I supposed to use df.reindex correctly?
I have found a similar SO question on reindexing a multi-index level, but the accepted answer there uses df.unstack, which doesn't work for me, because not every level of my desired index occurs in my current index (and vice versa).
You need reindex by both levels of MultiIndex:
mux = pd.MultiIndex.from_product([df['person_id'].unique(), desired_index],
names=['person_id','date'])
result = (
df
.groupby(["person_id", pd.Grouper(key="date", freq="w")]).sum()
.reindex(mux, fill_value=0)
.reset_index()
)
print (result)
person_id date event_count
0 0 2022-01-02 9
1 0 2022-01-09 13
2 0 2022-01-16 0
3 0 2022-01-23 0
4 1 2022-01-02 0
5 1 2022-01-09 11
6 1 2022-01-16 5
7 1 2022-01-23 0
8 2 2022-01-02 0
9 2 2022-01-09 0
10 2 2022-01-16 20
11 2 2022-01-23 0

How to resample pandas timeseries df into new rows representing equal cumsum of some measurement?

Is it possible to resample timeseries data by "bins" of cumulative sum of some column? I mean if my raw df is:
+------+------------+-------+----------------+
| time | value | bool | someothervalue |
+------+------------+-------+----------------+
| 00:01| 3 | True | 5 |
| 00:03| 1 | True | 3 |
| 00:04| 2 | False | 6 |
| 00:20| 2 | True | 7 |
| 00:27| 4 | True | 4 |
| 00:28| 1 | False | 6 |
| 00:29| 1 | True | 7 |
| 00:30| 2 | True | 3 |
+------+------------+-------+----------------+
I would like to resample it "by value" so that every resampled row would represent aggregated value of 4:
+-------+-------+-----+---------+--------------+-------------+
| start | end | valuesum | truecount | somevaluesum | sampledrows |
+-------+-------+----------+-----------+--------------+-------------+
| 00:01 | 00:03 | 4 | 2 | 8 | 2 |
| 00:04 | 00:20 | 4 | 1 | 13 | 2 |
| 00:27 | 00:27 | 4 | 1 | 4 | 1 |
| 00:28 | 00:30 | 4 | 2 | 16 | 3 |
+-------+-------+----------+-----------+--------------+-------------+
My current solution is "traditional" df.itertuples() but it is very slow and my target dataset is 100's milions of rows and I have to resample it for many different intervals. I'm looking for efficient solution similar to df.resample.ohlc() but based on "value" intervals not time intervals.
Edit: my exapmle is oversimplyfied, my real data is float so counting modulo is harder, also I need first (open) and last (close) values in resampled data. I promise not to overly simplify my problems in future, this is my first SO question.
Let us try get the mod then cumsum create the key , and we can do the groupby + agg
s=df.groupby(((df.value.cumsum()%4)==0).iloc[::-1].cumsum()).agg(st=('time','min'),
end=('time','max'),
truecount=('bool','sum'),
somevaluesum=('someothervalue','sum'),
sampledrows=('someothervalue','count'))
st end truecount somevaluesum sampledrows
value
1 00:28 00:30 2.0 16 3
2 00:27 00:27 1.0 4 1
3 00:04 00:20 1.0 13 2
4 00:01 00:03 2.0 8 2

comparing values of 2 columns from same pandas dataframe & returning value of 3rd column based on comparison

I'm trying to compare values between 2 columns in the same pandas dataframe and for where ever the match has been found I want to return the values from that row but from a 3rd column.
Basically if the following is dataframe df
| date | date_new | category | value |
| --------- | ---------- | -------- | ------ |
|2016-05-11 | 2018-05-15 | day | 1000.0 |
|2020-03-28 | 2018-05-11 | night | 2220.1 |
|2018-05-15 | 2020-03-28 | day | 142.8 |
|2018-05-11 | 2019-01-29 | night | 1832.9 |
I want to add a new column say, value_new which is basically obtained by getting the values from value after comparing for every date value in date_new for every date value in date followed by comparing if both the rows have same category values.
[steps of transformation]
- 1. for each value in date_new look for a match in date
- 2. if match found, compare if values in category column also match
- 3. if both the matches in above steps fulfilled, pick the corresponding value from value column from the row where both the matches fulfilled, otherwise leave blank.
So, I would finally want the final dataframe to look something like this.
| date | date_new | category | value | value_new |
| --------- | ---------- | -------- | ------ | --------- |
|2016-05-11 | 2018-05-15 | day | 1000.0 | 142.8 |
|2020-03-28 | 2018-05-11 | night | 2220.1 | 1832.9 |
|2018-05-15 | 2020-03-28 | day | 142.8 | None |
|2018-05-11 | 2016-05-11 | day | 1832.9 | 1000.0 |
Use DataFrame.merge with left join and assigned new column:
df['value_new'] = df.merge(df,
left_on=['date_new','category'],
right_on=['date','category'], how='left')['value_y']
print (df)
date date_new category value value_new
0 2016-05-11 2018-05-15 day 1000.0 142.8
1 2020-03-28 2018-05-11 night 2220.1 NaN
2 2018-05-15 2020-03-28 day 142.8 NaN
3 2018-05-11 2016-05-11 day 1832.9 1000.0

How to calculate average percentage change using groupby

I want to create a dateframe that calculates the average percentage change over a time period.
Target dataframe would look like this:
| | City | Ave_Growth |
|---|------|------------|
| 0 | A | 0.0 |
| 1 | B | -0.5 |
| 2 | C | 0.5 |
While simplified, real data would be cities with average changes over past 7 days.
Original dataset, df_bycity, looks like this:
| | City | Date | Case_Num |
|---|------|------------|----------|
| 0 | A | 2020-01-01 | 1 |
| 1 | A | 2020-01-02 | 1 |
| 2 | A | 2020-01-03 | 1 |
| 3 | B | 2020-01-01 | 3 |
| 4 | C | 2020-01-03 | 3 |
While simplified, this represents real data. Some cities have fewer cases, some have more. In some cities, there will be days that have no reported cases. But I would like to calculate average change from last seven days from today. I've simplified.
I tried the following code but I'm not getting the results I want:
df_bycity.groupby(['City','Date']).pct_change()
Case_Num
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
Obviously I'm using either pct_change or groupby incorrectly. I'm just learning this.
Anyone can point me to the right direction? Thanks.

Aggregating on 5 minute windows in pyspark

I Have the following dataframe df:
User | Datetime | amount | length
A | 2016-01-01 12:01 | 10 | 20
A | 2016-01-01 12:03 | 6 | 10
A | 2016-01-01 12:05 | 1 | 3
A | 2016-01-01 12:06 | 3 | 5
B | 2016-01-01 12:01 | 10 | 20
B | 2016-01-01 12:02 | 8 | 20
And I want to use pyspark efficiently to aggregate over a 5 minute time window and do some calculations - so for example calculate the average amount & length for every use for every 5 minute time window - the df will look like this:
User | Datetime | amount | length
A | 2016-01-01 12:00 | 8 | 15
B | 2016-01-01 12:00 | 2 | 4
A | 2016-01-01 12:05 | 9 | 20
How can I achieve this in the most efficient way?
In pandas I used:
df.groupby(['cs_username', pd.TimeGrouper('5Min')].apply(...)
Unfortunately, in pyspark this won't look so cool like in pandas ;-)
You can try casting date to timestamp and using modulo, for example:
import pyspark.sql.functions as F
seconds = 300
seconds_window = F.from_unixtime(F.unix_timestamp('date') - F.unix_timestamp('date') % seconds)
dataframe.withColumn('5_minutes_window', seconds_window)
Then you can simply group by new column and perform requested aggregations.

Categories