Splitting data into batches conditional on cumulative sum - python

I'm trying to batch some data based on a start_date and end_date that is conditional of the cumulative sum of which is <= 500000.
Say I have a simple data frame with two columns:
index Date num_books
0 2021-01-01 200000
1 2021-01-02 240000
2 2021-01-03 55000
3 2021-01-04 400000
4 2021-01-05 80000
5 2021-01-06 100000
I need to do a cumulative sum of the values in num_books until it has <= 500000 and record the start date, end date and the cumsum value. This is an example of what I'm trying to achieve
start_date end_date cumsum_books
2021-01-01 2021-01-03 495000
2021-01-04 2021-01-05 480000
2021-01-06 2021-01-06 100000
Is there an efficient way/function to achieve this? Thank you!

Here's one way:
from io import StringIO as sio
d = sio("""
index Date num_books
0 2021-01-01 200000
1 2021-01-02 240000
2 2021-01-03 55000
3 2021-01-04 400000
4 2021-01-05 80000
5 2021-01-06 100000
""")
import pandas as pd
df = pd.read_csv(d, sep='\s+')
batch_num = 5*10**5
df['batch_num'] = df['num_books'].cumsum()//batch_num
result = df.groupby('batch_num').agg(start_date=('Date', 'min'), end_date=('Date', 'max'), cumsum_books=('num_books','sum'))
print(result)
# start_date end_date cumsum_books
#batch_num
#0 2021-01-01 2021-01-03 495000
#1 2021-01-04 2021-01-05 480000
#2 2021-01-06 2021-01-06 100000
Note that the result dataframe also contains the entry with more than 500_000, but it's trivial to drop/filter it out.

Related

Count number of occurences in past 14 days of certain value

I have a pandas dataframe with a date column and a id column. I would like to return the number of occurences the the id of each line, in the past 14 days prior to the corresponding date of each line. That means, I would like to return "1, 2, 1, 2, 3, 4, 1". How can I do this? Performance is important since the dataframe has a len of 200,000 rows or so. Thanks !
date
id
2021-01-01
1
2021-01-04
1
2021-01-05
2
2021-01-06
2
2021-01-07
1
2021-01-08
1
2021-01-28
1
Assuming the input is sorted by date, you can use a GroupBy.rolling approach:
# only required if date is not datetime type
df['date'] = pd.to_datetime(df['date'])
(df.assign(count=1)
.set_index('date')
.groupby('id')
.rolling('14d')['count'].sum()
.sort_index(level='date').reset_index() #optional if order is not important
)
output:
id date count
0 1 2021-01-01 1.0
1 1 2021-01-04 2.0
2 2 2021-01-05 1.0
3 2 2021-01-06 2.0
4 1 2021-01-07 3.0
5 1 2021-01-08 4.0
6 1 2021-01-28 1.0
I am not sure whether this is the best idea or not, but the code below is what I have come up with:
from datetime import timedelta
df["date"] = pd.to_datetime(df["date"])
newColumn = []
for index, row in df.iterrows():
endDate = row["date"]
startDate = endDate - timedelta(days=14)
id = row["id"]
summation = df[(df["date"] >= startDate) & (df["date"] <= endDate) & (df["id"] == id)]["id"].count()
newColumn.append(summation)
df["check_column"] = newColumn
df
Output
date
id
check_column
0
2021-01-01 00:00:00
1
1
1
2021-01-04 00:00:00
1
2
2
2021-01-05 00:00:00
2
1
3
2021-01-06 00:00:00
2
2
4
2021-01-07 00:00:00
1
3
5
2021-01-08 00:00:00
1
4
6
2021-01-28 00:00:00
1
1
Explanation
In this approach, I have used iterrows in order to loop over the dataframe's rows. Additionally, I have used timedelta in order to subtract 14 days from the date column.

Cumulative sum that updates between two date ranges

I have data that looks like this: (assume start and end are date times)
id
start
end
1
01-01
01-02
1
01-03
01-05
1
01-04
01-07
1
01-06
NaT
1
01-07
NaT
I want to get a data frame that would include all dates, that has a 'cumulative sum' that only counts for the range they are in.
dates
count
01-01
1
01-02
0
01-03
1
01-04
2
01-05
1
01-06
2
01-07
3
One idea I thought of was simply using cumcount on the start dates, and doing a 'reverse cumcount' decreasing the counts using the end dates, but I am having trouble wrapping my head around doing this in pandas and I'm wondering whether there's a more elegant solution.
Here is two options. first consider this data with only one id, note that your columns start and end must be datetime.
d = {'id': [1, 1, 1, 1, 1],
'start': [pd.Timestamp('2021-01-01'), pd.Timestamp('2021-01-03'),
pd.Timestamp('2021-01-04'), pd.Timestamp('2021-01-06'),
pd.Timestamp('2021-01-07')],
'end': [pd.Timestamp('2021-01-02'), pd.Timestamp('2021-01-05'),
pd.Timestamp('2021-01-07'), pd.NaT, pd.NaT]}
df = pd.DataFrame(d)
so to get your result, you can do a sub between the get_dummies of start and end. then sum if several start and or end at the same dates, cumsum along the dates, reindex to get all the dates between the min and max dates available. create a function.
def dates_cc(df_):
return (
pd.get_dummies(df_['start'])
.sub(pd.get_dummies(df_['end'], dtype=int), fill_value=0)
.sum()
.cumsum()
.to_frame(name='count')
.reindex(pd.date_range(df_['start'].min(), df_['end'].max()), method='ffill')
.rename_axis('dates')
)
Now you can apply this function to your dataframe
res = dates_cc(df).reset_index()
print(res)
# dates count
# 0 2021-01-01 1.0
# 1 2021-01-02 0.0
# 2 2021-01-03 1.0
# 3 2021-01-04 2.0
# 4 2021-01-05 1.0
# 5 2021-01-06 2.0
# 6 2021-01-07 2.0
Now if you have several id, like
df1 = df.assign(id=[1,1,2,2,2])
print(df1)
# id start end
# 0 1 2021-01-01 2021-01-02
# 1 1 2021-01-03 2021-01-05
# 2 2 2021-01-04 2021-01-07
# 3 2 2021-01-06 NaT
# 4 2 2021-01-07 NaT
then you can use the above function like
res1 = df1.groupby('id').apply(dates_cc).reset_index()
print(res1)
# id dates count
# 0 1 2021-01-01 1.0
# 1 1 2021-01-02 0.0
# 2 1 2021-01-03 1.0
# 3 1 2021-01-04 1.0
# 4 1 2021-01-05 0.0
# 5 2 2021-01-04 1.0
# 6 2 2021-01-05 1.0
# 7 2 2021-01-06 2.0
# 8 2 2021-01-07 2.0
that said, a more straightforward possibility is with crosstab that create a row per id, the rest is about the same manipulations.
res2 = (
pd.crosstab(index=df1['id'], columns=df1['start'])
.sub(pd.crosstab(index=df1['id'], columns=df1['end']), fill_value=0)
.reindex(columns=pd.date_range(df1['start'].min(), df1['end'].max()), fill_value=0)
.rename_axis(columns='dates')
.cumsum(axis=1)
.stack()
.reset_index(name='count')
)
print(res2)
# id dates count
# 0 1 2021-01-01 1.0
# 1 1 2021-01-02 0.0
# 2 1 2021-01-03 1.0
# 3 1 2021-01-04 1.0
# 4 1 2021-01-05 0.0
# 5 1 2021-01-06 0.0
# 6 1 2021-01-07 0.0
# 7 2 2021-01-01 0.0
# 8 2 2021-01-02 0.0
# 9 2 2021-01-03 0.0
# 10 2 2021-01-04 1.0
# 11 2 2021-01-05 1.0
# 12 2 2021-01-06 2.0
# 13 2 2021-01-07 2.0
the main difference between the two options is that this one create extra dates for each id, because for example 2021-01-01 is in id=1 but not id=2 and with this version, you get this date also for id=2 while in groupby it is not taken into account.

aggregate by week of daily column

Hello guys i want to aggregate(sum) to by week. Lets notice that the column date is per day.
Does anyone knwos how to do?
Thank you
kindly,
df = pd.DataFrame({'date':['2021-01-01','2021-01-02',
'2021-01-03','2021-01-04','2021-01-05',
'2021-01-06','2021-01-07','2021-01-08',
'2021-01-09'],
'revenue':[5,3,2,
10,12,2,
1,0,6]})
df
date revenue
0 2021-01-01 5
1 2021-01-02 3
2 2021-01-03 2
3 2021-01-04 10
4 2021-01-05 12
5 2021-01-06 2
6 2021-01-07 1
7 2021-01-08 0
8 2021-01-09 6
Expected output:
2021-01-04. 31
Use DataFrame.resample with aggregate sum, but is necessary change default close to left with label parameter:
df['date'] = pd.to_datetime(df['date'])
df1 = df.resample('W-Mon', on='date', closed='left', label='left').sum()
print (df1)
revenue
date
2020-12-28 10
2021-01-04 31

Finding longest consecutive increase in Pandas

I have a dataframe:
Date Price
2021-01-01 29344.67
2021-01-02 32072.08
2021-01-03 33048.03
2021-01-04 32084.61
2021-01-05 34105.46
2021-01-06 36910.18
2021-01-07 39505.51
2021-01-08 40809.93
2021-01-09 40397.52
2021-01-10 38505.49
Date object
Price float64
dtype: object
And my goal is to find the longest consecutive period of growth.
It should return:
Longest consecutive period was from 2021-01-04 to 2021-01-08 with increase of $8725.32
and honestly I have no idea where to start with it. These are my first steps in pandas and I don't know which tools I should use to get this information.
Could anyone help me / point me in the right direction?
Detect your increasing sequence with cumsum on decreasing:
df['is_increasing'] = df['Price'].diff().lt(0).cumsum()
You would get:
Date Price is_increasing
0 2021-01-01 29344.67 0
1 2021-01-02 32072.08 0
2 2021-01-03 33048.03 0
3 2021-01-04 32084.61 1
4 2021-01-05 34105.46 1
5 2021-01-06 36910.18 1
6 2021-01-07 39505.51 1
7 2021-01-08 40809.93 1
8 2021-01-09 40397.52 2
9 2021-01-10 38505.49 3
Now, you can detect your longest sequence with
sizes=df.groupby('is_increasing')['Price'].transform('size')
df[sizes == sizes.max()]
And you get:
Date Price is_increasing
3 2021-01-04 32084.61 1
4 2021-01-05 34105.46 1
5 2021-01-06 36910.18 1
6 2021-01-07 39505.51 1
7 2021-01-08 40809.93 1
Something like what Quang did for split the group , then pick the number of group
s = df.Price.diff().lt(0).cumsum()
out = df.loc[s==s.value_counts().sort_values().index[-1]]
Out[514]:
Date Price
3 2021-01-04 32084.61
4 2021-01-05 34105.46
5 2021-01-06 36910.18
6 2021-01-07 39505.51
7 2021-01-08 40809.93

Pandas column redefinition (extension) does not work

I have two data frames (let say A and B) indexed with dates.
I define a column in B as following
B["column1"] = A.shift(1)
Later, when I add additional data to A and I want to update B, it doesn't work.
B["column1] = A.shift(1) still produces the same data before I added additional data to A.
How can I solve this issue?
Perform a df.reindex() before your assignment statement, as follows:
B = B.reindex(A.index)
Then, you can get your desired result with your code:
B["column1"] = A.shift(1)
Caution: If your dataframe B has other columns with values built with date indices other than the indices of dataframe A, reindexing in this way may cause loss of data in other columns of dataframe B. To overcome this, you can reindex B to the combined index of A and B with a union() function as follows:
B = B.reindex(A.index.union(B.index))
Demo Run
A_index = pd.date_range(start='2021/1/1', periods=8)
A = pd.Series([10, 20, 30, 40, 50, 60, 70, 80], index=A_index)
print(A)
2021-01-01 10
2021-01-02 20
2021-01-03 30
2021-01-04 40
2021-01-05 50
2021-01-06 60
2021-01-07 70
2021-01-08 80
Freq: D, dtype: int64
B = pd.DataFrame()
B["column1"] = A.shift(1)
print(B)
column1
2021-01-01 NaN
2021-01-02 10.0
2021-01-03 20.0
2021-01-04 30.0
2021-01-05 40.0
2021-01-06 50.0
2021-01-07 60.0
2021-01-08 70.0
# Add data to A
A = A.append(pd.Series([100, 110, 120], index=pd.date_range(start='2021/1/21', periods=3)))
print(A)
2021-01-01 10
2021-01-02 20
2021-01-03 30
2021-01-04 40
2021-01-05 50
2021-01-06 60
2021-01-07 70
2021-01-08 80
2021-01-21 100 <= New data
2021-01-22 110 <= New data
2021-01-23 120 <= New data
dtype: int64
#Run new code
B = B.reindex(A.index)
#Run existing code
B["column1"] = A.shift(1)
print(B)
column1
2021-01-01 NaN
2021-01-02 10.0
2021-01-03 20.0
2021-01-04 30.0
2021-01-05 40.0
2021-01-06 50.0
2021-01-07 60.0
2021-01-08 70.0
2021-01-21 80.0 <= New data
2021-01-22 100.0 <= New data
2021-01-23 110.0 <= New data
Please use DataFrame.at
B.at["column1"] = A.shift(1)
reference DataFrame.at

Categories