Pandas How to sum month value with each id - python

I have data like this.
id date amount
2 2018-04-03 10
2 2018-04-22 20
3 2018-01-21 20
4 2018-03-13 10
4 2018-04-19 30
I want to sum amount of each month and each id. So the result will be like this. The month not same for in each id.
id date amount
2 2018-04 30
3 2018-01 20
4 2018-03 10
4 2018-04 30
I test by this code.
df['amount'].groupby(df['date'].dt.to_period('M')).sum()
The result is.
pos_dt
2018-04 60
2018-01 20
2018-03 10
It's not group by separate with id. How to fix it?

You need to group both by the id and the month, so you can calculate this with:
df.groupby(['id', df['date'].dt.to_period('M')]).sum()
For example:
>>> df.groupby(['id', df['date'].dt.to_period('M')]).sum()
amount
id date
2 2018-04 30
3 2018-01 20
4 2018-03 10
2018-04 30

Related

Grouping of a dataframe monthly after calculating the highest daily values

I've got a dataframe with two columns one is datetime dataframe consisting of dates, and another one consists of quantity. It looks like something like this,
Date Quantity
0 2019-01-05 10
1 2019-01-10 15
2 2019-01-22 14
3 2019-02-03 12
4 2019-05-11 25
5 2019-05-21 4
6 2019-07-08 1
7 2019-07-30 15
8 2019-09-05 31
9 2019-09-10 44
10 2019-09-25 8
11 2019-12-09 10
12 2020-04-11 111
13 2020-04-17 5
14 2020-06-05 17
15 2020-06-16 12
16 2020-06-22 14
I want to make another dataframe. It should consist of two columns one is Month/Year and the other is Till Highest. I basically want to calculate the highest quantity value until that month and group it using month/year. Example of what I want precisely is,
Month/Year Till Highest
0 Jan/2019 15
1 Feb/2019 15
2 May/2019 25
3 Jul/2019 25
4 Sep/2019 44
5 Dec/2019 44
6 Apr/2020 111
7 Jun/2020 111
In my case, the dataset is vast, and I've readings of almost every day of each month and each year in the specified timeline. Here I've made a dummy dataset to show an example of what I want.
Please help me with this. Thanks in advance :)
See the annotated code:
(df
# convert date to monthly period (2019-01)
.assign(Date=pd.to_datetime(df['Date']).dt.to_period('M'))
# period and max quantity per month
.groupby('Date')
.agg(**{'Month/Year': ('Date', 'first'),
'Till highest': ('Quantity', 'max')})
# format periods as Jan/2019 and get cumulated max quantity
.assign(**{
'Month/Year': lambda d: d['Month/Year'].dt.strftime('%b/%Y'),
'Till highest': lambda d: d['Till highest'].cummax()
})
# drop the groupby index
.reset_index(drop=True)
)
output:
Month/Year Till highest
0 Jan/2019 15
1 Feb/2019 15
2 May/2019 25
3 Jul/2019 25
4 Sep/2019 44
5 Dec/2019 44
6 Apr/2020 111
7 Jun/2020 111
In R you can use cummax:
df=data.frame(Date=c("2019-01-05","2019-01-10","2019-01-22","2019-02-03","2019-05-11","2019-05-21","2019-07-08","2019-07-30","2019-09-05","2019-09-10","2019-09-25","2019-12-09","2020-04-11","2020-04-17","2020-06-05","2020-06-16","2020-06-22"),Quantity=c(10,15,14,12,25,4,1,15,31,44,8,10,111,5,17,12,14))
data.frame(`Month/Year`=unique(format(as.Date(df$Date),"%b/%Y")),
`Till Highest`=cummax(tapply(df$Quantity,sub("-..$","",df$Date),max)),
check.names=F,row.names=NULL)
Month/Year Till Highest
1 Jan/2019 15
2 Feb/2019 15
3 May/2019 25
4 Jul/2019 25
5 Sep/2019 44
6 Dec/2019 44
7 Apr/2020 111
8 Jun/2020 111

Splitting DataFrame rows into multiple based on a condition

I have a Dataframe df1 that has a bunch of columns like so:
val_1
val_2
start
end
val_3
val_4
0
10
70
1/1/2020
3/4/2020
10
20
1
20
80
1/1/2020
3/6/2021
30
40
2
30
90
1/1/2020
6/4/2021
50
60
3
40
100
12/5/2020
7/4/2021
70
80
4
89
300
4/5/2020
6/8/2022
40
10
I need to iterate over the rows, and split the cross-year periods into continuous same year ones. The remaining values in the row need to stay the same and maintain their data types like so:
val_1
val_2
start
end
val_3
val_4
0
10
70
1/1/2020
3/4/2020
10
20
1
20
80
1/1/2020
12/31/2020
30
40
2
20
80
1/1/2021
3/6/2021
30
40
3
30
90
1/1/2020
12/31/2020
50
60
4
30
90
1/1/2021
6/4/2021
50
60
5
40
100
7/5/2021
11/17/2021
70
80
6
89
300
4/5/2020
12/31/2020
40
10
7
89
300
1/1/2021
12/31/2021
40
10
8
89
300
1/1/2021
6/8/2022
40
10
Is there a fast and efficient way to do this? I tried iterating over the rows and doing it but I'm having trouble with the indices and appending rows after an index. Also, people have said that's probably not the best idea to edit things that I'm iterating over so I was wondering if there is a better way to do it. Any suggestions will be appreciated. Thank you!
EDIT
If the row spans more than a year, that should break into 3 or more rows, accordingly. I've edited the tables to accurately reflect this. Thank you!
Here's a different approach. Note that I've already converted start and end to datetimes, and I didn't bother sorting the resultant DataFrame because I didn't want to assume a specific ordering for your use-case.
import pandas as pd
def jump_to_new_year(df: pd.DataFrame) -> pd.DataFrame:
df["start"] = df["start"].map(lambda t: pd.Timestamp(t.year + 1, 1, 1))
return df
def fill_to_year_end(df: pd.DataFrame) -> pd.DataFrame:
df["end"] = df["start"].map(lambda t: pd.Timestamp(t.year, 12, 31))
return df
def roll_over(df: pd.DataFrame) -> pd.DataFrame:
mask = df.start.dt.year != df.end.dt.year
if all(~mask):
return df
start_df = fill_to_year_end(df[mask].copy())
end_df = roll_over(jump_to_new_year(df[mask].copy()))
return pd.concat([df[~mask], start_df, end_df]).reset_index(drop=True)
This is a recursive function. It first checks if any start-end date pairs have mismatched years. If not, we simply return the DataFrame. If so, we fill to the end of the year in the start_df DataFrame. Then we jump to the new year and fill that to the end date in the end_df DataFrame. Then we recurse on end_df, which will be a smaller subset of the original input.
Warning: this solution assumes that all start dates occur on or before the end date's year. If you start in 2020 and end in 2019, you will recurse infinitely and blow the stack.
Demo:
>>> df
val_1 val_2 start end val_3 val_4
0 10 70 2020-01-01 2020-03-04 10 20
1 20 80 2020-01-01 2021-03-06 30 40
2 30 90 2020-01-01 2021-06-04 50 60
3 40 100 2020-12-05 2021-07-04 70 80
4 89 300 2020-04-05 2022-06-08 40 10
>>> roll_over(df)
val_1 val_2 start end val_3 val_4
0 10 70 2020-01-01 2020-03-04 10 20
1 20 80 2020-01-01 2020-12-31 30 40
2 30 90 2020-01-01 2020-12-31 50 60
3 40 100 2020-12-05 2020-12-31 70 80
4 89 300 2020-04-05 2020-12-31 40 10
5 20 80 2021-01-01 2021-03-06 30 40
6 30 90 2021-01-01 2021-06-04 50 60
7 40 100 2021-01-01 2021-07-04 70 80
8 89 300 2021-01-01 2021-12-31 40 10
9 89 300 2022-01-01 2022-06-08 40 10
# An example of reordering the DataFrame
>>> roll_over(df).sort_values(by=["val_1", "start"])
val_1 val_2 start end val_3 val_4
0 10 70 2020-01-01 2020-03-04 10 20
1 20 80 2020-01-01 2020-12-31 30 40
5 20 80 2021-01-01 2021-03-06 30 40
2 30 90 2020-01-01 2020-12-31 50 60
6 30 90 2021-01-01 2021-06-04 50 60
3 40 100 2020-12-05 2020-12-31 70 80
7 40 100 2021-01-01 2021-07-04 70 80
4 89 300 2020-04-05 2020-12-31 40 10
8 89 300 2021-01-01 2021-12-31 40 10
9 89 300 2022-01-01 2022-06-08 40 10
Find the year end after date_range, then explode
df['end'] = [[y]+pd.date_range(x,y)[pd.date_range(x,y).is_year_end].strftime('%m/%d/%y').tolist() for x , y in zip(df['start'],df['end'])]
df = df.explode('end')
df
Out[29]:
val_1 val_2 start end val_3 val_4
0 10 70 1/1/2020 3/4/2020 10 20
1 20 80 1/1/2020 3/6/2021 30 40
1 20 80 1/1/2020 12/31/20 30 40
2 30 90 1/1/2020 6/4/2021 50 60
2 30 90 1/1/2020 12/31/20 50 60
3 40 100 12/5/2020 7/4/2021 70 80
3 40 100 12/5/2020 12/31/20 70 80
Update
df.end=pd.to_datetime(df.end)
df.start=pd.to_datetime(df.start)
df['Newstart'] = [list(set([x]+pd.date_range(x,y)[pd.date_range(x,y).is_year_start].tolist()))
for x , y in zip(df['start'],df['end'])]
df['Newend'] = [[y]+pd.date_range(x,y)[pd.date_range(x,y).is_year_end].tolist()
for x , y in zip(df['start'],df['end'])]
out = df.explode(['Newend','Newstart'])
val_1 val_2 start end val_3 val_4 Newstart Newend
0 10 70 2020-01-01 2020-03-04 10 20 2020-01-01 2020-03-04
1 20 80 2020-01-01 2021-03-06 30 40 2021-01-01 2021-03-06
1 20 80 2020-01-01 2021-03-06 30 40 2020-01-01 2020-12-31
2 30 90 2020-01-01 2021-06-04 50 60 2021-01-01 2021-06-04
2 30 90 2020-01-01 2021-06-04 50 60 2020-01-01 2020-12-31
3 40 100 2020-12-05 2021-07-04 70 80 2021-01-01 2021-07-04
3 40 100 2020-12-05 2021-07-04 70 80 2020-12-05 2020-12-31
4 89 300 2020-04-05 2022-06-08 40 10 2020-04-05 2022-06-08
4 89 300 2020-04-05 2022-06-08 40 10 2022-01-01 2020-12-31
4 89 300 2020-04-05 2022-06-08 40 10 2021-01-01 2021-12-31

calculate the difference between pandas rows in pairs

I have a dataframe of orders as below, where the column 'Value' represents cash in/out and the 'Date' column reflects when the transaction occurred.
Each transaction is grouped, so that the 'QTY' out, is always succeeded by the 'QTY' 'in', reflected by the sign in the 'QTY' column:
Date Qty Price Value
0 2014-11-18 58 495.775716 -2875499
1 2014-11-24 -58 484.280147 2808824
2 2014-11-26 63 474.138699 -2987073
3 2014-12-31 -63 507.931247 3199966
4 2015-01-05 59 495.923771 -2925950
5 2015-02-05 -59 456.224370 2691723
How can I create two columns, 'n_days' and 'price_diff' that is the difference in days between the two dates of each transaction and the 'Value'?
I have tried:
df['price_diff'] = df['Value'].rolling(2).apply(lambda x: x[0] + x[1])
but receiving a key error for the first observation (0).
Many thanks
Why don't you just use sum:
df['price_diff'] = df['Value'].rolling(2).sum()
Although from the name, it looks like
df['price_diff'] = df['Price'].diff()
And, for the two columns:
df[['Date_diff','Price_diff']] = df[['Date','Price']].diff()
Output:
Date Qty Price Value Date_diff Price_diff
0 2014-11-18 58 495.775716 -2875499 NaT NaN
1 2014-11-24 -58 484.280147 2808824 6 days -11.495569
2 2014-11-26 63 474.138699 -2987073 2 days -10.141448
3 2014-12-31 -63 507.931247 3199966 35 days 33.792548
4 2015-01-05 59 495.923771 -2925950 5 days -12.007476
5 2015-02-05 -59 456.224370 2691723 31 days -39.699401
Updated Per comment, you can try:
df['Val_sum'] = df['Value'].rolling(2).sum()[1::2]
Output:
Date Qty Price Value Val_sum
0 2014-11-18 58 495.775716 -2875499 NaN
1 2014-11-24 -58 484.280147 2808824 -66675.0
2 2014-11-26 63 474.138699 -2987073 NaN
3 2014-12-31 -63 507.931247 3199966 212893.0
4 2015-01-05 59 495.923771 -2925950 NaN
5 2015-02-05 -59 456.224370 2691723 -234227.0

Converting object into time and grouping/summarizing time (H/M/S) into 24 hours

I subsetted a big dataframe, slicing only one column Start Time with `type(object).
test = taxi_2020['Start Time']
Got a column
0 00:15:00
1 00:15:00
2 00:15:00
3 00:15:00
4 00:15:00
...
4137289 00:00:00
4137290 00:00:00
4137291 00:00:00
4137292 00:00:00
4137293 00:00:00
Name: Start Time, Length: 4137294, dtype: object
Then I grouped and summarized it by the count (to my best knowledge)
test.value_counts().sort_index().reset_index()
and got two columns
index Start Time
0 00:00:00 24005
1 00:15:00 22815
2 00:30:00 20438
3 00:45:00 19012
4 01:00:00 18082
... ... ...
91 22:45:00 32365
92 23:00:00 31815
93 23:15:00 29582
94 23:30:00 26903
95 23:45:00 24599
Not sure why this index column appeared, now I failed to rename it or convert.
What do I would like to see?
My ideal output - to group time by hour (24h format is ok), it looks like data counts every 15 min, so basically put each next 4 columns together. 00:15:00 is ok to be as 0 hour, 23:00:00 as 23rd hour.
My ideal output:
Hour Rides
0 34000
1 60000
2 30000
3 40000
I would like to create afterwards a simple histogram to show the occurrence by the hour.
Appreciate any help!
IIUC,
#Create dummy input datafframe
test = pd.DataFrame({'time':pd.date_range('2020-06-01', '2020-06-01 23:59:00', freq='15T').strftime('%H:%M:%S'),
'rides':np.random.randint(15000,28000,96)})
Let's create a DateTimeIndex from string and resample, aggregate with sum and convert DateTimeIndex to hours:
test2 = (test.set_index(pd.to_datetime(test['time'], format='%H:%M:%S'))
.rename_axis('hour').resample('H').sum())
test2.index = test2.index.hour
test2.reset_index()
Output:
hour rides
0 0 74241
1 1 87329
2 2 76933
3 3 86208
4 4 88002
5 5 82618
6 6 82188
7 7 81203
8 8 78591
9 9 95592
10 10 99778
11 11 85294
12 12 93931
13 13 80490
14 14 84181
15 15 71786
16 16 90962
17 17 96568
18 18 85646
19 19 88324
20 20 83595
21 21 89284
22 22 72061
23 23 74057
Step by step I found answer myself
Using this code, I renamed columns
test.rename(columns = {'index': "Time", 'Start Time': 'Rides'})
Got
The remaining question - how to summarize by the hour.
After applying
test2['hour'] = pd.to_datetime(test2['Time'], format='%H:%M:%S').dt.hour
test2
I came closer
Finally, I grouped by hour value
test3 = test2.groupby('hour', as_index=False).agg({"Rides": "sum"})
print(test3)

python pandas: trying to vectorize a function using date_range

Here is my dataframe:
import pandas as pd
df = pd.DataFrame({
'KEY': [1, 2, 3, 1, 1, 2],
'START_DATE': ['2018-01-05', '2018-01-04', '2018-01-01', '2018-01-23', '2018-02-01', '2018-03-11'],
'STOP_DATE': ['2018-01-22', '2018-03-10', '2018-01-31', '2018-02-15', '2018-04-01', '2018-07-21'],
'AMOUNT': [5, 3, 11, 14, 7, 9],
})
df.START_DATE = pd.to_datetime(df.START_DATE, format='%Y-%m-%d')
df.STOP_DATE = pd.to_datetime(df.STOP_DATE, format='%Y-%m-%d')
df
>>> AMOUNT KEY START_DATE STOP_DATE
0 5 A 2018-01-05 2018-01-22
1 3 B 2018-01-04 2018-03-10
2 11 C 2018-01-01 2018-01-31
3 14 A 2018-01-23 2018-02-15
4 7 A 2018-02-01 2018-04-01
5 9 B 2018-03-11 2018-07-21
I am trying to get the AMOUNT per month and per KEY considering the AMOUNT as linearly distributed (by day) between START_DATE and STOP_DATE. The output is shown below. I would like to also keep track of the number of charged days in a month. For example KEY = A has overlapped periods in February so the number of charged periods can be > 28.
DAYS AMOUNT
A 2018_01 27 10.250000
2018_02 43 12.016667
2018_03 31 3.616667
2018_04 1 0.116667
B 2018_01 28 1.272727
2018_02 28 1.272727
2018_03 31 1.875598
2018_04 30 2.030075
2018_05 31 2.097744
2018_06 30 2.030075
2018_07 21 1.421053
C 2018_01 31 11.000000
2018_02 0 0.000000
I came up with the solution detailed below but it is highly inefficient and takes an unaffordable amount of time to run for a dataset with ~100 million rows. I am looking for an improved version but could not manage to vectorize the pd.date_range part of it. Not sure if numba #jit could help here? Added a tag just in case.
from pandas.tseries.offsets import MonthEnd
# Prepare the final dataframe (filled with zeros)
bounds = df.groupby('KEY').agg({'START_DATE': min, 'STOP_DATE':max}).reset_index()
multiindex = []
for row in bounds.itertuples():
dates = pd.date_range(start=row.START_DATE, end=row.STOP_DATE + MonthEnd(),
freq='M').strftime('%Y_%m')
multiindex.extend([(row.KEY, date) for date in dates])
index = pd.MultiIndex.from_tuples(multiindex)
final = pd.DataFrame(0, index=index, columns=['DAYS', 'AMOUNT'])
# Run the actual iteration over rows
df['TOTAL_DAYS'] = (df.STOP_DATE - df.START_DATE).dt.days + 1
for row in df.itertuples():
data = pd.Series(index=pd.date_range(start=row.START_DATE, end=row.STOP_DATE))
data = data.resample('MS').size().rename('DAYS').to_frame()
data['AMOUNT'] = data.DAYS / row.TOTAL_DAYS * row.AMOUNT
data.index = data.index.strftime('%Y_%m')
# Add data to the final dataframe
final.loc[(row.KEY, data.index.tolist()), 'DAYS'] += data.DAYS.values
final.loc[(row.KEY, data.index.tolist()), 'AMOUNT'] += data.AMOUNT.values
I eventually came up with this solution (heavily inspired from #jezrael answer on this post). Probably not the most memory efficient solution but this is not a major concern for me, execution time was the problem!
from pandas.tseries.offsets import MonthBegin
df['ID'] = range(len(df))
df['TOTAL_DAYS'] = (df.STOP_DATE - df.START_DATE).dt.days + 1
df
>>> AMOUNT KEY START_DATE STOP_DATE ID TOTAL_DAYS
0 5 A 2018-01-05 2018-01-22 0 18
1 3 B 2018-01-04 2018-03-10 1 66
2 11 C 2018-01-01 2018-01-31 2 31
3 14 A 2018-01-23 2018-02-15 3 24
4 7 A 2018-02-01 2018-04-01 4 60
5 9 B 2018-03-11 2018-07-21 5 133
final = (df[['ID', 'START_DATE', 'STOP_DATE']].set_index('ID').stack()
.reset_index(level=-1, drop=True)
.rename('DATE_AFTER')
.to_frame())
final = final.groupby('ID').apply(
lambda x: x.set_index('DATE_AFTER').resample('M').asfreq()).reset_index()
final = final.merge(df[['ID', 'KEY', 'AMOUNT', 'TOTAL_DAYS']], how='left', on=['ID'])
final['PERIOD'] = final.DATE_AFTER.dt.to_period('M')
final['DATE_BEFORE'] = final.DATE_AFTER - MonthBegin()
At this point final looks like this:
final
>>> ID DATE_AFTER KEY AMOUNT TOTAL_DAYS PERIOD DATE_BEFORE
0 0 2018-01-31 A 5 18 2018-01 2018-01-01
1 1 2018-01-31 B 3 66 2018-01 2018-01-01
2 1 2018-02-28 B 3 66 2018-02 2018-02-01
3 1 2018-03-31 B 3 66 2018-03 2018-03-01
4 2 2018-01-31 C 11 31 2018-01 2018-01-01
5 3 2018-01-31 A 14 24 2018-01 2018-01-01
6 3 2018-02-28 A 14 24 2018-02 2018-02-01
7 4 2018-02-28 A 7 60 2018-02 2018-02-01
8 4 2018-03-31 A 7 60 2018-03 2018-03-01
9 4 2018-04-30 A 7 60 2018-04 2018-04-01
10 5 2018-03-31 B 9 133 2018-03 2018-03-01
11 5 2018-04-30 B 9 133 2018-04 2018-04-01
12 5 2018-05-31 B 9 133 2018-05 2018-05-01
13 5 2018-06-30 B 9 133 2018-06 2018-06-01
14 5 2018-07-31 B 9 133 2018-07 2018-07-01
We then merge back the initial df twice (start and end of month):
final = pd.merge(
final,
df[['ID', 'STOP_DATE']].assign(PERIOD = df.STOP_DATE.dt.to_period('M')),
how='left', on=['ID', 'PERIOD'])
final = pd.merge(
final,
df[['ID', 'START_DATE']].assign(PERIOD = df.START_DATE.dt.to_period('M')),
how='left', on=['ID', 'PERIOD'])
final['STOP_DATE'] = final.STOP_DATE.combine_first(final.DATE_AFTER)
final['START_DATE'] = final.START_DATE.combine_first(final.DATE_BEFORE)
final['DAYS'] = (final.STOP_DATE- final.START_DATE).dt.days + 1
final = final.drop(columns=['ID', 'DATE_AFTER', 'DATE_BEFORE'])
final.AMOUNT *= final.DAYS/final.TOTAL_DAYS
final = final.groupby(['KEY', 'PERIOD']).agg({'AMOUNT': sum, 'DAYS': sum})
With the expected result:
AMOUNT DAYS
KEY PERIOD
A 2018-01 10.250000 27
2018-02 12.016667 43
2018-03 3.616667 31
2018-04 0.116667 1
B 2018-01 1.272727 28
2018-02 1.272727 28
2018-03 1.875598 31
2018-04 2.030075 30
2018-05 2.097744 31
2018-06 2.030075 30
2018-07 1.421053 21
C 2018-01 11.000000 31

Categories