pandas get a sum column for next 7 days - python

I want to get the sum of values for next 7 days of a column
my dataframe :
date value
0 2021-04-29 1
1 2021-05-03 2
2 2021-05-06 1
3 2021-05-15 1
4 2021-05-17 2
5 2021-05-18 1
6 2021-05-21 2
7 2021-05-22 5
8 2021-05-24 4
i tried to make a new column that contains date 7 days from current date
df['temp'] = df['date'] + timedelta(days=7)
then calculate value between date range :
df['next_7days'] = df[(df.date > df.date) & (df.date <= df.temp)].value.sum()
But this gives me answer as all 0.
intended result:
date value next_7days
0 2021-04-29 1 3
1 2021-05-03 2 1
2 2021-05-06 1 0
3 2021-05-15 1 10
4 2021-05-17 2 12
5 2021-05-18 1 11
6 2021-05-21 2 9
7 2021-05-22 5 4
8 2021-05-24 4 0
The method iam using currently is quite tedious, are their any better methods to get the intended result.

With a list comprehension:
tomorrow_dates = df.date + pd.Timedelta("1 day")
next_week_dates = df.date + pd.Timedelta("7 days")
df["next_7days"] = [df.value[df.date.between(tomorrow, next_week)].sum()
for tomorrow, next_week in zip(tomorrow_dates, next_week_dates)]
where we first define tomorrow and next week's dates and store them. Then zip them together and use between of pd.Series to get a boolean series if the date is indeed between the desired range. Then using boolean indexing to get the actual values and sum them. Do this for each date pair.
to get
date value next_7days
0 2021-04-29 1 3
1 2021-05-03 2 1
2 2021-05-06 1 0
3 2021-05-15 1 10
4 2021-05-17 2 12
5 2021-05-18 1 11
6 2021-05-21 2 9
7 2021-05-22 5 4
8 2021-05-24 4 0

Related

pandas get delta from corresponding date in a seperate list of dates

I have a dataframe:
df a b
7 2019-05-01 00:00:01
6 2019-05-02 00:15:01
1 2019-05-06 00:10:01
3 2019-05-09 01:00:01
8 2019-05-09 04:20:01
9 2019-05-12 01:10:01
4 2019-05-16 03:30:01
And
l = [datetime.datetime(2019,05,02), datetime.datetime(2019,05,10), datetime.datetime(2019,05,22) ]
I want to add a column with the following:
for each row, find the last date from l that is before it, and add number of days between them.
If none of the date is smaller - add the delta from the smallest one.
So the new column will be:
df a b. delta date
7 2019-05-01 00:00:01 -1 datetime.datetime(2019,05,02)
6 2019-05-02 00:15:01 0 datetime.datetime(2019,05,02)
1 2019-05-06 00:10:01 4 datetime.datetime(2019,05,02)
3 2019-05-09 01:00:01 7 datetime.datetime(2019,05,02)
8 2019-05-09 04:20:01 7 datetime.datetime(2019,05,02)
9 2019-05-12 01:10:01 2 datetime.datetime(2019,05,10)
4 2019-05-16 03:30:01 6 datetime.datetime(2019,05,10)
How can I do it?
Using merge_asof to align df['b'] and the list (as Series), then computing the difference:
# ensure datetime
df['b'] = pd.to_datetime(df['b'])
# craft Series for merging (could be combined with line below)
s = pd.Series(l, name='l')
# merge and fillna with minimum date
ref = pd.merge_asof(df['b'], s, left_on='b', right_on='l')['l'].fillna(s.min())
# compute the delta as days
df['delta'] =(df['b']-ref).dt.days
output:
a b delta
0 7 2019-05-01 00:00:01 -1
1 6 2019-05-02 00:15:01 0
2 1 2019-05-06 00:10:01 4
3 3 2019-05-09 01:00:01 7
4 8 2019-05-09 04:20:01 7
5 9 2019-05-12 01:10:01 2
6 4 2019-05-16 03:30:01 6
Here's a one line solution if you your b column has datetime object. Otherwise convert it to datetime object.
df['delta'] = df.apply(lambda x: sorted([x.b - i for i in l], key= lambda y: y.seconds)[0].days, axis=1)
Explanation : To each row you apply a function that :
Compute the deltatime between your row's datetime and every datetime present in l, then store it in a list
Sort this list by the numbers of seconds of each deltatime
Get the first value (with the smallest deltatime) and return its days
this code is seperate this dataset on
weekday Friday
year 2014
day 01
hour 00
minute 03
rides['weekday'] = rides.timestamp.dt.strftime("%A")
rides['year'] = rides.timestamp.dt.strftime("%Y")
rides['day'] = rides.timestamp.dt.strftime("%d")
rides['hour'] = rides.timestamp.dt.strftime("%H")
rides["minute"] = rides.timestamp.dt.strftime("%M")

Conduct the calculation only when the date value is valid

I have a data frame dft:
Date Total Value
02/01/2022 2
03/01/2022 6
N/A 4
03/11/2022 4
03/15/2022 4
05/01/2022 4
For each date in the data frame, I want to calculate the how many days from today and I want to add these calculated values in a new column called Days.
I have tried the following code:
newdft = []
for item in dft:
temp = item.copy()
timediff = datetime.now() - datetime.strptime(temp["Date"], "%m/%d/%Y")
temp["Days"] = timediff.days
newdft.append(temp)
But the third date value is N/A, which caused an error. What should I add to my code so that I only conduct the calculation only when the date value is valid?
I would convert the whole Date column to be a date time object, using pd.to_datetime(), with the errors set to coerce, to replace the 'N/A' string to NaT (Not a Timestamp) with the below:
dft['Date'] = pd.to_datetime(dft['Date'], errors='coerce')
So the column will now look like this:
0 2022-02-01
1 2022-03-01
2 NaT
3 2022-03-11
4 2022-03-15
5 2022-05-01
Name: Date, dtype: datetime64[ns]
You can then subtract that column from the current date in one go, which will automatically ignore the NaT value, and assign this as a new column:
dft['Days'] = datetime.now() - dft['Date']
This will make dft look like below:
Date Total Value Days
0 2022-02-01 2 148 days 15:49:03.406935
1 2022-03-01 6 120 days 15:49:03.406935
2 NaT 4 NaT
3 2022-03-11 4 110 days 15:49:03.406935
4 2022-03-15 4 106 days 15:49:03.406935
5 2022-05-01 4 59 days 15:49:03.406935
If you just want the number instead of 59 days 15:49:03.406935, you can do the below instead:
df['Days'] = (datetime.now() - df['Date']).dt.days
Which will give you:
Date Total Value Days
0 2022-02-01 2 148.0
1 2022-03-01 6 120.0
2 NaT 4 NaN
3 2022-03-11 4 110.0
4 2022-03-15 4 106.0
5 2022-05-01 4 59.0
In contrast to Emi OB's excellent answer, if you did actually need to process individual values, it's usually easier to use apply to create a new Series from an existing one. You'd just need to filter out 'N/A'.
df['Days'] = (
df['Date']
[lambda d: d != 'N/A']
.apply(lambda d: (datetime.now() - datetime.strptime(d, "%m/%d/%Y")).days)
)
Result:
Date Total Value Days
0 02/01/2022 2 148.0
1 03/01/2022 6 120.0
2 N/A 4 NaN
3 03/11/2022 4 110.0
4 03/15/2022 4 106.0
5 05/01/2022 4 59.0
And for what it's worth, another option is date.today() instead of datetime.now():
.apply(lambda d: date.today() - datetime.strptime(d, "%m/%d/%Y").date())
And the result is a timedelta instead of float:
Date Total Value Days
0 02/01/2022 2 148 days
1 03/01/2022 6 120 days
2 N/A 4 NaT
3 03/11/2022 4 110 days
4 03/15/2022 4 106 days
5 05/01/2022 4 59 days
See also: How do I select rows from a DataFrame based on column values?
Following up on the excellent answer by Emi OB I would suggest using DataFrame.mask() to update the dataframe without type coercion.
import datetime
import pandas as pd
dft = pd.DataFrame({'Date': [
'02/01/2022',
'03/01/2022',
None,
'03/11/2022',
'03/15/2022',
'05/01/2022'],
'Total Value': [2,6,4,4,4,4]})
dft['today'] = datetime.datetime.now()
dft['Days'] = 0
dft['Days'].mask(dft['Date'].notna(),
(dft['today'] - pd.to_datetime(dft['Date'])).dt.days,
axis=0, inplace=True)
dft.drop(columns=['today'], inplace=True)
This would result in integer values in the Days column:
Date Total Value Days
0 02/01/2022 2 148
1 03/01/2022 6 120
2 None 4 None
3 03/11/2022 4 110
4 03/15/2022 4 106
5 05/01/2022 4 59

Python pandas: set consecutive index (starting at 0) for every group in groupby

I have a dataframe sorted and grouped by serial number, then date:
df = df.sort_values(by=["serial_num" , "date"])
df = df.groupby(df['serial_num'])['date'].some_function()
Dataframe
serial_num date
0 1 2001-01-01
1 1 2001-02-01
2 1 2001-03-01
3 1 2001-04-01
4 3 2003-05-01
5 3 2003-06-01
6 3 2003-07-01
7 7 2005-07-01
8 7 2005-08-01
9 7 2005-09-01
10 7 2005-10-01
11 7 2005-11-01
12 7 2005-12-01
Each unique serial_num group will be a line on a line graph.
The way it is graphed now, each line is a time series that starts at a different point -- because the first date is different for every serial_num group.
I need the x-axis of the graph to be time instead of date. All lines on the graph will start at the same point - the origin of the x-axis of my graph.
I think the easiest way to do this would be to add a consecutive index that starts at 0 for each group, like this:
serial_num date new_index
0 1 2001-01-01 0
1 1 2001-02-01 1
2 1 2001-03-01 2
3 1 2001-04-01 3
4 3 2003-05-01 0
5 3 2003-06-01 1
6 3 2003-07-01 2
7 7 2005-07-01 0
8 7 2005-08-01 1
9 7 2005-09-01 2
10 7 2005-10-01 3
11 7 2005-11-01 4
12 7 2005-12-01 5
Then, I think I will be able to graph (in Plotly) with all lines starting at the same point (the 0 index will be the first data point for each serial_num.
NOTE: each serial_num group has a different number of data points.
I'm unsure how to index with groupby this way. Please help! Or if you know another method that will accomplish the same goal, please share. Thanks!
Use cumcount:
df["new_index"] = df.groupby("serial_num").cumcount()
print(df)
==>
serial_num date new_index
0 1 2001-01-01 0
1 1 2001-02-01 1
2 1 2001-03-01 2
3 1 2001-04-01 3
4 3 2003-05-01 0
5 3 2003-06-01 1
6 3 2003-07-01 2
7 7 2005-07-01 0
8 7 2005-08-01 1
9 7 2005-09-01 2
10 7 2005-10-01 3
11 7 2005-11-01 4
12 7 2005-12-01 5

creating daily price change for a product on a pandas dataframe

I am working on a data set with the following columns:
order_id
order_item_id
product mrp
units
sale_date
I want to create a new column which shows how much the mrp changed from the last time this product was. This there a way I can do this with pandas data frame?
Sorry if this question is very basic but I am pretty new to pandas.
Sample data:
expected data:
For each row of the data I want to check the amount of price change for the last time the product was sold.
You can do this as follows:
# define a function that applies rolling window calculationg
# taking the difference between the last value and the current
# value
def calc_mrp(ser):
# in case you want the relative change, just
# divide by x[1] or x[0] in the lambda function
return ser.rolling(window=2).apply(lambda x: x[1]-x[0])
# apply this to the grouped 'product_mrp' column
# and store the result in a new column
df['mrp_change']=df.groupby('product_id')['product_mrp'].apply(calc_mrp)
If this is executed on a dataframe like:
Out[398]:
order_id product_id product_mrp units_sold sale_date
0 0 2 647.169280 8 2019-08-23
1 1 0 500.641188 0 2019-08-24
2 2 1 647.789399 15 2019-08-25
3 3 0 381.278167 12 2019-08-26
4 4 2 373.685000 7 2019-08-27
5 5 4 553.472850 2 2019-08-28
6 6 4 634.482718 7 2019-08-29
7 7 3 536.760482 11 2019-08-30
8 8 0 690.242274 6 2019-08-31
9 9 4 500.515521 0 2019-09-01
It yields:
Out[400]:
order_id product_id product_mrp units_sold sale_date mrp_change
0 0 2 647.169280 8 2019-08-23 NaN
1 1 0 500.641188 0 2019-08-24 NaN
2 2 1 647.789399 15 2019-08-25 NaN
3 3 0 381.278167 12 2019-08-26 -119.363022
4 4 2 373.685000 7 2019-08-27 -273.484280
5 5 4 553.472850 2 2019-08-28 NaN
6 6 4 634.482718 7 2019-08-29 81.009868
7 7 3 536.760482 11 2019-08-30 NaN
8 8 0 690.242274 6 2019-08-31 308.964107
9 9 4 500.515521 0 2019-09-01 -133.967197
The NaNs are in the rows, for which there is not previous order with the same product_id.

to_datetime assemblage error due to extra keys

My pandas version is 0.23.4.
I tried to run this code:
df['date_time'] = pd.to_datetime(df[['year','month','day','hour_scheduled_departure','minute_scheduled_departure']])
and the following error appeared:
extra keys have been passed to the datetime assemblage: [hour_scheduled_departure, minute_scheduled_departure]
Any ideas of how to get the job done by pd.to_datetime?
#anky_91
In this image an extract of first 10 rows is presented. First column [int32]: year; Second column[int32]: month; Third column[int32]: day; Fourth column[object]: hour; Fifth column[object]: minute. The length of objects is 2.
Another solution:
>>pd.concat([df.A,pd.to_datetime(pd.Series(df[df.columns[1:]].fillna('').values.tolist(),name='Date').map(lambda x: '0'.join(map(str,x))))],axis=1)
A Date
0 a 2002-07-01 05:07:00
1 b 2002-08-03 03:08:00
2 c 2002-09-05 06:09:00
3 d 2002-04-07 09:04:00
4 e 2002-02-01 02:02:00
5 f 2002-03-05 04:03:00
For the example you have added as image (i have skipped the last 3 columns due to save time)
>>df.month=df.month.map("{:02}".format)
>>df.day = df.day.map("{:02}".format)
>>pd.concat([df.A,pd.to_datetime(pd.Series(df[df.columns[1:]].fillna('').values.tolist(),name='Date').map(lambda x: ''.join(map(str,x))))],axis=1)
A Date
0 a 2015-01-01 00:05:00
1 b 2015-01-01 00:01:00
2 c 2015-01-01 00:02:00
3 d 2015-01-01 00:02:00
4 e 2015-01-01 00:25:00
5 f 2015-01-01 00:25:00
You can use rename to columns, so possible use pandas.to_datetime with columns year, month, day, hour, minute:
df = pd.DataFrame({
'A':list('abcdef'),
'year':[2002,2002,2002,2002,2002,2002],
'month':[7,8,9,4,2,3],
'day':[1,3,5,7,1,5],
'hour_scheduled_departure':[5,3,6,9,2,4],
'minute_scheduled_departure':[7,8,9,4,2,3]
})
print (df)
A year month day hour_scheduled_departure minute_scheduled_departure
0 a 2002 7 1 5 7
1 b 2002 8 3 3 8
2 c 2002 9 5 6 9
3 d 2002 4 7 9 4
4 e 2002 2 1 2 2
5 f 2002 3 5 4 3
cols = ['year','month','day','hour_scheduled_departure','minute_scheduled_departure']
d = {'hour_scheduled_departure':'hour','minute_scheduled_departure':'minute'}
df['date_time'] = pd.to_datetime(df[cols].rename(columns=d))
#if necessary remove columns
df = df.drop(cols, axis=1)
print (df)
A date_time
0 a 2002-07-01 05:07:00
1 b 2002-08-03 03:08:00
2 c 2002-09-05 06:09:00
3 d 2002-04-07 09:04:00
4 e 2002-02-01 02:02:00
5 f 2002-03-05 04:03:00
Detail:
print (df[cols].rename(columns=d))
year month day hour minute
0 2002 7 1 5 7
1 2002 8 3 3 8
2 2002 9 5 6 9
3 2002 4 7 9 4
4 2002 2 1 2 2
5 2002 3 5 4 3

Categories