Group by id and calculate variation on sells based on the date - python

My DataFrame looks like this:
id
date
value
1
2021-07-16
100
2
2021-09-15
20
1
2021-04-10
50
1
2021-08-27
30
2
2021-07-22
15
2
2021-07-22
25
1
2021-06-30
40
3
2021-10-11
150
2
2021-08-03
15
1
2021-07-02
90
I want to groupby the id, and return the difference of total value in a 90-days period.
Specifically, I want the values of last 90 days based on today, and based on 30 days ago.
For example, considering today is 2021-10-13, I would like to get:
the sum of all values per id between 2021-10-13 and 2021-07-15
the sum of all values per id between 2021-09-13 and 2021-06-15
And finally, subtract them to get the variation.
I've already managed to calculate it, by creating separated temporary dataframes containing only the dates in those periods of 90 days, grouping by id, and then merging these temp dataframes into a final one.
But I guess it should be an easier or simpler way to do it. Appreciate any help!
Btw, sorry if the explanation was a little messy.

If I understood correctly, you need something like this:
import pandas as pd
import datetime
## Calculation of the dates that we are gonna need.
today = datetime.datetime.now()
delta = datetime.timedelta(days = 120)
# Date of the 120 days ago
hundredTwentyDaysAgo = today - delta
delta = datetime.timedelta(days = 90)
# Date of the 90 days ago
ninetyDaysAgo = today - delta
delta = datetime.timedelta(days = 30)
# Date of the 30 days ago
thirtyDaysAgo = today - delta
## Initializing an example df.
df = pd.DataFrame({"id":[1,2,1,1,2,2,1,3,2,1],
"date": ["2021-07-16", "2021-09-15", "2021-04-10", "2021-08-27", "2021-07-22", "2021-07-22", "2021-06-30", "2021-10-11", "2021-08-03", "2021-07-02"],
"value": [100,20,50,30,15,25,40,150,15,90]})
## Casting date column
df['date'] = pd.to_datetime(df['date']).dt.date
grouped = df.groupby('id')
# Sum of last 90 days per id
ninetySum = grouped.apply(lambda x: x[x['date'] >= ninetyDaysAgo.date()]['value'].sum())
# Sum of last 90 days, starting from 30 days ago per id
hundredTwentySum = grouped.apply(lambda x: x[(x['date'] >= hundredTwentyDaysAgo.date()) & (x['date'] <= thirtyDaysAgo.date())]['value'].sum())
The output is
ninetySum - hundredTwentySum
id
1 -130
2 20
3 150
dtype: int64
You can double check to make sure these are the numbers you wanted by printing ninetySum and hundredTwentySum variables.

Related

Pandas find last row for each hour / minute in high frequency dataframe

Assume a dataframe as follows. I'm looking to add a column to the df dataframe that takes the price for current row, and subtracts it from the price at the last index 5 minutes prior to the current hour/minute. I've attempted to reference a minute_df and read the current hour/minute and pull the close price from the minute_df, but have not got a working solution. The df index is datetime64.
For example, at 06:27:12, it should be taking this rows price, minus the close price at the last index from the 06:22, as this is 5 minutes prior to 06:27. For each index within the minute 06:27, it should be referencing this close price for the calculation, until it turns to 06:28, then should be subtracting from last index at 06:23.
df
TimeStamp Price Q hour min
2022-10-05 05:30:11.344618-05:00 8636 1 5 30
2022-10-05 05:30:12.647597-05:00 8637 1 5 30
2022-10-05 05:30:20.080559-05:00 8637 1 5 30
2022-10-05 05:30:21.267389-05:00 8637 2 5 30
2022-10-05 05:30:21.267952-05:00 8636 1 5 30
minute_df
TimeStamp open high low close
2022-10-05 05:30:00-05:00 8636 8645 8635 8645
2022-10-05 05:31:00-05:00 8645 8647 8637 8638
2022-10-05 05:32:00-05:00 8639 8650 8639 8649
2022-10-05 05:33:00-05:00 8648 8652 8648 8649
Expected output is a column within the df dataframe containing value of the current price - closing price, or the price at the last index 5 minutes prior to current minute. NaN values up until there is sufficient rows to lookback this many periods.
df['price_change']
Not sure if I understand correctly but here's my try
If TimeStamp is a column
# Remove the seconds and microseconds
floor_ts = df.TimeStamp.dt.floor("min")
# Get last 5 minute timestamp
last_index_5_ts = floor_ts - pd.Timedelta(5, unit="min")
# Create dict from minute_df TimeStamp to close price
ts_to_close_dict = dict(zip(minute_df.TimeStamp, minute_df.close))
close_price_v = last_index_5_ts.map(ts_to_close_dict)
df["price_change"] = df.Price - close_price_v
df
Same code but if TimeStamp is an index
floor_ts = df.index.floor("min")
last_index_5_ts = floor_ts - pd.Timedelta(5, unit="min")
ts_to_close_dict = dict(zip(minute_df.index, minute_df.close))
close_price_v = last_index_5_ts.map(ts_to_close_dict)
df["price_change"] = df.Price - close_price_v
df
Few notes:
I'm not sure what you're meaning about handling NaN values but if you need forward fill / backward fill them you can use pd.fillna
Some of the pandas function (like floor) above might be missing in older pandas version
EDIT:
I didn't notice the df already have hour and minute column. You may use it for calculating floor_ts (though not sure if it's easier/faster)

How to calculate the days difference between all the dates of a dataframe column and a single data in Python

I would to calculate the days difference between all the days in the "last_review" column and
2018-08-01, and I want the output to be exact days, like if the observation is 2018-07-31, the output should be 2. And do this for every observation of the dataframe column. The output should be
48894 * 1
You can it like so:
df['last_review'] = pd.to_datetime(df['last_review'])
df['num_days'] = pd.to_datetime("2019-08-01") - df['last_review']
Output:
last_review num_days
0 2018-10-19 286 days
1 2019-05-21 72 days
2 2011-03-28 3048 days
You can use:
sub_date = datetime(2018,8,1)
df['last_review'] = pd.to_datetime(df['last_review'])
df['diff'] = (sub_date - df['last_review']).dt.days

Creating year week based on date with different start date

I have a df
date
2021-03-12
2021-03-17
...
2022-05-21
2022-08-17
I am trying to add a column year_week, but my year week starts at 2021-06-28, which is the first day of July.
I tried:
df['date'] = pd.to_datetime(df['date'])
df['year_week'] = (df['date'] - timedelta(days=datetime(2021, 6, 24).timetuple()
.tm_yday)).dt.isocalendar().week
I played around with the timedelta days values so that the 2021-06-28 has a value of 1.
But then I got problems with previous & dates exceeding my start date + 1 year:
2021-03-12 has a value of 38
2022-08-17 has a value of 8
So it looks like the valid period is from 2021-06-28 + 1 year.
date year_week
2021-03-12 38 # LY38
2021-03-17 39 # LY39
2021-06-28 1 # correct
...
2022-05-21 47 # correct
2022-08-17 8 # NY8
Is there a way to get around this? As I am aggregating the data by year week I get incorrect results due to the past & upcoming dates. I would want to have negative dates for the days before 2021-06-28 or LY38 denoting that its the year week of the last year, accordingly year weeks of 52+ or NY8 denoting that this is the 8th week of the next year?
Here is a way, I added two dates more than a year away. You need the isocalendar from the difference between the date column and the dayofyear of your specific date. Then you can select the different scenario depending on the year of your specific date. use np.select for the different result format.
#dummy dataframe
df = pd.DataFrame(
{'date': ['2020-03-12', '2021-03-12', '2021-03-17', '2021-06-28',
'2022-05-21', '2022-08-17', '2023-08-17']
}
)
# define start date
d = pd.to_datetime('2021-6-24')
# remove the nomber of day of year from each date
s = (pd.to_datetime(df['date']) - pd.Timedelta(days=d.day_of_year)
).dt.isocalendar()
# get the difference in year
m = (s['year'].astype('int32') - d.year)
# all condition of result depending on year difference
conds = [m.eq(0), m.eq(-1), m.eq(1), m.lt(-1), m.gt(1)]
choices = ['', 'LY','NY',(m+1).astype(str)+'LY', '+'+(m-1).astype(str)+'NY']
# create the column
df['res'] = np.select(conds, choices) + s['week'].astype(str)
print(df)
date res
0 2020-03-12 -1LY38
1 2021-03-12 LY38
2 2021-03-17 LY39
3 2021-06-28 1
4 2022-05-21 47
5 2022-08-17 NY8
6 2023-08-17 +1NY8
I think
pandas period_range can be of some help
pd.Series(pd.period_range("6/28/2017", freq="W", periods=Number of weeks you want))

How to remove specific day timestamps from a big dataframe

I have a big dataframe consisting of 600 days worth of data. Each day has 100 timestamps. I have a separate list of 30 days from which I want to data. How do I remove data from these 30 days from the dataframe?
I tried a for loop, but it did not work. I know there is a simple method. But I don't know how to implement it.
df #is main dataframe which has many columns and rows. Index is a timestamp.
df['dates'] = df.index.strftime('%Y-%m-%d') # date part of timestamp is sliced and
#a new column is created. Instead of index, I want to use this column for comparing with bad list.
bad_list # it is a list of bad dates
for i in range(0,len(df)):
for j in range(0,len(bad_list)):
if str(df['dates'][i])== bad_list[j]:
df.drop(df[i].index,inplace=True)
You can do the following
df['dates'] = df.index.strftime('%Y-%m-%d')
#badlist should be in date format too.
newdf = df[~df['dates'].isin(badlist)]
# the ~ is used to denote "not in" the list.
#if Jan 1, 2000 is a bad date, it should be in the list as datetime(2000,1,1)
You can perform simple comparison:
>>> dates = pd.Series(pd.to_datetime(np.random.randint(int(time()) - 60 * 60 * 24 * 5, int(time()), 12), unit='s'))
>>> dates
0 2019-03-19 05:25:32
1 2019-03-20 00:58:29
2 2019-03-19 01:03:36
3 2019-03-22 11:45:24
4 2019-03-19 08:14:29
5 2019-03-21 10:17:13
6 2019-03-18 09:09:15
7 2019-03-20 00:14:16
8 2019-03-21 19:47:02
9 2019-03-23 06:19:35
10 2019-03-23 05:42:34
11 2019-03-21 11:37:46
>>> start_date = pd.to_datetime('2019-03-20')
>>> end_date = pd.to_datetime('2019-03-22')
>>> dates[(dates > start_date) & (dates < end_date)]
1 2019-03-20 00:58:29
5 2019-03-21 10:17:13
7 2019-03-20 00:14:16
8 2019-03-21 19:47:02
11 2019-03-21 11:37:46
If your source Series is not in datetime format, then you will need to use pd.to_datetime to convert it.

How to find the median month between two dates?

I need to find the median month value between two dates in a date frame. I am simplifying the case by showing four examples.
import pandas as pd
import numpy as np
import datetime
df=pd.DataFrame([["1/31/2016","3/1/2016"],
["6/15/2016","7/14/2016"],
["7/14/2016","8/15/2016"],
["8/7/2016","9/6/2016"]], columns=['FromDate','ToDate'])
df['Month'] = df.ToDate.dt.month-df.FromDate.dt.month
I am trying to append a column but I am not getting the desired result.
I need to see these values: [2,6,7,8].
You can calculate the average date explicitly by adding half the timedelta between 2 dates to the earlier date. Then just extract the month:
# convert to datetime if necessary
df[df.columns] = df[df.columns].apply(pd.to_datetime)
# calculate mean date, then extract month
df['Month'] = (df['FromDate'] + (df['ToDate'] - df['FromDate']) / 2).dt.month
print(df)
FromDate ToDate Month
0 2016-01-31 2016-03-01 2
1 2016-06-15 2016-07-14 6
2 2016-07-14 2016-08-15 7
3 2016-08-07 2016-09-06 8
You need to convert the string to datetime before using dt.month.
This line calculates the average month number :
df['Month'] = (pd.to_datetime(df['ToDate']).dt.month +
pd.to_datetime(df['FromDate']).dt.month)//2
print(df)
FromDate ToDate Month
0 1/31/2016 3/1/2016 2
1 6/15/2016 7/14/2016 6
2 7/14/2016 8/15/2016 7
3 8/7/2016 9/6/2016 8
This only works with both dates in the same year.
jpp's solution is fine but will in some cases give the wrong answer:
["1/1/2016","3/1/2016"] one would expect 2 because February is between January and March, but jpp's will give 1 corresponding to January.

Categories