I have 2 datasets to work with:
ID Date Amount
1 2020-01-02 1000
1 2020-01-09 200
1 2020-01-08 400
And another dataset which tells which is most frequent day of week and most frequent week of month for each ID(there are multiple such IDs)
ID Pref_Day_Of_Week_A Pref_Week_Of_Month_A
1 3 2
For this ID ,Thursday was the most frequent day of the week for ID 1 and 2nd week of the month was the most frequent week of the month.
I wish to find sum of all the amounts that took place on the most frequent day of week and frequent week of month, for all IDs(hence requiring groupby):
ID Amount_On_Pref_Day Amount_Pref_Week
1 1200 600
I would really appreciate it if anyone could help me calculating this dataframe using pandas. For reference, I have used this function to find the week of month for a given date:
#https://stackoverflow.com/a/64192858/2901002
def weekinmonth(dates):
"""Get week number in a month.
Parameters:
dates (pd.Series): Series of dates.
Returns:
pd.Series: Week number in a month.
"""
firstday_in_month = dates - pd.to_timedelta(dates.dt.day - 1, unit='d')
return (dates.dt.day-1 + firstday_in_month.dt.weekday) // 7 + 1
Idea is filter only matched dayofweek and week and aggregate sum, last join together by concat:
#https://stackoverflow.com/a/64192858/2901002
def weekinmonth(dates):
"""Get week number in a month.
Parameters:
dates (pd.Series): Series of dates.
Returns:
pd.Series: Week number in a month.
"""
firstday_in_month = dates - pd.to_timedelta(dates.dt.day - 1, unit='d')
return (dates.dt.day-1 + firstday_in_month.dt.weekday) // 7 + 1
df.Date = pd.to_datetime(df.Date)
df['dayofweek'] = df.Date.dt.dayofweek
df['week'] = weekinmonth(df['Date'])
f = lambda x: x.mode().iat[0]
df1 = (df.groupby('ID', as_index=False).agg(Pref_Day_Of_Week_A=('dayofweek',f),
Pref_Week_Of_Month_A=('week',f)))
s1 = df1.rename(columns={'Pref_Day_Of_Week_A':'dayofweek'}).merge(df).groupby('ID')['Amount'].sum()
s2 = df1.rename(columns={'Pref_Week_Of_Month_A':'week'}).merge(df).groupby('ID')['Amount'].sum()
df2 = pd.concat([s1, s2], axis=1, keys=('Amount_On_Pref_Day','Amount_Pref_Week'))
print (df2)
Amount_On_Pref_Day Amount_Pref_Week
ID
1 1200 600
Related
How to extract the nearest last month to date data if the same day of the last month did not have the sale? Please refer to the sample provide for more understanding.
Original data:
It may not have the sale in the yesterday (last month), require to find the nearest day compare to today (last month).
Currently, using the pd.merge to get the Last MTD data, but if the same day of last month did not have the product's sale, it will show zero.
Example 1:
02/10/2022 VS 02/09/2022
02/10/2022 have Clothes's sale, but 02/09/2022 did not have. Expect the Last MTD column able to display the MTD data from last month.
Current result:
Expected output:
Code:
df["pdate"] = df.Date.apply(lambda x: (x - pd.DateOffset(months=1)))
df2 = df.copy()
final_df = pd.merge(left = df,right = df2, how="left", left_on=['pdate','Product'], right_on=['Date', 'Product'])
######## For understanding (can ignore)
###############
Example 2:
03/10/2022 VS 03/09/2022
03/10/2022 have Dining room's sale, but 03/09/2022 did not have. Expect the Last MTD column able to display the MTD data from last month.
Current result:
Expected result:
You can merge using a shifted period:
df['period'] = pd.to_datetime(df['date'], dayfirst=False).dt.to_period('M')
df['prev_period'] = df.groupby('product')['period'].shift()
out = (df.merge(df[['product', 'period', 'MTD']],
how='left', suffixes=[None, '_previous'],
left_on=['product', 'prev_period'],
right_on=['product', 'period'])
[['date', 'product', 'MTD', 'MTD_previous']]
)
Example :
date product MTD MTD_previous
0 01/09/2022 A 1 NaN
1 01/09/2022 B 2 NaN
2 02/09/2022 A 3 1.0
3 02/10/2022 B 4 2.0
Used input:
df = pd.DataFrame({'date': ['01/09/2022', '01/09/2022', '02/09/2022', '02/10/2022'],
'product': ['A', 'B', 'A', 'B'],
'MTD': [1, 2, 3, 4]
})
My DataFrame looks like this:
id
date
value
1
2021-07-16
100
2
2021-09-15
20
1
2021-04-10
50
1
2021-08-27
30
2
2021-07-22
15
2
2021-07-22
25
1
2021-06-30
40
3
2021-10-11
150
2
2021-08-03
15
1
2021-07-02
90
I want to groupby the id, and return the difference of total value in a 90-days period.
Specifically, I want the values of last 90 days based on today, and based on 30 days ago.
For example, considering today is 2021-10-13, I would like to get:
the sum of all values per id between 2021-10-13 and 2021-07-15
the sum of all values per id between 2021-09-13 and 2021-06-15
And finally, subtract them to get the variation.
I've already managed to calculate it, by creating separated temporary dataframes containing only the dates in those periods of 90 days, grouping by id, and then merging these temp dataframes into a final one.
But I guess it should be an easier or simpler way to do it. Appreciate any help!
Btw, sorry if the explanation was a little messy.
If I understood correctly, you need something like this:
import pandas as pd
import datetime
## Calculation of the dates that we are gonna need.
today = datetime.datetime.now()
delta = datetime.timedelta(days = 120)
# Date of the 120 days ago
hundredTwentyDaysAgo = today - delta
delta = datetime.timedelta(days = 90)
# Date of the 90 days ago
ninetyDaysAgo = today - delta
delta = datetime.timedelta(days = 30)
# Date of the 30 days ago
thirtyDaysAgo = today - delta
## Initializing an example df.
df = pd.DataFrame({"id":[1,2,1,1,2,2,1,3,2,1],
"date": ["2021-07-16", "2021-09-15", "2021-04-10", "2021-08-27", "2021-07-22", "2021-07-22", "2021-06-30", "2021-10-11", "2021-08-03", "2021-07-02"],
"value": [100,20,50,30,15,25,40,150,15,90]})
## Casting date column
df['date'] = pd.to_datetime(df['date']).dt.date
grouped = df.groupby('id')
# Sum of last 90 days per id
ninetySum = grouped.apply(lambda x: x[x['date'] >= ninetyDaysAgo.date()]['value'].sum())
# Sum of last 90 days, starting from 30 days ago per id
hundredTwentySum = grouped.apply(lambda x: x[(x['date'] >= hundredTwentyDaysAgo.date()) & (x['date'] <= thirtyDaysAgo.date())]['value'].sum())
The output is
ninetySum - hundredTwentySum
id
1 -130
2 20
3 150
dtype: int64
You can double check to make sure these are the numbers you wanted by printing ninetySum and hundredTwentySum variables.
I have a df
date
2021-03-12
2021-03-17
...
2022-05-21
2022-08-17
I am trying to add a column year_week, but my year week starts at 2021-06-28, which is the first day of July.
I tried:
df['date'] = pd.to_datetime(df['date'])
df['year_week'] = (df['date'] - timedelta(days=datetime(2021, 6, 24).timetuple()
.tm_yday)).dt.isocalendar().week
I played around with the timedelta days values so that the 2021-06-28 has a value of 1.
But then I got problems with previous & dates exceeding my start date + 1 year:
2021-03-12 has a value of 38
2022-08-17 has a value of 8
So it looks like the valid period is from 2021-06-28 + 1 year.
date year_week
2021-03-12 38 # LY38
2021-03-17 39 # LY39
2021-06-28 1 # correct
...
2022-05-21 47 # correct
2022-08-17 8 # NY8
Is there a way to get around this? As I am aggregating the data by year week I get incorrect results due to the past & upcoming dates. I would want to have negative dates for the days before 2021-06-28 or LY38 denoting that its the year week of the last year, accordingly year weeks of 52+ or NY8 denoting that this is the 8th week of the next year?
Here is a way, I added two dates more than a year away. You need the isocalendar from the difference between the date column and the dayofyear of your specific date. Then you can select the different scenario depending on the year of your specific date. use np.select for the different result format.
#dummy dataframe
df = pd.DataFrame(
{'date': ['2020-03-12', '2021-03-12', '2021-03-17', '2021-06-28',
'2022-05-21', '2022-08-17', '2023-08-17']
}
)
# define start date
d = pd.to_datetime('2021-6-24')
# remove the nomber of day of year from each date
s = (pd.to_datetime(df['date']) - pd.Timedelta(days=d.day_of_year)
).dt.isocalendar()
# get the difference in year
m = (s['year'].astype('int32') - d.year)
# all condition of result depending on year difference
conds = [m.eq(0), m.eq(-1), m.eq(1), m.lt(-1), m.gt(1)]
choices = ['', 'LY','NY',(m+1).astype(str)+'LY', '+'+(m-1).astype(str)+'NY']
# create the column
df['res'] = np.select(conds, choices) + s['week'].astype(str)
print(df)
date res
0 2020-03-12 -1LY38
1 2021-03-12 LY38
2 2021-03-17 LY39
3 2021-06-28 1
4 2022-05-21 47
5 2022-08-17 NY8
6 2023-08-17 +1NY8
I think
pandas period_range can be of some help
pd.Series(pd.period_range("6/28/2017", freq="W", periods=Number of weeks you want))
I have a Pandas dataframe, which looks like below
I want to create a new column, which tells the exact date from the information from all the above columns. The code should look something like this:
df['Date'] = pd.to_datetime(df['Month']+df['WeekOfMonth']+df['DayOfWeek']+df['Year'])
I was able to find a workaround for your case. You will need to define the dictionaries for the months and the days of the week.
month = {"Jan":"01", "Feb":"02", "March":"03", "Apr": "04", "May":"05", "Jun":"06", "Jul":"07", "Aug":"08", "Sep":"09", "Oct":"10", "Nov":"11", "Dec":"12"}
week = {"Monday":1,"Tuesday":2,"Wednesday":3,"Thursday":4,"Friday":5,"Saturday":6,"Sunday":7}
With this dictionaries the transformation that I used with a custom dataframe was:
rows = [["Dec",5,"Wednesday", "1995"],
["Jan",3,"Wednesday","2013"]]
df = pd.DataFrame(rows, columns=["Month","Week","Weekday","Year"])
df['Date'] = (df["Year"] + "-" + df["Month"].map(month) + "-" + (df["Week"].apply(lambda x: (x - 1)*7) + df["Weekday"].map(week).apply(int) ).apply(str)).astype('datetime64[ns]')
However you have to be careful. With some data that you posted as example there were some dates that exceeds the date range. For example, for
row = ["Oct",5,"Friday","2018"]
The date displayed is 2018-10-33. I recommend using some logic to filter your data in order to avoid this kind of problems.
Let's approach it in 3 steps as follows:
Get the date of month start Month_Start from Year and Month
Calculate the date offsets DateOffset relative to Month_Start from WeekOfMonth and DayOfWeek
Get the actual date Date from Month_Start and DateOffset
Here's the codes:
df['Month_Start'] = pd.to_datetime(df['Year'].astype(str) + df['Month'] + '01', format="%Y%b%d")
import time
df['DateOffset'] = (df['WeekOfMonth'] - 1) * 7 + df['DayOfWeek'].map(lambda x: time.strptime(x, '%A').tm_wday) - df['Month_Start'].dt.dayofweek
df['Date'] = df['Month_Start'] + pd.to_timedelta(df['DateOffset'], unit='D')
Output:
Month WeekOfMonth DayOfWeek Year Month_Start DateOffset Date
0 Dec 5 Wednesday 1995 1995-12-01 26 1995-12-27
1 Jan 3 Wednesday 2013 2013-01-01 15 2013-01-16
2 Oct 5 Friday 2018 2018-10-01 32 2018-11-02
3 Jun 2 Saturday 1980 1980-06-01 6 1980-06-07
4 Jan 5 Monday 1976 1976-01-01 25 1976-01-26
The Date column now contains the dates derived from the information from other columns.
You can remove the working interim columns, if you like, as follows:
df = df.drop(['Month_Start', 'DateOffset'], axis=1)
data.CSV
ID Activity Month Activity Date
0 04/2019 04-01-2019
1 05/2019 05-13-2019
2 05/2019 05-25-2019
3 06/2019 06-10-2019
4 06/2019 06-19-2019
5 07/2019 07-15-2019
6 07/2019 07-18-2019
7 07/2019 07-29-2019
8 08/2019 06-03-2019
9 08/2019 06-15-2019
10 08/2019 06-20-2019
MY PLAN
Read csv:
df = pd.read_csv('data.CSV')
Convert to datetime:
df['Activity Date'] = pd.to_datetime(df['Activity Date'], dayfirst=True)
Groupby the Activity Month column:
grouped = df.groupby(['Activity Month'])['Activity Date'].count()
print(grouped)
Activity Month
04/2019 15532
05/2019 13924
06/2019 12822
07/2019 14067
08/2019 10939
Name: Activity Date, dtype: int64
While the date is grouped, perform business day calculation:
This part i'm not sure what to do. Lost already
CODE I USED TO CALCULATE BUSINESS DAYS
import calendar
import datetime
x = datetime.date(2019, 4, 1)
cal = calendar.Calendar()
working_days = len([x for x in cal.itermonthdays2(x.year, x.month) if x[0] !=0 and x[1] < 5])
print ("Total business days for month (" + str(x.month) + ") is " + str(working_days) + " days")
OUTPUT THAT I WANTED
Total business days for month (4) is 22 days
Total business days for month (5) is 23 days
Total business days for month (6) is 20 days
Total business days for month (7) is 23 days
Total business days for month (8) is 22 days
I'm not entirely clear and the problem statement here, but if you want to calculate the number of business days for each Activity Month, you can wrap your calculation in a method, and apply that method over your Activity Month column (the lambda expression is basically a for loop operation over each row for specified columns).
grouped = df.groupby(['Activity Month'])['Activity Date'].count().reset_index()
def get_business_days(x):
x = datetime.date(int(x.split('/')[1]), int(x.split('/')[0]), 1)
cal = calendar.Calendar()
working_days = len([x for x in cal.itermonthdays2(x.year, x.month) if x[0] !=0 and x[1] < 5])
return ("Total business days for month (" + str(x.month) + ") is " + str(working_days) + " days")
grouped['Activity Month'].apply(get_business_days)
The output is a Series that has your text output.
0 Total business days for month (4) is 22 days
1 Total business days for month (5) is 23 days
2 Total business days for month (6) is 20 days
3 Total business days for month (7) is 23 days
4 Total business days for month (8) is 22 days
But, it's a bad idea to store repeated information in every cell. It'd be preferable to simply return working_days instead of having it embedded in a string.