One query I often do in SQL within a relational database is to join a table back to itself and summarize each row based on records for the same id either backwards or forward in time.
For example, assume table1 as columns 'ID','Date', 'Var1'
In SQL I could sum var1 for the past 3 months for each record like this:
Select a.ID, a.Date, sum(b.Var1) as sum_var1
from table1 a
left outer join table1 b
on a.ID = b.ID
and months_between(a.date,b.date) <0
and months_between(a.date,b.date) > -3
Is there any way to do this in Pandas?
It seems you need GroupBy + rolling. Implementing the logic in precisely the same way it is written in SQL is likely to be expensive as it will involve repeated loops. Let's take an example dataframe:
Date ID Var1
0 2015-01-01 1 0
1 2015-02-01 1 1
2 2015-03-01 1 2
3 2015-04-01 1 3
4 2015-05-01 1 4
5 2015-01-01 2 5
6 2015-02-01 2 6
7 2015-03-01 2 7
8 2015-04-01 2 8
9 2015-05-01 2 9
You can add a column which, by group, looks back and sums a variable over a fixed period. First define a function utilizing pd.Series.rolling:
def lookbacker(x):
"""Sum over past 70 days"""
return x.rolling('70D').sum().astype(int)
Then apply it on a GroupBy object and extract values for assignment:
df['Lookback_Sum'] = df.set_index('Date').groupby('ID')['Var1'].apply(lookbacker).values
print(df)
Date ID Var1 Lookback_Sum
0 2015-01-01 1 0 0
1 2015-02-01 1 1 1
2 2015-03-01 1 2 3
3 2015-04-01 1 3 6
4 2015-05-01 1 4 9
5 2015-01-01 2 5 5
6 2015-02-01 2 6 11
7 2015-03-01 2 7 18
8 2015-04-01 2 8 21
9 2015-05-01 2 9 24
It appears pd.Series.rolling does not work with months, e.g. using '2M' (2 months) instead of '70D' (70 days) gives ValueError: <2 * MonthEnds> is a non-fixed frequency. This makes sense since a "month" is ambiguous given months have different numbers of days.
Another point worth mentioning is you can use GroupBy + rolling directly and possibly more efficiently by bypassing apply, but this requires ensuring your index is monotic. For example, via sort_index:
df['Lookback_Sum'] = df.set_index('Date').sort_index()\
.groupby('ID')['Var1'].rolling('70D').sum()\
.astype(int).values
I don't think pandas.DataFrame.rolling() supports rolling-window aggregation by some number of months; currently, you must specify a fixed number of days, or other fixed-length time period.
But as #jpp mentioned, you can use python loops to perform rolling aggregation over a window size specified in calendar months, where the number of days in each window will vary, depending on what part of the calendar you're rolling over.
The following approach builds on this SO answer as well as #jpp's:
# Build some example data:
# 3 unique IDs, each with 365 samples, one sample per day throughout 2015
df = pd.DataFrame({'Date': pd.date_range('2015-01-01', '2015-12-31', freq='D'),
'Var1': list(range(365))})
df = pd.concat([df] * 3)
df['ID'] = [1]*365 + [2]*365 + [3]*365
df.head()
Date Var1 ID
0 2015-01-01 0 1
1 2015-01-02 1 1
2 2015-01-03 2 1
3 2015-01-04 3 1
4 2015-01-05 4 1
# Define a lookback function that mimics rolling aggregation,
# but uses DateOffset() slicing, rather than a window of fixed size.
# Use .count() here as a sanity check; you will need .sum()
def lookbacker(ser):
return pd.Series([ser.loc[d - pd.offsets.DateOffset(months=3):d].count()
for d in ser.index])
# By default, groupby.agg output is sorted by key. So make sure to
# sort df by (ID, Date) before inserting the flattened groupby result
# into a new column
df.sort_values(['ID', 'Date'], inplace=True)
df.set_index('Date', inplace=True)
df['window_size'] = df.groupby('ID')['Var1'].apply(lookbacker).values
# Manually check the resulting window sizes
df.head()
Var1 ID window_size
Date
2015-01-01 0 1 1
2015-01-02 1 1 2
2015-01-03 2 1 3
2015-01-04 3 1 4
2015-01-05 4 1 5
df.tail()
Var1 ID window_size
Date
2015-12-27 360 3 92
2015-12-28 361 3 92
2015-12-29 362 3 92
2015-12-30 363 3 92
2015-12-31 364 3 93
df[df.ID == 1].loc['2015-05-25':'2015-06-05']
Var1 ID window_size
Date
2015-05-25 144 1 90
2015-05-26 145 1 90
2015-05-27 146 1 90
2015-05-28 147 1 90
2015-05-29 148 1 91
2015-05-30 149 1 92
2015-05-31 150 1 93
2015-06-01 151 1 93
2015-06-02 152 1 93
2015-06-03 153 1 93
2015-06-04 154 1 93
2015-06-05 155 1 93
The last column gives the lookback window size in days, looking back from that date, including both the start and end dates.
Looking "3 months" before 2016-05-31 would land you at 2015-02-31, but February has only 28 days in 2015. As you can see in the sequence 90, 91, 92, 93 in the above sanity check, This DateOffset approach maps the last four days in May to the last day in February:
pd.to_datetime('2015-05-31') - pd.offsets.DateOffset(months=3)
Timestamp('2015-02-28 00:00:00')
pd.to_datetime('2015-05-30') - pd.offsets.DateOffset(months=3)
Timestamp('2015-02-28 00:00:00')
pd.to_datetime('2015-05-29') - pd.offsets.DateOffset(months=3)
Timestamp('2015-02-28 00:00:00')
pd.to_datetime('2015-05-28') - pd.offsets.DateOffset(months=3)
Timestamp('2015-02-28 00:00:00')
I don't know if this matches SQL's behaviour, but in any case, you'll want to test this and decide if this makes sense in your case.
you could use lambda to achieve it.
table1['sum_var1'] = table1.apply(lambda row: findSum(row), axis=1)
and we should write an equivalent method for months_between
the complete example is
from datetime import datetime
import datetime as dt
import pandas as pd
def months_between(date1, date2):
if date1.day == date2.day:
return (date1.year - date2.year) * 12 + date1.month - date2.month
# if both are last days
if date1.month != (date1 + dt.timedelta(days=1)).month :
if date2.month != (date2 + dt.timedelta(days=1)).month :
return date1.month - date2.month
return (date1 - date2).days / 31
def findSum(cRow):
table1['month_diff'] = table1['Date'].apply(months_between, date2=cRow['Date'])
filtered_table = table1[(table1["month_diff"] < 0) & (table1["month_diff"] > -3) & (table1['ID'] == cRow['ID'])]
if filtered_table.empty:
return 0
return filtered_table['Var1'].sum()
table1 = pd.DataFrame(columns = ['ID', 'Date', 'Var1'])
table1.loc[len(table1)] = [1, datetime.strptime('2015-01-01','%Y-%m-%d'), 0]
table1.loc[len(table1)] = [1, datetime.strptime('2015-02-01','%Y-%m-%d'), 1]
table1.loc[len(table1)] = [1, datetime.strptime('2015-03-01','%Y-%m-%d'), 2]
table1.loc[len(table1)] = [1, datetime.strptime('2015-04-01','%Y-%m-%d'), 3]
table1.loc[len(table1)] = [1, datetime.strptime('2015-05-01','%Y-%m-%d'), 4]
table1.loc[len(table1)] = [2, datetime.strptime('2015-01-01','%Y-%m-%d'), 5]
table1.loc[len(table1)] = [2, datetime.strptime('2015-02-01','%Y-%m-%d'), 6]
table1.loc[len(table1)] = [2, datetime.strptime('2015-03-01','%Y-%m-%d'), 7]
table1.loc[len(table1)] = [2, datetime.strptime('2015-04-01','%Y-%m-%d'), 8]
table1.loc[len(table1)] = [2, datetime.strptime('2015-05-01','%Y-%m-%d'), 9]
table1['sum_var1'] = table1.apply(lambda row: findSum(row), axis=1)
table1.drop(columns=['month_diff'], inplace=True)
print(table1)
Related
I have a time series data and a non-continuous data logs with timestamps. I want to merge the latter with the time series data, and create a new columns with column values.
Let the time series data be:
import pandas as pd
import numpy as np
df = pd.DataFrame(index=pd.date_range(freq=f'{5}T',start='2020-10-10',periods=(12)*24*5))
df['col'] = np.random.random_integers(1, 100, size= df.shape[0])
df['uid'] = 1
df2 = pd.DataFrame(index=pd.date_range(freq=f'{5}T',start='2020-10-10',periods=(12)*24*5))
df2['col'] = np.random.random_integers(1, 50, size= df2.shape[0])
df2['uid'] = 2
df3=pd.concat([df, df2]).reset_index()
df3= df3.rename(columns={'index': 'timestamp'})
timestamp col uid
0 2020-10-10 00:00:00 96 1
1 2020-10-10 00:05:00 47 1
2 2020-10-10 00:10:00 78 1
3 2020-10-10 00:15:00 27 1
...
Let the log data be:
import datetime as dt
df_log=pd.DataFrame(np.array([[100, 1, 3], [40, 2, 6], [50, 1, 5], [60, 2, 9], [20, 1, 2], [30, 2, 5]]),
columns=['duration', 'uid', 'factor'])
df_log['timestamp'] = pd.Series([dt.datetime(2020,10,10, 15,21), dt.datetime(2020,10,10, 16,27),
dt.datetime(2020,10,11, 21,25), dt.datetime(2020,10,11, 10,12),
dt.datetime(2020,10,13, 20,56), dt.datetime(2020,10,13, 13,15)])
duration uid factor timestamp
0 100 1 3 2020-10-10 15:21:00
1 40 2 6 2020-10-10 16:27:00
...
I want to merge these two (df_merged), and create new column in the time series data as such (respective to the uid):
df_merged['new'] = df_merged['duration] * df_merged['factor']
and ffill the df_merged['new'] with this value until the next log for each uid, then do the same operation on the next log and sum, and have it be a moving 2-day average.
Can anybody show me a direction for this problem?
Expected Output:
timestamp col uid duration factor new
0 2020-10-10 15:20:00 96 1 100 3 300
1 2020-10-10 15:25:00 47 1 100 3 300
2 2020-10-10 15:30:00 78 1 100 3 300
...
2020-10-11 21:25:00 .. 1 60 9 540+300
2020-10-11 21:30:00 .. 1 60 9 540+300
...
2020-10-13 20:55:00 .. 1 20 2 40+540
2020-10-13 21:00:00 .. 1 20 2 40+540
..
2020-10-13 21:25:00 .. 1 20 2 40
as I understand it, it's simpler to calculate the new column on df_log before merging. You'd just use rolling to calculate the window for each uid group:
df_log["new"] = df_log["duration"] * df_log["factor"]
# 2 day rolling window summing `new`
df_log = df_log.groupby("uid").rolling("2d", on="timestamp")["new"].sum().to_frame()
Then merging is straightforward:
# prepare for merge
df_log = df_log.sort_values(by="timestamp")
df3 = df3.sort_values(by="timestamp")
df_merged = (
pd.merge_asof(df3, df_log, on="timestamp", by=["uid"])
.dropna()
.reset_index(drop=True)
)
This solution does deviate slightly from your expected output. The first included row from the continuous series (df3) would be at timestamp 2020-10-10 15:25:00 instead of 2020-10-10 15:20:00 since the merge method would look for the last timestamp in df_log before the timestamp in df3.
Alternatively, if you require the first row in the output to have timestamp 2020-10-10 15:20:00, you can use direction="forward" in pd.merge_asof. That would make each row match the first row in df_log with a timestamp after the one in df3, so you'd need to remove the extra rows in the beginning for each uid.
I want to convert all rows of my DataFrame that contains hours and minutes into minutes only.
I have a dataframe that looks like this:
df=
time
0 8h30
1 14h07
2 08h30
3 7h50
4 8h0
5 8h15
6 6h15
I'm using the following method to convert:
df['time'] = pd.eval(
df['time'].replace(['h'], ['*60+'], regex=True))
Output
SyntaxError: invalid syntax
I think the error comes from the format of the hour, maybe pd.evalcant accept 08h30 or 8h0, how to solve this probleme ?
Pandas can already handle such strings if the units are included in the string. While 14h07 can't be parse (why assume 07 is minutes?), 14h07 can be converted to a Timedelta :
>>> pd.to_timedelta("14h07m")
Timedelta('0 days 14:07:00')
Given this dataframe :
d1 = pd.DataFrame(['8h30m', '14h07m', '08h30m', '8h0m'],
columns=['time'])
You can convert the time series into a Timedelta series with pd.to_timedelta :
>>> d1['tm'] = pd.to_timedelta(d1['time'])
>>> d1
time tm
0 8h30m 0 days 08:30:00
1 14h07m 0 days 14:07:00
2 08h30m 0 days 08:30:00
3 8h0m 0 days 08:00:00
To handle the missing minutes unit in the original data, just append m:
d1['tm'] = pd.to_timedelta(d1['time'] + 'm')
Once you have a Timedelta you can calculate hours and minutes.
The components of the values can be retrieved with Timedelta.components
>>> d1.tm.dt.components.hours
0 8
1 14
2 8
3 8
Name: hours, dtype: int64
To get the total minutes, seconds or hours, change the frequency to minutes:
>>> d1.tm.astype('timedelta64[m]')
0 510.0
1 847.0
2 510.0
3 480.0
Name: tm, dtype: float64
Bringing all the operations together :
>>> d1['tm'] = pd.to_timedelta(d1['time'])
>>> d2 = (d1.assign(h=d1.tm.dt.components.hours,
... m=d1.tm.dt.components.minutes,
... total_minutes=d1.tm.astype('timedelta64[m]')))
>>>
>>> d2
time tm h m total_minutes
0 8h30m 0 days 08:30:00 8 30 510.0
1 14h07m 0 days 14:07:00 14 7 847.0
2 08h30m 0 days 08:30:00 8 30 510.0
3 8h0m 0 days 08:00:00 8 0 480.0
To avoid having to trim leading zeros, an alternative approach:
df[['h', 'm']] = df['time'].str.split('h', expand=True).astype(int)
df['total_min'] = df['h']*60 + df['m']
Result:
time h m total_min
0 8h30 8 30 510
1 14h07 14 7 847
2 08h30 8 30 510
3 7h50 7 50 470
4 8h0 8 0 480
5 8h15 8 15 495
6 6h15 6 15 375
Just to give an alternative approach with kind of the same elements as above you could do:
df = pd.DataFrame(data=["8h30", "14h07", "08h30", "7h50", "8h0 ", "8h15", "6h15"],
columns=["time"])
First split you column on the "h"
hm = df["time"].str.split("h", expand=True)
Then combine the columns again, but zeropad time hours and minutes in order to make valid time strings:
df2 = hm[0].str.strip().str.zfill(2) + hm[1].str.strip().str.zfill(2)
Then convert the string column with proper values to a date time column:
df3 = pd.to_datetime(df2, format="%H%M")
Finally, calculate the number of minutes by subtrackting a zero time (to make deltatimes) and divide by the minutes deltatime:
zerotime= pd.to_datetime("0000", format="%H%M")
df['minutes'] = (df3 - zerotime) / pd.Timedelta(minutes=1)
The results look like:
time minutes
0 8h30 510.0
1 14h07 847.0
2 08h30 510.0
3 7h50 470.0
4 8h0 480.0
5 8h15 495.0
6 6h15 375.0
I have a csv file containing 2 columns: id, val
where id is the number of the day (total 365)
Is it possible to convert the number id to dates in format '%d-%m-%Y'?
In fact I want to add all the days of year 2015 e.g. 01-01-2015 etc.
How can i do this with pandas in python?
following is a part of the file and the desired output
"id" "val"
1 49
2 48
3 46
4 45
"date" "val"
01-01-2015 49
02-01-2015 48
03-01-2015 46
04-01-2015 45
Use pd.tseries.offsets.Day:
df['date'] = pd.Timestamp('2015-01-01') \
+ df['id'].sub(1).apply(pd.tseries.offsets.Day)
Alternative, proposed by #HenryEcker:
df['date'] = pd.Timestamp('2015-01-01') \
- pd.Timedelta(days=1) \
+ df['id'].apply(pd.tseries.offsets.Day)
>>> df['id'].sub(1).apply(pd.tseries.offsets.Day)
0 <0 * Days>
1 <Day>
2 <2 * Days>
3 <3 * Days>
Name: id, dtype: object
>>> df
id val date
0 1 49 2015-01-01
1 2 48 2015-01-02
2 3 46 2015-01-03
3 4 45 2015-01-04
You can convert id to datetime and format the output with strftime:
df['Date'] = pd.to_datetime(df['id'].astype(str)+"-2015", format='%j-%Y').dt.strftime('%d-%m-%Y')
Result:
id
val
Date
0
1
49
01-01-2015
1
2
48
02-01-2015
2
3
46
03-01-2015
3
4
45
04-01-2015
df.columns['date', 'val']
for i, contents in enumerate(df['date']):
info = str(contents)
if contents < 10:
info = str(0) + info
df['date'][i] = "01-" + info + "-2015"
This iterates through your column and converts it to date formatting
Or like this:
df['Date'] = pd.Timestamp('2014-12-31') + df['id'].apply(lambda x: pd.Timedelta(days=x))
Output:
id val Date
0 1 49 2015-01-01
1 2 48 2015-01-02
2 3 46 2015-01-03
3 4 45 2015-01-04
You can use pd.to_timedelta() on id column to turn its values into date offsets for adding to the base date, as follows:
df['date'] = pd.Timestamp('2015-01-01') + pd.to_timedelta(df['id'] -1, unit='day')
Result:
print(df)
id val date
0 1 49 2015-01-01
1 2 48 2015-01-02
2 3 46 2015-01-03
3 4 45 2015-01-04
If you want the date in dd-mm-YYYY format, you can use together with .dt.strftime(), as follows:
df['date2'] = (pd.Timestamp('2015-01-01') + pd.to_timedelta(df['id'] -1, unit='day')).dt.strftime('%d-%m-%Y')
Result:
print(df)
id val date date2
0 1 49 2015-01-01 01-01-2015
1 2 48 2015-01-02 02-01-2015
2 3 46 2015-01-03 03-01-2015
3 4 45 2015-01-04 04-01-2015
I'm not sure about the years as the day count doesn't speak about which year to choose but you can convert it into months and dates.
change your csv column called id into the date. Then >>>
df['Date'] = pd.to_datetime(df['Date'], format='%j').dt.strftime('%m-%d')
it will change it into date. Then you can manually add year.
maybe I could not find it... anyhow, with pandas '0.19.2' there is the following
problem:
I have some timed events of associated groups which can be generated by:
from numpy.random import randint, seed
import pandas as pd
seed(42) # reproducibility
samp_N = 1000
# create times within 3 hours, and 15 random groups
df = pd.DataFrame({'time': randint(0,3*60*60, samp_N),
'group': randint(0, 15, samp_N)})
# make a resample-able index from the seconds time values
df.set_index(pd.TimedeltaIndex(df.time, 's'), inplace=True)
which looks like:
group time
02:01:10 10 7270
00:14:20 13 860
01:29:50 9 5390
01:26:31 13 5191
...
When I try to resample the events, I get something undesirable
df.resample('5T').count()
group time
00:00:04 28 28
00:05:04 18 18
00:10:04 32 32
...
Unfortunately the resampling periods start at arbitrary (first in data) offset values.
It is even more annoying if I group this (as ultimately required)
df.groupby('group').resample('5T').count()
then I get a new offset for each group
what I want is the precise start of sampling windows:
00:00:00 5 ...
00:05:00 17 ...
00:10:00 11 ...
...
there was a suggestion in: https://stackoverflow.com/a/23966229
df.groupby(pd.TimeGrouper('5Min')).count()
but it does not work either, as it also ruins the grouping required above.
thanks for hints!
Unfortunately i didn't come up with a nice solution but rather a work around. I added a dummy row with time value zero and then grouped by time and group:
df = pd.Series({'time':0,'group':-1}).to_frame().T.set_index(pd.TimedeltaIndex([0], 's')).append(df)
df = df.groupby([pd.Grouper(freq='5Min'), 'group']).count().reset_index('group')
df = df.loc[df['group']!=-1]
df.head()
group time
0 days 0 2
0 days 1 4
0 days 2 3
0 days 3 1
0 days 4 2
I am not sure this is the result you want:
result = df.groupby(['group', pd.Grouper(freq='5Min')]).count().reset_index(level=0)
result.head()
>>> group time
00:05:00 0 2
00:10:00 0 1
00:15:00 0 3
00:20:00 0 2
00:30:00 0 1
result.sort_index().head()
>>> group time
0 days 10 1
0 days 14 3
0 days 2 1
0 days 13 1
0 days 4 3
I'm trying to get week on a month, some months might have four weeks some might have five.
For each date i would like to know to which week does it belongs to. I'm mostly interested in the last week of the month.
data = pd.DataFrame(pd.date_range(' 1/ 1/ 2000', periods = 100, freq ='D'))
0 2000-01-01
1 2000-01-02
2 2000-01-03
3 2000-01-04
4 2000-01-05
5 2000-01-06
6 2000-01-07
See this answer and decide which week of month you want.
There's nothing built-in, so you'll need to calculate it with apply. For example, for an easy 'how many 7 day periods have passed' measure.
data['wom'] = data[0].apply(lambda d: (d.day-1) // 7 + 1)
For a more complicated (based on the calender), using the function from that answer.
import datetime
import calendar
def week_of_month(tgtdate):
tgtdate = tgtdate.to_datetime()
days_this_month = calendar.mdays[tgtdate.month]
for i in range(1, days_this_month):
d = datetime.datetime(tgtdate.year, tgtdate.month, i)
if d.day - d.weekday() > 0:
startdate = d
break
# now we canuse the modulo 7 appraoch
return (tgtdate - startdate).days //7 + 1
data['calendar_wom'] = data[0].apply(week_of_month)
I've used the code below when dealing with dataframes that have a datetime index.
import pandas as pd
import math
def add_week_of_month(df):
df['week_in_month'] = pd.to_numeric(df.index.day/7)
df['week_in_month'] = df['week_in_month'].apply(lambda x: math.ceil(x))
return df
If you run this example:
df = test = pd.DataFrame({'count':['a','b','c','d','e']},
index = ['2018-01-01', '2018-01-08','2018-01-31','2018-02-01','2018-02-28'])
df.index = pd.to_datetime(df.index)
you should get the following dataframe
count week_in_month
2018-01-01 a 1
2018-01-08 b 2
2018-01-31 c 5
2018-02-01 d 1
2018-02-28 e 4
TL;DR
import pandas as pd
def weekinmonth(dates):
"""Get week number in a month.
Parameters:
dates (pd.Series): Series of dates.
Returns:
pd.Series: Week number in a month.
"""
firstday_in_month = dates - pd.to_timedelta(dates.dt.day - 1, unit='d')
return (dates.dt.day-1 + firstday_in_month.dt.weekday) // 7 + 1
df = pd.DataFrame(pd.date_range(' 1/ 1/ 2000', periods = 100, freq ='D'), columns=['Date'])
weekinmonth(df['Date'])
0 1
1 1
2 2
3 2
4 2
..
95 2
96 2
97 2
98 2
99 2
Name: Date, Length: 100, dtype: int64
Explanation
At first, calculate first day in month (from this answer: How floor a date to the first date of that month?):
df = pd.DataFrame(pd.date_range(' 1/ 1/ 2000', periods = 100, freq ='D'), columns=['Date'])
df['MonthFirstDay'] = df['Date'] - pd.to_timedelta(df['Date'].dt.day - 1, unit='d')
df
Date MonthFirstDay
0 2000-01-01 2000-01-01
1 2000-01-02 2000-01-01
2 2000-01-03 2000-01-01
3 2000-01-04 2000-01-01
4 2000-01-05 2000-01-01
.. ... ...
95 2000-04-05 2000-04-01
96 2000-04-06 2000-04-01
97 2000-04-07 2000-04-01
98 2000-04-08 2000-04-01
99 2000-04-09 2000-04-01
[100 rows x 2 columns]
Obtain weekday from first day:
df['FirstWeekday'] = df['MonthFirstDay'].dt.weekday
df
Date MonthFirstDay FirstWeekday
0 2000-01-01 2000-01-01 5
1 2000-01-02 2000-01-01 5
2 2000-01-03 2000-01-01 5
3 2000-01-04 2000-01-01 5
4 2000-01-05 2000-01-01 5
.. ... ... ...
95 2000-04-05 2000-04-01 5
96 2000-04-06 2000-04-01 5
97 2000-04-07 2000-04-01 5
98 2000-04-08 2000-04-01 5
99 2000-04-09 2000-04-01 5
[100 rows x 3 columns]
Now I can calculate with modulo of weekdays to obtain the week number in a month:
Get day of the month by df['Date'].dt.day and make sure that begins with 0 due to modulo calculation df['Date'].dt.day-1.
Add weekday number to make sure which day of month starts + df['FirstWeekday']
Be safe to use the integer division of 7 days in a week and add 1 to start week number in month from 1 // 7 + 1.
Whole modulo calculation:
df['WeekInMonth'] = (df['Date'].dt.day-1 + df['FirstWeekday']) // 7 + 1
df
Date MonthFirstDay FirstWeekday WeekInMonth
0 2000-01-01 2000-01-01 5 1
1 2000-01-02 2000-01-01 5 1
2 2000-01-03 2000-01-01 5 2
3 2000-01-04 2000-01-01 5 2
4 2000-01-05 2000-01-01 5 2
.. ... ... ... ...
95 2000-04-05 2000-04-01 5 2
96 2000-04-06 2000-04-01 5 2
97 2000-04-07 2000-04-01 5 2
98 2000-04-08 2000-04-01 5 2
99 2000-04-09 2000-04-01 5 2
[100 rows x 4 columns]
This seems to do the trick for me
df_dates = pd.DataFrame({'date':pd.bdate_range(df['date'].min(),df['date'].max())})
df_dates_tues = df_dates[df_dates['date'].dt.weekday==2].copy()
df_dates_tues['week']=np.mod(df_dates_tues['date'].dt.strftime('%W').astype(int),4)
You can get it subtracting the current week and the week of the first day of the month, but extra logic is needed to handle first and last week of the year:
def get_week(s):
prev_week = (s - pd.to_timedelta(7, unit='d')).dt.week
return (
s.dt.week
.where((s.dt.month != 1) | (s.dt.week < 50), 0)
.where((s.dt.month != 12) | (s.dt.week > 1), prev_week + 1)
)
def get_week_of_month(s):
first_day_of_month = s - pd.to_timedelta(s.dt.day - 1, unit='d')
first_week_of_month = get_week(first_day_of_month)
current_week = get_week(s)
return current_week - first_week_of_month
My logic to get the week of the month depends on the week of the year.
1st calculate week of the year in a data frame
Then get the max week month of the previous year if the month is not 1, if month is 1 return week of year
if max week of previous month equals max week of current month
Then return the difference current week of the year with the max week month of the previous month plus 1
Else return difference of current week of the year with the max week month of the previous month
Hope this solves the problem for multiple logics used above which have limitations, the below function does the same. Temp here is the data frame for which week of the year is calculated using dt.weekofyear
def weekofmonth(dt1):
if dt1.month == 1:
return (dt1.weekofyear)
else:
pmth = dt1.month - 1
year = dt1.year
pmmaxweek = temp[(temp['timestamp_utc'].dt.month == pmth) & (temp['timestamp_utc'].dt.year == year)]['timestamp_utc'].dt.weekofyear.max()
if dt1.weekofyear == pmmaxweek:
return (dt1.weekofyear - pmmaxweek + 1)
else:
return (dt1.weekofyear - pmmaxweek)
import pandas as pd
import math
def week_of_month(dt:pd.Timestamp):
return math.ceil((x-1)//7)+1
dt["what_you_need"] = df["dt_col_name"].apply(week_of_month)
This gives you week from 1-5, if days>28, then it will count as 5th week.