add specific dates in a file using python - python

I have a csv file containing 2 columns: id, val
where id is the number of the day (total 365)
Is it possible to convert the number id to dates in format '%d-%m-%Y'?
In fact I want to add all the days of year 2015 e.g. 01-01-2015 etc.
How can i do this with pandas in python?
following is a part of the file and the desired output
"id" "val"
1 49
2 48
3 46
4 45
"date" "val"
01-01-2015 49
02-01-2015 48
03-01-2015 46
04-01-2015 45

Use pd.tseries.offsets.Day:
df['date'] = pd.Timestamp('2015-01-01') \
+ df['id'].sub(1).apply(pd.tseries.offsets.Day)
Alternative, proposed by #HenryEcker:
df['date'] = pd.Timestamp('2015-01-01') \
- pd.Timedelta(days=1) \
+ df['id'].apply(pd.tseries.offsets.Day)
>>> df['id'].sub(1).apply(pd.tseries.offsets.Day)
0 <0 * Days>
1 <Day>
2 <2 * Days>
3 <3 * Days>
Name: id, dtype: object
>>> df
id val date
0 1 49 2015-01-01
1 2 48 2015-01-02
2 3 46 2015-01-03
3 4 45 2015-01-04

You can convert id to datetime and format the output with strftime:
df['Date'] = pd.to_datetime(df['id'].astype(str)+"-2015", format='%j-%Y').dt.strftime('%d-%m-%Y')
Result:
id
val
Date
0
1
49
01-01-2015
1
2
48
02-01-2015
2
3
46
03-01-2015
3
4
45
04-01-2015

df.columns['date', 'val']
for i, contents in enumerate(df['date']):
info = str(contents)
if contents < 10:
info = str(0) + info
df['date'][i] = "01-" + info + "-2015"
This iterates through your column and converts it to date formatting

Or like this:
df['Date'] = pd.Timestamp('2014-12-31') + df['id'].apply(lambda x: pd.Timedelta(days=x))
Output:
id val Date
0 1 49 2015-01-01
1 2 48 2015-01-02
2 3 46 2015-01-03
3 4 45 2015-01-04

You can use pd.to_timedelta() on id column to turn its values into date offsets for adding to the base date, as follows:
df['date'] = pd.Timestamp('2015-01-01') + pd.to_timedelta(df['id'] -1, unit='day')
Result:
print(df)
id val date
0 1 49 2015-01-01
1 2 48 2015-01-02
2 3 46 2015-01-03
3 4 45 2015-01-04
If you want the date in dd-mm-YYYY format, you can use together with .dt.strftime(), as follows:
df['date2'] = (pd.Timestamp('2015-01-01') + pd.to_timedelta(df['id'] -1, unit='day')).dt.strftime('%d-%m-%Y')
Result:
print(df)
id val date date2
0 1 49 2015-01-01 01-01-2015
1 2 48 2015-01-02 02-01-2015
2 3 46 2015-01-03 03-01-2015
3 4 45 2015-01-04 04-01-2015

I'm not sure about the years as the day count doesn't speak about which year to choose but you can convert it into months and dates.
change your csv column called id into the date. Then >>>
df['Date'] = pd.to_datetime(df['Date'], format='%j').dt.strftime('%m-%d')
it will change it into date. Then you can manually add year.

Related

Convert string hours to minute pd.eval

I want to convert all rows of my DataFrame that contains hours and minutes into minutes only.
I have a dataframe that looks like this:
df=
time
0 8h30
1 14h07
2 08h30
3 7h50
4 8h0
5 8h15
6 6h15
I'm using the following method to convert:
df['time'] = pd.eval(
df['time'].replace(['h'], ['*60+'], regex=True))
Output
SyntaxError: invalid syntax
I think the error comes from the format of the hour, maybe pd.evalcant accept 08h30 or 8h0, how to solve this probleme ?
Pandas can already handle such strings if the units are included in the string. While 14h07 can't be parse (why assume 07 is minutes?), 14h07 can be converted to a Timedelta :
>>> pd.to_timedelta("14h07m")
Timedelta('0 days 14:07:00')
Given this dataframe :
d1 = pd.DataFrame(['8h30m', '14h07m', '08h30m', '8h0m'],
columns=['time'])
You can convert the time series into a Timedelta series with pd.to_timedelta :
>>> d1['tm'] = pd.to_timedelta(d1['time'])
>>> d1
time tm
0 8h30m 0 days 08:30:00
1 14h07m 0 days 14:07:00
2 08h30m 0 days 08:30:00
3 8h0m 0 days 08:00:00
To handle the missing minutes unit in the original data, just append m:
d1['tm'] = pd.to_timedelta(d1['time'] + 'm')
Once you have a Timedelta you can calculate hours and minutes.
The components of the values can be retrieved with Timedelta.components
>>> d1.tm.dt.components.hours
0 8
1 14
2 8
3 8
Name: hours, dtype: int64
To get the total minutes, seconds or hours, change the frequency to minutes:
>>> d1.tm.astype('timedelta64[m]')
0 510.0
1 847.0
2 510.0
3 480.0
Name: tm, dtype: float64
Bringing all the operations together :
>>> d1['tm'] = pd.to_timedelta(d1['time'])
>>> d2 = (d1.assign(h=d1.tm.dt.components.hours,
... m=d1.tm.dt.components.minutes,
... total_minutes=d1.tm.astype('timedelta64[m]')))
>>>
>>> d2
time tm h m total_minutes
0 8h30m 0 days 08:30:00 8 30 510.0
1 14h07m 0 days 14:07:00 14 7 847.0
2 08h30m 0 days 08:30:00 8 30 510.0
3 8h0m 0 days 08:00:00 8 0 480.0
To avoid having to trim leading zeros, an alternative approach:
df[['h', 'm']] = df['time'].str.split('h', expand=True).astype(int)
df['total_min'] = df['h']*60 + df['m']
Result:
time h m total_min
0 8h30 8 30 510
1 14h07 14 7 847
2 08h30 8 30 510
3 7h50 7 50 470
4 8h0 8 0 480
5 8h15 8 15 495
6 6h15 6 15 375
Just to give an alternative approach with kind of the same elements as above you could do:
df = pd.DataFrame(data=["8h30", "14h07", "08h30", "7h50", "8h0 ", "8h15", "6h15"],
columns=["time"])
First split you column on the "h"
hm = df["time"].str.split("h", expand=True)
Then combine the columns again, but zeropad time hours and minutes in order to make valid time strings:
df2 = hm[0].str.strip().str.zfill(2) + hm[1].str.strip().str.zfill(2)
Then convert the string column with proper values to a date time column:
df3 = pd.to_datetime(df2, format="%H%M")
Finally, calculate the number of minutes by subtrackting a zero time (to make deltatimes) and divide by the minutes deltatime:
zerotime= pd.to_datetime("0000", format="%H%M")
df['minutes'] = (df3 - zerotime) / pd.Timedelta(minutes=1)
The results look like:
time minutes
0 8h30 510.0
1 14h07 847.0
2 08h30 510.0
3 7h50 470.0
4 8h0 480.0
5 8h15 495.0
6 6h15 375.0

extract number of days in month from date column, return days to date in current month

I have a pandas dataframe with a date column
I'm trying to create a function and apply it to the dataframe to create a column that returns the number of days in the month/year specified
so far i have:
from calendar import monthrange
def dom(x):
m = dfs["load_date"].dt.month
y = dfs["load_date"].dt.year
monthrange(y,m)
days = monthrange[1]
return days
This however does not work when I attempt to apply it to the date column.
Additionally, I would like to be able to identify whether or not it is the current month, and if so return the number of days up to the current date in that month as opposed to days in the entire month.
I am not sure of the best way to do this, all I can think of is to check the month/year against datetime's today and then use a delta
thanks in advance
For pt.1 of your question, you can cast to pd.Period and retrieve days_in_month:
import pandas as pd
# create a sample df:
df = pd.DataFrame({'date': pd.date_range('2020-01', '2021-01', freq='M')})
df['daysinmonths'] = df['date'].apply(lambda t: pd.Period(t, freq='S').days_in_month)
# df['daysinmonths']
# 0 31
# 1 29
# 2 31
# ...
For pt.2, you can take the timestamp of 'now' and create a boolean mask for your date column, i.e. where its year/month is less than "now". Then calculate the cumsum of the daysinmonth column for the section where the mask returns True. Invert the order of that series to get the days until now.
now = pd.Timestamp('now')
m = (df['date'].dt.year <= now.year) & (df['date'].dt.month < now.month)
df['daysuntilnow'] = df['daysinmonths'][m].cumsum().iloc[::-1].reset_index(drop=True)
Update after comment: to get the elapsed days per month, you can do
df['dayselapsed'] = df['daysinmonths']
m = (df['date'].dt.year == now.year) & (df['date'].dt.month == now.month)
if m.any():
df.loc[m, 'dayselapsed'] = now.day
df.loc[(df['date'].dt.year >= now.year) & (df['date'].dt.month > now.month), 'dayselapsed'] = 0
output
df
Out[13]:
date daysinmonths daysuntilnow dayselapsed
0 2020-01-31 31 213.0 31
1 2020-02-29 29 182.0 29
2 2020-03-31 31 152.0 31
3 2020-04-30 30 121.0 30
4 2020-05-31 31 91.0 31
5 2020-06-30 30 60.0 30
6 2020-07-31 31 31.0 31
7 2020-08-31 31 NaN 27
8 2020-09-30 30 NaN 0
9 2020-10-31 31 NaN 0
10 2020-11-30 30 NaN 0
11 2020-12-31 31 NaN 0

How to replace timestamp across the columns using pandas

df = pd.DataFrame({
'subject_id':[1,1,2,2],
'time_1':['2173/04/11 12:35:00','2173/04/12 12:50:00','2173/04/11 12:59:00','2173/04/12 13:14:00'],
'time_2':['2173/04/12 16:35:00','2173/04/13 18:50:00','2173/04/13 22:59:00','2173/04/21 17:14:00'],
'val' :[5,5,40,40],
'iid' :[12,12,12,12]
})
df['time_1'] = pd.to_datetime(df['time_1'])
df['time_2'] = pd.to_datetime(df['time_2'])
df['day'] = df['time_1'].dt.day
Currently my dataframe looks like as shown below
I would like to replace the timestamp in time_1 column to 00:00:00 and time_2 column to 23:59:00
This is what I tried but it doesn't work
df.groupby(df['subject_id'])['time_1'].apply(lambda x: pd.datetime.strftime(x, "%H:%M:%S") == "00:00:00") #approach 1
df.groupby(df['subject_id'])['time_1'].apply(lambda x: pd.pd.Timestamp(hour = '00', second = '00')) #approach 2
I expect my output dataframe to be like as shown below
I pandas if all datetimes have 00:00:00 times in same column then not display it.
Use Series.dt.floor or Series.str.normalize for remove times and for second add DateOffset:
df['time_1'] = pd.to_datetime(df['time_1']).dt.floor('d')
#alternative
#df['time_1'] = pd.to_datetime(df['time_1']).dt.normalize()
df['time_2']=pd.to_datetime(df['time_2']).dt.floor('d') + pd.DateOffset(hours=23, minutes=59)
df['day'] = df['time_1'].dt.day
print (df)
subject_id time_1 time_2 val iid day
0 1 2173-04-11 2173-04-12 23:59:00 5 12 11
1 1 2173-04-12 2173-04-13 23:59:00 5 12 12
2 2 2173-04-11 2173-04-13 23:59:00 40 12 11
3 2 2173-04-12 2173-04-21 23:59:00 40 12 12

Pandas : SQL SelfJoin With Date Criteria

One query I often do in SQL within a relational database is to join a table back to itself and summarize each row based on records for the same id either backwards or forward in time.
For example, assume table1 as columns 'ID','Date', 'Var1'
In SQL I could sum var1 for the past 3 months for each record like this:
Select a.ID, a.Date, sum(b.Var1) as sum_var1
from table1 a
left outer join table1 b
on a.ID = b.ID
and months_between(a.date,b.date) <0
and months_between(a.date,b.date) > -3
Is there any way to do this in Pandas?
It seems you need GroupBy + rolling. Implementing the logic in precisely the same way it is written in SQL is likely to be expensive as it will involve repeated loops. Let's take an example dataframe:
Date ID Var1
0 2015-01-01 1 0
1 2015-02-01 1 1
2 2015-03-01 1 2
3 2015-04-01 1 3
4 2015-05-01 1 4
5 2015-01-01 2 5
6 2015-02-01 2 6
7 2015-03-01 2 7
8 2015-04-01 2 8
9 2015-05-01 2 9
You can add a column which, by group, looks back and sums a variable over a fixed period. First define a function utilizing pd.Series.rolling:
def lookbacker(x):
"""Sum over past 70 days"""
return x.rolling('70D').sum().astype(int)
Then apply it on a GroupBy object and extract values for assignment:
df['Lookback_Sum'] = df.set_index('Date').groupby('ID')['Var1'].apply(lookbacker).values
print(df)
Date ID Var1 Lookback_Sum
0 2015-01-01 1 0 0
1 2015-02-01 1 1 1
2 2015-03-01 1 2 3
3 2015-04-01 1 3 6
4 2015-05-01 1 4 9
5 2015-01-01 2 5 5
6 2015-02-01 2 6 11
7 2015-03-01 2 7 18
8 2015-04-01 2 8 21
9 2015-05-01 2 9 24
It appears pd.Series.rolling does not work with months, e.g. using '2M' (2 months) instead of '70D' (70 days) gives ValueError: <2 * MonthEnds> is a non-fixed frequency. This makes sense since a "month" is ambiguous given months have different numbers of days.
Another point worth mentioning is you can use GroupBy + rolling directly and possibly more efficiently by bypassing apply, but this requires ensuring your index is monotic. For example, via sort_index:
df['Lookback_Sum'] = df.set_index('Date').sort_index()\
.groupby('ID')['Var1'].rolling('70D').sum()\
.astype(int).values
I don't think pandas.DataFrame.rolling() supports rolling-window aggregation by some number of months; currently, you must specify a fixed number of days, or other fixed-length time period.
But as #jpp mentioned, you can use python loops to perform rolling aggregation over a window size specified in calendar months, where the number of days in each window will vary, depending on what part of the calendar you're rolling over.
The following approach builds on this SO answer as well as #jpp's:
# Build some example data:
# 3 unique IDs, each with 365 samples, one sample per day throughout 2015
df = pd.DataFrame({'Date': pd.date_range('2015-01-01', '2015-12-31', freq='D'),
'Var1': list(range(365))})
df = pd.concat([df] * 3)
df['ID'] = [1]*365 + [2]*365 + [3]*365
df.head()
Date Var1 ID
0 2015-01-01 0 1
1 2015-01-02 1 1
2 2015-01-03 2 1
3 2015-01-04 3 1
4 2015-01-05 4 1
# Define a lookback function that mimics rolling aggregation,
# but uses DateOffset() slicing, rather than a window of fixed size.
# Use .count() here as a sanity check; you will need .sum()
def lookbacker(ser):
return pd.Series([ser.loc[d - pd.offsets.DateOffset(months=3):d].count()
for d in ser.index])
# By default, groupby.agg output is sorted by key. So make sure to
# sort df by (ID, Date) before inserting the flattened groupby result
# into a new column
df.sort_values(['ID', 'Date'], inplace=True)
df.set_index('Date', inplace=True)
df['window_size'] = df.groupby('ID')['Var1'].apply(lookbacker).values
# Manually check the resulting window sizes
df.head()
Var1 ID window_size
Date
2015-01-01 0 1 1
2015-01-02 1 1 2
2015-01-03 2 1 3
2015-01-04 3 1 4
2015-01-05 4 1 5
df.tail()
Var1 ID window_size
Date
2015-12-27 360 3 92
2015-12-28 361 3 92
2015-12-29 362 3 92
2015-12-30 363 3 92
2015-12-31 364 3 93
df[df.ID == 1].loc['2015-05-25':'2015-06-05']
Var1 ID window_size
Date
2015-05-25 144 1 90
2015-05-26 145 1 90
2015-05-27 146 1 90
2015-05-28 147 1 90
2015-05-29 148 1 91
2015-05-30 149 1 92
2015-05-31 150 1 93
2015-06-01 151 1 93
2015-06-02 152 1 93
2015-06-03 153 1 93
2015-06-04 154 1 93
2015-06-05 155 1 93
The last column gives the lookback window size in days, looking back from that date, including both the start and end dates.
Looking "3 months" before 2016-05-31 would land you at 2015-02-31, but February has only 28 days in 2015. As you can see in the sequence 90, 91, 92, 93 in the above sanity check, This DateOffset approach maps the last four days in May to the last day in February:
pd.to_datetime('2015-05-31') - pd.offsets.DateOffset(months=3)
Timestamp('2015-02-28 00:00:00')
pd.to_datetime('2015-05-30') - pd.offsets.DateOffset(months=3)
Timestamp('2015-02-28 00:00:00')
pd.to_datetime('2015-05-29') - pd.offsets.DateOffset(months=3)
Timestamp('2015-02-28 00:00:00')
pd.to_datetime('2015-05-28') - pd.offsets.DateOffset(months=3)
Timestamp('2015-02-28 00:00:00')
I don't know if this matches SQL's behaviour, but in any case, you'll want to test this and decide if this makes sense in your case.
you could use lambda to achieve it.
table1['sum_var1'] = table1.apply(lambda row: findSum(row), axis=1)
and we should write an equivalent method for months_between
the complete example is
from datetime import datetime
import datetime as dt
import pandas as pd
def months_between(date1, date2):
if date1.day == date2.day:
return (date1.year - date2.year) * 12 + date1.month - date2.month
# if both are last days
if date1.month != (date1 + dt.timedelta(days=1)).month :
if date2.month != (date2 + dt.timedelta(days=1)).month :
return date1.month - date2.month
return (date1 - date2).days / 31
def findSum(cRow):
table1['month_diff'] = table1['Date'].apply(months_between, date2=cRow['Date'])
filtered_table = table1[(table1["month_diff"] < 0) & (table1["month_diff"] > -3) & (table1['ID'] == cRow['ID'])]
if filtered_table.empty:
return 0
return filtered_table['Var1'].sum()
table1 = pd.DataFrame(columns = ['ID', 'Date', 'Var1'])
table1.loc[len(table1)] = [1, datetime.strptime('2015-01-01','%Y-%m-%d'), 0]
table1.loc[len(table1)] = [1, datetime.strptime('2015-02-01','%Y-%m-%d'), 1]
table1.loc[len(table1)] = [1, datetime.strptime('2015-03-01','%Y-%m-%d'), 2]
table1.loc[len(table1)] = [1, datetime.strptime('2015-04-01','%Y-%m-%d'), 3]
table1.loc[len(table1)] = [1, datetime.strptime('2015-05-01','%Y-%m-%d'), 4]
table1.loc[len(table1)] = [2, datetime.strptime('2015-01-01','%Y-%m-%d'), 5]
table1.loc[len(table1)] = [2, datetime.strptime('2015-02-01','%Y-%m-%d'), 6]
table1.loc[len(table1)] = [2, datetime.strptime('2015-03-01','%Y-%m-%d'), 7]
table1.loc[len(table1)] = [2, datetime.strptime('2015-04-01','%Y-%m-%d'), 8]
table1.loc[len(table1)] = [2, datetime.strptime('2015-05-01','%Y-%m-%d'), 9]
table1['sum_var1'] = table1.apply(lambda row: findSum(row), axis=1)
table1.drop(columns=['month_diff'], inplace=True)
print(table1)

Week of a month pandas

I'm trying to get week on a month, some months might have four weeks some might have five.
For each date i would like to know to which week does it belongs to. I'm mostly interested in the last week of the month.
data = pd.DataFrame(pd.date_range(' 1/ 1/ 2000', periods = 100, freq ='D'))
0 2000-01-01
1 2000-01-02
2 2000-01-03
3 2000-01-04
4 2000-01-05
5 2000-01-06
6 2000-01-07
See this answer and decide which week of month you want.
There's nothing built-in, so you'll need to calculate it with apply. For example, for an easy 'how many 7 day periods have passed' measure.
data['wom'] = data[0].apply(lambda d: (d.day-1) // 7 + 1)
For a more complicated (based on the calender), using the function from that answer.
import datetime
import calendar
def week_of_month(tgtdate):
tgtdate = tgtdate.to_datetime()
days_this_month = calendar.mdays[tgtdate.month]
for i in range(1, days_this_month):
d = datetime.datetime(tgtdate.year, tgtdate.month, i)
if d.day - d.weekday() > 0:
startdate = d
break
# now we canuse the modulo 7 appraoch
return (tgtdate - startdate).days //7 + 1
data['calendar_wom'] = data[0].apply(week_of_month)
I've used the code below when dealing with dataframes that have a datetime index.
import pandas as pd
import math
def add_week_of_month(df):
df['week_in_month'] = pd.to_numeric(df.index.day/7)
df['week_in_month'] = df['week_in_month'].apply(lambda x: math.ceil(x))
return df
If you run this example:
df = test = pd.DataFrame({'count':['a','b','c','d','e']},
index = ['2018-01-01', '2018-01-08','2018-01-31','2018-02-01','2018-02-28'])
df.index = pd.to_datetime(df.index)
you should get the following dataframe
count week_in_month
2018-01-01 a 1
2018-01-08 b 2
2018-01-31 c 5
2018-02-01 d 1
2018-02-28 e 4
TL;DR
import pandas as pd
def weekinmonth(dates):
"""Get week number in a month.
Parameters:
dates (pd.Series): Series of dates.
Returns:
pd.Series: Week number in a month.
"""
firstday_in_month = dates - pd.to_timedelta(dates.dt.day - 1, unit='d')
return (dates.dt.day-1 + firstday_in_month.dt.weekday) // 7 + 1
df = pd.DataFrame(pd.date_range(' 1/ 1/ 2000', periods = 100, freq ='D'), columns=['Date'])
weekinmonth(df['Date'])
0 1
1 1
2 2
3 2
4 2
..
95 2
96 2
97 2
98 2
99 2
Name: Date, Length: 100, dtype: int64
Explanation
At first, calculate first day in month (from this answer: How floor a date to the first date of that month?):
df = pd.DataFrame(pd.date_range(' 1/ 1/ 2000', periods = 100, freq ='D'), columns=['Date'])
df['MonthFirstDay'] = df['Date'] - pd.to_timedelta(df['Date'].dt.day - 1, unit='d')
df
Date MonthFirstDay
0 2000-01-01 2000-01-01
1 2000-01-02 2000-01-01
2 2000-01-03 2000-01-01
3 2000-01-04 2000-01-01
4 2000-01-05 2000-01-01
.. ... ...
95 2000-04-05 2000-04-01
96 2000-04-06 2000-04-01
97 2000-04-07 2000-04-01
98 2000-04-08 2000-04-01
99 2000-04-09 2000-04-01
[100 rows x 2 columns]
Obtain weekday from first day:
df['FirstWeekday'] = df['MonthFirstDay'].dt.weekday
df
Date MonthFirstDay FirstWeekday
0 2000-01-01 2000-01-01 5
1 2000-01-02 2000-01-01 5
2 2000-01-03 2000-01-01 5
3 2000-01-04 2000-01-01 5
4 2000-01-05 2000-01-01 5
.. ... ... ...
95 2000-04-05 2000-04-01 5
96 2000-04-06 2000-04-01 5
97 2000-04-07 2000-04-01 5
98 2000-04-08 2000-04-01 5
99 2000-04-09 2000-04-01 5
[100 rows x 3 columns]
Now I can calculate with modulo of weekdays to obtain the week number in a month:
Get day of the month by df['Date'].dt.day and make sure that begins with 0 due to modulo calculation df['Date'].dt.day-1.
Add weekday number to make sure which day of month starts + df['FirstWeekday']
Be safe to use the integer division of 7 days in a week and add 1 to start week number in month from 1 // 7 + 1.
Whole modulo calculation:
df['WeekInMonth'] = (df['Date'].dt.day-1 + df['FirstWeekday']) // 7 + 1
df
Date MonthFirstDay FirstWeekday WeekInMonth
0 2000-01-01 2000-01-01 5 1
1 2000-01-02 2000-01-01 5 1
2 2000-01-03 2000-01-01 5 2
3 2000-01-04 2000-01-01 5 2
4 2000-01-05 2000-01-01 5 2
.. ... ... ... ...
95 2000-04-05 2000-04-01 5 2
96 2000-04-06 2000-04-01 5 2
97 2000-04-07 2000-04-01 5 2
98 2000-04-08 2000-04-01 5 2
99 2000-04-09 2000-04-01 5 2
[100 rows x 4 columns]
This seems to do the trick for me
df_dates = pd.DataFrame({'date':pd.bdate_range(df['date'].min(),df['date'].max())})
df_dates_tues = df_dates[df_dates['date'].dt.weekday==2].copy()
df_dates_tues['week']=np.mod(df_dates_tues['date'].dt.strftime('%W').astype(int),4)
You can get it subtracting the current week and the week of the first day of the month, but extra logic is needed to handle first and last week of the year:
def get_week(s):
prev_week = (s - pd.to_timedelta(7, unit='d')).dt.week
return (
s.dt.week
.where((s.dt.month != 1) | (s.dt.week < 50), 0)
.where((s.dt.month != 12) | (s.dt.week > 1), prev_week + 1)
)
def get_week_of_month(s):
first_day_of_month = s - pd.to_timedelta(s.dt.day - 1, unit='d')
first_week_of_month = get_week(first_day_of_month)
current_week = get_week(s)
return current_week - first_week_of_month
My logic to get the week of the month depends on the week of the year.
1st calculate week of the year in a data frame
Then get the max week month of the previous year if the month is not 1, if month is 1 return week of year
if max week of previous month equals max week of current month
Then return the difference current week of the year with the max week month of the previous month plus 1
Else return difference of current week of the year with the max week month of the previous month
Hope this solves the problem for multiple logics used above which have limitations, the below function does the same. Temp here is the data frame for which week of the year is calculated using dt.weekofyear
def weekofmonth(dt1):
if dt1.month == 1:
return (dt1.weekofyear)
else:
pmth = dt1.month - 1
year = dt1.year
pmmaxweek = temp[(temp['timestamp_utc'].dt.month == pmth) & (temp['timestamp_utc'].dt.year == year)]['timestamp_utc'].dt.weekofyear.max()
if dt1.weekofyear == pmmaxweek:
return (dt1.weekofyear - pmmaxweek + 1)
else:
return (dt1.weekofyear - pmmaxweek)
import pandas as pd
import math
def week_of_month(dt:pd.Timestamp):
return math.ceil((x-1)//7)+1
dt["what_you_need"] = df["dt_col_name"].apply(week_of_month)
This gives you week from 1-5, if days>28, then it will count as 5th week.

Categories