Pandas Series.dt.week vs. pd.Period.strftime - what's the difference? - python

When operating on a pandas series of dates, isolating the week number can be performed in two separate ways that produce different results.
Using the .dt.week accessor on a numpy.datetime64 value or a pd.Period within a series produces different results than using pd.Period.strftime on the same objects. The online documentation for pd.Period.strftime states that all days before the first occurrence of the start week in the beginning of the year are counted as week 0. This follows standard python strftime behavior.
The .dt.week accessor seems to start at 1 and restart after 52 weeks, making the final two days of 2018 week 1 of 2019. The online documentation for pd.Series.dt.week only states that it returns the week ordinal of the year. This seems to be the iso week number?
Why is there this discrepancy in the behavior of the two methods? Which one should be used and why? How can I elegantly get the iso week number from a single python datetime (or pd.Period or pd.timestamp) object (as opposed to a series)?
df2 = pd.DataFrame({"Date_string": ["2018-12-27", "2018-12-28","2018-12-29", "2018-12-30", "2018-12-31", "2019-01-01", "2019-01-02", "2019-01-03", "2019-01-04", "2019-01-05", "2019-01-06", "2019-01-07",]})
df2["Date_datestamp"] = pd.to_datetime(df2["Date_string"], format='%Y-%m-%d')
df2["Date_period"] = df2['Date_datestamp'].dt.to_period("D")
df2["Week1"] = df2['Date_period'].apply(lambda x: (x + timedelta(days=1)).week)
df2["Week2"] = df2['Date_period'].apply(lambda x: x.strftime("%U"))
df2
returns
Date_string Date_datestamp Date_period Week1 Week2
0 2018-12-27 2018-12-27 2018-12-27 52 51
1 2018-12-28 2018-12-28 2018-12-28 52 51
2 2018-12-29 2018-12-29 2018-12-29 52 51
3 2018-12-30 2018-12-30 2018-12-30 1 52
4 2018-12-31 2018-12-31 2018-12-31 1 52
5 2019-01-01 2019-01-01 2019-01-01 1 00
6 2019-01-02 2019-01-02 2019-01-02 1 00
7 2019-01-03 2019-01-03 2019-01-03 1 00
8 2019-01-04 2019-01-04 2019-01-04 1 00
9 2019-01-05 2019-01-05 2019-01-05 1 00
10 2019-01-06 2019-01-06 2019-01-06 2 01
11 2019-01-07 2019-01-07 2019-01-07 2 01

This is because there was actually 53 weeks in 2018. I would recommend using a year-week combination, something like.
df2['Year-Week'] = df2['Date_period'].apply(lambda x: x.strftime('%Y-%U'))
Edited:
To see the number of weeks, you can try
df2["Week2"] = df2['Date_period'].apply(lambda x: x.strftime("%W"))
This shows 2018-12-31 as week 53.
%U - gets the Week Number, using Sunday as First day of Week
%W - gets the Week Number, using Monday as First day of Week

Related

filtering index in a dataframe

I currently have a dataframe with the index "2018-01-02" to "2020-12-31".
I need to write a program that takes in this dataframe and outputs a new dataframe that contains the first date available for each month.
What is the best way to do this?
Assume that the source DataFrame is:
Amount
Date
2018-01-02 10
2018-01-03 11
2018-01-04 12
2018-02-03 13
2018-02-04 14
2018-02-05 15
2018-03-07 16
2018-03-09 17
2018-04-10 18
2018-04-12 19
(its index is of DatetimeIndex type, not string).
If you want only the first date in each month, you can run:
result = df.groupby(pd.Grouper(freq='MS')).apply(lambda grp: grp.index.min())
The result is a Series containing:
Date
2018-01-01 2018-01-02
2018-02-01 2018-02-03
2018-03-01 2018-03-07
2018-04-01 2018-04-10
Freq: MS, dtype: datetime64[ns]
The left column is the index - starting date of each month.
The right column is the value found - the first date in each month from
the source DataFrame.
But if you want full first rows from each month, you can run:
result = df.groupby(pd.Grouper(freq='MS')).head(1)
This time the result is:
Amount
Date
2018-01-02 10
2018-02-03 13
2018-03-07 16
2018-04-10 18
Note that df.groupby(pd.Grouper(freq='MS')).first() is a wrong
choice, since it returns in the key the first day of each month,
not the first existing day in this month (try it on your own).

Convert 3 columns from dataframe to date

I have dataframe like this:
I want to convert the 'start_year', 'start_month', 'start_day' columns to date
and the columns 'end_year', 'end_month', 'end_day' to another date
There is a way to do that?
Thank you.
Given a dataframe like this:
year month day
0 2019.0 12.0 29.0
1 2020.0 9.0 15.0
2 2018.0 3.0 1.0
You can convert them to date string using type cast, and str.zfill:
OUTPUT:
df.apply(lambda x: f'{int(x["year"])}-{str(int(x["month"])).zfill(2)}-{str(int(x["day"])).zfill(2)}', axis=1)
0 2019-12-29
1 2020-09-15
2 2018-03-01
dtype: object
Here's an approach
simulate some data as your data was an image
use apply against each row to row series using datetime.datetime()
import datetime as dt
import numpy as np
import pandas as pd
df = pd.DataFrame(
{
"start_year": np.random.choice(range(2018, 2022), 10),
"start_month": np.random.choice(range(1, 13), 10),
"start_day": np.random.choice(range(1, 28), 10),
"end_year": np.random.choice(range(2018, 2022), 10),
"end_month": np.random.choice(range(1, 13), 10),
"end_day": np.random.choice(range(1, 28), 10),
}
)
df = df.apply(
lambda r: r.append(pd.Series({f"{startend}_date": dt.datetime(*(r[f"{startend}_{part}"]
for part in ["year", "month", "day"]))
for startend in ["start", "end"]})),
axis=1)
df
start_year
start_month
start_day
end_year
end_month
end_day
start_date
end_date
0
2018
9
6
2020
1
3
2018-09-06 00:00:00
2020-01-03 00:00:00
1
2018
11
6
2020
7
2
2018-11-06 00:00:00
2020-07-02 00:00:00
2
2021
8
13
2020
11
2
2021-08-13 00:00:00
2020-11-02 00:00:00
3
2021
3
15
2021
3
6
2021-03-15 00:00:00
2021-03-06 00:00:00
4
2019
4
13
2021
11
5
2019-04-13 00:00:00
2021-11-05 00:00:00
5
2021
2
5
2018
8
17
2021-02-05 00:00:00
2018-08-17 00:00:00
6
2020
4
19
2020
9
18
2020-04-19 00:00:00
2020-09-18 00:00:00
7
2020
3
27
2020
10
20
2020-03-27 00:00:00
2020-10-20 00:00:00
8
2019
12
23
2018
5
11
2019-12-23 00:00:00
2018-05-11 00:00:00
9
2021
7
18
2018
5
10
2021-07-18 00:00:00
2018-05-10 00:00:00
An interesting feature of pandasonic to_datetime function is that instead of
a sequence of strings you can pass to it a whole DataFrame.
But in this case there is a requirement that such a DataFrame must have columns
named year, month and day. They can be also of float type, like your source
DataFrame sample.
So a quite elegant solution is to:
take a part of the source DataFrame (3 columns with the respective year,
month and day),
rename its columns to year, month and day,
use it as the argument to to_datetime,
save the result as a new column.
To do it, start from defining a lambda function, to be used as the rename
function below:
colNames = lambda x: x.split('_')[1]
Then just call:
df['Start'] = pd.to_datetime(df.loc[:, 'start_year' : 'start_day']
.rename(columns=colNames))
df['End'] = pd.to_datetime(df.loc[:, 'end_year' : 'end_day']
.rename(columns=colNames))
For a sample of your source DataFrame, the result is:
start_year start_month start_day evidence_method_dating end_year end_month end_day Start End
0 2019.0 12.0 9.0 Historical Observations 2019.0 12.0 9.0 2019-12-09 2019-12-09
1 2019.0 2.0 18.0 Historical Observations 2019.0 7.0 28.0 2019-02-18 2019-07-28
2 2018.0 7.0 3.0 Seismicity 2019.0 8.0 20.0 2018-07-03 2019-08-20
Maybe the next part should be to remove columns with parts of both "start"
and "end" dates. Your choice.
Edit
To avoid saving the lambda (anonymous) function under a variable, define
this function as a regular (named) function:
def colNames(x):
return x.split('_')[1]

pandas-resample-to-specific-weekday-in-month (MOnday before 3rd Friday)

I have a pandas series s, I would like to extract the Monday before the third Friday:
with the help of the answer in following link, I can get a resample of third friday, I am still not sure how to get the Monday just before it.
pandas resample to specific weekday in month
from pandas.tseries.offsets import WeekOfMonth
s.resample(rule=WeekOfMonth(week=2,weekday=4)).bfill().asfreq(freq='D').dropna()
Any help is welcome
Many thanks
For each source date, compute your "wanted" date in 3 steps:
Shift back to the first day of the current month.
Shift forward to Friday in third week.
Shift back 4 days (from Friday to Monday).
For a Series containing dates, the code to do it is:
s.dt.to_period('M').dt.to_timestamp() + pd.offsets.WeekOfMonth(week=2, weekday=4)\
- pd.Timedelta('4D')
To test this code I created the source Series as:
s = (pd.date_range('2020-01-01', '2020-12-31', freq='MS') + pd.Timedelta('1D')).to_series()
It contains the second day of each month, both as the index and value.
When you run the above code, you will get:
2020-01-02 2020-01-13
2020-02-02 2020-02-17
2020-03-02 2020-03-16
2020-04-02 2020-04-13
2020-05-02 2020-05-11
2020-06-02 2020-06-15
2020-07-02 2020-07-13
2020-08-02 2020-08-17
2020-09-02 2020-09-14
2020-10-02 2020-10-12
2020-11-02 2020-11-16
2020-12-02 2020-12-14
dtype: datetime64[ns]
The left column contains the original index (source date) and the right
column - the "wanted" date.
Note that third Monday formula (as proposed in one of comments) is wrong.
E.g. third Monday in January is 2020-01-20, whereas the correct date is 2020-01-13.
Edit
If you have a DataFrame, something like:
Date Amount
0 2020-01-02 10
1 2020-01-12 10
2 2020-01-13 2
3 2020-01-20 2
4 2020-02-16 2
5 2020-02-17 12
6 2020-03-15 12
7 2020-03-16 3
8 2020-03-31 3
and you want something like resample but each "period" should start
on a Monday before the third Friday in each month, and e.g. compute
a sum for each period, you can:
Define the following function:
def dateShift(d):
d += pd.Timedelta(4, 'D')
d = pd.offsets.WeekOfMonth(week=2, weekday=4).rollback(d)
return d - pd.Timedelta(4, 'D')
i.e.:
Add 4 days (e.g. move 2020-01-13 (Monday) to 2020-01-17 (Friday).
Roll back (in the above case (on offset) this date will not be moved).
Subtract 4 days.
Run:
df.groupby(df.Date.apply(dateShift)).sum()
The result is:
Amount
Date
2019-12-16 20
2020-01-13 6
2020-02-17 24
2020-03-16 6
E. g. two values of 10 for 2020-01-02 and 2020-01-12 are assigned
to period starting on 2019-12-16 (the "wanted" date for December 2019).

Match datetime YYYY-MM-DD object in pandas dataframe

I have a pandas DataFrame of the form:
id amount birth
0 4 78.0 1980-02-02 00:00:00
1 5 24.0 1989-03-03 00:00:00
2 6 49.5 2014-01-01 00:00:00
3 7 34.0 2014-01-01 00:00:00
4 8 49.5 2014-01-01 00:00:00
I am interested in only the year, month and day in the birth column of the dataframe. I tried to leverage on the Python datetime from pandas but it resulted into an error:
OutOfBoundsDatetime: Out of bounds nanosecond timestamp: 1054-02-07 00:00:00
The birth column is an object dtype.
My guess would be that it is an incorrect date. I would not like to pass the parameter errors="coerce" into the to_datetime method, because each item is important and I need just the YYYY-MM-DD.
I tried to leverage on the regex from pandas:
df["birth"].str.find("(\d{4})-(\d{2})-(\d{2})")
But this is returning NANs. How can I resolve this?
Thanks
Because not possible convert to datetimes you can use split by first whitespace and then select first value:
df['birth'] = df['birth'].str.split().str[0]
And then if necessary convert to periods.
Representing out-of-bounds spans.
print (df)
id amount birth
0 4 78.0 1980-02-02 00:00:00
1 5 24.0 1989-03-03 00:00:00
2 6 49.5 2014-01-01 00:00:00
3 7 34.0 2014-01-01 00:00:00
4 8 49.5 0-01-01 00:00:00
def to_per(x):
splitted = x.split('-')
return pd.Period(year=int(splitted[0]),
month=int(splitted[1]),
day=int(splitted[2]), freq='D')
df['birth'] = df['birth'].str.split().str[0].apply(to_per)
print (df)
id amount birth
0 4 78.0 1980-02-02
1 5 24.0 1989-03-03
2 6 49.5 2014-01-01
3 7 34.0 2014-01-01
4 8 49.5 0000-01-01

How to use groupby on day and month in pandas?

I have a timeseries data for a full year for every minute.
timestamp day hour min somedata
2010-01-01 00:00:00 1 0 0 x
2010-01-01 00:01:00 1 0 1 x
2010-01-01 00:02:00 1 0 2 x
2010-01-01 00:03:00 1 0 3 x
2010-01-01 00:04:00 1 0 4 x
... ...
2010-12-31 23:55:00 365 23 55
2010-12-31 23:56:00 365 23 56
2010-12-31 23:57:00 365 23 57
2010-12-31 23:58:00 365 23 58
2010-12-31 23:59:00 365 23 59
I want to group-by the data based on the day, i.e 2010-01-01 data should be one group, 2010-01-02 should be another upto 2010-12-31.
I used daily_groupby = dataframe.groupby(pd.to_datetime(dataframe.index.day, unit='D', origin=pd.Timestamp('2009-12-31'))). This creates the group based on the days so all jan, feb upto dec 01 day are in one group. But I want to also group by using month so that jan, feb .. does not get mixed up.
I am a beginner in pandas.
if timestamp is the index use DatetimeIndex.date
df.groupby(pd.to_datetime(df.index).date)
else Series.dt.date
df.groupby(pd.to_datetime(df['timestamp']).dt.date)
If you don't want group by year use:
time_index = pd.to_datetime(df.index)
df.groupby([time_index.month,time_index.day])

Categories