Parse Year Week columns to Date - python

I have a data frame with columns Year and Week that I am trying to parse to date into a new column called Date.
import datetime
df['Date']=datetime.datetime.fromisocalendar(df['Year'], df['Week'], 1)
But this generates the following error: 'cannot convert the series to <class 'int'>'.
My desired outcome is to give the Sunday Date of each week.
For example:
Year: 2022
Week: 01
Expected Date: 2022-01-02
I know there are similar posts to this already, and I have tried to manipulate, but I was unsuccessful.
Thanks for the help!

You can do
Yw = df['Year'].astype(str) + df['Week'].astype(str) + '0'
df['Date'] = pd.to_datetime(Yw, format='%Y%U%w')

Related

How to deal with inconsistent date series in Python?

Inconsistent date formats
As shown in the photo above, the check-in and check-out dates are inconsistent. Whenever I try to clean convert the entire series to datetime using df['Check-in date'] = pd.to_datetime(df['Check-in date'], errors='coerce') and
df['Check-out date'] = pd.to_datetime(df['Check-out date'], errors='coerce') the days and months get mixed up. I don't really know what to do now. I also tried splitting the days months and years and re-arranging them, but I still have no luck.
My goal here is to get the total night stay of our guest but due to the inconsistency, I end up getting negative total night stays.
I'd appreciate any help here. Thanks!
You can try different formats with strptime and return a DateTime object if any of them works.
from datetime import datetime
import pandas as pd
def try_different_formats(value):
only_date_format = "%d/%m/%Y"
date_and_time_format = "%Y-%m-%d %H:%M:%S"
try:
return datetime.strptime(value,only_date_format)
except ValueError:
pass
try:
return datetime.strptime(value,date_and_time_format)
except ValueError:
return pd.NaT
in your example:
df = pd.DataFrame({'Check-in date': ['19/02/2022','2022-02-12 00:00:00']})
Check-in date
0 19/02/2022
1 2022-02-12 00:00:00
apply method will run this function on every value of the Check-in date
column. the result would be a column of DateTime objects.
df['Check-in date'].apply(try_different_formats)
0 2022-02-19
1 2022-02-12
Name: Check-in date, dtype: datetime64[ns]
for a more pandas-specific solution you can check out this answer.

Extracting month from date in dd/mm/yyyy format in pandas

I can't get month and day from date in the correct format.
I'm using both pd.DatetimeIndex(df['date1']).month
and pd.to_datetime(parity['date1']).dt.month but it still retrieves day as month and only if value is larger than 12 it considers it as day.
Thank you in advance
Specify format of dates:
df['date1'] = pd.to_datetime(df['date1'], format='%d.%m.%Y').dt.month
Or set parameter dayfirst=True:
df['date1'] = pd.to_datetime(df['date1'], dayfirst=True).dt.month

Converting Julian to calendar date using pandas

I am trying to convert Julian codes to calendar dates in pandas using :
pd.to_datetime(43390, unit = 'D', origin = 'Julian')
This is giving me ValueError: origin Julian cannot be converted to a Timestamp
You need to set origin = 'julian'
pd.to_datetime(43390, unit = 'D', origin = 'julian')
but this number (43390) throws
OutOfBoundsDatetime: 43390 is Out of Bounds for origin='julian'
because the bounds are from 2333836 to 2547339
(Timestamp('1677-09-21 12:00:00') to Timestamp('2262-04-11 12:00:00'))
Method 1 - using Julian for origin didn't work
Method 2 - using excel start date to calculate other dates. All other date values will be referenced from excel default start date.
Finally this worked for me.
pd.to_datetime(43390, unit = 'D', origin=pd.Timestamp("30-12-1899"))
Below code works only for 6 digit julian value. It also handles the calendar date for leap and non-leap years.
A Julian date is considered as "CYYDDD". Where C represents century, YY represents Year and DDD represents total days which are then further defined in Days and Months.
import pandas as pd
from datetime import datetime
jul_date = '120075'
add_days = int(jul_date[3:6])
cal_date = pd.to_datetime(datetime.strptime(str(19+int(jul_date[0:1]))+jul_date[1:3]+'-01-01','%Y-%m-%d'))-timedelta(1)+pd.DateOffset(days= add_days)
print(cal_date.strftime('%Y-%m-%d'))
output: 2020-03-15
without timedelta(1): 2020-03-16
Here datetime.strptime function is being used to cast date type from string to date.
%Y represents year in 4 digit (1980)
%m & %d represents month and day in digits.
strftime('%Y-%m-%d') is used to remove timestamp from the date.
timedelta(1) :- It's used to minus one day from the date because we've concatenated year with '01-01'. so when total no's of days being split to days and months, one day will not be extra.

Converting inconsistently formatted string dates to datetime in pandas

I have a pandas dataframe in which the date information is a string with the month and year:
date = ["JUN 17", "JULY 17", "AUG 18", "NOV 19"]
Note that the month is usually written as the 3 digit abbreviation, but is sometimes written as the full month for June and July.
I would like to convert this into a datetime format which assumes each date is on the first of the month:
date = [06-01-2017, 07-01-2017, 08-01-2018, 11-01-2019]
Edit to provide more information:
Two main issues I wasn't sure how to handle:
Month is not in a consistent format. Tried to solve this using by just taking a subset of the first three characters of the string.
Year is last two digits only, was struggling to specify that it is 2020 without it getting very messy
I have tried a dozen different things that didn't work, most recent attempt is below:
df['date'] = pd.to_datetime(dict(year = df['Record Month'].astype(str).str[-2:], month = df['Record Month'].astype(str).str[0:3], day=1))
This has the error "Unable to parse string "JUN" at position 0
If you are not sure of the many spellings that can show up then a dictionary mapping would not work. Perhaps your best chance is to split and slice so you normalize into year and month columns and then build the date.
If date is a list as in your example.
date = [d.split() for d in date]
df = pd.DataFrame([m[:3].lower, '20' + y] for m, y in date],
# df = pd.DataFrame([[s.split()[0][:3].lower, '20' + s.split()[1]] for s in date],
columns=['month', 'year'])
Then pass a mapper to series.replace as in
df.month = df.month.replace({'jan': 1, 'feb': 2 ...})
Then parse the dates from its components
# first cap the date to the first day of the month
df['day'] = 1
df = pd.to_datetime(df)
You were close with using pandas.to_datetime(). Instead of using a dictionary though, you could just reformat the date strings to a more standard format. If you convert each date string into MMMYY format (pretty similar to what you were doing) you can pass the strftime format "%b%y" to to_datetime() and it will convert the strings into dates.
import pandas as pd
date = ["JUN 17", "JULY 17", "AUG 18", "NOV 19"]
df = pd.DataFrame(date, columns=["Record Month"])
df['date'] = pd.to_datetime(df["Record Month"].str[:3] + df["Record Month"].str[-2:], format='%b%y')
print(df)
Produces that following result:
Record Date date
0 JUN 17 2017-06-01
1 JULY 17 2017-07-01
2 AUG 18 2018-08-01
3 NOV 19 2019-11-01

How to avoid time being generated after subtracting timedelta

I have a dataframe which look like this as below
Year Birthday OnsetDate
5 2018/1/1
5 2018/2/2
now I use the OnsetDate column subtract with the Day column
df['Birthday'] = df['OnsetDate'] - pd.to_timedelta(df['Day'], unit='Y')
but the outcome of the Birthday column is mixing with time just like below
Birthday
2013/12/31 18:54:00
2013/1/30 18:54:00
the outcome is just a dummy data, what I focused on this is that the time will cause inaccurate of date after the operation. What is the solution to avoid the time being generated so that I can get accurate data.
Second question, I merge the above dataframe to another data frame.
new.update(df)
and the 'new' dataframe Birthday column became like this
Birthday
1164394440000000000
1165949640000000000
so actually caused this and what is the solution?
First question, you should know that is not a whole year by using pd.to_timedelta. If you print, you can see 1 year = 365 days 05:49:12.
print(pd.to_timedelta(1, unit='Y'))
365 days 05:49:12
If you want to avoid the time being generated, you can use DateOffset.
from pandas.tseries.offsets import DateOffset
df['Year'] = df['Year'].apply(lambda x: DateOffset(years=x))
df['Birthday'] = df['OnsetDate'] - df['Year']
Year OnsetDate Birthday
0 <DateOffset: years=5> 2018-01-01 2013-01-01
1 <DateOffset: years=5> 2018-02-02 2013-02-02
As for the second question is caused by the type of column, you can use pd.to_datetime to solve it.
new['Birthday'] = pd.to_datetime(new['Birthday'])

Categories