Converting inconsistently formatted string dates to datetime in pandas - python

I have a pandas dataframe in which the date information is a string with the month and year:
date = ["JUN 17", "JULY 17", "AUG 18", "NOV 19"]
Note that the month is usually written as the 3 digit abbreviation, but is sometimes written as the full month for June and July.
I would like to convert this into a datetime format which assumes each date is on the first of the month:
date = [06-01-2017, 07-01-2017, 08-01-2018, 11-01-2019]
Edit to provide more information:
Two main issues I wasn't sure how to handle:
Month is not in a consistent format. Tried to solve this using by just taking a subset of the first three characters of the string.
Year is last two digits only, was struggling to specify that it is 2020 without it getting very messy
I have tried a dozen different things that didn't work, most recent attempt is below:
df['date'] = pd.to_datetime(dict(year = df['Record Month'].astype(str).str[-2:], month = df['Record Month'].astype(str).str[0:3], day=1))
This has the error "Unable to parse string "JUN" at position 0

If you are not sure of the many spellings that can show up then a dictionary mapping would not work. Perhaps your best chance is to split and slice so you normalize into year and month columns and then build the date.
If date is a list as in your example.
date = [d.split() for d in date]
df = pd.DataFrame([m[:3].lower, '20' + y] for m, y in date],
# df = pd.DataFrame([[s.split()[0][:3].lower, '20' + s.split()[1]] for s in date],
columns=['month', 'year'])
Then pass a mapper to series.replace as in
df.month = df.month.replace({'jan': 1, 'feb': 2 ...})
Then parse the dates from its components
# first cap the date to the first day of the month
df['day'] = 1
df = pd.to_datetime(df)

You were close with using pandas.to_datetime(). Instead of using a dictionary though, you could just reformat the date strings to a more standard format. If you convert each date string into MMMYY format (pretty similar to what you were doing) you can pass the strftime format "%b%y" to to_datetime() and it will convert the strings into dates.
import pandas as pd
date = ["JUN 17", "JULY 17", "AUG 18", "NOV 19"]
df = pd.DataFrame(date, columns=["Record Month"])
df['date'] = pd.to_datetime(df["Record Month"].str[:3] + df["Record Month"].str[-2:], format='%b%y')
print(df)
Produces that following result:
Record Date date
0 JUN 17 2017-06-01
1 JULY 17 2017-07-01
2 AUG 18 2018-08-01
3 NOV 19 2019-11-01

Related

Parse Year Week columns to Date

I have a data frame with columns Year and Week that I am trying to parse to date into a new column called Date.
import datetime
df['Date']=datetime.datetime.fromisocalendar(df['Year'], df['Week'], 1)
But this generates the following error: 'cannot convert the series to <class 'int'>'.
My desired outcome is to give the Sunday Date of each week.
For example:
Year: 2022
Week: 01
Expected Date: 2022-01-02
I know there are similar posts to this already, and I have tried to manipulate, but I was unsuccessful.
Thanks for the help!
You can do
Yw = df['Year'].astype(str) + df['Week'].astype(str) + '0'
df['Date'] = pd.to_datetime(Yw, format='%Y%U%w')

Pandas dt accessor returns wrong day and month

My CSV data looks like this -
Date Time
1/12/2019 12:04AM
1/12/2019 12:09AM
1/12/2019 12:14AM
and so on
And I am trying to read this file using pandas in the following way -
import pandas as pd
import numpy as np
data = pd.read_csv('D 2019.csv',parse_dates=[['Date','Time']])
print(data['Date_Time'].dt.month)
When I try to access the year through the dt accessor the year prints out fine as 2019.
But when I try to print the day or the month it is completely incorrect. In the case of month it starts off as 1 and ends up as 12 when the right value should be 12 all the time.
With the day it starts off as 12 and ends up at 31 when it should start at 1 and end in 31. The file has total of 8867 entries. Where am I going wrong ?
The default format is MM/DD, while yours is DD/MM.
The simplest solution is to set the dayfirst parameter of read_csv:
dayfirst : DD/MM format dates, international and European format (default False)
data = pd.read_csv('D 2019.csv', parse_dates=[['Date', 'Time']], dayfirst=True)
# -------------
>>> data['Date_Time'].dt.month
# 0 12
# 1 12
# 2 12
# Name: Date_Time, dtype: int64
Try assigning format argument of pd.to_datetime
df = pd.read_csv('D 2019.csv')
df["Date_Time"] = pd.to_datetime(df["Date_Time"], format='%d/%m/%Y %H:%M%p')
You need to check the data type of your dataframe and convert the column "Date" into datetime
df["Date"] = pd.to_datetime(df["Date"])
After you can access the day, month, or year using:
dt.day
dt.month
dt.year
Note: Make sure the format of the date (D/M/Y or M/D/Y)
Full Code
import pandas as pd
import numpy as np
data = pd.read_csv('D 2019.csv')
data["Date"] = pd.to_datetime(data["Date"])
print(data["Date"].dt.day)
print(data["Date"].dt.month)
print(data["Date"].dt.year)

Not all dates are captured when filtering by dates. Python Pandas

I am filtering a dataframe by dates to produce two seperate versions:
Data from only today's date
Data from the last two years
However, when I try to filter on the date, it seems to miss dates that are within the last two years.
date_format = '%m-%d-%Y' # desired date format
today = dt.now().strftime(date_format) # today's date. Will always result in today's date
today = dt.strptime(today, date_format).date() # converting 'today' into a datetime object
today = today.strftime(date_format)
two_years = today - relativedelta(years=2) # date is today's date minus two years.
two_years = two_years.strftime(date_format)
# normalizing the format of the date column to the desired format
df_data['date'] = pd.to_datetime(df_data['date'], errors='coerce').dt.strftime(date_format)
df_today = df_data[df_data['date'] == today]
df_two_year = df_data[df_data['date'] >= two_years]
Which results in:
all dates ['07-17-2020' '07-15-2020' '08-01-2019' '03-25-2015']
today df ['07-17-2020']
two year df ['07-17-2020' '08-01-2019']
The 07-15-2020 date is missing from the two year, even though 08-01-2019 is captured.
you don't need to convert anything to string, simply work with datetime dtype. Ex:
import pandas as pd
df = pd.DataFrame({'date': pd.to_datetime(['07-17-2020','07-15-2020','08-01-2019','03-25-2015'])})
today = pd.Timestamp('now')
print(df[df['date'].dt.date == today.date()])
# date
# 0 2020-07-17
print(df[(df['date'].dt.year >= today.year-1) & (df['date'].dt.date != today.date())])
# date
# 1 2020-07-15
# 2 2019-08-01
What you get from the comparison operations (adjust them as needed...) are boolean masks - you can use them nicely to filter the df.
Your datatype conversions are the problem here. You could do this:
today = dt.now() # today's date. Will always result in today's date
two_years = today - relativedelta(years=2) # date is today's date minus two years.
This prints '2018-07-17 18:40:42.704395'. You can then convert it to the date only format.
two_years = two_years.strftime(date_format)
two_years = dt.strptime(two_years, date_format).date()

Split and Format Date Range

I am working on parsing a date range from an email in zapier. Here is what comes in: Dec 4 - Jan 4, 2020 From this I need to separate the start and end date to something like 12/04/2019 and 01/04/2020 accounting for the fact that some dates will start in the prior year as in the example above and some will be in the same year for example Mar 4 - Mar 22, 2020. It seems the code to use in zapier is python. I have looked at examples for panda
import pandas as pd
date_series = pd.date_range(start='Mar 4' -, end='Mar 7, 2020')
print(date)
But keep getting errors.
Any suggestions would be much appreciated thanks
This is one way to do it:
def parse_email_range(date_string):
dates = date_string.split(' - ')
month_1 = pd.to_datetime(dates[0], format='%b %d').month
month_2 = pd.to_datetime(dates[1]).month
day_1 = pd.to_datetime(dates[0], format='%b %d').day
day_2 = pd.to_datetime(dates[1]).day
year_2 = pd.to_datetime(dates[1]).year
year_1 = year_2 if (month_1 < month_2) or (month_1 == month_2 and day_1 < day_2) else year_2 - 1
return '{}-{}-{}'.format(year_1, month_1, day_1), '{}-{}-{}'.format(year_2, month_2, day_2)
parse_email_range('Dec 4 - Jan 4, 2020')
## ('2019-12-4', '2020-1-4')
Split the two dates and record them into a single variable:
raw_dates = 'Dec 4 - Jan 4, 2020'.split(" - ")
dateutil package is capable of parsing most dates:
from dateutil.parser import parse
Parse and separate start and end date from the raw dates:
start_date, end_date = (parse(date) for date in raw_dates)
strftime is the method that could be used to format dates.
Store desired format in a variable (please note I have used day first format):
date_format = '%d/%m/%Y'
Convert the end date into the desired format:
print(end_date.strftime(date_format))
'04/01/2020'
Convert start date:
dateutil's relativedelta function will help us to subtract one year from the start date:
from dateutil.relativedelta import relativedelta
adjusted_start_date = start_date - relativedelta(years=1)
print(adjusted_start_date.strftime(date_format))
'04/12/2019'

ValueError when converting String to datetime

I have a dataframe as follows, and I am trying to reduce the dataframe to only contain rows for which the Date is greater than a variable curve_enddate. The df['Date'] is in datetime and hence I'm trying to convert curve_enddate[i][0] which gives a string of the form 2015-06-24 to datetime but am getting the error ValueError: time data '2015-06-24' does not match format '%Y-%b-%d'.
Date Maturity Yield_pct Currency
0 2015-06-24 0.25 na CAD
1 2015-06-25 0.25 0.0948511020 CAD
The line where I get the Error:
df = df[df['Date'] > time.strptime(curve_enddate[i][0], '%Y-%b-%d')]
Thank You
You are using wrong date format, %b is for the named months (abbreviations like Jan or Feb , etc), use %m for the numbered months.
Code -
df = df[df['Date'] > time.strptime(curve_enddate[i][0], '%Y-%m-%d')]
You cannot compare a time.struct_time tuple which is what time.strptime returns to a Timestamp so you also need to change that as well as using '%Y-%m-%d' using m which is the month as a decimal number. You can use pd.to_datetime to create the object to compare:
df = df[df['Date'] > pd.to_datetime(curve_enddate[i][0], '%Y-%m-%d')]

Categories