Get year from unknown date format using python

Get year from unknown date format using python - python

So I am querying a server for specific data, and I need to extract the year, from the date field returned back, however the date field varies for example:
2009
2009-10-8
2009-10
2017-10-22
2017-10
The obvious would be to extract the date into a array and fetch the max: (but there is a problem)
year = max(d.split('-'))
for some reason this gives out false positives as 22 seems to be max verses 2017, also if future calls to the server result in the date being stored as "2019/10/20" this will bring forth issues as well.

The problem is that, while 2017 > 22, '2017' < '22' because it's a string comparison. You could do this to resolve that:
year = max(map(int, d.split('-')))
But instead, if you don't mind being frowned upon by the Long Now Foundation, consider using a regular expression to extract any 4-digit number:
match = re.search(r'\b\d{4}\b', d)
if match:
year = int(match.group(0))

I would use the python-dateutil library to easily extract the year from a date string:
from dateutil.parser import parse
dates = ['2009', '2009-10-8', '2009-10']
for date in dates:
print(parse(date).year)
Output:
2009
2009
2009

Related

Why does pandas interpret Aug-30 as 1930-08, but not 2030-08? [duplicate]

I'm coming across something that is almost certainly a stupid mistake on my part, but I can't seem to figure out what's going on.
Essentially, I have a series of dates as strings in the format "%d-%b-%y", such as 26-Sep-05. When I go to convert them to datetime, the year is sometimes correct, but sometimes it is not.
E.g.:
dates = ['26-Sep-05', '26-Sep-05', '15-Jun-70', '5-Dec-94', '9-Jan-61', '8-Feb-55']
pd.to_datetime(dates, format="%d-%b-%y")
DatetimeIndex(['2005-09-26', '2005-09-26', '1970-06-15', '1994-12-05',
'2061-01-09', '2055-02-08'],
dtype='datetime64[ns]', freq=None)
The last two entries, which get returned as 2061 and 2055 for the years, are wrong. But this works fine for the 15-Jun-70 entry. What's going on here?

That seems to be the behavior of the Python library datetime, I did a test to see where the cutoff is 68 - 69:
datetime.datetime.strptime('31-Dec-68', '%d-%b-%y').date()
>>> datetime.date(2068, 12, 31)
datetime.datetime.strptime('1-Jan-69', '%d-%b-%y').date()
>>> datetime.date(1969, 1, 1)
Two digits year ambiguity
So it seems that anything with the %y year below 69 will be attributed a century of 2000, and 69 upwards get 1900
The %y two digits can only go from 00 to 99 which is going to be ambiguous if we start crossing centuries.
If there is no overlap, you could manually process it and annotate the century (kill the ambiguity)
I suggest you process your data manually and specify the century, e.g. you can decide that anything in your data that has the year between 17 and 68 is attributed to 1917 - 1968 (instead of 2017 - 2068).
If you have overlap then you can't process with insufficient year information, unless e.g. you have some ordered data and a reference
If you have overlap e.g. you have data from both 2016 and 1916 and both were logged as '16', that's ambiguous and there isn't sufficient information to parse this, unless the data is ordered by date in which case you can use heuristics to switch the century as you parse it.

from the docs
Year 2000 (Y2K) issues: Python depends on the platform’s C library,
which generally doesn’t have year 2000 issues, since all dates and
times are represented internally as seconds since the epoch. Function
strptime() can parse 2-digit years when given %y format code. When
2-digit years are parsed, they are converted according to the POSIX
and ISO C standards: values 69–99 are mapped to 1969–1999, and values
0–68 are mapped to 2000–2068.

For anyone looking for a quick and dirty code snippet to fix these cases, this worked for me:
from datetime import timedelta, date
col = 'date'
df[col] = pd.to_datetime(df[col])
future = df[col] > date(year=2050,month=1,day=1)
df.loc[future, col] -= timedelta(days=365.25*100)
You may need to tune the threshold date closer to the present depending on the earliest dates in your data.

You can write a simple function to correct this parsing of wrong year as stated below:
import datetime
def fix_date(x):
if x.year > 1989:
year = x.year - 100
else:
year = x.year
return datetime.date(year,x.month,x.day)
df['date_column'] = data['date_column'].apply(fix_date)
Hope this helps..

Another quick solution to the problem:-
import pandas as pd
import numpy as np
dates = pd.DataFrame(['26-Sep-05', '26-Sep-05', '15-Jun-70', '5-Dec-94', '9-Jan-61', '8-Feb-55'])
for i in dates:
tempyear=pd.to_numeric(dates[i].str[-2:])
dates["temp_year"]=np.where((tempyear>=44)&(tempyear<=99),tempyear+1900,tempyear+2000).astype(str)
dates["temp_month"]=dates[i].str[:-2]
dates["temp_flyr"]=dates["temp_month"]+dates["temp_year"]
dates["pddt"]=pd.to_datetime(dates.temp_flyr.str.upper(), format='%d-%b-%Y', yearfirst=False)
tempdrops=["temp_year","temp_month","temp_flyr",i]
dates.drop(tempdrops, axis=1, inplace=True)
And the output is as follows, here I have converted the output to pandas datetime format from object using pd.to_datetime
pddt
0 2005-09-26
1 2005-09-26
2 1970-06-15
3 1994-12-05
4 1961-01-09
5 1955-02-08
As mentioned in some other answers this works best if there is no overlap between the dates of the two centuries.

If running into the same problem using a pandas DataFrame, try using the current year or year greater than a particular year, then apply a lambda function similar to below:
df["column"] = df["column"].apply(lambda x: x - dt.timedelta(days=365*100) if x > dt.datetime.now() else x)
or
df["column"] = df["column"].apply(lambda x: x - dt.timedelta(days=365*100) if x > 2022 else x)

Pandas: Sorting by week number and year string

I had a list of dates that turned into week number and years using;
dfweek['weeknum'] = df['Date'].dt.strftime('%U_%Y')
This would output: 34_2019
34 being the 34th week of 2019
How would I go about sorting data by this string in chronological order since the order comes out:
00_2018
00_2019
01_2018
01_2019
I tried converting back to datetime by:
dfweek['weeknum1'] = pd.to_datetime(dfweek['weeknum'], format = '%W_%Y')
This kept returning the error: ValueError: Cannot use '%W' or '%U' without day and year
Tried adding a day in the form of %w just to see what happens
dfweek['weeknum'] = df['Date'].dt.strftime('%U_%Y_%w')
dfweek['weeknum1'] = pd.to_datetime(dfweek['weeknum'], format = '%W_%Y_%w')
but it just spits back the original date without the week number
My desired output would be
00_2018
01_2018
02_2018
...
51_2019
52_2019

You can use the following for the sorting:
dfweek = dfweek.assign(weeknum1= df['Date'].dt.strftime('%Y_%U')).sort_values('weeknum1')
Here, we made a temporary column weeknum1 using format e.g. '2018_00' and then sort using this format. As a result, it is sorting in year + week number as required.

Date change halfway through csv from YYYY-MM-DD to DD/MM/YY and after switch datetime no longer works

I have a csv of daily temperature data with 3 columns: dates, daily maximum temperatures, and daily minimum temperatures. I attached it here so you can see what I mean.
I am trying to break this data set into smaller datasets of 30 year periods. For the first few years of Old.csv the dates are entered in YYYY-MM-DD but then switch to DD/MM/YY in 1900. After this date format switches my code to split the years no longer works. Here is what I'm using:
df2 = pd.read_csv("Old.csv")
test = df2[
(pd.to_datetime(df2['Date']) >
pd.to_datetime('1897-01-01')) &
(pd.to_datetime(df2['Date']) <
pd.to_datetime('1899-12-31'))
]
and it works...BUT when I switch to 1900 and beyond it stops. So this one doesnt work:
test = df2[
(pd.to_datetime(df2['Date']) >
pd.to_datetime('1900-01-01')) &
(pd.to_datetime(df2['Date']) <
pd.to_datetime('1905-12-31'))
]
The above code gives me an empty data set, despite working pre 1900. I'm assuming this is some sort of a formatting issue but I thought that using ".to_datetime" would fix that. I also tried this:
df2['Date']=pd.to_datetime(df2['Date'])
to reformat the entire list before I ran the code above but it still didnt work. The other interesting thing is that I have a separate csv with dates consistently entered as MM/DD/YY and that one works with the code above. Could it be an issue with the turn of the century? Does anyone know how to fix this?

You're dealing with time/date data with different formats, for this you could you could use a more flexible parser, for instance dateutil.parser
Example:
>>> from dateutil.parser import parse
>>> df
Date
0 1897-01-01
1 1899-12-31
2 01/01/00
>>> df.Date.apply(parse)
0 1897-01-01 00:00:00
1 1899-12-31 00:00:00
2 2000-01-01
Name: Date, dtype: datetime64[ns]
and use your function on the parsed data.
As remarked in the comment above, it's still not clear whether year "00" refers to year 1900 or 2000, but maybe you can infer that from the context of the csv file.
To change all years in the 'DD/MM/YY' format to 1900 dates you could define your own parse function
>>> def my_parse(d):
... if d[-3]=='/':
... d = d[:-3]+'/19'+d[-2:]
... return parse(d)
>>> df.Date.apply(my_parse)
0 1897-01-01
1 1899-12-31
2 1900-01-01

Python is reading 00 as 2000 instead of 1900. So I tried this to edit 00 to read as 1900:
df2.Date.dt.year.replace(2000, 1990, inplace=True)
But python returned an error that said dates are not directly editable. So I then changed them to a string and edited that way using:
df2['Date'] = df2['Date'].str.replace(r'00', '1900')
This works but now I need to find a way to loop through 1896-1968 without having to type that line out every time.

Selecting specific dates from dataframe

I have a dataset with the column 'Date', which has dates in several formats, including:
2018.05.07
01-Jun-2018
Reported 01 Jun 2018
Jun 2018
2018
before 1970
1941-1945
Ca. 1960
There are also invalid dates, such as:
190Feb-2010
I am trying to find dates which have an exact date (day, month, and year) and convert them to datetime. I also need to exclude dates with "Reported" in the field. Is there any way to filter such data without finding before all the possible formats of dates?

Using dateutil library.
if statement to check if any part of date (month,year,date) is missing, if yes then avoid it.
use fuzzy=True if want to extract dates from strings such as "Reported 01 Jun 2018"
import dateutil.parser
dates = ["2018.05.07","01-Jun-2018","Reported 01 Jun 2018","Jun 2018","2018","before 1970","1941-1945","Ca. 1960","190Feb-2010"]
formated_date = []
for date in dates:
try:
if dateutil.parser.parse(date,fuzzy=False,default=datetime.datetime(2015, 1, 1)) == dateutil.parser.parse(date,fuzzy=False,default=datetime.datetime(2016, 2, 2)):
formated_date.append(yourdate)
except:
continue
another solution. This is brute force method that check each date with every format. Keep on adding more formats to make it work on any date format. But this is time taking method.
import datetime
dates = ["2018.05.07","01-Jun-2018","Reported 01 Jun 2018","Jun 2018","2018","before 1970","1941-1945","Ca. 1960","190Feb-2010"]
formats = ["%Y%m%d","%Y.%m.%d","%Y-%m-%d","%Y/%m/%d","%Y%a%d","%Y.%a.%d","%Y-%a-%d","%Y%A%d","%Y.%A.%d","%Y-%A-%d",
"%d-%m-%Y","%d.%m.%Y","%d%m%Y","%d/%m/%Y","%d-%b-%Y","%d%b%Y","%d.%b.%Y","%d/%b/%Y"]
formated_date = []
for date in dates:
for fmt in formats:
try:
dt = datetime.datetime.strptime(date,fmt)
formated_date.append(dt)
except:
continue

In [1]: string_with_dates = """entries are due by January 4th, 2017 at 8:00pm created 01/15/2005 by ACME Inc. and associates."""
In [2]: import datefinder
In [3]: matches = datefinder.find_dates(string_with_dates)
In [4]: for match in matches:
...: print match
2017-01-04 20:00:00
2005-01-15 00:00:00
Hope this would help you to find dates from string with dates

Best way to get date strings with Python

What's the best way to get datestrings from a website using Python?
The datestrings can be, for example, in the forms of:
April 1st, 2011
April 2nd, 2011
April 23, 2011
4/2/2011
04/23/2011
Would this have to be a ton of regex? What's the most elegant solution?

Consider this lib: http://code.google.com/p/parsedatetime/
From its examples Wiki page, here are a couple of formats it can handle that look relevant to your question:
result = p.parseDateText("March 5th, 1980")
result = p.parseDate("4/4/80")
EDIT: now I notice it's actually a duplicate of this SO question where the same library was recommended!

month = '(jan|feb|mar|apr|may|jun|jul|aug|sep|nov|dec)[a-z]{0,6}'
regex_strings = ['%s(\.| )\d{1,2},? \d{2,4}' % month, # Month.Day, Year
'\d{1,2} %s,? \d{4}' % month, # Day Month Year(4)
'%s \d{1,2}\w{2},? \d{4}' % month, # Mon Day(th), Year
'\d{1,2} %s' % month, # Day Month
'\d{1,2}\.\d{1,2}\.\d{4}', # Month.Day.Year
'\d{1,2}/\d{1,2}/\d{2,4}', # Month/Day/Year{2,4}
]

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Get year from unknown date format using python - python

I would use the python-dateutil library to easily extract the year from a date string: from dateutil.parser import parse dates = ['2009', '2009-10-8', '2009-10'] for date in dates: print(parse(date).year) Output: 2009 2009 2009

Related

Why does pandas interpret Aug-30 as 1930-08, but not 2030-08? [duplicate]

Pandas: Sorting by week number and year string

Date change halfway through csv from YYYY-MM-DD to DD/MM/YY and after switch datetime no longer works

Selecting specific dates from dataframe

Best way to get date strings with Python

Categories

Resources