Selecting specific dates from dataframe - python

I have a dataset with the column 'Date', which has dates in several formats, including:
2018.05.07
01-Jun-2018
Reported 01 Jun 2018
Jun 2018
2018
before 1970
1941-1945
Ca. 1960
There are also invalid dates, such as:
190Feb-2010
I am trying to find dates which have an exact date (day, month, and year) and convert them to datetime. I also need to exclude dates with "Reported" in the field. Is there any way to filter such data without finding before all the possible formats of dates?

Using dateutil library.
if statement to check if any part of date (month,year,date) is missing, if yes then avoid it.
use fuzzy=True if want to extract dates from strings such as "Reported 01 Jun 2018"
import dateutil.parser
dates = ["2018.05.07","01-Jun-2018","Reported 01 Jun 2018","Jun 2018","2018","before 1970","1941-1945","Ca. 1960","190Feb-2010"]
formated_date = []
for date in dates:
try:
if dateutil.parser.parse(date,fuzzy=False,default=datetime.datetime(2015, 1, 1)) == dateutil.parser.parse(date,fuzzy=False,default=datetime.datetime(2016, 2, 2)):
formated_date.append(yourdate)
except:
continue
another solution. This is brute force method that check each date with every format. Keep on adding more formats to make it work on any date format. But this is time taking method.
import datetime
dates = ["2018.05.07","01-Jun-2018","Reported 01 Jun 2018","Jun 2018","2018","before 1970","1941-1945","Ca. 1960","190Feb-2010"]
formats = ["%Y%m%d","%Y.%m.%d","%Y-%m-%d","%Y/%m/%d","%Y%a%d","%Y.%a.%d","%Y-%a-%d","%Y%A%d","%Y.%A.%d","%Y-%A-%d",
"%d-%m-%Y","%d.%m.%Y","%d%m%Y","%d/%m/%Y","%d-%b-%Y","%d%b%Y","%d.%b.%Y","%d/%b/%Y"]
formated_date = []
for date in dates:
for fmt in formats:
try:
dt = datetime.datetime.strptime(date,fmt)
formated_date.append(dt)
except:
continue

In [1]: string_with_dates = """entries are due by January 4th, 2017 at 8:00pm created 01/15/2005 by ACME Inc. and associates."""
In [2]: import datefinder
In [3]: matches = datefinder.find_dates(string_with_dates)
In [4]: for match in matches:
...: print match
2017-01-04 20:00:00
2005-01-15 00:00:00
Hope this would help you to find dates from string with dates

Related

Converting inconsistently formatted string dates to datetime in pandas

I have a pandas dataframe in which the date information is a string with the month and year:
date = ["JUN 17", "JULY 17", "AUG 18", "NOV 19"]
Note that the month is usually written as the 3 digit abbreviation, but is sometimes written as the full month for June and July.
I would like to convert this into a datetime format which assumes each date is on the first of the month:
date = [06-01-2017, 07-01-2017, 08-01-2018, 11-01-2019]
Edit to provide more information:
Two main issues I wasn't sure how to handle:
Month is not in a consistent format. Tried to solve this using by just taking a subset of the first three characters of the string.
Year is last two digits only, was struggling to specify that it is 2020 without it getting very messy
I have tried a dozen different things that didn't work, most recent attempt is below:
df['date'] = pd.to_datetime(dict(year = df['Record Month'].astype(str).str[-2:], month = df['Record Month'].astype(str).str[0:3], day=1))
This has the error "Unable to parse string "JUN" at position 0
If you are not sure of the many spellings that can show up then a dictionary mapping would not work. Perhaps your best chance is to split and slice so you normalize into year and month columns and then build the date.
If date is a list as in your example.
date = [d.split() for d in date]
df = pd.DataFrame([m[:3].lower, '20' + y] for m, y in date],
# df = pd.DataFrame([[s.split()[0][:3].lower, '20' + s.split()[1]] for s in date],
columns=['month', 'year'])
Then pass a mapper to series.replace as in
df.month = df.month.replace({'jan': 1, 'feb': 2 ...})
Then parse the dates from its components
# first cap the date to the first day of the month
df['day'] = 1
df = pd.to_datetime(df)
You were close with using pandas.to_datetime(). Instead of using a dictionary though, you could just reformat the date strings to a more standard format. If you convert each date string into MMMYY format (pretty similar to what you were doing) you can pass the strftime format "%b%y" to to_datetime() and it will convert the strings into dates.
import pandas as pd
date = ["JUN 17", "JULY 17", "AUG 18", "NOV 19"]
df = pd.DataFrame(date, columns=["Record Month"])
df['date'] = pd.to_datetime(df["Record Month"].str[:3] + df["Record Month"].str[-2:], format='%b%y')
print(df)
Produces that following result:
Record Date date
0 JUN 17 2017-06-01
1 JULY 17 2017-07-01
2 AUG 18 2018-08-01
3 NOV 19 2019-11-01

Get year from unknown date format using python

So I am querying a server for specific data, and I need to extract the year, from the date field returned back, however the date field varies for example:
2009
2009-10-8
2009-10
2017-10-22
2017-10
The obvious would be to extract the date into a array and fetch the max: (but there is a problem)
year = max(d.split('-'))
for some reason this gives out false positives as 22 seems to be max verses 2017, also if future calls to the server result in the date being stored as "2019/10/20" this will bring forth issues as well.
The problem is that, while 2017 > 22, '2017' < '22' because it's a string comparison. You could do this to resolve that:
year = max(map(int, d.split('-')))
But instead, if you don't mind being frowned upon by the Long Now Foundation, consider using a regular expression to extract any 4-digit number:
match = re.search(r'\b\d{4}\b', d)
if match:
year = int(match.group(0))
I would use the python-dateutil library to easily extract the year from a date string:
from dateutil.parser import parse
dates = ['2009', '2009-10-8', '2009-10']
for date in dates:
print(parse(date).year)
Output:
2009
2009
2009

Split and Format Date Range

I am working on parsing a date range from an email in zapier. Here is what comes in: Dec 4 - Jan 4, 2020 From this I need to separate the start and end date to something like 12/04/2019 and 01/04/2020 accounting for the fact that some dates will start in the prior year as in the example above and some will be in the same year for example Mar 4 - Mar 22, 2020. It seems the code to use in zapier is python. I have looked at examples for panda
import pandas as pd
date_series = pd.date_range(start='Mar 4' -, end='Mar 7, 2020')
print(date)
But keep getting errors.
Any suggestions would be much appreciated thanks
This is one way to do it:
def parse_email_range(date_string):
dates = date_string.split(' - ')
month_1 = pd.to_datetime(dates[0], format='%b %d').month
month_2 = pd.to_datetime(dates[1]).month
day_1 = pd.to_datetime(dates[0], format='%b %d').day
day_2 = pd.to_datetime(dates[1]).day
year_2 = pd.to_datetime(dates[1]).year
year_1 = year_2 if (month_1 < month_2) or (month_1 == month_2 and day_1 < day_2) else year_2 - 1
return '{}-{}-{}'.format(year_1, month_1, day_1), '{}-{}-{}'.format(year_2, month_2, day_2)
parse_email_range('Dec 4 - Jan 4, 2020')
## ('2019-12-4', '2020-1-4')
Split the two dates and record them into a single variable:
raw_dates = 'Dec 4 - Jan 4, 2020'.split(" - ")
dateutil package is capable of parsing most dates:
from dateutil.parser import parse
Parse and separate start and end date from the raw dates:
start_date, end_date = (parse(date) for date in raw_dates)
strftime is the method that could be used to format dates.
Store desired format in a variable (please note I have used day first format):
date_format = '%d/%m/%Y'
Convert the end date into the desired format:
print(end_date.strftime(date_format))
'04/01/2020'
Convert start date:
dateutil's relativedelta function will help us to subtract one year from the start date:
from dateutil.relativedelta import relativedelta
adjusted_start_date = start_date - relativedelta(years=1)
print(adjusted_start_date.strftime(date_format))
'04/12/2019'

Convert weekday name string into datetime

I have the following date (as an object format) : Tue 31 Jan in a pandas Series.
and I try to change it into : 31/01/2019
Please, how can I achieve this ? I understand more or less that pandas.Datetime can convert easily when a string date is clearer (like 6/1/1930 22:00) but not in my case, when their is a weekday name.
Thank you for your help.
Concat the year and callpd.to_datetime with a custom format:
s = pd.Series(['Tue 31 Jan', 'Mon 20 Feb',])
pd.to_datetime(s + ' 2019', format='%a %d %b %Y')
0 2019-01-31
1 2019-02-20
dtype: datetime64[ns]
This is fine as long as all your dates follow this format. If that is not the case, this cannot be solved reliably.
More information on datetime formats at strftime.org.
Another option is using the 3rd party dateutil library:
import dateutil
s.apply(dateutil.parser.parse)
0 2018-01-31
1 2018-02-20
dtype: datetime64[ns]
This can be installed with PyPi.
Another, slower option (but more flexible) is using the 3rd party datefinder library to sniff dates from string containing random text (if this is what you need):
import datefinder
s.apply(lambda x: next(datefinder.find_dates(x)))
0 2018-01-31
1 2018-02-20
dtype: datetime64[ns]
You can install it with PyPi.
Convert to a datetime object
If you wanted to use the datetime module, you could get the year by doing the following:
import datetime as dt
d = dt.datetime.strptime('Tue 31 Jan', '%a %d %b').replace(year=dt.datetime.now().year)
This is taking the date in your format, but replacing the default year 1900 with the current year in a reliable way.
This is similar to the other answers, but uses the builtin replace method as opposed to concatenating a string.
Output
To get the desired output from your new datetime object, you could perform the following:
>>> d.strftime('%d/%m/%Y')
'31/01/2018'
Here is two alternate ways to achieve the same result.
Method 1: Using datetime module
from datetime import datetime
datetime_object = datetime.strptime('Tue 31 Jan', '%a %d %b')
print(datetime_object) # outputs 1900-01-31 00:00:00
If you had given an Year parameter like Tue 31 Jan 2018, then this code would work.
from datetime import datetime
datetime_object = datetime.strptime('Tue 31 Jan 2018', '%a %d %b %Y')
print(datetime_object) # outputs 2018-01-31 00:00:00
To print the resultant date in a format like this 31/01/2019. You can use
print(datetime_object.strftime("%d/%m/%Y")) # outputs 31/01/2018
Here are all the possible formatting options available with datetime object.
Method 2: Using dateutil.parser
This method automatically fills in the Year parameter with current year.
from dateutil import parser
string = "Tue 31 Jan"
date = parser.parse(string)
print(date) # outputs 2018-01-31 00:00:00

pandas read_csv parse foreign dates

I am trying to use read_csv on a .csv file that contains a date column. The problem is that the date column is in a foreign language (romanian), with entries like:
'26 septembrie 2017'
'13 iulie 2017'
etc. How can I parse this nicely into a pandas dataframe which has a US date format?
you can pass a converter for that column:
df = pd.read_csv(myfile, converters={'date_column': foreign_date_converter})
But first you have to define the converter to do what you want. This approach uses locale manipulation:
def foreign_date_converter(text):
# Resets locale to "ro_RO" to parse romanian date properly
# (non thread-safe code)
loc = locale.getlocale(locale.LC_TIME)
locale.setlocale(locale.LC_TIME, 'ro_RO')
date = datetime.datetime.strptime(text '%d %b %Y').date()
locale.setlocale(locale.LC_TIME, loc) # restores locale
return date
Use dateparser module.
import dateparser
df = pd.read_csv('yourfile.csv', parse_dates=['date'], date_parser=dateparser.parse)
Enter your date column name in parse_dates parameter. I'm just assuming it as date
You may have output like this:
date
0 2017-09-26
1 2017-07-13
If you want to change the format use strftime strftime
df['date'] = df.date.dt.strftime(date_format = '%d %B %Y')
output:
date
0 26 September 2017
1 13 July 2017
The easiest solution would be to simply use 12 times the str.replace(old, new) function.
It is not pretty but if you just built the function:
def translater(date_string_with_exatly_one_date):
date_str = date_string_with_exatly_one_date
date_str = date_str.replace("iulie", "july")
date_str = date_str.replace("septembrie", "september")
#do this 10 more times with the right translation
return date_str
Now you just have to call it for every entry. After that you can handle it like a US date string. This is not very efficient but it will get the job done and you do not have to search for special libraries.

Categories