I have some text, taken from different websites, that I want to extract dates from. As one can imagine, the dates vary substantially in how they are formatted, and look something like:
Posted: 10/01/2014
Published on August 1st 2014
Last modified on 5th of July 2014
Posted by Dave on 10-01-14
What I want to know is if anyone knows of a Python library [or API] which would help with this - (other than e.g. regex, which will be my fallback). I could probably relatively easily remove the "posed on" parts, but getting the other stuff consistent does not look easy.
My solution using dateutil
Following Lukas's suggestion, I used the dateutil package (seemed far more flexible than Arrow), using the Fuzzy entry, which basically ignores things which are not dates.
Caution on Fuzzy parsing using dateutil
The main thing to note with this is that as noted in the thread Trouble in parsing date using dateutil if it is unable to parse a day/month/year it takes a default value (which is the current day, unless specified), and as far as i can tell there is no flag reported to indicate that it took the default.
This would result in "random text" returning today's date of 2015-4-16 which could have caused problems.
Solution
Since I really want to know when it fails, rather than fill in the date with a default value, I ended up running twice, and then seeing if it took the default on both instances - if not, then I assumed parsing correctly.
from datetime import datetime
from dateutil.parser import parse
def extract_date(text):
date = {}
date_1 = parse(text, fuzzy=True, default=datetime(2001, 01, 01))
date_2 = parse(text, fuzzy=True, default=datetime(2002, 02, 02))
if date_1.day == 1 and date_2.day ==2:
date["day"] = "XX"
else:
date["day"] = date_1.day
if date_1.month == 1 and date_2.month ==2:
date["month"] = "XX"
else:
date["month"] = date_1.month
if date_1.year == 2001 and date_2.year ==2002:
date["year"] = "XXXX"
else:
date["year"] = date_1.year
return(date)
print extract_date("Posted: by dave August 1st")
Obviously this is a bit of a botch (so if anyone has a more elegant solution -please share), but this correctly parsed the four examples i had above [where it assumed US format for the date 10/01/2014 rather than UK format], and resulted in XX being returned appropriately when missing data entered.
You could use Arrow library:
arrow.get('2013-05-05 12:30:45', ['MM/DD/YYYY', 'MM-DD-YYYY'])
Two arguments, first a str to parse and second a list of formats to try.
Related
within a dataset, there are several different datetime-strings.
For example:
2020-11-16T06:00:00Z
2020-11-16T06:00:00+01:00
2020-11-16T06:00:00+01:00Z
2020-11-16T06:00:00+02:00
2020-11-16T06:00:00.000Z
I thought about replacing everything after the seconds, but it gives me errors, when for example +01:00 isn't given in the first place.. or else.
Do you have any clou, how to handle this?
It would be absolutely enough, if I could get:
%Y-%m-%dT%H:%M
(The basics, how to strp and strf are known...)
I've wrangled my head all night about this problem.
Hope, that one of you have got a solution...
thank you in advance!
Python has a standard library that deals with this problem:
import dateutil.parser
examples = [
'2020-11-16T06:00:00Z',
'2020-11-16T06:00:00+01:00',
'2020-11-16T06:00:00+01:00Z',
'2020-11-16T06:00:00+02:00',
'2020-11-16T06:00:00.000Z'
]
for e in examples:
try:
print(dateutil.parser.parse(e))
except ValueError:
print(f'Invalid datetime: {e}')
Result:
2020-11-16 06:00:00+00:00
2020-11-16 06:00:00+01:00
Invalid datetime: 2020-11-16T06:00:00+01:00Z
2020-11-16 06:00:00+02:00
2020-11-16 06:00:00+00:00
#Z4-tier also has a solution for your examples(be careful with just leaving off the end of the string though), but dateutil will also deal with more exotic stuff:
print(dateutil.parser.parse('15:45 16 Nov 2020'))
Result:
2020-11-16 15:45:00
Also note this:
print(dateutil.parser.parse(e).tzinfo)
If you add that, you'll see that dateutil includes the information about the time zone in the result, which would be lost if you only parse the first part of the strings.
how about this:
import datetime
dates = ['2020-11-16T06:00:00Z',
'2020-11-16T06:00:00+01:00',
'2020-11-16T06:00:00+01:00Z',
'2020-11-16T06:00:00+02:00',
'2020-11-16T06:00:00.000Z']
for d in dates:
datetime.datetime.fromisoformat(d[0:19])
Since each date has the same format up to the offset and timezone, just strip that part off of the string and cast it to a datetime.datetime.
The dataset is generated by a scraper, which scrapes news pages.
Therefore the datetime is scraped as string in the first place, so that the conversion has to take place for several different occurrences, before a strptime can be executed.
I found a solution for my problem, which was influenced by all of the approaches of you guys.
'''
date = '2020-11-16T06:00:00+01:00'
splitat = 19
date = date[:splitat]
date
'''
This resulted in the standardized format, which I needed:
'2020-11-16T06:00:00'
I need to return the date format from a string. Currently I am using parser to parse a string as a date, then replacing the year with a yyyy or yy. Similarly for other dates items. Is there some function I could use that would return mm-dd-yyyy when I send 12-05-2018?
Technically, it is an impossible question. If you send in 12-05-2018, there is no way for me to know whether you are sending in a mm-dd-yyyy (Dec 5, 2018) or dd-mm-yyyy (May 12, 2018).
One approach might be to do a regex replacement of anything which matches your expected date pattern, e.g.
date = "Here is a date: 12-05-2018 and here is another one: 10-31-2010"
date_masked = re.sub(r'\b\d{2}-\d{2}-\d{4}\b', 'mm-dd-yyyy', date)
print(date)
print(date_masked)
Here is a date: 12-05-2018 and here is another one: 10-31-2010
Here is a date: mm-dd-yyyy and here is another one: mm-dd-yyyy
Of course, the above script makes no effort to check whether the dates are actually valid. If you require that, you may use one of the date libraries available in Python.
I don't really understand what you plan to do with the format. There are two reasons I can think of why you might want it. (1) You want at some future point to convert a normalized datetime back into the original string. If that is what you want you would be better off just storing the normalized datetime and the original string. Or (2) you want to draw (dodgy) conclusions about person sending the data, because different nationalities will tend to use different formats. But, whatever you want it for, you can do it this way:
from dateutil import parser
def get_date_format(date_input):
date = parser.parse(date_input)
for date_format in ("%m-%d-%Y", "%d-%m-%Y", "%Y-%m-%d"):
# You can extend the list above to include formats with %y in addition to %Y, etc, etc
if date.strftime(date_format) == date_input:
return date_format
>>> date_input = "12-05-2018"
>>> get_date_format(date_input)
'%m-%d-%Y'
You mention in a comment you are prepared to make assumptions about ambiguous dates like 12-05-2018 (could be May or December) and 05-12-18 (could be 2018 or 2005). You can pass those assumptions to dateutil.parser.parse. It accepts boolean keyword parameters dayfirst and yearfirst which it will use in ambiguous cases.
Take a look at the datetime library. There you will find the function strptime(), which is exactly what you are looking for.
Here is the documentation: https://docs.python.org/3/library/datetime.html#strftime-strptime-behavior
I have an input which may contain a date, or time, or both. In case the date is provided I can use that date but in case the date is not provided (i.e. only the time is provided) then I have to use a different date provided from someplace else.
But once I parse the date using using python dateutil, today's date is added to the parsed value. For example:
from dateutil.parser import parse
print(parse('03:15:08'))
print(parse('04-04-2019 03:15:08'))
the above code gives the following output:
2019-04-04 03:15:08
2019-04-04 03:15:08
As you can see the information (that the date was provided or not) is lost and can't be differentiated.
Also some kind of length manipulation may not work because only the date may be provided.
How to differentiate between the 2 inputs?
Thank you.
I have a large sales database the first column of which is the purchase date. The problem is some of these dates are entered in DD.MM.YY format, some in YY.MM.DD and some in YYYY/MM/DD. I want to make them all to same format. What is the cleanest way I can do this?
Note 1: I'm thinking of doing a series of ifs but that would be a lot of conditions so I'm wondering if there is a cleaner shortcut.
Note 2: An additional complication is that the dates are in Jalaali calender and not Gregorian. I have the function that will convert them to gregorian but I need to pass the correct year, month, day arguments to it; this is why I want to bring them all to a single format. But additionally, this means that if you offer some "Gregorian-only" solutions, like dateutil.parser, it might not work.
Immediately after posting this I found/thought of a solution myself, but instead of deleting the question I decided to post the answer in case someone else come to a similar problem.
tl;dr - I just added a century option to dateutil.parser. I didnt know how to but I found this.
Here's my end code:
from khayyam import JalaliDate
from dateutil.parser import parse, parserinfo
class MyParserInfo(parserinfo):
def convertyear(self, year, *args, **kwargs):
if year < 100:
year += 1300
return year
if __name__ == '__main__':
dt = parse("9.12.96", MyParserInfo()).date()
a=JalaliDate(dt.year, dt.month, dt.day).todate()
print(dt)
print(a)
#1396-09-12
#2017-12-03
I am currently writing a program which uses the ComapaniesHouse API to return a json file containing information about a certain company.
I am able to retrieve the data easily using the following commands:
r = requests.get('https://api.companieshouse.gov.uk/company/COMPANY-NO/filing-history', auth=('API-KEY', ''))
data = r.json()
With that information I can do an awful lot, however I've ran into a problem which I was hoping you guys could possible help me with. What I aim to do is go through every nested entry in the json file and check if the value of certain keys matches certain criteria, if the values of 2 keys match a certain criteria then other code is executed.
One of the keys is the date of an entry, and I would like to ignore results that are older than a certain date, I have attempted to do this with the following:
date_threshold = datetime.date.today() - datetime.timedelta(days=30)``
for each in data["items"]:
date = ['date']
type = ['type']
if date < date_threshold and type is "RM01":
print("wwwwww")
In case it isn't clear, what I'm attempting to do (albeit very badly) is assign each of the entries to a variable, which then gets tested against certain criteria.
Although this doesn't work, python spits out a variable mismatch error:
TypeError: unorderable types: list() < datetime.date()
Which makes me think the date is being stored as a string, and so I can't compare it to the datetime value set earlier, but when I check the API documentation (https://developer.companieshouse.gov.uk/api/docs/company/company_number/filing-history/filingHistoryItem-resource.html), it says clearly that the 'date' entry is returned as a date type.
What am I doing wrong, its very clear that I'm extremely new to python given what I presume is the atrocity of my code, but in my head it seems to make at least a little sense. In case none of this clear, I basically want to go through all the entries in the json file, and the if the date and type match a certain description, then other code can be executed (in this case I have just used random text).
Any help is greatly appreciated! Let me know if you need anything cleared up.
:)
EDIT
After tweaking my code to the below:
for each in data["items"]:
date = each['date']
type = each['type']
if date is '2016-09-15' and type is "RM01":
print("wwwwww")
The code executes without any errors, but the words aren't printed, even though I know there is an entry in the json file with that exact date, and that exact type, any thoughts?
SOLUTION:
Thanks to everyone for helping me out, I had made a couple of very basic errors, the code that works as expected is below::
for each in data["items"]:
date = each['date']
typevariable = each['type']
if date == '2016-09-15' and typevariable == "RM01":
print("wwwwww")
This prints the word "wwwwww" 3 times, which is correct seeing as there are 3 entries in the JSON that fulfil those criteria.
You need to first convert your date variable to a datetime type using datetime.strptime()
You are comparing a list type variable date with datetime type variable date_threshold.