Convert date and time from scraped text to datetime format - python

I'm making a news aggregator using Python and Scrapy and cannot find an answer for exactly what I'm trying to do.
I am scraping a line of text from an article, a publish time, like so:
item['published'] = hxs.select('//div[#class="date"]/text()').extract()
This is what I'm getting back (there is no ISO date on the site, as there are some of the others I'm scraping for this project):
Last Updated: Tuesday, March 11, 2014
I need to put these dates and times into a format that I can also convert other sources' publish times and so that I can order them chronologically later via that key in the JSON feed.
So with a date in that format, how can I convert it to a usable form? I'd like in the end to have all the ISO dates and those written-out text formats converted to something like this:
Published: 2:15 p.m., March 15, 2014.

I think you want to use dateutil.parser.parse. Here's the documentation. It handles a variety of formats. On debian-style OSes, it's available in the package python-dateutil.
If this answer doesn't fully answer your question, please comment and I'll try to updated it appropriately.

Edit: jrennie's solution above is way cleaner than mine.
This works. I use strptime in order to get a solution. Note, since there is no hh:mm data in the original string, I can't output any hh:mm data like you did in your example.
Step by step solution:
>>> import time
>>> t = "Last Updated: Tuesday, March 11, 2014"
>>> t = t.rsplit(' ',4)[1:5] # Get a list of the relevant date fields
['Tuesday,', 'March', '11,', '2014']
>>> t = ' '.join(t) # Turn t into a string so we can use strptime
'Tuesday, March 11, 2014'
>>> t = time.strptime(t, "%A, %B %d, %Y") # Use strptime
time.struct_time(tm_year=2014, tm_mon=3, tm_mday=11, tm_hour=0, tm_min=0, tm_sec=0, tm_wday=1, tm_yday=70, tm_isdst=-1)
One liner:
import time
t = "Last Updated: Tuesday, March 11, 2014"
time.strptime(' '.join(t.rsplit(' ',4)[1:5]), "%A, %B %d, %Y")
This results a struct_time. You may end up wanting convert these to datetimes, depending on how you wish to manipulate them.

Today a good way to do that is to use the dateparser project from the scrapy team: https://github.com/scrapinghub/dateparser

Related

How to convert a string of date and time in words to numbers?

I have the following string: January 1, 2020 / 12:04 AM / a month ago. How do I convert this into 1/1/2020 0:04:00? The code should ignore the a month ago. How do I do this?
You could use dateutil. It is a third part extension of datetime module. You can add it with
python -m pip install python-dateutil
from dateutil.parser import parse
data = 'January 1, 2020 / 12:04 AM / a month ago'
resp = parse(data, fuzzy_with_tokens=True)
print(resp[0]) # the first index is datetime object
The parser is relatively powerful. Here is documentation to Parser.
dateutil is one among many that can help you solve your problem. Good summary of tools such as maya, arrow etc are found Stackabuse thanks to #WasabiMonster
using strptime you can do it and below is the sample,i was using in my code
datetime.datetime.strptime(
... "April 18, 2019, 09:09:09", "%B %d, %Y, %H:%M:%S"

Extract timestamp from a given string using python

I tried multiple packages to extract timestamp from a given string, but no package gives correct results. I did use dateutils, datefinder, parsedatetime, etc. for this task. They extract some datetimes which are in certain formats but not all formats, sometimes they extract some unwanted numbers also as timestamps.
Is there any python package which extracts datetime from a given string.
Assume, I have 2 strings like these:
scala> val xorder= new order(1,"2016-02-22 00:00:00.00",100,"COMPLETED")
and
Fri, 10 Jun 2011 11:04:17 +0200 (CEST)
and want to extract only datetime. Is there any function which extracts both formats of datetimes from above strings. In other cases formats may be different, still it should pick out datetime strings
You can use the datetime function strptime() as follows
dt = datetime.strptime("21/11/06 16:30", "%d/%m/%y %H:%M")
You can create your own formatting and use the function as well.
I created a small python package datetime_extractor to pull out timestamps from a given strings. It can extract many datetime formats from given strings. Hope it will be useful.
pip install datetime-extractor
from datetime_extractor import DateTimeExtractor
import re
samplestring1 = 'scala> val xorder= new order(1,"2016-02-22 00:00:00.00",100,"COMPLETED")'
DateTimeExtractor(samplestring1)
Out: ['2016-02-22 00:00:00.00']
samplestring2 = 'Fri, 10 Jun 2011 11:04:17 +0200 (CEST)'
DateTimeExtractor(samplestring2)
Out: ['10 Jun 2011 11:04:17']
#Allan & #Manmeet Singh, Let me know your comments.

Python: date from plaintext (foreign language) weekday

I am looking to retrieve the next possible date for a weekday contained in a string. Complexity being that this weekday will be in foreign language (sv_SE).
In bash I can solve this using `dateround´:
startdate=$(dateround --from-locale=sv_SE -z CET today $startday)
Highly appreciate your guidance on how to solve this in Python.
Thank you very much!
Dateparser has support for quite a few languages. You could parse the weekday to a datetime object then determine the next possible date available.
-- Edit --
from dateparser import parse
parse('Onsdag').isoweekday() # 3
Now that you have the iso weekday, you can find the next possible date. You can refer to this to see how.
It seems locale aliases are platform specific and case sensitive. I've windows. So locale will be sv_SE.
You can use babel for date/time conversion and is much more comprehensive than native locale module.
Babel is an integrated collection of utilities that assist in internationalizing and localizing Python applications, with an emphasis on web-based applications.
Which can be installed as:
pip install Babel
Once installed, we can use format_date , format_datetime , format_time utilities to format one language date , time to other.
You can use these utilities to convert date/time data between English and Swedish.
>>>import datetime
>>>from babel.dates import format_date, format_datetime, format_time
#Here we get current date time in an datetime object
>>> now = datetime.datetime.now()
>>> now
datetime.datetime(2017, 10, 31, 9, 46, 32, 650000)
#We format datetime object to english using babel
>>> format_date(now, locale='en')
u'Oct 31, 2017'
#We format datetime object to sweedish using babel
>>> format_date(now, locale='sv_SE')
u'31 okt. 2017'
>>>

Identify that a string could be a datetime object

If I knew the format in which a string represents date-time information, then I can easily use datetime.datetime.strptime(s, fmt). However, without knowing the format of the string beforehand, would it be possible to determine whether a given string contains something that could be parsed as a datetime object with the right format string?
Obviously, generating every possible format string to do an exhaustive search is not a feasible idea. I also don't really want to write one function with many format strings hardcoded into it.
Does anyone have any thoughts on how this can be accomplished (perhaps some sort of regex?)?
What about fuzzyparsers:
Sample inputs:
jan 12, 2003
jan 5
2004-3-5
+34 -- 34 days in the future (relative to todays date)
-4 -- 4 days in the past (relative to todays date)
Example usage:
>>> from fuzzyparsers import parse_date
>>> parse_date('jun 17 2010') # my youngest son's birthday
datetime.date(2010, 6, 17)
Install with:
$ pip install fuzzyparsers
You can use parser from dateutil
Example usage:
from dateutil import parser
dt = parser.parse("Aug 28 1999 12:00AM")

Python datetime strptime() and strftime(): how to preserve the timezone information

See the following code:
import datetime
import pytz
fmt = '%Y-%m-%d %H:%M:%S %Z'
d = datetime.datetime.now(pytz.timezone("America/New_York"))
d_string = d.strftime(fmt)
d2 = datetime.datetime.strptime(d_string, fmt)
print d_string
print d2.strftime(fmt)
the output is
2013-02-07 17:42:31 EST
2013-02-07 17:42:31
The timezone information simply got lost in the translation.
If I switch '%Z' to '%z', I get
ValueError: 'z' is a bad directive in format '%Y-%m-%d %H:%M:%S %z'
I know I can use python-dateutil, but I just found it bizzare that I can't achieve this simple feature in datetime and have to introduce more dependency?
Part of the problem here is that the strings usually used to represent timezones are not actually unique. "EST" only means "America/New_York" to people in North America. This is a limitation in the C time API, and the Python solution is… to add full tz features in some future version any day now, if anyone is willing to write the PEP.
You can format and parse a timezone as an offset, but that loses daylight savings/summer time information (e.g., you can't distinguish "America/Phoenix" from "America/Los_Angeles" in the summer). You can format a timezone as a 3-letter abbreviation, but you can't parse it back from that.
If you want something that's fuzzy and ambiguous but usually what you want, you need a third-party library like dateutil.
If you want something that's actually unambiguous, just append the actual tz name to the local datetime string yourself, and split it back off on the other end:
d = datetime.datetime.now(pytz.timezone("America/New_York"))
dtz_string = d.strftime(fmt) + ' ' + "America/New_York"
d_string, tz_string = dtz_string.rsplit(' ', 1)
d2 = datetime.datetime.strptime(d_string, fmt)
tz2 = pytz.timezone(tz_string)
print dtz_string
print d2.strftime(fmt) + ' ' + tz_string
Or… halfway between those two, you're already using the pytz library, which can parse (according to some arbitrary but well-defined disambiguation rules) formats like "EST". So, if you really want to, you can leave the %Z in on the formatting side, then pull it off and parse it with pytz.timezone() before passing the rest to strptime.
Unfortunately, strptime() can only handle the timezone configured by your OS, and then only as a time offset, really. From the documentation:
Support for the %Z directive is based on the values contained in tzname and whether daylight is true. Because of this, it is platform-specific except for recognizing UTC and GMT which are always known (and are considered to be non-daylight savings timezones).
strftime() doesn't officially support %z.
You are stuck with python-dateutil to support timezone parsing, I am afraid.
Here is my answer in Python 2.7
Print current time with timezone
from datetime import datetime
import tzlocal # pip install tzlocal
print datetime.now(tzlocal.get_localzone()).strftime("%Y-%m-%d %H:%M:%S %z")
Print current time with specific timezone
from datetime import datetime
import pytz # pip install pytz
print datetime.now(pytz.timezone('Asia/Taipei')).strftime("%Y-%m-%d %H:%M:%S %z")
It will print something like
2017-08-10 20:46:24 +0800
Try this:
import pytz
import datetime
fmt = '%Y-%m-%d %H:%M:%S %Z'
d = datetime.datetime.now(pytz.timezone("America/New_York"))
d_string = d.strftime(fmt)
d2 = pytz.timezone('America/New_York').localize(d.strptime(d_string,fmt), is_dst=None)
print(d_string)
print(d2.strftime(fmt))

Categories