Removing time stamp when converting date format with dateparser in scrapy - python

I am using dateparser in scrapy to convert the date format.
Original date format: Apr 16, 2019
After using dateparser: 2019-04-16 00:00:00
This is what I wanted to achieve. However, I would still like to remove the time from the date format, so in the end, I only have 2019-04-16. Unfortunately, I am not able to realize this.
This is my line of code:
import dateparser
...
def parse_site(self, response):
def get_with_xpath(query):
return response.xpath(query).get(default='').strip()
yield {
'date': dateparser.parse(get_with_xpath('//meta[#name="date"]/#content'))
}
As I said, it works. But the time stamp I would like to remove. Any ideas?

Dateparser.parse returns datetime representing parsed date if successful. You can use strftime() function to remove the timestamp as shown below
dateparser.parse('Apr 16, 2019').strftime("%Y-%m-%d")

Methods of this library return all values in datetime format. But afterwards you are free to do with them anything you want. Check this example:
>>> import dateparser
>>> dateparser.parse("Apr 16, 2019")
datetime.datetime(2019, 4, 16, 0, 0)
>>> dateparser.parse("Apr 16, 2019").date()
datetime.date(2019, 4, 16)

Related

Adding days to a ISO 8601 format date in Python

I need to add +3 hours to a date in iso 8601 format in python, for example "2022-09-21T22:31:59Z" due to time difference. In this time information that is returned to me from the API, I only need Y/M/D information, but due to the +3 hour difference, the day information needs to go one step further in the date information, as will be experienced in the date I conveyed in the example. How can I overcome this problem? I think the format of the date format is ISO 8601 but can you correct me if I am wrong?
ex. api response;
"createdDateTime": "2022-09-21T22:31:59Z"
what i need;
"createdDateTime": "2022-09-21T22:31:59Z" to "2022-09-22T01:31:59Z"
Try this code it will definitely work:
from datetime import datetime,timedelta
parsed_date=datetime.strptime("2022-09-21T22:31:59Z", "%Y-%m-%dT%H:%M:%SZ")
Updated_date = parsed_date+ timedelta(hours=3)
print(Updated_date)
If you have a proper JSON string you can parse it with json, extract the string value, parse that with datetime.fromisoformat into a datetime value and then get the date from it :
import json
from datetime import datetime
data=json.loads('{"createdDateTime": "2022-09-21T22:31:59+00:00"}')
isostr=data['createdDateTime'].replace('Z','+00:00')
fulldate=datetime.fromisoformat(isostr)
fulldate.date()
-----
datetime.date(2022, 9, 21)
The replacement is necessary because fromisoformat doesn't understand Z
Adding 3 hours to fulldate will return 1 AM in the next day:
fulldate + timedelta(hours=3)
------
datetime.datetime(2022, 9, 22, 1, 31, 59, tzinfo=datetime.timezone.utc)
fulldate is in UTC. It can be converted to another timezone offset using astimezone
fulldate.astimezone(tz=timezone(timedelta(hours=3)))
---
datetime.datetime(2022, 9, 22, 1, 31, 59, tzinfo=datetime.timezone(datetime.timedelta(seconds=10800)))
Or in a more readable form:
fulldate.astimezone(tz=timezone(timedelta(hours=3))).isoformat()
---------------------------
'2022-09-22T01:31:59+03:00'
This is 1AM in the next day but with a +3:00 offset. This is still the same time as 22PM at UTC.
It's also possible to just replace the UTC offset with another one, without changing the time values, using replace:
fulldate.replace(tzinfo=timezone(timedelta(hours=3))).isoformat()
----------------------------
'2022-09-21T22:31:59+03:00'
That's the original time with a different offset. That's no longer the same time as 2022-09-21T22:31:59Z

pdf.getDocumentInfo date format

I am using pypdf2's function for extracting document info. The results are something like this but I am unable to interpret the creation date format. What are the last few digits representing?
pdf.documentInfo
[Output]: {'/Creator': 'Rave (http://www.nevrona.com/rave)',
'/Producer': 'Nevrona Designs',
'/CreationDate': 'D:20060301072826' }
and at one point I also saw this:
CreationDate': "D:20170920114835+02'00'"
how can I read or convert it into a normal date time readable format?
you can clean & parse like
from datetime import datetime
CreationDate = "D:20170920114835+02'00'"
dt = datetime.strptime(CreationDate.replace("'", ""), "D:%Y%m%d%H%M%S%z")
# UTC offset is set correctly:
print(dt)
# 2017-09-20 11:48:35+02:00
print(repr(dt))
# datetime.datetime(2017, 9, 20, 11, 48, 35, tzinfo=datetime.timezone(datetime.timedelta(seconds=7200)))
...which I think is more straight forward than the answer to this related question shows.

normalizing JSON datestrings to UTC python

I have an important test that says "Calculate users that logged in during the month of April normalized to the UTC timezone."
Items look as such:
[ {u'email': u' ybartoletti#littel.biz',
u'login_date': u'2014-05-08T22:30:57-04:00'},
{u'email': u'woodie.crooks#kozey.com',
u'login_date': u'2014-04-25T13:27:48-08:00'},
]
It seems to me that an item like 2014-04-13T17:12:20-04:00 means "April 13th, 2014, at 5:12:20 pm, 4 hours behind UTC". Then I just use strptime to convert to datetime (Converting JSON date string to python datetime), and subtract a timedelta of however many hours I get from a regex that grabs the end of string? I feel this way because some have a + at the end instead of -, like 2014-05-07T00:30:06+07:00
Thank you
It is probably best to use the dateutil.parser.parse and pytz packages for this purpose. This will allow you to parse a string and convert it to a datetime object with UTC timezone:
>>> s = '2014-05-08T22:30:57-04:00'
>>> import dateutil.parser
>>> import pytz
>>> pytz.UTC.normalize(dateutil.parser.parse(s))
datetime.datetime(2014, 5, 9, 2, 30, 57, tzinfo=<UTC>)
You can use arrow to easily parse date with time zone.
>>>import arrow
>>> a = arrow.get('2014-05-08T22:30:57-04:00').to('utc')
>>> a
<Arrow [2014-05-09T02:30:57+00:00]>
Get a datetime object or timestamp:
>>> a.datetime
datetime.datetime(2014, 5, 9, 2, 30, 57, tzinfo=tzutc())
>>> a.naive
datetime.datetime(2014, 5, 9, 2, 30, 57)
>>> a.timestamp
1399602657
The following solution should be faster and avoids importing external libraries. The downside is that it will only work if the date strings are all guaranteed to have the specified format. If that's not the case, then I would prefer Simeon's solution, which lets dateutil.parser.parse() take care of any inconsistencies.
import datetime as dt
def parse_date(datestr):
diff = dt.timedelta(hours=int(datestr[20:22]), minutes=int(datestr[23:]))
if datestr[19] == '-':
return dt.datetime.strptime(datestr[:19], '%Y-%m-%dT%H:%M:%S') - diff
return dt.datetime.strptime(datestr[:19], '%Y-%m-%dT%H:%M:%S') + diff

Python: convert complicated date and time string to timestamp

I want to know how to convert this date format
"Thu 21st Aug '14, 4:58am"
to a timestamp with Python?
Another format that I need to convert:
"Yesterday, 7:22am"
I tried parse util without success...
If you haven't done so already, have a look at the parse function in dateutils.parser for parsing strings representing dates...
>>> from dateutil.parser import parse
>>> dt = parse("Thu 21st Aug '14, 4:58am")
>>> dt
datetime.datetime(2014, 8, 21, 4, 58)
...and then to convert a datetime object to a timestamp, you can do the following:
>>> import time
>>> import datetime
>>> time.mktime(dt.timetuple())
1408593480.0
As side remark, parse is a useful function which can recognise a huge range of different date formats. However it's sometimes too helpful and sees dates where perhaps a date is not intended:
>>> parse("14, m 23")
datetime.datetime(2014, 8, 23, 0, 14)
If you also want to parse expressions such as "Yesterday, 7:22am", you could do one of two things:
Replace "yesterday", "yester-day", "yday" and other variations with "25/08/2014" (or another appropriate date) and then use parse on the new string.
Use another library to parse the string. parsedatetime is one option...
Here's parsedatetime in action on your example:
>>> import parsedatetime.parsedatetime as pdt
>>> p = pdt.Calendar()
>>> d = p.parse("Yesterday, 7:22am")
>>> d
((2014, 8, 25, 7, 22, 0, 0, 237, -1), 3)
To turn this date representation d into a datetime object, you can unpack the tuple like so:
>>> dt = datetime.datetime(*d[0][:7])
>>> dt
datetime.datetime(2014, 8, 25, 7, 22)
Now dt can be easily converted to a timestamp in the way described above.
You can use this:
a = "Thu 21st Aug '14, 4:58am"
datetime.datetime.strptime(a, '%a %dst %b \'%y, %H:%M%p')

Convert UTC time to python datetime

I have numerous UTC time stamps in the following format:
2012-04-30T23:08:56+00:00
I want to convert them to python datetime objects but am having trouble.
My code:
for time in data:
pythondata[i]=datetime.strptime(time,"%y-%m-%dT%H:%M:%S+00:00")
I get the following error:
ValueError: time data '2012-03-01T00:05:55+00:00' does not match format '%y-%m-%dT%H:%M:%S+00:00'
It looks like I have the proper format, so why doesn't this work?
Change the year marker in your time format string to %Y:
time = '2012-03-01T00:05:55+00:00'
datetime.strptime(time, "%Y-%m-%dT%H:%M:%S+00:00")
# => datetime.datetime(2012, 3, 1, 0, 5, 55)
See strftime() and strptime() behavior.
I highly recommend python-dateutil library, it allows conversion of multiple datetime formats from raw strings into datetime objects with/without timezone set
>>> from dateutil.parser import parse
>>> parse('2012-04-30T23:08:56+00:00')
datetime.datetime(2012, 4, 30, 23, 8, 56, tzinfo=tzutc())

Categories