python extracting date - python

I need to extract a date from a jpeg format,
I have extracted the text from the jpeg in the form of a string & have used regex to extract the date,
Text from JPEG
Cont:7225811153;
BillNo4896TableNoR306
07-Jun-201921:18:40
Code used
Importing regular expression & Date time
import re as r
from datetime import datetime
regex to identify the date in the above string
id = r.search(r'\d{2}-\w{3}-\d{4}',text)
print(id)
Output
re.Match object; span=(89, 100), match='07-Jun-2019'
However after performing the above code i tried the following to extract the date
Code
Extracting the date
date = datetime.strptime(id.group(),'%d-%B-%Y').date()
Output
ValueError: time data '07-Jun-2019' does not match format '%d-%B-%Y'
Where am I going wrong, or is there a better way to do the same.
Help would be really appreciated

Use %b instead of %B, but make sure you only try to convert the match if it occurred:
import re as r
from datetime import datetime
text = 'Cont:7225811153; BillNo4896TableNoR306 07-Jun-201921:18:40'
id = r.search(r'\d{2}-\w{3}-\d{4}',text)
if id: # <-- Check if a match occurred
print(datetime.strptime(id.group(),'%d-%b-%Y').date())
# => 2019-06-07
See the Python demo online
See more details on the datetime.strptime format strings.

You had it almost perfect. Just replace the B with b.
>>> datetime.strptime(id.group(),'%d-%b-%Y').date()
datetime.date(2019, 6, 7)

Related

How to identify multiple dates within a python string?

For a given string, I want to identify the dates in it.
import datefinder
string = str("/plot 23/01/2023 24/02/2021 /cmd")
matches = list(datefinder.find_dates(string))
if matches:
print(matches)
else:
print("no date in string")
Output:
no date in string
However, there are clearly dates in the string. Ultimately I want to identify which date is the oldest by putting in a variable Date1, and which date is the newest by putting in a variable Date2.
I believe that if a string contains multiple dates, datefinder is unable to parse it. In your case, splitting the string using string.split() and applying the find_dates method should do the job.
You've only given 1 example, but based on that example, you can use regex.
import re
from datetime import datetime
string = "/plot 23/01/2023 24/02/2021 /cmd"
dates = [datetime.strptime(d, "%d/%m/%Y") for d in re.findall(r"\d{2}/\d{2}/\d{4}", string)]
print(f"earliest: {min(dates)}, latest: {max(dates)}")
Output
earliest: 2021-02-24 00:00:00, latest: 2023-01-23 00:00:00

replace the date section of a string in python

if I have a string 'Tpsawd_20220320_default_economic_v5_0.xls'.
I want to replace the date part (20220320) with a date variable (i.e if I define the date = 20220410, it will replace 20220320 with this date). How should I do it with build-in python package? Please note the date location in the string can vary. it might be 'Tpsawd_default_economic_v5_0_20220320.xls' or 'Tpsawd_default_economic_20220320_v5_0.xls'
Yes, this can be done with regex fairly easily~
import re
s = 'Tpsawd_20220320_default_economic_v5_0.xls'
date = '20220410'
s = re.sub(r'\d{8}', date, s)
print(s)
Output:
Tpsawd_20220410_default_economic_v5_0.xls
This will replace the first time 8 numbers in a row are found with the given string, in this case date.

Can't produce a certain format of date when I use a customized date instead of now

I'm trying to format a date to a customized one. When I use datetime.datetime.now(), I get the right format of date I'm after. However, my intention is to get the same format when I use 1980-01-22 instead of now.
import datetime
date_string = "1980-01-22"
item = datetime.datetime.now(datetime.timezone.utc).isoformat(timespec="milliseconds").replace("+00:00", "Z")
print(item)
Output I get:
2021-05-04T09:52:04.010Z
How can I get the same format of date when I use a customized date, as in 1980-01-22 instead of now?
MrFuppes suggestion in the comments is the shortest way to accomplish your date conversion and formatting use case.
Another way is to use the Python module dateutil. This module has a lot of flexibility and I use it all the time.
Using dateutil.parser.parse:
from dateutil.parser import parse
# ISO FORMAT
ISO_FORMAT_MICROS = "%Y-%m-%dT%H:%M:%S.%f%z"
# note the format of these strings
date_strings = ["1980-01-22",
"01-22-1980",
"January 22, 1980",
"1980 January 22"]
for date_string in date_strings:
dt = parse(date_string).strftime(ISO_FORMAT_MICROS)
# strip 3 milliseconds for the output and add the ZULU time zone designator
iso_formatted_date = f'{dt[:-3]}Z'
print(iso_formatted_date)
# output
1980-01-22T00:00:00.000Z
1980-01-22T00:00:00.000Z
1980-01-22T00:00:00.000Z
1980-01-22T00:00:00.000Z
Using dateutil.parser.isoparse:
from dateutil.parser import isoparse
from dateutil.tz import *
dt = isoparse("1980-01-22").isoformat(timespec="milliseconds")
iso_formatted_date = f'{dt}Z'
print(iso_formatted_date)
# output
1980-01-22T00:00:00.000Z
Is this what your trying to achieve?
date_string = "1980-01-22"
datetime.datetime.strptime(date_string, "%Y-%m-%d").isoformat(timespec="milliseconds")
Output
'1980-01-22T00:00:00.000'

Finding date represented in various formats in a string

Below code prints only 2-Nov-2018; how do I modify the code so that both the date formats are picked.
import re
string = "some text contains 2-Nov-2018 and 3-11-2018"
date = re.findall('\d{1,2}[/-]\D{1,8}[/-]\d{2,4}', string)
print(date)
I think the simplest thing would be to write multiple patterns.
(Assuming you are just looking for these two patterns -- obviously gets more complicated to do yourself if you are looking for every possible date format)
import re
date_string = "some text contains 2-Nov-2018 and 3-11-2018"
formats = [r'\d{1,2}[/-]\D{1,8}[/-]\d{2,4}', # List of patterns
r'\d{1,2}[/-]\d{1,2}[/-]\d{2,4}']
dates = re.findall('|'.join(formats), date_string) # Join with | operator
dates
# ['2-Nov-2018', '3-11-2018']
To standardize the dates after this, you could try something like pandas.to_datetime :
import pandas as pd
dates = ['2-Nov-2018', '3-11-2018']
std_dates = [pd.to_datetime(d) for d in dates]
std_dates
# [Timestamp('2018-11-02 00:00:00'), Timestamp('2018-03-11 00:00:00')]
As was mentioned in some comments, there may be libraries already built to do all of this for you. So if you are looking for a more general approach, I would take a look at those libraries.

strip date with -07:00 timezone format python

I have a variable 'd' that contains dates in this format:
2015-08-03T09:00:00-07:00
2015-08-03T10:00:00-07:00
2015-08-03T11:00:00-07:00
2015-08-03T12:00:00-07:00
2015-08-03T13:00:00-07:00
2015-08-03T14:00:00-07:00
etc.
I need to strip these dates, but I'm having trouble because of the timezone. If I use d = dt.datetime.strptime(d[:19],'%Y-%m-%dT%H:%M:%S'), only the first 19 characters will appear and the rest of the dates are ignored. If I try d = dt.datetime.strptime(d[:-6],'%Y-%m-%dT%H:%M:%S, Python doesn't chop off the timezone and I still get the error ValueError: unconverted data remains: -07:00. I don't think I can use the dateutil parser because I've only seen it be used for one date instead of a whole list like I have. What can I do? Thanks!
Since you have a list just iterate over and use dateutil.parser:
d = ["2015-08-03T09:00:00-07:00","2015-08-03T10:00:00-07:00","2015-08-03T11:00:00-07:00","2015-08-03T12:00:00-07:00",
"2015-08-03T13:00:00-07:00","2015-08-03T14:00:00-07:00"]
from dateutil import parser
for dte in d:
print(parser.parse(dte))
If for some reason you actually want to ignore the timezone you can use rsplit with datetime.strptime:
from datetime import datetime
for dte in d:
print(datetime.strptime(dte.rsplit("-",1)[0],"%Y-%m-%dT%H:%M:%S"))
If you had a single string delimited by commas then just use d.split(",")
You can use strftime to format the string in any format you want if you actually want a string:
for dte in d:
print(datetime.strptime(dte.rsplit("-",1)[0],"%Y-%m-%dT%H:%M:%S").strftime("%Y-%m-%d %H:%M:%S"))

Categories