Is there a way to get back the directives that dateutil used to parse a date?
from dateutil import parser
dstr = '2017/10/01 16:44'
dtime = parser.parse(dstr)
What I would like is the ability to get '%Y/%m/%d %H:%M' back somehow.
No, the parser in dateutil has no support for extracting a format. The parser uses a mix of tokenizing and heuristics to try to figure out what the various numbers and words in the input could mean, and no 'format' is build up during this process.
Your best bet is to search the input string for the fields from the resulting datetime object and produce a format from that.
For your specific example, that is a reasonable option, because all the resulting values are unique. If your inputs do not have unique values, you'll have include heuristics where you use multiple examples to increase the certainty of a correct match.
For example, for your specific example, you can find unique positions for all the datetime components presented as strings, starting with '2017', '10', etc. However, for other examples you'll have to search for different variants of string representations of those components, like a 2-year format, or month, day, hour or minute components not using zero-padding, and you need to account for a 12-hour clock representation.
I haven't directly tried this, but I strongly suspect that this is a problem very suitable for the Aho–Corasick algorithm, which lets you find positions of matching known strings (the dictionary, here your various datetime components formatted as strings, plus potential delimiter characters) in an input string. Once you have those positions, and you have resolved the ambiguities, you can construct a format string from those. You can probably narrow down the number of possible component formats by looking for tell-tale strings like pm or weekdays or month names.
There are ready-made Python implementations, like the pyahocorasick package. With that library I was able to make a pretty good approximation in a few steps:
>>> from dateutil import parser
>>> import ahocorasick
>>> A = ahocorasick.Automaton()
>>> dstr = '2017/10/01 16:44'
>>> dtime = parser.parse(dstr)
>>> formats = 'dmyYHIpMS'
>>> for f in formats:
... _ = A.add_word(dtime.strftime(f'%{f}'), (False, f))
...
>>> for p in ':/ ':
... _ = A.add_word(p, (True, p))
...
>>> A.make_automaton()
>>> for end_index, (punctuation, char) in A.iter(dstr):
... print(end_index, char if punctuation else f'%{char}')
...
2 %d
3 %Y
3 %y
4 /
6 %m
7 /
9 %d
10
12 %H
13 :
15 %M
You could include priorities, and only output a specific formatter when punctuation is reached; that'll resolve the %d / %Y / %y clash at the start.
Related
I'm attempting to read data from multiple text files and move the data into a two-dimensional array. The data needs to remain in a specific order.
Could regex assist with this?
If you have any insight on how to improve this section of the code please let me know.
the datetime module provides (most) everything date-related
from datetime import datetime
date = "Sat 30-Mar-1996 7:40 PM"
fmt = "%a %d-%b-%Y %I:%M %p"
a = datetime.strptime(date, fmt)
print(a.year)
>>> 1996
You can parse the date-time string very easily by splitting its components and using iterable unpacking, e.g.,
def parse_date(d):
day_of_week, date, hhmm, ampm = d.split()
day_of_month, month, year = date.split('-')
hour, minute = hhmm.split(':')
return (year, month, day_of_month,
hour if ampm=='AM' or str(int(hour)+12), minute,
day_of_week)
and later, in the body of the loop
year, m, dom, h, m, dow = parse_date(fields[-1].strip())
or, if you are interested only in year
year, *_ = parse_date(fields[-1].strip())
You're probably looking for regular expressions, which are a very powerful way to analyze and extract data from strings. For an intro into them, I'd check out this site or the python docs, but in your case I think you probably want something like '| ([a-zA-Z]*) ([0-9]*)-([a-zA-Z]*)-([0-9]*) ([0-9:]* [a-zA-Z]*) |' would work. A more specific description of the format the time would be in is necessary for a 100% correct regex [short for regular expressions].
To use regex in python, you want the re library. First, create the pattern matcher with matcher = re.compile(your_regex_string_here). Then, find the match with result = matcher.match(file_contents). (You could also just do result = re.match(regex_string,file_contents).) Whatever your regex, anything surrounded by parentheses is known as a "capturing group", which can be extracted from the result with result.group(); result.group(0) will return full match, and result.group(n) will return the contents of the nth capturing group - that is, the nth set of parentheses. In the above example, result.group(4) would return the year, though you could get any of the day of the week, day, month, year, and time by using groups 1-5.
The DateTime module as mentioned in another answer is also a great tool.
this question maybe is duplicated but I didn't find any exact solution for this. I have this type of string that includes date and time.
"check_in": "10/25/2019 14:30"
I need to extract an hour from it but this is not always a valid format. I tried this pattern so far but it includes the ":" character.
\d+?(:)
(\d+:)
(\d+)*:
Regular expressions aren't always the best way to deal with strings representing dates, especially if you can't rely on the input format to be consistent. Use a specialized parser instead:
>>> from dateutil import parser
>>> parser.parse("10/25/2019 14:30").hour
14
>>> parser.parse("10/25/2019 2:30 PM").hour
14
>>> parser.parse("2019-10-25T143000").hour
14
The module dateutil isn't in the standard library but is well worth the trouble of downloading.
\d+(?=:)
Demo
You don't need match the :, but need check it. So use Positive Lookahead (?=:).
First, this is what is wrong with your regexes:
\d+?(:) - finds number and column (14:) and puts the column into a group
(\d+:) - finds number and column (14:) and puts all of it into a group
(\d+)*: - finds (optionally, because of *) number and column (14:) and puts the number into a group
So, the last one could work:
>>> match = re.search(r'(\d+)*:', "10/25/2019 14:30")
>>> match.group(0) # whole result
'14:'
>>> match.group(1) # just the number
'14'
But then again, it would give wrong result (instead of breaking) on something like "time: 14:30", making it difficult to debug the error later. What you want is to use a more strict search, e.g. matching the whole string and labelling all groups:
>>> regex = r'(?P<month>\d\d)/(?P<day>\d\d)/(?P<year>\d{4}) (?P<hour>\d\d):(?P<minute>\d\d)'
>>> re.search(regex, "10/25/2019 14:30").group('hour')
'14'
Another, easier and even safer way is to use strptime:
>>> import datetime
>>> datetime.datetime.strptime("10/25/2019 14:30", "%m/%d/%Y %H:%M")
datetime.datetime(2019, 10, 25, 14, 30)
Now you have the complete datetime object and you can extract the .hour if you want.
This question already has answers here:
Convert an RFC 3339 time to a standard Python timestamp
(15 answers)
Closed 7 years ago.
i have a timestamp that looks like this 2015-11-06T14:20:14.011+01:00. I would like to parse it to datetime.
I have the idea that i can use %Y-%m-%dT%H:%M:%S.%f%z as representation of this.
But the problem is the colon in the timezone. How can i remove the colon in the Timezone or is there a better way as the %z?
You have an ISO 8601 datetime string. Don't bother parsing it or fiddling with it by hand (see: XY Problem). Use the iso8601 library for Python.
import iso8601
parsed = iso8601.parse_date("2015-11-06T14:20:14.011+01:00")
If you want to remove the timezone information from it, use the replace method.
tz_stripped = parsed.replace(tzinfo=None)
import re
original = '2015-11-06T14:20:14.011+01:00'
replaced = re.sub(r'([+-]\d+):(\d+)$', r'\1\2', original)
# replaced == '2015-11-06T14:20:14.011+0100'
This will replace the colon only when it is preceded by a plus or minus, and surrounded by digits until the end of the string.
I think the best way to do this is with dateutils
https://labix.org/python-dateutil
from dateutil.parser import parse
original = '2015-11-06T14:20:14.011+01:00'
print "Original Date {}".format(original)
new_date = parse(original)
print new_date
print type(new_date)
# print new_date.hour
# print new_date.minute
# print new_date.second
print "New Date 1 {}".format(new_date.strftime('%Y-%m-%d %H:%M:%S'))
print "New Date 2 {}".format(new_date.strftime('%Y-%m-%dT%H:%M:%S.%f%z'))
Output:
Original Date 2015-11-06T14:20:14.011+01:00
2015-11-06 14:20:14.011000+01:00
<type 'datetime.datetime'>
New Date 1 2015-11-06 14:20:14
New Date 2 2015-11-06T14:20:14.011000+0100
Regards
I have a series of string that I am trying to parse into dates. They are of the form (001 is the julian day)
code_36763.letters_81m_2013_001_0000.dat
Only that the numbers which don't compose the date change, so in wildcards this would be
code_?????.letters_??m_%Y_%j_%H%M.dat
My first thought nwas to try this is datetime.datetime.strptime, but I get an error saying that ValueError: time data does not match format, which means that strptimedoes not understand wildcards. Then my second thought as to use dateutil.parser, but when I do
from dateutil.parser import parse
f='code_36763.letters_81m_2013_001_0000.dat'
parse(f, fuzzy=True)
I get the error
TypeError: 'NoneType' object is not iterable
which probably means that those other numbers are getting in the way.
Is there a way to solve this without manually cutting the other numbers? I ask this because the code I have to write should be general enough that the other numbers can be in different positions along the string.
Something like this could work by using re.sub to reformat the file name into something that strptime could parse.
>>> import re
>>> import datetime
>>> filenames = ["code_36763.letters_81m_2013_001_0000.dat", "code_36763.letters_81m_2013_240_1700.dat"]
>>> for n in filenames:
... parsed = re.sub(r"code_\d+.letters_\d{2}m_(\d{4})_(\d{3})_(\d{2})(\d{2}).dat", r"\1-\2-\4:\3", n)
... print datetime.datetime.strptime(parsed, "%Y-%j-%H:%M")
...
2013-01-01 00:00:00
2013-08-28 00:17:00
I would use a regular expression:
>>> import re
>>> re.match(
r"code_\d{5}.letters_\d{2}m_(?P<year>\d{4})_(?P<day>\d{3})_(?P<hour>\d{2})(?P<minute>\d{2}).dat",
"code_36763.letters_81m_2013_001_0000.dat"
).groupdict()
{'year': '2013', 'day': '001', 'minute': '00', 'hour': '00'}
You can then convert the numbers to integers and pass them on accordingly. See e.g. Convert julian day into date for help with that step.
The string as you have it appears to be fairly fixed format. If this is the case, then the following approach might suffice which simply slices off the beginning so that it is suitable for strptime:
import datetime
filename = "code_36763.letters_81m_2013_001_0000.dat"
print datetime.datetime.strptime(filename[-19:-4], "m_%Y_%j_%H%M")
Giving you the output:
2013-01-01 00:00:00
i am trying to do string manipulation based on format. str.replace(old,new) alllows changing by specific string pattern. is it possible to find and replace by format? for example,
i want to find all datetime like value in a long string and replace it with another format
assuming % is wildcard for number and datetime is %%/%%/%%T%%:%%
str.replace(%%/%%/%%T%%:%%, 'dummy value')
EDIT:
sorry i should have been more clearer. re.sub seems like I can use that, but how do it substitute it with a date converted value. in this case, e.g.
YY/MM/DDTHH:MM to (YY/MM/DD HH:MM)+8 hours
The easiest way to do this is probably using a combination of regular expression syntax, applying re.sub and using the fact that the repl parameter can be a function that takes a match and returns a string to replace it, and datetime's syntax for strptime and strftime:
>>> from datetime import datetime
>>> import re
>>> def replacer(match):
return datetime.strptime(
match.group(), # matched text
'%y/%m/%dT%H:%M', # source format in datetime syntax
).strftime('%d %B %Y at %H.%M') # destination format in datetime syntax
>>> re.sub(
r'\d{2}/\d{2}/\d{2}T\d{2}:\d{2}', # source format in regex syntax
replacer, # function to process match
'The date and time was 12/12/12T12:12 exactly.', # string to process
)
'The date and time was 12 December 2012 at 12.12 exactly.'
The only downside of this is that you need to define the source format in both datetime and re syntax, which isn't very DRY; if they don't match, you'll get nowhere.