extract hour from a string _ unclear format - python

this question maybe is duplicated but I didn't find any exact solution for this. I have this type of string that includes date and time.
"check_in": "10/25/2019 14:30"
I need to extract an hour from it but this is not always a valid format. I tried this pattern so far but it includes the ":" character.
\d+?(:)
(\d+:)
(\d+)*:

Regular expressions aren't always the best way to deal with strings representing dates, especially if you can't rely on the input format to be consistent. Use a specialized parser instead:
>>> from dateutil import parser
>>> parser.parse("10/25/2019 14:30").hour
14
>>> parser.parse("10/25/2019 2:30 PM").hour
14
>>> parser.parse("2019-10-25T143000").hour
14
The module dateutil isn't in the standard library but is well worth the trouble of downloading.

\d+(?=:)
Demo
You don't need match the :, but need check it. So use Positive Lookahead (?=:).

First, this is what is wrong with your regexes:
\d+?(:) - finds number and column (14:) and puts the column into a group
(\d+:) - finds number and column (14:) and puts all of it into a group
(\d+)*: - finds (optionally, because of *) number and column (14:) and puts the number into a group
So, the last one could work:
>>> match = re.search(r'(\d+)*:', "10/25/2019 14:30")
>>> match.group(0) # whole result
'14:'
>>> match.group(1) # just the number
'14'
But then again, it would give wrong result (instead of breaking) on something like "time: 14:30", making it difficult to debug the error later. What you want is to use a more strict search, e.g. matching the whole string and labelling all groups:
>>> regex = r'(?P<month>\d\d)/(?P<day>\d\d)/(?P<year>\d{4}) (?P<hour>\d\d):(?P<minute>\d\d)'
>>> re.search(regex, "10/25/2019 14:30").group('hour')
'14'
Another, easier and even safer way is to use strptime:
>>> import datetime
>>> datetime.datetime.strptime("10/25/2019 14:30", "%m/%d/%Y %H:%M")
datetime.datetime(2019, 10, 25, 14, 30)
Now you have the complete datetime object and you can extract the .hour if you want.

Related

Return directives used by dateutil.parser

Is there a way to get back the directives that dateutil used to parse a date?
from dateutil import parser
dstr = '2017/10/01 16:44'
dtime = parser.parse(dstr)
What I would like is the ability to get '%Y/%m/%d %H:%M' back somehow.
No, the parser in dateutil has no support for extracting a format. The parser uses a mix of tokenizing and heuristics to try to figure out what the various numbers and words in the input could mean, and no 'format' is build up during this process.
Your best bet is to search the input string for the fields from the resulting datetime object and produce a format from that.
For your specific example, that is a reasonable option, because all the resulting values are unique. If your inputs do not have unique values, you'll have include heuristics where you use multiple examples to increase the certainty of a correct match.
For example, for your specific example, you can find unique positions for all the datetime components presented as strings, starting with '2017', '10', etc. However, for other examples you'll have to search for different variants of string representations of those components, like a 2-year format, or month, day, hour or minute components not using zero-padding, and you need to account for a 12-hour clock representation.
I haven't directly tried this, but I strongly suspect that this is a problem very suitable for the Aho–Corasick algorithm, which lets you find positions of matching known strings (the dictionary, here your various datetime components formatted as strings, plus potential delimiter characters) in an input string. Once you have those positions, and you have resolved the ambiguities, you can construct a format string from those. You can probably narrow down the number of possible component formats by looking for tell-tale strings like pm or weekdays or month names.
There are ready-made Python implementations, like the pyahocorasick package. With that library I was able to make a pretty good approximation in a few steps:
>>> from dateutil import parser
>>> import ahocorasick
>>> A = ahocorasick.Automaton()
>>> dstr = '2017/10/01 16:44'
>>> dtime = parser.parse(dstr)
>>> formats = 'dmyYHIpMS'
>>> for f in formats:
... _ = A.add_word(dtime.strftime(f'%{f}'), (False, f))
...
>>> for p in ':/ ':
... _ = A.add_word(p, (True, p))
...
>>> A.make_automaton()
>>> for end_index, (punctuation, char) in A.iter(dstr):
... print(end_index, char if punctuation else f'%{char}')
...
2 %d
3 %Y
3 %y
4 /
6 %m
7 /
9 %d
10
12 %H
13 :
15 %M
You could include priorities, and only output a specific formatter when punctuation is reached; that'll resolve the %d / %Y / %y clash at the start.

Python Regular Expression Extracting 'name= ....'

I'm using a Python script to read data from our corporate instance of JIRA. There is a value that is returned as a string and I need to figure out how to extract one bit of info from it. What I need is the 'name= ....' and I just need the numbers from that result.
<class 'list'>: ['com.atlassian.greenhopper.service.sprint.Sprint#6f68eefa[id=30943,rapidViewId=10468,state=CLOSED,name=2016.2.4 - XXXXXXXXXX,startDate=2016-05-26T08:50:57.273-07:00,endDate=2016-06-08T20:59:00.000-07:00,completeDate=2016-06-09T07:34:41.899-07:00,sequence=30943]']
I just need the 2016.2.4 portion of it. This number will not always be the same either.
Any thoughts as how to do this with RE? I'm new to regular expressions and would appreciate any help.
A simple regular expression can do the trick: name=([0-9.]+).
The primary part of the regex is ([0-9.]+) which will search for any digit (0-9) or period (.) in succession (+).
Now, to use this:
import re
pattern = re.compile('name=([0-9.]+)')
string = '''<class 'list'>: ['com.atlassian.greenhopper.service.sprint.Sprint#6f68eefa[id=30943,rapidViewId=10468,state=CLOSED,name=2016.2.4 - XXXXXXXXXX,startDate=2016-05-26T08:50:57.273-07:00,endDate=2016-06-08T20:59:00.000-07:00,completeDate=2016-06-09T07:34:41.899-07:00,sequence=30943]']'''
matches = pattern.search(string)
# Only assign the value if a match is found
name_value = '' if not matches else matches.group(1)
Use a capturing group to extract the version name:
>>> import re
>>> s = 'com.atlassian.greenhopper.service.sprint.Sprint#6f68eefa[id=30943,rapidViewId=10468,state=CLOSED,name=2016.2.4 - XXXXXXXXXX,startDate=2016-05-26T08:50:57.273-07:00,endDate=2016-06-08T20:59:00.000-07:00,completeDate=2016-06-09T07:34:41.899-07:00,sequence=30943]'
>>> re.search(r"name=([0-9.]+)", s).group(1)
'2016.2.4'
where ([0-9.]+) is a capturing group matching one or more digits or dots, parenthesis define a capturing group.
A non-regex option would involve some splitting by ,, = and -:
>>> l = [item.split("=") for item in s.split(",")]
>>> next(value[1] for value in l if value[0] == "name").split(" - ")[0]
'2016.2.4'
This, of course, needs testing and error handling.

Extract date string from (more) complex string (possibly a regex match)

I have a string template that looks like 'my_index-{year}'.
I do something like string_template.format(year=year) where year is some string. Result of this is some string that looks like my_index-2011.
Now. to my question. I have a string like my_index-2011 and my template 'my_index-{year}' What might be a slick way to extract the {year} portion?
[Note: I know of the existence of parse library]
There is this module called parse which provides an opposite to format() functionality:
Parse strings using a specification based on the Python format() syntax.
>>> from parse import parse
>>> s = "my_index-2011"
>>> f = "my_index-{year}"
>>> parse(f, s)['year']
'2011'
And, an alternative option and, since you are extracting a year, would be to use the dateutil parser in a fuzzy mode:
>>> from dateutil.parser import parse
>>> parse("my_index-2011", fuzzy=True).year
2011
Use the split() string function to split the string into two parts around the dash, then grab just the second part.
mystring = "my_index-2011"
year = mystring.split("-")[1]
I assume "year" is 4 digits and you have multiple indexes
import re
res = ''
patterns = [ '%s-[0-9]{4}'%index for index in idx ]
for index,pattern in zip(idx,patterns):
res +=' '.join( re.findall(pattern ,data) ).replace(index+'-','') + ' '
---update---
dummyString = 'adsf-1234 fsfdr lkjdfaif ln ewr-1234 adsferggs sfdgrsfgadsf-3456'
dummyIdx = ['ewr','adsf']
output
1234 1234 3456
Yes, a regex would be helpful here.
In [1]: import re
In [2]: s = 'my_string-2014'
In [3]: print( re.search('\d{4}', s).group(0) )
2014
Edit: I should have mentioned your regex can be more sophisticated. You can haul out a subcomponent of a more specific string, for example:
In [4]: print( re.search('my_string-(\d{4})$', s).group(1) )
2014
Given the problem you presented, I think any "find the year" formula should be expressible in terms of a regular expression.
You are going to want to use the string method split to split on "-", and then catch the last element as your year:
year = "any_index-2016".split("-")[-1]
Because you caught the last element (using -1 as the index), your index can have hyphens in them, and you will still extract the year appropriately.

How to replace string with certain format in python

i am trying to do string manipulation based on format. str.replace(old,new) alllows changing by specific string pattern. is it possible to find and replace by format? for example,
i want to find all datetime like value in a long string and replace it with another format
assuming % is wildcard for number and datetime is %%/%%/%%T%%:%%
str.replace(%%/%%/%%T%%:%%, 'dummy value')
EDIT:
sorry i should have been more clearer. re.sub seems like I can use that, but how do it substitute it with a date converted value. in this case, e.g.
YY/MM/DDTHH:MM to (YY/MM/DD HH:MM)+8 hours
The easiest way to do this is probably using a combination of regular expression syntax, applying re.sub and using the fact that the repl parameter can be a function that takes a match and returns a string to replace it, and datetime's syntax for strptime and strftime:
>>> from datetime import datetime
>>> import re
>>> def replacer(match):
return datetime.strptime(
match.group(), # matched text
'%y/%m/%dT%H:%M', # source format in datetime syntax
).strftime('%d %B %Y at %H.%M') # destination format in datetime syntax
>>> re.sub(
r'\d{2}/\d{2}/\d{2}T\d{2}:\d{2}', # source format in regex syntax
replacer, # function to process match
'The date and time was 12/12/12T12:12 exactly.', # string to process
)
'The date and time was 12 December 2012 at 12.12 exactly.'
The only downside of this is that you need to define the source format in both datetime and re syntax, which isn't very DRY; if they don't match, you'll get nowhere.

Number Trouble with Regex in Python

I'm trying to filter a date retrieved from a .csv file, but no combination I try seems to work. The date comes in as "2011-10-01 19:25:01" or "year-month-date hour:min:sec".
I want just the year, month and date but I get can't seem to get ride of the time from the string:
date = bug[2] # Column in which the date is located
date = date.replace('\"','') #getting rid of the quotations
mdate = date.replace(':','')
re.split('$[\d]+',mdate) # trying to get rid of the trailing set of number (from the time)
Thanks in advance for the advice.
If your source is a string, you'd probably better use strptime
import datetime
string = "2011-10-01 19:25:01"
dt = datetime.datetime.strptime(string, "%Y-%m-%d %H:%M:%S")
After that, use
dt.year
dt.month
dt.day
to access the data you want.
Use datetime to parse your input as a datetime object, then output it in whatever format you like: http://docs.python.org/library/datetime.html
I think you're confusing the circumflex for start of line and dollar for end of line. Try ^[\d-]+.
If the format is always "YYYY-MM-DD HH:mm:ss", then try this:
date = date[1:11]
In a prompt:
>>> date = '"2012-01-12 15:13:20"'
>>> date[1:11]
'2012-01-12'
>>>
No need for regex
>>> date = '"2011-10-01 19:25:01"'
>>> date.strip('"').split()[0]
'2011-10-01'
One problem with your code is that in your last regular expression, $ matches the end of the string, so that regular expression will never match anything. You could do this much more simply by splitting by spaces and only taking the first result. After removing the quotation marks, the line
date.split()
will return ["2011-10-01", "19:25:01"], so the first element of that list is what you need.

Categories