python regex for time format, is this a good method? - python

I saw this from a textbook to match the time format by regex:
t = '19:05:30'
m = re.match(r'^(0[0-9]|1[0-9]|2[0-3]|[0-9])\:(0[0-9]|1[0-9]|2[0-9]|3[0-9]|4[0-9]|5[0-9]|[0-9])\:(0[0-9]|1[0-9]|2[0-9]|3[0-9]|4[0-9]|5[0-9]|[0-9])$', t)
# limited to 0X or 1X or 2X or x
# limited to 1X, 2X, 3X, 4X, 5X or X
# same as minutes
print(m.groups())
# >>> ('19', '05', '30')
Obviously it worked, but it seems to be quite redundant, can I use:
m = re.match(r'^(2[0-3]|[0-1][0-9]|[0-9])\:([0-5][0-9])\:([0-5][0-9])$', t)
print(m.groups())
# >>> ('19', '05', '30')
I am quite new at regex, and I am not sure if I can easily write something better than a textbook, but I can't find anything wrong with it.
Thanks,

Related

How to avoid hard coding in python for replacing words to dates?

text = "Going out with Sahi to Ikebukuro next week around 4PM or 16:30"
dt_now = datetime.datetime.now()
print('Date and time now:', dt_now.strftime('%Y/%m/%d %H:%M:%S'))
text = re.sub(r'(today)', f'{dt_now.month}/{dt_now.day}', text)
text = re.sub(r'(tomorrow)', f'{dt_now.month}/{dt_now.day + 1}', text)
text = re.sub(r'(the day after tomorrow)', f'{dt_now.month}/{dt_now.day + 2}', text)
text = re.sub(r'(in 2 days)', f'{dt_now.month}/{dt_now.day + 2}', text)
text = re.sub(r'(in 3 days)', f'{dt_now.month}/{dt_now.day + 3}', text)
text = re.sub(r'(yesterday)', f'{dt_now.month}/{dt_now.day - 1}', text)
text = re.sub(r'(next week)', f'{dt_now.month}/{dt_now.day + 7}', text)
text = re.sub(r'(in a month)', f'{dt_now.month + 1}/{dt_now.day}', text)
print(text)
In the code above, I have tried to convert any date-like words directly to absolute dates and hence hard-coded the solution. However, is there a way that I can soft-code it.
The timedelta object from the datetime standard library allows you to express relative times.
You could map each verbal expression to a separate timedelta object, though of course, you still can't resolve expressions like "Thursday the week after next" without knowing today's date.
from datetime import timedelta
reltimes = dict()
for expr, kwargs in (
('today', {'days': 0}),
('tomorrow', {'days': 1}),
('the day after tomorrow', {'days': 2}),
('in 2 days', {'days': 2}),
('in 3 days', {'days': 3}), # XXX extend with a regex r"in \d+ days"?
('yesterday', {'days': -1}),
('next week', {'days': 7}),
('in a month', {'days': 30}), # XX hardcodes 30-day month
('in an hour', {'hours': 1}) # extra example
):
reltimes[expr] = timedelta(**kwargs)
def datesub(reldate):
absdate = datetime.datetime.now() + reldate
return "%i/%i" % (absdate.month, absdate.day)
for repl in sorted(reltimes.keys(), key=len):
text = re.sub(repl, lambda x: datesub(reltimes[repl]), text)
Unfortunately, timedelta doesn't easily let you express "this day in another month" so this code would need some elaboration if you are not happy with hardcoding 30-day months. See e.g. How do I calculate the date six months from the current date using the datetime Python module?
As an aside, you have to make sure you substitute the longer strings before the shorter ones. Your code had the bug that "tomorrow" would be substituted before "the day after tomorrow", so the input to the latter would be "the day after 12/2" because "tomorrow" was already replaced.
Three recommendations from my side:
A simple improvement to limit hard coding and simplyfy your code can be to use python's regular experssions library to get numbers:
>>> import re
>>> string_to_parse = 'in 2 days'
>>> re.findall(r'\d+', string_to_parse)
['2']
>>> [int(x) for x in re.findall(r'\d+', string_to_parse)]
[2]
Leverage existing work already done like: https://github.com/nltk/nltk_contrib/blob/master/nltk_contrib/timex.py
If you want a highly advanced solution I recommend you to deep dive on Natural Language Toolkit (NLTK), here there is a nice explanation and example on how to tokenize text and extract relationships: https://towardsdatascience.com/named-entity-recognition-with-nltk-and-spacy-8c4a7d88e7da

How to perform pattern matching between two strings?

I want to have a good pattern matching code which can exactly match between both strings.
x = "Apple iPhone 6(Silver, 16 GB)"
y = "Apple iPhone 6 64 GB GSM Mobile Phone (Silver)"
Approach 1:
tmp_body = " ".join("".join([" " if ch in string.punctuation else ch.lower() for ch in y]).split())
tmp_body_1 = " ".join("".join([" " if ch in string.punctuation else ch.lower() for ch in x]).split())
if tmp_body in tmp_body_1:
print "true"
In my problem x will always be a base string and y will change
Approach 2:
Fuzzy logic --> But was not getting good results through it
Approach 3:
Using regex which I don't know
I am still figuring out ways to solve it with regex.
Removal of special characters from both base and incoming string
Matches the GB and Color
Splitting the GB from the number for good matching
These things I have figured out.
How about the following approach. Split each into words, lowercase each word and store in a set. x must then be a subset of y. So for your example it will fail as 16 does not match 64:
x = "Apple iPhone 6(Silver, 16 GB)"
y = "Apple iPhone 6 64 GB GSM Mobile Phone (Silver)"
set_x = set([item.lower() for item in re.findall("([a-zA-Z0-9]+)", x)])
set_y = set([item.lower() for item in re.findall("([a-zA-Z0-9]+)", y)])
print set_x
print set_y
print set_x.issubset(set_y)
Giving the following results:
set(['apple', '16', 'gb', '6', 'silver', 'iphone'])
set(['apple', 'mobile', 'phone', '64', 'gb', '6', 'gsm', 'silver', 'iphone'])
False
If 64 is changed to 16 then you get:
set(['apple', '16', 'gb', '6', 'silver', 'iphone'])
set(['apple', '16', 'mobile', 'phone', 'gb', '6', 'gsm', 'silver', 'iphone'])
True
Looks like you're trying to do longest common substring here ofntwo unknown strings.
Find common substring between two strings
Regex only works when you have a known pattern to your strings. You could use LCS to derive a pattern that you could use to test additional strings, but I don't think that's what you want.
If you are wanting to extract the capacity, model, and other information from these strings, you may want to use multiple patterns to find each piece of information. Some information may not be available. Your regular expressions will need to flex in order to handle a wider input (hard for me to assume all variations given a sample size of 2).
capacity = re.search(r'(\d+)\s*GB', useragent)
model = re.search(r'Apple iPhone ([A-Za-z0-9]+)', useragent)
These patterns won't make much sense to you unless you read the Python re module documentation. Basically, for capacity, I'm searching for 1 or more digits followed by 0 or more whitespace followed by GB. If I find a match, the result is a match object and I can get the capacity with match.group(). Similar story for finding iPhone version, though my pattern doesn't work for "6 Plus".
Since you have no control over the generation of these strings, if this is a script that you plan on using 3 years from now, expect to be a slave to it, updating the regular expression patterns as new string formats become available. Hopefully this is a one-off number crunching exercise that can be scrapped as soon as you answered your question.

Python: help composing regex pattern

I'm just learning python and having a problem figuring out how to create the regex pattern for the following string
"...', 'begin:32,12:1:2005-10-30 T 10:45:end', 'begin:33,13:2:2006-11-31 T 11:46:end', '... <div dir="ltr">begin:32,12:1:2005-10-30 T 10:45:end<br>begin:33,13:2:2006-11-31 T 11:46:end<br>..."
I'm trying to extract the data between the begin: and :end for n iterations without getting duplicate data. I've attached my current attempt.
for m in re.finditer('.begin:(.*),(.*):(.*):(.*:.*):end.', list_to_string(j), re.DOTALL):
print m.group(1)
print m.group(2)
print m.group(3)
print m.group(4)
the output is:
begin:32,12:1:2005-10-30 T 10:45:end<br>begin:33
13
2
2006-11-31 T 11:46
and I want it to be:
32
12
1
2005-10-30 T 10:45
33
13
2
2006-11-31 T 11:46
Thank you for any help.
.* is greedy, matching across your intended :end boundary. Replace all .*s with lazy .*?.
>>> s = """...', 'begin:32,12:1:2005-10-30 T 10:45:end', 'begin:33,13:2:2006-11-31 T 11:46:end', '... <div dir="ltr">begin:32,12:1:2005-10-30 T 10:45:end<br>begin:33,13:2:2006-11-31 T 11:46:end<br>..."""
>>> re.findall("begin:(.*?),(.*?):(.*?):(.*?:.*?):end", s)
[('32', '12', '1', '2005-10-30 T 10:45'), ('33', '13', '2', '2006-11-31 T 11:46'),
('32', '12', '1', '2005-10-30 T 10:45'), ('33', '13', '2', '2006-11-31 T 11:46')]
With a modified pattern, forcing single quotes to be present at the start/end of the match:
>>> re.findall("'begin:(.*?),(.*?):(.*?):(.*?:.*?):end'", s)
[('32', '12', '1', '2005-10-30 T 10:45'), ('33', '13', '2', '2006-11-31 T 11:46')]
You need to make the variable-sized parts of your pattern "non-greedy". That is, make them match the smallest possible string rather than the longest possible (which is the default).
Try the pattern '.begin:(.*?),(.*?):(.*?):(.*?:.*?):end.'.
Another option to Blckknght and Tim Pietzcker's is
re.findall("begin:([^,]*),([^:]*):([^:]*):([^:]*:[^:]*):end", s)
Instead of choosing non-greedy extensions, you use [^X] to mean "any character but X" for some X.
The advantage is that it's more rigid: there's no way to get the delimiter in the result, so
'begin:33,13:134:2:2006-11-31 T 11:46:end'
would not match, whereas it would for Blckknght and Tim Pietzcker's. For this reason, it's also probably faster on edge cases. This is probably unimportant in real-world circumstances.
The disadvantage is that it's more rigid, of course.
I suggest to choose whichever one makes more intuitive sense, 'cause both methods work.

Python: Regex outputs 12_34 - I need 1234

So I have input coming in as follows: 12_34 5_6_8_2 4_____3 1234
and the output I need from it is: 1234, 5682, 43, 1234
I'm currently working with r'[0-9]+[0-9_]*'.replace('_',''), which, as far as I can tell, successfully rejects any input which is not a combination of numeric digits and under-scores, where the underscore cannot be the first character.
However, replacing the _ with the empty string causes 12_34 to come out as 12 and 34.
Is there a better method than 'replace' for this? Or could I adapt my regex to deal with this problem?
EDIT: Was responding to questions in comments below, I realised it might be better specified up here.
So, the broad aim is to take a long input string (small example:
"12_34 + 'Iamastring#' I_am_an_Ident"
and return:
('NUMBER', 1234), ('PLUS', '+'), ('STRING', 'Iamastring#'), ('IDENT', 'I_am_an_Ident')
I didn't want to go through all that because I've got it all working as specified, except for number.
The solution code looks something like:
tokens = ('PLUS', 'MINUS', 'TIMES', 'DIVIDE',
'IDENT', 'STRING', 'NUMBER')
t_PLUS = "+"
t_MINUS = '-'
and so on, down to:
t_NUMBER = ###code goes here
I'm not sure how to put multi-line processes into t_NUMBER
I'm not sure what you mean and why you need regex, but maybe this helps
In [1]: ins = '12_34 5_6_8_2 4_____3 1234'
In [2]: for x in ins.split(): print x.replace('_', '')
1234
5682
43
1234
EDIT in response to the edited question:
I'm still not quite sure what you're doing with tokens there, but I'd do something like (at least it makes sense to me:
input_str = "12_34 + 'Iamastring#' I_am_an_Ident"
tokens = ('NUMBER', 'SIGN', 'STRING', 'IDENT')
data = dict(zip(tokens, input_str.split()))
This would give you
{'IDENT': 'I_am_an_Ident',
'NUMBER': '12_34',
'SIGN': '+',
'STRING': "'Iamastring#'"}
Then you could do
data['NUMBER'] = int(data['NUMBER'].replace('_', ''))
and anything else you like.
P.S. Sorry if it doesn't help, but I really don't see the point of having tokens = ('PLUS', 'MINUS', 'TIMES', 'DIVIDE', 'IDENT', 'STRING', 'NUMBER'), etc.
a='12_34 5_6_8_2 4___3 1234'
>>> a.replace('_','').replace(' ',', ')
'1234, 5682, 43, 1234'
>>>
The phrasing of your question is a little bit unclear. If you don't care about input validation, the following should work:
input = '12_34 5_6_8_2 4_____3 1234'
re.sub('\s+', ', ', input.replace('_', ''))
If you need to actually strip out all characters which are not either digits or whitespace and add commas between the numbers, then:
re.sub('\s+', ', ', re.sub('[^\d\s]', '', input))
...should accomplish the task. Of course, it would probably be more efficient to write a function that only has to walk through the string once rather than using multiple re.sub() calls.
You seem to be doing something like:
>>> data = '12_34 5_6_8_2 4_____3 1234'
>>> pattern = '[0-9]+[0-9_]*'
>>> re.findall(pattern, data)
['12_34', '5_6_8_2', '4_____3', '1234']
re.findall(pattern.replace('_', ''), data)
['12', '34', '5', '6', '8', '2', '4', '3', '1234']
The issue is that pattern.replace isn't a signal to re to remove the _s from the matches, it changes your regex to: '[0-9]+[0-9]*'. What you want to do is to do replace on the results, rather than the pattern - eg,
>>> [match.replace('_', '') for match in re.findall(pattern, data)]
['1234', '5682', '43', '1234']
Also note that your regex can be simplified slightly; I will leave out the details of how since this is homework.
Well, if you really have to use re and only re, you could do this:
import re
def replacement(match):
separator_dict = {
'_': '',
' ': ',',
}
for sep, repl in separator_dict.items():
if all( (char == sep for char in match.group(2)) ):
return match.group(1) + repl + match.group(3)
def rec_sub(s):
"""
Recursive so it works with any number of numbers separated by underscores.
"""
new_s = re.sub('(\d+)([_ ]+)(\d+)', replacement, s)
if new_s == s:
return new_s
else:
return rec_sub(new_s)
But that epitomizes the concept of overkill.

regex - how to recognise a pattern until a second one is found

I have a file, named a particular way. Let's say it's:
tv_show.s01e01.episode_name.avi
it's the standard way a video file of a tv show's episode is named on the net. The pattern is quite the same all over the web, so I want to extract some information from a file named this way. Basically I want to get:
the show's title;
the season number s01;
the episode number e01;
the extension.
I'm using a Python 3 script to do so. This test file is pretty simple because all I have to do is this
import re
def acquire_info(f="tv_show.s01e01.episode_name.avi"):
tvshow_title = title_p.match(f).group()
numbers = numbers_p.search(f).group()
season_number = numbers.split("e")[0].split("s")[1]
ep_number = numbers.split("e")[1]
return [tvshow_title, season_number, ep_number]
if __name__ == '__main__':
# re.I stands for the option "ignorecase"
title_p = re.compile("^[a-z]+", re.I)
numbers_p = re.compile("s\d{1,2}e\d{1,2}", re.I)
print(acquire_info())
and the output is as expected ['tv_show', '01', '01']. But what if my file name is like this other one? some.other.tv.show.s04e05.episode_name.avi.
How can I build a regex that gets all the text BEFORE the "s\d{1,2}e\d{1,2}" pattern is found?
P.S. I didn't put in the example the code to get the extension, I know, but that's not my problem so it does not matter.
try this
show_p=re.compile("(.*)\.s(\d*)e(\d*)")
show_p.match(x).groups()
where x is your string
Edit** (I forgot to include the extension, here is the revision)
show_p=re.compile("^(.*)\.s(\d*)e(\d*).*?([^\.]*)$")
show_p.match(x).groups()
And Here is the test result
>>> show_p=re.compile("(.*)\.s(\d*)e(\d*).*?([^\.]*)$")
>>> x="tv_show.s01e01.episode_name.avi"
>>> show_p.match(x).groups()
('tv_show', '01', '01', 'avi')
>>> x="tv_show.s2e1.episode_name.avi"
>>> show_p.match(x).groups()
('tv_show', '2', '1', 'avi')
>>> x='some.other.tv.show.s04e05.episode_name.avi'
>>> show_p.match(x).groups()
('some.other.tv.show', '04', '05', 'avi')
>>>
Here is one option, use capturing groups to extract all of the info you want in one step:
>>> show_p = re.compile(r'(.*?)\.s(\d{1,2})e(\d{1,2})')
>>> show_p.match('some.other.tv.show.s04e05.episode_name.avi').groups()
('some.other.tv.show', '04', '05')
I'm not a Python expert but if it can do named captures, something general like this might work:
^(?<Title>.+)\.s(?<Season>\d{1,2})e(?<Episode>\d{1,2})\..*?(?<Extension>[^.]+)$
if no named groups, just use normal groups.
A problem could occur if the title has a .s2e1. part that masks the real season/episode part. That would require more logic. The regex above asumes that the title/season/episode/extension exists, and s/e is the farthest one to the right.

Categories