Python: help composing regex pattern

Python: help composing regex pattern - python

I'm just learning python and having a problem figuring out how to create the regex pattern for the following string
"...', 'begin:32,12:1:2005-10-30 T 10:45:end', 'begin:33,13:2:2006-11-31 T 11:46:end', '... <div dir="ltr">begin:32,12:1:2005-10-30 T 10:45:end<br>begin:33,13:2:2006-11-31 T 11:46:end<br>..."
I'm trying to extract the data between the begin: and :end for n iterations without getting duplicate data. I've attached my current attempt.
for m in re.finditer('.begin:(.*),(.*):(.*):(.*:.*):end.', list_to_string(j), re.DOTALL):
print m.group(1)
print m.group(2)
print m.group(3)
print m.group(4)
the output is:
begin:32,12:1:2005-10-30 T 10:45:end<br>begin:33
13
2
2006-11-31 T 11:46
and I want it to be:
32
12
1
2005-10-30 T 10:45
33
13
2
2006-11-31 T 11:46
Thank you for any help.

.* is greedy, matching across your intended :end boundary. Replace all .*s with lazy .*?.
>>> s = """...', 'begin:32,12:1:2005-10-30 T 10:45:end', 'begin:33,13:2:2006-11-31 T 11:46:end', '... <div dir="ltr">begin:32,12:1:2005-10-30 T 10:45:end<br>begin:33,13:2:2006-11-31 T 11:46:end<br>..."""
>>> re.findall("begin:(.*?),(.*?):(.*?):(.*?:.*?):end", s)
[('32', '12', '1', '2005-10-30 T 10:45'), ('33', '13', '2', '2006-11-31 T 11:46'),
('32', '12', '1', '2005-10-30 T 10:45'), ('33', '13', '2', '2006-11-31 T 11:46')]
With a modified pattern, forcing single quotes to be present at the start/end of the match:
>>> re.findall("'begin:(.*?),(.*?):(.*?):(.*?:.*?):end'", s)
[('32', '12', '1', '2005-10-30 T 10:45'), ('33', '13', '2', '2006-11-31 T 11:46')]

You need to make the variable-sized parts of your pattern "non-greedy". That is, make them match the smallest possible string rather than the longest possible (which is the default).
Try the pattern '.begin:(.*?),(.*?):(.*?):(.*?:.*?):end.'.

Another option to Blckknght and Tim Pietzcker's is
re.findall("begin:([^,]*),([^:]*):([^:]*):([^:]*:[^:]*):end", s)
Instead of choosing non-greedy extensions, you use [^X] to mean "any character but X" for some X.
The advantage is that it's more rigid: there's no way to get the delimiter in the result, so
'begin:33,13:134:2:2006-11-31 T 11:46:end'
would not match, whereas it would for Blckknght and Tim Pietzcker's. For this reason, it's also probably faster on edge cases. This is probably unimportant in real-world circumstances.
The disadvantage is that it's more rigid, of course.
I suggest to choose whichever one makes more intuitive sense, 'cause both methods work.

Related

python regex for time format, is this a good method?

I saw this from a textbook to match the time format by regex:
t = '19:05:30'
m = re.match(r'^(0[0-9]|1[0-9]|2[0-3]|[0-9])\:(0[0-9]|1[0-9]|2[0-9]|3[0-9]|4[0-9]|5[0-9]|[0-9])\:(0[0-9]|1[0-9]|2[0-9]|3[0-9]|4[0-9]|5[0-9]|[0-9])$', t)
# limited to 0X or 1X or 2X or x
# limited to 1X, 2X, 3X, 4X, 5X or X
# same as minutes
print(m.groups())
# >>> ('19', '05', '30')
Obviously it worked, but it seems to be quite redundant, can I use:
m = re.match(r'^(2[0-3]|[0-1][0-9]|[0-9])\:([0-5][0-9])\:([0-5][0-9])$', t)
print(m.groups())
# >>> ('19', '05', '30')
I am quite new at regex, and I am not sure if I can easily write something better than a textbook, but I can't find anything wrong with it.
Thanks,

How to perform pattern matching between two strings?

I want to have a good pattern matching code which can exactly match between both strings.
x = "Apple iPhone 6(Silver, 16 GB)"
y = "Apple iPhone 6 64 GB GSM Mobile Phone (Silver)"
Approach 1:
tmp_body = " ".join("".join([" " if ch in string.punctuation else ch.lower() for ch in y]).split())
tmp_body_1 = " ".join("".join([" " if ch in string.punctuation else ch.lower() for ch in x]).split())
if tmp_body in tmp_body_1:
print "true"
In my problem x will always be a base string and y will change
Approach 2:
Fuzzy logic --> But was not getting good results through it
Approach 3:
Using regex which I don't know
I am still figuring out ways to solve it with regex.
Removal of special characters from both base and incoming string
Matches the GB and Color
Splitting the GB from the number for good matching
These things I have figured out.

How about the following approach. Split each into words, lowercase each word and store in a set. x must then be a subset of y. So for your example it will fail as 16 does not match 64:
x = "Apple iPhone 6(Silver, 16 GB)"
y = "Apple iPhone 6 64 GB GSM Mobile Phone (Silver)"
set_x = set([item.lower() for item in re.findall("([a-zA-Z0-9]+)", x)])
set_y = set([item.lower() for item in re.findall("([a-zA-Z0-9]+)", y)])
print set_x
print set_y
print set_x.issubset(set_y)
Giving the following results:
set(['apple', '16', 'gb', '6', 'silver', 'iphone'])
set(['apple', 'mobile', 'phone', '64', 'gb', '6', 'gsm', 'silver', 'iphone'])
False
If 64 is changed to 16 then you get:
set(['apple', '16', 'gb', '6', 'silver', 'iphone'])
set(['apple', '16', 'mobile', 'phone', 'gb', '6', 'gsm', 'silver', 'iphone'])
True

Looks like you're trying to do longest common substring here ofntwo unknown strings.
Find common substring between two strings
Regex only works when you have a known pattern to your strings. You could use LCS to derive a pattern that you could use to test additional strings, but I don't think that's what you want.
If you are wanting to extract the capacity, model, and other information from these strings, you may want to use multiple patterns to find each piece of information. Some information may not be available. Your regular expressions will need to flex in order to handle a wider input (hard for me to assume all variations given a sample size of 2).
capacity = re.search(r'(\d+)\s*GB', useragent)
model = re.search(r'Apple iPhone ([A-Za-z0-9]+)', useragent)
These patterns won't make much sense to you unless you read the Python re module documentation. Basically, for capacity, I'm searching for 1 or more digits followed by 0 or more whitespace followed by GB. If I find a match, the result is a match object and I can get the capacity with match.group(). Similar story for finding iPhone version, though my pattern doesn't work for "6 Plus".
Since you have no control over the generation of these strings, if this is a script that you plan on using 3 years from now, expect to be a slave to it, updating the regular expression patterns as new string formats become available. Hopefully this is a one-off number crunching exercise that can be scrapped as soon as you answered your question.

Python regexp multiple expressions with grouping

I'm trying to match the output given by a Modem when asked about the network info, it looks like this:
Network survey started...
For BCCH-Carrier:
arfcn: 15,bsic: 4,dBm: -68
For non BCCH-Carrier:
arfcn: 10,dBm: -72
arfcn: 6,dBm: -78
arfcn: 11,dBm: -81
arfcn: 14,dBm: -83
arfcn: 16,dBm: -83
So I've two types of expressions to match, the BCCH and non BCCH. the following code is almost working:
match = re.findall('(?:arfcn: (\d*),dBm: (-\d*))|(?:arfcn: (\d*),bsic: (\d*),dBm: (-\d*))', data)
But it seems that BOTH expressions are being matched, and not found fields left blank:
>>> match
[('', '', '15', '4', '-68'), ('10', '-72', '', '', ''), ('6', '-78', '', '', ''), ('11', '-81', '', '', ''), ('14', '-83', '', '', ''), ('16', '-83', '', '', '')]
May anyone help? Why such behaviour? I've tried changing the order of the expressions, with no luck.
Thanks!

That is how capturing groups work. Since you have five of them, there will always be five parts returned.
Based on your data, I think you could simplify your regex by making the bsic part optional. That way each row would return three parts, the middle one being empty for non BCCH-Carriers.
match = re.findall('arfcn: (\d*)(?:,bsic: (\d*))?,dBm: (-\d*)', data)

You have an expression with 5 groups.
The fact that you have 2 of those in one optional part and the other 3 in a mutually exclusive other part of your expression doesn't change that fact. Either 2 or 3 of the groups are going to be empty, depending on what line you matched.
If you have to match either line with one expression, there is no way around this. You can use named groups (and return a dictionary of matched groups) to make this a little easier to manage, but you will always end up with empty groups.

Python: Regex outputs 12_34 - I need 1234

So I have input coming in as follows: 12_34 5_6_8_2 4_____3 1234
and the output I need from it is: 1234, 5682, 43, 1234
I'm currently working with r'[0-9]+[0-9_]*'.replace('_',''), which, as far as I can tell, successfully rejects any input which is not a combination of numeric digits and under-scores, where the underscore cannot be the first character.
However, replacing the _ with the empty string causes 12_34 to come out as 12 and 34.
Is there a better method than 'replace' for this? Or could I adapt my regex to deal with this problem?
EDIT: Was responding to questions in comments below, I realised it might be better specified up here.
So, the broad aim is to take a long input string (small example:
"12_34 + 'Iamastring#' I_am_an_Ident"
and return:
('NUMBER', 1234), ('PLUS', '+'), ('STRING', 'Iamastring#'), ('IDENT', 'I_am_an_Ident')
I didn't want to go through all that because I've got it all working as specified, except for number.
The solution code looks something like:
tokens = ('PLUS', 'MINUS', 'TIMES', 'DIVIDE',
'IDENT', 'STRING', 'NUMBER')
t_PLUS = "+"
t_MINUS = '-'
and so on, down to:
t_NUMBER = ###code goes here
I'm not sure how to put multi-line processes into t_NUMBER

I'm not sure what you mean and why you need regex, but maybe this helps
In [1]: ins = '12_34 5_6_8_2 4_____3 1234'
In [2]: for x in ins.split(): print x.replace('_', '')
1234
5682
43
1234
EDIT in response to the edited question:
I'm still not quite sure what you're doing with tokens there, but I'd do something like (at least it makes sense to me:
input_str = "12_34 + 'Iamastring#' I_am_an_Ident"
tokens = ('NUMBER', 'SIGN', 'STRING', 'IDENT')
data = dict(zip(tokens, input_str.split()))
This would give you
{'IDENT': 'I_am_an_Ident',
'NUMBER': '12_34',
'SIGN': '+',
'STRING': "'Iamastring#'"}
Then you could do
data['NUMBER'] = int(data['NUMBER'].replace('_', ''))
and anything else you like.
P.S. Sorry if it doesn't help, but I really don't see the point of having tokens = ('PLUS', 'MINUS', 'TIMES', 'DIVIDE', 'IDENT', 'STRING', 'NUMBER'), etc.

a='12_34 5_6_8_2 4___3 1234'
>>> a.replace('_','').replace(' ',', ')
'1234, 5682, 43, 1234'
>>>

The phrasing of your question is a little bit unclear. If you don't care about input validation, the following should work:
input = '12_34 5_6_8_2 4_____3 1234'
re.sub('\s+', ', ', input.replace('_', ''))
If you need to actually strip out all characters which are not either digits or whitespace and add commas between the numbers, then:
re.sub('\s+', ', ', re.sub('[^\d\s]', '', input))
...should accomplish the task. Of course, it would probably be more efficient to write a function that only has to walk through the string once rather than using multiple re.sub() calls.

You seem to be doing something like:
>>> data = '12_34 5_6_8_2 4_____3 1234'
>>> pattern = '[0-9]+[0-9_]*'
>>> re.findall(pattern, data)
['12_34', '5_6_8_2', '4_____3', '1234']
re.findall(pattern.replace('_', ''), data)
['12', '34', '5', '6', '8', '2', '4', '3', '1234']
The issue is that pattern.replace isn't a signal to re to remove the _s from the matches, it changes your regex to: '[0-9]+[0-9]*'. What you want to do is to do replace on the results, rather than the pattern - eg,
>>> [match.replace('_', '') for match in re.findall(pattern, data)]
['1234', '5682', '43', '1234']
Also note that your regex can be simplified slightly; I will leave out the details of how since this is homework.

Well, if you really have to use re and only re, you could do this:
import re
def replacement(match):
separator_dict = {
'_': '',
' ': ',',
}
for sep, repl in separator_dict.items():
if all( (char == sep for char in match.group(2)) ):
return match.group(1) + repl + match.group(3)
def rec_sub(s):
"""
Recursive so it works with any number of numbers separated by underscores.
"""
new_s = re.sub('(\d+)([_ ]+)(\d+)', replacement, s)
if new_s == s:
return new_s
else:
return rec_sub(new_s)
But that epitomizes the concept of overkill.

regex - how to recognise a pattern until a second one is found

I have a file, named a particular way. Let's say it's:
tv_show.s01e01.episode_name.avi
it's the standard way a video file of a tv show's episode is named on the net. The pattern is quite the same all over the web, so I want to extract some information from a file named this way. Basically I want to get:
the show's title;
the season number s01;
the episode number e01;
the extension.
I'm using a Python 3 script to do so. This test file is pretty simple because all I have to do is this
import re
def acquire_info(f="tv_show.s01e01.episode_name.avi"):
tvshow_title = title_p.match(f).group()
numbers = numbers_p.search(f).group()
season_number = numbers.split("e")[0].split("s")[1]
ep_number = numbers.split("e")[1]
return [tvshow_title, season_number, ep_number]
if __name__ == '__main__':
# re.I stands for the option "ignorecase"
title_p = re.compile("^[a-z]+", re.I)
numbers_p = re.compile("s\d{1,2}e\d{1,2}", re.I)
print(acquire_info())
and the output is as expected ['tv_show', '01', '01']. But what if my file name is like this other one? some.other.tv.show.s04e05.episode_name.avi.
How can I build a regex that gets all the text BEFORE the "s\d{1,2}e\d{1,2}" pattern is found?
P.S. I didn't put in the example the code to get the extension, I know, but that's not my problem so it does not matter.

try this
show_p=re.compile("(.*)\.s(\d*)e(\d*)")
show_p.match(x).groups()
where x is your string
Edit** (I forgot to include the extension, here is the revision)
show_p=re.compile("^(.*)\.s(\d*)e(\d*).*?([^\.]*)$")
show_p.match(x).groups()
And Here is the test result
>>> show_p=re.compile("(.*)\.s(\d*)e(\d*).*?([^\.]*)$")
>>> x="tv_show.s01e01.episode_name.avi"
>>> show_p.match(x).groups()
('tv_show', '01', '01', 'avi')
>>> x="tv_show.s2e1.episode_name.avi"
>>> show_p.match(x).groups()
('tv_show', '2', '1', 'avi')
>>> x='some.other.tv.show.s04e05.episode_name.avi'
>>> show_p.match(x).groups()
('some.other.tv.show', '04', '05', 'avi')
>>>

Here is one option, use capturing groups to extract all of the info you want in one step:
>>> show_p = re.compile(r'(.*?)\.s(\d{1,2})e(\d{1,2})')
>>> show_p.match('some.other.tv.show.s04e05.episode_name.avi').groups()
('some.other.tv.show', '04', '05')

I'm not a Python expert but if it can do named captures, something general like this might work:
^(?<Title>.+)\.s(?<Season>\d{1,2})e(?<Episode>\d{1,2})\..*?(?<Extension>[^.]+)$
if no named groups, just use normal groups.
A problem could occur if the title has a .s2e1. part that masks the real season/episode part. That would require more logic. The regex above asumes that the title/season/episode/extension exists, and s/e is the farthest one to the right.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.