Python regexp multiple expressions with grouping - python

I'm trying to match the output given by a Modem when asked about the network info, it looks like this:
Network survey started...
For BCCH-Carrier:
arfcn: 15,bsic: 4,dBm: -68
For non BCCH-Carrier:
arfcn: 10,dBm: -72
arfcn: 6,dBm: -78
arfcn: 11,dBm: -81
arfcn: 14,dBm: -83
arfcn: 16,dBm: -83
So I've two types of expressions to match, the BCCH and non BCCH. the following code is almost working:
match = re.findall('(?:arfcn: (\d*),dBm: (-\d*))|(?:arfcn: (\d*),bsic: (\d*),dBm: (-\d*))', data)
But it seems that BOTH expressions are being matched, and not found fields left blank:
>>> match
[('', '', '15', '4', '-68'), ('10', '-72', '', '', ''), ('6', '-78', '', '', ''), ('11', '-81', '', '', ''), ('14', '-83', '', '', ''), ('16', '-83', '', '', '')]
May anyone help? Why such behaviour? I've tried changing the order of the expressions, with no luck.
Thanks!

That is how capturing groups work. Since you have five of them, there will always be five parts returned.
Based on your data, I think you could simplify your regex by making the bsic part optional. That way each row would return three parts, the middle one being empty for non BCCH-Carriers.
match = re.findall('arfcn: (\d*)(?:,bsic: (\d*))?,dBm: (-\d*)', data)

You have an expression with 5 groups.
The fact that you have 2 of those in one optional part and the other 3 in a mutually exclusive other part of your expression doesn't change that fact. Either 2 or 3 of the groups are going to be empty, depending on what line you matched.
If you have to match either line with one expression, there is no way around this. You can use named groups (and return a dictionary of matched groups) to make this a little easier to manage, but you will always end up with empty groups.

Related

in python regular expression,why can't (h)* and (h)+ yield same result?

I am learning the re module in python. I have found something that doean't make sense(to me) and i don't know why. Here is a small example,
x=re.compile(r'(ha)*')
c=x.search('the man know how to hahahaha')
print(c.group())#output will be nothing,no error.But i expect "hahahaha"
same happens if i use re.compile(r'(ha)?'),
x=re.compile(r'(ha)?')
c=x.search('the man know how to hahahaha')
print(c.group())#output will be nothing,no error.But i expect "ha".
But if i use re.compile(r'(ha)+'),
x=re.compile(r'(ha)+')
c=x.search('the man know how to hahahaha')
print(c.group())#output will be `hahahaha`,just as expected.
Why is this,aren't re.compile(r'(ha)*') and re.compile(r'(ha)+') same in this case?
The pattern r'h+' and r'h*' are not identical, thats why they do not deliver the same result. + implies 1 or more matches of your pattern, * zero or more:
re.search returns "nothing" because it only looks at the first match. The first match for * is a zero occurence of your '(ha)' pattern at the first letter of your string:
import re
x=re.compile(r'(ha)*')
c=x.findall('the man know how to hahahaha') # get _all_ matches
print(c)
Output:
['', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'ha', '']
# t h e m a n k n o w h o w t o hahahaha
* and ? quantifier allow 0 matches
Doku:
Pattern.search(string[, pos[, endpos]])
Scan through string looking for the first location where this regular expression produces a match, ...
(source: https://docs.python.org/3/library/re.html#re.Pattern.search)

Match a word, followed by two optionals group in any order

I'm writing a sort of parser for a little library.
My string is in the following format:
text = "Louis,Edward,John|85.56!26,Billy,Don!18|78.0,Dean"
Just to be more clear, this is a list of people names, separated by commas and followed by two optionals separator (| and !), after the first there is the weight that is a number with 0-2 decimals, while after the "!" there is an integer number that represents the age. Separators and related values could appear in any order, as you can see for John and for Don.
I need to extract with Regex (I know I could do it in many other ways) all the names with a length between 2 and 4 and the two separator and the following values, if they are present.
This is my expected result:
[('John', '|85.56', '!26'), ('Don', '|78.00' ,'!18'), ('Dean', '', '')]
I'm trying with this code:
import re
text = "Louis,Edward,John|85.56!26,Billy,Don!18|78.0,Dean"
pattern = re.compile(r'(\b\w{2,4}\b)(\!\d+)?(\|\d+(?:\.\d{1,2})?)?')
search_result = pattern.findall(text)
print(search_result)
But this is the actual result:
[('John', '', '|85.56'), ('26', '', ''), ('Don', '!18', '|78.0'), ('Dean', '', '')]
The following regex seems to be giving what you want:
re.findall(r'(\b[a-z]{2,4}\b)(?:(!\d+)|(\|\d+(?:\.\d{,2})?))*', text, re.I)
#[('John', '!26', '|85.56'), ('Don', '!18', '|78.0'), ('Dean', '', '')]
If you do not want those names, you can easily filter them out.
Pyparsing is good at composing complex expressions from simpler ones, and includes many builtins for optional, unordered, and comma-delimited values. See the comments in the code below:
import pyparsing as pp
real = pp.pyparsing_common.real
integer = pp.pyparsing_common.integer
name = pp.Word(pp.alphas, min=2, max=4)
# a valid person entry starts with a name followed by an optional !integer for age
# and an optional |real for weight; the '&' operator allows these to occur in either
# order, but at most only one of each will be allowed
expr = pp.Group(name("name")
+ (pp.Optional(pp.Suppress('!') + integer("age"), default='')
& pp.Optional(pp.Suppress('|') + real("weight"), default='')))
# other entries that we don't care about
other = pp.Word(pp.alphas, min=5)
# an expression for the complete input line - delimitedList defaults to using
# commas as delimiters; and we don't really care about the other entries, just
# suppress them from the results; whitespace is also skipped implicitly, but that
# is not an issue in your given sample text
input_expr = pp.delimitedList(expr | pp.Suppress(other))
# try it against your test data
text = "Louis,Edward,John|85.56!26,Billy,Don!18|78.0,Dean"
input_expr.runTests(text)
Prints:
Louis,Edward,John|85.56!26,Billy,Don!18|78.0,Dean
[['John', 85.56, 26], ['Don', 18, 78.0], ['Dean', '', '']]
[0]:
['John', 85.56, 26]
- age: 26
- name: 'John'
- weight: 85.56
[1]:
['Don', 18, 78.0]
- age: 18
- name: 'Don'
- weight: 78.0
[2]:
['Dean', '', '']
- name: 'Dean'
In this case, using the pre-defined real and integer expressions not only parses the values, but also does the conversion to int and float. The named parameters can be accessed like object attributes:
for person in input_expr.parseString(text):
print("({!r}, {}, {})".format(person.name, person.age, person.weight))
Gives:
('John', 26, 85.56)
('Don', 18, 78.0)
('Dean', , )

Python Regex not returning phone numbers

Given the following code:
import re
file_object = open("all-OANC.txt", "r")
file_text = file_object.read()
pattern = "(\+?1-)?(\()?[0-9]{3}(\))?(-|.)[0-9]{3}(-|.)[0-9]{4}"
for match in re.findall(pattern, file_text):
print match
I get output that stretches like this:
('', '', '', '-', '-')
('', '', '', '-', '-')
('', '', '', '-', '-')
('', '', '', '-', '-')
('', '', '', '-', '-')
('', '', '', '-', '-')
('', '', '', '-', '-')
('', '', '', '-', '-')
('', '', '', '-', '-')
I'm trying to find phone numbers, and I am one hundred percent sure there are numbers in the file. When I search for numbers in an online applet for example, with the same expression, I get matches.
Here is a snippet where the expression is found outside of python:
"Slate on Paper," our
specially formatted print-out version of Slate, is e-mailed to readers
Friday around midday. It also can be downloaded from our
site. Those services are free. An actual paper edition of "Slate on Paper"
can be mailed to you (call 800-555-4995), but that costs money and can take a
few days to arrive."
I want output that at least recognizes the presence of a number
It's your capture groups that are being displayed. Display the whole match:
text = '''"Slate on Paper," our specially formatted print-out version of Slate, is e-mailed to readers Friday around midday. It also can be downloaded from our site. Those services are free. An actual paper edition of "Slate on Paper" can be mailed to you (call 800-555-4995), but that costs money and can take a few days to arrive."'''
pattern = "(\+?1-)?(\()?[0-9]{3}(\))?(-|.)[0-9]{3}(-|.)[0-9]{4}"
for match in re.finditer(pattern,text):
print(match.group())
Output:
800-555-4995

Python: help composing regex pattern

I'm just learning python and having a problem figuring out how to create the regex pattern for the following string
"...', 'begin:32,12:1:2005-10-30 T 10:45:end', 'begin:33,13:2:2006-11-31 T 11:46:end', '... <div dir="ltr">begin:32,12:1:2005-10-30 T 10:45:end<br>begin:33,13:2:2006-11-31 T 11:46:end<br>..."
I'm trying to extract the data between the begin: and :end for n iterations without getting duplicate data. I've attached my current attempt.
for m in re.finditer('.begin:(.*),(.*):(.*):(.*:.*):end.', list_to_string(j), re.DOTALL):
print m.group(1)
print m.group(2)
print m.group(3)
print m.group(4)
the output is:
begin:32,12:1:2005-10-30 T 10:45:end<br>begin:33
13
2
2006-11-31 T 11:46
and I want it to be:
32
12
1
2005-10-30 T 10:45
33
13
2
2006-11-31 T 11:46
Thank you for any help.
.* is greedy, matching across your intended :end boundary. Replace all .*s with lazy .*?.
>>> s = """...', 'begin:32,12:1:2005-10-30 T 10:45:end', 'begin:33,13:2:2006-11-31 T 11:46:end', '... <div dir="ltr">begin:32,12:1:2005-10-30 T 10:45:end<br>begin:33,13:2:2006-11-31 T 11:46:end<br>..."""
>>> re.findall("begin:(.*?),(.*?):(.*?):(.*?:.*?):end", s)
[('32', '12', '1', '2005-10-30 T 10:45'), ('33', '13', '2', '2006-11-31 T 11:46'),
('32', '12', '1', '2005-10-30 T 10:45'), ('33', '13', '2', '2006-11-31 T 11:46')]
With a modified pattern, forcing single quotes to be present at the start/end of the match:
>>> re.findall("'begin:(.*?),(.*?):(.*?):(.*?:.*?):end'", s)
[('32', '12', '1', '2005-10-30 T 10:45'), ('33', '13', '2', '2006-11-31 T 11:46')]
You need to make the variable-sized parts of your pattern "non-greedy". That is, make them match the smallest possible string rather than the longest possible (which is the default).
Try the pattern '.begin:(.*?),(.*?):(.*?):(.*?:.*?):end.'.
Another option to Blckknght and Tim Pietzcker's is
re.findall("begin:([^,]*),([^:]*):([^:]*):([^:]*:[^:]*):end", s)
Instead of choosing non-greedy extensions, you use [^X] to mean "any character but X" for some X.
The advantage is that it's more rigid: there's no way to get the delimiter in the result, so
'begin:33,13:134:2:2006-11-31 T 11:46:end'
would not match, whereas it would for Blckknght and Tim Pietzcker's. For this reason, it's also probably faster on edge cases. This is probably unimportant in real-world circumstances.
The disadvantage is that it's more rigid, of course.
I suggest to choose whichever one makes more intuitive sense, 'cause both methods work.

Python: Regex outputs 12_34 - I need 1234

So I have input coming in as follows: 12_34 5_6_8_2 4_____3 1234
and the output I need from it is: 1234, 5682, 43, 1234
I'm currently working with r'[0-9]+[0-9_]*'.replace('_',''), which, as far as I can tell, successfully rejects any input which is not a combination of numeric digits and under-scores, where the underscore cannot be the first character.
However, replacing the _ with the empty string causes 12_34 to come out as 12 and 34.
Is there a better method than 'replace' for this? Or could I adapt my regex to deal with this problem?
EDIT: Was responding to questions in comments below, I realised it might be better specified up here.
So, the broad aim is to take a long input string (small example:
"12_34 + 'Iamastring#' I_am_an_Ident"
and return:
('NUMBER', 1234), ('PLUS', '+'), ('STRING', 'Iamastring#'), ('IDENT', 'I_am_an_Ident')
I didn't want to go through all that because I've got it all working as specified, except for number.
The solution code looks something like:
tokens = ('PLUS', 'MINUS', 'TIMES', 'DIVIDE',
'IDENT', 'STRING', 'NUMBER')
t_PLUS = "+"
t_MINUS = '-'
and so on, down to:
t_NUMBER = ###code goes here
I'm not sure how to put multi-line processes into t_NUMBER
I'm not sure what you mean and why you need regex, but maybe this helps
In [1]: ins = '12_34 5_6_8_2 4_____3 1234'
In [2]: for x in ins.split(): print x.replace('_', '')
1234
5682
43
1234
EDIT in response to the edited question:
I'm still not quite sure what you're doing with tokens there, but I'd do something like (at least it makes sense to me:
input_str = "12_34 + 'Iamastring#' I_am_an_Ident"
tokens = ('NUMBER', 'SIGN', 'STRING', 'IDENT')
data = dict(zip(tokens, input_str.split()))
This would give you
{'IDENT': 'I_am_an_Ident',
'NUMBER': '12_34',
'SIGN': '+',
'STRING': "'Iamastring#'"}
Then you could do
data['NUMBER'] = int(data['NUMBER'].replace('_', ''))
and anything else you like.
P.S. Sorry if it doesn't help, but I really don't see the point of having tokens = ('PLUS', 'MINUS', 'TIMES', 'DIVIDE', 'IDENT', 'STRING', 'NUMBER'), etc.
a='12_34 5_6_8_2 4___3 1234'
>>> a.replace('_','').replace(' ',', ')
'1234, 5682, 43, 1234'
>>>
The phrasing of your question is a little bit unclear. If you don't care about input validation, the following should work:
input = '12_34 5_6_8_2 4_____3 1234'
re.sub('\s+', ', ', input.replace('_', ''))
If you need to actually strip out all characters which are not either digits or whitespace and add commas between the numbers, then:
re.sub('\s+', ', ', re.sub('[^\d\s]', '', input))
...should accomplish the task. Of course, it would probably be more efficient to write a function that only has to walk through the string once rather than using multiple re.sub() calls.
You seem to be doing something like:
>>> data = '12_34 5_6_8_2 4_____3 1234'
>>> pattern = '[0-9]+[0-9_]*'
>>> re.findall(pattern, data)
['12_34', '5_6_8_2', '4_____3', '1234']
re.findall(pattern.replace('_', ''), data)
['12', '34', '5', '6', '8', '2', '4', '3', '1234']
The issue is that pattern.replace isn't a signal to re to remove the _s from the matches, it changes your regex to: '[0-9]+[0-9]*'. What you want to do is to do replace on the results, rather than the pattern - eg,
>>> [match.replace('_', '') for match in re.findall(pattern, data)]
['1234', '5682', '43', '1234']
Also note that your regex can be simplified slightly; I will leave out the details of how since this is homework.
Well, if you really have to use re and only re, you could do this:
import re
def replacement(match):
separator_dict = {
'_': '',
' ': ',',
}
for sep, repl in separator_dict.items():
if all( (char == sep for char in match.group(2)) ):
return match.group(1) + repl + match.group(3)
def rec_sub(s):
"""
Recursive so it works with any number of numbers separated by underscores.
"""
new_s = re.sub('(\d+)([_ ]+)(\d+)', replacement, s)
if new_s == s:
return new_s
else:
return rec_sub(new_s)
But that epitomizes the concept of overkill.

Categories