regular expression: may or may not contain a string - python

I want to match a floating number that might be in the form of 0.1234567 or 1.23e-5
Here is my python code:
import re
def main():
m2 = re.findall(r'\d{1,4}:[-+]?\d+\.\d+(e-\d+)?', '1:0.00003 3:0.123456 8:-0.12345')
for svs_elem in m2:
print svs_elem
main()
It prints blank... Based on my test, the problem was in (e-\d+)? part.

See emphasis:
Help on function findall in module re:
findall(pattern, string, flags=0)
Return a list of all non-overlapping matches in the string.
If one or more groups are present in the pattern, return a
list of groups; this will be a list of tuples if the pattern
has more than one group.
Empty matches are included in the result.
You have a group, so it’s returned instead of the entire match, but it doesn’t match in any of your cases. Make it non-capturing with (?:e-\d+):
m2 = re.findall(r'\d{1,4}:[-+]?\d+\.\d+(?:e-\d+)?', '1:0.00003 3:0.123456 8:-0.12345')

Use a non-capturing group. The matches are succeeding, but the output is the contents of the optional groups that don't actually match.
See the output when your input includes something like e-6:
>>> re.findall(r'\d{1,4}:[-+]?\d+\.\d+(e-\d+)?', '1:0.00003 3:0.123456 8:-0.12345e-6')
['', '', 'e-6']
With a non-capturing group ((?:...)):
>>> re.findall(r'\d{1,4}:[-+]?\d+\.\d+(?:e-\d+)?', '1:0.00003 3:0.123456 8:-0.12345e-6')
['1:0.00003', '3:0.123456', '8:-0.12345e-6']
Here's are some simpler examples to demonstrate how capturing groups work and how they influence the output of findall. First, no groups:
>>> re.findall("a[bc]", "ab")
["ab"]
Here, the string "ab" matched the regex, so we print everything the regex matched.
>>> re.findall("a([bc])", "ab")
["b"]
This time, we put the [bc] inside a capturing group, so even though the entire string is still matched by the regex, findall only includes the part inside the capturing group in its output.
>>> re.findall("a(?:[bc])", "ab")
["ab"]
Now, by converting the capturing group to a non-capturing group, findall again uses the match of the entire regex in its output.
>>> re.findall("a([bc])?", "a")
['']
>>> re.findall("a(?:[bc])?", "a")
['a']
In both of these final case, the regular expression as a whole matches, so the return value is a non-empty list. In the first one, the capturing group itself doesn't match any text, though, so the empty string is part of the output. In the second, we don't have a capturing group, so the match of the entire regex is used for the output.

Related

Apart from returning string and iterator in re.findall() and re.finditer() in python do their working also differ?

Wrote the following code so that i get all variable length patterns matching str_key.
line = "ABCDABCDABCDXXXABCDXXABCDABCDABCD"
str_key = "ABCD"
regex = rf"({str_key})+"
find_all_found = re.findall(regex,line)
print(find_all_found)
find_iter_found = re.finditer(regex, line)
for i in find_iter_found:
print(i.group())
Output i got:
['ABCD', 'ABCD', 'ABCD']
ABCDABCDABCD
ABCD
ABCDABCDABCD
The intended output is last three lines printed by finditer(). I was expecting both functions to give me same output(list or callable does not matter). why it differs in findall() as far i understood from other posts already on stackoverflow, these two functions differ only in their return types and not in matching patterns. Do they work differently, if not what have i done wrong?
You want to access groups rather than group.
>>> find_iter_found = re.finditer(regex, line)
>>> for i in find_iter_found:
... print(i.groups()[0])
The difference between the two methods is explained here.
The behaviour of the two functions is pretty much the same as far as the matching process is concerned as per:
re.findall(pattern, string, flags=0)
Return all non-overlapping matches of pattern in string, as a list of
strings. The string is scanned left-to-right, and matches are returned
in the order found. If one or more groups are present in the pattern,
return a list of groups; this will be a list of tuples if the pattern
has more than one group. Empty matches are included in the result.
Changed in version 3.7: Non-empty matches can now start just after a
previous empty match.
re.finditer(pattern, string, flags=0)
Return an iterator yielding match objects over all non-overlapping
matches for the RE pattern in string. The string is scanned
left-to-right, and matches are returned in the order found. Empty
matches are included in the result.
Changed in version 3.7: Non-empty matches can now start just after a
previous empty match.
For re.findall change your regex
regex = rf"({str_key})+"
into
regex = rf"((?:{str_key})+)".
The quantifier + have to inside the capture group.

Extract an alphanumeric string between two special characters

I'm trying to match strings in the lines of a file and write the matches minus the first one and the last one
import os, re
infile=open("~/infile", "r")
out=open("~/out", "w")
pattern=re.compile("=[A-Z0-9]*>")
for line in infile:
out.write( pattern.search(line)[1:-1] + '\n' )
Problem is that it says that Match is not subscriptable, when I try to add .group() it says that Nonegroup has no attritube group, groups() returns that .write needs a tuple etc
Any idea how to get .search to return a string ?
The re.search function returns a Match object.
If the match fails, the re.search function will return None. To extract the matching text, use the Match.group method.
>>> match = re.search("a.", "abc")
>>> if match is not None:
... print(match.group(0))
'ab'
>>> print(re.search("a.", "a"))
None
That said, it's probably a better idea to use groups to find the required section of the match:
>>> match = re.search("=([A-Z0-9]*)>", "=abc>") # Notice brackets
>>> match.group(0)
'=abc>'
>>> match.group(1)
'abc'
This regex can then be used with findall as #WiktorStribiżew suggests.
You seem to need only the part of strings between = and >. In this case, it is much easier to use a capturing group around the alphanumeric pattern and use it with re.findall that will never return None, but just an empty list upon no match, or a list of captured texts if found. Also, I doubt you need empty matches, so use + instead of *:
pattern=re.compile(r"=([A-Z0-9]+)>")
^ ^
and then
"\n".join(pattern.findall(line))

Regex disjunction in Python findall does not match full substrings

In several online testers, the regex a(b|c)z matches both 'abz' and 'acz' in the string 'abz acz', but Python's re.findall() only matches 'b' and 'c'.
What am I missing?
In[42]: re.findall(r'a(b|c)z', 'abz acz')
Out[42]: ['b', 'c']
With findall, the captured groups are returned:
As stated in the documentation ...
Return all non-overlapping matches of pattern in string, as a list of strings. The string is scanned left-to-right, and matches are returned in the order found. If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group. Empty matches are included in the result unless they touch the beginning of another match.
You can simply use a character class here instead.
>>> re.findall(r'a[bc]z', 'abz acz')
['abz', 'acz']
thats because you are using capturing parens
re.findall(r'a(?:b|c)z', 'abz acz')
?: will cause it to be non-capturing parentheses

How to print regex match results in python 3?

I was in IDLE, and decided to use regex to sort out a string. But when I typed in what the online tutorial told me to, all it would do was print:
<_sre.SRE_Match object at 0x00000000031D7E68>
Full program:
import re
reg = re.compile("[a-z]+8?")
str = "ccc8"
print(reg.match(str))
result:
<_sre.SRE_Match object at 0x00000000031D7ED0>
Could anybody tell me how to actually print the result?
You need to include .group() after to the match function so that it would print the matched string otherwise it shows only whether a match happened or not. To print the chars which are captured by the capturing groups, you need to pass the corresponding group index to the .group() function.
>>> import re
>>> reg = re.compile("[a-z]+8?")
>>> str = "ccc8"
>>> print(reg.match(str).group())
ccc8
Regex with capturing group.
>>> reg = re.compile("([a-z]+)8?")
>>> print(reg.match(str).group(1))
ccc
re.match(pattern, string, flags=0)
If zero or more characters at the beginning of string match the regular expression pattern, return a corresponding MatchObject instance. Return None if the string does not match the pattern; note that this is different from a zero-length match.
Note that even in MULTILINE mode, re.match() will only match at the beginning of the string and not at the beginning of each line.
If you need to get the whole match value, you should use
m = reg.match(r"[a-z]+8?", text)
if m: # Always check if a match occurred to avoid NoneType issues
print(m.group()) # Print the match string
If you need to extract a part of the regex match, you need to use capturing groups in your regular expression. Enclose those patterns with a pair of unescaped parentheses.
To only print captured group results, use Match.groups:
Return a tuple containing all the subgroups of the match, from 1 up to however many groups are in the pattern. The default argument is used for groups that did not participate in the match; it defaults to None.
So, to get ccc and 8 and display only those, you may use
import re
reg = re.compile("([a-z]+)(8?)")
s = "ccc8"
m = reg.match(s)
if m:
print(m.groups()) # => ('ccc', '8')
See the Python demo

Regular Expression in python

When the parenthesis were used in the below program output is
['www.google.com'].
import re
teststring = "href=\"www.google.com\""
m=re.findall('href="(.*?)"',teststring)
print m;
If parenthesis is removed in findall function output is ['href="www.google.com"'].
import re
teststring = "href=\"www.google.com\""
m=re.findall('href=".*?"',teststring)
print m;
Would be helpful if someone explained how it works.
The re.findall() documentation is quite clear on the difference:
Return all non-overlapping matches of pattern in string, as a list of strings. […] If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group.
So .findall() returns a list containing one of three types of values, depending on the number of groups in the pattern:
0 capturing groups in the pattern (no (...) parenthesis): the whole matched string ('href="www.google.com"' in your second example).
1 capturing group in the pattern: return the captured group ('www.google.com' in your first example).
more than 1 capturing group in the pattern: return a tuple of all matched groups.
Use non-capturing groups ((?:...)) if you don't want that behaviour, or add groups if you want more information. For example, adding a group around the href= part would result in a list of tuples with two elements each:
>>> re.findall('(href=)"(.*?)"', teststring)
[('href=', 'www.google.com')]

Categories