How to print substring using RegEx in Python? - python

This is two texts:
1) 'provider:sipoutilp1.ym.ms'
2) 'provider:sipoutqtm.ym.ms'
I would like to print ilp when reaches to the fist line and qtm when reaches to the second line.
This is my solution but it is not working.
RE_PROVIDER = re.compile(r'(?P<provider>\((ilp+|qtm+)')
or in the line below,
182938,DOMINICAN REPUBLIC-MOBILE
to DOMINICAN REPUBLIC , can I use the same approach re.compile?
Thank you for any help.

Your regex is not correct because you have a open parenthesis before your keywords, since there is no such character in your lines.
As a more general way you can get capture the alphabetical character after sipout or provider:sipout.
>>> s1 = 'provider:sipoutilp1.ym.ms'
>>> s2 = 'provider:sipoutqtm.ym.ms'
>>> RE_PROVIDER = re.compile(r'(?P<provider>(?<=sipout)(ilp|qtm))')
>>> RE_PROVIDER.search(s1).groupdict()
{'provider': 'ilp'}
>>> RE_PROVIDER.search(s2).groupdict()
{'provider': 'qtm'}
(?<=sipout) is a positive look-behind which will makes the regex engine match the patter which is precede with sipout.
After edit:
If you want to match multiple strings with different structure, you have to use a optional preceding patterns for matching your keywords, and due to this point that you cannot use unfixed length patterns within look-behind you cannot use it for this aim. So instead you can use a capture group trick.
You can define the optional preceding patterns within a none capture group and your keyword within a capture group then after match get the second matched gorup (group(1), group(0) is the whole of your match).
>>> RE_PROVIDER = re.compile(r'(?:sipout|\d+,)(?P<provider>(ilp|qtm|[A-Z\s]+))')
>>> RE_PROVIDER.search(s1).groupdict()
{'provider': 'ilp'}
>>> RE_PROVIDER.search(s2).groupdict()
{'provider': 'qtm'}
>>> s3 = "182938,DOMINICAN REPUBLIC-MOBILE"
>>> RE_PROVIDER.search(s3).groupdict()
{'provider': 'DOMINICAN REPUBLIC'}
Note that gorupdict doesn't works in this case because it will returns

Related

Which pattern was matched among those that I passed through a regular expression?

I am using regexp with pyhton and the library re. The regular expression I am passing contains many possible variations of a string, such as:
myRExp = ("aaaaa|bbbbb|ccccc|ddddd")
This is what I am doing to match the full regular expression
# read a file with two columns
df = pd.read_csv('a_file.csv')
# get second column and create a unique regular expression
myRExp = "|".join(df[df.columns[1]])
# now test if line contains myRExp
if re.match(myRExp, line):
# get the actual matching pattern and do something with it
What I need to do is to know which substring from myRExp was actually matching the line, i.e. which one between "aaaaa", "bbbbb", "ccccc" or "ddddd", matched?
EDIT:
Let's go with the example. This is my regular expression:
>>> linE = 'zzzzbbdbbxxx'
>>> myRExp = "(aa[a|b]a)|(bb[c|d]bb)|(ccc[d|c]c)"
by re.match() I can now match it and get this output (note that I am using search to make my point here):
# do we have a match? (yes)
>>> matched = re.search(myRExp, linE)
# show groups: I partially care
>>> matched.groups(0)
(0, 'bbdbb', 0)
At this point, what I need is the index of the regular expression that matched: the match was (bb[c|d]bb), then the output should be 2, i.e. the index of that regular expression group in myRExp:
index of matched.groups(0) in myRExp
Is there any way of obtaining the index?
Grab the "match object" returned by the regex call, and you can examine it:
m = re.match(myRExp, line)
if m:
print("Matched", m.group(0))
This will show you the part of your string that matched, which in this case is the simplest way to get what you are after.
If your regex contains groups and you want to know exactly which of the groups matched, use m.groups() instead:
>>> probe = "(orange)|(or)|(or.*)"
>>> m = re.match(probe, 'order')
>>> m.groups()
(None, 'or', None)
There should only be one value that is not None, so you can take its index and look up the regex in your list of regex substrings. Here's one way to find the index with a one-liner:
>>> match_index = list(map(bool, m.groups())).index(True)
I would suggest, that you can use this website
There you can tinker and adapt your regular expressions and get visual feedback what is matched, when providing test strings. Also the syntax is documented for the rare case you forget some commands ;)

python: re.search doesn't start at beginning of string?

I'm working on a Flask API, which takes the following regex as an endpoint:
([0-9]*)((OK)|(BACK)|(X))*
That means I'm expecting a series of numbers, and the OK, BACK, X keywords multiple times in succession after the numbers.
I want to split this regex and do different stuff depending which capture groups were present.
My approach was the following:
endp = endp.encode('ASCII', 'ignore')
match = re.search(r"([0-9]*)", str(endp), re.I)
if match:
n = match.groups()
logging.info('nums: ' + str(n[0]))
match = re.search(r"((OK)|(BACK)|(X))*", str(endp), re.I)
if match:
s1 = match.groups()
for i in s1:
logging.info('str: ' + str(i[0]))
Using the /12OK endpoint, getting the numbers works just fine, but for some reason capturing the rest of the keywords are unsuccessful. I tried reducing the second capture group to only
match = re.search(r"(OK)*", str(endp), re.I)
I constantly find the following in s1 (using the reduced regex):
(None,)
originally (with the rest of the keywords):
(None, None, None, None)
Which I suppose means the regex pattern does not match anything in my endp string (why does it have 4 Nones? 1 for each keyword, but what the 4th is there for?). I validated my endpoint (the regex against the same string too) with a regex validator, it seems fine to me. I understand that re.match is supposed to get matches from the beginning, therefore I used the re.search method, as the documentation points out it's supposed to match anywhere in the string.
What am I missing here? Please advise, I'm a beginner in the python world.
Indeed it is a bit surprising that searching with * returns `None:
>>> re.search("(OK|BACK|X)*", u'/12OK').groups()
(None,)
But it's "correct", since * matches zero or more, and any pattern matches zero times in any string, that's why you see None. Searching with + somewhat solves it:
>>> re.search("(OK|BACK|X)+", u'/12OK').groups()
('OK',)
But now, searching with this pattern in /12OKOK still only finds one match because + means one or more, and it matched one time at the first OK. To find all occurrences you need to use re.findall:
>>> re.findall("(OK|BACK|X)", u'/12OKOK')
['OK', 'OK']
With those findings, your code would look as follows: (note that you don't need to write i[0] since i is already a string, unless you want to log only the first char of the string):
import re
endp = endp.encode('ASCII', 'ignore')
match = re.search(r"([0-9]+)", str(endp))
if match:
n = match.groups()
logging.info('nums: ' + str(n))
match = re.findall(r"(OK|BACK|X)", str(endp), re.I)
for i in match:
logging.info('str: ' + str(i))
If you want to match at least ONE of the groups, use + instead of *.
>>> endp = '/12OK'
>>> match = re.search(r"((OK)|(BACK)|(X))+", str(endp), re.I)
>>> if match:
... s1 = match.groups()
... for i in s1:
... print s1
...
('OK', 'OK', None, None)
>>> endp = '/12X'
>>> match = re.search(r"((OK)|(BACK)|(X))+", str(endp), re.I)
>>> match.groups()
('X', None, None, 'X')
Notice that you have 4 matching groups in your expression, one for each pair of parentheses. The first match is the outer parenthesis and the second one is the first of the nested groups. In the second example, you still get the first match for the outer parenthesis and then the last one is the third of the nested ones.
"((OK)|(BACK)|(X))*" will search for OK or BACK or X, 0 or more times. Note that the * means 0 or more, not more than 0. The above expression should have a + at the end not * as + means 1 or more.
I think you're having two different issues, and their intersection is causing more confusion than either of them would cause on their own.
The first issue is that you're using repeated groups. Python's re library is not able to capture multiple matches when a group is repeated. Matching with a pattern like (X)+ against 'XXXX' will only capture a single 'X' in the first group even though the whole string will be matched. The regex library (which is not part of the standard library) can do multiple captures, though I'm not sure of the exact commands required.
The second issue is using the * repetition operator in your pattern. The pattern you show at the top of the question will match on an empty string. Obviously, none of the gropus will capture anything in that situation (which may be why you're seeing a lot of None entries in your results). You probably need to modify your pattern so that it requires some minimal amount of valid text to count as a match. Using + instead of * might be one solution, but it's not clear to me exactly what you want to match against so I can't suggest a specific pattern.

Regex: skip the first match of a character in group?

From this string
s = 'stringalading-0.26.0-1'
I'd like to extract the part 0.26.0-1. I can think of various ways to achieve this, using split or a regular expression using a pattern like this
pattern = r'\d+\.\d+\.\d+\-\d+'
I also tried to use a group of characters, like so:
pattern = r'[.\-\d]+'
This gives me:
In [30]: re.findall(pattern, s)
Out[30]: ['-0.26.0-1']
So I wondered: is it possible to skip the first occurrence of a character in a group, in this case the first occurrence of -?
is it possible to to skip the first occurrence of a character in a group, in this case the first occurrence of -?
NO, because when matching, the regex engine processes the string from left to right, and once the matching pattern is found, the matched chunk of text is written to the match buffer. Thus, either write a regex that only matches what you need, or post-process the found result by stripping unwanted characters from the left.
I think you do not need a regex here. You can split the string with - and pass the maxsplit argument set to 1, then just access the second item:
s = 'stringalading-0.26.0-1'
print(s.split("-", 1)[1]) # => '0.26.0-1'
See the Python demo
Also, your first regex works well:
import re
s = 'stringalading-0.26.0-1'
pat = r'\d+\.\d+\.\d+-\d+'
print(re.findall(pat, s)) # => ['0.26.0-1']
Do:
-(.*)
and get captured group 1.
Example:
In [9]: s = 'stringalading-0.26.0-1'
In [10]: re.search(r'-(.*)', s).group(1)
Out[10]: '0.26.0-1'

How to print regex match results in python 3?

I was in IDLE, and decided to use regex to sort out a string. But when I typed in what the online tutorial told me to, all it would do was print:
<_sre.SRE_Match object at 0x00000000031D7E68>
Full program:
import re
reg = re.compile("[a-z]+8?")
str = "ccc8"
print(reg.match(str))
result:
<_sre.SRE_Match object at 0x00000000031D7ED0>
Could anybody tell me how to actually print the result?
You need to include .group() after to the match function so that it would print the matched string otherwise it shows only whether a match happened or not. To print the chars which are captured by the capturing groups, you need to pass the corresponding group index to the .group() function.
>>> import re
>>> reg = re.compile("[a-z]+8?")
>>> str = "ccc8"
>>> print(reg.match(str).group())
ccc8
Regex with capturing group.
>>> reg = re.compile("([a-z]+)8?")
>>> print(reg.match(str).group(1))
ccc
re.match(pattern, string, flags=0)
If zero or more characters at the beginning of string match the regular expression pattern, return a corresponding MatchObject instance. Return None if the string does not match the pattern; note that this is different from a zero-length match.
Note that even in MULTILINE mode, re.match() will only match at the beginning of the string and not at the beginning of each line.
If you need to get the whole match value, you should use
m = reg.match(r"[a-z]+8?", text)
if m: # Always check if a match occurred to avoid NoneType issues
print(m.group()) # Print the match string
If you need to extract a part of the regex match, you need to use capturing groups in your regular expression. Enclose those patterns with a pair of unescaped parentheses.
To only print captured group results, use Match.groups:
Return a tuple containing all the subgroups of the match, from 1 up to however many groups are in the pattern. The default argument is used for groups that did not participate in the match; it defaults to None.
So, to get ccc and 8 and display only those, you may use
import re
reg = re.compile("([a-z]+)(8?)")
s = "ccc8"
m = reg.match(s)
if m:
print(m.groups()) # => ('ccc', '8')
See the Python demo

Regular expression to return all characters between two special characters

How would I go about using regx to return all characters between two brackets.
Here is an example:
foobar['infoNeededHere']ddd
needs to return infoNeededHere
I found a regex to do it between curly brackets but all attempts at making it work with square brackets have failed. Here is that regex: (?<={)[^}]*(?=}) and here is my attempt to hack it
(?<=[)[^}]*(?=])
Final Solution:
import re
str = "foobar['InfoNeeded'],"
match = re.match(r"^.*\['(.*)'\].*$",str)
print match.group(1)
If you're new to REG(gular) EX(pressions) you learn about them at Python Docs. Or, if you want a gentler introduction, you can check out the HOWTO. They use Perl-style syntax.
Regex
The expression that you need is .*?\[(.*)\].*. The group that you want will be \1.
- .*?: . matches any character but a newline. * is a meta-character and means Repeat this 0 or more times. ? makes the * non-greedy, i.e., . will match up as few chars as possible before hitting a '['.
- \[: \ escapes special meta-characters, which in this case, is [. If we didn't do that, [ would do something very weird instead.
- (.*): Parenthesis 'groups' whatever is inside it and you can later retrieve the groups by their numeric IDs or names (if they're given one).
- \].*: You should know enough by now to know what this means.
Implementation
First, import the re module -- it's not a built-in -- to where-ever you want to use the expression.
Then, use re.search(regex_pattern, string_to_be_tested) to search for the pattern in the string to be tested. This will return a MatchObject which you can store to a temporary variable. You should then call it's group() method and pass 1 as an argument (to see the 'Group 1' we captured using parenthesis earlier). I should now look like:
>>> import re
>>> pat = r'.*?\[(.*)].*' #See Note at the bottom of the answer
>>> s = "foobar['infoNeededHere']ddd"
>>> match = re.search(pat, s)
>>> match.group(1)
"'infoNeededHere'"
An Alternative
You can also use findall() to find all the non-overlapping matches by modifying the regex to (?>=\[).+?(?=\]).
- (?<=\[): (?<=) is called a look-behind assertion and checks for an expression preceding the actual match.
- .+?: + is just like * except that it matches one or more repititions. It is made non-greedy by ?.
- (?=\]): (?=) is a look-ahead assertion and checks for an expression following the match w/o capturing it.
Your code should now look like:
>>> import re
>>> pat = r'(?<=\[).+?(?=\])' #See Note at the bottom of the answer
>>> s = "foobar['infoNeededHere']ddd[andHere] [andOverHereToo[]"
>>> re.findall(pat, s)
["'infoNeededHere'", 'andHere', 'andOverHereToo[']
Note: Always use raw Python strings by adding an 'r' before the string (E.g.: r'blah blah blah').
10x for reading! I wrote this answer when there were no accepted ones yet, but by the time I finished it, 2 ore came up and one got accepted. :( x<
^.*\['(.*)'\].*$ will match a line and capture what you want in a group.
You have to escape the [ and ] with \
The documentation at the rubular.com proof link will explain how the expression is formed.
If there's only one of these [.....] tokens per line, then you don't need to use regular expressions at all:
In [7]: mystring = "Bacon, [eggs], and spam"
In [8]: mystring[ mystring.find("[")+1 : mystring.find("]") ]
Out[8]: 'eggs'
If there's more than one of these per line, then you'll need to modify Jarrod's regex ^.*\['(.*)'\].*$ to match multiple times per line, and to be non greedy. (Use the .*? quantifier instead of the .* quantifier.)
In [15]: mystring = "[Bacon], [eggs], and [spam]."
In [16]: re.findall(r"\[(.*?)\]",mystring)
Out[16]: ['Bacon', 'eggs', 'spam']

Categories