Why does the order of expressions matter in re.match? - python

I'm making a function that will take a string like "three()" or something like "{1 + 2}" and put them into a list of token (EX: "three()" = ["three", "(", ")"] I using the re.match to help separate the string.
def lex(s):
# scan input string and return a list of its tokens
seq = []
patterns = (r"^(\t|\n|\r| )*(([a-z])*|[0-9]|\(|\)|\*|\/|)(\t|\n|\r| )*")
m = re.match(patterns,s)
while m != None:
if s == '':
break
seq.append(m.group(2))
s = s[len(m.group(0)):]
m = re.match(patterns,s)
return seq
This one works if the string is just "three". But if the string contains "()" or any symbol it stays in the while loop.
But a funny thing happens when move ([a-z])* in the pattern string it works. Why is that happening?
works: patterns = (r"^(\t|\n|\r| )*([0-9]|\(|\)|\*|\/|([a-z])*)(\t|\n|\r| )*")
Does not work: patterns = (r"^(\t|\n|\r| )*(([a-z])*|[0-9]|\(|\)|\*|\/)(\t|\n|\r| )*")

This one is a bit tricky, but the problem is with this part ([a-z])*. This matches any string of lowercase letters size 0 (zero) or more.
If you put this sequence at the end, like here:
patterns = (r"^(\t|\n|\r| )*([0-9]|\(|\)|\*|\/|([a-z])*)(\t|\n|\r| )*")
The regex engine will try the other matches first, and if it finds a match, stop there. Only if none of the others match, does it try ([a-z])* and since * is 'greedy', it will match all of three, then proceed to match ( and finally ).
Read an explanation of how the full expression is tested in the documentation (thanks to #kaya3).
However, if you put that sequence a the start, like here:
patterns = (r"^(\t|\n|\r| )*(([a-z])*|[0-9]|\(|\)|\*|\/)(\t|\n|\r| )*")
It will now try to match it first. It's still greedy, so three still gets matched. But then on the next try, it will try to match ([a-z])* to the remaining '()' - and it matches, since that string starts with zero letters.
It keeps matching it like that, and gets stuck in the loop. You can fix it by changing the * for a + which will only match if there is 1 or more matches:
patterns = (r"^(\t|\n|\r| )*(([a-z])+|[0-9]|\(|\)|\*|\/)(\t|\n|\r| )*")

Related

how to prevent regex matching substring of words?

I have a regex in python and I want to prevent matching substrings. I want to add '#' at the beginning some words with alphanumeric and _ character and 4 to 15 characters. But it matches substring of larger words. I have this method:
def add_atsign(sents):
for i, sent in enumerate(sents):
sents[i] = re.sub(r'([a-zA-Z0-9_]{4,15})', r'#\1', str(sent))
return sents
And the example is :
mylist = list()
mylist.append("ali_s ali_t ali_u aabs:/t.co/kMMALke2l9")
add_atsign(mylist)
And the answer is :
['#ali_s #ali_t #ali_u #aabs:/t.co/#kMMALke2l9']
As you can see, it puts '#' at the beginning of 'aabs' and 'kMMALke2l9'. That it is wrong.
I tried to edit the code as bellow :
def add_atsign(sents):
for i, sent in enumerate(sents):
sents[i] = re.sub(r'((^|\s)[a-zA-Z0-9_]{4,15}(\s|$))', r'#\1', str(sent))
return sents
But the result will become like this :
['#ali_s ali_t# ali_u aabs:/t.co/kMMALke2l9']
As you can see It has wrong replacements.
The correct result I expect is:
"#ali_s #ali_t #ali_u aabs:/t.co/kMMALke2l9"
Could anyone help?
Thanks
This is a pretty interesting question. If I understand correctly, the issue is that you want to divide the string by spaces, and then do the replacement only if the entire word matches, and not catch a substring.
I think the best way to do this is to first split by spaces, and then add assertions to your regex that catch only an entire string:
def add_atsign(sents):
new_list = []
for string in sents:
new_list.append(' '.join(re.sub(r'^([a-zA-Z0-9_]{4,15})$', r'#\1', w)
for w in string.split()))
return new_list
mylist = ["ali_s ali_t ali_u aabs:/t.co/kMMALke2l9"]
add_atsign(mylist)
>
['#ali_s #ali_t #ali_u aabs:/t.co/kMMALke2l9']
ie, we split, then replace only if the entire word matches, then rejoin.
By the way, your regex can be simplified to r'^(\w{4,15})$':
def add_atsign(sents):
new_list = []
for string in sents:
new_list.append(' '.join(re.sub(r'^(\w{4,15})$', r'#\1', w)
for w in string.split()))
return new_list
You can separate words by spaces by adding (?<=\s) to the start and \s to the end of your first expression.
def add_atsign(sents):
for i, sent in enumerate(sents):
sents[i] = re.sub(r'((^|(?<=\s))[a-zA-Z0-9_]{4,15}\s)', r'#\1', str(sent))
return sents
The result will be like this:
['#ali_s #ali_t #ali_u aabs:/t.co/kMMALke2l9']
I am not sure what you are trying to accomplish, but the reason it puts the # at the wrong places is that as you added /s or ^ to the regex the whitespace becomes part of the match and it therefore puts the # before the whitespace.
you could try to split it to
check at beginning of string and put at first position and
check after every whitespace and put to second position
Im aware its not optimal, but maybe i can help if you clarify what the regex is supposed to match and what it shouldnt in a bit more detail

Change string for defiened pattern (Python)

Learning Python, came across a demanding begginer's exercise.
Let's say you have a string constituted by "blocks" of characters separated by ';'. An example would be:
cdk;2(c)3(i)s;c
And you have to return a new string based on old one but in accordance to a certain pattern (which is also a string), for example:
c?*
This pattern means that each block must start with an 'c', the '?' character must be switched by some other letter and finally '*' by an arbitrary number of letters.
So when the pattern is applied you return something like:
cdk;cciiis
Another example:
string: 2(a)bxaxb;ab
pattern: a?*b
result: aabxaxb
My very crude attempt resulted in this:
def switch(string,pattern):
d = []
for v in range(0,string):
r = float("inf")
for m in range (0,pattern):
if pattern[m] == string[v]:
d.append(pattern[m])
elif string[m]==';':
d.append(pattern[m])
elif (pattern[m]=='?' & Character.isLetter(string.charAt(v))):
d.append(pattern[m])
return d
Tips?
To split a string you can use split() function.
For pattern detection in strings you can use regular expressions (regex) with the re library.

How to write regular expression to find combination of characters, but each can only appear once in python

I would like to find whether "xy" in a string, "xy" is optional, for each character it can only appear once. For example:
def findpat(texts, pat):
for text in texts:
if re.search(pat, t):
print re.search(pat, t).group()
else:
print None
pat = re.compile(r'[xy]*?b')
text = ['xyb', 'xb', 'yb', 'yxb','b', 'xyxb']
findpat(text, pat)
# it prints
# xyb
# xb
# yb
# yxb
# b
# xyxb
For the last one, my desired output is "yxb".
How should I modify my regex? Many thanks
You may use the following approach: match and capture the two groups, ([xy]*)(b). Then, once a match is found, check if the length of the value in Group 1 is the same as the number of unique chars in this value. If not, remove the chars from the start of the group value until you get a string with the length of the number of unique chars.
Something like:
def findpat(texts, pat):
for t in texts:
m = re.search(pat, t) # Find a match
if m:
tmp = set([x for x in m.group(1)]) # Get the unqiue chars
if len(tmp) == len(m.group(1)): # If Group 1 length is the same
print re.search(pat, t).group() # Report a whole match value
else:
res = m.group(1)
while len(tmp) < len(res): # While the length of the string is not
res = res[1:] # equal to the number of unique chars, truncate from the left
print "{}{}".format(res, m.group(2)) # Print the result
else:
print None # Else, no match
pat = re.compile(r'([xy]*)(b)')
text = ['xyb', 'xb', 'yb', 'yxb','b', 'xyxb']
findpat(text, pat)
# => [xyb, xb, yb, yxb, b, yxb]
See the Python demo
You can use this pattern
r'(x?y?|yx)b'
To break down, the interesting part x?y?|yx will match:
empty string
only x
only y
xy
and on the alternative branch, yx
As an advice, when you aren't very comfortable with regex and your number of scenarios are small, you could simply brute force the pattern. It's ugly, but it makes clear what your cases are:
r'b|xb|yb|xyb|yxb'
Part 2.
For a generic solution, that will do the same, but for any number of characters instead of just {x, y}, the following regex style can be used:
r'(?=[^x]*x?[^x]*b)(?=[^y]*y?[^y]*b)(?=[^z]*z?[^z]*b)[xyz]*b'
I'll explain it a bit:
By using lookaheads you advance the regex cursor and for each position, you just "look ahead" and see if what follows respects a certain condition. By using this technique, you may combine several conditions into a single regex.
For a cursor position, we test each character from our set to appear at most once from the position, until we match our target b character. We do this with this pattern [^x]*x?[^x]*, which means match not-x if there are any, match at most one x, then match any number of not x
Once the test conditions are met, we start advancing the cursor and matching all the characters from our needed set, until we find a b. At this point we are guaranteed that we won't match any duplicates, because we performed our lookahead tests.
Note: I strongly suspect that this has poor performance, because it does backtracking. You should only use it for small test strings.
Test it.
Well, the regexp that literally passes your test cases is:
pat = re.compile(r'(x|y|xy|yx)?b$')
where the "$" anchors the string at the end and thereby ensures it's the last match found.
However it's a little more tricky to use the regexp mechanism(s) to ensure that only one matching character from the set is used ...
From Wiktor Stribiżew's comment & demo, I got my answer.
pat = re.compile(r'([xy]?)(?:(?!\1)[xy])?b')
Thanks you all!

Regular Expression Testing

So i have been working on this project for myself to understand regular expressions There are 6 lines of input. The first line will contain 10 character strings. The last 5 lines will contain a valid regular expression string.
For the output, each regular expression print all the character strings that are matches to the strings according to line 1; if none match then print none. # is used to say it is an empty string. I have gotten everything but the empty string part so here is my code
and example input that would be
1)#,aac,acc,abc,ac,abbc,abbbc,abbbbc,aabc,accb
and i would like the second input to be
2)b*
the output im trying to get is #
and so far it outputs nothing
import re
inp = input("Search String:").upper().split(',')
for runs in range(50):
temp = []
query = input("Search Query:").replace("?", "[A-Z_0-9]+?+$").upper()
for item in inp:
search = re.match(query, item)
if search:
if search.group() not in temp:
temp.append(search.group())
if len(temp) > 0:
print(" ".join(temp))
else:
print("NONE")
b matches only the literal character 'b', so your search string will only match a sequence of zero or more b's, such as
b
or
bbbb
or
bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb (and so on)
Your match string will not match anything else.
I don't know why you are using a specific letter, but I assume you intended an escape sequence, like "\b*", although that only matches transitions between types of characters, so it won't match # in this context. If you use \W*, it will match # (not sure whether it will match the other stuff you want).
If you haven't already, check out the following resources on regular expressions, including all of the escape characters and metacharacters:
Wikipedia
Python.org 2.7

How to make python regex which matches multiple patterns to same index

Is it possible to get all overlapping matches, which starts from the same index, but are from different matching group?
e.g. when I look for pattern "(A)|(AB)" from "ABC" regex should return following matches:
(0,"A") and (0,"AB")
For one possibility see the answer of Evpok. The second interpretation of your question can be that you want to match all patterns at the same time from the same position. You can use a lookahead expression in this case. E.g. the regular expression
(?=(A))(?=(AB))
will give you the desired result (i.e. all places where both patterns match together with the groups).
Update: With the additional clarification this can still be done with a single regex. You just have to make both groups above optional, i.e.
(?=(A))?(?=(AB))?(?:(?:A)|(?:AB))
Nevertheless I wouldn't suggest to do so. You can much more easily look for each pattern separately and later join the results.
string = "AABAABA"
result = [(g.start(), g.group()) for g in re.compile('A').finditer(string)]
result += [(g.start(), g.group()) for g in re.compile('AB').finditer(string)]
I get this though I can't recall where or from who
def myfindall(regex, seq):
resultlist = []
pos = 0
while True:
result = regex.search(seq, pos)
if result is None:
break
resultlist.append(seq[result.start():result.end()])
pos = result.start() + 1
return resultlist
it returns a list of all (even overlapping) matches, with the limit of no more than one match for each index.

Categories