How to make python regex which matches multiple patterns to same index - python

Is it possible to get all overlapping matches, which starts from the same index, but are from different matching group?
e.g. when I look for pattern "(A)|(AB)" from "ABC" regex should return following matches:
(0,"A") and (0,"AB")

For one possibility see the answer of Evpok. The second interpretation of your question can be that you want to match all patterns at the same time from the same position. You can use a lookahead expression in this case. E.g. the regular expression
(?=(A))(?=(AB))
will give you the desired result (i.e. all places where both patterns match together with the groups).
Update: With the additional clarification this can still be done with a single regex. You just have to make both groups above optional, i.e.
(?=(A))?(?=(AB))?(?:(?:A)|(?:AB))
Nevertheless I wouldn't suggest to do so. You can much more easily look for each pattern separately and later join the results.
string = "AABAABA"
result = [(g.start(), g.group()) for g in re.compile('A').finditer(string)]
result += [(g.start(), g.group()) for g in re.compile('AB').finditer(string)]

I get this though I can't recall where or from who
def myfindall(regex, seq):
resultlist = []
pos = 0
while True:
result = regex.search(seq, pos)
if result is None:
break
resultlist.append(seq[result.start():result.end()])
pos = result.start() + 1
return resultlist
it returns a list of all (even overlapping) matches, with the limit of no more than one match for each index.

Related

Why does the order of expressions matter in re.match?

I'm making a function that will take a string like "three()" or something like "{1 + 2}" and put them into a list of token (EX: "three()" = ["three", "(", ")"] I using the re.match to help separate the string.
def lex(s):
# scan input string and return a list of its tokens
seq = []
patterns = (r"^(\t|\n|\r| )*(([a-z])*|[0-9]|\(|\)|\*|\/|)(\t|\n|\r| )*")
m = re.match(patterns,s)
while m != None:
if s == '':
break
seq.append(m.group(2))
s = s[len(m.group(0)):]
m = re.match(patterns,s)
return seq
This one works if the string is just "three". But if the string contains "()" or any symbol it stays in the while loop.
But a funny thing happens when move ([a-z])* in the pattern string it works. Why is that happening?
works: patterns = (r"^(\t|\n|\r| )*([0-9]|\(|\)|\*|\/|([a-z])*)(\t|\n|\r| )*")
Does not work: patterns = (r"^(\t|\n|\r| )*(([a-z])*|[0-9]|\(|\)|\*|\/)(\t|\n|\r| )*")
This one is a bit tricky, but the problem is with this part ([a-z])*. This matches any string of lowercase letters size 0 (zero) or more.
If you put this sequence at the end, like here:
patterns = (r"^(\t|\n|\r| )*([0-9]|\(|\)|\*|\/|([a-z])*)(\t|\n|\r| )*")
The regex engine will try the other matches first, and if it finds a match, stop there. Only if none of the others match, does it try ([a-z])* and since * is 'greedy', it will match all of three, then proceed to match ( and finally ).
Read an explanation of how the full expression is tested in the documentation (thanks to #kaya3).
However, if you put that sequence a the start, like here:
patterns = (r"^(\t|\n|\r| )*(([a-z])*|[0-9]|\(|\)|\*|\/)(\t|\n|\r| )*")
It will now try to match it first. It's still greedy, so three still gets matched. But then on the next try, it will try to match ([a-z])* to the remaining '()' - and it matches, since that string starts with zero letters.
It keeps matching it like that, and gets stuck in the loop. You can fix it by changing the * for a + which will only match if there is 1 or more matches:
patterns = (r"^(\t|\n|\r| )*(([a-z])+|[0-9]|\(|\)|\*|\/)(\t|\n|\r| )*")

How to write regular expression to find combination of characters, but each can only appear once in python

I would like to find whether "xy" in a string, "xy" is optional, for each character it can only appear once. For example:
def findpat(texts, pat):
for text in texts:
if re.search(pat, t):
print re.search(pat, t).group()
else:
print None
pat = re.compile(r'[xy]*?b')
text = ['xyb', 'xb', 'yb', 'yxb','b', 'xyxb']
findpat(text, pat)
# it prints
# xyb
# xb
# yb
# yxb
# b
# xyxb
For the last one, my desired output is "yxb".
How should I modify my regex? Many thanks
You may use the following approach: match and capture the two groups, ([xy]*)(b). Then, once a match is found, check if the length of the value in Group 1 is the same as the number of unique chars in this value. If not, remove the chars from the start of the group value until you get a string with the length of the number of unique chars.
Something like:
def findpat(texts, pat):
for t in texts:
m = re.search(pat, t) # Find a match
if m:
tmp = set([x for x in m.group(1)]) # Get the unqiue chars
if len(tmp) == len(m.group(1)): # If Group 1 length is the same
print re.search(pat, t).group() # Report a whole match value
else:
res = m.group(1)
while len(tmp) < len(res): # While the length of the string is not
res = res[1:] # equal to the number of unique chars, truncate from the left
print "{}{}".format(res, m.group(2)) # Print the result
else:
print None # Else, no match
pat = re.compile(r'([xy]*)(b)')
text = ['xyb', 'xb', 'yb', 'yxb','b', 'xyxb']
findpat(text, pat)
# => [xyb, xb, yb, yxb, b, yxb]
See the Python demo
You can use this pattern
r'(x?y?|yx)b'
To break down, the interesting part x?y?|yx will match:
empty string
only x
only y
xy
and on the alternative branch, yx
As an advice, when you aren't very comfortable with regex and your number of scenarios are small, you could simply brute force the pattern. It's ugly, but it makes clear what your cases are:
r'b|xb|yb|xyb|yxb'
Part 2.
For a generic solution, that will do the same, but for any number of characters instead of just {x, y}, the following regex style can be used:
r'(?=[^x]*x?[^x]*b)(?=[^y]*y?[^y]*b)(?=[^z]*z?[^z]*b)[xyz]*b'
I'll explain it a bit:
By using lookaheads you advance the regex cursor and for each position, you just "look ahead" and see if what follows respects a certain condition. By using this technique, you may combine several conditions into a single regex.
For a cursor position, we test each character from our set to appear at most once from the position, until we match our target b character. We do this with this pattern [^x]*x?[^x]*, which means match not-x if there are any, match at most one x, then match any number of not x
Once the test conditions are met, we start advancing the cursor and matching all the characters from our needed set, until we find a b. At this point we are guaranteed that we won't match any duplicates, because we performed our lookahead tests.
Note: I strongly suspect that this has poor performance, because it does backtracking. You should only use it for small test strings.
Test it.
Well, the regexp that literally passes your test cases is:
pat = re.compile(r'(x|y|xy|yx)?b$')
where the "$" anchors the string at the end and thereby ensures it's the last match found.
However it's a little more tricky to use the regexp mechanism(s) to ensure that only one matching character from the set is used ...
From Wiktor Stribiżew's comment & demo, I got my answer.
pat = re.compile(r'([xy]?)(?:(?!\1)[xy])?b')
Thanks you all!

Finding patterns in HEX data with regex but getting duplicates

I have a regex python script to go over Hex data and find patterns which looks like this
r"(.{6,}?)\1{2,}"
all it does is look for at least 6 character long hex strings that repeat and at least have two instances of it repeating. My issue is it is also finding substrings inside larger strings it has already found for example:
if it was "a00b00a00b00a00b00a00b00a00b00a00b00" it would find 2 instances of "a00b00a00b00a00b00" and 6 instances of "a00b00" How could I go about keeping only the longest patterns found and ignoring even looking for shorter patterns without more hardcoded parameters?
#!/usr/bin/python
import fnmatch
pattern_string = "abcdefabcdef"
def print_pattern(pattern, num):
n = num
# takes n and splits it by that value in this case 6
new_pat = [pattern[i:i+n] for i in range(0, len(pattern), n)]
# this is the hit counter for matches
match = 0
# stores the new value of the match
new_match = ""
#loops through the list to see if it matches more than once
for new in new_pat:
new_match = new
print new
#if matches previous keep adding to match
if fnmatch.fnmatch(new, new_pat[0]):
match += 1
if match:
print "Count: %d\nPattern:%s" %(match, new_match)
#returns the match
return new_match
print_pattern(pattern_string, 6)
regex is better but this was funner to write

Finding the recurring pattern

Let's say I have a number with a recurring pattern, i.e. there exists a string of digits that repeat themselves in order to make the number in question. For example, such a number might be 1234123412341234, created by repeating the digits 1234.
What I would like to do, is find the pattern that repeats itself to create the number. Therefore, given 1234123412341234, I would like to compute 1234 (and maybe 4, to indicate that 1234 is repeated 4 times to create 1234123412341234)
I know that I could do this:
def findPattern(num):
num = str(num)
for i in range(len(num)):
patt = num[:i]
if (len(num)/len(patt))%1:
continue
if pat*(len(num)//len(patt)):
return patt, len(num)//len(patt)
However, this seems a little too hacky. I figured I could use itertools.cycle to compare two cycles for equality, which doesn't really pan out:
In [25]: c1 = itertools.cycle(list(range(4)))
In [26]: c2 = itertools.cycle(list(range(4)))
In [27]: c1==c2
Out[27]: False
Is there a better way to compute this? (I'd be open to a regex, but I have no idea how to apply it there, which is why I didn't include it in my attempts)
EDIT:
I don't necessarily know that the number has a repeating pattern, so I have to return None if there isn't one.
Right now, I'm only concerned with detecting numbers/strings that are made up entirely of a repeating pattern. However, later on, I'll likely also be interested in finding patterns that start after a few characters:
magic_function(78961234123412341234)
would return 1234 as the pattern, 4 as the number of times it is repeated, and 4 as the first index in the input where the pattern first presents itself
(.+?)\1+
Try this. Grab the capture. See demo.
import re
p = re.compile(ur'(.+?)\1+')
test_str = u"1234123412341234"
re.findall(p, test_str)
Add anchors and flag Multiline if you want the regex to fail on 12341234123123, which should return None.
^(.+?)\1+$
See demo.
One way to find a recurring pattern and number of times repeated is to use this pattern:
(.+?)(?=\1+$|$)
w/ g option.
It will return the repeated pattern and number of matches (times repeated)
Non-repeated patterns (fails) will return only "1" match
Repeated patterns will return 2 or more matches (number of times repeated).
Demo

SequenceMatcher for multiple inputs, not just two?

wondering about the best way to approach this particular problem and if any libraries (python preferably, but I can be flexible if need be).
I have a file with a string on each line. I would like to find the longest common patterns and their locations in each line. I know that I can use SequenceMatcher to compare line one and two, one and three, so on and then correlate the results, but if there something that already does it?
Ideally these matches would appear anywhere on each line, but for starters I can be fine with them existing at the same offset in each line and go from there. Something like a compression library that has a good API to access its string table might be ideal, but I have not found anything so far that fits that description.
For instance with these lines:
\x00\x00\x8c\x9e\x28\x28\x62\xf2\x97\x47\x81\x40\x3e\x4b\xa6\x0e\xfe\x8b
\x00\x00\xa8\x23\x2d\x28\x28\x0e\xb3\x47\x81\x40\x3e\x9c\xfa\x0b\x78\xed
\x00\x00\xb5\x30\xed\xe9\xac\x28\x28\x4b\x81\x40\x3e\xe7\xb2\x78\x7d\x3e
I would want to see that 0-1, and 10-12 match in all lines at the same position and line1[4,5] matches line2[5,6] matches line3[7,8].
Thanks,
If all you want is to find common substrings that are at the same offset in each line, all you need is something like this:
matches = []
zipped_strings = zip(s1,s2,s3)
startpos = -1
for i in len(zipped_strings):
c1,c2,c3 = zipped_strings[i]
# if you're not inside a match,
# look for matching characters and save the match start position
if startpos==-1 and c1==c2==c3:
startpos = i
# if you are inside a match,
# look for non-matching characters, save the match to matches, reset startpos
elif startpos>-1 and not c1==c2==c3:
matches.append((startpos,i,s1[startpos:i]))
# matches will contain (startpos,endpos,matchstring) tuples
startpos = -1
# if you're still inside a match when you run out of string, save that match too!
if startpos>-1:
endpos = len(zipped_strings)
matches.append((startpos,endpos,s1[startpos:endpos]))
To find the longest common pattern regardless of location, SequenceMatcher does sound like the best idea, but instead of comparing string1 to string2 and then string1 to string3 and trying to merge the results, just get all common substrings of string1 and string2 (with get_matching_blocks), and then compare each result of that to string3 to get matches between all three strings.

Categories