Regex subsequence matching - python

I'm using python but code in any language will do as well for this question.
Suppose I have 2 strings.
sequence ='abcd'
string = 'axyzbdclkd'
In the above example sequence is a subsequence of string
How can I check if sequence is a subsequence of string using regex? Also check the examples here for difference in subsequence and subarray and what I mean by subsequence.
The only think I could think of is this but it's far from what I want.
import re
c = re.compile('abcd')
c.match('axyzbdclkd')

Just allow arbitrary strings in between:
c = re.compile('.*a.*b.*c.*d.*')
# .* any character, zero or more times

You can, for an arbitrary sequence construct a regex like:
import re
sequence = 'abcd'
rgx = re.compile('.*'.join(re.escape(x) for x in sequence))
which will - for 'abcd' result in a regex 'a.*b.*c.*d'. You can then use re.find(..):
the_string = 'axyzbdclkd'
if rgx.search(the_string):
# ... the sequence is a subsequence.
pass
By using re.escape(..) you know for sure that for instance '.' in the original sequence will be translated to '\.' and thus not match any character.

I don't think the solution is as simple as #schwobaseggl claims. Let me show you another sequence from your database: ab1b2cd. By using the abcd subsequence for pattern matching you can get 2 results: ab(1b2)cd and a(b1)b(2)cd. So for testing purposes the proposed ^.*a.*b.*c.*d.*$ is ok(ish), but for parsing the ^a(.*)b(.*)cd$ will always be greedy. To get the second result you'll need to make it lazy: ^a(.*?)b(.*)cd$. So if you need this for parsing, then you should know how many variables are expected and to optimize the regex pattern you need to parse a few example strings and put the gaps with capturing groups only to the positions you really need them. An advanced version of this would inject the pattern of the actual variable instead of .*, so for example ^ab(\d\w\d)cd$ or ^a(\w\d)b(\d)cd$ in the second case.

Related

How to find whether a string pattern exists in a python string?

Lets say I have an arrays of strings.I want to find all the string which contain the following
substring , charachter digit digit digit charachter (CDDDC will be the pattern). For instance the format would be as following:
H554L
K007K
Is there any fast string expression matching to find such occurrences ?
Things like this are the field of "regex". Regex is made for pattern matching. It of itself is a broad ttopic too much to explain here (check regexbuddy or another site).
python has a regex compiler build in, under the re (as well as regex module). A simple solution would hence be:
word for word in somelist if re.search(r"[a-zA-Z]\d{3}[a-zA-Z]", word)
Which iterates over somelist, and selects anything that matches (completely) a character in one of the two "ranges", followed by 3 digits, followed by a character in the range.
A noted as in the comments: re.search will match (find) any item which has a "part" of that item matching the "pattern". So it will match a123b as well as abc b123cd. If you wish to make sure that the full "word" in the array matches the substring use re.fullmatch instead.
Fullmatch will match a123b but not abc b123cd and not ab123cd
Try this example with this regex:
regex: (?i)[A-Z]\d\d\d[A-Z]
import re
xx = ['aeeea','5eeae','H554L','juan','K007K']
for i in xx:
r1 = re.findall(r"(?i)[A-Z]\d\d\d[A-Z]", i)
print (', '.join(r1)
)
Run the example online

Regex for search specific substring [duplicate]

This question already has answers here:
How to find overlapping matches with a regexp?
(4 answers)
Closed 4 years ago.
I tried this code:
re.findall(r"d.*?c", "dcc")
to search for substrings with first letter d and last letter c.
But I get output ['dc']
The correct output should be ['dc', 'dcc'].
What did i do wrong?
What you're looking for isn't possible using any built-in regexp functions that I know of. re.findall() only returns non-overlapping matches. After it matches dc, it looks for another match starting after that. Since the rest of the string is just c, and that doesn't match, it's done, so it just returns ["dc"].
When you use a quantifier like *, you have a choice of making it greedy, or non-greedy -- either it finds the longest or shortest match of the regexp. To do what you want, you need a way of telling it to look for successively longer matches until it can't find anything. There's no simple way to do this. You can use a quantifier with a specific count, but you'd have to loop it in your code:
d.{0}c
d.{1}c
d.{2}c
d.{3}c
...
If you have a regexp with multiple quantified sub-patterns, you'd have to try all combinations of lengths.
Your two problems are that .* is greedy while .*? is minimal, and that re.findall() only returns non-overlapping matches. Here's a possible solution:
def findall_inner(expr, text):
explore = list(re.findall(expr, text))
matches = set()
while explore:
word = explore.pop()
if len(word) >= 2 and word not in matches:
explore.extend(re.findall(expr, word[1:])) # try more removing first letter
explore.extend(re.findall(expr, word[:-1])) # try more removing last letter
matches.add(word)
return list(matches)
found = findall_inner(r"d.*c", "dcc")
print(found)
This is a little bit of overkill, using findall instead of search and using >= 2 instead of > 2, as in this case there can only be one non-overlapping match of d.*c and one-character strings cannot match the pattern. But there is some flexibility in it depending on what other kinds of patterns you might want.
Try this regex:
^d.*c$
Essentially, you are looking for the start of the string to be d and the end of the string to be c.
This is a very important point to understand: a regex engine always returns the leftmost match, even if a "better" match could be found later. When applying a regex to a string, the engine starts at the first character of the string. It tries all possible permutations of the regular expression at the first character. Only if all possibilities have been tried and found to fail, does the engine continue with the second character in the text. So when it find ['dc'] then engine pass 'dc' and continues with second 'c'. So it is impossible to match with ['dcc'].

How does regex {m,n}? work in Python?

From the Python documentation of the re module:
{m,n}?
Causes the resulting RE to match from m to n repetitions of the preceding RE, attempting to match as few repetitions as possible. This is the non-greedy version of the previous qualifier. For example, on the 6-character string 'aaaaaa', a{3,5} will match 5 'a' characters, while a{3,5}? will only match 3 characters.
I'm confused about how this works. How is this any different from {m}? I do not see how there could ever be a case where the pattern could match more than m repetitions. If there are m+1 repetitions in a row, then there are also m. What am I missing?
Whereas, it is true that a regex solely containing a{3,5}? and one with the pattern: a{3} will match the same thing (i.e. re.match(r'a{3,5}?', 'aaaaa').group(0) and re.match(r'a{3}', 'aaaaa').group(0)
will both return 'aaa'), the differences between the patterns becomes clear when you look at patterns containing these two elements. Say your pattern is a{3,5}?b, then aaab, aaaab, and aaaaab will be matched. If you just used a{3}b then only aaab would get matched. aaaab and aaaaab would not.
Look to Shashank's answer for examples that flush out this difference a little more, or test your own. I've found that this site is a good resource to use to test out python regular expressions.
I think the way to see the difference between the two is through the following examples:
>>> re.findall(r'ab{3,5}?', 'abbbbb')
['abbb']
>>> re.findall(r'ab{3}', 'abbbbb')
['abbb']
Those two runs give the same results as expected, but let's see some differences.
Difference 1: A range quantifier on a subpattern lets you match a large range of patterns containing that subpattern. This lets you find matches where there normally wouldn't be any if you used an exact quantifier:
>>> re.findall(r'ab{3,5}?c', 'abbbbbc')
['abbbbbc']
>>> re.findall(r'ab{3}c', 'abbbbbc')
[]
Difference 2: Greedy doesn't necessarily mean "match the shortest subpattern possible". It's actually a bit more like "match the shortest subpattern possible starting from the leftmost unmatched index that can possibly start off a match":
>>> re.findall(r'b{3,5}?c', 'bbbbbc')
['bbbbbc']
>>> re.findall(r'b{3}c', 'bbbbbc')
['bbbc']
The way I think of regex is as a construct that scans the string from left to right with two iterators that point to indices in the string. The first iterator marks the beginning of the next possible pattern. The second iterator goes through the suffix of the substring starting from the first iterator and tries to complete the pattern. The first iterator only advances when the construct determines that the regex pattern cannot possibly match a string starting from that index. Thus, defining a range for your quantifier will make it so that the first iterator will keep matching sub-patterns beyond the minimum value specified even if the quantifier is non-greedy.
A non-greedy regex will stop its second iterator as soon as the pattern can stop, but a greedy regex will "save" the position of a matched pattern and keep searching for a longer one. If a longer pattern is found, then it uses that one instead, if it's not found, then it uses the shorter one that it saved in memory earlier.
That's why you see the possibly surprising result with 'b{3,5}?c' and 'bbbbbc'. Although the regex is greedy, it will still never advance its first iterator until the pattern match fails, and that's why the substring with 5 'b' characters is matched by the non-greedy regex even though its not the shortest pattern matchable.
SwankSwashbucklers's answer describes the greedy version. The ? makes it non-greedy, which means it will try to match as few items as possible, which means that
`re.match('a{3,5}?b', 'aaaab').group(0)` # returns `'aaaab'`
but
`re.match('a{3,5}?', 'aaaa').group(0)` # returns `'aaa'`
let say we have a string to be searched is:
str ="aaaaa"
Now we have patter = a{3,5}
The string which it matches are :{aaa,aaaa,aaaaa}
But here we have string as "aaaaa" since we have only one option.
Now lets say we have pattern = a{3,5}?
in this case it matches only "aaa" not "aaaaa".
Thus it takes the minimum items as possible,being non greedy.
please try using online regular Expression at :https://pythex.org/
It will be great help and we check immediately what it matches and what it does not

What is the reason behind the advice that the substrings in regex should be ordered based on length?

longest first
>>> p = re.compile('supermanutd|supermanu|superman|superm|super')
shortest first
>>> p = re.compile('super|superm|superman|supermanu|supermanutd')
Why is the longest first regex preferred?
Alternatives in Regexes are tested in order you provide, so if first branch matches, then Rx doesn't check other branches. This doesn't matter if you only need to test for match, but if you want to extract text based on match, then it matters.
You only need to sort by length when your shorter strings are substrings of longer ones. For example when you have text:
supermanutd
supermanu
superman
superm
then with your first Rx you'll get:
>>> regex.findall(string)
[u'supermanutd', u'supermanu', u'superman', u'superm']
but with second Rx:
>>> regex.findall(string)
[u'super', u'super', u'super', u'super', u'super']
Test your regexes with http://www.pythonregex.com/
As #MBO says, alternatives are tested in the order they are written, and once one of them matches, the RE engine goes on to what comes after.
This behaviour is common to Perl-like RE engines, and ultimately goes back to the 1985 Bell Labs design of the RE library for Edition 8 Unix.
Note that POSIX 2 (from 1991) has another definition, insisting on the leftmost longest match for the whole RE and subject to that, for each subexpression in turn (in lexical order). In POSIX 2, order of alternatives does not matter.
However, the difference in behaviour is often: irrelevant (if you're just testing), masked by backtracking (if the shorter match causes the rest of the RE to fail), or compensated by the rest of the RE matching the part that the longer match 'should have' -- so most people aren't aware of it.
I'd guess it's because they're matched in that order, and it's faster to match shorter substrings. As an extreme example, a match against a single letter | a huge string will perform much better if the single letter (which is probably going to be responsible for the majority of matches anyway) is tested against first.
But in practice you should measure, not guess. If you need to have a performant regexp, test variations against representative test data.
The advice to which you refer is contingent on the regex engine attempting to match the components of the alternation in strictly left-to-right order, as is documented for the Python re module.
Sorting substrings in descending length order is just a special case of a wider problem when you are trying to extract a series of tokens. The general principle is that you put the more specialised sub-regexes first. For example, you are writing the lexical analysis for a formula parser. You have a "float constant" subregex and an "int constant" subregex. Your first attempt at the float subregex is likely to also match int constants. If so, you have two choices: (1) write a more complicated float subregex that doesn't match int constants (2) put your int subregex first.

most efficient way to go about identifying sub-strings in a string in python?

i need to search a fairly lengthy string for CPV (common procurement vocab) codes.
at the moment i'm doing this with a simple for loop and str.find()
the problem is, if the CPV code has been listed in a slightly different format, this algorithm won't find it.
what's the most efficient way of searching for all the different iterations of the code within the string? Is it simply a case of reformatting each of the up to 10,000 CPV codes and using str.find() for each instance?
An example of different formatting could be as follows
30124120-1
301241201
30124120 - 1
30124120 1
30124120.1
etc.
Thanks :)
Try a regular expression:
>>> cpv = re.compile(r'([0-9]+[-\. ]?[0-9])')
>>> print cpv.findall('foo 30124120-1 bar 21966823.1 baz')
['30124120-1', '21966823.1']
(Modify until it matches the CPVs in your data closely.)
Try using any of the functions in re (regular expressions for Python). See the docs for more info.
You can craft a regular expression to accept a number of different formats for these codes, and then use re.findall or something similar to extract the information. I'm not certain what a CPV is so I don't have a regular expression for it (though maybe you could see if Google has any?)
cpv = re.compile(r'(\d{8})(?:[ -.\t/\\]*)(\d{1}\b)')
for m in re.finditer(cpv, ex):
cpval,chk = m.groups()
print("{0}-{1}".format(cpval,chk))
applied to your sample data returns
30124120-1
30124120-1
30124120-1
30124120-1
30124120-1
The regular expression can be read as
(\d{8}) # eight digits
(?: # followed by a sequence which does not get returned
[ -.\t/\\]* # consisting of 0 or more
) # spaces, hyphens, periods, tabs, forward- or backslashes
(\d{1}\b) # followed by one digit, ending at a word boundary
# (ie whitespace or the end of the string)
Hope that helps!

Categories