Regex for search specific substring [duplicate] - python

This question already has answers here:
How to find overlapping matches with a regexp?
(4 answers)
Closed 4 years ago.
I tried this code:
re.findall(r"d.*?c", "dcc")
to search for substrings with first letter d and last letter c.
But I get output ['dc']
The correct output should be ['dc', 'dcc'].
What did i do wrong?

What you're looking for isn't possible using any built-in regexp functions that I know of. re.findall() only returns non-overlapping matches. After it matches dc, it looks for another match starting after that. Since the rest of the string is just c, and that doesn't match, it's done, so it just returns ["dc"].
When you use a quantifier like *, you have a choice of making it greedy, or non-greedy -- either it finds the longest or shortest match of the regexp. To do what you want, you need a way of telling it to look for successively longer matches until it can't find anything. There's no simple way to do this. You can use a quantifier with a specific count, but you'd have to loop it in your code:
d.{0}c
d.{1}c
d.{2}c
d.{3}c
...
If you have a regexp with multiple quantified sub-patterns, you'd have to try all combinations of lengths.

Your two problems are that .* is greedy while .*? is minimal, and that re.findall() only returns non-overlapping matches. Here's a possible solution:
def findall_inner(expr, text):
explore = list(re.findall(expr, text))
matches = set()
while explore:
word = explore.pop()
if len(word) >= 2 and word not in matches:
explore.extend(re.findall(expr, word[1:])) # try more removing first letter
explore.extend(re.findall(expr, word[:-1])) # try more removing last letter
matches.add(word)
return list(matches)
found = findall_inner(r"d.*c", "dcc")
print(found)
This is a little bit of overkill, using findall instead of search and using >= 2 instead of > 2, as in this case there can only be one non-overlapping match of d.*c and one-character strings cannot match the pattern. But there is some flexibility in it depending on what other kinds of patterns you might want.

Try this regex:
^d.*c$
Essentially, you are looking for the start of the string to be d and the end of the string to be c.

This is a very important point to understand: a regex engine always returns the leftmost match, even if a "better" match could be found later. When applying a regex to a string, the engine starts at the first character of the string. It tries all possible permutations of the regular expression at the first character. Only if all possibilities have been tried and found to fail, does the engine continue with the second character in the text. So when it find ['dc'] then engine pass 'dc' and continues with second 'c'. So it is impossible to match with ['dcc'].

Related

How to find whether a string pattern exists in a python string?

Lets say I have an arrays of strings.I want to find all the string which contain the following
substring , charachter digit digit digit charachter (CDDDC will be the pattern). For instance the format would be as following:
H554L
K007K
Is there any fast string expression matching to find such occurrences ?
Things like this are the field of "regex". Regex is made for pattern matching. It of itself is a broad ttopic too much to explain here (check regexbuddy or another site).
python has a regex compiler build in, under the re (as well as regex module). A simple solution would hence be:
word for word in somelist if re.search(r"[a-zA-Z]\d{3}[a-zA-Z]", word)
Which iterates over somelist, and selects anything that matches (completely) a character in one of the two "ranges", followed by 3 digits, followed by a character in the range.
A noted as in the comments: re.search will match (find) any item which has a "part" of that item matching the "pattern". So it will match a123b as well as abc b123cd. If you wish to make sure that the full "word" in the array matches the substring use re.fullmatch instead.
Fullmatch will match a123b but not abc b123cd and not ab123cd
Try this example with this regex:
regex: (?i)[A-Z]\d\d\d[A-Z]
import re
xx = ['aeeea','5eeae','H554L','juan','K007K']
for i in xx:
r1 = re.findall(r"(?i)[A-Z]\d\d\d[A-Z]", i)
print (', '.join(r1)
)
Run the example online

Find all strings starting and ending with given substring in a string using regex in Python [duplicate]

This question already has an answer here:
Regex including overlapping matches with same start
(1 answer)
Closed 3 years ago.
I have given a string
ATGCCAGGCTAGCTTATTTAA
and I have to find out all substrings in string which starts with ATG and end with either of TAA, TAG, TGA.
Here is what I am doing:
seq="ATGCCAGGCTAGCTTATTTAA"
pattern = re.compile(r"(ATG[ACGT]*(TAG|TAA|TGA))")
for match in re.finditer(pattern, seq):
coding = match.group(1)
print(coding)
This code is giving me output:
ATGCCAGGCTAGCTTATTTAA
But actual output should be :
ATGCCAGGCTAGCTTATTTAA, ATGCCAGGCTAG
what I should change in my code?
tl;dr: can't use regex for this
The problem isn't greedy/non-greedy.
The problem isn't overlapping matches either: there's a solution for that (How to find overlapping matches with a regexp?)
The real problem with OP's question is, REGEX isn't designed for matches with the same start. Regex performs a linear search and stops at the first match. That's one of the reasons why it's fast. However, this prevents REGEX from supporting multiple overlapping matches starting at the same character.
See
Regex including overlapping matches with same start
for more info.
Regex isn't the be-all-end-all of pattern matching. It's in the name: Regular expressions are all about single-interpretation symbol sequences, and DNA tends not to fit that paradigm.
In r"(ATG[ACGT]*(TAG|TAA|TGA))", the * operator is "greedy". Use the non-greedy modifier, like r"(ATG[ACGT]*?(TAG|TAA|TGA))", to tell the regexp to take the shortest matching string, not the longest.

Regex subsequence matching

I'm using python but code in any language will do as well for this question.
Suppose I have 2 strings.
sequence ='abcd'
string = 'axyzbdclkd'
In the above example sequence is a subsequence of string
How can I check if sequence is a subsequence of string using regex? Also check the examples here for difference in subsequence and subarray and what I mean by subsequence.
The only think I could think of is this but it's far from what I want.
import re
c = re.compile('abcd')
c.match('axyzbdclkd')
Just allow arbitrary strings in between:
c = re.compile('.*a.*b.*c.*d.*')
# .* any character, zero or more times
You can, for an arbitrary sequence construct a regex like:
import re
sequence = 'abcd'
rgx = re.compile('.*'.join(re.escape(x) for x in sequence))
which will - for 'abcd' result in a regex 'a.*b.*c.*d'. You can then use re.find(..):
the_string = 'axyzbdclkd'
if rgx.search(the_string):
# ... the sequence is a subsequence.
pass
By using re.escape(..) you know for sure that for instance '.' in the original sequence will be translated to '\.' and thus not match any character.
I don't think the solution is as simple as #schwobaseggl claims. Let me show you another sequence from your database: ab1b2cd. By using the abcd subsequence for pattern matching you can get 2 results: ab(1b2)cd and a(b1)b(2)cd. So for testing purposes the proposed ^.*a.*b.*c.*d.*$ is ok(ish), but for parsing the ^a(.*)b(.*)cd$ will always be greedy. To get the second result you'll need to make it lazy: ^a(.*?)b(.*)cd$. So if you need this for parsing, then you should know how many variables are expected and to optimize the regex pattern you need to parse a few example strings and put the gaps with capturing groups only to the positions you really need them. An advanced version of this would inject the pattern of the actual variable instead of .*, so for example ^ab(\d\w\d)cd$ or ^a(\w\d)b(\d)cd$ in the second case.

How does regex {m,n}? work in Python?

From the Python documentation of the re module:
{m,n}?
Causes the resulting RE to match from m to n repetitions of the preceding RE, attempting to match as few repetitions as possible. This is the non-greedy version of the previous qualifier. For example, on the 6-character string 'aaaaaa', a{3,5} will match 5 'a' characters, while a{3,5}? will only match 3 characters.
I'm confused about how this works. How is this any different from {m}? I do not see how there could ever be a case where the pattern could match more than m repetitions. If there are m+1 repetitions in a row, then there are also m. What am I missing?
Whereas, it is true that a regex solely containing a{3,5}? and one with the pattern: a{3} will match the same thing (i.e. re.match(r'a{3,5}?', 'aaaaa').group(0) and re.match(r'a{3}', 'aaaaa').group(0)
will both return 'aaa'), the differences between the patterns becomes clear when you look at patterns containing these two elements. Say your pattern is a{3,5}?b, then aaab, aaaab, and aaaaab will be matched. If you just used a{3}b then only aaab would get matched. aaaab and aaaaab would not.
Look to Shashank's answer for examples that flush out this difference a little more, or test your own. I've found that this site is a good resource to use to test out python regular expressions.
I think the way to see the difference between the two is through the following examples:
>>> re.findall(r'ab{3,5}?', 'abbbbb')
['abbb']
>>> re.findall(r'ab{3}', 'abbbbb')
['abbb']
Those two runs give the same results as expected, but let's see some differences.
Difference 1: A range quantifier on a subpattern lets you match a large range of patterns containing that subpattern. This lets you find matches where there normally wouldn't be any if you used an exact quantifier:
>>> re.findall(r'ab{3,5}?c', 'abbbbbc')
['abbbbbc']
>>> re.findall(r'ab{3}c', 'abbbbbc')
[]
Difference 2: Greedy doesn't necessarily mean "match the shortest subpattern possible". It's actually a bit more like "match the shortest subpattern possible starting from the leftmost unmatched index that can possibly start off a match":
>>> re.findall(r'b{3,5}?c', 'bbbbbc')
['bbbbbc']
>>> re.findall(r'b{3}c', 'bbbbbc')
['bbbc']
The way I think of regex is as a construct that scans the string from left to right with two iterators that point to indices in the string. The first iterator marks the beginning of the next possible pattern. The second iterator goes through the suffix of the substring starting from the first iterator and tries to complete the pattern. The first iterator only advances when the construct determines that the regex pattern cannot possibly match a string starting from that index. Thus, defining a range for your quantifier will make it so that the first iterator will keep matching sub-patterns beyond the minimum value specified even if the quantifier is non-greedy.
A non-greedy regex will stop its second iterator as soon as the pattern can stop, but a greedy regex will "save" the position of a matched pattern and keep searching for a longer one. If a longer pattern is found, then it uses that one instead, if it's not found, then it uses the shorter one that it saved in memory earlier.
That's why you see the possibly surprising result with 'b{3,5}?c' and 'bbbbbc'. Although the regex is greedy, it will still never advance its first iterator until the pattern match fails, and that's why the substring with 5 'b' characters is matched by the non-greedy regex even though its not the shortest pattern matchable.
SwankSwashbucklers's answer describes the greedy version. The ? makes it non-greedy, which means it will try to match as few items as possible, which means that
`re.match('a{3,5}?b', 'aaaab').group(0)` # returns `'aaaab'`
but
`re.match('a{3,5}?', 'aaaa').group(0)` # returns `'aaa'`
let say we have a string to be searched is:
str ="aaaaa"
Now we have patter = a{3,5}
The string which it matches are :{aaa,aaaa,aaaaa}
But here we have string as "aaaaa" since we have only one option.
Now lets say we have pattern = a{3,5}?
in this case it matches only "aaa" not "aaaaa".
Thus it takes the minimum items as possible,being non greedy.
please try using online regular Expression at :https://pythex.org/
It will be great help and we check immediately what it matches and what it does not

The most elegant way to find n words in String with the particular word

There is a big string and I need to find all substrings containing exactly N words (if it is possible).
For example:
big_string = "The most elegant way to find n words in String with the particular word"
N = 2
find_sub(big_string, 'find', N=2) # => ['way to find n words']
I've tried to solve it with regular expressions, but it happened to be more complex then I expect at first. Is there an elegant solution around I've just overlook?
Upd
By word we mean everything separated by \b
N parameter indicates how many words on each side of the 'find' should be
For your specific example (if we use the "word" definition of regular expressions, i.e. anything containing letters, digits and underscores) the regex would look like this:
r'(?:\w+\W+){2}find(?:\W+\w+){2}'
\w matches one of said word characters. \W matches any other character. I think it's obvious where in the pattern your parameters go. You can use the pattern with re.search or re.findall.
The issue is if there are less than the desired amount of words around your query (i.e. if it's too close to one end of the string). But you should be able to get away with:
r'(?:\w+\W+){0,2}find(?:\W+\w+){0,2}'
thanks to greediness of repetition. Note that in any case, if you want multiple results, matches can never overlap. So if you use the first pattern, you will only get the first match, if two occurrences of find are to close to each other, whereas in the second, you won't get n words before the second find (the ones that were already consumed will be missing). In particular, if two occurrences of find are closer together than n so that the second find will already be part of the first match, then you can't get the second match at all.
If you want to treat a word as anything that is not a white-space character, the approach looks similar:
r'(?:\S+\s+){0,2}find(?:\s+\S+){0,2}'
For anything else you will have to come up with the character classes yourself, I guess.

Categories