Ambiguous substring with mismatches - python

I'm trying to use regular expressions to find a substring in a string of DNA. This substring has ambiguous bases, that like ATCGR, where R could be A or G. Also, the script must allow x number of mismatches. So this is my code
import regex
s = 'ACTGCTGAGTCGT'
regex.findall(r"T[AG]T"+'{e<=1}', s, overlapped=True)
So, with one mismatch I would expect 3 substrings AC**TGC**TGAGTCGT and ACTGC**TGA**GTCGT and ACTGCTGAGT**CGT**. The expected result should be like this:
['TGC', 'TGA', 'AGT', 'CGT']
But the output is
['TGC', 'TGA']
Even using re.findall, the code doesn't recognize the last substring.
On the other hand, if the code is setting to allow 2 mismatches with {e<=2}, the output is
['TGC', 'TGA']
Is there another way to get all the substrings?

If I understand well, you are looking for all three letters substrings that match the pattern T[GA]T and you allow at worst one error, but I think the error you are looking for is only a character substitution since you never spoke about 2 letters results.
To obtain the expected result, you have to change {e<=1} to {s<=1} (or {s<2}) and to apply it to the whole pattern (and not only the last letter) enclosing it in a group (capturing or not capturing, like you want), otherwise the predicate {s<=1} is only linked to the last letter:
regex.findall(r'(T[AG]T){s<=1}', s, overlapped=True)

Related

Extracting two strings from between two characters. Why doesn't my regex match and how can I improve it?

I'm learning about regular expressions and I to want extract a string from a text that has the following characteristic:
It always begins with the letter C, in either lowercase or
uppercase, which is then followed by a number of hexadecimal
characters (meaning it can contain the letters A to F and numbers
from 1 to 9, with no zeros included).
After those hexadecimal
characters comes a letter P, also either in lowercase or uppercase
And then some more hexadecimal characters (again, excluding 0).
Meaning I want to capture the strings that come in between the letters C and P as well as the string that comes after the letter P and concatenate them into a single string, while discarding the letters C and P
Examples of valid strings would be:
c45AFP2
CAPF
c56Bp26
CA6C22pAAA
For the above examples what I want would be to extract the following, in the same order:
45AF2 # Original string: c45AFP2
AF # Original string: CAPF
56B26 # Original string: c56Bp26
A6C22AAA # Original string: CA6C22pAAA
Examples of invalid strings would be:
BCA6C22pAAA # It doesn't begin with C
c56Bp # There aren't any characters after P
c45AF0P2 # Contains a zero
I'm using python and I want a regex to extract the two strings that come both in between the characters C and P as well as after P
So far I've come up with this:
(?<=\A[cC])[a-fA-F1-9]*(?<=[pP])[a-fA-F1-9]*
A breakdown would be:
(?<=\A[cC]) Positive lookbehind assertion. Asserts that what comes before the regex parser’s current position must match [cC] and that [cC] must be at the beginning of the string
[a-fA-F1-9]* Matches a single character in the list between zero and unlimited times
(?<=[pP]) Positive lookbehind assertion. Asserts that what comes before the regex parser’s current position must match [pP]
[a-fA-F1-9]* Matches a single character in the list between zero and unlimited times
But with the above regex I can't match any of the strings!
When I insert a | in between (?<=[cC])[a-fA-F1-9]* and (?<=[pP])[a-fA-F1-9]* it works.
Meaning the below regex works:
(?<=[cC])[a-fA-F1-9]*|(?<=[pP])[a-fA-F1-9]*
I know that | means that it should match at most one of the specified regex expressions. But it's non greedy and it returns the first match that it finds. The remaining expressions aren’t tested, right?
But using | means the string BCA6C22pAAA is a partial match to AAA since it comes after P, even though the first assertion isn't true, since it doesn't begin with a C.
That shouldn't be the case. I want it to only match if all conditions explained in the beginning are true.
Could someone explain to me why my first attempt doesn't produces the result I want? Also, how can I improve my regex?
I still need it to:
Not be a match if the string contains the number 0
Only be a match if ALL conditions are met
Thank you
To match both groups before and after P or p
(?<=^[Cc])[1-9a-fA-F]+(?=[Pp]([1-9a-fA-F]+$))
(?<=^[Cc]) - Positive Lookbehind. Must match a case insensitive C or c at the start of the line
[1-9a-fA-F]+ - Matches hexadecimal characters one or more times
(?=[Pp] - Positive Lookahead for case insensitive p or P
([1-9a-fA-F]+$) - Cature group for one or more hexadecimal characters following the pP
View Demo
Your main problem is you're using a look behind (?<=[pP]) for something ahead, which will never work: You need a look ahead (?=...).
Also, the final quantifier should be + not * because you require at least one trailing character after the p.
The final mistake is that you're not capturing anything, you're only matching, so put what you want to capture inside brackets, which also means you can remove all look arounds.
If you use the case insensitive flag, it makes the regex much smaller and easier to read.
A working regex that captures the 2 hex parts in groups 1 and 2 is:
(?i)^c([a-f1-9]*)p([a-f1-9]+)
See live demo.
Unless you need to use \A, prefer ^ (start of input) over \A (start of all input in multi line scenario) because ^ easier to read and \A won't match every line, which is what many situations and tools expect. I've used ^.

why isn't the re.group function giving me the expected output

import re
v = "aeiou"
c = "qwrtypsdfghjklzxcvbnm"
m = re.finditer(r"(?<=[%s])([%s]{2,})[%s]" % (c, v, c), input(), flags=re.I)
for i in m:
print(i.group())
The above code is an attempt to solve the hackerrank question using re.finditer but for the input
rabcdeefgyYhFjkIoomnpOeorteeeeetmy
my output is
eef
Ioom
Oeor
eeeeet
instead of
ee
Ioo
Oeo
eeeee
I would like to know the reason why
It is because findall() and finditer() are returning different things.
In the re doc, for findall():
If one or more groups are present in the pattern, return a list of groups
for finditer():
Return an iterator yielding match objects over all non-overlapping matches for the RE pattern in string.
In your case, when you use findall() with a group, the whole match is ignored, it just returns a list of vowels in group. But for finditer(), the whole match object is returned including the ending consonant.
You have two ways to get the result,
Keep the current pattern and use i.group(1) to get the match in group 1 instead of the whole match.
Use lookahead assertion for ending consonant like (?=[%s]), then the matched string will be only vowels.

How to ignore words that have only 1/2 letters in a list of strings

I have 2 csv files, dictionary.csv and news.csv, where i match the words contained in dictionary.csv in the news.csv. But, apparently i kept on getting matches even when i shouldn't. I'm not sure if its because my code matches for every letter or term, can someone help?
Below are my codes:
news=pd.read_csv("news.csv")
capitalizednews=news['STORY'].str.title() #to capitalize each first letter in news csv
dictionary=pd.read_csv("dictionary.csv")
capitalizeddict=dictionary['Lists'].str.title().str.replace(',','').str.replace('(','').str.replace(')','').str.replace('-','').str.replace('\d','')#to capitalize each first letter in dictionary and remove
splitterm = capitalizeddict.str.split('\s+',expand=True).stack().unique().tolist()
pattern='|'.join(splitterm) #to join all of the terms in dictionary.csv
news["contain term"] =np.where(capitalizednews.str.contains(pattern,regex=True,case=False),1,0)
I kept on getting 1 for all of my 'contain term' column.
Although, i keep getting this feeling that because some of my terms after split, became a 1/2 letter word(like P, Aa), so i would like to ignore these terms
Ignoring any possible problems with the code reading, this is a general function for doing what you lined out assuming words are any strings in the title with any iterable (note that it returns a list):
def weed_out_short_words(wordlist):
wordlist2 = []
for word in wordlist:
if word.length <= 2:
pass
else:
wordlist2.append(word)
This does not remove entries in a table where a specific column has words with two letters or less. It also doesn't deal with strings composed of multiple words, like "Hello world".
Furthermore, your code seems to jumble up what any given thing represents. Since I don't actually know what the colums of your .csv are, I can't help any further ATM.
Removing only one- and two-letter words will still cause problems. "cat" will be still recognised in "catastrophe". It's a 3-letter word. There's much more examples when this could fail.
That's why you need to check whole words, not only substrings.
Since you're using regex with "or" (|), you can also use regex word borders r'\b':
pattern = r'\b' + r'\b|\b'.join(splitterm) + r'\b'
This thing will use whole r'\b|\b' as a separator, and add r'\b to the beginning of first word and the end of the last word.
Using raw strings (r'...') here because it's regex and we're using regex special character \b, not using an escaping sequence.

Regular expression to find longest substring which occurs twice (and is disjoint from its twin)

There are many questions which ask to find the longest repeating substring:
Longest substring that occurs at least twice: C++ question
Find longest repeating substring in JavaScript using regular expressions
Find longest repeating strings?
Regex to match the longest repeating substring
But these don't exactly match my requirements, which are:
The substrings may not overlap (they are disjoint).
The substrings are allowed to be non-adjacent.
Any characters are allowed.
I would like to match the longest pattern like this.
So far I have this:
>>> m = re.match(".*(?P<grp>.+).*(?P=grp).*", "dhiblhip")
>>> m.group('grp')
'i'
I think this is matching the last substring which repeats itself, 'i', but that's certainly not the longest one. I'd expect the following output for the following input:
'123abc' -> ''
'hh' -> 'h'
'hihi' -> 'hi'
'dhiblhip' -> 'hi'
'phiblhip' -> 'hi' (note how I do not return 'p' since it is not as long as 'hi' even though it is a repeating disjoint substring.)
'racecaracecar' -> 'raceca' (note how I can't recycle the middle r.) In this case, 'acecar' is just as acceptable.
I am using Python's re and would like to continue to do so, but answers in another language are not unwelcome.
Credit to #HamZa for the actual regex: (.+)(?=.*\1). This basically finds a capturing group with at least one character, and then does a non-capturing forward lookahead to make sure it repeats (that way there isn't trouble with python not finding overlapping matches).
While it is not possible to find the largest with regex alone, it is pretty simple to write
matches = re.findall(r'(.+)(?=.*\1)',yourstring)
largest = '' if not matches else max(matches,key=lambda m:len(m))

Python regexp: get all group's sequence

I have a regex like this '^(a|ab|1|2)+$' and want to get all sequence for this...
for example for re.search(reg, 'ab1') I want to get ('ab','1')
Equivalent result I can get with '^(a|ab|1|2)(a|ab|1|2)$' pattern,
but I don't know how many blocks been matched with (pattern)+
Is this possible, and if yes - how?
try this:
import re
r = re.compile('(ab|a|1|2)')
for i in r.findall('ab1'):
print i
The ab option has been moved to be first, so it will match ab in favor of just a.
findall method matches your regular expression more times and returns a list of matched groups. In this simple example you'll get back just a list of strings. Each string for one match. If you had more groups you'll get back a list of tuples each containing strings for each group.
This should work for your second example:
pattern = '(7325189|7325|9087|087|18)'
str = '7325189087'
res = re.compile(pattern).findall(str)
print(pattern, str, res, [i for i in res])
I'm removing the ^$ signs from the pattern because if findall has to find more than one substring, then it should search anywhere in str. Then I've removed + so it matches single occurences of those options in pattern.
Your original expression does match the way you want to, it just matches the entire string and doesn't capture individual groups for each separate match. Using a repetition operator ('+', '*', '{m,n}'), the group gets overwritten each time, and only the final match is saved. This is alluded to in the documentation:
If a group matches multiple times, only the last match is accessible.
I think you don't need regexpes for this problem,
you need some recursial graph search function

Categories