How to use Boolean OR inside a regex - python

I want to use a regex to find a substring, followed by a variable number of characters, followed by any of several substrings.
an re.findall of
"ATGTCAGGTAAGCTTAGGGCTTTAGGATT"
should give me:
['ATGTCAGGTAA', 'ATGTCAGGTAAGCTTAG', 'ATGTCAGGTAAGCTTAGGGCTTTAG']
I have tried all of the following without success:
import re
string2 = "ATGTCAGGTAAGCTTAGGGCTTTAGGATT"
re.findall('(ATG.*TAA)|(ATG.*TAG)', string2)
re.findall('ATG.*(TAA|TAG)', string2)
re.findall('ATG.*((TAA)|(TAG))', string2)
re.findall('ATG.*(TAA)|(TAG)', string2)
re.findall('ATG.*(TAA)|ATG.*(TAG)', string2)
re.findall('(ATG.*)(TAA)|(ATG.*)(TAG)', string2)
re.findall('(ATG.*)TAA|(ATG.*)TAG', string2)
What am I missing here?

This is not super-easy, because a) you want overlapping matches, and b) you want greedy and non-greedy and everything inbetween.
As long as the strings are fairly short, you can check every substring:
import re
s = "ATGTCAGGTAAGCTTAGGGCTTTAGGATT"
p = re.compile(r'ATG.*TA[GA]$')
for start in range(len(s)-6): # string is at least 6 letters long
for end in range(start+6, len(s)):
if p.match(s, pos=start, endpos=end):
print(s[start:end])
This prints:
ATGTCAGGTAA
ATGTCAGGTAAGCTTAG
ATGTCAGGTAAGCTTAGGGCTTTAG
Since you appear to work with DNA sequences or something like that, make sure to check out Biopython, too.

I like the accepted answer just fine :-) That is, I'm adding this for info, not looking for points.
If you have heavy need for this, trying a match on O(N^2) pairs of indices may soon become unbearably slow. One improvement is to use the .search() method to "leap" directly to the only starting indices that can possibly pay off. So the following does that.
It also uses the .fullmatch() method so that you don't have to artificially change the "natural" regexp (e.g., in your example, no need to add a trailing $ to the regexp - and, indeed, in the following code doing so would no longer work as intended). Note that .fullmatch() was added in Python 3.4, so this code also requires Python 3!
Finally, this intends to generalize the re module's finditer() function/method. While you don't need match objects (you just want strings), they're far more generally applicable, and returning a generator is often friendlier than returning a list too.
So, no, this doesn't do exactly what you want, but does things from which you can get what you want, in Python 3, faster:
def finditer_overlap(regexp, string):
start = 0
n = len(string)
while start <= n:
# don't know whether regexp will find shortest or
# longest match, but _will_ find leftmost match
m = regexp.search(string, start)
if m is None:
return
start = m.start()
for finish in range(start, n+1):
m = regexp.fullmatch(string, start, finish)
if m is not None:
yield m
start += 1
Then, e.g.,
import re
string2 = "ATGTCAGGTAAGCTTAGGGCTTTAGGATT"
pat = re.compile("ATG.*(TAA|TAG)")
for match in finditer_overlap(pat, string2):
print(match.group())
prints what you wanted in your example. The other ways you tried to write a regexp should also work. In this example it's faster because the second time around the outer loop start is 1, and regexp.search(string, 1) fails to find another match, so the generator exits at once (so skips checking O(N^2) other index pairs).

Related

Why doesn't replace () change all occurrences?

I have the following code:
dna = "TGCGAGAAGGGGCGATCATGGAGATCTACTATCCTCTCGGGGTATGGTGGGGTTGAGA"
print(dna.count("GAGA"))
dna = dna.replace("GAGA", "AGAG")
print(dna.count("GAGA"))
Replace does not replace all occurrences. Could somebody help my in understanding why it happened?
It replaces all occurences. That might lead to new occurences (look at your replacement string!).
I'd say, logically, all is fine.
You could repeat this replace while dna.count("GAGA") > 0 , but: that sounds not like what you should be doing. (I bet you really just want to do one round of replacement to simulate something specific happening. Not a genetics expert at all though.)
It did make all replacements (that's what .replace() does in Python unless specified otherwise), but some of these replacements inadvertently introduced new instances of GAGA. Take the beginning of your string:
TGCGAGAA
There's GAGA at indices 3-6. If you replace that with AGAG, you get
TGCAGAGA
So the last G from that AGAG, together with the subsequent A that was already there before, forms a new GAGA.
Replacements does not occur "until exhausted"; they occur when a substring is matched in your original string.
Consider the following from your string:
>>> a = "TGCGAGAA"
>>> a.replace("GAGA", "AGAG")
'TGCAGAGA'
>>>
The replacement does not happen again, since the original string did not match GAGA in that location.
If you want to do the replacement until no match is found, you can wrap it in a loop:
>>> while a.count("GAGA") > 0: # you probably don't want to use count here if the string is long because of performance considerations
... a = a.replace("GAGA", "AGAG")
...
>>> a
'TGCAAGAG'

optimal way to count string tags using python

so I have this list:
tokens = ['<greeting>', 'Hello World!', '</greeting>']
the task is to count the number of strings that have XML tags. what I have so far (that works) is this:
tokens = ['<greeting>', 'Hello World!', '</greeting>']
count = 0
for i in range(len(tokens)):
if tokens[i].find('>') >1:
print(tokens[i])
count += 1
print(count)
else:
count += 0
what puzzles me is that I'm inclined in using the following line for the if statement
if tokens[i].find('>') == True:
but it won't work.
what's the optimal way of writing this loop, in your opinion?
many thanks!
alex.
One issue I see with you approach is that it might capture false positives (e.g. "gree>ting"), so checking only for a closing tag is not enough.
If your definition of "contains a tag" simply means checking whether the string contains a < followed by some characters, then another >, you could use a regular expression (keeping this in mind in case you were thinking about something more complex).
This, combined with the compact list generator method proposed by #aws_apprentice in the comments, gives us:
import re
regex = "<.+>"
count = sum([1 if re.search(regex, t) else 0 for t in tokens])
print(count) #done!
Explanation:
This one-liner we used is called a list generator, which will generate a list of ones and zeros. For each string t in tokens, if the string contains a tag, append 1 to the new list, else append 0. And re.search is used for checking whether the string (or a substring of it) matches the given regex.
The following approach checks for the opening < at the start of the string and also checks for > at the end of the string.
In [4]: tokens = ['<greeting>', 'Hello World!', '</greeting>']
In [5]: sum([1 if i.startswith('<') and i.endswith('>') else 0 for i in tokens])
Out[5]: 2
Anis R.'s answer should work fine but this is a non-regex alternative (and not as elegant. In fact I would call this clumsy).
This code just looks at the beginning and end of each list element for carats. I'm a novice to the extreme but I think a range(len(tokens)) is redundant and can be simplified like this as well.
tokens = ['<greeting>', 'Hello World!', '</greeting>']
count = 0
for i in tokens:
if i[0].find('<') == 0 and i[-1].find('>') != -1:
print(i)
count += 1
print(count)
str.find() returns an index position, not a boolean as others have noted, so your if statement must reflect that. A .find() with no result returns -1. As you can see, for the first carat checking for an index of 0 will work, as long as your data follows the scheme in your example list. The second if component is negative (using !=), since it checks the last character in the list item. I don't think you could use a positive if statement there since, again, .find() returns an index position and your data presumably has variable lengths. I'm sure you could complicate that check to be positive by adding more code but that shortcut seems satisfactory in your case to me. The only time it wouldn't work is if your list components can look like '<greeting> Hello'
Happy to be corrected by others, that's why I'm here.

How do I find index in string without using find() in python

I'm currently using the find function and found a slight problem.
theres gonna be a fire here
If I have a sentence with the word "here" and "theres" and I use find() to find "here"s index, I instead get "theres"
I thought find() would be like
if thisword in thatword:
as it would find the word, not a substring within a string.
Is there another function that may work similarly? I'm using find() quite heavily would like to know of alternatives before I clog the code with string.split() then iterate until I find the exact match with an index counter on the side.
MainLine = str('theres gonna be a fire here')
WordtoFind = str('here')
#String_Len = MainLine.find(WordtoFind)
split_line = MainLine.split()
indexCounter = 0
for i in range (0,len(split_line)):
indexCounter += (len(split_line[i]) + 1)
if WordtoFind in split_line[i]:
#String_Len = MainLine.find(split_line[i])
String_Len = indexCounter
break
The best route would be regular expressions. To find a "word" just make sure that the leading and ending characters are not alphanumeric. It uses no splits, has no exposed loops, and even works when you run into a weird sentence like "There is a fire,here". A find_word function might look like this
import re
def find_word_start(word, string):
pattern = "(?<![a-zA-Z0-9])"+word+"(?![a-zA-Z0-9])"
result = re.search(pattern, string)
return result.start()
>> find_word_start("here", "There is a fire,here")
>> 16
The regex I made uses a trick called lookarounds that make sure that the characters preceding and after the word are not letters or numbers. https://www.regular-expressions.info/lookaround.html. The term [a-zA-Z0-9] is a character set that is comprised of a single character in the sets a-z, A-Z, and 0-9. Look up the python re module to find out more about regular expressions.

finding needle in haystack, what is a better solution?

so given "needle" and "there is a needle in this but not thisneedle haystack"
I wrote
def find_needle(n,h):
count = 0
words = h.split(" ")
for word in words:
if word == n:
count += 1
return count
This is O(n) but wondering if there is a better approach? maybe not by using split at all?
How would you write tests for this case to check that it handles all edge cases?
I don't think it's possible to get bellow O(n) with this (because you need to iterate trough the string at least once). You can do some optimizations.
I assume you want to match "whole words", for example looking up foo should match like this:
foo and foo, or foobar and not foo.
^^^ ^^^ ^^^
So splinting just based on space wouldn't do the job, because:
>>> 'foo and foo, or foobar and not foo.'.split(' ')
['foo', 'and', 'foo,', 'or', 'foobar', 'and', 'not', 'foo.']
# ^ ^
This is where re module comes in handy, which will allows you to build fascinating conditions. For example \b inside the regexp means:
Matches the empty string, but only at the beginning or end of a word. A word is defined as a sequence of Unicode alphanumeric or underscore characters, so the end of a word is indicated by whitespace or a non-alphanumeric, non-underscore Unicode character. Note that formally, \b is defined as the boundary between a \w and a \W character (or vice versa), or between \w and the beginning/end of the string. This means that r'\bfoo\b' matches 'foo', 'foo.', '(foo)', 'bar foo baz' but not 'foobar' or 'foo3'.
So r'\bfoo\b' will match only whole word foo. Also don't forget to use re.escape():
>>> re.escape('foo.bar+')
'foo\\.bar\\+'
>>> r'\b{}\b'.format(re.escape('foo.bar+'))
'\\bfoo\\.bar\\+\\b'
All you have to do now is use re.finditer() to scan the string. Based on documentation:
Return an iterator yielding match objects over all non-overlapping matches for the RE pattern in string. The string is scanned left-to-right, and matches are returned in the order found. Empty matches are included in the result unless they touch the beginning of another match.
I assume that matches are generated on the fly, so they never have to be in memory at once (which may come in handy with large strings, with many matched items). And in the end just count them:
>>> r = re.compile(r'\bfoo\b')
>>> it = r.finditer('foo and foo, or foobar and not foo.')
>>> sum(1 for _ in it)
3
This does not address the complexity issue but simplifies the code:
def find_needle(n,h):
return h.split().count(n)
You can use Counter
from collections import Counter
def find_needle(n,h):
return Counter(h.split())[n]
i.e.:
n = "portugal"
h = 'lobito programmer from portugal hello fromportugal portugal'
print find_needle(n,h)
Output:
2
DEMO
Actually, when you say O(n) you are forgetting the fact that after matching the first letter, you have to match the remaining ones as well (match n from needle to sentence, then match e, then the next e...) You are essentially trying to replicate the functionality of grep, so you can look at the grep algorithm. You can do well by building a finite state machine. There are many links that can help you, for one you could start from How does grep run so fast?
This is still going to be O(n) but it uses the power of the re module and python's generator expressions.
import re
def find_needle(n,h):
g = re.finditer(r'\b%s\b'%n, h) # use regex word boundaries
return sum(1 for _ in g) # return the length of the iterator
Should use far less memory than .split for a relatively large 'haystack'.
Note that this is not exactly the same as the code in the OP because it will not only find 'needle' but also 'needle,' and 'needle.' It will not find 'needles' though.
If you are concerned with the time it takes (as distinct from time complexity) multiprocess it. Basically make n smaller. Here is an example to run it in 2 processes.
from multiprocessing import Process
def find(word, string):
return string.count(word)
def search_for_words(word, string):
full_length = len(string)
part1 = string[:full_length/2]
proc1 = Process(target=find, args=(word, part1,))
proc1.start()
part2 = string[full_lenght/2:]
proc2 = Process(target=find, args=(word, part2,))
proc2.start()
proc1.join()
proc2.join()
if its O(n) you are worried about - then, i'm not sure there is much you can do, unless it is possible to get the string in another data structure. like a set or something. (but putting it in that set is also O(n), you can save on time if you are already iterating over the string somewhere else, and then make this structure then. write once, read many.
In order to guarantee finding a needle in a haystack, you need to examine each piece of hay until you find the needle. This is O(n) no matter what, a tight lower bound.
def find_needle(haystack):
for item in haystack:
if item == 'needle':
haystack.append(item)
return 'found the needle at position ' + str(haystack.index(item))
here's my one.
def find_needle(haystack, needle):
return haystack.count(needele)
here, we simply use the built-in count method to count the number of needles in a haystack.

Regex to match 'lol' to 'lolllll' and 'omg' to 'omggg', etc

Hey there, I love regular expressions, but I'm just not good at them at all.
I have a list of some 400 shortened words such as lol, omg, lmao...etc. Whenever someone types one of these shortened words, it is replaced with its English counterpart ([laughter], or something to that effect). Anyway, people are annoying and type these short-hand words with the last letter(s) repeated x number of times.
examples:
omg -> omgggg, lol -> lollll, haha -> hahahaha, lol -> lololol
I was wondering if anyone could hand me the regex (in Python, preferably) to deal with this?
Thanks all.
(It's a Twitter-related project for topic identification if anyone's curious. If someone tweets "Let's go shoot some hoops", how do you know the tweet is about basketball, etc)
FIRST APPROACH -
Well, using regular expression(s) you could do like so -
import re
re.sub('g+', 'g', 'omgggg')
re.sub('l+', 'l', 'lollll')
etc.
Let me point out that using regular expressions is a very fragile & basic approach to dealing with this problem. You could so easily get strings from users which will break the above regular expressions. What I am trying to say is that this approach requires lot of maintenance in terms of observing the patterns of mistakes the users make & then creating case specific regular expressions for them.
SECOND APPROACH -
Instead have you considered using difflib module? It's a module with helpers for computing deltas between objects. Of particular importance here for you is SequenceMatcher. To paraphrase from official documentation-
SequenceMatcher is a flexible class
for comparing pairs of sequences of
any type, so long as the sequence
elements are hashable. SequenceMatcher
tries to compute a "human-friendly
diff" between two sequences. The
fundamental notion is the longest
contiguous & junk-free matching subsequence.
import difflib as dl
x = dl.SequenceMatcher(lambda x : x == ' ', "omg", "omgggg")
y = dl.SequenceMatcher(lambda x : x == ' ', "omgggg","omg")
avg = (x.ratio()+y.ratio())/2.0
if avg>= 0.6:
print 'Match!'
else:
print 'Sorry!'
According to documentation, any ratio() over 0.6 is a close match. You might need to explore tweak the ratio for your data needs. If you need more stricter matching I found any value over 0.8 serves well.
How about
\b(?=lol)\S*(\S+)(?<=\blol)\1*\b
(replace lol with omg, haha etc.)
This will match lol, lololol, lollll, lollollol etc. but fail lolo, lollllo, lolly and so on.
The rules:
Match the word lol completely.
Then allow any repetition of one or more characters at the end of the word (i. e. l, ol or lol)
So \b(?=zomg)\S*(\S+)(?<=\bzomg)\1*\b will match zomg, zomggg, zomgmgmg, zomgomgomg etc.
In Python, with comments:
result = re.sub(
r"""(?ix)\b # assert position at a word boundary
(?=lol) # assert that "lol" can be matched here
\S* # match any number of characters except whitespace
(\S+) # match at least one character (to be repeated later)
(?<=\blol) # until we have reached exactly the position after the 1st "lol"
\1* # then repeat the preceding character(s) any number of times
\b # and ensure that we end up at another word boundary""",
"lol", subject)
This will also match the "unadorned" version (i. e. lol without any repetition). If you don't want this, use \1+ instead of \1*.

Categories