Test multiple substrings against a string - python

If I have an list of strings:
matches = [ 'string1', 'anotherstring', 'astringystring' ]
And I have another string that I want to test:
teststring = 'thestring1'
And I want to test each string, and if any match, do something. I have:
match = 0
for matchstring in matches:
if matchstring in teststring:
match = 1
if !match:
continue
This is in a loop, so we just go around again if we don't get a match (I can reverse this logic of course and do something if it matches), but the code looks clumsy and not pythonic, if easy to follow.
I am thinking there is a better way to do this, but I don't grok python as well as I would like. Is there a better approach?
Note the "duplicate" is the opposite question (though the same answer approach is the same).

You could use any here
Code:
if any(matchstring in teststring for matchstring in matches):
print "Matched"
Notes:
any exits as soon it see's a match.
As per as the loop what is happening is for matchstring in matches here each string from the matches is iterated.
And here matchstring in teststring we are checking if the iterated string is in the defined check string.
The any will exit as soon as it see's a True[match] in the expression.

If you want to know what the first match was you can use next:
match = next((match for match in matches if match in teststring), None)
You have to pass None as the second parameter if you don't want it to raise an exception when nothing matches. It will use the value as the default, so match will be None if nothing is found.

How about you try this:
len([ x for x in b if ((a in x) or (x in a)) ]) > 0
I've updated the answer to check the substring both ways. You can pick and choose or modify as you see fit but I think the basics should be pretty clear.

Related

Why doesn't replace () change all occurrences?

I have the following code:
dna = "TGCGAGAAGGGGCGATCATGGAGATCTACTATCCTCTCGGGGTATGGTGGGGTTGAGA"
print(dna.count("GAGA"))
dna = dna.replace("GAGA", "AGAG")
print(dna.count("GAGA"))
Replace does not replace all occurrences. Could somebody help my in understanding why it happened?
It replaces all occurences. That might lead to new occurences (look at your replacement string!).
I'd say, logically, all is fine.
You could repeat this replace while dna.count("GAGA") > 0 , but: that sounds not like what you should be doing. (I bet you really just want to do one round of replacement to simulate something specific happening. Not a genetics expert at all though.)
It did make all replacements (that's what .replace() does in Python unless specified otherwise), but some of these replacements inadvertently introduced new instances of GAGA. Take the beginning of your string:
TGCGAGAA
There's GAGA at indices 3-6. If you replace that with AGAG, you get
TGCAGAGA
So the last G from that AGAG, together with the subsequent A that was already there before, forms a new GAGA.
Replacements does not occur "until exhausted"; they occur when a substring is matched in your original string.
Consider the following from your string:
>>> a = "TGCGAGAA"
>>> a.replace("GAGA", "AGAG")
'TGCAGAGA'
>>>
The replacement does not happen again, since the original string did not match GAGA in that location.
If you want to do the replacement until no match is found, you can wrap it in a loop:
>>> while a.count("GAGA") > 0: # you probably don't want to use count here if the string is long because of performance considerations
... a = a.replace("GAGA", "AGAG")
...
>>> a
'TGCAAGAG'

optimal way to count string tags using python

so I have this list:
tokens = ['<greeting>', 'Hello World!', '</greeting>']
the task is to count the number of strings that have XML tags. what I have so far (that works) is this:
tokens = ['<greeting>', 'Hello World!', '</greeting>']
count = 0
for i in range(len(tokens)):
if tokens[i].find('>') >1:
print(tokens[i])
count += 1
print(count)
else:
count += 0
what puzzles me is that I'm inclined in using the following line for the if statement
if tokens[i].find('>') == True:
but it won't work.
what's the optimal way of writing this loop, in your opinion?
many thanks!
alex.
One issue I see with you approach is that it might capture false positives (e.g. "gree>ting"), so checking only for a closing tag is not enough.
If your definition of "contains a tag" simply means checking whether the string contains a < followed by some characters, then another >, you could use a regular expression (keeping this in mind in case you were thinking about something more complex).
This, combined with the compact list generator method proposed by #aws_apprentice in the comments, gives us:
import re
regex = "<.+>"
count = sum([1 if re.search(regex, t) else 0 for t in tokens])
print(count) #done!
Explanation:
This one-liner we used is called a list generator, which will generate a list of ones and zeros. For each string t in tokens, if the string contains a tag, append 1 to the new list, else append 0. And re.search is used for checking whether the string (or a substring of it) matches the given regex.
The following approach checks for the opening < at the start of the string and also checks for > at the end of the string.
In [4]: tokens = ['<greeting>', 'Hello World!', '</greeting>']
In [5]: sum([1 if i.startswith('<') and i.endswith('>') else 0 for i in tokens])
Out[5]: 2
Anis R.'s answer should work fine but this is a non-regex alternative (and not as elegant. In fact I would call this clumsy).
This code just looks at the beginning and end of each list element for carats. I'm a novice to the extreme but I think a range(len(tokens)) is redundant and can be simplified like this as well.
tokens = ['<greeting>', 'Hello World!', '</greeting>']
count = 0
for i in tokens:
if i[0].find('<') == 0 and i[-1].find('>') != -1:
print(i)
count += 1
print(count)
str.find() returns an index position, not a boolean as others have noted, so your if statement must reflect that. A .find() with no result returns -1. As you can see, for the first carat checking for an index of 0 will work, as long as your data follows the scheme in your example list. The second if component is negative (using !=), since it checks the last character in the list item. I don't think you could use a positive if statement there since, again, .find() returns an index position and your data presumably has variable lengths. I'm sure you could complicate that check to be positive by adding more code but that shortcut seems satisfactory in your case to me. The only time it wouldn't work is if your list components can look like '<greeting> Hello'
Happy to be corrected by others, that's why I'm here.

New to regex -- unexpected results in for loop

I'm not sure if this is a problem in my understanding of regex modules, or a silly mistake I'm making in my for loop.
I have a list of numbers that look like this:
4; 94
3; 92
1; 53
etc.
I made a regex pattern to match just the last two digits of the string:
'^.*\s([0-9]+)$'
This works when I take each element of the list 1 at a time.
However when I try and make a for loop
for i in xData:
if re.findall('^.*\s([0-9]+)$', i)
print i
The output is simply the entire string instead of just the last two digits.
I'm sure I'm missing something very simple here but if someone could point me in the right direction that would be great. Thanks.
You are printing the whole string, i. If you wanted to print the output of re.findall(), then store the result and print that result:
for i in xData:
results = re.findall('^.*\s([0-9]+)$', i)
if results:
print results
I don't think that re.findall() is the right method here, since your lines contain just the one set of digits. Use re.search() to get a match object, and if the match object is not None, take the first group data:
for i in xData:
match = re.search('^.*\s([0-9]+)$', i)
if match:
print match.group(1)
I might be missing something here, but if all you're looking to do is get the last 2 characters, could you use the below?
for i in xData:
print(i[-2:])

How to use Boolean OR inside a regex

I want to use a regex to find a substring, followed by a variable number of characters, followed by any of several substrings.
an re.findall of
"ATGTCAGGTAAGCTTAGGGCTTTAGGATT"
should give me:
['ATGTCAGGTAA', 'ATGTCAGGTAAGCTTAG', 'ATGTCAGGTAAGCTTAGGGCTTTAG']
I have tried all of the following without success:
import re
string2 = "ATGTCAGGTAAGCTTAGGGCTTTAGGATT"
re.findall('(ATG.*TAA)|(ATG.*TAG)', string2)
re.findall('ATG.*(TAA|TAG)', string2)
re.findall('ATG.*((TAA)|(TAG))', string2)
re.findall('ATG.*(TAA)|(TAG)', string2)
re.findall('ATG.*(TAA)|ATG.*(TAG)', string2)
re.findall('(ATG.*)(TAA)|(ATG.*)(TAG)', string2)
re.findall('(ATG.*)TAA|(ATG.*)TAG', string2)
What am I missing here?
This is not super-easy, because a) you want overlapping matches, and b) you want greedy and non-greedy and everything inbetween.
As long as the strings are fairly short, you can check every substring:
import re
s = "ATGTCAGGTAAGCTTAGGGCTTTAGGATT"
p = re.compile(r'ATG.*TA[GA]$')
for start in range(len(s)-6): # string is at least 6 letters long
for end in range(start+6, len(s)):
if p.match(s, pos=start, endpos=end):
print(s[start:end])
This prints:
ATGTCAGGTAA
ATGTCAGGTAAGCTTAG
ATGTCAGGTAAGCTTAGGGCTTTAG
Since you appear to work with DNA sequences or something like that, make sure to check out Biopython, too.
I like the accepted answer just fine :-) That is, I'm adding this for info, not looking for points.
If you have heavy need for this, trying a match on O(N^2) pairs of indices may soon become unbearably slow. One improvement is to use the .search() method to "leap" directly to the only starting indices that can possibly pay off. So the following does that.
It also uses the .fullmatch() method so that you don't have to artificially change the "natural" regexp (e.g., in your example, no need to add a trailing $ to the regexp - and, indeed, in the following code doing so would no longer work as intended). Note that .fullmatch() was added in Python 3.4, so this code also requires Python 3!
Finally, this intends to generalize the re module's finditer() function/method. While you don't need match objects (you just want strings), they're far more generally applicable, and returning a generator is often friendlier than returning a list too.
So, no, this doesn't do exactly what you want, but does things from which you can get what you want, in Python 3, faster:
def finditer_overlap(regexp, string):
start = 0
n = len(string)
while start <= n:
# don't know whether regexp will find shortest or
# longest match, but _will_ find leftmost match
m = regexp.search(string, start)
if m is None:
return
start = m.start()
for finish in range(start, n+1):
m = regexp.fullmatch(string, start, finish)
if m is not None:
yield m
start += 1
Then, e.g.,
import re
string2 = "ATGTCAGGTAAGCTTAGGGCTTTAGGATT"
pat = re.compile("ATG.*(TAA|TAG)")
for match in finditer_overlap(pat, string2):
print(match.group())
prints what you wanted in your example. The other ways you tried to write a regexp should also work. In this example it's faster because the second time around the outer loop start is 1, and regexp.search(string, 1) fails to find another match, so the generator exits at once (so skips checking O(N^2) other index pairs).

Get the actual ending when testing with .endswith(tuple)

I found a nice question where one can search for multiple endings of a string using: endswith(tuple)
Check if string ends with one of the strings from a list
My question is, how can I return which value from the tuple is actually found to be the match? and what if I have multiple matches, how can I choose the best match?
for example:
str= "ERTYHGFYUUHGFREDFYAAAAAAAAAA"
endings = ('AAAAA', 'AAAAAA', 'AAAAAAA', 'AAAAAAAA', 'AAAAAAAAA')
str.endswith(endings) ## this will return true for all of values inside the tuple, but how can I get which one matches the best
In this case, multiple matches can be found from the tuple, how can I deal with this and return only the best (biggest) match, which in this case should be: 'AAAAAAAAA' which I want to remove at the end (which can be done with a regular expression or so).
I mean one could do this in a for loop, but maybe there is an easier pythonic way?
>>> s = "ERTYHGFYUUHGFREDFYAAAAAAAAAA"
>>> endings = ['AAAAA', 'AAAAAA', 'AAAAAAA', 'AAAAAAAA', 'AAAAAAAAA']
>>> max([i for i in endings if s.endswith(i)],key=len)
'AAAAAAAAA'
import re
str= "ERTYHGFYUUHGFREDFYAAAAAAAAAA"
endings = ['AAAAA', 'AAAAAA', 'AAAAAAA', 'AAAAAAAA', 'AAAAAAAAA']
print max([i for i in endings if re.findall(i+r"$",str)],key=len)
How about:
len(str) - len(str.rstrip('A'))
str.endswith(tuple) is (currently) implemented as a simple loop over tuple, repeatedly re- running the match, any similarities between the endings are not taken into account.
In the example case, a regular expression should compile into an automaton that essentially runs in linear time:
regexp = '(' + '|'.join(
re.escape(ending) for ending in sorted(endings, key=len, reverse=True
) + ')$'
Edit 1: As pointed out correctly by Martijn Pieters, Python's re does not return the longest overall match, but for alternates only matches the first matching subexpression:
https://docs.python.org/2/library/re.html#module-re:
When one pattern completely matches, that branch is accepted. This means that once A matches, B will not be tested further, even if it would produce a longer overall match.
(emphasis mine)
Hence, unfortunately the need for sorting by length.
Note that this makes Python's re different from POSIX regular expressions, which match the longest overall match.

Categories