Get the actual ending when testing with .endswith(tuple) - python

I found a nice question where one can search for multiple endings of a string using: endswith(tuple)
Check if string ends with one of the strings from a list
My question is, how can I return which value from the tuple is actually found to be the match? and what if I have multiple matches, how can I choose the best match?
for example:
str= "ERTYHGFYUUHGFREDFYAAAAAAAAAA"
endings = ('AAAAA', 'AAAAAA', 'AAAAAAA', 'AAAAAAAA', 'AAAAAAAAA')
str.endswith(endings) ## this will return true for all of values inside the tuple, but how can I get which one matches the best
In this case, multiple matches can be found from the tuple, how can I deal with this and return only the best (biggest) match, which in this case should be: 'AAAAAAAAA' which I want to remove at the end (which can be done with a regular expression or so).
I mean one could do this in a for loop, but maybe there is an easier pythonic way?

>>> s = "ERTYHGFYUUHGFREDFYAAAAAAAAAA"
>>> endings = ['AAAAA', 'AAAAAA', 'AAAAAAA', 'AAAAAAAA', 'AAAAAAAAA']
>>> max([i for i in endings if s.endswith(i)],key=len)
'AAAAAAAAA'

import re
str= "ERTYHGFYUUHGFREDFYAAAAAAAAAA"
endings = ['AAAAA', 'AAAAAA', 'AAAAAAA', 'AAAAAAAA', 'AAAAAAAAA']
print max([i for i in endings if re.findall(i+r"$",str)],key=len)

How about:
len(str) - len(str.rstrip('A'))

str.endswith(tuple) is (currently) implemented as a simple loop over tuple, repeatedly re- running the match, any similarities between the endings are not taken into account.
In the example case, a regular expression should compile into an automaton that essentially runs in linear time:
regexp = '(' + '|'.join(
re.escape(ending) for ending in sorted(endings, key=len, reverse=True
) + ')$'
Edit 1: As pointed out correctly by Martijn Pieters, Python's re does not return the longest overall match, but for alternates only matches the first matching subexpression:
https://docs.python.org/2/library/re.html#module-re:
When one pattern completely matches, that branch is accepted. This means that once A matches, B will not be tested further, even if it would produce a longer overall match.
(emphasis mine)
Hence, unfortunately the need for sorting by length.
Note that this makes Python's re different from POSIX regular expressions, which match the longest overall match.

Related

Why doesn't replace () change all occurrences?

I have the following code:
dna = "TGCGAGAAGGGGCGATCATGGAGATCTACTATCCTCTCGGGGTATGGTGGGGTTGAGA"
print(dna.count("GAGA"))
dna = dna.replace("GAGA", "AGAG")
print(dna.count("GAGA"))
Replace does not replace all occurrences. Could somebody help my in understanding why it happened?
It replaces all occurences. That might lead to new occurences (look at your replacement string!).
I'd say, logically, all is fine.
You could repeat this replace while dna.count("GAGA") > 0 , but: that sounds not like what you should be doing. (I bet you really just want to do one round of replacement to simulate something specific happening. Not a genetics expert at all though.)
It did make all replacements (that's what .replace() does in Python unless specified otherwise), but some of these replacements inadvertently introduced new instances of GAGA. Take the beginning of your string:
TGCGAGAA
There's GAGA at indices 3-6. If you replace that with AGAG, you get
TGCAGAGA
So the last G from that AGAG, together with the subsequent A that was already there before, forms a new GAGA.
Replacements does not occur "until exhausted"; they occur when a substring is matched in your original string.
Consider the following from your string:
>>> a = "TGCGAGAA"
>>> a.replace("GAGA", "AGAG")
'TGCAGAGA'
>>>
The replacement does not happen again, since the original string did not match GAGA in that location.
If you want to do the replacement until no match is found, you can wrap it in a loop:
>>> while a.count("GAGA") > 0: # you probably don't want to use count here if the string is long because of performance considerations
... a = a.replace("GAGA", "AGAG")
...
>>> a
'TGCAAGAG'

How to remove a substrings from a list of strings?

I have a list of strings, all of which have a common property, they all go like this "pp:actual_string". I do not know for sure what the substring "pp:" will be, basically : acts as a delimiter; everything before : shouldn't be included in the result.
I have solved the problem using the brute force approach, but I would like to see a clever method, maybe something like regex.
Note : Some strings might not have this "pp:string" format, and could be already a perfect string, i.e. without the delimiter.
This is my current solution:
ll = ["pp17:gaurav","pp17:sauarv","pp17:there","pp17:someone"]
res=[]
for i in ll:
g=""
for j in range(len(i)):
if i[j] == ':':
index=j+1
res.append(i[index:len(i)])
print(res)
Is there a way that I can do it without creating an extra list ?
Whilst regex is an incredibly powerful tool with a lot of capabilities, using a "clever method" is not necessarily the best idea you are unfamiliar with its principles.
Your problem is one that can be solved without regex by splitting on the : character using the str.split() method, and just returning the last part by using the [-1] index value to represent the last (or only) string that results from the split. This will work even if there isn't a :.
list_with_prefixes = ["pp:actual_string", "perfect_string", "frog:actual_string"]
cleaned_list = [x.split(':')[-1] for x in list_with_prefixes]
print(cleaned_list)
This is a list comprehension that takes each of the strings in turn (x), splits the string on the : character, this returns a list containing the prefix (if it exists) and the suffix, and builds a new list with only the suffix (i.e. item [-1] in the list that results from the split. In this example, it returns:
['actual_string', 'perfect_string', 'actual_string']
Here are a few options, based upon different assumptions.
Most explicit
if s.startswith('pp:'):
s = s[len('pp:'):] # aka 3
If you want to remove anything before the first :
s = s.split(':', 1)[-1]
Regular expressions:
Same as startswith
s = re.sub('^pp:', '', s)
Same as split, but more careful with 'pp:' and slower
s = re.match('(?:^pp:)?(.*)', s).group(1)

How to use Boolean OR inside a regex

I want to use a regex to find a substring, followed by a variable number of characters, followed by any of several substrings.
an re.findall of
"ATGTCAGGTAAGCTTAGGGCTTTAGGATT"
should give me:
['ATGTCAGGTAA', 'ATGTCAGGTAAGCTTAG', 'ATGTCAGGTAAGCTTAGGGCTTTAG']
I have tried all of the following without success:
import re
string2 = "ATGTCAGGTAAGCTTAGGGCTTTAGGATT"
re.findall('(ATG.*TAA)|(ATG.*TAG)', string2)
re.findall('ATG.*(TAA|TAG)', string2)
re.findall('ATG.*((TAA)|(TAG))', string2)
re.findall('ATG.*(TAA)|(TAG)', string2)
re.findall('ATG.*(TAA)|ATG.*(TAG)', string2)
re.findall('(ATG.*)(TAA)|(ATG.*)(TAG)', string2)
re.findall('(ATG.*)TAA|(ATG.*)TAG', string2)
What am I missing here?
This is not super-easy, because a) you want overlapping matches, and b) you want greedy and non-greedy and everything inbetween.
As long as the strings are fairly short, you can check every substring:
import re
s = "ATGTCAGGTAAGCTTAGGGCTTTAGGATT"
p = re.compile(r'ATG.*TA[GA]$')
for start in range(len(s)-6): # string is at least 6 letters long
for end in range(start+6, len(s)):
if p.match(s, pos=start, endpos=end):
print(s[start:end])
This prints:
ATGTCAGGTAA
ATGTCAGGTAAGCTTAG
ATGTCAGGTAAGCTTAGGGCTTTAG
Since you appear to work with DNA sequences or something like that, make sure to check out Biopython, too.
I like the accepted answer just fine :-) That is, I'm adding this for info, not looking for points.
If you have heavy need for this, trying a match on O(N^2) pairs of indices may soon become unbearably slow. One improvement is to use the .search() method to "leap" directly to the only starting indices that can possibly pay off. So the following does that.
It also uses the .fullmatch() method so that you don't have to artificially change the "natural" regexp (e.g., in your example, no need to add a trailing $ to the regexp - and, indeed, in the following code doing so would no longer work as intended). Note that .fullmatch() was added in Python 3.4, so this code also requires Python 3!
Finally, this intends to generalize the re module's finditer() function/method. While you don't need match objects (you just want strings), they're far more generally applicable, and returning a generator is often friendlier than returning a list too.
So, no, this doesn't do exactly what you want, but does things from which you can get what you want, in Python 3, faster:
def finditer_overlap(regexp, string):
start = 0
n = len(string)
while start <= n:
# don't know whether regexp will find shortest or
# longest match, but _will_ find leftmost match
m = regexp.search(string, start)
if m is None:
return
start = m.start()
for finish in range(start, n+1):
m = regexp.fullmatch(string, start, finish)
if m is not None:
yield m
start += 1
Then, e.g.,
import re
string2 = "ATGTCAGGTAAGCTTAGGGCTTTAGGATT"
pat = re.compile("ATG.*(TAA|TAG)")
for match in finditer_overlap(pat, string2):
print(match.group())
prints what you wanted in your example. The other ways you tried to write a regexp should also work. In this example it's faster because the second time around the outer loop start is 1, and regexp.search(string, 1) fails to find another match, so the generator exits at once (so skips checking O(N^2) other index pairs).

finding needle in haystack, what is a better solution?

so given "needle" and "there is a needle in this but not thisneedle haystack"
I wrote
def find_needle(n,h):
count = 0
words = h.split(" ")
for word in words:
if word == n:
count += 1
return count
This is O(n) but wondering if there is a better approach? maybe not by using split at all?
How would you write tests for this case to check that it handles all edge cases?
I don't think it's possible to get bellow O(n) with this (because you need to iterate trough the string at least once). You can do some optimizations.
I assume you want to match "whole words", for example looking up foo should match like this:
foo and foo, or foobar and not foo.
^^^ ^^^ ^^^
So splinting just based on space wouldn't do the job, because:
>>> 'foo and foo, or foobar and not foo.'.split(' ')
['foo', 'and', 'foo,', 'or', 'foobar', 'and', 'not', 'foo.']
# ^ ^
This is where re module comes in handy, which will allows you to build fascinating conditions. For example \b inside the regexp means:
Matches the empty string, but only at the beginning or end of a word. A word is defined as a sequence of Unicode alphanumeric or underscore characters, so the end of a word is indicated by whitespace or a non-alphanumeric, non-underscore Unicode character. Note that formally, \b is defined as the boundary between a \w and a \W character (or vice versa), or between \w and the beginning/end of the string. This means that r'\bfoo\b' matches 'foo', 'foo.', '(foo)', 'bar foo baz' but not 'foobar' or 'foo3'.
So r'\bfoo\b' will match only whole word foo. Also don't forget to use re.escape():
>>> re.escape('foo.bar+')
'foo\\.bar\\+'
>>> r'\b{}\b'.format(re.escape('foo.bar+'))
'\\bfoo\\.bar\\+\\b'
All you have to do now is use re.finditer() to scan the string. Based on documentation:
Return an iterator yielding match objects over all non-overlapping matches for the RE pattern in string. The string is scanned left-to-right, and matches are returned in the order found. Empty matches are included in the result unless they touch the beginning of another match.
I assume that matches are generated on the fly, so they never have to be in memory at once (which may come in handy with large strings, with many matched items). And in the end just count them:
>>> r = re.compile(r'\bfoo\b')
>>> it = r.finditer('foo and foo, or foobar and not foo.')
>>> sum(1 for _ in it)
3
This does not address the complexity issue but simplifies the code:
def find_needle(n,h):
return h.split().count(n)
You can use Counter
from collections import Counter
def find_needle(n,h):
return Counter(h.split())[n]
i.e.:
n = "portugal"
h = 'lobito programmer from portugal hello fromportugal portugal'
print find_needle(n,h)
Output:
2
DEMO
Actually, when you say O(n) you are forgetting the fact that after matching the first letter, you have to match the remaining ones as well (match n from needle to sentence, then match e, then the next e...) You are essentially trying to replicate the functionality of grep, so you can look at the grep algorithm. You can do well by building a finite state machine. There are many links that can help you, for one you could start from How does grep run so fast?
This is still going to be O(n) but it uses the power of the re module and python's generator expressions.
import re
def find_needle(n,h):
g = re.finditer(r'\b%s\b'%n, h) # use regex word boundaries
return sum(1 for _ in g) # return the length of the iterator
Should use far less memory than .split for a relatively large 'haystack'.
Note that this is not exactly the same as the code in the OP because it will not only find 'needle' but also 'needle,' and 'needle.' It will not find 'needles' though.
If you are concerned with the time it takes (as distinct from time complexity) multiprocess it. Basically make n smaller. Here is an example to run it in 2 processes.
from multiprocessing import Process
def find(word, string):
return string.count(word)
def search_for_words(word, string):
full_length = len(string)
part1 = string[:full_length/2]
proc1 = Process(target=find, args=(word, part1,))
proc1.start()
part2 = string[full_lenght/2:]
proc2 = Process(target=find, args=(word, part2,))
proc2.start()
proc1.join()
proc2.join()
if its O(n) you are worried about - then, i'm not sure there is much you can do, unless it is possible to get the string in another data structure. like a set or something. (but putting it in that set is also O(n), you can save on time if you are already iterating over the string somewhere else, and then make this structure then. write once, read many.
In order to guarantee finding a needle in a haystack, you need to examine each piece of hay until you find the needle. This is O(n) no matter what, a tight lower bound.
def find_needle(haystack):
for item in haystack:
if item == 'needle':
haystack.append(item)
return 'found the needle at position ' + str(haystack.index(item))
here's my one.
def find_needle(haystack, needle):
return haystack.count(needele)
here, we simply use the built-in count method to count the number of needles in a haystack.

String replacement with dictionary, complications with punctuation

I'm trying to write a function process(s,d) to replace abbreviations in a string with their full meaning by using a dictionary. where s is the string input and d is the dictionary. For example:
>>>d = {'ASAP':'as soon as possible'}
>>>s = "I will do this ASAP. Regards, X"
>>>process(s,d)
>>>"I will do this as soon as possible. Regards, X"
I have tried using the split function to separate the string and compare each part with the dictionary.
def process(s):
return ''.join(d[ch] if ch in d else ch for ch in s)
However, it returns me the same exact string. I have a suspicion that the code doesn't work because of the full stop behind ASAP in the original string. If so, how do I ignore the punctuation and get ASAP to be replaced?
Here is a way to do it with a single regex:
In [24]: d = {'ASAP':'as soon as possible', 'AFAIK': 'as far as I know'}
In [25]: s = 'I will do this ASAP, AFAIK. Regards, X'
In [26]: re.sub(r'\b' + '|'.join(d.keys()) + r'\b', lambda m: d[m.group(0)], s)
Out[26]: 'I will do this as soon as possible, as far as I know. Regards, X'
Unlike versions based on str.replace(), this observes word boundaries and therefore won't replace abbreviations that happen to appear in the middle of other words (e.g. "etc" in "fetch").
Also, unlike most (all?) other solutions presented thus far, it iterates over the input string just once, regardless of how many search terms there are in the dictionary.
You can do something like this:
def process(s,d):
for key in d:
s = s.replace(key,d[key])
return s
Here is a working solution: use re.split(), and split by word boundaries (preserving the interstitial characters):
''.join( d.get( word, word ) for word in re.split( '(\W+)', s ) )
One significant difference that this code has from Vaughn's or Sheena's answer is that this code takes advantage of the O(1) lookup time of the dictionary, while their solutions look at every key in the dictionary. This means that when s is short and d is very large, their code will take significantly longer to run. Furthermore, parts of words will still be replaced in their solutions: if d = { "lol": "laugh out loud" } and s="lollipop" their solutions will incorrectly produce "laugh out loudlipop".
use regular expressions:
re.sub(pattern,replacement,s)
In your application:
ret = s
for key in d:
ret = re.sub(r'\b'+key+r'\b',d[key],ret)
return ret
\b matches the beginning or end of a word. Thanks Paul for the comment
Instead of splitting by spaces, use:
split("\W")
It will split by anything that's not a character that would be part of a word.
python 3.2
[s.replace(i,v) for i,v in d.items()]
This is string replacement as well (+1 to #VaughnCato). This uses the reduce function to iterate through your dictionary, replacing any instances of the keys in the string with the values. s in this case is the accumulator, which is reduced (i.e. fed to the replace function) on every iteration, maintaining all past replacements (also, per #PaulMcGuire's point above, this replaces keys starting with the longest and ending with the shortest).
In [1]: d = {'ASAP':'as soon as possible', 'AFAIK': 'as far as I know'}
In [2]: s = 'I will do this ASAP, AFAIK. Regards, X'
In [3]: reduce(lambda x, y: x.replace(y, d[y]), sorted(d, key=lambda i: len(i), reverse=True), s)
Out[3]: 'I will do this as soon as possible, as far as I know. Regards, X'
As for why your function didn't return what you expected - when you iterate through s, you are actually iterating through the characters of the string - not the words. Your version could be tweaked by iterating over s.split() (which would be a list of the words), but you then run into an issue where the punctuation is causing words to not match your dictionary. You can get it to match by importing string and stripping out string.punctuation from each word, but that will remove the punctuation from the final string (so regex would be likely be the best option if replacement doesn't work).

Categories