Why doesn't replace () change all occurrences? - python

I have the following code:
dna = "TGCGAGAAGGGGCGATCATGGAGATCTACTATCCTCTCGGGGTATGGTGGGGTTGAGA"
print(dna.count("GAGA"))
dna = dna.replace("GAGA", "AGAG")
print(dna.count("GAGA"))
Replace does not replace all occurrences. Could somebody help my in understanding why it happened?

It replaces all occurences. That might lead to new occurences (look at your replacement string!).
I'd say, logically, all is fine.
You could repeat this replace while dna.count("GAGA") > 0 , but: that sounds not like what you should be doing. (I bet you really just want to do one round of replacement to simulate something specific happening. Not a genetics expert at all though.)

It did make all replacements (that's what .replace() does in Python unless specified otherwise), but some of these replacements inadvertently introduced new instances of GAGA. Take the beginning of your string:
TGCGAGAA
There's GAGA at indices 3-6. If you replace that with AGAG, you get
TGCAGAGA
So the last G from that AGAG, together with the subsequent A that was already there before, forms a new GAGA.

Replacements does not occur "until exhausted"; they occur when a substring is matched in your original string.
Consider the following from your string:
>>> a = "TGCGAGAA"
>>> a.replace("GAGA", "AGAG")
'TGCAGAGA'
>>>
The replacement does not happen again, since the original string did not match GAGA in that location.
If you want to do the replacement until no match is found, you can wrap it in a loop:
>>> while a.count("GAGA") > 0: # you probably don't want to use count here if the string is long because of performance considerations
... a = a.replace("GAGA", "AGAG")
...
>>> a
'TGCAAGAG'

Related

How to remove a substrings from a list of strings?

I have a list of strings, all of which have a common property, they all go like this "pp:actual_string". I do not know for sure what the substring "pp:" will be, basically : acts as a delimiter; everything before : shouldn't be included in the result.
I have solved the problem using the brute force approach, but I would like to see a clever method, maybe something like regex.
Note : Some strings might not have this "pp:string" format, and could be already a perfect string, i.e. without the delimiter.
This is my current solution:
ll = ["pp17:gaurav","pp17:sauarv","pp17:there","pp17:someone"]
res=[]
for i in ll:
g=""
for j in range(len(i)):
if i[j] == ':':
index=j+1
res.append(i[index:len(i)])
print(res)
Is there a way that I can do it without creating an extra list ?
Whilst regex is an incredibly powerful tool with a lot of capabilities, using a "clever method" is not necessarily the best idea you are unfamiliar with its principles.
Your problem is one that can be solved without regex by splitting on the : character using the str.split() method, and just returning the last part by using the [-1] index value to represent the last (or only) string that results from the split. This will work even if there isn't a :.
list_with_prefixes = ["pp:actual_string", "perfect_string", "frog:actual_string"]
cleaned_list = [x.split(':')[-1] for x in list_with_prefixes]
print(cleaned_list)
This is a list comprehension that takes each of the strings in turn (x), splits the string on the : character, this returns a list containing the prefix (if it exists) and the suffix, and builds a new list with only the suffix (i.e. item [-1] in the list that results from the split. In this example, it returns:
['actual_string', 'perfect_string', 'actual_string']
Here are a few options, based upon different assumptions.
Most explicit
if s.startswith('pp:'):
s = s[len('pp:'):] # aka 3
If you want to remove anything before the first :
s = s.split(':', 1)[-1]
Regular expressions:
Same as startswith
s = re.sub('^pp:', '', s)
Same as split, but more careful with 'pp:' and slower
s = re.match('(?:^pp:)?(.*)', s).group(1)

How to catch a string with re.sub but also let part of the outcome be caught again?

I need to use Python's re.sub to catch each 'H' in a string together with its preceding letter if any and its following letter if any, in order to feed the group into re.sub's replacing function as we're gonna do magic to potentially any of these three letters.
With this example: "eHetaHḷ", I'm going to have two groups: 'eHe' and 'aHḷ', feed them into the replacing function and out shoot 'ē' and 'āl'. I can do that.
Now with this one: "eHeHḷ", I need to have two groups: 'eHe' and... 'ēHḷ', that is to say I need the outcome of the first group to be caught in the next one.
I can use only one re.sub turn and I can't catch more that one letter behind and more than one ahead. There are a considerable number of possible outcomes so I can't easily if/else my way out of this.
Have you got any brilliant idea? Am I misunderstanding Regex' way of operating completely?
With re.sub you can't recursively replace on top of previous replacements.
What you can do, however, is conditionally match shorter or larger groups based on the appearance of the character H.
This pattern r'\wH((?=\wH)\wH.|.)' will match eHe in eHexxx, or eHeH! in eHeH!x.
It works by using a positive lookahead (?=\wH), which matches only if the current position is followed by (word character + H).
So ((?=\wH)\wH.|.) matches (\wH.) if (\wH) exists, else (.).
To enable differentiated replacement rules based on the H count you could create a substitution function:
def create_substitution(x: re.match):
match = x.group(0)
if match.count('H') == 1:
return 'custom handling of match xH.'
elif match.count('H') > 1:
return 'custom handling of match xHxH.'
which you can use in your substitution:
result = re.sub(r'\wH((?=\wH)\wH.|.)', create_substitution, "eHeHḷ")

How to use Boolean OR inside a regex

I want to use a regex to find a substring, followed by a variable number of characters, followed by any of several substrings.
an re.findall of
"ATGTCAGGTAAGCTTAGGGCTTTAGGATT"
should give me:
['ATGTCAGGTAA', 'ATGTCAGGTAAGCTTAG', 'ATGTCAGGTAAGCTTAGGGCTTTAG']
I have tried all of the following without success:
import re
string2 = "ATGTCAGGTAAGCTTAGGGCTTTAGGATT"
re.findall('(ATG.*TAA)|(ATG.*TAG)', string2)
re.findall('ATG.*(TAA|TAG)', string2)
re.findall('ATG.*((TAA)|(TAG))', string2)
re.findall('ATG.*(TAA)|(TAG)', string2)
re.findall('ATG.*(TAA)|ATG.*(TAG)', string2)
re.findall('(ATG.*)(TAA)|(ATG.*)(TAG)', string2)
re.findall('(ATG.*)TAA|(ATG.*)TAG', string2)
What am I missing here?
This is not super-easy, because a) you want overlapping matches, and b) you want greedy and non-greedy and everything inbetween.
As long as the strings are fairly short, you can check every substring:
import re
s = "ATGTCAGGTAAGCTTAGGGCTTTAGGATT"
p = re.compile(r'ATG.*TA[GA]$')
for start in range(len(s)-6): # string is at least 6 letters long
for end in range(start+6, len(s)):
if p.match(s, pos=start, endpos=end):
print(s[start:end])
This prints:
ATGTCAGGTAA
ATGTCAGGTAAGCTTAG
ATGTCAGGTAAGCTTAGGGCTTTAG
Since you appear to work with DNA sequences or something like that, make sure to check out Biopython, too.
I like the accepted answer just fine :-) That is, I'm adding this for info, not looking for points.
If you have heavy need for this, trying a match on O(N^2) pairs of indices may soon become unbearably slow. One improvement is to use the .search() method to "leap" directly to the only starting indices that can possibly pay off. So the following does that.
It also uses the .fullmatch() method so that you don't have to artificially change the "natural" regexp (e.g., in your example, no need to add a trailing $ to the regexp - and, indeed, in the following code doing so would no longer work as intended). Note that .fullmatch() was added in Python 3.4, so this code also requires Python 3!
Finally, this intends to generalize the re module's finditer() function/method. While you don't need match objects (you just want strings), they're far more generally applicable, and returning a generator is often friendlier than returning a list too.
So, no, this doesn't do exactly what you want, but does things from which you can get what you want, in Python 3, faster:
def finditer_overlap(regexp, string):
start = 0
n = len(string)
while start <= n:
# don't know whether regexp will find shortest or
# longest match, but _will_ find leftmost match
m = regexp.search(string, start)
if m is None:
return
start = m.start()
for finish in range(start, n+1):
m = regexp.fullmatch(string, start, finish)
if m is not None:
yield m
start += 1
Then, e.g.,
import re
string2 = "ATGTCAGGTAAGCTTAGGGCTTTAGGATT"
pat = re.compile("ATG.*(TAA|TAG)")
for match in finditer_overlap(pat, string2):
print(match.group())
prints what you wanted in your example. The other ways you tried to write a regexp should also work. In this example it's faster because the second time around the outer loop start is 1, and regexp.search(string, 1) fails to find another match, so the generator exits at once (so skips checking O(N^2) other index pairs).

Finding various string repeats in python in next 10 characters

So I'm working on a problem where I have to find various string repeats after encountering an initial string, say we take ACTGAC so the data file has sequences that look like:
AAACTGACACCATCGATCAGAACCTGA
So in that string once we find ACTGAC then I need to analyze the next 10 characters for the string repeats which go by some rules. I have the rules coded but can anyone show me how once I find the string that I need, I can make a substring for the next ten characters to analyze. I know that str.partition function can do that once I find the string, and then the [1:10] can get the next ten characters.
Thanks!
You almost have it already (but note that indexes start counting from zero in Python).
The partition method will split a string into head, separator, tail, based on the first occurence of separator.
So you just need to take a slice of the first ten characters of the tail:
>>> data = 'AAACTGACACCATCGATCAGAACCTGA'
>>> head, sep, tail = data.partition('ACTGAC')
>>> tail[:10]
'ACCATCGATC'
Python allows you to leave out the start-index in slices (in defaults to zero - the start of the string), and also the end-index (it defaults to the length of the string).
Note that you could also do the whole operation in one line, like this:
>>> data.partition('ACTGAC')[2][:10]
'ACCATCGATC'
So, based on marcog's answer in Find all occurrences of a substring in Python , I propose:
>>> import re
>>> data = 'AAACTGACACCATCGATCAGAACCTGAACTGACTGACAAA'
>>> sep = 'ACTGAC'
>>> [data[m.start()+len(sep):][:10] for m in re.finditer('(?=%s)'%sep, data)]
['ACCATCGATC', 'TGACAAA', 'AAA']

Sequence of vowels count

This is not a homework question, it is an exam preparation question.
I should define a function syllables(word) that counts the number of syllables in
A word in the following way:
• a maximal sequence of vowels is a syllable;
• a final e in a word is not a syllable (or the vowel sequence it is a part
Of).
I do not have to deal with any special cases, such as a final e in a
One-syllable word (e.g., ’be’ or ’bee’).
>>> syllables(’honour’)
2
>>> syllables(’decode’)
2
>>> syllables(’oiseau’)
2
Should I use regular expression here or just list comprehension ?
I find regular expressions natural for this question. (I think a non-regex answer would take more coding. I use two string methods, 'lower' and 'endswith' to make the answer more clear.)
import re
def syllables(word):
word = word.lower()
if word.endswith('e'):
word = word[:-1]
count = len(re.findall('[aeiou]+', word))
return count
for word in ('honour', 'decode', 'decodes', 'oiseau', 'pie'):
print word, syllables(word)
Which prints:
honour 2
decode 2
decodes 3
oiseau 2
pie 1
Note that 'decodes' has one more syllable than 'decode' (which is strange, but fits your definition).
Question. How does this help you? Isn't the point of the study question that you work through it yourself? You may get more benefit in the future by posting a failed attempt in your question, so you can learn exactly where you are lacking.
Use regexps - most languages will let you count the number of matches of a regexp in a string.
Then special-case the terminal-e by checking the right-most match group.
I don't think regex is the right solution here.
It seems pretty straightforward to write this treating each string as a list.
Some pointers:
[abc] matches a, b or c.
A + after a regex token allows the token to match once or more
$ matches the end of the string.
(?<=x) matches the current position only if the previous character is an x.
(?!x) matches the current position only if the next character is not an x.
EDIT:
I just saw your comment that since this is not homework, actual code is requested.
Well, then:
[aeiou]+(?!(?<=e)$)
If you don't want to count final vowel sequences that end in e at all (like the u in tongue or the o in toe), then use
[aeiou]+(?=[^aeiou])|[aeiou]*[aiou]$
I'm sure you'll be able to figure out how it works if you read the explanation above.
Here's an answer without regular expressions. My real answer (also posted) uses regular expressions. Untested code:
def syllables(word):
word = word.lower()
if word.endswith('e'):
word = word[:-1]
vowels = 'aeiou'
in_vowel_group = False
vowel_groups = 0
for letter in word:
if letter in vowels:
if not in_vowel_group:
in_vowel_group = True
vowel_groups += 1
else:
in_vowel_group = False
return vowel_groups
Both ways work. You said yourself that it was for exam preparation. Use whichever is going to be on the exam. If they're both on the exam, use which you need more practice for. Just remember:
Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems. ~Jamie Zawinski
So in my opinion, don't use regex unless you need the practice.
Regular expressions would be way too complex, and a list comprehension probably wouldn't be robust enough. You will probably be able to solve this easily using a grammar lexer like PyParsing. Give it a shot!
Use a regex that matches a,e,i,o, or u, convert the string to a list, then iterate through the list... 1 for first true, 1 for next false, 2 for next true, 2 for next false, etc.
To handle the case where the last letter is 'e' following a consonant (as in ate), just check the last two letters of the word before you start. If they match that pattern truncate the final e and process as normal.
This pattern works for your definition:
(?!e$)([aeiouy]+)
Just count how many times it occurs.

Categories