Pythonic way to select characters after specific character sequence in string

Pythonic way to select characters after specific character sequence in string - python

I have a list of strings like
lst = ['foo000bar111', 'foo000bar1112', 'foo000bar1113']
and I want to extract the last numbers from each string to get
nums = ['111', '1112', '1113']
I have other numbers earlier in the string that I don't care about (000 in this example). There aren't spaces, so I can't lst.split() and I believe doing something like that without spacing is difficult. The numbers are of different lengths, so I can't just do str[-3:]. For what it's worth, the characters before the numbers I care about are the same in each string, and the numbers are at the end of the string.
I'm looking for a way to say 'ok, read until you find bar and then tell me what's the rest of the string.' The best I've come up with is [str[(str.index('bar')+3):] for str in lst], which works, but I doubt that's the most pythonic way to do it.

Your method is accurate. You can also try using re
>>> import re
>>> lst = ['foo000bar111', 'foo000bar1112', 'foo000bar1113']
>>> [re.search(r'(\d+$)',i).group() for i in lst]
['111', '1112', '1113']
You can also try rindex
>>> [i[i.rindex('r')+1:] for i in lst]
['111', '1112', '1113']

Your solution is not bad at all, but you could improve it in a couple of ways:
Use rindex() instead of index; if bar should happen to occur twice (or more) in a string, you want to find the last instance.
Or you can use rsplit():
[ s.rsplit("bar", 1)[1] for s in lst ]
Edit: #Bas beat me to the second solution by a few seconds! :-)

Your own solution works well enough, but I think the main problem with is that you have to hard-code the length of the search string you are using. This could be solved using a temporary variable like this:
tag = 'bar'
[s[(s.index(tag)+len(tag)):] for s in lst]
One alternative way using rsplit:
[x.rsplit('bar', 1)[1] for x in lst]
This always splits on the last occurrence of bar, even if it occurs more than once.

Related

Why doesn't replace () change all occurrences?

I have the following code:
dna = "TGCGAGAAGGGGCGATCATGGAGATCTACTATCCTCTCGGGGTATGGTGGGGTTGAGA"
print(dna.count("GAGA"))
dna = dna.replace("GAGA", "AGAG")
print(dna.count("GAGA"))
Replace does not replace all occurrences. Could somebody help my in understanding why it happened?

It replaces all occurences. That might lead to new occurences (look at your replacement string!).
I'd say, logically, all is fine.
You could repeat this replace while dna.count("GAGA") > 0 , but: that sounds not like what you should be doing. (I bet you really just want to do one round of replacement to simulate something specific happening. Not a genetics expert at all though.)

It did make all replacements (that's what .replace() does in Python unless specified otherwise), but some of these replacements inadvertently introduced new instances of GAGA. Take the beginning of your string:
TGCGAGAA
There's GAGA at indices 3-6. If you replace that with AGAG, you get
TGCAGAGA
So the last G from that AGAG, together with the subsequent A that was already there before, forms a new GAGA.

Replacements does not occur "until exhausted"; they occur when a substring is matched in your original string.
Consider the following from your string:
>>> a = "TGCGAGAA"
>>> a.replace("GAGA", "AGAG")
'TGCAGAGA'
>>>
The replacement does not happen again, since the original string did not match GAGA in that location.
If you want to do the replacement until no match is found, you can wrap it in a loop:
>>> while a.count("GAGA") > 0: # you probably don't want to use count here if the string is long because of performance considerations
... a = a.replace("GAGA", "AGAG")
...
>>> a
'TGCAAGAG'

How to remove a substrings from a list of strings?

I have a list of strings, all of which have a common property, they all go like this "pp:actual_string". I do not know for sure what the substring "pp:" will be, basically : acts as a delimiter; everything before : shouldn't be included in the result.
I have solved the problem using the brute force approach, but I would like to see a clever method, maybe something like regex.
Note : Some strings might not have this "pp:string" format, and could be already a perfect string, i.e. without the delimiter.
This is my current solution:
ll = ["pp17:gaurav","pp17:sauarv","pp17:there","pp17:someone"]
res=[]
for i in ll:
g=""
for j in range(len(i)):
if i[j] == ':':
index=j+1
res.append(i[index:len(i)])
print(res)
Is there a way that I can do it without creating an extra list ?

Whilst regex is an incredibly powerful tool with a lot of capabilities, using a "clever method" is not necessarily the best idea you are unfamiliar with its principles.
Your problem is one that can be solved without regex by splitting on the : character using the str.split() method, and just returning the last part by using the [-1] index value to represent the last (or only) string that results from the split. This will work even if there isn't a :.
list_with_prefixes = ["pp:actual_string", "perfect_string", "frog:actual_string"]
cleaned_list = [x.split(':')[-1] for x in list_with_prefixes]
print(cleaned_list)
This is a list comprehension that takes each of the strings in turn (x), splits the string on the : character, this returns a list containing the prefix (if it exists) and the suffix, and builds a new list with only the suffix (i.e. item [-1] in the list that results from the split. In this example, it returns:
['actual_string', 'perfect_string', 'actual_string']

Here are a few options, based upon different assumptions.
Most explicit
if s.startswith('pp:'):
s = s[len('pp:'):] # aka 3
If you want to remove anything before the first :
s = s.split(':', 1)[-1]
Regular expressions:
Same as startswith
s = re.sub('^pp:', '', s)
Same as split, but more careful with 'pp:' and slower
s = re.match('(?:^pp:)?(.*)', s).group(1)

Is there a reverse \n?

I am making a dictionary application using argparse in Python 3. I'm using difflib to find the closest matches to a given word. Though it's a list, and it has newline characters at the end, like:
['hello\n', 'hallo\n', 'hell\n']
And when I put a word in, it gives a output of this:
hellllok could be spelled as hello
hellos
hillock
Question:
I'm wondering if there is a reverse or inverse \n so I can counteract these \n's.
Any help is appreciated.

There's no "reverse newline" in the standard character set but, even if there was, you would have to apply it to each string in turn.
And, if you can do that, you can equally modify the strings to remove the newline. In other words, create a new list using the current one, with newlines removed. That would be something like:
>>> oldlist = ['hello\n', 'hallo\n', 'hell\n']
>>> oldlist
['hello\n', 'hallo\n', 'hell\n']
>>> newlist = [s.replace('\n','') for s in oldlist]
>>> newlist
['hello', 'hallo', 'hell']
That will remove all newlines from each of the strings. If you want to ensure you only replace a single newline at the end of the strings, you can instead use:
newlist = [re.sub('\n$','',s) for s in oldlist]

Get the actual ending when testing with .endswith(tuple)

I found a nice question where one can search for multiple endings of a string using: endswith(tuple)
Check if string ends with one of the strings from a list
My question is, how can I return which value from the tuple is actually found to be the match? and what if I have multiple matches, how can I choose the best match?
for example:
str= "ERTYHGFYUUHGFREDFYAAAAAAAAAA"
endings = ('AAAAA', 'AAAAAA', 'AAAAAAA', 'AAAAAAAA', 'AAAAAAAAA')
str.endswith(endings) ## this will return true for all of values inside the tuple, but how can I get which one matches the best
In this case, multiple matches can be found from the tuple, how can I deal with this and return only the best (biggest) match, which in this case should be: 'AAAAAAAAA' which I want to remove at the end (which can be done with a regular expression or so).
I mean one could do this in a for loop, but maybe there is an easier pythonic way?

>>> s = "ERTYHGFYUUHGFREDFYAAAAAAAAAA"
>>> endings = ['AAAAA', 'AAAAAA', 'AAAAAAA', 'AAAAAAAA', 'AAAAAAAAA']
>>> max([i for i in endings if s.endswith(i)],key=len)
'AAAAAAAAA'

import re
str= "ERTYHGFYUUHGFREDFYAAAAAAAAAA"
endings = ['AAAAA', 'AAAAAA', 'AAAAAAA', 'AAAAAAAA', 'AAAAAAAAA']
print max([i for i in endings if re.findall(i+r"$",str)],key=len)

How about:
len(str) - len(str.rstrip('A'))

str.endswith(tuple) is (currently) implemented as a simple loop over tuple, repeatedly re- running the match, any similarities between the endings are not taken into account.
In the example case, a regular expression should compile into an automaton that essentially runs in linear time:
regexp = '(' + '|'.join(
re.escape(ending) for ending in sorted(endings, key=len, reverse=True
) + ')$'
Edit 1: As pointed out correctly by Martijn Pieters, Python's re does not return the longest overall match, but for alternates only matches the first matching subexpression:
https://docs.python.org/2/library/re.html#module-re:
When one pattern completely matches, that branch is accepted. This means that once A matches, B will not be tested further, even if it would produce a longer overall match.
(emphasis mine)
Hence, unfortunately the need for sorting by length.
Note that this makes Python's re different from POSIX regular expressions, which match the longest overall match.

extract first three numbers from a string

I have strings like
"ABCD_ABCD_6.2.15_3.2"
"ABCD_ABCD_12.22.15_4.323"
"ABCD_ABCD_2.33.15_3.223"
I want to extract following from above
"6.2.15"
"12.22.15"
"2.33.15"
I tried using indices of numbers but cant use them since they are variable. Only thing constant here is the length of the characters appearing in the beginning of each string.

Another way would be this regex:
_(\d+.*?)_
import re
m = re.search('_(\\d+.*?)_', 'ABCD_ABCD_6.2.15_3.2')
m.group(1)

There are a ton of ways to do this. Try:
>>> "ABCD_ABCD_6.2.15_3.2".split("_")[2]
'6.2.15'

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pythonic way to select characters after specific character sequence in string - python

Your method is accurate. You can also try using re >>> import re >>> lst = ['foo000bar111', 'foo000bar1112', 'foo000bar1113'] >>> [re.search(r'(\d+$)',i).group() for i in lst] ['111', '1112', '1113'] You can also try rindex >>> [i[i.rindex('r')+1:] for i in lst] ['111', '1112', '1113']

Related

Why doesn't replace () change all occurrences?

How to remove a substrings from a list of strings?

Is there a reverse \n?

Get the actual ending when testing with .endswith(tuple)

extract first three numbers from a string

Categories

Resources