find last occurence of multiple characters in a string in Python - python

I would like to find the last occurrence of a number of characters in a string.
str.rfind() will give the index of the last occurrence of a single character in a string, but I need the index of the last occurrence of any of a number of characters. For example if I had a string:
test_string = '([2+2])-[3+4])'
I would want a function that returns the index of the last occurence of {, [, or { similar to
test_string.rfind('(', '[', '{')
Which would ideally return 8. What is the best way to do this?
max(test_string.rfind('('), test_string.rfind('['), test_string.rfind('{'))
seems clunky and not Pythonic.

You can use generator expression to do this in a Pythonic way.
max(test_string.rfind(i) for i in "([{")
This iterates through the list/tuple of characters that you want to check and uses rfind() on them, groups those values together, and then returns the maximum value.

This is pretty concise, and will do the trick.
max(map(test_string.rfind, '([{'))

You can use reversed to start at the end of the string getting the first match, using the length of the string -1 - the index i to get the index counting from the start, doing at worst a single pass over the string:
test_string = '([2+2])-[3+4])'
st = {"[", "(", "{"}
print(next((len(test_string) - 1 - i
for i, s in enumerate(reversed(test_string)) if s in st),-1))
8
If there is no match, you will get -1 as the default value. This is a lot more efficient if you a large amount of substrings to search for than doing an O(n) rfind for every substring you want to match and then getting the max of all those

>>> def last_of_many(string, findees):
... return max(string.rfind(s) for s in findees)
...
>>> test_string = '([2+2])-[3+4])'
>>> last_of_many(test_string, '([{')
8
>>> last_of_many(test_string, ['+4', '+2'])
10
>>>

Related

What is the easiest way of finding the characters of a string after the last occurrence of a given character in Python?

I am trying to find the easiest way of returning the substring consisting of the characters of a string after the last occurrence of a given character in Python.
Example:
s = 'foo-bar-123-7-foo2'
I am interested in the characters after the last occurrence of '-'.
So the output would be 'foo2'
I could do a str.find(sub,start,end) function to find the position of the first '-' then store it's position then repeat this starting at this position to search for the next one until there are no more then return the characters of the string after this last position but is their a nicer way?
Simply with str.rfind function (returns the highest index in the string where substring is found):
s = 'foo-bar-123-7-foo2'
res = s[s.rfind('-') + 1:]
print(res) # foo2
s = 'foo-bar-123-7-foo2'
print(s.rsplit('-', 1)[1])
use rfind
so, s.rfind('-') will return you 13, which is the last occurrence of '-'.
and further doing
s[s.rfind('-')+1:] will return foo2
Here is a regex option:
s = 'foo-bar-123-7-foo2'
output = re.sub(r'^.*-', '', s)
print(output)
This prints: foo2
The strategy here to match all content up, to and including, the final dash character. Then, replace that with empty string, leaving only the final portion.

Regex inside findall vs regex inside count

This is a follow up question to How to count characters in a string? and to Find out how many times a regex matches in a string in Python
I want to count all alphabet characters in the string:
'Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...'
The str.count() method allows for counting a specific letter. How would one do that for counting any letter in the entire alphabet in a string, using the count method?
I am trying to use a regex inside the count method, but it returns 0 instead of 83. The code I am using is:
import re
spam_data['text'][0].count((r'[a-zA-Z]'))
When I use:
len(re.findall((r'[a-zA-Z]'), spam_data['text'][0])) it returns a length of 83.
Why does count return a 0 here?
You should use str.count instead of count.
spam_data['text'].str.count('\w')
0 83
Name: text, dtype: int64
To access the first value use:
spam_data['text'].str.count('\w')[0]
83
How would one do that for counting any letter in the entire alphabet in a string, using the count method?
wrd = 'Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...'
>>>> count = sum([''.join({_ for _ in wrd if _.isalpha()}).count(w) for w in wrd])
>>>> count
83
explanation: get the sum of unique letters count (inside a set) in the wrd using list comprehension.
similar to:
count = []
set_w = set()
for w in wrd:
if w.isalpha():
set_w.add(w)
for w in set_w:
count.append(wrd.count(w))
print(sum(count))
In this one:
spam_data['text'][0].count((r'[a-zA-Z]'))
the count accepts parameter by string, not regex, that is why it returns 0.
Use your second example.
Short answer: you did not use a regex, but a raw string literal, and thus count occurrences of the string '[a-zA-Z].
Because a string of the format r'..' is not a regex, it is a raw string literal. If you write r'\n', you write a string with two characters: a backslash and an n. not a new line. Raw strings are useful in the context of regexes, because regexes use a lot of escaping as well.
For example:
>>> r'\n'
'\\n'
>>> type(r'\n')
<class 'str'>
But here you thus count the number of times the string '[a-zA-Z]' occurs, and unless your spam_data['text'][0] literally contains a square bracket [ followed by a, etc., the count will be zero. Or as specified in the documentation of str.count [Python-doc]:
string.count(s, sub[, start[, end]])
Return the number of (non-overlapping) occurrences of substring sub in string s[start:end]. Defaults for start and end and interpretation of negative values are the same as for slices.)
In case the string is rather large, and you do not want to construct a list of matches, you can count the number of elements with:
sum(1 for _ in re.finditer('[a-zA-Z]', 'mystring'))
It is however typically faster to simply use re.findall(..) and then calculate the number of elements.

Confusion with string split method in python

Consider the following example
a= 'Apple'
b = a.split(',')
print(b)
Output is ['Apple'].
I am not getting why is it returning a list even when there is no ',' character in Apple
There might be case when we use split method we are expecting more than one element in list but since we are splitting based on separator not present in string, there will be only one element, wouldn't it be better if this mistake is caught during this split method itself
The behaviour of a.split(',') when no commas are present in a is perfectly consistent with the way it behaves when there are a positive number of commas in a.
a.split(',') says to split string a into a list of substrings that are delimited by ',' in a; the delimiter is not preserved in the substrings.
If 1 comma is found you get 2 substrings in the list, if 2 commas are found you get 3 substrings in the list, and in general, if n commas are found you get n+1 substrings in the list. So if 0 commas are found you get 1 substring in the list.
If you want 0 substrings in the list, then you'll need to supply a string with -1 commas in it. Good luck with that. :)
The docstring of that method says:
Return a list of the words in the string S, using sep as the delimiter string.
The delimiter is used to separate multiple parts of the string; having only one part is not an error.
That's the way split() function works. If you do not want that behaviour, you can implement your my_split() function as follows:
def my_split(s, d=' '):
return s.split(d) if d in s else s

Dot notation string manipulation

Is there a way to manipulate a string in Python using the following ways?
For any string that is stored in dot notation, for example:
s = "classes.students.grades"
Is there a way to change the string to the following:
"classes.students"
Basically, remove everything up to and including the last period. So "restaurants.spanish.food.salty" would become "restaurants.spanish.food".
Additionally, is there any way to identify what comes after the last period? The reason I want to do this is I want to use isDigit().
So, if it was classes.students.grades.0 could I grab the 0 somehow, so I could use an if statement with isdigit, and say if the part of the string after the last period (so 0 in this case) is a digit, remove it, otherwise, leave it.
you can use split and join together:
s = "classes.students.grades"
print '.'.join(s.split('.')[:-1])
You are splitting the string on . - it'll give you a list of strings, after that you are joining the list elements back to string separating them by .
[:-1] will pick all the elements from the list but the last one
To check what comes after the last .:
s.split('.')[-1]
Another way is to use rsplit. It works the same way as split but if you provide maxsplit parameter it'll split the string starting from the end:
rest, last = s.rsplit('.', 1)
'classes.students'
'grades'
You can also use re.sub to substitute the part after the last . with an empty string:
re.sub('\.[^.]+$', '', s)
And the last part of your question to wrap words in [] i would recommend to use format and list comprehension:
''.join("[{}]".format(e) for e in s.split('.'))
It'll give you the desired output:
[classes][students][grades]
The best way to do this is using the rsplit method and pass in the maxsplit argument.
>>> s = "classes.students.grades"
>>> before, after = s.rsplit('.', maxsplit=1) # rsplit('.', 1) in Python 2.x onwards
>>> before
'classes.students'
>>> after
'grades'
You can also use the rfind() method with normal slice operation.
To get everything before last .:
>>> s = "classes.students.grades"
>>> last_index = s.rfind('.')
>>> s[:last_index]
'classes.students'
Then everything after last .
>>> s[last_index + 1:]
'grades'
if '.' in s, s.rpartition('.') finds last dot in s,
and returns (before_last_dot, dot, after_last_dot):
s = "classes.students.grades"
s.rpartition('.')[0]
If your goal is to get rid of a final component that's just a single digit, start and end with re.sub():
s = re.sub(r"\.\d$", "", s)
This will do the job, and leave other strings alone. No need to mess with anything else.
If you do want to know about the general case (separate out the last component, no matter what it is), then use rsplit to split your string once:
>>> "hel.lo.there".rsplit(".", 1)
['hel.lo', 'there']
If there's no dot in the string you'll just get one element in your array, the entire string.
You can do it very simply with rsplit (str.rsplit([sep[, maxsplit]]) , which will return a list by breaking each element along the given separator.
You can also specify how many splits should be performed:
>>> s = "res.spa.f.sal.786423"
>>> s.rsplit('.',1)
['res.spa.f.sal', '786423']
So the final function that you describe is:
def dimimak_cool_function(s):
if '.' not in s: return s
start, end = s.rsplit('.', 1)
return start if end.isdigit() else s
>>> dimimak_cool_function("res.spa.f.sal.786423")
'res.spa.f.sal'
>>> dimimak_cool_function("res.spa.f.sal")
'res.spa.f.sal'

Finding various string repeats in python in next 10 characters

So I'm working on a problem where I have to find various string repeats after encountering an initial string, say we take ACTGAC so the data file has sequences that look like:
AAACTGACACCATCGATCAGAACCTGA
So in that string once we find ACTGAC then I need to analyze the next 10 characters for the string repeats which go by some rules. I have the rules coded but can anyone show me how once I find the string that I need, I can make a substring for the next ten characters to analyze. I know that str.partition function can do that once I find the string, and then the [1:10] can get the next ten characters.
Thanks!
You almost have it already (but note that indexes start counting from zero in Python).
The partition method will split a string into head, separator, tail, based on the first occurence of separator.
So you just need to take a slice of the first ten characters of the tail:
>>> data = 'AAACTGACACCATCGATCAGAACCTGA'
>>> head, sep, tail = data.partition('ACTGAC')
>>> tail[:10]
'ACCATCGATC'
Python allows you to leave out the start-index in slices (in defaults to zero - the start of the string), and also the end-index (it defaults to the length of the string).
Note that you could also do the whole operation in one line, like this:
>>> data.partition('ACTGAC')[2][:10]
'ACCATCGATC'
So, based on marcog's answer in Find all occurrences of a substring in Python , I propose:
>>> import re
>>> data = 'AAACTGACACCATCGATCAGAACCTGAACTGACTGACAAA'
>>> sep = 'ACTGAC'
>>> [data[m.start()+len(sep):][:10] for m in re.finditer('(?=%s)'%sep, data)]
['ACCATCGATC', 'TGACAAA', 'AAA']

Categories