Remove substrings of variable length from string - python

I have a list of strings where all of the strings roughly follow the format 'foo\tbar\tfoo\n' in that there are three segments of variable length that are separated by two tabs (\t) and with a newline indicator at the end (\n).
I want to remove everything except for the text before the first \, so that it would return as 'foo'. Given that the first segment is of variable length, I'm not sure how I can do that.

Use str.split():
>>> string = 'foo\tbar\tfoo\n'
>>> string.split('\t', 1)[0]
'foo'
This splits the string by the first occurrence of the '\t' tab character, which returns a list with two elements. The [0] selects the first element in the list, which is the part of the string before the first '\t' occurrence.

Just search for the first \t character, and get everything before it. Slicing makes this easy.
newstr = oldstr[:oldstr.find("\t")]

Try with:
t = 'foo\tbar\tfoo\n'
t[:t.index("\t")]

Related

What is the easiest way of finding the characters of a string after the last occurrence of a given character in Python?

I am trying to find the easiest way of returning the substring consisting of the characters of a string after the last occurrence of a given character in Python.
Example:
s = 'foo-bar-123-7-foo2'
I am interested in the characters after the last occurrence of '-'.
So the output would be 'foo2'
I could do a str.find(sub,start,end) function to find the position of the first '-' then store it's position then repeat this starting at this position to search for the next one until there are no more then return the characters of the string after this last position but is their a nicer way?
Simply with str.rfind function (returns the highest index in the string where substring is found):
s = 'foo-bar-123-7-foo2'
res = s[s.rfind('-') + 1:]
print(res) # foo2
s = 'foo-bar-123-7-foo2'
print(s.rsplit('-', 1)[1])
use rfind
so, s.rfind('-') will return you 13, which is the last occurrence of '-'.
and further doing
s[s.rfind('-')+1:] will return foo2
Here is a regex option:
s = 'foo-bar-123-7-foo2'
output = re.sub(r'^.*-', '', s)
print(output)
This prints: foo2
The strategy here to match all content up, to and including, the final dash character. Then, replace that with empty string, leaving only the final portion.

Confusion with string split method in python

Consider the following example
a= 'Apple'
b = a.split(',')
print(b)
Output is ['Apple'].
I am not getting why is it returning a list even when there is no ',' character in Apple
There might be case when we use split method we are expecting more than one element in list but since we are splitting based on separator not present in string, there will be only one element, wouldn't it be better if this mistake is caught during this split method itself
The behaviour of a.split(',') when no commas are present in a is perfectly consistent with the way it behaves when there are a positive number of commas in a.
a.split(',') says to split string a into a list of substrings that are delimited by ',' in a; the delimiter is not preserved in the substrings.
If 1 comma is found you get 2 substrings in the list, if 2 commas are found you get 3 substrings in the list, and in general, if n commas are found you get n+1 substrings in the list. So if 0 commas are found you get 1 substring in the list.
If you want 0 substrings in the list, then you'll need to supply a string with -1 commas in it. Good luck with that. :)
The docstring of that method says:
Return a list of the words in the string S, using sep as the delimiter string.
The delimiter is used to separate multiple parts of the string; having only one part is not an error.
That's the way split() function works. If you do not want that behaviour, you can implement your my_split() function as follows:
def my_split(s, d=' '):
return s.split(d) if d in s else s

Dot notation string manipulation

Is there a way to manipulate a string in Python using the following ways?
For any string that is stored in dot notation, for example:
s = "classes.students.grades"
Is there a way to change the string to the following:
"classes.students"
Basically, remove everything up to and including the last period. So "restaurants.spanish.food.salty" would become "restaurants.spanish.food".
Additionally, is there any way to identify what comes after the last period? The reason I want to do this is I want to use isDigit().
So, if it was classes.students.grades.0 could I grab the 0 somehow, so I could use an if statement with isdigit, and say if the part of the string after the last period (so 0 in this case) is a digit, remove it, otherwise, leave it.
you can use split and join together:
s = "classes.students.grades"
print '.'.join(s.split('.')[:-1])
You are splitting the string on . - it'll give you a list of strings, after that you are joining the list elements back to string separating them by .
[:-1] will pick all the elements from the list but the last one
To check what comes after the last .:
s.split('.')[-1]
Another way is to use rsplit. It works the same way as split but if you provide maxsplit parameter it'll split the string starting from the end:
rest, last = s.rsplit('.', 1)
'classes.students'
'grades'
You can also use re.sub to substitute the part after the last . with an empty string:
re.sub('\.[^.]+$', '', s)
And the last part of your question to wrap words in [] i would recommend to use format and list comprehension:
''.join("[{}]".format(e) for e in s.split('.'))
It'll give you the desired output:
[classes][students][grades]
The best way to do this is using the rsplit method and pass in the maxsplit argument.
>>> s = "classes.students.grades"
>>> before, after = s.rsplit('.', maxsplit=1) # rsplit('.', 1) in Python 2.x onwards
>>> before
'classes.students'
>>> after
'grades'
You can also use the rfind() method with normal slice operation.
To get everything before last .:
>>> s = "classes.students.grades"
>>> last_index = s.rfind('.')
>>> s[:last_index]
'classes.students'
Then everything after last .
>>> s[last_index + 1:]
'grades'
if '.' in s, s.rpartition('.') finds last dot in s,
and returns (before_last_dot, dot, after_last_dot):
s = "classes.students.grades"
s.rpartition('.')[0]
If your goal is to get rid of a final component that's just a single digit, start and end with re.sub():
s = re.sub(r"\.\d$", "", s)
This will do the job, and leave other strings alone. No need to mess with anything else.
If you do want to know about the general case (separate out the last component, no matter what it is), then use rsplit to split your string once:
>>> "hel.lo.there".rsplit(".", 1)
['hel.lo', 'there']
If there's no dot in the string you'll just get one element in your array, the entire string.
You can do it very simply with rsplit (str.rsplit([sep[, maxsplit]]) , which will return a list by breaking each element along the given separator.
You can also specify how many splits should be performed:
>>> s = "res.spa.f.sal.786423"
>>> s.rsplit('.',1)
['res.spa.f.sal', '786423']
So the final function that you describe is:
def dimimak_cool_function(s):
if '.' not in s: return s
start, end = s.rsplit('.', 1)
return start if end.isdigit() else s
>>> dimimak_cool_function("res.spa.f.sal.786423")
'res.spa.f.sal'
>>> dimimak_cool_function("res.spa.f.sal")
'res.spa.f.sal'

Remove items in a sequence from a string Python

Okay so I'm trying to make a function that will take a string and a sequence of items (in the form of either a list, a tuple or a string) and remove all items from that list from the string.
So far my attempt looks like this:
def eliminate(s, bad_characters):
for item in bad_characters:
s = s.strip(item)
return s
However, for some reason when I try this or variations of this, it only returns either the original string or a version with only the first item in bad_characters removed.
>>> eliminate("foobar",["o","b"])
'foobar'
Is there a way to remove all items in bad_characters from the given string?
The reason your solution doesn't work is because str.strip() only removes characters from the outsides of the string, i.e. characters on the leftmost or rightmost end of the string. So, in the case of 'foobar', str.strip() with a single character argument would only work if you wanted to remove the characters 'f' and 'r'.
You could eliminate more of the inner characters with strip, but you would need to include one of the outer characters as well.
>>> 'foobar'.strip('of')
'bar'
>>> 'foobar'.strip('o')
'foobar'
Here's how to do it by string-joining a generator expression:
def eliminate(s, bad_characters):
bc = set(bad_characters)
return ''.join(c for c in s if c not in bc)
Try to replace the bad characters as empty strings.
def eliminate(s, bad_characters):
for item in bad_characters:
s = s.replace(item, '')
return s
strip() doesn't work as it tries to remove beginning and tail part of the original string only.
strip is not a correct choice for this task as it remove the characters from leading and trailing of the string, instead you can use str.translate method :
>>> s,l="foobar",["o","b"]
>>> s.translate(None,''.join(l))
'far'
Try this, may be time consuming using recursion
def eliminate(s, seq):
while seq:
return eliminate(s.replace(seq.pop(),""), seq)
return s
>>>eliminate("foobar",["o","b"])
'far'

Removing empty strings from a list in python

I need to split a string. I am using this:
def ParseStringFile(string):
p = re.compile('\W+')
result = p.split(string)
But I have an error: my result has two empty strings (''), one before 'Лев'. How do I get rid of them?
As nhahtdh pointed out, the empty string is expected since there's a \n at the start and end of the string, but if they bother you, you can filter them very quickly and efficiently.
>>> filter(None, ['', 'text', 'more text', ''])
['text', 'more text']
You could remove all newlines from the string before matching it:
p.split(string.strip('\n'))
Alternatively, split the string and then remove the first and last element:
result = p.split(string)[1:-1]
The [1:-1] takes a copy of the result and includes all indexes starting at 1 (i.e. removing the first element), and ending at -2 (i.e. the second to last element. The second index is exclusive)
A longer and less elegant alternative would be to modify the list in-place:
result = p.split(string)
del result[-1] # remove last element
del result[0] # remove first element
Note that in these two solutions the first and last element must be the empty string. If sometimes the input doesn't contain these empty strings at the beginning or end, then they will misbehave. However they are also the fastest solutions.
If you want to remove all empty strings in the result, even if they happen inside the list of results you can use a list-comprehension:
[word for word in p.split(string) if word]

Categories