Python - an extremely odd behavior of function lstrip [duplicate] - python

This question already has answers here:
Python string.strip stripping too many characters [duplicate]
(3 answers)
Closed 6 years ago.
I have encountered a very odd behavior of built-in function lstrip.
I will explain with a few examples:
print 'BT_NAME_PREFIX=MUV'.lstrip('BT_NAME_PREFIX=') # UV
print 'BT_NAME_PREFIX=NUV'.lstrip('BT_NAME_PREFIX=') # UV
print 'BT_NAME_PREFIX=PUV'.lstrip('BT_NAME_PREFIX=') # UV
print 'BT_NAME_PREFIX=SUV'.lstrip('BT_NAME_PREFIX=') # SUV
print 'BT_NAME_PREFIX=mUV'.lstrip('BT_NAME_PREFIX=') # mUV
As you can see, the function trims one additional character sometimes.
I tried to model the problem, and noticed that it persisted if I:
Changed BT_NAME_PREFIX to BT_NAME_PREFIY
Changed BT_NAME_PREFIX to BT_NAME_PREFIZ
Changed BT_NAME_PREFIX to BT_NAME_PREF
Further attempts have made it even more weird:
print 'BT_NAME=MUV'.lstrip('BT_NAME=') # UV
print 'BT_NAME=NUV'.lstrip('BT_NAME=') # UV
print 'BT_NAME=PUV'.lstrip('BT_NAME=') # PUV - different than before!!!
print 'BT_NAME=SUV'.lstrip('BT_NAME=') # SUV
print 'BT_NAME=mUV'.lstrip('BT_NAME=') # mUV
Could someone please explain what on earth is going on here?
I know I might as well just use array-slicing, but I would still like to understand this.
Thanks

You're misunderstanding how lstrip works. It treats the characters you pass in as a bag and it strips characters that are in the bag until it finds a character that isn't in the bag.
Consider:
'abc'.lstrip('ba') # 'c'
It is not removing a substring from the start of the string. To do that, you need something like:
if s.startswith(prefix):
s = s[len(prefix):]
e.g.:
>>> s = 'foobar'
>>> prefix = 'foo'
>>> if s.startswith(prefix):
... s = s[len(prefix):]
...
>>> s
'bar'
Or, I suppose you could use a regular expression:
>>> s = 'foobar'
>>> import re
>>> re.sub('^foo', '', s)
'bar'

The argument given to lstrip is a list of things to remove from the left of a string, on a character by character basis. The phrase is not considered, only the characters themselves.
S.lstrip([chars]) -> string or unicode
Return a copy of the string S with leading whitespace removed. If
chars is given and not None, remove characters in chars instead. If
chars is unicode, S will be converted to unicode before stripping
You could solve this in a flexible way using regular expressions (the re module):
>>> import re
>>> re.sub('^BT_NAME_PREFIX=', '', 'BT_NAME_PREFIX=MUV')
MUV

Related

How to remove everything before certain character in Python

I'm new to python and struggle with a certain task:
I have a String that could have anything in it, but it always "ends" the same.
It can be just a Filename, a complete path, or just a random string, ending with a Version Number.
Example:
C:\Users\abc\Desktop\string-anotherstring-15.1R7-S8.1
string-anotherstring-15.1R7-S8.1
string-anotherstring.andanother-15.1R7-S8.1
What always is the same (looking from the end) is that if you reach the second dot and go 2 characters in front of it, you always match the part that I'm interested in.
Cutting everything after a certain string was "easy," and I solved it myself - that's why the string ends with the version now :)
Is there a way to tell python, "look for the second dot from behind the string and go 2 in front of it and delete everything in front of that so that I get the Version as a string?
Happy for any pointers in the right direction.
Thanks
If you want the version number, can you use the hyphen (-) to split the string? Or do you need to depend on the dots only?
Please see below use of rsplit and join which can help you.
>>> a = 'string-anotherstring.andanother-15.1R7-S8.1'
>>> a.rsplit('-')
['string', 'anotherstring.andanother', '15.1R7', 'S8.1']
>>> a.rsplit('-')[-2:] #Get everything from second last to the end
['15.1R7', 'S8.1']
>>> '-'.join(a.rsplit('-')[-2:]) #Get everything from second last to the end, and join them with a hyphen
'15.1R7-S8.1'
>>>
For using dots, use the same way
>>> a
'string-anotherstring.andanother-15.1R7-S8.1'
>>> data = a.rsplit('.')
>>> [data[-3][-2:]]
['15']
>>> [data[-3][-2:]] + data[-2:]
['15', '1R7-S8', '1']
>>> '.'.join([data[-3][-2:]] + data[-2:])
'15.1R7-S8.1'
>>>
You can build a regex from the end mark of a line using the anchor $.
Using your own description, use the regex:
(\d\d\.[^.]*)\.[^.]*$
Demo
If you want the last characters from the end included, just move the capturing parenthesis:
(\d\d\.[^.]*\.[^.]*)$
Demo
Explanation:
(\d\d\.[^.]*\.[^.]*)$
^ ^ #digits
^ # a literal '.'
^ # anything OTHER THAN a '.'
^ # literal '.'
^ # anything OTHER THAN a '.'
^ # end of line
Assuming I understand this correctly, there are two ways to do this that come to mind:
Including both, since I might not understand this correctly, and for completeness reasons. I think the split/parts solution is cleaner, particularly when the 'certain character' is a dot.
>>> msg = r'C:\Users\abc\Desktop\string-anotherstring-15.1R7-S8.1'
>>> re.search(r'.*(..\..*)', msg).group(1)
'S8.1'
>>> parts = msg.split('.')
>>> ".".join((parts[-2][-2:], parts[-1]))
'S8.1'
For your example, you can split the string by the separator '-', and then join the last two indices. Like so:
txt = "string-anotherstring-15.1R7-S8.1"
x = txt.split("-")
y = "".join(x[-2:])
print(y) # outputs 15.1R7S8.1

lstrip is removing a character I wouldn't expect it to

The following code:
s = "www.wired.com"
print s
s = s.lstrip('www.')
print s
outputs:
www.wired.com
ired.com
Note the missing w on the second line. I'm not sure I understand the behavior. I would expect:
www.wired.com
wired.com
EDIT:
Following the first two answers, I now understand the behavior. My question is now: how do I strip the leading www. without touching the rest?
The argument to string.lstrip is a list of characters:
>>> help(string.lstrip)
Help on function lstrip in module string:
lstrip(s, chars=None)
lstrip(s [,chars]) -> string
Return a copy of the string s with leading whitespace removed.
If chars is given and not None, remove characters in chars instead.
>>>
It removes ALL occurrences of those leading characters.
print s.lstrip('w.') # does the same!
[EDIT]:
If you wanted to strop the initial www., but only if it started with that, you could use a regular expression or something like:
s = s[4:] if s.startswith('www.') else s
According to the documentation:
The chars argument is a string specifying the set of characters to be removed...The chars argument is not a prefix; rather, all combinations of its values are stripped
You would achieve the same result by just saying:
'www.wired.com'.lstrip('w.')
If you wanted something more general, I would do something like this:
i = find(s, 'www.')
if i >= 0:
s = s[0:i] + s[i+4:]
To remove the leading www.
>>> import re
>>> s = "www.wired.com"
>>> re.sub(r'^www\.', '', s)
'wired.com'

Strip all matching characters from string

Given any of the following strings:
'test'
'test='
'test=='
'test==='
I'd like to run a function on it that will remove any/all '=' characters from the end. Now, I could write something like this in two seconds, in fact, here goes one, and I can imaging a dozen alternative approaches:
def cleanup():
p = passwd()
while True:
new_p = p.rstrip('=')
if len(new_p) == len(p):
return new_p
p = new_p
But I was wondering if anything like that already exists as part of the Python Standard Library?
str.rstrip() already removes all matching characters:
>>> 'test===='.rstrip('=')
'test'
There is no need to loop.
All you need is str.rstrip:
>>> 'test'.rstrip('=')
'test'
>>> 'test='.rstrip('=')
'test'
>>> 'test=='.rstrip('=')
'test'
>>> 'test==='.rstrip('=')
'test'
>>>
From the docs:
str.rstrip([chars])
Return a copy of the string with trailing characters removed.
It should be noted however that str.rstrip only removes characters from the right end of the string. You need to use str.lstrip to remove characters from the left end and str.strip to remove characters from both ends.

String may only contain A, U, G or C [duplicate]

This question already has answers here:
Test if string ONLY contains given characters [duplicate]
(7 answers)
Closed 8 years ago.
Forgive the simplistic question, but I've read through the SO questions and the Python documentation and still haven't been able to figure this out.
How can I create a Python regex to test whether a string contains ANY but ONLY the A, U, G and C characters? The string can contain either one or all of those characters, but if it contains any other characters, I'd like the regex to fail.
I tried:
>>> re.match(r"[AUGC]", "AUGGAC")
<_sre.SRE_Match object at 0x104ca1850>
But adding an X on to the end of the string still works, which is not what I expected:
>>> re.match(r"[AUGC]", "AUGGACX")
<_sre.SRE_Match object at 0x104ca1850>
Thanks in advance.
You need the regex to consume the whole string (or fail, if it can't). re.match implicitly adds an anchor at the start of the string, you need to add one to the end:
re.match(r"[AUGC]+$", string_to_check)
Also note the +, which repeatedly matches your character set (since, again, the point is to consume the whole string)
if the value is the only characters in the string, you can do the following:
>>> r = re.compile(r'^[AUGC]+$')
>>> r.match("AUGGAC")
<_sre.SRE_Match object at 0x10ee166b0>
>>> r.match("AUGGACX")
>>>
then if you want your regex to match the empty string as well, you can do:
>>> r = re.compile(r'^[AUGC]*$')
>>> r.match("")
<_sre.SRE_Match object at 0x10ee16718>
>>> r.match("AUGGAC")
<_sre.SRE_Match object at 0x10ee166b0>
>>> r.match("AUGGACX")
Here's a description of what the first regexp does:
Walk through it
Use ^[AUCG]*$; this will match against the entire string.
Or, if there has to be at least one letter, ^[AUCG]+$ — ^ and $ stand for beginning of string and end of string respectively; * and + stand for zero or more and one or more respectively.
This is purely about regular expressions and not specific to Python really.
You are actually really close. What you have just tests for a single character that A or U or G or C.
What you want is to match a string that has one or more letters that are all A or U or G or C, you can accomplish this by adding the plus modifier to your regular expression.
re.match(r"^[AUGC]+$", "AUGGAC")
Additionally, adding $ at the end marks the end of string, you can optionally use ^ at the front to match the beginning of the string.
Just check to see if there is anything other than "AUGC" in there:
if re.search('[^AUGC]', string_to_check):
#fail
You can add a check to make sure the string is not empty in the same statement:
if not string_to_check or re.search('[^AUGC]', string_to_check):
#fail
No real need to use a regex:
>>> good = 'AUGGCUA'
>>> bad = 'AUGHACUA'
>>> all([c in 'AUGC' for c in good])
True
>>> all([c in 'AUGC' for c in bad])
False
I know you're asking about regular expressions but I though it was worth mentioning set. To establish whether your string only contains A U G or C, you could do this:
>>> input = "AUCGCUAGCGAU"
>>> s = set("AUGC")
>>> set(input) <= s
True
>>> bad = "ASNMSA"
>>> set(bad) <= s
False
edit: thanks to #roippi for spotting my mistake, <= should be used, not ==.
Instead of using <=, the method issubset can be used:
>>> set("AUGAUG").issubset(s)
True
if all characters in the string input are in the set s, then issubset will return True.
From: https://docs.python.org/2/library/re.html
Characters that are not within a range can be matched by complementing the set.
If the first character of the set is '^', all the characters that are not in the set will be matched.
For example, [^5] will match any character except '5', and [^^] will match any character except '^'.
^ has no special meaning if it’s not the first character in the set.
So you could do [^AUGC] and if it matches that then reject it, else keep it.

Remove last 3 characters of a string

I'm trying to remove the last 3 characters from a string in Python, I don't know what these characters are so I can't use rstrip, I also need to remove any white space and convert to upper-case.
An example would be:
foo = "Bs12 3ab"
foo.replace(" ", "").rstrip(foo[-3:]).upper()
This works and gives me "BS12" which is what I want, however if the last 4th & 3rd characters are the same I lose both, e.g. if foo = "BS11 1AA" I just get "BS".
Examples of foo could be:
BS1 1AB
bs11ab
BS111ab
The string could be 6 or 7 characters and I need to drop the last 3 (assuming no white space).
Removing any and all whitespace:
foo = ''.join(foo.split())
Removing last three characters:
foo = foo[:-3]
Converting to capital letters:
foo = foo.upper()
All of that code in one line:
foo = ''.join(foo.split())[:-3].upper()
It doesn't work as you expect because strip is character based. You need to do this instead:
foo = foo.replace(' ', '')[:-3].upper()
>>> foo = "Bs12 3ab"
>>> foo[:-3]
'Bs12 '
>>> foo[:-3].strip()
'Bs12'
>>> foo[:-3].strip().replace(" ","")
'Bs12'
>>> foo[:-3].strip().replace(" ","").upper()
'BS12'
You might have misunderstood rstrip slightly, it strips not a string but any character in the string you specify.
Like this:
>>> text = "xxxxcbaabc"
>>> text.rstrip("abc")
'xxxx'
So instead, just use
text = text[:-3]
(after replacing whitespace with nothing)
>>> foo = 'BS1 1AB'
>>> foo.replace(" ", "").rstrip()[:-3].upper()
'BS1'
I try to avoid regular expressions, but this appears to work:
string = re.sub("\s","",(string.lower()))[:-3]
split
slice
concentrate
This is a good workout for beginners and it's easy to achieve.
Another advanced method is a function like this:
def trim(s):
return trim(s[slice])
And for this question, you just want to remove the last characters, so you can write like this:
def trim(s):
return s[ : -3]
I think you are over to care about what those three characters are, so you lost. You just want to remove last three, nevertheless who they are!
If you want to remove some specific characters, you can add some if judgements:
def trim(s):
if [conditions]: ### for some cases, I recommend using isinstance().
return trim(s[slice])
What's wrong with this?
foo.replace(" ", "")[:-3].upper()
Aren't you performing the operations in the wrong order? You requirement seems to be foo[:-3].replace(" ", "").upper()
It some what depends on your definition of whitespace. I would generally call whitespace to be spaces, tabs, line breaks and carriage returns. If this is your definition you want to use a regex with \s to replace all whitespace charactors:
import re
def myCleaner(foo):
print 'dirty: ', foo
foo = re.sub(r'\s', '', foo)
foo = foo[:-3]
foo = foo.upper()
print 'clean:', foo
print
myCleaner("BS1 1AB")
myCleaner("bs11ab")
myCleaner("BS111ab")

Categories