Python: strip() method not removing whitespace from text - python

I have a problem where what looks like whitespace preceding a string isn't removed using the strip method. This is the script:
text = '"X-DSPAM-Confidence: 0.8475";'
startpos = text.find(":")
endpos = text.find('\";', startpos)
extracted_text = text[startpos+1:endpos]
extracted_text.strip()
print("Substring:",extracted_text)
This returns:
Substring: 0.8475
Assuming that strip() was used correctly, any advice on debugging to identify what is actually printed to screen that appears to be whitespace but isn't?

You have to re-assign the variable:
extracted_text=extracted_text.strip()
Alternatively:
print("Substring:",extracted_text.strip())

str.strip does not happen in-place it returns the stripped string.
In order to isolate the last number without the trailing characters you can use a combination of str.strip and str.split then get the second value and remove the trailing characters using str.replace:
>>> text.strip().split()[1].replace('";', '')
'0.8475'

Related

How is lstrip() method removing chars from left? [duplicate]

This question already has answers here:
Understanding python's lstrip method on strings [duplicate]
(3 answers)
Closed 1 year ago.
My understanding is that the lstrip(arg) removes characters from the left based on the value of arg.
I am executing the following code:
'htp://www.abc.com'.lstrip('/')
Output:
'htp://www.abc.com'
My understanding is that all the characters should be stripped from left until / is reached.
In other words, the output should be:
'www.abc.com'
I am also not sure why running the following code is generating below output:
'htp://www.abc.com'.lstrip('/:pth')
Output:
'www.abc.com'
Calling the help function shows the following:
Help on built-in function lstrip:
lstrip(chars=None, /) method of builtins.str instance
Return a copy of the string with leading whitespace removed.
If chars is given and not None, remove characters in chars instead.
Which, clearly means that any white-space in the starting (i.e. left) will be chopped-off or if the chars argument is specified it will remove those characters if and only if the string begins with any of the specified characters i.e. if you pass 'abc' as an argument then the string should start with any of 'a','b' or 'c' else the function won't change anything.
The string need not begin with the 'abc' as a whole.
print(' the left strip'.lstrip()) # strips off the whitespace
the left strip
>>> print('ththe left strip'.lstrip('th')) # strips off the given characters as the string starts with those
e left strip
>>> print('ththe left strip'.lstrip('left')) # removes 't' as 'left' contatins 't' in it
hthe left strip
>>> print('ththe left strip'.lstrip('zeb')) # doesn't change anything as the argument passed doesn't match the beggining of the string
ththe left strip
>>> print('ththe left strip'.lstrip('h')) # doesn't change anything as the argument passed doesn't match the beggining of the string
ththe left strip
If you want all chars right of a given string try split
url = 'htp://www.abc.com'
print(url.split('//')[1])
output
www.abc.com
lstrip only returns a copy of the string with leading characters stripped, not in between
I think you want this :
a = 'htp://www.abc.com'
a = a[a.find('/')+1:]
From Python Docs :
str.lstrip([chars])
Return a copy of the string with leading characters removed. The chars argument is a string specifying the set of characters to be removed. If omitted or None, the chars argument defaults to removing whitespace. **The chars argument is not a prefix; rather, all combinations of its values are stripped:**
Read the last line your doubt will be resolved.
In the Python documentation, str.lstrip can only remove the leading characters specified in its args, or whitespaces if no characters is provided.
You can try using str.rfind like this:
>>> url = "https://www.google.com"
>>> url[url.rfind('/')+1:]
'www.google.com'

Python - Remove word only from within a sentence

I am trying to remove a specific word from within a sentence, which is 'you'. The code is as listed below:
out1.text_condition = out1.text_condition.replace('you','')
This works, however, it also removes it from within a word that contains it, so when 'your' appears, it removes the 'you' from within it, leaving 'r' standing. Can anyone help me figure out what I can do to just remove the word, not the letters from within another string?
Thanks!
In order to replace whole words and not substrings, you should use a regular expression (regex).
Here is how to replace a whole word with the module re:
import re
def replace_whole_word_from_string(word, string, replacement=""):
regular_expression = rf"\b{word}\b"
return re.sub(regular_expression, replacement, string)
string = "you you ,you your"
result = replace_whole_word_from_string("you", string)
print(result)
Output:
, your
Explanation:
The two \b are what we call "word boundaries". The advantage over str.replace is that it will take into account the punctuation too.
In order to create the regular expression, here we use Literal String Interpolation (also called "f-strings", https://www.python.org/dev/peps/pep-0498/).
To create a "f-string", we add the prefix f.
We also use the prefix r, in order to create a "raw string". We use a raw string in order to avoid escaping the backslash in \b.
Without the prefix r, we would have written regular_expression = f"\\b{word}\\b".
If you had used string.replace(' you ', ' '), you would have received this (wrong) output:
you ,you your
A very simple solution is to replace the word with spaces around it with one space:
out1.text_condition = out1.text_condition.replace(' you ', ' ')
But note that it wouldn't remove for example you. (in the end of the sentence) or you,, etc.
Easiest way is probably just to assume there are spaces before and after the word:
out1.text_condition = out1.text_condition.replace(' you ','')

Why does sentence.strip() remove certain characters but not others from the end of this string?

Tyring to figure out how strip() works when reading characters in a string.
This:
sentence = "All the single ladies"
sentence = sentence.strip("All the si")
print(sentence)
returns this:
ngle lad
I get why 'All the si' is removed from the start of the string. But how does Python decide to remove the 'ies' from the end of the string? If the 'e' is being removed from the 'ies', why isn't it being removed from 'the' too? What are the rules for string stripping behavior?
.strip() accepts an iterable of characters you want to remove not a substring. So all of i, e, s characters are present in the substring you passed (All the si). And d (that is at the end of the resulting string) isn't, so it stops on it.
See more in the docs.
To remove the substring you would use:
sentence.replace("All the si", "")

How to use text strip() function?

I can strip numerics but not alpha characters:
>>> text
'132abcd13232111'
>>> text.strip('123')
'abcd'
Why the following is not working?
>>> text.strip('abcd')
'132abcd13232111'
The reason is simple and stated in the documentation of strip:
str.strip([chars])
Return a copy of the string with the leading and trailing characters removed.
The chars argument is a string specifying the set of characters to be removed.
'abcd' is neither leading nor trailing in the string '132abcd13232111' so it isn't stripped.
Just to add a few examples to Jim's answer, according to .strip() docs:
Return a copy of the string with the leading and trailing characters removed.
The chars argument is a string specifying the set of characters to be removed.
If omitted or None, the chars argument defaults to removing whitespace.
The chars argument is not a prefix or suffix; rather, all combinations of its values are stripped.
So it doesn't matter if it's a digit or not, the main reason your second code didn't worked as you expected, is because the term "abcd" was located in the middle of the string.
Example1:
s = '132abcd13232111'
print(s.strip('123'))
print(s.strip('abcd'))
Output:
abcd
132abcd13232111
Example2:
t = 'abcd12312313abcd'
print(t.strip('123'))
print(t.strip('abcd'))
Output:
abcd12312313abcd
12312313

Remove non-letter characters from beginning and end of a string

I need to remove all non-letter characters from the beginning and from the end of a word, but keep them if they appear between two letters.
For example:
'123foo456' --> 'foo'
'2foo1c#BAR' --> 'foo1c#BAR'
I tried using re.sub(), but I couldn't write the regex.
like this?
re.sub('^[^a-zA-Z]*|[^a-zA-Z]*$','',s)
s is the input string.
You could use str.strip for this:
In [1]: import string
In [4]: '123foo456'.strip(string.digits)
Out[4]: 'foo'
In [5]: '2foo1c#BAR'.strip(string.digits)
Out[5]: 'foo1c#BAR'
As Matt points out in the comments (thanks, Matt), this removes digits only. To remove any non-letter character,
Define what you mean by a non-letter:
In [22]: allchars = string.maketrans('', '')
In [23]: nonletter = allchars.translate(allchars, string.letters)
and then strip:
In [18]: '2foo1c#BAR'.strip(nonletter)
Out[18]: 'foo1c#BAR'
With your two examples, I was able to create a regex using Python's non-greedy syntax as described here. I broke up the input into three parts: non-letters, exclusively letters, then non-letters until the end. Here's a test run:
1:[123] 2:[foo] 3:[456]
1:[2] 2:[foo1c#BAR] 3:[]
Here's the regular expression:
^([^A-Za-z]*)(.*?)([^A-Za-z]*)$
And mo.group(2) what you want, where mo is the MatchObject.
To be unicode compatible:
^\PL+|\PL+$
\PL stands for for not a letter
Try this:
re.sub(r'^[^a-zA-Z]*(.*?)[^a-zA-Z]*$', '\1', string);
The round brackets capture everything between non-letter strings at the beginning and end of the string. The ? makes sure that the . does not capture any non-letter strings at the end, too. The replacement then simply prints the captured group.
result = re.sub('(.*?)([a-z].*[a-z])(.*)', '\\2', '23WERT#3T67', flags=re.IGNORECASE)

Categories