Python re.sub strip leading/trailing whitespace within quotes

Python re.sub strip leading/trailing whitespace within quotes - python

I want to use re.sub to remove leading and trailing whitespace from single-quoted strings embedded in a larger string. If I have, say,
textin = " foo ' bar nox ': glop ,' frox ' "
I want to produce
desired = " foo 'bar nox': glop ,'frox' "
Removing the leading whitespace is relatively straightforward.
>>> lstripped = re.sub(r"'\s*([^']*')", r"'\1", textin)
>>> lstripped
" foo 'bar nox ': glop ,'frox ' "
The problem is removing the trailing whitespace. I tried, for example,
>>> rstripped = re.sub(r"('[^']*)(\s*')", r"\1'", lstripped)
>>> rstripped
" foo 'bar nox ': glop ,'frox ' "
but that fails because the [^']* matches the trailing whitespace.
I thought about using lookback patterns, but the Re doc says they can only contain fixed-length patterns.
I'm sure this is a previously solved problem but I'm stumped.
Thanks!
EDIT: The solution needs to handle strings containing a single non-whitespace character and empty strings, i.e. ' p ' --> 'p' and ' ' --> ''.

[^\']* - is greedy, i.e. it includes also spaces and/or tabs, so let's use non-greedy one: [^\']*?
In [66]: re.sub(r'\'\s*([^\']*?)\s*\'','\'\\1\'', textin)
Out[66]: " foo 'bar nox': glop ,'frox' "
Less escaped version:
re.sub(r"'\s*([^']*?)\s*'", r"'\1'", textin)

The way to catch the whitespaces is by defining the previous
* as non-greedy, instead of r"('[^']*)(\s*')" use r"('[^']*?)(\s*')".
You can also catch both sides with a single regex:
stripped = re.sub("'\s*([^']*?)\s*'", r"'\1'", textin)

This seems to work:
'(\s*)(.*?)(\s*)'
' # an apostrophe
(\s*) # 0 or more white-space characters (leading white-space)
(.*?) # 0 or more any character, lazily matched (keep)
(\s*) # 0 or more white-space characters (trailing white-space)
' # an apostrophe
Demo

Related

How do I ignore the spaces in a string inputted by the user? [duplicate]

I want to eliminate all the whitespace from a string, on both ends, and in between words.
I have this Python code:
def my_handle(self):
sentence = ' hello apple '
sentence.strip()
But that only eliminates the whitespace on both sides of the string. How do I remove all whitespace?

If you want to remove leading and ending spaces, use str.strip():
>>> " hello apple ".strip()
'hello apple'
If you want to remove all space characters, use str.replace() (NB this only removes the “normal” ASCII space character ' ' U+0020 but not any other whitespace):
>>> " hello apple ".replace(" ", "")
'helloapple'
If you want to remove duplicated spaces, use str.split() followed by str.join():
>>> " ".join(" hello apple ".split())
'hello apple'

To remove only spaces use str.replace:
sentence = sentence.replace(' ', '')
To remove all whitespace characters (space, tab, newline, and so on) you can use split then join:
sentence = ''.join(sentence.split())
or a regular expression:
import re
pattern = re.compile(r'\s+')
sentence = re.sub(pattern, '', sentence)
If you want to only remove whitespace from the beginning and end you can use strip:
sentence = sentence.strip()
You can also use lstrip to remove whitespace only from the beginning of the string, and rstrip to remove whitespace from the end of the string.

An alternative is to use regular expressions and match these strange white-space characters too. Here are some examples:
Remove ALL spaces in a string, even between words:
import re
sentence = re.sub(r"\s+", "", sentence, flags=re.UNICODE)
Remove spaces in the BEGINNING of a string:
import re
sentence = re.sub(r"^\s+", "", sentence, flags=re.UNICODE)
Remove spaces in the END of a string:
import re
sentence = re.sub(r"\s+$", "", sentence, flags=re.UNICODE)
Remove spaces both in the BEGINNING and in the END of a string:
import re
sentence = re.sub("^\s+|\s+$", "", sentence, flags=re.UNICODE)
Remove ONLY DUPLICATE spaces:
import re
sentence = " ".join(re.split("\s+", sentence, flags=re.UNICODE))
(All examples work in both Python 2 and Python 3)

"Whitespace" includes space, tabs, and CRLF. So an elegant and one-liner string function we can use is str.translate:
Python 3
' hello apple '.translate(str.maketrans('', '', ' \n\t\r'))
OR if you want to be thorough:
import string
' hello apple'.translate(str.maketrans('', '', string.whitespace))
Python 2
' hello apple'.translate(None, ' \n\t\r')
OR if you want to be thorough:
import string
' hello apple'.translate(None, string.whitespace)

For removing whitespace from beginning and end, use strip.
>> " foo bar ".strip()
"foo bar"

' hello \n\tapple'.translate({ord(c):None for c in ' \n\t\r'})
MaK already pointed out the "translate" method above. And this variation works with Python 3 (see this Q&A).

In addition, strip has some variations:
Remove spaces in the BEGINNING and END of a string:
sentence= sentence.strip()
Remove spaces in the BEGINNING of a string:
sentence = sentence.lstrip()
Remove spaces in the END of a string:
sentence= sentence.rstrip()
All three string functions strip lstrip, and rstrip can take parameters of the string to strip, with the default being all white space. This can be helpful when you are working with something particular, for example, you could remove only spaces but not newlines:
" 1. Step 1\n".strip(" ")
Or you could remove extra commas when reading in a string list:
"1,2,3,".strip(",")

Be careful:
strip does a rstrip and lstrip (removes leading and trailing spaces, tabs, returns and form feeds, but it does not remove them in the middle of the string).
If you only replace spaces and tabs you can end up with hidden CRLFs that appear to match what you are looking for, but are not the same.

eliminate all the whitespace from a string, on both ends, and in between words.
>>> import re
>>> re.sub("\s+", # one or more repetition of whitespace
'', # replace with empty string (->remove)
''' hello
... apple
... ''')
'helloapple'
https://en.wikipedia.org/wiki/Whitespace_character
Python docs:
https://docs.python.org/library/stdtypes.html#textseq
https://docs.python.org/library/stdtypes.html#str.replace
https://docs.python.org/library/string.html#string.replace
https://docs.python.org/library/re.html#re.sub
https://docs.python.org/library/re.html#regular-expression-syntax

I use split() to ignore all whitespaces and use join() to concatenate
strings.
sentence = ''.join(' hello apple '.split())
print(sentence) #=> 'helloapple'
I prefer this approach because it is only a expression (not a statement).
It is easy to use and it can use without binding to a variable.
print(''.join(' hello apple '.split())) # no need to binding to a variable

import re
sentence = ' hello apple'
re.sub(' ','',sentence) #helloworld (remove all spaces)
re.sub(' ',' ',sentence) #hello world (remove double spaces)

In the following script we import the regular expression module which we use to substitute one space or more with a single space. This ensures that the inner extra spaces are removed. Then we use strip() function to remove leading and trailing spaces.
# Import regular expression module
import re
# Initialize string
a = " foo bar "
# First replace any number of spaces with a single space
a = re.sub(' +', ' ', a)
# Then strip any leading and trailing spaces.
a = a.strip()
# Show results
print(a)

I found that this works the best for me:
test_string = ' test a s test '
string_list = [s.strip() for s in str(test_string).split()]
final_string = ' '.join(string_array)
# final_string: 'test a s test'
It removes any whitespaces, tabs, etc.

try this.. instead of using re i think using split with strip is much better
def my_handle(self):
sentence = ' hello apple '
' '.join(x.strip() for x in sentence.split())
#hello apple
''.join(x.strip() for x in sentence.split())
#helloapple

How to remove selected characters from a string?

I have been trying to learn how I can remove special characters on random given strings. A random given string could be something like:
uh\n haha - yes 'nope' \t tuben\xa01337
and I have used both regex and string.translate to try what could work out for me:
import re
random_string = "uh\n haha - yes 'nope' \t tuben\xa01337"
print(re.sub(r"/[' \n \t\r]|(\xa0)/g", '', random_string))
print("-------")
print(random_string.translate(str.maketrans({c: "" for c in "\n \xa0\t\r"})))
The output of that returns:
uh
haha - yes 'nope' tuben 1337
-------
uhhaha-yes'nope'tuben1337
The problem is that it does not work as I wanted since I want a output to be:
uh haha - yes nope tuben 1337
I wonder how I could be able to do that?
\n\t\xa0 or any similar should be replaced as one whitespace
' and " should be replaced with no whitespace, just remove the ' and "
double whitespaces or more should be replaced with only one whitespace total. Meaning that if there are two or more whitespaces in a text they should be replaced with one.
Any special characters should be removed as well

You can use
import re
random_string = "uh\n haha - yes 'nope' \t tuben\xa01337"
random_string = re.sub(r"\s+", " ", random_string).strip().replace('"', '').replace("'", '')
print(random_string)
See the Python demo.
Notes:
re.sub(r"\s+", " ", random_string) - shrinks any chunks of one or more whitespace chars into a single regular space char
.strip() - removes leading/trailing whitespace
.replace('"', '').replace("'", '') - removes " and ' chars.

/[' \n \t\r]|(\xa0)/g
This is syntax that is used by tools like sed or Vim, not Python's re module.
The equivalent would be
print(re.sub(r"[' \n \t\r]|(\xa0)", '', random_string))
which prints
uhhaha-yesnopetuben1337
which is not far off, but you also removed all spaces.
If you don't remove the spaces,
print(re.sub(r"['\n\t\r]|(\xa0)", '', random_string))
you get
uh haha - yes nope tuben1337
which has too many spaces.
A solution is to use the inverse regular expression (which matches runs of characters you want to keep) with re.findall to get a list of words, which you can then re-join:
result = re.findall(r"[^' \n\t\r\xa0]+", random_string)
print(' '.join(result))
which prints
uh haha - yes nope tuben 1337

This regular expression will do the trick:
>>> print(re.sub(" +", ' ', re.sub(r'''/|[^\w\s]|\n|\t|\r|(\xa0)/g''', '', random_string)))
uh haha yes nope tuben 1337
The outer re.sub matches multiple whitespace and replaces it with one whitespace.
The inner re.sub is almost identical to the one you're using, I just found it more readable to have them all as choices with |.

Split leading whitespace from rest of string

I'm not sure how to exactly convey what I'm trying to do, but I'm trying to create a function to split off a part of my string (the leading whitespace) so that I can edit it with different parts of my script, then add it again to my string after it has been altered.
So lets say I have the string:
" That's four spaces"
I want to split it so I end up with:
" " and "That's four spaces"

You can use re.match:
>>> import re
>>> re.match('(\s*)(.*)', " That's four spaces").groups()
(' ', "That's four spaces")
>>>
(\s*) captures zero or more whitespace characters at the start of the string and (.*) gets everything else.
Remember though that strings are immutable in Python. Technically, you cannot edit their contents; you can only create new string objects.
For a non-Regex solution, you could try something like this:
>>> mystr = " That's four spaces"
>>> n = next(i for i, c in enumerate(mystr) if c != ' ') # Count spaces at start
>>> (' ' * n, mystr[n:])
(' ', "That's four spaces")
>>>
The main tools here are next, enumerate, and a generator expression. This solution is probably faster than the Regex one, but I personally think that the first is more elegant.

Why don't you try matching instead of splitting?
>>> import re
>>> s = " That's four spaces"
>>> re.findall(r'^\s+|.+', s)
[' ', "That's four spaces"]
Explanation:
^\s+ Matches one or more spaces at the start of a line.
| OR
.+ Matches all the remaining characters.

One solution is to lstrip the string, then figure out how many characters you've removed. You can then 'modify' the string as desired and finish by adding the whitespace back to your string. I don't think this would work properly with tab characters, but for spaces only it seems to get the job done:
my_string = " That's four spaces"
no_left_whitespace = my_string.lstrip()
modified_string = no_left_whitespace + '!'
index = my_string.index(no_left_whitespace)
final_string = (' ' * index) + modified_string
print(final_string) # That's four spaces!
And a simple test to ensure that we've done it right, which passes:
assert final_string == my_string + '!'

One thing you can do it make a list out of string.that is
x=" That's four spaces"
y=list(x)
z="".join(y[0:4]) #if this is variable you can apply a loop over here to detect spaces from start
k="".join(y[4:])
s=[]
s.append(z)
s.append(k)
print s
This is a non regex solution which will not require any imports

Python regex '\s' vs '\\s'

I have simple expression \s and \\s. Both expression matches This is Sparta!!.
>>> re.findall('\\s',"This is Sparta")
[' ', ' ']
>>> re.findall('\s',"This is Sparta")
[' ', ' ']
I am confused here. \ is used to escape special character and \s represents white space but, how both are acting here?

Don't confuse python-level string-escaping and regex-level string-escaping. Since s is not an escapable character at python-level, the interpreter understand a string like "\s" as the two characters "\" and "s". Replace "s" with "n" (for example), and it understands it as the newline character.
'\s' == '\\s'
True
'\n' == '\\n'
False

\ only escapes the following character if the escaped character is valid
>>> len('\s')
2
>>> len('\n')
1
compare with
>>> len('\\s')
2
>>> len('\\n')
2

Add string between tabs and text

I simply want to add string after (0 or more) tabs in the beginning of a string.
i.e.
a = '\t\t\tHere is the next part of string. More garbage.'
(insert Added String here.)
to
b = '\t\t\t Added String here. Here is the next part of string. More garbage.'
What is the easiest/simplest way to go about it?

Simple:
re.sub(r'^(\t*)', r'\1 Added String here. ', inputtext)
The ^ caret matches the start of the string, \t a tab character, of which there should be zero or more (*). The parenthesis capture the matched tabs for use in the replacement string, where \1 inserts them again in front of the string you need adding.
Demo:
>>> import re
>>> a = '\t\t\tHere is the next part of string. More garbage.'
>>> re.sub(r'^(\t*)', r'\1 Added String here. ', a)
'\t\t\t Added String here. Here is the next part of string. More garbage.'
>>> re.sub(r'^(\t*)', r'\1 Added String here. ', 'No leading tabs.')
' Added String here. No leading tabs.'

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python re.sub strip leading/trailing whitespace within quotes - python

[^\']* - is greedy, i.e. it includes also spaces and/or tabs, so let's use non-greedy one: [^\']? In [66]: re.sub(r'\'\s([^\']?)\s\'','\'\\1\'', textin) Out[66]: " foo 'bar nox': glop ,'frox' " Less escaped version: re.sub(r"'\s([^']?)\s*'", r"'\1'", textin)

The way to catch the whitespaces is by defining the previous * as non-greedy, instead of r"('[^'])(\s')" use r"('[^']?)(\s')". You can also catch both sides with a single regex: stripped = re.sub("'\s([^']?)\s*'", r"'\1'", textin)

This seems to work: '(\s)(.?)(\s)' ' # an apostrophe (\s) # 0 or more white-space characters (leading white-space) (.?) # 0 or more any character, lazily matched (keep) (\s) # 0 or more white-space characters (trailing white-space) ' # an apostrophe Demo

Related

How do I ignore the spaces in a string inputted by the user? [duplicate]

How to remove selected characters from a string?

Split leading whitespace from rest of string

Python regex '\s' vs '\\s'

Add string between tabs and text

Categories

Resources

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python re.sub strip leading/trailing whitespace within quotes - python

[^\']* - is greedy, i.e. it includes also spaces and/or tabs, so let's use non-greedy one: [^\']*? In [66]: re.sub(r'\'\s*([^\']*?)\s*\'','\'\\1\'', textin) Out[66]: " foo 'bar nox': glop ,'frox' " Less escaped version: re.sub(r"'\s*([^']*?)\s*'", r"'\1'", textin)

The way to catch the whitespaces is by defining the previous * as non-greedy, instead of r"('[^']*)(\s*')" use r"('[^']*?)(\s*')". You can also catch both sides with a single regex: stripped = re.sub("'\s*([^']*?)\s*'", r"'\1'", textin)

This seems to work: '(\s*)(.*?)(\s*)' ' # an apostrophe (\s*) # 0 or more white-space characters (leading white-space) (.*?) # 0 or more any character, lazily matched (keep) (\s*) # 0 or more white-space characters (trailing white-space) ' # an apostrophe Demo

Related

How do I ignore the spaces in a string inputted by the user? [duplicate]

How to remove selected characters from a string?

Split leading whitespace from rest of string

Python regex '\s' vs '\\s'

Add string between tabs and text

Categories

Resources

[^\']* - is greedy, i.e. it includes also spaces and/or tabs, so let's use non-greedy one: [^\']? In [66]: re.sub(r'\'\s([^\']?)\s\'','\'\\1\'', textin) Out[66]: " foo 'bar nox': glop ,'frox' " Less escaped version: re.sub(r"'\s([^']?)\s*'", r"'\1'", textin)

The way to catch the whitespaces is by defining the previous * as non-greedy, instead of r"('[^'])(\s')" use r"('[^']?)(\s')". You can also catch both sides with a single regex: stripped = re.sub("'\s([^']?)\s*'", r"'\1'", textin)

This seems to work: '(\s)(.?)(\s)' ' # an apostrophe (\s) # 0 or more white-space characters (leading white-space) (.?) # 0 or more any character, lazily matched (keep) (\s) # 0 or more white-space characters (trailing white-space) ' # an apostrophe Demo