How to strip all whitespace from string - python

How do I strip all the spaces in a python string? For example, I want a string like strip my spaces to be turned into stripmyspaces, but I cannot seem to accomplish that with strip():
>>> 'strip my spaces'.strip()
'strip my spaces'

Taking advantage of str.split's behavior with no sep parameter:
>>> s = " \t foo \n bar "
>>> "".join(s.split())
'foobar'
If you just want to remove spaces instead of all whitespace:
>>> s.replace(" ", "")
'\tfoo\nbar'
Premature optimization
Even though efficiency isn't the primary goal—writing clear code is—here are some initial timings:
$ python -m timeit '"".join(" \t foo \n bar ".split())'
1000000 loops, best of 3: 1.38 usec per loop
$ python -m timeit -s 'import re' 're.sub(r"\s+", "", " \t foo \n bar ")'
100000 loops, best of 3: 15.6 usec per loop
Note the regex is cached, so it's not as slow as you'd imagine. Compiling it beforehand helps some, but would only matter in practice if you call this many times:
$ python -m timeit -s 'import re; e = re.compile(r"\s+")' 'e.sub("", " \t foo \n bar ")'
100000 loops, best of 3: 7.76 usec per loop
Even though re.sub is 11.3x slower, remember your bottlenecks are assuredly elsewhere. Most programs would not notice the difference between any of these 3 choices.

For Python 3:
>>> import re
>>> re.sub(r'\s+', '', 'strip my \n\t\r ASCII and \u00A0 \u2003 Unicode spaces')
'stripmyASCIIandUnicodespaces'
>>> # Or, depending on the situation:
>>> re.sub(r'(\s|\u180B|\u200B|\u200C|\u200D|\u2060|\uFEFF)+', '', \
... '\uFEFF\t\t\t strip all \u000A kinds of \u200B whitespace \n')
'stripallkindsofwhitespace'
...handles any whitespace characters that you're not thinking of - and believe us, there are plenty.
\s on its own always covers the ASCII whitespace:
(regular) space
tab
new line (\n)
carriage return (\r)
form feed
vertical tab
Additionally:
for Python 2 with re.UNICODE enabled,
for Python 3 without any extra actions,
...\s also covers the Unicode whitespace characters, for example:
non-breaking space,
em space,
ideographic space,
...etc. See the full list here, under "Unicode characters with White_Space property".
However \s DOES NOT cover characters not classified as whitespace, which are de facto whitespace, such as among others:
zero-width joiner,
Mongolian vowel separator,
zero-width non-breaking space (a.k.a. byte order mark),
...etc. See the full list here, under "Related Unicode characters without White_Space property".
So these 6 characters are covered by the list in the second regex, \u180B|\u200B|\u200C|\u200D|\u2060|\uFEFF.
Sources:
https://docs.python.org/2/library/re.html
https://docs.python.org/3/library/re.html
https://en.wikipedia.org/wiki/Unicode_character_property

Alternatively,
"strip my spaces".translate( None, string.whitespace )
And here is Python3 version:
"strip my spaces".translate(str.maketrans('', '', string.whitespace))

Remove the Starting Spaces in Python
string1 = " This is Test String to strip leading space"
print(string1)
print(string1.lstrip())
Remove the Trailing or End Spaces in Python
string2 = "This is Test String to strip trailing space "
print(string2)
print(string2.rstrip())
Remove the whiteSpaces from Beginning and end of the string in Python
string3 = " This is Test String to strip leading and trailing space "
print(string3)
print(string3.strip())
Remove all the spaces in python
string4 = " This is Test String to test all the spaces "
print(string4)
print(string4.replace(" ", ""))

The simplest is to use replace:
"foo bar\t".replace(" ", "").replace("\t", "")
Alternatively, use a regular expression:
import re
re.sub(r"\s", "", "foo bar\t")

As mentioned by Roger Pate following code worked for me:
s = " \t foo \n bar "
"".join(s.split())
'foobar'
I am using Jupyter Notebook to run following code:
i=0
ProductList=[]
while i < len(new_list):
temp='' # new_list[i]=temp=' Plain Utthapam '
#temp=new_list[i].strip() #if we want o/p as: 'Plain Utthapam'
temp="".join(new_list[i].split()) #o/p: 'PlainUtthapam'
temp=temp.upper() #o/p:'PLAINUTTHAPAM'
ProductList.append(temp)
i=i+2

Try a regex with re.sub. You can search for all whitespace and replace with an empty string.
\s in your pattern will match whitespace characters - and not just a space (tabs, newlines, etc). You can read more about it in the manual.

import re
re.sub(' ','','strip my spaces')

The standard techniques to filter a list apply, although they are not as efficient as the split/join or translate methods.
We need a set of whitespaces:
>>> import string
>>> ws = set(string.whitespace)
The filter builtin:
>>> "".join(filter(lambda c: c not in ws, "strip my spaces"))
'stripmyspaces'
A list comprehension (yes, use the brackets: see benchmark below):
>>> import string
>>> "".join([c for c in "strip my spaces" if c not in ws])
'stripmyspaces'
A fold:
>>> import functools
>>> "".join(functools.reduce(lambda acc, c: acc if c in ws else acc+c, "strip my spaces"))
'stripmyspaces'
Benchmark:
>>> from timeit import timeit
>>> timeit('"".join("strip my spaces".split())')
0.17734256500003198
>>> timeit('"strip my spaces".translate(ws_dict)', 'import string; ws_dict = {ord(ws):None for ws in string.whitespace}')
0.457635745999994
>>> timeit('re.sub(r"\s+", "", "strip my spaces")', 'import re')
1.017787621000025
>>> SETUP = 'import string, operator, functools, itertools; ws = set(string.whitespace)'
>>> timeit('"".join([c for c in "strip my spaces" if c not in ws])', SETUP)
0.6484303600000203
>>> timeit('"".join(c for c in "strip my spaces" if c not in ws)', SETUP)
0.950212219999969
>>> timeit('"".join(filter(lambda c: c not in ws, "strip my spaces"))', SETUP)
1.3164566040000523
>>> timeit('"".join(functools.reduce(lambda acc, c: acc if c in ws else acc+c, "strip my spaces"))', SETUP)
1.6947649049999995

Parce your string to separate words
Strip white spaces on both sides
Join them with single space in the end
Final line of code:
' '.join(word.strip() for word in message_text.split()

If optimal performance is not a requirement and you just want something dead simple, you can define a basic function to test each character using the string class's built in "isspace" method:
def remove_space(input_string):
no_white_space = ''
for c in input_string:
if not c.isspace():
no_white_space += c
return no_white_space
Building the no_white_space string this way will not have ideal performance, but the solution is easy to understand.
>>> remove_space('strip my spaces')
'stripmyspaces'
If you don't want to define a function, you can convert this into something vaguely similar with list comprehension. Borrowing from the top answer's join solution:
>>> "".join([c for c in "strip my spaces" if not c.isspace()])
'stripmyspaces'

TL/DR
This solution was tested using Python 3.6
To strip all spaces from a string in Python3 you can use the following function:
def remove_spaces(in_string: str):
return in_string.translate(str.maketrans({' ': ''})
To remove any whitespace characters (' \t\n\r\x0b\x0c') you can use the following function:
import string
def remove_whitespace(in_string: str):
return in_string.translate(str.maketrans(dict.fromkeys(string.whitespace)))
Explanation
Python's str.translate method is a built-in class method of str, it takes a table and returns a copy of the string with each character mapped through the passed translation table. Full documentation for str.translate
To create the translation table str.maketrans is used. This method is another built-in class method of str. Here we use it with only one parameter, in this case a dictionary, where the keys are the characters to be replaced mapped to values with the characters replacement value. It returns a translation table for use with str.translate. Full documentation for str.maketrans
The string module in python contains some common string operations and constants. string.whitespace is a constant which returns a string containing all ASCII characters that are considered whitespace. This includes the characters space, tab, linefeed, return, formfeed, and vertical tab. Full documentation for string.whitespace
In the second function dict.fromkeys is used to create a dictionary where the keys are the characters in the string returned by string.whitespace each with value None. Full documentation for dict.fromkeys

Here's another way using plain old list comprehension:
''.join([c for c in aString if c not in [' ','\t','\n']])
Example:
>>> aStr = 'aaa\nbbb\t\t\tccc '
>>> print(aString)
aaa
bbb ccc
>>> ''.join([c for c in aString if c not in [' ','\t','\n']])
'aaabbbccc'

This got asked in an interview. So if you have to give a solution just by using strip method. Here's an approach -
s='string with spaces'
res=''.join((i.strip(' ') for i in s))
print(res)

Related

Python print list [duplicate]

How can I remove the last character of a string if it is a newline?
"abc\n" --> "abc"
Try the method rstrip() (see doc Python 2 and Python 3)
>>> 'test string\n'.rstrip()
'test string'
Python's rstrip() method strips all kinds of trailing whitespace by default, not just one newline as Perl does with chomp.
>>> 'test string \n \r\n\n\r \n\n'.rstrip()
'test string'
To strip only newlines:
>>> 'test string \n \r\n\n\r \n\n'.rstrip('\n')
'test string \n \r\n\n\r '
In addition to rstrip(), there are also the methods strip() and lstrip(). Here is an example with the three of them:
>>> s = " \n\r\n \n abc def \n\r\n \n "
>>> s.strip()
'abc def'
>>> s.lstrip()
'abc def \n\r\n \n '
>>> s.rstrip()
' \n\r\n \n abc def'
And I would say the "pythonic" way to get lines without trailing newline characters is splitlines().
>>> text = "line 1\nline 2\r\nline 3\nline 4"
>>> text.splitlines()
['line 1', 'line 2', 'line 3', 'line 4']
The canonical way to strip end-of-line (EOL) characters is to use the string rstrip() method removing any trailing \r or \n. Here are examples for Mac, Windows, and Unix EOL characters.
>>> 'Mac EOL\r'.rstrip('\r\n')
'Mac EOL'
>>> 'Windows EOL\r\n'.rstrip('\r\n')
'Windows EOL'
>>> 'Unix EOL\n'.rstrip('\r\n')
'Unix EOL'
Using '\r\n' as the parameter to rstrip means that it will strip out any trailing combination of '\r' or '\n'. That's why it works in all three cases above.
This nuance matters in rare cases. For example, I once had to process a text file which contained an HL7 message. The HL7 standard requires a trailing '\r' as its EOL character. The Windows machine on which I was using this message had appended its own '\r\n' EOL character. Therefore, the end of each line looked like '\r\r\n'. Using rstrip('\r\n') would have taken off the entire '\r\r\n' which is not what I wanted. In that case, I simply sliced off the last two characters instead.
Note that unlike Perl's chomp function, this will strip all specified characters at the end of the string, not just one:
>>> "Hello\n\n\n".rstrip("\n")
"Hello"
Note that rstrip doesn't act exactly like Perl's chomp() because it doesn't modify the string. That is, in Perl:
$x="a\n";
chomp $x
results in $x being "a".
but in Python:
x="a\n"
x.rstrip()
will mean that the value of x is still "a\n". Even x=x.rstrip() doesn't always give the same result, as it strips all whitespace from the end of the string, not just one newline at most.
I might use something like this:
import os
s = s.rstrip(os.linesep)
I think the problem with rstrip("\n") is that you'll probably want to make sure the line separator is portable. (some antiquated systems are rumored to use "\r\n"). The other gotcha is that rstrip will strip out repeated whitespace. Hopefully os.linesep will contain the right characters. the above works for me.
You may use line = line.rstrip('\n'). This will strip all newlines from the end of the string, not just one.
s = s.rstrip()
will remove all newlines at the end of the string s. The assignment is needed because rstrip returns a new string instead of modifying the original string.
"line 1\nline 2\r\n...".replace('\n', '').replace('\r', '')
>>> 'line 1line 2...'
or you could always get geekier with regexps
This would replicate exactly perl's chomp (minus behavior on arrays) for "\n" line terminator:
def chomp(x):
if x.endswith("\r\n"): return x[:-2]
if x.endswith("\n") or x.endswith("\r"): return x[:-1]
return x
(Note: it does not modify string 'in place'; it does not strip extra trailing whitespace; takes \r\n in account)
you can use strip:
line = line.strip()
demo:
>>> "\n\n hello world \n\n".strip()
'hello world'
rstrip doesn't do the same thing as chomp, on so many levels. Read http://perldoc.perl.org/functions/chomp.html and see that chomp is very complex indeed.
However, my main point is that chomp removes at most 1 line ending, whereas rstrip will remove as many as it can.
Here you can see rstrip removing all the newlines:
>>> 'foo\n\n'.rstrip(os.linesep)
'foo'
A much closer approximation of typical Perl chomp usage can be accomplished with re.sub, like this:
>>> re.sub(os.linesep + r'\Z','','foo\n\n')
'foo\n'
Careful with "foo".rstrip(os.linesep): That will only chomp the newline characters for the platform where your Python is being executed. Imagine you're chimping the lines of a Windows file under Linux, for instance:
$ python
Python 2.7.1 (r271:86832, Mar 18 2011, 09:09:48)
[GCC 4.5.0 20100604 [gcc-4_5-branch revision 160292]] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import os, sys
>>> sys.platform
'linux2'
>>> "foo\r\n".rstrip(os.linesep)
'foo\r'
>>>
Use "foo".rstrip("\r\n") instead, as Mike says above.
An example in Python's documentation simply uses line.strip().
Perl's chomp function removes one linebreak sequence from the end of a string only if it's actually there.
Here is how I plan to do that in Python, if process is conceptually the function that I need in order to do something useful to each line from this file:
import os
sep_pos = -len(os.linesep)
with open("file.txt") as f:
for line in f:
if line[sep_pos:] == os.linesep:
line = line[:sep_pos]
process(line)
I don't program in Python, but I came across an FAQ at python.org advocating S.rstrip("\r\n") for python 2.2 or later.
import re
r_unwanted = re.compile("[\n\t\r]")
r_unwanted.sub("", your_text)
If your question is to clean up all the line breaks in a multiple line str object (oldstr), you can split it into a list according to the delimiter '\n' and then join this list into a new str(newstr).
newstr = "".join(oldstr.split('\n'))
I find it convenient to have be able to get the chomped lines via in iterator, parallel to the way you can get the un-chomped lines from a file object. You can do so with the following code:
def chomped_lines(it):
return map(operator.methodcaller('rstrip', '\r\n'), it)
Sample usage:
with open("file.txt") as infile:
for line in chomped_lines(infile):
process(line)
I'm bubbling up my regular expression based answer from one I posted earlier in the comments of another answer. I think using re is a clearer more explicit solution to this problem than str.rstrip.
>>> import re
If you want to remove one or more trailing newline chars:
>>> re.sub(r'[\n\r]+$', '', '\nx\r\n')
'\nx'
If you want to remove newline chars everywhere (not just trailing):
>>> re.sub(r'[\n\r]+', '', '\nx\r\n')
'x'
If you want to remove only 1-2 trailing newline chars (i.e., \r, \n, \r\n, \n\r, \r\r, \n\n)
>>> re.sub(r'[\n\r]{1,2}$', '', '\nx\r\n\r\n')
'\nx\r'
>>> re.sub(r'[\n\r]{1,2}$', '', '\nx\r\n\r')
'\nx\r'
>>> re.sub(r'[\n\r]{1,2}$', '', '\nx\r\n')
'\nx'
I have a feeling what most people really want here, is to remove just one occurrence of a trailing newline character, either \r\n or \n and nothing more.
>>> re.sub(r'(?:\r\n|\n)$', '', '\nx\n\n', count=1)
'\nx\n'
>>> re.sub(r'(?:\r\n|\n)$', '', '\nx\r\n\r\n', count=1)
'\nx\r\n'
>>> re.sub(r'(?:\r\n|\n)$', '', '\nx\r\n', count=1)
'\nx'
>>> re.sub(r'(?:\r\n|\n)$', '', '\nx\n', count=1)
'\nx'
(The ?: is to create a non-capturing group.)
(By the way this is not what '...'.rstrip('\n', '').rstrip('\r', '') does which may not be clear to others stumbling upon this thread. str.rstrip strips as many of the trailing characters as possible, so a string like foo\n\n\n would result in a false positive of foo whereas you may have wanted to preserve the other newlines after stripping a single trailing one.)
workaround solution for special case:
if the newline character is the last character (as is the case with most file inputs), then for any element in the collection you can index as follows:
foobar= foobar[:-1]
to slice out your newline character.
It looks like there is not a perfect analog for perl's chomp. In particular, rstrip cannot handle multi-character newline delimiters like \r\n. However, splitlines does as pointed out here.
Following my answer on a different question, you can combine join and splitlines to remove/replace all newlines from a string s:
''.join(s.splitlines())
The following removes exactly one trailing newline (as chomp would, I believe). Passing True as the keepends argument to splitlines retain the delimiters. Then, splitlines is called again to remove the delimiters on just the last "line":
def chomp(s):
if len(s):
lines = s.splitlines(True)
last = lines.pop()
return ''.join(lines + last.splitlines())
else:
return ''
s = '''Hello World \t\n\r\tHi There'''
# import the module string
import string
# use the method translate to convert
s.translate({ord(c): None for c in string.whitespace}
>>'HelloWorldHiThere'
With regex
s = ''' Hello World
\t\n\r\tHi '''
print(re.sub(r"\s+", "", s), sep='') # \s matches all white spaces
>HelloWorldHi
Replace \n,\t,\r
s.replace('\n', '').replace('\t','').replace('\r','')
>' Hello World Hi '
With regex
s = '''Hello World \t\n\r\tHi There'''
regex = re.compile(r'[\n\r\t]')
regex.sub("", s)
>'Hello World Hi There'
with Join
s = '''Hello World \t\n\r\tHi There'''
' '.join(s.split())
>'Hello World Hi There'
>>> ' spacious '.rstrip()
' spacious'
>>> "AABAA".rstrip("A")
'AAB'
>>> "ABBA".rstrip("AB") # both AB and BA are stripped
''
>>> "ABCABBA".rstrip("AB")
'ABC'
Just use :
line = line.rstrip("\n")
or
line = line.strip("\n")
You don't need any of this complicated stuff
There are three types of line endings that we normally encounter: \n, \r and \r\n. A rather simple regular expression in re.sub, namely r"\r?\n?$", is able to catch them all.
(And we gotta catch 'em all, am I right?)
import re
re.sub(r"\r?\n?$", "", the_text, 1)
With the last argument, we limit the number of occurences replaced to one, mimicking chomp to some extent. Example:
import re
text_1 = "hellothere\n\n\n"
text_2 = "hellothere\n\n\r"
text_3 = "hellothere\n\n\r\n"
a = re.sub(r"\r?\n?$", "", text_1, 1)
b = re.sub(r"\r?\n?$", "", text_2, 1)
c = re.sub(r"\r?\n?$", "", text_3, 1)
... where a == b == c is True.
If you are concerned about speed (say you have a looong list of strings) and you know the nature of the newline char, string slicing is actually faster than rstrip. A little test to illustrate this:
import time
loops = 50000000
def method1(loops=loops):
test_string = 'num\n'
t0 = time.time()
for num in xrange(loops):
out_sting = test_string[:-1]
t1 = time.time()
print('Method 1: ' + str(t1 - t0))
def method2(loops=loops):
test_string = 'num\n'
t0 = time.time()
for num in xrange(loops):
out_sting = test_string.rstrip()
t1 = time.time()
print('Method 2: ' + str(t1 - t0))
method1()
method2()
Output:
Method 1: 3.92700004578
Method 2: 6.73000001907
This will work both for windows and linux (bit expensive with re sub if you are looking for only re solution)
import re
if re.search("(\\r|)\\n$", line):
line = re.sub("(\\r|)\\n$", "", line)
A catch all:
line = line.rstrip('\r|\n')

Avoid newline in paths (Python) [duplicate]

How can I remove the last character of a string if it is a newline?
"abc\n" --> "abc"
Try the method rstrip() (see doc Python 2 and Python 3)
>>> 'test string\n'.rstrip()
'test string'
Python's rstrip() method strips all kinds of trailing whitespace by default, not just one newline as Perl does with chomp.
>>> 'test string \n \r\n\n\r \n\n'.rstrip()
'test string'
To strip only newlines:
>>> 'test string \n \r\n\n\r \n\n'.rstrip('\n')
'test string \n \r\n\n\r '
In addition to rstrip(), there are also the methods strip() and lstrip(). Here is an example with the three of them:
>>> s = " \n\r\n \n abc def \n\r\n \n "
>>> s.strip()
'abc def'
>>> s.lstrip()
'abc def \n\r\n \n '
>>> s.rstrip()
' \n\r\n \n abc def'
And I would say the "pythonic" way to get lines without trailing newline characters is splitlines().
>>> text = "line 1\nline 2\r\nline 3\nline 4"
>>> text.splitlines()
['line 1', 'line 2', 'line 3', 'line 4']
The canonical way to strip end-of-line (EOL) characters is to use the string rstrip() method removing any trailing \r or \n. Here are examples for Mac, Windows, and Unix EOL characters.
>>> 'Mac EOL\r'.rstrip('\r\n')
'Mac EOL'
>>> 'Windows EOL\r\n'.rstrip('\r\n')
'Windows EOL'
>>> 'Unix EOL\n'.rstrip('\r\n')
'Unix EOL'
Using '\r\n' as the parameter to rstrip means that it will strip out any trailing combination of '\r' or '\n'. That's why it works in all three cases above.
This nuance matters in rare cases. For example, I once had to process a text file which contained an HL7 message. The HL7 standard requires a trailing '\r' as its EOL character. The Windows machine on which I was using this message had appended its own '\r\n' EOL character. Therefore, the end of each line looked like '\r\r\n'. Using rstrip('\r\n') would have taken off the entire '\r\r\n' which is not what I wanted. In that case, I simply sliced off the last two characters instead.
Note that unlike Perl's chomp function, this will strip all specified characters at the end of the string, not just one:
>>> "Hello\n\n\n".rstrip("\n")
"Hello"
Note that rstrip doesn't act exactly like Perl's chomp() because it doesn't modify the string. That is, in Perl:
$x="a\n";
chomp $x
results in $x being "a".
but in Python:
x="a\n"
x.rstrip()
will mean that the value of x is still "a\n". Even x=x.rstrip() doesn't always give the same result, as it strips all whitespace from the end of the string, not just one newline at most.
I might use something like this:
import os
s = s.rstrip(os.linesep)
I think the problem with rstrip("\n") is that you'll probably want to make sure the line separator is portable. (some antiquated systems are rumored to use "\r\n"). The other gotcha is that rstrip will strip out repeated whitespace. Hopefully os.linesep will contain the right characters. the above works for me.
You may use line = line.rstrip('\n'). This will strip all newlines from the end of the string, not just one.
s = s.rstrip()
will remove all newlines at the end of the string s. The assignment is needed because rstrip returns a new string instead of modifying the original string.
"line 1\nline 2\r\n...".replace('\n', '').replace('\r', '')
>>> 'line 1line 2...'
or you could always get geekier with regexps
This would replicate exactly perl's chomp (minus behavior on arrays) for "\n" line terminator:
def chomp(x):
if x.endswith("\r\n"): return x[:-2]
if x.endswith("\n") or x.endswith("\r"): return x[:-1]
return x
(Note: it does not modify string 'in place'; it does not strip extra trailing whitespace; takes \r\n in account)
you can use strip:
line = line.strip()
demo:
>>> "\n\n hello world \n\n".strip()
'hello world'
rstrip doesn't do the same thing as chomp, on so many levels. Read http://perldoc.perl.org/functions/chomp.html and see that chomp is very complex indeed.
However, my main point is that chomp removes at most 1 line ending, whereas rstrip will remove as many as it can.
Here you can see rstrip removing all the newlines:
>>> 'foo\n\n'.rstrip(os.linesep)
'foo'
A much closer approximation of typical Perl chomp usage can be accomplished with re.sub, like this:
>>> re.sub(os.linesep + r'\Z','','foo\n\n')
'foo\n'
Careful with "foo".rstrip(os.linesep): That will only chomp the newline characters for the platform where your Python is being executed. Imagine you're chimping the lines of a Windows file under Linux, for instance:
$ python
Python 2.7.1 (r271:86832, Mar 18 2011, 09:09:48)
[GCC 4.5.0 20100604 [gcc-4_5-branch revision 160292]] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import os, sys
>>> sys.platform
'linux2'
>>> "foo\r\n".rstrip(os.linesep)
'foo\r'
>>>
Use "foo".rstrip("\r\n") instead, as Mike says above.
An example in Python's documentation simply uses line.strip().
Perl's chomp function removes one linebreak sequence from the end of a string only if it's actually there.
Here is how I plan to do that in Python, if process is conceptually the function that I need in order to do something useful to each line from this file:
import os
sep_pos = -len(os.linesep)
with open("file.txt") as f:
for line in f:
if line[sep_pos:] == os.linesep:
line = line[:sep_pos]
process(line)
I don't program in Python, but I came across an FAQ at python.org advocating S.rstrip("\r\n") for python 2.2 or later.
import re
r_unwanted = re.compile("[\n\t\r]")
r_unwanted.sub("", your_text)
If your question is to clean up all the line breaks in a multiple line str object (oldstr), you can split it into a list according to the delimiter '\n' and then join this list into a new str(newstr).
newstr = "".join(oldstr.split('\n'))
I find it convenient to have be able to get the chomped lines via in iterator, parallel to the way you can get the un-chomped lines from a file object. You can do so with the following code:
def chomped_lines(it):
return map(operator.methodcaller('rstrip', '\r\n'), it)
Sample usage:
with open("file.txt") as infile:
for line in chomped_lines(infile):
process(line)
I'm bubbling up my regular expression based answer from one I posted earlier in the comments of another answer. I think using re is a clearer more explicit solution to this problem than str.rstrip.
>>> import re
If you want to remove one or more trailing newline chars:
>>> re.sub(r'[\n\r]+$', '', '\nx\r\n')
'\nx'
If you want to remove newline chars everywhere (not just trailing):
>>> re.sub(r'[\n\r]+', '', '\nx\r\n')
'x'
If you want to remove only 1-2 trailing newline chars (i.e., \r, \n, \r\n, \n\r, \r\r, \n\n)
>>> re.sub(r'[\n\r]{1,2}$', '', '\nx\r\n\r\n')
'\nx\r'
>>> re.sub(r'[\n\r]{1,2}$', '', '\nx\r\n\r')
'\nx\r'
>>> re.sub(r'[\n\r]{1,2}$', '', '\nx\r\n')
'\nx'
I have a feeling what most people really want here, is to remove just one occurrence of a trailing newline character, either \r\n or \n and nothing more.
>>> re.sub(r'(?:\r\n|\n)$', '', '\nx\n\n', count=1)
'\nx\n'
>>> re.sub(r'(?:\r\n|\n)$', '', '\nx\r\n\r\n', count=1)
'\nx\r\n'
>>> re.sub(r'(?:\r\n|\n)$', '', '\nx\r\n', count=1)
'\nx'
>>> re.sub(r'(?:\r\n|\n)$', '', '\nx\n', count=1)
'\nx'
(The ?: is to create a non-capturing group.)
(By the way this is not what '...'.rstrip('\n', '').rstrip('\r', '') does which may not be clear to others stumbling upon this thread. str.rstrip strips as many of the trailing characters as possible, so a string like foo\n\n\n would result in a false positive of foo whereas you may have wanted to preserve the other newlines after stripping a single trailing one.)
workaround solution for special case:
if the newline character is the last character (as is the case with most file inputs), then for any element in the collection you can index as follows:
foobar= foobar[:-1]
to slice out your newline character.
It looks like there is not a perfect analog for perl's chomp. In particular, rstrip cannot handle multi-character newline delimiters like \r\n. However, splitlines does as pointed out here.
Following my answer on a different question, you can combine join and splitlines to remove/replace all newlines from a string s:
''.join(s.splitlines())
The following removes exactly one trailing newline (as chomp would, I believe). Passing True as the keepends argument to splitlines retain the delimiters. Then, splitlines is called again to remove the delimiters on just the last "line":
def chomp(s):
if len(s):
lines = s.splitlines(True)
last = lines.pop()
return ''.join(lines + last.splitlines())
else:
return ''
s = '''Hello World \t\n\r\tHi There'''
# import the module string
import string
# use the method translate to convert
s.translate({ord(c): None for c in string.whitespace}
>>'HelloWorldHiThere'
With regex
s = ''' Hello World
\t\n\r\tHi '''
print(re.sub(r"\s+", "", s), sep='') # \s matches all white spaces
>HelloWorldHi
Replace \n,\t,\r
s.replace('\n', '').replace('\t','').replace('\r','')
>' Hello World Hi '
With regex
s = '''Hello World \t\n\r\tHi There'''
regex = re.compile(r'[\n\r\t]')
regex.sub("", s)
>'Hello World Hi There'
with Join
s = '''Hello World \t\n\r\tHi There'''
' '.join(s.split())
>'Hello World Hi There'
>>> ' spacious '.rstrip()
' spacious'
>>> "AABAA".rstrip("A")
'AAB'
>>> "ABBA".rstrip("AB") # both AB and BA are stripped
''
>>> "ABCABBA".rstrip("AB")
'ABC'
Just use :
line = line.rstrip("\n")
or
line = line.strip("\n")
You don't need any of this complicated stuff
There are three types of line endings that we normally encounter: \n, \r and \r\n. A rather simple regular expression in re.sub, namely r"\r?\n?$", is able to catch them all.
(And we gotta catch 'em all, am I right?)
import re
re.sub(r"\r?\n?$", "", the_text, 1)
With the last argument, we limit the number of occurences replaced to one, mimicking chomp to some extent. Example:
import re
text_1 = "hellothere\n\n\n"
text_2 = "hellothere\n\n\r"
text_3 = "hellothere\n\n\r\n"
a = re.sub(r"\r?\n?$", "", text_1, 1)
b = re.sub(r"\r?\n?$", "", text_2, 1)
c = re.sub(r"\r?\n?$", "", text_3, 1)
... where a == b == c is True.
If you are concerned about speed (say you have a looong list of strings) and you know the nature of the newline char, string slicing is actually faster than rstrip. A little test to illustrate this:
import time
loops = 50000000
def method1(loops=loops):
test_string = 'num\n'
t0 = time.time()
for num in xrange(loops):
out_sting = test_string[:-1]
t1 = time.time()
print('Method 1: ' + str(t1 - t0))
def method2(loops=loops):
test_string = 'num\n'
t0 = time.time()
for num in xrange(loops):
out_sting = test_string.rstrip()
t1 = time.time()
print('Method 2: ' + str(t1 - t0))
method1()
method2()
Output:
Method 1: 3.92700004578
Method 2: 6.73000001907
This will work both for windows and linux (bit expensive with re sub if you are looking for only re solution)
import re
if re.search("(\\r|)\\n$", line):
line = re.sub("(\\r|)\\n$", "", line)
A catch all:
line = line.rstrip('\r|\n')

Replacing a \t with something in a list

>>>user_sentence = "hello \t how are you?"
>>>import re
>>>user_sentenceSplit = re.findall(r"([\s]|[\w']+|[.,!?;])",user_sentence)
>>>print user_sentenceSplit
I get ['hello', '\t', 'how', 'are', 'you', '?']
I don't know how to create any code that will replace the '\t' with 'tab'.
I do not believe that replacing \t in the original string will ever work, you have two issues:
Your code also outputs spaces as tokens, but you do not want to have them
The \t in between letters will become a part of a word token.
So, you need to replace [\s] with [^\S ] pattern that matches any whitespace but a regular space (add more excluded whitespace symbols if necessary into the negated character class) and you need to iterate through all the tokens and check if a token is equal to a tab, and then replace it with tab value. So, the best is to use re.finditer and push the found values into a list variable, see sample code below:
import re
user_sentence = "hello \t how are you?"
user_sentenceSplit = []
for x in re.finditer(r"[^\S ]|[\w']+|[.,!?;]",user_sentence):
if x.group() == "\t": # if it is a tab, replace the value
user_sentenceSplit.append("tab")
else: # else, push the match value
user_sentenceSplit.append(x.group())
print(user_sentenceSplit)
See the Python demo
I think str.replace would do the job.
user_sentence.replace('\t', 'tab')
Do this before splitting the string.
It is behavior of Python's compiler. You should not be worrying about it. Pyhton's Compiler store tab as \t. You need not to do anything on it as it will treat it as tab while performing any action over it. For example:
>>> my_string = 'Yes Hello So?' # <- String with tab
>>> my_string
'Yes\tHello\tSo?' # <- Stored tab as '\t'
>>> print my_string
Yes Hello So? # While printing, again tab
However you exact requirement is not clear to me. In case you want to replace the value of \t with tab string, you may do:
>>> my_string = my_string.replace('\t', 'tab')
>>> my_string
'YestabHellotabSo?'
where my_string is holding the value I mentioned in previous example.

Removing control characters from a string in python

I currently have the following code
def removeControlCharacters(line):
i = 0
for c in line:
if (c < chr(32)):
line = line[:i - 1] + line[i+1:]
i += 1
return line
This is just does not work if there are more than one character to be deleted.
There are hundreds of control characters in unicode. If you are sanitizing data from the web or some other source that might contain non-ascii characters, you will need Python's unicodedata module. The unicodedata.category(…) function returns the unicode category code (e.g., control character, whitespace, letter, etc.) of any character. For control characters, the category always starts with "C".
This snippet removes all control characters from a string.
import unicodedata
def remove_control_characters(s):
return "".join(ch for ch in s if unicodedata.category(ch)[0]!="C")
Examples of unicode categories:
>>> from unicodedata import category
>>> category('\r') # carriage return --> Cc : control character
'Cc'
>>> category('\0') # null character ---> Cc : control character
'Cc'
>>> category('\t') # tab --------------> Cc : control character
'Cc'
>>> category(' ') # space ------------> Zs : separator, space
'Zs'
>>> category(u'\u200A') # hair space -------> Zs : separator, space
'Zs'
>>> category(u'\u200b') # zero width space -> Cf : control character, formatting
'Cf'
>>> category('A') # letter "A" -------> Lu : letter, uppercase
'Lu'
>>> category(u'\u4e21') # 両 ---------------> Lo : letter, other
'Lo'
>>> category(',') # comma -----------> Po : punctuation
'Po'
>>>
You could use str.translate with the appropriate map, for example like this:
>>> mpa = dict.fromkeys(range(32))
>>> 'abc\02de'.translate(mpa)
'abcde'
Anyone interested in a regex character class that matches any Unicode control character may use [\x00-\x1f\x7f-\x9f].
You may test it like this:
>>> import unicodedata, re, sys
>>> all_chars = [chr(i) for i in range(sys.maxunicode)]
>>> control_chars = ''.join(c for c in all_chars if unicodedata.category(c) == 'Cc')
>>> expanded_class = ''.join(c for c in all_chars if re.match(r'[\x00-\x1f\x7f-\x9f]', c))
>>> control_chars == expanded_class
True
So to remove the control characters using re just use the following:
>>> re.sub(r'[\x00-\x1f\x7f-\x9f]', '', 'abc\02de')
'abcde'
This is the easiest, most complete, and most robust way I am aware of. It does require an external dependency, however. I consider it to be worth it for most projects.
pip install regex
import regex as rx
def remove_control_characters(str):
return rx.sub(r'\p{C}', '', 'my-string')
\p{C} is the unicode character property for control characters, so you can leave it up to the unicode consortium which ones of the millions of unicode characters available should be considered control. There are also other extremely useful character properties I frequently use, for example \p{Z} for any kind of whitespace.
Your implementation is wrong because the value of i is incorrect. However that's not the only problem: it also repeatedly uses slow string operations, meaning that it runs in O(n2) instead of O(n). Try this instead:
return ''.join(c for c in line if ord(c) >= 32)
And for Python 2, with the builtin translate:
import string
all_bytes = string.maketrans('', '') # String of 256 characters with (byte) value 0 to 255
line.translate(all_bytes, all_bytes[:32]) # All bytes < 32 are deleted (the second argument lists the bytes to delete)
You modify the line during iterating over it. Something like ''.join([x for x in line if ord(x) >= 32])
filter(string.printable[:-5].__contains__,line)
I've tried all the above and it didn't help. In my case, I had to remove Unicode 'LRM' chars:
Finally I found this solution that did the job:
df["AMOUNT"] = df["AMOUNT"].str.encode("ascii", "ignore")
df["AMOUNT"] = df["AMOUNT"].str.decode('UTF-8')
Reference here.

How can I remove a trailing newline?

How can I remove the last character of a string if it is a newline?
"abc\n" --> "abc"
Try the method rstrip() (see doc Python 2 and Python 3)
>>> 'test string\n'.rstrip()
'test string'
Python's rstrip() method strips all kinds of trailing whitespace by default, not just one newline as Perl does with chomp.
>>> 'test string \n \r\n\n\r \n\n'.rstrip()
'test string'
To strip only newlines:
>>> 'test string \n \r\n\n\r \n\n'.rstrip('\n')
'test string \n \r\n\n\r '
In addition to rstrip(), there are also the methods strip() and lstrip(). Here is an example with the three of them:
>>> s = " \n\r\n \n abc def \n\r\n \n "
>>> s.strip()
'abc def'
>>> s.lstrip()
'abc def \n\r\n \n '
>>> s.rstrip()
' \n\r\n \n abc def'
And I would say the "pythonic" way to get lines without trailing newline characters is splitlines().
>>> text = "line 1\nline 2\r\nline 3\nline 4"
>>> text.splitlines()
['line 1', 'line 2', 'line 3', 'line 4']
The canonical way to strip end-of-line (EOL) characters is to use the string rstrip() method removing any trailing \r or \n. Here are examples for Mac, Windows, and Unix EOL characters.
>>> 'Mac EOL\r'.rstrip('\r\n')
'Mac EOL'
>>> 'Windows EOL\r\n'.rstrip('\r\n')
'Windows EOL'
>>> 'Unix EOL\n'.rstrip('\r\n')
'Unix EOL'
Using '\r\n' as the parameter to rstrip means that it will strip out any trailing combination of '\r' or '\n'. That's why it works in all three cases above.
This nuance matters in rare cases. For example, I once had to process a text file which contained an HL7 message. The HL7 standard requires a trailing '\r' as its EOL character. The Windows machine on which I was using this message had appended its own '\r\n' EOL character. Therefore, the end of each line looked like '\r\r\n'. Using rstrip('\r\n') would have taken off the entire '\r\r\n' which is not what I wanted. In that case, I simply sliced off the last two characters instead.
Note that unlike Perl's chomp function, this will strip all specified characters at the end of the string, not just one:
>>> "Hello\n\n\n".rstrip("\n")
"Hello"
Note that rstrip doesn't act exactly like Perl's chomp() because it doesn't modify the string. That is, in Perl:
$x="a\n";
chomp $x
results in $x being "a".
but in Python:
x="a\n"
x.rstrip()
will mean that the value of x is still "a\n". Even x=x.rstrip() doesn't always give the same result, as it strips all whitespace from the end of the string, not just one newline at most.
I might use something like this:
import os
s = s.rstrip(os.linesep)
I think the problem with rstrip("\n") is that you'll probably want to make sure the line separator is portable. (some antiquated systems are rumored to use "\r\n"). The other gotcha is that rstrip will strip out repeated whitespace. Hopefully os.linesep will contain the right characters. the above works for me.
You may use line = line.rstrip('\n'). This will strip all newlines from the end of the string, not just one.
s = s.rstrip()
will remove all newlines at the end of the string s. The assignment is needed because rstrip returns a new string instead of modifying the original string.
"line 1\nline 2\r\n...".replace('\n', '').replace('\r', '')
>>> 'line 1line 2...'
or you could always get geekier with regexps
This would replicate exactly perl's chomp (minus behavior on arrays) for "\n" line terminator:
def chomp(x):
if x.endswith("\r\n"): return x[:-2]
if x.endswith("\n") or x.endswith("\r"): return x[:-1]
return x
(Note: it does not modify string 'in place'; it does not strip extra trailing whitespace; takes \r\n in account)
you can use strip:
line = line.strip()
demo:
>>> "\n\n hello world \n\n".strip()
'hello world'
rstrip doesn't do the same thing as chomp, on so many levels. Read http://perldoc.perl.org/functions/chomp.html and see that chomp is very complex indeed.
However, my main point is that chomp removes at most 1 line ending, whereas rstrip will remove as many as it can.
Here you can see rstrip removing all the newlines:
>>> 'foo\n\n'.rstrip(os.linesep)
'foo'
A much closer approximation of typical Perl chomp usage can be accomplished with re.sub, like this:
>>> re.sub(os.linesep + r'\Z','','foo\n\n')
'foo\n'
Careful with "foo".rstrip(os.linesep): That will only chomp the newline characters for the platform where your Python is being executed. Imagine you're chimping the lines of a Windows file under Linux, for instance:
$ python
Python 2.7.1 (r271:86832, Mar 18 2011, 09:09:48)
[GCC 4.5.0 20100604 [gcc-4_5-branch revision 160292]] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import os, sys
>>> sys.platform
'linux2'
>>> "foo\r\n".rstrip(os.linesep)
'foo\r'
>>>
Use "foo".rstrip("\r\n") instead, as Mike says above.
An example in Python's documentation simply uses line.strip().
Perl's chomp function removes one linebreak sequence from the end of a string only if it's actually there.
Here is how I plan to do that in Python, if process is conceptually the function that I need in order to do something useful to each line from this file:
import os
sep_pos = -len(os.linesep)
with open("file.txt") as f:
for line in f:
if line[sep_pos:] == os.linesep:
line = line[:sep_pos]
process(line)
I don't program in Python, but I came across an FAQ at python.org advocating S.rstrip("\r\n") for python 2.2 or later.
import re
r_unwanted = re.compile("[\n\t\r]")
r_unwanted.sub("", your_text)
If your question is to clean up all the line breaks in a multiple line str object (oldstr), you can split it into a list according to the delimiter '\n' and then join this list into a new str(newstr).
newstr = "".join(oldstr.split('\n'))
I find it convenient to have be able to get the chomped lines via in iterator, parallel to the way you can get the un-chomped lines from a file object. You can do so with the following code:
def chomped_lines(it):
return map(operator.methodcaller('rstrip', '\r\n'), it)
Sample usage:
with open("file.txt") as infile:
for line in chomped_lines(infile):
process(line)
I'm bubbling up my regular expression based answer from one I posted earlier in the comments of another answer. I think using re is a clearer more explicit solution to this problem than str.rstrip.
>>> import re
If you want to remove one or more trailing newline chars:
>>> re.sub(r'[\n\r]+$', '', '\nx\r\n')
'\nx'
If you want to remove newline chars everywhere (not just trailing):
>>> re.sub(r'[\n\r]+', '', '\nx\r\n')
'x'
If you want to remove only 1-2 trailing newline chars (i.e., \r, \n, \r\n, \n\r, \r\r, \n\n)
>>> re.sub(r'[\n\r]{1,2}$', '', '\nx\r\n\r\n')
'\nx\r'
>>> re.sub(r'[\n\r]{1,2}$', '', '\nx\r\n\r')
'\nx\r'
>>> re.sub(r'[\n\r]{1,2}$', '', '\nx\r\n')
'\nx'
I have a feeling what most people really want here, is to remove just one occurrence of a trailing newline character, either \r\n or \n and nothing more.
>>> re.sub(r'(?:\r\n|\n)$', '', '\nx\n\n', count=1)
'\nx\n'
>>> re.sub(r'(?:\r\n|\n)$', '', '\nx\r\n\r\n', count=1)
'\nx\r\n'
>>> re.sub(r'(?:\r\n|\n)$', '', '\nx\r\n', count=1)
'\nx'
>>> re.sub(r'(?:\r\n|\n)$', '', '\nx\n', count=1)
'\nx'
(The ?: is to create a non-capturing group.)
(By the way this is not what '...'.rstrip('\n', '').rstrip('\r', '') does which may not be clear to others stumbling upon this thread. str.rstrip strips as many of the trailing characters as possible, so a string like foo\n\n\n would result in a false positive of foo whereas you may have wanted to preserve the other newlines after stripping a single trailing one.)
workaround solution for special case:
if the newline character is the last character (as is the case with most file inputs), then for any element in the collection you can index as follows:
foobar= foobar[:-1]
to slice out your newline character.
It looks like there is not a perfect analog for perl's chomp. In particular, rstrip cannot handle multi-character newline delimiters like \r\n. However, splitlines does as pointed out here.
Following my answer on a different question, you can combine join and splitlines to remove/replace all newlines from a string s:
''.join(s.splitlines())
The following removes exactly one trailing newline (as chomp would, I believe). Passing True as the keepends argument to splitlines retain the delimiters. Then, splitlines is called again to remove the delimiters on just the last "line":
def chomp(s):
if len(s):
lines = s.splitlines(True)
last = lines.pop()
return ''.join(lines + last.splitlines())
else:
return ''
s = '''Hello World \t\n\r\tHi There'''
# import the module string
import string
# use the method translate to convert
s.translate({ord(c): None for c in string.whitespace}
>>'HelloWorldHiThere'
With regex
s = ''' Hello World
\t\n\r\tHi '''
print(re.sub(r"\s+", "", s), sep='') # \s matches all white spaces
>HelloWorldHi
Replace \n,\t,\r
s.replace('\n', '').replace('\t','').replace('\r','')
>' Hello World Hi '
With regex
s = '''Hello World \t\n\r\tHi There'''
regex = re.compile(r'[\n\r\t]')
regex.sub("", s)
>'Hello World Hi There'
with Join
s = '''Hello World \t\n\r\tHi There'''
' '.join(s.split())
>'Hello World Hi There'
>>> ' spacious '.rstrip()
' spacious'
>>> "AABAA".rstrip("A")
'AAB'
>>> "ABBA".rstrip("AB") # both AB and BA are stripped
''
>>> "ABCABBA".rstrip("AB")
'ABC'
Just use :
line = line.rstrip("\n")
or
line = line.strip("\n")
You don't need any of this complicated stuff
There are three types of line endings that we normally encounter: \n, \r and \r\n. A rather simple regular expression in re.sub, namely r"\r?\n?$", is able to catch them all.
(And we gotta catch 'em all, am I right?)
import re
re.sub(r"\r?\n?$", "", the_text, 1)
With the last argument, we limit the number of occurences replaced to one, mimicking chomp to some extent. Example:
import re
text_1 = "hellothere\n\n\n"
text_2 = "hellothere\n\n\r"
text_3 = "hellothere\n\n\r\n"
a = re.sub(r"\r?\n?$", "", text_1, 1)
b = re.sub(r"\r?\n?$", "", text_2, 1)
c = re.sub(r"\r?\n?$", "", text_3, 1)
... where a == b == c is True.
If you are concerned about speed (say you have a looong list of strings) and you know the nature of the newline char, string slicing is actually faster than rstrip. A little test to illustrate this:
import time
loops = 50000000
def method1(loops=loops):
test_string = 'num\n'
t0 = time.time()
for num in xrange(loops):
out_sting = test_string[:-1]
t1 = time.time()
print('Method 1: ' + str(t1 - t0))
def method2(loops=loops):
test_string = 'num\n'
t0 = time.time()
for num in xrange(loops):
out_sting = test_string.rstrip()
t1 = time.time()
print('Method 2: ' + str(t1 - t0))
method1()
method2()
Output:
Method 1: 3.92700004578
Method 2: 6.73000001907
This will work both for windows and linux (bit expensive with re sub if you are looking for only re solution)
import re
if re.search("(\\r|)\\n$", line):
line = re.sub("(\\r|)\\n$", "", line)
A catch all:
line = line.rstrip('\r|\n')

Categories