How can I remove the last character of a string if it is a newline?
"abc\n" --> "abc"
Try the method rstrip() (see doc Python 2 and Python 3)
>>> 'test string\n'.rstrip()
'test string'
Python's rstrip() method strips all kinds of trailing whitespace by default, not just one newline as Perl does with chomp.
>>> 'test string \n \r\n\n\r \n\n'.rstrip()
'test string'
To strip only newlines:
>>> 'test string \n \r\n\n\r \n\n'.rstrip('\n')
'test string \n \r\n\n\r '
In addition to rstrip(), there are also the methods strip() and lstrip(). Here is an example with the three of them:
>>> s = " \n\r\n \n abc def \n\r\n \n "
>>> s.strip()
'abc def'
>>> s.lstrip()
'abc def \n\r\n \n '
>>> s.rstrip()
' \n\r\n \n abc def'
And I would say the "pythonic" way to get lines without trailing newline characters is splitlines().
>>> text = "line 1\nline 2\r\nline 3\nline 4"
>>> text.splitlines()
['line 1', 'line 2', 'line 3', 'line 4']
The canonical way to strip end-of-line (EOL) characters is to use the string rstrip() method removing any trailing \r or \n. Here are examples for Mac, Windows, and Unix EOL characters.
>>> 'Mac EOL\r'.rstrip('\r\n')
'Mac EOL'
>>> 'Windows EOL\r\n'.rstrip('\r\n')
'Windows EOL'
>>> 'Unix EOL\n'.rstrip('\r\n')
'Unix EOL'
Using '\r\n' as the parameter to rstrip means that it will strip out any trailing combination of '\r' or '\n'. That's why it works in all three cases above.
This nuance matters in rare cases. For example, I once had to process a text file which contained an HL7 message. The HL7 standard requires a trailing '\r' as its EOL character. The Windows machine on which I was using this message had appended its own '\r\n' EOL character. Therefore, the end of each line looked like '\r\r\n'. Using rstrip('\r\n') would have taken off the entire '\r\r\n' which is not what I wanted. In that case, I simply sliced off the last two characters instead.
Note that unlike Perl's chomp function, this will strip all specified characters at the end of the string, not just one:
>>> "Hello\n\n\n".rstrip("\n")
"Hello"
Note that rstrip doesn't act exactly like Perl's chomp() because it doesn't modify the string. That is, in Perl:
$x="a\n";
chomp $x
results in $x being "a".
but in Python:
x="a\n"
x.rstrip()
will mean that the value of x is still "a\n". Even x=x.rstrip() doesn't always give the same result, as it strips all whitespace from the end of the string, not just one newline at most.
I might use something like this:
import os
s = s.rstrip(os.linesep)
I think the problem with rstrip("\n") is that you'll probably want to make sure the line separator is portable. (some antiquated systems are rumored to use "\r\n"). The other gotcha is that rstrip will strip out repeated whitespace. Hopefully os.linesep will contain the right characters. the above works for me.
You may use line = line.rstrip('\n'). This will strip all newlines from the end of the string, not just one.
s = s.rstrip()
will remove all newlines at the end of the string s. The assignment is needed because rstrip returns a new string instead of modifying the original string.
"line 1\nline 2\r\n...".replace('\n', '').replace('\r', '')
>>> 'line 1line 2...'
or you could always get geekier with regexps
This would replicate exactly perl's chomp (minus behavior on arrays) for "\n" line terminator:
def chomp(x):
if x.endswith("\r\n"): return x[:-2]
if x.endswith("\n") or x.endswith("\r"): return x[:-1]
return x
(Note: it does not modify string 'in place'; it does not strip extra trailing whitespace; takes \r\n in account)
you can use strip:
line = line.strip()
demo:
>>> "\n\n hello world \n\n".strip()
'hello world'
rstrip doesn't do the same thing as chomp, on so many levels. Read http://perldoc.perl.org/functions/chomp.html and see that chomp is very complex indeed.
However, my main point is that chomp removes at most 1 line ending, whereas rstrip will remove as many as it can.
Here you can see rstrip removing all the newlines:
>>> 'foo\n\n'.rstrip(os.linesep)
'foo'
A much closer approximation of typical Perl chomp usage can be accomplished with re.sub, like this:
>>> re.sub(os.linesep + r'\Z','','foo\n\n')
'foo\n'
Careful with "foo".rstrip(os.linesep): That will only chomp the newline characters for the platform where your Python is being executed. Imagine you're chimping the lines of a Windows file under Linux, for instance:
$ python
Python 2.7.1 (r271:86832, Mar 18 2011, 09:09:48)
[GCC 4.5.0 20100604 [gcc-4_5-branch revision 160292]] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import os, sys
>>> sys.platform
'linux2'
>>> "foo\r\n".rstrip(os.linesep)
'foo\r'
>>>
Use "foo".rstrip("\r\n") instead, as Mike says above.
An example in Python's documentation simply uses line.strip().
Perl's chomp function removes one linebreak sequence from the end of a string only if it's actually there.
Here is how I plan to do that in Python, if process is conceptually the function that I need in order to do something useful to each line from this file:
import os
sep_pos = -len(os.linesep)
with open("file.txt") as f:
for line in f:
if line[sep_pos:] == os.linesep:
line = line[:sep_pos]
process(line)
I don't program in Python, but I came across an FAQ at python.org advocating S.rstrip("\r\n") for python 2.2 or later.
import re
r_unwanted = re.compile("[\n\t\r]")
r_unwanted.sub("", your_text)
If your question is to clean up all the line breaks in a multiple line str object (oldstr), you can split it into a list according to the delimiter '\n' and then join this list into a new str(newstr).
newstr = "".join(oldstr.split('\n'))
I find it convenient to have be able to get the chomped lines via in iterator, parallel to the way you can get the un-chomped lines from a file object. You can do so with the following code:
def chomped_lines(it):
return map(operator.methodcaller('rstrip', '\r\n'), it)
Sample usage:
with open("file.txt") as infile:
for line in chomped_lines(infile):
process(line)
I'm bubbling up my regular expression based answer from one I posted earlier in the comments of another answer. I think using re is a clearer more explicit solution to this problem than str.rstrip.
>>> import re
If you want to remove one or more trailing newline chars:
>>> re.sub(r'[\n\r]+$', '', '\nx\r\n')
'\nx'
If you want to remove newline chars everywhere (not just trailing):
>>> re.sub(r'[\n\r]+', '', '\nx\r\n')
'x'
If you want to remove only 1-2 trailing newline chars (i.e., \r, \n, \r\n, \n\r, \r\r, \n\n)
>>> re.sub(r'[\n\r]{1,2}$', '', '\nx\r\n\r\n')
'\nx\r'
>>> re.sub(r'[\n\r]{1,2}$', '', '\nx\r\n\r')
'\nx\r'
>>> re.sub(r'[\n\r]{1,2}$', '', '\nx\r\n')
'\nx'
I have a feeling what most people really want here, is to remove just one occurrence of a trailing newline character, either \r\n or \n and nothing more.
>>> re.sub(r'(?:\r\n|\n)$', '', '\nx\n\n', count=1)
'\nx\n'
>>> re.sub(r'(?:\r\n|\n)$', '', '\nx\r\n\r\n', count=1)
'\nx\r\n'
>>> re.sub(r'(?:\r\n|\n)$', '', '\nx\r\n', count=1)
'\nx'
>>> re.sub(r'(?:\r\n|\n)$', '', '\nx\n', count=1)
'\nx'
(The ?: is to create a non-capturing group.)
(By the way this is not what '...'.rstrip('\n', '').rstrip('\r', '') does which may not be clear to others stumbling upon this thread. str.rstrip strips as many of the trailing characters as possible, so a string like foo\n\n\n would result in a false positive of foo whereas you may have wanted to preserve the other newlines after stripping a single trailing one.)
workaround solution for special case:
if the newline character is the last character (as is the case with most file inputs), then for any element in the collection you can index as follows:
foobar= foobar[:-1]
to slice out your newline character.
It looks like there is not a perfect analog for perl's chomp. In particular, rstrip cannot handle multi-character newline delimiters like \r\n. However, splitlines does as pointed out here.
Following my answer on a different question, you can combine join and splitlines to remove/replace all newlines from a string s:
''.join(s.splitlines())
The following removes exactly one trailing newline (as chomp would, I believe). Passing True as the keepends argument to splitlines retain the delimiters. Then, splitlines is called again to remove the delimiters on just the last "line":
def chomp(s):
if len(s):
lines = s.splitlines(True)
last = lines.pop()
return ''.join(lines + last.splitlines())
else:
return ''
s = '''Hello World \t\n\r\tHi There'''
# import the module string
import string
# use the method translate to convert
s.translate({ord(c): None for c in string.whitespace}
>>'HelloWorldHiThere'
With regex
s = ''' Hello World
\t\n\r\tHi '''
print(re.sub(r"\s+", "", s), sep='') # \s matches all white spaces
>HelloWorldHi
Replace \n,\t,\r
s.replace('\n', '').replace('\t','').replace('\r','')
>' Hello World Hi '
With regex
s = '''Hello World \t\n\r\tHi There'''
regex = re.compile(r'[\n\r\t]')
regex.sub("", s)
>'Hello World Hi There'
with Join
s = '''Hello World \t\n\r\tHi There'''
' '.join(s.split())
>'Hello World Hi There'
>>> ' spacious '.rstrip()
' spacious'
>>> "AABAA".rstrip("A")
'AAB'
>>> "ABBA".rstrip("AB") # both AB and BA are stripped
''
>>> "ABCABBA".rstrip("AB")
'ABC'
Just use :
line = line.rstrip("\n")
or
line = line.strip("\n")
You don't need any of this complicated stuff
There are three types of line endings that we normally encounter: \n, \r and \r\n. A rather simple regular expression in re.sub, namely r"\r?\n?$", is able to catch them all.
(And we gotta catch 'em all, am I right?)
import re
re.sub(r"\r?\n?$", "", the_text, 1)
With the last argument, we limit the number of occurences replaced to one, mimicking chomp to some extent. Example:
import re
text_1 = "hellothere\n\n\n"
text_2 = "hellothere\n\n\r"
text_3 = "hellothere\n\n\r\n"
a = re.sub(r"\r?\n?$", "", text_1, 1)
b = re.sub(r"\r?\n?$", "", text_2, 1)
c = re.sub(r"\r?\n?$", "", text_3, 1)
... where a == b == c is True.
If you are concerned about speed (say you have a looong list of strings) and you know the nature of the newline char, string slicing is actually faster than rstrip. A little test to illustrate this:
import time
loops = 50000000
def method1(loops=loops):
test_string = 'num\n'
t0 = time.time()
for num in xrange(loops):
out_sting = test_string[:-1]
t1 = time.time()
print('Method 1: ' + str(t1 - t0))
def method2(loops=loops):
test_string = 'num\n'
t0 = time.time()
for num in xrange(loops):
out_sting = test_string.rstrip()
t1 = time.time()
print('Method 2: ' + str(t1 - t0))
method1()
method2()
Output:
Method 1: 3.92700004578
Method 2: 6.73000001907
This will work both for windows and linux (bit expensive with re sub if you are looking for only re solution)
import re
if re.search("(\\r|)\\n$", line):
line = re.sub("(\\r|)\\n$", "", line)
A catch all:
line = line.rstrip('\r|\n')
So I need to compare 2 strings :
str1 = 'this is my string/ndone'
str2 = 'this is my string done'
So I replace the new line from str1 with ' ':
new_str = str1.replace('\n', ' ')
And when print the 2 strings there are identical:
'this is my string done'
But when compared using == operator the not so I convert this 2 strings into array to see why they are not equal:
arr1 = bytearray(str1 , 'utf-8')
print(arr1)
arr2 = bytearray(str2 , 'utf-8')
print(arr2)
And this is the output:
str1 = bytearray(b'this is\xc2\xa0my string done')
str2 = bytearray(b'this is my string done')
So what is this \xc2\xa0 ?
'\xc2\xa0' is the UTF-8 encoding of the Unicode character 'NO-BREAK SPACE' (U+00A0).
use python unidecode library
from unidecode import unidecode
str = "this is\xc2\xa0my string done"
print(unidecode(str))
o/p
this isA my string done
== is working in comparing two string
str1 = 'this is my string\ndone'
str2 = 'this is my string done'
str1 = str1.replace("\n"," ")
print(str1)
if (str1 == str2):
print("y")
else:
print("n")
and output is
this is my string done
y
As stated elsewhere your string had a "/n" not "\n" in it.
Assuming though that what you wanted to do was normalise all whitespace characters, this is a very handy trick I use all the time:
string = ' '.join(string.split())
Update: Okay this is why:
If you don't specify what string.split() should use a separater the, per docs:
If sep is not specified or is None, a different splitting algorithm is
applied: runs of consecutive whitespace are regarded as a single
separator, and the result will contain no empty strings at the start
or end if the string has leading or trailing whitespace.
So it splits on whitspaces, and treats multiple whitespaces as a single seperator. I don't know what characters are all defined as "whitespaces", but is certainly includes all the usual suspects. Then when you rejoin the list into a string with ' '.join(), you know for sure that all whitespaces are now the same.
I want to remove all the text before and including */ in a string.
For example, consider:
string = ''' something
other things
etc. */ extra text.
'''
Here I want extra text. as the output.
I tried:
string = re.sub("^(.*)(?=*/)", "", string)
I also tried:
string = re.sub(re.compile(r"^.\*/", re.DOTALL), "", string)
But when I print string, it did not perform the operation I wanted and the whole string is printing.
I suppose you're fine without regular expressions:
string[string.index("*/ ")+3:]
And if you want to strip that newline:
string[string.index("*/ ")+3:].rstrip()
The problem with your first regex is that . does not match newlines as you noticed. With your second one, you were closer but forgot the * that time. This would work:
string = re.sub(re.compile(r"^.*\*/", re.DOTALL), "", string)
You can also just get the part of the string that comes after your "*/":
string = re.search(r"(\*/)(.*)", string, re.DOTALL).group(2)
Update: After doing some research, I found that the pattern (\n|.) to match everything including newlines is inefficient. I've updated the answer to use [\s\S] instead as shown on the answer I linked.
The problem is that . in python regex matches everything except newlines. For a regex solution, you can do the following:
import re
strng = ''' something
other things
etc. */ extra text.
'''
print(re.sub("[\s\S]+\*/", "", strng))
# extra text.
Add in a .strip() if you want to remove that remaining leading whitespace.
to keep text until that symbol you can do:
split_str = string.split(' ')
boundary = split_str.index('*/')
new = ' '.join(split_str[0:boundary])
print(new)
which gives you:
something
other things
etc.
string_list = string.split('*/')[1:]
string = '*/'.join(string_list)
print(string)
gives output as
' extra text. \n'
trying to remove the following punctuation in python I need to use the replace methods to remove these punctuation characters and replace it with whitespace ,.:;'"-?!/
here is my code:
text_punct_removed = raw_text.replace(".", "")
text_punct_removed = raw_text.replace("!", "")
print("\ntext with punctuation characters removed:\n", text_punct_removed)
It only will remove the last one I try to replace, so I tried combining them
text_punct_removed = raw_text.replace(".", "" , "!", "")
print("\ntext with punctuation characters removed:\n", text_punct_removed)
but I get an error message, how do I remove multiple punctuation? Also there will be an issue if I put the " in quotes like this """ which will make a comment, is there a way around that? thanks
If you don't need to explicitly use replace:
exclude = set(",.:;'\"-?!/")
text = "".join([(ch if ch not in exclude else " ") for ch in text])
Here's a naive but working solution:
for sp in '.,"':
raw_text = raw_text.replace(sp, '')
If you need to replace all punctuations with space, you can use the built-in punctuation list to replace the string:
Python 3
import string
import re
my_string = "(I hope...this works!)"
translator = re.compile('[%s]' % re.escape(string.punctuation))
translator.sub(' ', my_string)
print(my_string)
# Result:
# I hope this works
After, if you want to remove double spaces inside string, you can make:
my_string = re.sub(' +',' ', my_string).strip()
print(my_string)
# Result:
# I hope this works
This works in Python3.5.3:
from string import punctuation
raw_text_with_punctuations = "text, with: punctuation; characters? all over ,.:;'\"-?!/"
print(raw_text_with_punctuations)
for char in punctuation:
raw_text_with_punctuations = raw_text_with_punctuations.replace(char, '')
print(raw_text_with_punctuations)
Either remove one character at a time:
raw_text.replace(".", "").replace("!", "")
Or, better, use regular expressions (re.sub()):
re.sub(r"\.|!", "", raw_text)
Basically, I'm asking the user to input a string of text into the console, but the string is very long and includes many line breaks. How would I take the user's string and delete all line breaks to make it a single line of text. My method for acquiring the string is very simple.
string = raw_input("Please enter string: ")
Is there a different way I should be grabbing the string from the user? I'm running Python 2.7.4 on a Mac.
P.S. Clearly I'm a noob, so even if a solution isn't the most efficient, the one that uses the most simple syntax would be appreciated.
How do you enter line breaks with raw_input? But, once you have a string with some characters in it you want to get rid of, just replace them.
>>> mystr = raw_input('please enter string: ')
please enter string: hello world, how do i enter line breaks?
>>> # pressing enter didn't work...
...
>>> mystr
'hello world, how do i enter line breaks?'
>>> mystr.replace(' ', '')
'helloworld,howdoienterlinebreaks?'
>>>
In the example above, I replaced all spaces. The string '\n' represents newlines. And \r represents carriage returns (if you're on windows, you might be getting these and a second replace will handle them for you!).
basically:
# you probably want to use a space ' ' to replace `\n`
mystring = mystring.replace('\n', ' ').replace('\r', '')
Note also, that it is a bad idea to call your variable string, as this shadows the module string. Another name I'd avoid but would love to use sometimes: file. For the same reason.
You can try using string replace:
string = string.replace('\r', '').replace('\n', '')
You can split the string with no separator arg, which will treat consecutive whitespace as a single separator (including newlines and tabs). Then join using a space:
In : " ".join("\n\nsome text \r\n with multiple whitespace".split())
Out: 'some text with multiple whitespace'
https://docs.python.org/2/library/stdtypes.html#str.split
The canonic answer, in Python, would be :
s = ''.join(s.splitlines())
It splits the string into lines (letting Python doing it according to its own best practices). Then you merge it. Two possibilities here:
replace the newline by a whitespace (' '.join())
or without a whitespace (''.join())
updated based on Xbello comment:
string = my_string.rstrip('\r\n')
read more here
Another option is regex:
>>> import re
>>> re.sub("\n|\r", "", "Foo\n\rbar\n\rbaz\n\r")
'Foobarbaz'
If anybody decides to use replace, you should try r'\n' instead '\n'
mystring = mystring.replace(r'\n', ' ').replace(r'\r', '')
A method taking into consideration
additional white characters at the beginning/end of string
additional white characters at the beginning/end of every line
various end-line characters
it takes such a multi-line string which may be messy e.g.
test_str = '\nhej ho \n aaa\r\n a\n '
and produces nice one-line string
>>> ' '.join([line.strip() for line in test_str.strip().splitlines()])
'hej ho aaa a'
UPDATE:
To fix multiple new-line character producing redundant spaces:
' '.join([line.strip() for line in test_str.strip().splitlines() if line.strip()])
This works for the following too
test_str = '\nhej ho \n aaa\r\n\n\n\n\n a\n '
Regular expressions is the fastest way to do this
s='''some kind of
string with a bunch\r of
extra spaces in it'''
re.sub(r'\s(?=\s)','',re.sub(r'\s',' ',s))
result:
'some kind of string with a bunch of extra spaces in it'
The problem with rstrip() is that it does not work in all cases (as I myself have seen few). Instead you can use
text = text.replace("\n"," ")
This will remove all new line '\n' with a space.
You really don't need to remove ALL the signs: lf cr crlf.
# Pythonic:
r'\n', r'\r', r'\r\n'
Some texts must have breaks, but you probably need to join broken lines to keep particular sentences together.
Therefore it is natural that line breaking happens after priod, semicolon, colon, but not after comma.
My code considers above conditions. Works well with texts copied from pdfs.
Enjoy!:
def unbreak_pdf_text(raw_text):
""" the newline careful sign removal tool
Args:
raw_text (str): string containing unwanted newline signs: \\n or \\r or \\r\\n
e.g. imported from OCR or copied from a pdf document.
Returns:
_type_: _description_
"""
pat = re.compile((r"[, \w]\n|[, \w]\r|[, \w]\r\n"))
breaks = re.finditer(pat, raw_text)
processed_text = raw_text
raw_text = None
for i in breaks:
processed_text = processed_text.replace(i.group(), i.group()[0]+" ")
return processed_text