python regex: unexpected character after line continuation character - python

import re
def preprocessor(text):
text = re.sub('<[^>]*>', '', text)
emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text)
text = re.sub('[\W]+', ' ', text.lower()) + \'.join(emoticons).replace('-', '')
return text
I am getting an error in line 5 saying "unexpected character after line continuation character". Please, somebody help me out.

Your join() statement is not properly formed. Because you have a bare \, the interpreter thinks you're invoking the line continuation character, e.g.:
print("a \
b")
# a b
If you want to join on \, use:
print("\\".join(['a','b']))
# a\b
In your case:
'\\'.join(emoticons)

Related

Python print list [duplicate]

How can I remove the last character of a string if it is a newline?
"abc\n" --> "abc"
Try the method rstrip() (see doc Python 2 and Python 3)
>>> 'test string\n'.rstrip()
'test string'
Python's rstrip() method strips all kinds of trailing whitespace by default, not just one newline as Perl does with chomp.
>>> 'test string \n \r\n\n\r \n\n'.rstrip()
'test string'
To strip only newlines:
>>> 'test string \n \r\n\n\r \n\n'.rstrip('\n')
'test string \n \r\n\n\r '
In addition to rstrip(), there are also the methods strip() and lstrip(). Here is an example with the three of them:
>>> s = " \n\r\n \n abc def \n\r\n \n "
>>> s.strip()
'abc def'
>>> s.lstrip()
'abc def \n\r\n \n '
>>> s.rstrip()
' \n\r\n \n abc def'
And I would say the "pythonic" way to get lines without trailing newline characters is splitlines().
>>> text = "line 1\nline 2\r\nline 3\nline 4"
>>> text.splitlines()
['line 1', 'line 2', 'line 3', 'line 4']
The canonical way to strip end-of-line (EOL) characters is to use the string rstrip() method removing any trailing \r or \n. Here are examples for Mac, Windows, and Unix EOL characters.
>>> 'Mac EOL\r'.rstrip('\r\n')
'Mac EOL'
>>> 'Windows EOL\r\n'.rstrip('\r\n')
'Windows EOL'
>>> 'Unix EOL\n'.rstrip('\r\n')
'Unix EOL'
Using '\r\n' as the parameter to rstrip means that it will strip out any trailing combination of '\r' or '\n'. That's why it works in all three cases above.
This nuance matters in rare cases. For example, I once had to process a text file which contained an HL7 message. The HL7 standard requires a trailing '\r' as its EOL character. The Windows machine on which I was using this message had appended its own '\r\n' EOL character. Therefore, the end of each line looked like '\r\r\n'. Using rstrip('\r\n') would have taken off the entire '\r\r\n' which is not what I wanted. In that case, I simply sliced off the last two characters instead.
Note that unlike Perl's chomp function, this will strip all specified characters at the end of the string, not just one:
>>> "Hello\n\n\n".rstrip("\n")
"Hello"
Note that rstrip doesn't act exactly like Perl's chomp() because it doesn't modify the string. That is, in Perl:
$x="a\n";
chomp $x
results in $x being "a".
but in Python:
x="a\n"
x.rstrip()
will mean that the value of x is still "a\n". Even x=x.rstrip() doesn't always give the same result, as it strips all whitespace from the end of the string, not just one newline at most.
I might use something like this:
import os
s = s.rstrip(os.linesep)
I think the problem with rstrip("\n") is that you'll probably want to make sure the line separator is portable. (some antiquated systems are rumored to use "\r\n"). The other gotcha is that rstrip will strip out repeated whitespace. Hopefully os.linesep will contain the right characters. the above works for me.
You may use line = line.rstrip('\n'). This will strip all newlines from the end of the string, not just one.
s = s.rstrip()
will remove all newlines at the end of the string s. The assignment is needed because rstrip returns a new string instead of modifying the original string.
"line 1\nline 2\r\n...".replace('\n', '').replace('\r', '')
>>> 'line 1line 2...'
or you could always get geekier with regexps
This would replicate exactly perl's chomp (minus behavior on arrays) for "\n" line terminator:
def chomp(x):
if x.endswith("\r\n"): return x[:-2]
if x.endswith("\n") or x.endswith("\r"): return x[:-1]
return x
(Note: it does not modify string 'in place'; it does not strip extra trailing whitespace; takes \r\n in account)
you can use strip:
line = line.strip()
demo:
>>> "\n\n hello world \n\n".strip()
'hello world'
rstrip doesn't do the same thing as chomp, on so many levels. Read http://perldoc.perl.org/functions/chomp.html and see that chomp is very complex indeed.
However, my main point is that chomp removes at most 1 line ending, whereas rstrip will remove as many as it can.
Here you can see rstrip removing all the newlines:
>>> 'foo\n\n'.rstrip(os.linesep)
'foo'
A much closer approximation of typical Perl chomp usage can be accomplished with re.sub, like this:
>>> re.sub(os.linesep + r'\Z','','foo\n\n')
'foo\n'
Careful with "foo".rstrip(os.linesep): That will only chomp the newline characters for the platform where your Python is being executed. Imagine you're chimping the lines of a Windows file under Linux, for instance:
$ python
Python 2.7.1 (r271:86832, Mar 18 2011, 09:09:48)
[GCC 4.5.0 20100604 [gcc-4_5-branch revision 160292]] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import os, sys
>>> sys.platform
'linux2'
>>> "foo\r\n".rstrip(os.linesep)
'foo\r'
>>>
Use "foo".rstrip("\r\n") instead, as Mike says above.
An example in Python's documentation simply uses line.strip().
Perl's chomp function removes one linebreak sequence from the end of a string only if it's actually there.
Here is how I plan to do that in Python, if process is conceptually the function that I need in order to do something useful to each line from this file:
import os
sep_pos = -len(os.linesep)
with open("file.txt") as f:
for line in f:
if line[sep_pos:] == os.linesep:
line = line[:sep_pos]
process(line)
I don't program in Python, but I came across an FAQ at python.org advocating S.rstrip("\r\n") for python 2.2 or later.
import re
r_unwanted = re.compile("[\n\t\r]")
r_unwanted.sub("", your_text)
If your question is to clean up all the line breaks in a multiple line str object (oldstr), you can split it into a list according to the delimiter '\n' and then join this list into a new str(newstr).
newstr = "".join(oldstr.split('\n'))
I find it convenient to have be able to get the chomped lines via in iterator, parallel to the way you can get the un-chomped lines from a file object. You can do so with the following code:
def chomped_lines(it):
return map(operator.methodcaller('rstrip', '\r\n'), it)
Sample usage:
with open("file.txt") as infile:
for line in chomped_lines(infile):
process(line)
I'm bubbling up my regular expression based answer from one I posted earlier in the comments of another answer. I think using re is a clearer more explicit solution to this problem than str.rstrip.
>>> import re
If you want to remove one or more trailing newline chars:
>>> re.sub(r'[\n\r]+$', '', '\nx\r\n')
'\nx'
If you want to remove newline chars everywhere (not just trailing):
>>> re.sub(r'[\n\r]+', '', '\nx\r\n')
'x'
If you want to remove only 1-2 trailing newline chars (i.e., \r, \n, \r\n, \n\r, \r\r, \n\n)
>>> re.sub(r'[\n\r]{1,2}$', '', '\nx\r\n\r\n')
'\nx\r'
>>> re.sub(r'[\n\r]{1,2}$', '', '\nx\r\n\r')
'\nx\r'
>>> re.sub(r'[\n\r]{1,2}$', '', '\nx\r\n')
'\nx'
I have a feeling what most people really want here, is to remove just one occurrence of a trailing newline character, either \r\n or \n and nothing more.
>>> re.sub(r'(?:\r\n|\n)$', '', '\nx\n\n', count=1)
'\nx\n'
>>> re.sub(r'(?:\r\n|\n)$', '', '\nx\r\n\r\n', count=1)
'\nx\r\n'
>>> re.sub(r'(?:\r\n|\n)$', '', '\nx\r\n', count=1)
'\nx'
>>> re.sub(r'(?:\r\n|\n)$', '', '\nx\n', count=1)
'\nx'
(The ?: is to create a non-capturing group.)
(By the way this is not what '...'.rstrip('\n', '').rstrip('\r', '') does which may not be clear to others stumbling upon this thread. str.rstrip strips as many of the trailing characters as possible, so a string like foo\n\n\n would result in a false positive of foo whereas you may have wanted to preserve the other newlines after stripping a single trailing one.)
workaround solution for special case:
if the newline character is the last character (as is the case with most file inputs), then for any element in the collection you can index as follows:
foobar= foobar[:-1]
to slice out your newline character.
It looks like there is not a perfect analog for perl's chomp. In particular, rstrip cannot handle multi-character newline delimiters like \r\n. However, splitlines does as pointed out here.
Following my answer on a different question, you can combine join and splitlines to remove/replace all newlines from a string s:
''.join(s.splitlines())
The following removes exactly one trailing newline (as chomp would, I believe). Passing True as the keepends argument to splitlines retain the delimiters. Then, splitlines is called again to remove the delimiters on just the last "line":
def chomp(s):
if len(s):
lines = s.splitlines(True)
last = lines.pop()
return ''.join(lines + last.splitlines())
else:
return ''
s = '''Hello World \t\n\r\tHi There'''
# import the module string
import string
# use the method translate to convert
s.translate({ord(c): None for c in string.whitespace}
>>'HelloWorldHiThere'
With regex
s = ''' Hello World
\t\n\r\tHi '''
print(re.sub(r"\s+", "", s), sep='') # \s matches all white spaces
>HelloWorldHi
Replace \n,\t,\r
s.replace('\n', '').replace('\t','').replace('\r','')
>' Hello World Hi '
With regex
s = '''Hello World \t\n\r\tHi There'''
regex = re.compile(r'[\n\r\t]')
regex.sub("", s)
>'Hello World Hi There'
with Join
s = '''Hello World \t\n\r\tHi There'''
' '.join(s.split())
>'Hello World Hi There'
>>> ' spacious '.rstrip()
' spacious'
>>> "AABAA".rstrip("A")
'AAB'
>>> "ABBA".rstrip("AB") # both AB and BA are stripped
''
>>> "ABCABBA".rstrip("AB")
'ABC'
Just use :
line = line.rstrip("\n")
or
line = line.strip("\n")
You don't need any of this complicated stuff
There are three types of line endings that we normally encounter: \n, \r and \r\n. A rather simple regular expression in re.sub, namely r"\r?\n?$", is able to catch them all.
(And we gotta catch 'em all, am I right?)
import re
re.sub(r"\r?\n?$", "", the_text, 1)
With the last argument, we limit the number of occurences replaced to one, mimicking chomp to some extent. Example:
import re
text_1 = "hellothere\n\n\n"
text_2 = "hellothere\n\n\r"
text_3 = "hellothere\n\n\r\n"
a = re.sub(r"\r?\n?$", "", text_1, 1)
b = re.sub(r"\r?\n?$", "", text_2, 1)
c = re.sub(r"\r?\n?$", "", text_3, 1)
... where a == b == c is True.
If you are concerned about speed (say you have a looong list of strings) and you know the nature of the newline char, string slicing is actually faster than rstrip. A little test to illustrate this:
import time
loops = 50000000
def method1(loops=loops):
test_string = 'num\n'
t0 = time.time()
for num in xrange(loops):
out_sting = test_string[:-1]
t1 = time.time()
print('Method 1: ' + str(t1 - t0))
def method2(loops=loops):
test_string = 'num\n'
t0 = time.time()
for num in xrange(loops):
out_sting = test_string.rstrip()
t1 = time.time()
print('Method 2: ' + str(t1 - t0))
method1()
method2()
Output:
Method 1: 3.92700004578
Method 2: 6.73000001907
This will work both for windows and linux (bit expensive with re sub if you are looking for only re solution)
import re
if re.search("(\\r|)\\n$", line):
line = re.sub("(\\r|)\\n$", "", line)
A catch all:
line = line.rstrip('\r|\n')

Avoid newline in paths (Python) [duplicate]

How can I remove the last character of a string if it is a newline?
"abc\n" --> "abc"
Try the method rstrip() (see doc Python 2 and Python 3)
>>> 'test string\n'.rstrip()
'test string'
Python's rstrip() method strips all kinds of trailing whitespace by default, not just one newline as Perl does with chomp.
>>> 'test string \n \r\n\n\r \n\n'.rstrip()
'test string'
To strip only newlines:
>>> 'test string \n \r\n\n\r \n\n'.rstrip('\n')
'test string \n \r\n\n\r '
In addition to rstrip(), there are also the methods strip() and lstrip(). Here is an example with the three of them:
>>> s = " \n\r\n \n abc def \n\r\n \n "
>>> s.strip()
'abc def'
>>> s.lstrip()
'abc def \n\r\n \n '
>>> s.rstrip()
' \n\r\n \n abc def'
And I would say the "pythonic" way to get lines without trailing newline characters is splitlines().
>>> text = "line 1\nline 2\r\nline 3\nline 4"
>>> text.splitlines()
['line 1', 'line 2', 'line 3', 'line 4']
The canonical way to strip end-of-line (EOL) characters is to use the string rstrip() method removing any trailing \r or \n. Here are examples for Mac, Windows, and Unix EOL characters.
>>> 'Mac EOL\r'.rstrip('\r\n')
'Mac EOL'
>>> 'Windows EOL\r\n'.rstrip('\r\n')
'Windows EOL'
>>> 'Unix EOL\n'.rstrip('\r\n')
'Unix EOL'
Using '\r\n' as the parameter to rstrip means that it will strip out any trailing combination of '\r' or '\n'. That's why it works in all three cases above.
This nuance matters in rare cases. For example, I once had to process a text file which contained an HL7 message. The HL7 standard requires a trailing '\r' as its EOL character. The Windows machine on which I was using this message had appended its own '\r\n' EOL character. Therefore, the end of each line looked like '\r\r\n'. Using rstrip('\r\n') would have taken off the entire '\r\r\n' which is not what I wanted. In that case, I simply sliced off the last two characters instead.
Note that unlike Perl's chomp function, this will strip all specified characters at the end of the string, not just one:
>>> "Hello\n\n\n".rstrip("\n")
"Hello"
Note that rstrip doesn't act exactly like Perl's chomp() because it doesn't modify the string. That is, in Perl:
$x="a\n";
chomp $x
results in $x being "a".
but in Python:
x="a\n"
x.rstrip()
will mean that the value of x is still "a\n". Even x=x.rstrip() doesn't always give the same result, as it strips all whitespace from the end of the string, not just one newline at most.
I might use something like this:
import os
s = s.rstrip(os.linesep)
I think the problem with rstrip("\n") is that you'll probably want to make sure the line separator is portable. (some antiquated systems are rumored to use "\r\n"). The other gotcha is that rstrip will strip out repeated whitespace. Hopefully os.linesep will contain the right characters. the above works for me.
You may use line = line.rstrip('\n'). This will strip all newlines from the end of the string, not just one.
s = s.rstrip()
will remove all newlines at the end of the string s. The assignment is needed because rstrip returns a new string instead of modifying the original string.
"line 1\nline 2\r\n...".replace('\n', '').replace('\r', '')
>>> 'line 1line 2...'
or you could always get geekier with regexps
This would replicate exactly perl's chomp (minus behavior on arrays) for "\n" line terminator:
def chomp(x):
if x.endswith("\r\n"): return x[:-2]
if x.endswith("\n") or x.endswith("\r"): return x[:-1]
return x
(Note: it does not modify string 'in place'; it does not strip extra trailing whitespace; takes \r\n in account)
you can use strip:
line = line.strip()
demo:
>>> "\n\n hello world \n\n".strip()
'hello world'
rstrip doesn't do the same thing as chomp, on so many levels. Read http://perldoc.perl.org/functions/chomp.html and see that chomp is very complex indeed.
However, my main point is that chomp removes at most 1 line ending, whereas rstrip will remove as many as it can.
Here you can see rstrip removing all the newlines:
>>> 'foo\n\n'.rstrip(os.linesep)
'foo'
A much closer approximation of typical Perl chomp usage can be accomplished with re.sub, like this:
>>> re.sub(os.linesep + r'\Z','','foo\n\n')
'foo\n'
Careful with "foo".rstrip(os.linesep): That will only chomp the newline characters for the platform where your Python is being executed. Imagine you're chimping the lines of a Windows file under Linux, for instance:
$ python
Python 2.7.1 (r271:86832, Mar 18 2011, 09:09:48)
[GCC 4.5.0 20100604 [gcc-4_5-branch revision 160292]] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import os, sys
>>> sys.platform
'linux2'
>>> "foo\r\n".rstrip(os.linesep)
'foo\r'
>>>
Use "foo".rstrip("\r\n") instead, as Mike says above.
An example in Python's documentation simply uses line.strip().
Perl's chomp function removes one linebreak sequence from the end of a string only if it's actually there.
Here is how I plan to do that in Python, if process is conceptually the function that I need in order to do something useful to each line from this file:
import os
sep_pos = -len(os.linesep)
with open("file.txt") as f:
for line in f:
if line[sep_pos:] == os.linesep:
line = line[:sep_pos]
process(line)
I don't program in Python, but I came across an FAQ at python.org advocating S.rstrip("\r\n") for python 2.2 or later.
import re
r_unwanted = re.compile("[\n\t\r]")
r_unwanted.sub("", your_text)
If your question is to clean up all the line breaks in a multiple line str object (oldstr), you can split it into a list according to the delimiter '\n' and then join this list into a new str(newstr).
newstr = "".join(oldstr.split('\n'))
I find it convenient to have be able to get the chomped lines via in iterator, parallel to the way you can get the un-chomped lines from a file object. You can do so with the following code:
def chomped_lines(it):
return map(operator.methodcaller('rstrip', '\r\n'), it)
Sample usage:
with open("file.txt") as infile:
for line in chomped_lines(infile):
process(line)
I'm bubbling up my regular expression based answer from one I posted earlier in the comments of another answer. I think using re is a clearer more explicit solution to this problem than str.rstrip.
>>> import re
If you want to remove one or more trailing newline chars:
>>> re.sub(r'[\n\r]+$', '', '\nx\r\n')
'\nx'
If you want to remove newline chars everywhere (not just trailing):
>>> re.sub(r'[\n\r]+', '', '\nx\r\n')
'x'
If you want to remove only 1-2 trailing newline chars (i.e., \r, \n, \r\n, \n\r, \r\r, \n\n)
>>> re.sub(r'[\n\r]{1,2}$', '', '\nx\r\n\r\n')
'\nx\r'
>>> re.sub(r'[\n\r]{1,2}$', '', '\nx\r\n\r')
'\nx\r'
>>> re.sub(r'[\n\r]{1,2}$', '', '\nx\r\n')
'\nx'
I have a feeling what most people really want here, is to remove just one occurrence of a trailing newline character, either \r\n or \n and nothing more.
>>> re.sub(r'(?:\r\n|\n)$', '', '\nx\n\n', count=1)
'\nx\n'
>>> re.sub(r'(?:\r\n|\n)$', '', '\nx\r\n\r\n', count=1)
'\nx\r\n'
>>> re.sub(r'(?:\r\n|\n)$', '', '\nx\r\n', count=1)
'\nx'
>>> re.sub(r'(?:\r\n|\n)$', '', '\nx\n', count=1)
'\nx'
(The ?: is to create a non-capturing group.)
(By the way this is not what '...'.rstrip('\n', '').rstrip('\r', '') does which may not be clear to others stumbling upon this thread. str.rstrip strips as many of the trailing characters as possible, so a string like foo\n\n\n would result in a false positive of foo whereas you may have wanted to preserve the other newlines after stripping a single trailing one.)
workaround solution for special case:
if the newline character is the last character (as is the case with most file inputs), then for any element in the collection you can index as follows:
foobar= foobar[:-1]
to slice out your newline character.
It looks like there is not a perfect analog for perl's chomp. In particular, rstrip cannot handle multi-character newline delimiters like \r\n. However, splitlines does as pointed out here.
Following my answer on a different question, you can combine join and splitlines to remove/replace all newlines from a string s:
''.join(s.splitlines())
The following removes exactly one trailing newline (as chomp would, I believe). Passing True as the keepends argument to splitlines retain the delimiters. Then, splitlines is called again to remove the delimiters on just the last "line":
def chomp(s):
if len(s):
lines = s.splitlines(True)
last = lines.pop()
return ''.join(lines + last.splitlines())
else:
return ''
s = '''Hello World \t\n\r\tHi There'''
# import the module string
import string
# use the method translate to convert
s.translate({ord(c): None for c in string.whitespace}
>>'HelloWorldHiThere'
With regex
s = ''' Hello World
\t\n\r\tHi '''
print(re.sub(r"\s+", "", s), sep='') # \s matches all white spaces
>HelloWorldHi
Replace \n,\t,\r
s.replace('\n', '').replace('\t','').replace('\r','')
>' Hello World Hi '
With regex
s = '''Hello World \t\n\r\tHi There'''
regex = re.compile(r'[\n\r\t]')
regex.sub("", s)
>'Hello World Hi There'
with Join
s = '''Hello World \t\n\r\tHi There'''
' '.join(s.split())
>'Hello World Hi There'
>>> ' spacious '.rstrip()
' spacious'
>>> "AABAA".rstrip("A")
'AAB'
>>> "ABBA".rstrip("AB") # both AB and BA are stripped
''
>>> "ABCABBA".rstrip("AB")
'ABC'
Just use :
line = line.rstrip("\n")
or
line = line.strip("\n")
You don't need any of this complicated stuff
There are three types of line endings that we normally encounter: \n, \r and \r\n. A rather simple regular expression in re.sub, namely r"\r?\n?$", is able to catch them all.
(And we gotta catch 'em all, am I right?)
import re
re.sub(r"\r?\n?$", "", the_text, 1)
With the last argument, we limit the number of occurences replaced to one, mimicking chomp to some extent. Example:
import re
text_1 = "hellothere\n\n\n"
text_2 = "hellothere\n\n\r"
text_3 = "hellothere\n\n\r\n"
a = re.sub(r"\r?\n?$", "", text_1, 1)
b = re.sub(r"\r?\n?$", "", text_2, 1)
c = re.sub(r"\r?\n?$", "", text_3, 1)
... where a == b == c is True.
If you are concerned about speed (say you have a looong list of strings) and you know the nature of the newline char, string slicing is actually faster than rstrip. A little test to illustrate this:
import time
loops = 50000000
def method1(loops=loops):
test_string = 'num\n'
t0 = time.time()
for num in xrange(loops):
out_sting = test_string[:-1]
t1 = time.time()
print('Method 1: ' + str(t1 - t0))
def method2(loops=loops):
test_string = 'num\n'
t0 = time.time()
for num in xrange(loops):
out_sting = test_string.rstrip()
t1 = time.time()
print('Method 2: ' + str(t1 - t0))
method1()
method2()
Output:
Method 1: 3.92700004578
Method 2: 6.73000001907
This will work both for windows and linux (bit expensive with re sub if you are looking for only re solution)
import re
if re.search("(\\r|)\\n$", line):
line = re.sub("(\\r|)\\n$", "", line)
A catch all:
line = line.rstrip('\r|\n')

Error in python string

when I type this code in python it shows this error
>>> 'Ahmed' + \t 'Ashraf '
SyntaxError: unexpected character after line continuation character
what does this error mean ??
The \t escape sequence has to be inside a string literal, so the correct syntax is:
'Ahmed' + '\t' + 'Ashraf '
Outside a string literal, \ is used to indicate that you're continuing a statement on the next line, so it's called the line continuation character. It should only be followed by a newline, e.g.
>>> var1 = 'Ahmed' + \
'Ashraf'
If it's followed by some other character, such as t in your example, you get an error.
You can also do it with the format string.
>>> '%s\t%s' %('Ahmed','Ashraf')
'Ahmed\tAshraf'
>>>

Unicode issues when using NLTK

I have a text scraped from internet (I think it was a Spanish text encoded in "latin-1" and decoded to unicode when scraped). The text is something like this:
730\u20ac.\r\n\nropa nueva 2012 ... 5,10 muy buen estado..... 170 \u20ac\r\n\nPack 850\u20ac,
After that I do some replacements on the text to normalize some words (i.e. replace the € symbol (\u20ac) for "euros" using regex (r'\u20ac', r' euros')).
Here my problem seems to start... If I do not encode each string to "UTF-8" before applying the regex, the regex wont find any occurrences (despite a lot of occurrences do exist)...
Anyways, after encoding it to UTF-8, the regex (r'\u20ac', r' euros') works.
After that I tokenize and tag all the strings. When I try to use the regexparser I then get the
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 1: ordinal not in range(128)
My question is, if I have already encoded it to UTF-8, how come I have a problem now? And what would be your suggestion to try to avoid it?
Is there a way to do the encoding process once and for all, like below? If so what should I do for the second part (encode/ decode it anyway)?
Get text -> encode/ decode it anyway... -> Work on the text without any issue
Thanks in advance for any help!! I am new to programming and it is killing me...
Code detail:
regex function
replacement_patterns = [(ur' \\u20ac', ur' euros'),(ur' \xe2\x82\xac', r' euros'),(ur' \b[eE]?[uU]?[rR]\b', r' euros'), (ur' \b([0-9]+)[eE][uU]?[rR]?[oO]?[sS]?\b',ur' \1 euros')]
class RegexpReplacer(object):
def __init__(self, patterns=replacement_patterns):
self.patterns = [(re.compile(regex, re.IGNORECASE), repl) for (regex, repl) in patterns]
def replace(self, text):
s = text
for (pattern, repl) in self.patterns:
(s, count) = re.subn(pattern, repl, s)
return s
You seem to be misunderstanding the meaning of r'\u20ac'
The r indicates a raw string. Not a unicode string, a standard one. So using a unicode escape in a pattern only gets you a literal backslash:
>>> p = re.compile(r'\u20ac')
>>> p.pattern
'\\u20ac'
>>> print p.pattern
\u20ac
If you want to use raw strings and unicode escapes, you'll have to use raw unicode strings, indicated by ur instead of just r:
>>> p = re.compile(ur'\u20ac')
>>> p.pattern
u'\u20ac'
>>> print p.pattern
€
Did you use the decode & encode functions correctly?
from nltk import ne_chunk,pos_tag
from nltk.tokenize.punkt import PunktSentenceTokenizer
from nltk.tokenize.treebank import TreebankWordTokenizer
text = "€"
text = text.decode('utf-8')
sentences = PunktTokenizer.tokenize(text)
tokens = [TreeBankTokenizer.tokenize(sentence) for sentence in sentences]
tagged = [pos_tag(token) for token in tokens]
When needed, try to use:
print your_string.encode("utf-8")
I have no problems currently. The only issue is that $50, says:
word: $ meaning: dollar word: 50 meaning: numeral, cardinal
This is correct.
And €50, says:
word: €50 meaning: -NONE-
This is INcorrect.
With a space between the € sign and the number, it says:
word: € meaning: noun, common, singular or mass word: 50 meaning:
numeral, cardinal
Which is more correct.

How can I remove a trailing newline?

How can I remove the last character of a string if it is a newline?
"abc\n" --> "abc"
Try the method rstrip() (see doc Python 2 and Python 3)
>>> 'test string\n'.rstrip()
'test string'
Python's rstrip() method strips all kinds of trailing whitespace by default, not just one newline as Perl does with chomp.
>>> 'test string \n \r\n\n\r \n\n'.rstrip()
'test string'
To strip only newlines:
>>> 'test string \n \r\n\n\r \n\n'.rstrip('\n')
'test string \n \r\n\n\r '
In addition to rstrip(), there are also the methods strip() and lstrip(). Here is an example with the three of them:
>>> s = " \n\r\n \n abc def \n\r\n \n "
>>> s.strip()
'abc def'
>>> s.lstrip()
'abc def \n\r\n \n '
>>> s.rstrip()
' \n\r\n \n abc def'
And I would say the "pythonic" way to get lines without trailing newline characters is splitlines().
>>> text = "line 1\nline 2\r\nline 3\nline 4"
>>> text.splitlines()
['line 1', 'line 2', 'line 3', 'line 4']
The canonical way to strip end-of-line (EOL) characters is to use the string rstrip() method removing any trailing \r or \n. Here are examples for Mac, Windows, and Unix EOL characters.
>>> 'Mac EOL\r'.rstrip('\r\n')
'Mac EOL'
>>> 'Windows EOL\r\n'.rstrip('\r\n')
'Windows EOL'
>>> 'Unix EOL\n'.rstrip('\r\n')
'Unix EOL'
Using '\r\n' as the parameter to rstrip means that it will strip out any trailing combination of '\r' or '\n'. That's why it works in all three cases above.
This nuance matters in rare cases. For example, I once had to process a text file which contained an HL7 message. The HL7 standard requires a trailing '\r' as its EOL character. The Windows machine on which I was using this message had appended its own '\r\n' EOL character. Therefore, the end of each line looked like '\r\r\n'. Using rstrip('\r\n') would have taken off the entire '\r\r\n' which is not what I wanted. In that case, I simply sliced off the last two characters instead.
Note that unlike Perl's chomp function, this will strip all specified characters at the end of the string, not just one:
>>> "Hello\n\n\n".rstrip("\n")
"Hello"
Note that rstrip doesn't act exactly like Perl's chomp() because it doesn't modify the string. That is, in Perl:
$x="a\n";
chomp $x
results in $x being "a".
but in Python:
x="a\n"
x.rstrip()
will mean that the value of x is still "a\n". Even x=x.rstrip() doesn't always give the same result, as it strips all whitespace from the end of the string, not just one newline at most.
I might use something like this:
import os
s = s.rstrip(os.linesep)
I think the problem with rstrip("\n") is that you'll probably want to make sure the line separator is portable. (some antiquated systems are rumored to use "\r\n"). The other gotcha is that rstrip will strip out repeated whitespace. Hopefully os.linesep will contain the right characters. the above works for me.
You may use line = line.rstrip('\n'). This will strip all newlines from the end of the string, not just one.
s = s.rstrip()
will remove all newlines at the end of the string s. The assignment is needed because rstrip returns a new string instead of modifying the original string.
"line 1\nline 2\r\n...".replace('\n', '').replace('\r', '')
>>> 'line 1line 2...'
or you could always get geekier with regexps
This would replicate exactly perl's chomp (minus behavior on arrays) for "\n" line terminator:
def chomp(x):
if x.endswith("\r\n"): return x[:-2]
if x.endswith("\n") or x.endswith("\r"): return x[:-1]
return x
(Note: it does not modify string 'in place'; it does not strip extra trailing whitespace; takes \r\n in account)
you can use strip:
line = line.strip()
demo:
>>> "\n\n hello world \n\n".strip()
'hello world'
rstrip doesn't do the same thing as chomp, on so many levels. Read http://perldoc.perl.org/functions/chomp.html and see that chomp is very complex indeed.
However, my main point is that chomp removes at most 1 line ending, whereas rstrip will remove as many as it can.
Here you can see rstrip removing all the newlines:
>>> 'foo\n\n'.rstrip(os.linesep)
'foo'
A much closer approximation of typical Perl chomp usage can be accomplished with re.sub, like this:
>>> re.sub(os.linesep + r'\Z','','foo\n\n')
'foo\n'
Careful with "foo".rstrip(os.linesep): That will only chomp the newline characters for the platform where your Python is being executed. Imagine you're chimping the lines of a Windows file under Linux, for instance:
$ python
Python 2.7.1 (r271:86832, Mar 18 2011, 09:09:48)
[GCC 4.5.0 20100604 [gcc-4_5-branch revision 160292]] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import os, sys
>>> sys.platform
'linux2'
>>> "foo\r\n".rstrip(os.linesep)
'foo\r'
>>>
Use "foo".rstrip("\r\n") instead, as Mike says above.
An example in Python's documentation simply uses line.strip().
Perl's chomp function removes one linebreak sequence from the end of a string only if it's actually there.
Here is how I plan to do that in Python, if process is conceptually the function that I need in order to do something useful to each line from this file:
import os
sep_pos = -len(os.linesep)
with open("file.txt") as f:
for line in f:
if line[sep_pos:] == os.linesep:
line = line[:sep_pos]
process(line)
I don't program in Python, but I came across an FAQ at python.org advocating S.rstrip("\r\n") for python 2.2 or later.
import re
r_unwanted = re.compile("[\n\t\r]")
r_unwanted.sub("", your_text)
If your question is to clean up all the line breaks in a multiple line str object (oldstr), you can split it into a list according to the delimiter '\n' and then join this list into a new str(newstr).
newstr = "".join(oldstr.split('\n'))
I find it convenient to have be able to get the chomped lines via in iterator, parallel to the way you can get the un-chomped lines from a file object. You can do so with the following code:
def chomped_lines(it):
return map(operator.methodcaller('rstrip', '\r\n'), it)
Sample usage:
with open("file.txt") as infile:
for line in chomped_lines(infile):
process(line)
I'm bubbling up my regular expression based answer from one I posted earlier in the comments of another answer. I think using re is a clearer more explicit solution to this problem than str.rstrip.
>>> import re
If you want to remove one or more trailing newline chars:
>>> re.sub(r'[\n\r]+$', '', '\nx\r\n')
'\nx'
If you want to remove newline chars everywhere (not just trailing):
>>> re.sub(r'[\n\r]+', '', '\nx\r\n')
'x'
If you want to remove only 1-2 trailing newline chars (i.e., \r, \n, \r\n, \n\r, \r\r, \n\n)
>>> re.sub(r'[\n\r]{1,2}$', '', '\nx\r\n\r\n')
'\nx\r'
>>> re.sub(r'[\n\r]{1,2}$', '', '\nx\r\n\r')
'\nx\r'
>>> re.sub(r'[\n\r]{1,2}$', '', '\nx\r\n')
'\nx'
I have a feeling what most people really want here, is to remove just one occurrence of a trailing newline character, either \r\n or \n and nothing more.
>>> re.sub(r'(?:\r\n|\n)$', '', '\nx\n\n', count=1)
'\nx\n'
>>> re.sub(r'(?:\r\n|\n)$', '', '\nx\r\n\r\n', count=1)
'\nx\r\n'
>>> re.sub(r'(?:\r\n|\n)$', '', '\nx\r\n', count=1)
'\nx'
>>> re.sub(r'(?:\r\n|\n)$', '', '\nx\n', count=1)
'\nx'
(The ?: is to create a non-capturing group.)
(By the way this is not what '...'.rstrip('\n', '').rstrip('\r', '') does which may not be clear to others stumbling upon this thread. str.rstrip strips as many of the trailing characters as possible, so a string like foo\n\n\n would result in a false positive of foo whereas you may have wanted to preserve the other newlines after stripping a single trailing one.)
workaround solution for special case:
if the newline character is the last character (as is the case with most file inputs), then for any element in the collection you can index as follows:
foobar= foobar[:-1]
to slice out your newline character.
It looks like there is not a perfect analog for perl's chomp. In particular, rstrip cannot handle multi-character newline delimiters like \r\n. However, splitlines does as pointed out here.
Following my answer on a different question, you can combine join and splitlines to remove/replace all newlines from a string s:
''.join(s.splitlines())
The following removes exactly one trailing newline (as chomp would, I believe). Passing True as the keepends argument to splitlines retain the delimiters. Then, splitlines is called again to remove the delimiters on just the last "line":
def chomp(s):
if len(s):
lines = s.splitlines(True)
last = lines.pop()
return ''.join(lines + last.splitlines())
else:
return ''
s = '''Hello World \t\n\r\tHi There'''
# import the module string
import string
# use the method translate to convert
s.translate({ord(c): None for c in string.whitespace}
>>'HelloWorldHiThere'
With regex
s = ''' Hello World
\t\n\r\tHi '''
print(re.sub(r"\s+", "", s), sep='') # \s matches all white spaces
>HelloWorldHi
Replace \n,\t,\r
s.replace('\n', '').replace('\t','').replace('\r','')
>' Hello World Hi '
With regex
s = '''Hello World \t\n\r\tHi There'''
regex = re.compile(r'[\n\r\t]')
regex.sub("", s)
>'Hello World Hi There'
with Join
s = '''Hello World \t\n\r\tHi There'''
' '.join(s.split())
>'Hello World Hi There'
>>> ' spacious '.rstrip()
' spacious'
>>> "AABAA".rstrip("A")
'AAB'
>>> "ABBA".rstrip("AB") # both AB and BA are stripped
''
>>> "ABCABBA".rstrip("AB")
'ABC'
Just use :
line = line.rstrip("\n")
or
line = line.strip("\n")
You don't need any of this complicated stuff
There are three types of line endings that we normally encounter: \n, \r and \r\n. A rather simple regular expression in re.sub, namely r"\r?\n?$", is able to catch them all.
(And we gotta catch 'em all, am I right?)
import re
re.sub(r"\r?\n?$", "", the_text, 1)
With the last argument, we limit the number of occurences replaced to one, mimicking chomp to some extent. Example:
import re
text_1 = "hellothere\n\n\n"
text_2 = "hellothere\n\n\r"
text_3 = "hellothere\n\n\r\n"
a = re.sub(r"\r?\n?$", "", text_1, 1)
b = re.sub(r"\r?\n?$", "", text_2, 1)
c = re.sub(r"\r?\n?$", "", text_3, 1)
... where a == b == c is True.
If you are concerned about speed (say you have a looong list of strings) and you know the nature of the newline char, string slicing is actually faster than rstrip. A little test to illustrate this:
import time
loops = 50000000
def method1(loops=loops):
test_string = 'num\n'
t0 = time.time()
for num in xrange(loops):
out_sting = test_string[:-1]
t1 = time.time()
print('Method 1: ' + str(t1 - t0))
def method2(loops=loops):
test_string = 'num\n'
t0 = time.time()
for num in xrange(loops):
out_sting = test_string.rstrip()
t1 = time.time()
print('Method 2: ' + str(t1 - t0))
method1()
method2()
Output:
Method 1: 3.92700004578
Method 2: 6.73000001907
This will work both for windows and linux (bit expensive with re sub if you are looking for only re solution)
import re
if re.search("(\\r|)\\n$", line):
line = re.sub("(\\r|)\\n$", "", line)
A catch all:
line = line.rstrip('\r|\n')

Categories