python UnicodeEncodeError > How can I simply remove troubling unicode characters? - python

Heres what I did..
>>> soup = BeautifulSoup (html)
>>> soup
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xae' in position 96953: ordinal not in range(128)
>>>
>>> soup.find('div')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xae' in position 11035: ordinal not in range(128)
>>>
>>> soup.find('span')
<span id="navLogoPrimary" class="navSprite"><span>amazon.com</span></span>
>>>
How can I simply remove troubling unicode characters from html ?
Or is there any cleaner solution ?

Try this way:
soup = BeautifulSoup (html.decode('utf-8', 'ignore'))

The error you see is due to repr(soup)tries to mix Unicode and bytestrings. Mixing Unicode and bytestrings frequently leads to errors.
Compare:
>>> u'1' + '©'
Traceback (most recent call last):
File "<input>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0: ordinal not in range(128)
And:
>>> u'1' + u'©'
u'1\xa9'
>>> '1' + u'©'
u'1\xa9'
>>> '1' + '©'
'1\xc2\xa9'
Here's an example for classes:
>>> class A:
... def __repr__(self):
... return u'copyright ©'.encode('utf-8')
...
>>> A()
copyright ©
>>> class B:
... def __repr__(self):
... return u'copyright ©'
...
>>> B()
Traceback (most recent call last):
File "<input>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa9' in position 10: ordi
nal not in range(128) #' workaround highlighting bug
>>> class C:
... def __repr__(self):
... return repr(A()) + repr(B())
...
>>> C()
Traceback (most recent call last):
File "<input>", line 1, in <module>
File "<input>", line 3, in __repr__
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa9' in position 10: ordi
nal not in range(128)
Similar thing happens with BeautifulSoup:
>>> html = """<p>©"""
>>> soup = BeautifulSoup(html)
>>> repr(soup)
Traceback (most recent call last):
File "<input>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa9' in position 3: ordin
al not in range(128)
To workaround it:
>>> unicode(soup)
u'<p>\xa9</p>'
>>> str(soup)
'<p>\xc2\xa9</p>'
>>> soup.encode('utf-8')
'<p>\xc2\xa9</p>'

First of all, "troubling" unicode characters could be letters in some language but assuming you won't have to worry about non-english characters then you can use a python lib to convert unicode to ansi. Check out the answer to this question:
How do I convert a file's format from Unicode to ASCII using Python?
The accepted answer there seems like a good solution (that I didn't know about beforehand).

I had the same problem, spent hours on it. Notice the error occurs whenever the interpreter has to display content, this is because the interpreter is trying to convert to ascii, causing problems. Take a look at the top answer here:
UnicodeEncodeError with BeautifulSoup 3.1.0.1 and Python 2.5.2

Related

Resolving ascii codec can't decode byte in position ordinal not in range

I've seen all of the other posts and done quite a bit of research but I am still scratching my head.
Here is the problem:
$ python
Python 2.7.6 (default, Mar 22 2014, 22:59:56)
[GCC 4.8.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> a=u'My Mate\u2019s'
>>> b='\xe2\x80\x99s BBQ'
>>> print a
My Mate’s
>>> print b
’s BBQ
So, the variables are finely printed themselves, but printing a concatenation:
>>> print a+b
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 0: ordinal not in range(128)
gives a decode error. So, I try to decode the string:
>>> print a.decode('utf-8')+b.decode('utf-8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in position 7: ordinal not in range(128)
The error changes into an encode error. So, I try a couple of ways to inform the encoding:
>>> print a.decode('utf-8').encode('utf-8')+b.decode('utf-8').encode('utf-8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in position 7: ordinal not in range(128)
>>> print a.decode('ascii','ignore')+b.decode('ascii','ignore')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in position 7: ordinal not in range(128)
>>> print a.decode('utf-8').encode('ascii','ignore') +b.decode('utf-8').encode('ascii','ignore')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in position 7: ordinal not in range(128)
>>>
The error persists no matter what I try.
I suppose the problem might be very simple. I'd appreciate someone helping with an explanation of what's going on, and how to resolve this.
I have python 2.7 on ubuntu.
b is encoded as UTF-8 so you have to .decode it to Unicode.
print a + b.decode('utf-8')
Tested in Python 2.7.6 on Ubuntu.
If you want both in UTF-8 you can do:
print a.encode('utf-8') + b
I'll explain why each one of your attempts doesn't work:
a + b # the default decoding is ascii which cannot decode UTF-8
a.decode('utf-8')+b.decode('utf-8') # you don't need to decode Unicode
Again you don't need to decode Unicode.
a.decode('utf-8').encode('utf-8')+b.decode('utf-8').encode('utf-8')
You keep trying to decode Unicode. What you should do instead is to encode it, or to decode b.
a.decode('ascii','ignore')+b.decode('ascii','ignore')
And finally you again try to decode Unicode. The point to be made here is that UTF-8 is an encoding. You decode from UTF-8 to Unicode.
a.decode('utf-8').encode('ascii','ignore') +b.decode('utf-8').encode('ascii','ignore')

Unable to print some unicode characters in Python 2.7

I need to print some unicode characters in Python 2.7.
Why does the following code gave me an error?
print(u'\u2601')
Output
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2601' in position 0: ordinal not in range(128)

Syntax Error in ArcGIS Field Calculator

I got a tiny script in ArcGIS which creates a hyperlink.
My code:
def Befahrung(value1, value2):
if value1 is '':
return ''
else:
return "G:\\Example\\" + str(value1) + "\\File_" + str(value2) + ".pdf"
The error (only when !Bezeichnun! contains a special character):
ERROR 000539: Error running expression: Befahrung(u" ",u"1155Mönch1")
Traceback (most recent call last):
File "<expression>", line 1 in <module>
File "<string>", line 5 in Befahrung
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf6' in position 5: ordinal not in range(128)
!Bezeichnun! and !Auftrag! are both strings. It works very well until !Bezeichnun! contains a special character. I can't change the characters, I need to save them.
What do I have to change?
In Befahrung, you convert a string (Unicode in this case) to ASCII:
str(value1);
str(value2);
cannot work if value1 or value2 contain non-ASCII characters. You want to use
unicode(value1)
or better, use string formatting:
return u"G:\\Example\\{}\\File_{}.pdf".format(value1, value2)
(works in Python 2.7 and above)
I recommend reading the Python Unicode HOWTO. The error can be distilled to
>>> str(u"1155Mönch1")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf6' in position 5: ordinal not in range(128)
If you know what character encoding you need (e.g., UTF-8), you can encode it like
value1.encode('utf-8')

Why html2text module throws UnicodeDecodeError?

I have problem with html2text module...shows me UnicodeDecodeError:
UnicodeDecodeError: 'ascii' codec can't decode byte
0xbe in position 6: ordinal not in range(128)
Example :
#!/usr/bin/python
# -*- coding: utf-8 -*-
import html2text
import urllib
h = html2text.HTML2Text()
h.ignore_links = True
html = urllib.urlopen( "http://google.com" ).read()
print h.handle( html )
...also have tried h.handle( unicode( html, "utf-8" ) with no success. Any help.
EDIT :
Traceback (most recent call last):
File "test.py", line 12, in <module>
print h.handle(html)
File "/home/alex/Desktop/html2text-master/html2text.py", line 254, in handle
return self.optwrap(self.close())
File "/home/alex/Desktop/html2text-master/html2text.py", line 266, in close
self.outtext = self.outtext.join(self.outtextlist)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xbe in position 6: ordinal not in range(128)
The issue is easily reproducable when not decoding, but works just fine when you decode your source correctly. You also get the error if you reuse the parser!
You can try this out with a known good Unicode source, such as http://www.ltg.ed.ac.uk/~richard/unicode-sample.html.
If you don't decode the response to unicode, the library fails:
>>> h = html2text.HTML2Text()
>>> h.handle(html)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/mj/Development/venvs/stackoverflow-2.7/lib/python2.7/site-packages/html2text.py", line 240, in handle
return self.optwrap(self.close())
File "/Users/mj/Development/venvs/stackoverflow-2.7/lib/python2.7/site-packages/html2text.py", line 252, in close
self.outtext = self.outtext.join(self.outtextlist)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0: ordinal not in range(128)
Now, if you reuse the HTML2Text object, its state is not cleared up, it still holds the incorrect data, so even passing in Unicode will now fail:
>>> h.handle(html.decode('utf8'))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/mj/Development/venvs/stackoverflow-2.7/lib/python2.7/site-packages/html2text.py", line 240, in handle
return self.optwrap(self.close())
File "/Users/mj/Development/venvs/stackoverflow-2.7/lib/python2.7/site-packages/html2text.py", line 252, in close
self.outtext = self.outtext.join(self.outtextlist)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0: ordinal not in range(128)
You need to use a new object and it'll work just fine:
>>> h = html2text.HTML2Text()
>>> result = h.handle(html.decode('utf8'))
>>> len(result)
12750
>>> type(result)
<type 'unicode'>

Beautiful Soup Unicode encode error

I am trying the following code with a particular HTML file
from BeautifulSoup import BeautifulSoup
import re
import codecs
import sys
f = open('test1.html')
html = f.read()
soup = BeautifulSoup(html)
body = soup.body.contents
para = soup.findAll('p')
print str(para).encode('utf-8')
I get the following error:
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in position 9: ordinal not in range(128)
How do I debug this?
I do not get any error when I remove the call to print function.
The str(para) builtin is trying to use the default (ascii) encoding for the unicode in para.
This is done before the encode() call:
>>> s=u'123\u2019'
>>> str(s)
Traceback (most recent call last):
File "<interactive input>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in position 3: ordinal not in range(128)
>>> s.encode("utf-8")
'123\xe2\x80\x99'
>>>
Try encoding para directly, maybe by applying encode("utf-8") to each list element.

Categories