I need to print some unicode characters in Python 2.7.
Why does the following code gave me an error?
print(u'\u2601')
Output
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2601' in position 0: ordinal not in range(128)
Related
I'm trying to write html from webpage to file, but I have problem with decode characters:
import urllib.request
response = urllib.request.urlopen("https://www.google.com")
charset = response.info().get_content_charset()
print(response.read().decode(charset))
Last line causes error:
Traceback (most recent call last):
File "script.py", line 7, in <module>
print(response.read().decode(charset))
UnicodeEncodeError: 'ascii' codec can't encode character '\u015b' in
position 6079: ordinal not in range(128)
response.info().get_content_charset() returns iso-8859-2, but if i check content of response without decoding (print(resposne.read())) there is "utf-8" encoding as html metatag. If i use "utf-8" in decode function there is also similar problem:
Traceback (most recent call last):
File "script.py", line 7, in <module>
print(response.read().decode("utf-8"))
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb6 in position
6111: invalid start byte
What's going on?
You can ignore invalid characters using
response.read().decode("utf-8", 'ignore')
Instead of ignore there are other options, e.g. replace
https://www.tutorialspoint.com/python/string_encode.htm
https://docs.python.org/3/howto/unicode.html#the-string-type
(There is also str.encode(encoding='UTF-8',errors='strict') for strings.)
I've seen all of the other posts and done quite a bit of research but I am still scratching my head.
Here is the problem:
$ python
Python 2.7.6 (default, Mar 22 2014, 22:59:56)
[GCC 4.8.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> a=u'My Mate\u2019s'
>>> b='\xe2\x80\x99s BBQ'
>>> print a
My Mate’s
>>> print b
’s BBQ
So, the variables are finely printed themselves, but printing a concatenation:
>>> print a+b
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 0: ordinal not in range(128)
gives a decode error. So, I try to decode the string:
>>> print a.decode('utf-8')+b.decode('utf-8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in position 7: ordinal not in range(128)
The error changes into an encode error. So, I try a couple of ways to inform the encoding:
>>> print a.decode('utf-8').encode('utf-8')+b.decode('utf-8').encode('utf-8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in position 7: ordinal not in range(128)
>>> print a.decode('ascii','ignore')+b.decode('ascii','ignore')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in position 7: ordinal not in range(128)
>>> print a.decode('utf-8').encode('ascii','ignore') +b.decode('utf-8').encode('ascii','ignore')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in position 7: ordinal not in range(128)
>>>
The error persists no matter what I try.
I suppose the problem might be very simple. I'd appreciate someone helping with an explanation of what's going on, and how to resolve this.
I have python 2.7 on ubuntu.
b is encoded as UTF-8 so you have to .decode it to Unicode.
print a + b.decode('utf-8')
Tested in Python 2.7.6 on Ubuntu.
If you want both in UTF-8 you can do:
print a.encode('utf-8') + b
I'll explain why each one of your attempts doesn't work:
a + b # the default decoding is ascii which cannot decode UTF-8
a.decode('utf-8')+b.decode('utf-8') # you don't need to decode Unicode
Again you don't need to decode Unicode.
a.decode('utf-8').encode('utf-8')+b.decode('utf-8').encode('utf-8')
You keep trying to decode Unicode. What you should do instead is to encode it, or to decode b.
a.decode('ascii','ignore')+b.decode('ascii','ignore')
And finally you again try to decode Unicode. The point to be made here is that UTF-8 is an encoding. You decode from UTF-8 to Unicode.
a.decode('utf-8').encode('ascii','ignore') +b.decode('utf-8').encode('ascii','ignore')
--- update ---
I think this console log nails the issue, however it's still not clear how to fix it:
>>> workbook = openpyxl.load_workbook('data.xlsx')
>>> worksheet = workbook.active
>>> worksheet['A2'].value
u'\u041c\u0435\u0448\u043e\u043a \u0434\u0435\u043d\u0435\u0433'
>>> print worksheet['A2'].value
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python27\lib\encodings\cp437.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode characters in position 0-4: character maps to <undefined>
--- end update ---
I'm trying to print the values of some .xlsx cells using openpyxl:
import openpyxl
workbook = openpyxl.load_workbook(filename='puzzles.xlsx')
worksheet = workbook.active
for row in worksheet.iter_rows('A2:K5'):
print row[0].value
Which results in the following error:
Traceback (most recent call last):
File "xls_import.py", line 8, in <module>
print row[0].value
File "C:\Python27\lib\encodings\cp437.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode characters in position 0-4: character maps to <undefined>
As far as I know, XLSX is encoded as UTF-8, however:
print row[0].value.decode('utf-8')
does not help either:
Traceback (most recent call last):
File "xls_import.py", line 8, in <module>
print row[0].value.decode('utf-8')
File "C:\Python27\lib\encodings\utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-4: ordinal not in range(128)
Any suggestions?
I'm running Python 2.7 and openpyxl 2.2.5.
openpyxl returns unicode strings (XML itself is encoded in UTF-8) so you don't need to decode them (decoding goes from an encoding to unicode) but encode them in encoding of your choice.
I got a tiny script in ArcGIS which creates a hyperlink.
My code:
def Befahrung(value1, value2):
if value1 is '':
return ''
else:
return "G:\\Example\\" + str(value1) + "\\File_" + str(value2) + ".pdf"
The error (only when !Bezeichnun! contains a special character):
ERROR 000539: Error running expression: Befahrung(u" ",u"1155Mönch1")
Traceback (most recent call last):
File "<expression>", line 1 in <module>
File "<string>", line 5 in Befahrung
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf6' in position 5: ordinal not in range(128)
!Bezeichnun! and !Auftrag! are both strings. It works very well until !Bezeichnun! contains a special character. I can't change the characters, I need to save them.
What do I have to change?
In Befahrung, you convert a string (Unicode in this case) to ASCII:
str(value1);
str(value2);
cannot work if value1 or value2 contain non-ASCII characters. You want to use
unicode(value1)
or better, use string formatting:
return u"G:\\Example\\{}\\File_{}.pdf".format(value1, value2)
(works in Python 2.7 and above)
I recommend reading the Python Unicode HOWTO. The error can be distilled to
>>> str(u"1155Mönch1")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf6' in position 5: ordinal not in range(128)
If you know what character encoding you need (e.g., UTF-8), you can encode it like
value1.encode('utf-8')
Heres what I did..
>>> soup = BeautifulSoup (html)
>>> soup
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xae' in position 96953: ordinal not in range(128)
>>>
>>> soup.find('div')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xae' in position 11035: ordinal not in range(128)
>>>
>>> soup.find('span')
<span id="navLogoPrimary" class="navSprite"><span>amazon.com</span></span>
>>>
How can I simply remove troubling unicode characters from html ?
Or is there any cleaner solution ?
Try this way:
soup = BeautifulSoup (html.decode('utf-8', 'ignore'))
The error you see is due to repr(soup)tries to mix Unicode and bytestrings. Mixing Unicode and bytestrings frequently leads to errors.
Compare:
>>> u'1' + '©'
Traceback (most recent call last):
File "<input>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0: ordinal not in range(128)
And:
>>> u'1' + u'©'
u'1\xa9'
>>> '1' + u'©'
u'1\xa9'
>>> '1' + '©'
'1\xc2\xa9'
Here's an example for classes:
>>> class A:
... def __repr__(self):
... return u'copyright ©'.encode('utf-8')
...
>>> A()
copyright ©
>>> class B:
... def __repr__(self):
... return u'copyright ©'
...
>>> B()
Traceback (most recent call last):
File "<input>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa9' in position 10: ordi
nal not in range(128) #' workaround highlighting bug
>>> class C:
... def __repr__(self):
... return repr(A()) + repr(B())
...
>>> C()
Traceback (most recent call last):
File "<input>", line 1, in <module>
File "<input>", line 3, in __repr__
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa9' in position 10: ordi
nal not in range(128)
Similar thing happens with BeautifulSoup:
>>> html = """<p>©"""
>>> soup = BeautifulSoup(html)
>>> repr(soup)
Traceback (most recent call last):
File "<input>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa9' in position 3: ordin
al not in range(128)
To workaround it:
>>> unicode(soup)
u'<p>\xa9</p>'
>>> str(soup)
'<p>\xc2\xa9</p>'
>>> soup.encode('utf-8')
'<p>\xc2\xa9</p>'
First of all, "troubling" unicode characters could be letters in some language but assuming you won't have to worry about non-english characters then you can use a python lib to convert unicode to ansi. Check out the answer to this question:
How do I convert a file's format from Unicode to ASCII using Python?
The accepted answer there seems like a good solution (that I didn't know about beforehand).
I had the same problem, spent hours on it. Notice the error occurs whenever the interpreter has to display content, this is because the interpreter is trying to convert to ascii, causing problems. Take a look at the top answer here:
UnicodeEncodeError with BeautifulSoup 3.1.0.1 and Python 2.5.2