UnicodeEncodeError: 'ascii' codec can't encode character u'\xa3' - python

I have an Excel spreadsheet that I'm reading in that contains some £ signs.
When I try to read it in using the xlrd module, I get the following error:
x = table.cell_value(row, col)
x = x.decode("ISO-8859-1")
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa3' in position 0: ordinal not in range(128)
If I rewrite this to x.encode('utf-8') it stops throwing an error, but unfortunately when I then write the data out somewhere else (as latin-1), the £ signs have all become garbled.
How can I fix this, and read the £ signs in correctly?
--- UPDATE ---
Some kind readers have suggested that I don't need to decode it at all, or that I can just encode it to Latin-1 when I need to. The problem with this is that I need to write the data to a CSV file eventually, and it seems to object to the raw strings.
If I don't encode or decode the data at all, then this happens (after I've added the string to an array called items):
for item in items:
#item = [x.encode('latin-1') for x in item]
cleancsv.writerow(item)
File "clean_up_barnet.py", line 104, in <module>
cleancsv.writerow(item)
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2022' in position 43: ordinal not in range(128)
I get the same error even if I uncomment the Latin-1 line.

A very easy way around all the "'ascii' codec can't encode character…" issues with csvwriter is to instead use unicodecsv, a drop-in replacement for csvwriter.
Install unicodecsv with pip and then you can use it in the exact same way, eg:
import unicodecsv
file = open('users.csv', 'w')
w = unicodecsv.writer(file)
for user in User.objects.all().values_list('first_name', 'last_name', 'email', 'last_login'):
w.writerow(user)

For what it's worth: I'm the author of xlrd.
Does xlrd produce unicode?
Option 1: Read the Unicode section at the bottom of the first screenful of xlrd doc: This module presents all text strings as Python unicode objects.
Option 2: print type(text), repr(text)
You say """If I rewrite this to x.encode('utf-8') it stops throwing an error, but unfortunately when I then write the data out somewhere else (as latin-1), the £ signs have all become garbled.""" Of course if you write UTF-8-encoded text to a device that's expecting latin1, it will be garbled. What do did you expect?
You say in your edit: """I get the same error even if I uncomment the Latin-1 line""". This is very unlikely -- much more likely is that you got a slightly different error (mentioning the latin1 codec instead of the ascii codec) in a different source line (the uncommented latin1 line instead of the writerow line). Reading error messages carefully aids understanding.
Your problem here is that in general your data is NOT encodable in latin1; very little real-world data is. Your POUND SIGN is encodable in latin1, but that's not all your non-ASCII data. The problematic character is U+2022 BULLET which is not encodable in latin1.
It would have helped you get a better answer sooner if you had mentioned up front that you were working on Mac OS X ... the usual suspect for a CSV-suitable encoding is cp1252 (Windows), not mac-roman.

Your code snippet says x.decode, but you're getting an encode error -- meaning x is Unicode already, so, to "decode" it, it must be first turned into a string of bytes (and that's where the default codec ansi comes up and fails). In your text then you say "if I rewrite ot to x.encode"... which seems to imply that you do know x is Unicode.
So what it IS you're doing -- and what it is you mean to be doing -- encoding a unicode x to get a coded string of bytes, or decoding a string of bytes into a unicode object?
I find it unfortunate that you can call encode on a byte string, and decode on a unicode object, because I find it seems to lead users to nothing but confusion... but at least in this case you seem to manage to propagate the confusion (at least to me;-).
If, as it seems, x is unicode, then you never want to "decode" it -- you may want to encode it to get a byte string with a certain codec, e.g. latin-1, if that's what you need for some kind of I/O purposes (for your own internal program use I recommend sticking with unicode all the time -- only encode/decode if and when you absolutely need, or receive, coded byte strings for input / output purposes).

x = x.decode("ISO-8859-1")
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa3' in position 0: ordinal not in range(128)
Look closely: You got a Unicode***Encode***Error calling the decode method.
The reason for this is that decode is intended to convert from a byte sequence (str) to a unicode object. But, as John said, xlrd already uses Unicode strings, so x is already a unicode object.
In this situation, Python 2.x assumes that you meant to decode a str object, so it "helpfully" creates one for you. But in order to convert a unicode to a str, it needs an encoding, and chooses ASCII because it's the lowest common denominator of character encodings. Your code effectively gets interpreted as
x = x.encode('ascii').decode("ISO-8859-1")
which fails because x contains a non-ASCII character.
Since x is already a unicode object, the decode is unnecessary. However, now you run into the problem that the Python 2.x csv module doesn't support Unicode. You have to convert your data to str objects.
for item in items:
item = [x.encode('latin-1') for x in item]
cleancsv.writerow(item)
This would be correct, except that you have the • character (U+2022 BULLET) in your data, and Latin-1 can't represent it. There are several ways around this problem:
Write x.encode('latin-1', 'ignore') to remove the bullet (or other non-Latin-1 characters).
Write x.encode('latin-1', 'replace') to replace the bullet with a question mark.
Replace the bullets with a Latin-1 character like * or ·.
Use a character encoding that does contain all the characters you need.
These days, UTF-8 is widely supported, so there is little reason to use any other encoding for text files.

xlrd works with Unicode, so the string you get back is a Unicode string. The £-sign has code point U+00A3, so the representation of said string should be u'\xa3'. This has been read in correctly; it is the string that you should be working with throughout your program.
When you write this (abstract, Unicode) string somewhere, you need to choose an encoding. At that point, you should .encode it into that encoding, say latin-1.
>>> book = xlrd.open_workbook( "test.xls" )
>>> sh = book.sheet_by_index( 0 )
>>> x = sh.cell_value( 0, 0 )
>>> x
u'\xa3'
>>> print x
£
# sample outputs (for e.g. writing to a file)
>>> x.encode( "latin-1" )
'\xa3'
>>> x.encode( "utf-8" )
'\xc2\xa3'
# garbage, because x is already Unicode
>>> x.decode( "ascii" )
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa3' in position 0:
ordinal not in range(128)
>>>

Working with xlrd, I have in a line ...xl_data.find(str(cell_value))... which gives the error:"'ascii' codec can't encode character u'\xdf' in position 3: ordinal not in range(128)". All suggestions in the forums have been useless for my german words. But changing into: ...xl_data.find(cell.value)... gives no error. So, I suppose using strings as arguments in certain commands with xldr has specific encoding problems.

Related

"ascii" codec can't encode characters in position 0-2: ordinal not in range(128)

I am using python 2.7 and used Chinese characters in my code, so...
# coding = utf-8
and the problem is part of my code, as follows:
def fileoutput():
global percent_shown
date = str(datetime.datetime.now()).decode('utf-8')
with open("result.txt","a") as datafile:
datafile.write(date+" "+str(percent_shown.get()))
percent_shown is a string that includes Chinese characters
When I run it, I get:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-2: ordinal not in range(128)
How to fix it? Thanks
As per PEP 263, the coding declaration must match the regular expression r"^[ \t\v]*#.*?coding[:=][ \t]*([-_.a-zA-Z0-9]+)" so you need to get rid of the space between "coding" and the equal sign:
# coding=utf-8
This declaration tells python that the .py file itself is utf-8 encoded, but doesn't change the rest of the program. This is useful if you are writing unicode literals but you still need to cast them to unicde properly to make sure things work.
Since you haven't shown us what you are trying to print, I found some Chinese characters to demonstrate. I have no idea what they mean... so appollogies for anyone I insult!
foo = u"学而设" # Good! you've got a unicode string
bar = "学而设" # Bad! you've got a utf-8 encoded string that python
# thinks is ascii
I think you can fix your program with a few tweaks. First, don't try to decode datetime.now(). Its just ascii. It didn't change its return type just because you declared the source file encoding. Second, use the codecs module to open the file with the encoding you wnat (I'm assuming its utf-8). Now, since you are working with unicode strings you can write them directly to the file.
import codecs
def fileoutput():
date = unicode(datetime.datetime.now())
with codecs.open("result.txt","a", encoding="utf-8") as datafile:
datafile.write(date+" "+percent_shown.get())
You can't have whitespace before the = in your coding comment. Try:
# coding=utf-8
See the regular expression in: https://www.python.org/dev/peps/pep-0263/

How to decode chars in Python respectively?

I have tried this problem
# -*- coding: utf-8 -*-
s = "Ñ ÑÑÑаÑ! Ð½ÐµÑ Ñил"
e = s.encode('ascii')
print e
but it gives me this error.
Traceback (most recent call last):
File "C:/Users/username/Desktop/unicode.py", line 3, in <module>
e = s.encode('ascii')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)
How do I get the text to be readable? I have been trying for hours! Not sure how to fix this. Any help would be greatly appreciated!
You have a whole slew of problems here.
First, you've stuck Unicode characters into a str literal instead of a unicode literal. That's almost always a bad idea.
Second, you've called encode on a str. But encode is for converting unicode to str.* In order to do that, Python has to first decode your str to a unicode so that it can call encode on it. And if you force Python to decode for you without telling it which codec to use, it will use sys.getdefaultencoding(), which is almost never what you want. (In particular, it's not going to be UTF-8 just because your source encoding is.)
You can fix those first two problems just by adding one letter:
s = u"Ñ ÑÑÑаÑ! Ð½ÐµÑ Ñил"
But it's still not going to work. Why? Because you're asking it to encode non-ASCII characters into the ASCII character set. Which is impossible. So it's going to call the error handler. Since you didn't specify an error handler, you get the default, called strict. As the name implies, strict raises an exception when you ask it do something impossible.
There are other error handlers—see the str.encode docs for a full list. I'm not sure what output you were expecting, but you can get backslash-escaped text, or text with all the non-ASCII characters replaced by ?s, or a few other possibilities. For example:
e = s.encode('ascii', 'replace')
Of course if you didn't actually want ASCII, but rather UTF-8, then everything is easy: just tell Python you want UTF-8 instead of ASCII:
e = s.encode('utf-8')
* There are a few special codecs, like hex and gzip, that convert str to str, unicode to unicode, or str to unicode, but ascii isn't one of them.

Reading Unicode Files - Python3.2

I'm trying to read some files using Python3.2, the some of the files may contain unicode while others do not.
When I try:
file = open(item_path + item, encoding="utf-8")
for line in file:
print (repr(line))
I get the error:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 13-16: ordinal not in range(128)
I am following the documentation here: http://docs.python.org/release/3.0.1/howto/unicode.html
Why would Python be trying to encode to ascii at any point in this code?
The problem is that repr(line) in Python 3 returns also the Unicode string. It does not convert the above 128 characters to the ASCII escape sequences.
Use ascii(line) instead if you want to see the escape sequences.
Actually, the repr(line) is expected to return the string that if placed in a source code would produce the object with the same value. This way, the Python 3 behaviour is just fine as there is no need for ASCII escape sequences in the source files to express a string with more than ASCII characters. It is quite natural to use UTF-8 or some other Unicode encoding these day. The truth is that Python 2 produced the escape sequences for such characters.
What's your output encoding? If you remove the call to print(), does it start working?
I suspect you've got a non-UTF-8 locale, so Python is trying to encode repr(line) as ASCII as part of printing it.
To resolve the issue, you must either encode the string and print the byte array, or set your default encoding to something that can handle your strings (UTF-8 being the obvious choice).

Trouble with encoding and urllib

I'm loading web-page using urllib. Ther eis russian symbols, but page encoding is 'utf-8'
1
pageData = unicode(requestHandler.read()).decode('utf-8')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd0 in position 262: ordinal not in range(128)
2
pageData = requestHandler.read()
soupHandler = BeautifulSoup(pageData)
print soupHandler.findAll(...)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 340-345: ordinal not in range(128)
In your first snippet, the call unicode(requestHandler.read()) tells Python to convert the bytestring returned by read into unicode: since no code is specified for the conversion, ascii gets tried (and fails). It never gets to the point where you're going to call .decode (which would make no sense to call on that unicode object anyway).
Either use unicode(requestHandler.read(), 'utf-8'), or requestHandler.read().decode('utf-8'): either of these should produce a correct unicode object if the encoding is indeed utf-8 (the presence of that D0 byte suggests it may not be, but it's impossible to guess from being shown a single non-ascii character out of context).
printing Unicode data is a different issue and requires a well configured and cooperative terminal emulator -- one that lets Python set sys.stdout.encoding on startup. For example, on a Mac, using Apple's Terminal.App:
>>> sys.stdout.encoding
'UTF-8'
so the printing of Unicode objects works fine here:
>>> print u'\xabutf8\xbb'
«utf8»
as does the printing of utf8-encoded byte strings:
>>> print u'\xabutf8\xbb'.encode('utf8')
«utf8»
but on other machines only the latter will work (using the terminal emulator's own encoding, which you need to discover on your own because the terminal emulator isn't telling Python;-).
If requestHandler.read() delivers a UTF-8 encoded stream, then
pageData = requestHandler.read().decode('utf-8')
will decode this into a Unicode string (at which point, as Dietrich Epp noted correctly), the unicode() call is not necessary anymore.
If it throws an exception, then the input is obviously not UTF-8-encoded.

Python unicode problem

I'm receiving some data from a ZODB (Zope Object Database). I receive a mybrains object. Then I do:
o = mybrains.getObject()
and I receive a "Person" object in my project. Then, I can do
b = o.name
and doing print b on my class I get:
José Carlos
and print b.name.__class__
<type 'unicode'>
I have a lot of "Person" objects. They are added to a list.
names = [o.nome, o1.nome, o2.nome]
Then, I trying to create a text file with this data.
delimiter = ';'
all = delimiter.join(names) + '\n'
No problem. Now, when I do a print all I have:
José Carlos;Jonas;Natália
Juan;John
But when I try to create a file of it:
f = open("/tmp/test.txt", "w")
f.write(all)
I get an error like this (the positions aren't exaclty the same, since I change the names)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 84: ordinal not in range(128)
If I can print already with the "correct" form to display it, why I can't write a file with it? Which encode/decode method should I use to write a file with this data?
I'm using Python 2.4.5 (can't upgrade it)
UnicodeEncodeError: 'ascii' codec
write is trying to encode the string using the ascii codec (which doesn't have a way of encoding accented characters like é or à.
Instead use
import codecs
with codecs.open("/tmp/test.txt",'w',encoding='utf-8') as f:
f.write(all.decode('utf-8'))
or choose some other codec (like cp1252) which can encode the characters in your string.
PS. all.decode('utf-8') was used above because f.write expects a unicode string. Better than using all.decode('utf-8') would be to convert all your strings to unicode early, work in unicode, and encode to a specific encoding like 'utf-8' late -- only when you have to.
PPS. It looks like names might already be a list of unicode strings. In that case, define delimiter to be a unicode string too: delimiter = u';', so all will be a unicode string. Then
with codecs.open("/tmp/test.txt",'w',encoding='utf-8') as f:
f.write(all)
should work (unless there is some issue with Python 2.4 that I'm not aware of.)
If 'utf-8' does not work, remember to try other encodings that contain the characters you need, and that your computer knows about. On Windows, that might mean 'cp1252'.
You told Python to print all, but since all has no fixed computer representation, Python first had to convert all to some printable form. Since you didn't tell Python how to do the conversion, it assumed you wanted ASCII. Unfortunately, ASCII can only handle values from 0 to 127, and all contains values out of that range, hence you see an error.
To fix this use:
all = "José Carlos;Jonas;Natália Juan;John"
import codecs
f = codecs.open("/tmp/test.txt", "w", "utf-8")
f.write(all.decode("utf-8"))
f.close()

Categories