Python unicode problem

Python unicode problem - python

I'm receiving some data from a ZODB (Zope Object Database). I receive a mybrains object. Then I do:
o = mybrains.getObject()
and I receive a "Person" object in my project. Then, I can do
b = o.name
and doing print b on my class I get:
José Carlos
and print b.name.__class__
<type 'unicode'>
I have a lot of "Person" objects. They are added to a list.
names = [o.nome, o1.nome, o2.nome]
Then, I trying to create a text file with this data.
delimiter = ';'
all = delimiter.join(names) + '\n'
No problem. Now, when I do a print all I have:
José Carlos;Jonas;Natália
Juan;John
But when I try to create a file of it:
f = open("/tmp/test.txt", "w")
f.write(all)
I get an error like this (the positions aren't exaclty the same, since I change the names)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 84: ordinal not in range(128)
If I can print already with the "correct" form to display it, why I can't write a file with it? Which encode/decode method should I use to write a file with this data?
I'm using Python 2.4.5 (can't upgrade it)

UnicodeEncodeError: 'ascii' codec
write is trying to encode the string using the ascii codec (which doesn't have a way of encoding accented characters like é or à.
Instead use
import codecs
with codecs.open("/tmp/test.txt",'w',encoding='utf-8') as f:
f.write(all.decode('utf-8'))
or choose some other codec (like cp1252) which can encode the characters in your string.
PS. all.decode('utf-8') was used above because f.write expects a unicode string. Better than using all.decode('utf-8') would be to convert all your strings to unicode early, work in unicode, and encode to a specific encoding like 'utf-8' late -- only when you have to.
PPS. It looks like names might already be a list of unicode strings. In that case, define delimiter to be a unicode string too: delimiter = u';', so all will be a unicode string. Then
with codecs.open("/tmp/test.txt",'w',encoding='utf-8') as f:
f.write(all)
should work (unless there is some issue with Python 2.4 that I'm not aware of.)
If 'utf-8' does not work, remember to try other encodings that contain the characters you need, and that your computer knows about. On Windows, that might mean 'cp1252'.

You told Python to print all, but since all has no fixed computer representation, Python first had to convert all to some printable form. Since you didn't tell Python how to do the conversion, it assumed you wanted ASCII. Unfortunately, ASCII can only handle values from 0 to 127, and all contains values out of that range, hence you see an error.
To fix this use:
all = "José Carlos;Jonas;Natália Juan;John"
import codecs
f = codecs.open("/tmp/test.txt", "w", "utf-8")
f.write(all.decode("utf-8"))
f.close()

Related

"ascii" codec can't encode characters in position 0-2: ordinal not in range(128)

I am using python 2.7 and used Chinese characters in my code, so...
# coding = utf-8
and the problem is part of my code, as follows:
def fileoutput():
global percent_shown
date = str(datetime.datetime.now()).decode('utf-8')
with open("result.txt","a") as datafile:
datafile.write(date+" "+str(percent_shown.get()))
percent_shown is a string that includes Chinese characters
When I run it, I get:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-2: ordinal not in range(128)
How to fix it? Thanks

As per PEP 263, the coding declaration must match the regular expression r"^[ \t\v]*#.*?coding[:=][ \t]*([-_.a-zA-Z0-9]+)" so you need to get rid of the space between "coding" and the equal sign:
# coding=utf-8
This declaration tells python that the .py file itself is utf-8 encoded, but doesn't change the rest of the program. This is useful if you are writing unicode literals but you still need to cast them to unicde properly to make sure things work.
Since you haven't shown us what you are trying to print, I found some Chinese characters to demonstrate. I have no idea what they mean... so appollogies for anyone I insult!
foo = u"学而设" # Good! you've got a unicode string
bar = "学而设" # Bad! you've got a utf-8 encoded string that python
# thinks is ascii
I think you can fix your program with a few tweaks. First, don't try to decode datetime.now(). Its just ascii. It didn't change its return type just because you declared the source file encoding. Second, use the codecs module to open the file with the encoding you wnat (I'm assuming its utf-8). Now, since you are working with unicode strings you can write them directly to the file.
import codecs
def fileoutput():
date = unicode(datetime.datetime.now())
with codecs.open("result.txt","a", encoding="utf-8") as datafile:
datafile.write(date+" "+percent_shown.get())

You can't have whitespace before the = in your coding comment. Try:
# coding=utf-8
See the regular expression in: https://www.python.org/dev/peps/pep-0263/

Python decoding issue with Chinese characters

I'm using Python 3.5, and I'm trying to take a block of byte text that may or may not contain special Chinese characters and output it to a file. It works for entries that do not contain Chinese characters, but breaks when they do. The Chinese characters are always a person's name, and are always in addition to the English spelling of their name. The text is JSON formatted and needs to be decoded before I can load it. The decoding seems to go fine and doesn't give me any errors. When I try and write the decoded text to a file it gives me the following error message:
UnicodeEncodeError: 'charmap' codec can't encode characters in position 14-18: character maps to undefined
Here is an example of the raw data that I get before I do anything to it:
b' "isBulkRecipient": "false",\r\n "name": "Name in, English \xef'
b'\xab\x62\xb6\xe2\x15\x8a\x8b\x8a\xee\xab\x89\xcf\xbc\x8a",\r\n
Here is the code that I am using:
recipientData = json.loads(recipientContent.decode('utf-8', 'ignore'))
recipientName = recipientData['signers'][0]['name']
pprint(recipientName)
with open('envelope recipient list.csv', 'a', newline='') as fp:
a = csv.writer(fp, delimiter=',')
csvData = [[recipientName]]
a.writerows(csvData)
The recipientContent is obtained from an API call. I do not need to have the Chinese characters in the output file. Any advice will be greatly appreciated!
Update:
I've been doing some manual workarounds for each entry that breaks, and came other entries that didn't contain Chinese special characters, but had them from other languages, and the broke the program as well. The special characters are only in the name field. So a name could be something like "Ałex" where it is a mixture of normal and special characters. Before i decode the string that contains this information i am able to print it out to the screen and it looks like this: b'name": "A\xc5ex",\r\n
But after i decode it into utf-8 it will give me an error if i try to output it. The error message is: UnicodeEncodeError: 'charmap' codec can't encode character 'u0142' in position 2- character maps to -undefined-
I looked up what \u0142 was and it is the ł special character.

The error you're getting is when you're writing to the file.
In Python 3.x, when you open() in text mode (the default) without specifying an encoding=, Python will use an encoding most suitable to your locale or language settings.
If you're on Windows, this will use the charmap codec to map to your language encoding.
Although you could just write bytes straight to a file, you're doing the right thing by decoding it first. As others have said, you should really decode using the encoding specified by the web server. You could also use Python Requests module, which does this for you. (You example doesn't decode as UTF-8, so I assume your example isn't correct)
To solve your immediate error, simply pass an encoding to open(), which supports the characters you have in your data. Unicode in UTF-8 encoding is the obvious choice. Therefore, you should change your code to read:
with open('envelope recipient list.csv', 'a', encoding='utf-8', newline='') as fp:

Warning: shotgun solution ahead
Assuming you just want to get rid of all foreign character in all your file ( that is they are not important for your future processing of all other fields), you can simply ignore all non ascii characters
recipientData = json.loads(recipientContent.decode('utf-8', 'ignore'))
by
recipientData = json.loads(recipientContent.decode('ascii', 'ignore'))
like this you remove all non ascii characters before future processing.
I called it shotgun solution because it might not work correctly under certain circumstances:
Obviously if non ascii characters are needed to keep for future use
If b'\' or b" characters appears for example from part of an utf-16 character.

Add this line to your code :
from __future__ import unicode_literals

'ascii' codec can't encode character u'\xe9'

I already tried all previous answers and solution.
I am trying to use this value, which gave me encoding related error.
ar = [u'http://dbpedia.org/resource/Anne_Hathaway', u'http://dbpedia.org/resource/Jodie_Bain', u'http://dbpedia.org/resource/Wendy_Divine', u'http://dbpedia.org/resource/Jos\xe9_El\xedas_Moreno', u'http://dbpedia.org/resource/Baaba_Maal']
So I tried,
d = [x.decode('utf-8') for x in ar]
which gives:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 31: ordinal not in range(128)
I tried out
d = [x.encode('utf-8') for x in ar]
which removes error but changes the original content
original value was u'http://dbpedia.org/resource/Jos\xe9_El\xedas_Moreno' which converted to 'http://dbpedia.org/resource/Jos\xc3\xa9_El\xc3\xadas_Moreno' while using encode
what is correct way to deal with this scenario?
Edit
Error comes when I feed these links in
req = urllib2.Request()

The second version of your string is the correct utf-8 representation of your original unicode string. If you want to have a meaningful comparison, you have to use the same representation for both the stored string and the user input string. The sane thing to do here is to always use Unicode string internally (in your code), and make sure both your user inputs and stored strings are correctly decoded to unicode from their respective encodings at your system's boundaries (storage subsystem and user inputs subsystem).
Also you seem to be a bit confused about unicode and encodings, so reading this and this might help.

Unicode strings in python are "raw" unicode, so make sure to .encode() and .decode() them as appropriate. Using utf8 encoding is considered a best practice among multiple dev groups all over the world.
To encode use the quote function from the urllib2 library:
from urllib2 import quote
escaped_string = quote(unicode_string.encode('utf-8'))
To decode, use unquote:
from urllib2 import unquote
src = "http://dbpedia.org/resource/Jos\xc3\xa9_El\xc3\xadas_Moreno"
unicode_string = unquote(src).decode('utf-8')
Also, if you're more interested in Unicode and UTF-8 work, check out Unicode HOWTO and

In your Unicode list, u'http://dbpedia.org/resource/Jos\xe9_El\xedas_Moreno' is an ASCII safe way to represent a Unicode string. When encoded in a form that supports the full Western European character set, such as UTF-8, it's: http://dbpedia.org/resource/José_Elías_Moreno
Your .encode("UTF-8") is correct and would have looked ok in a UTF-8 editor or browser. What you saw after the encode was an ASCII safe representation of UTF-8.
For example, your trouble chars were é and í.
é = 00E9 Unicode = C3A9 UTF-8
í = 00ED Unicode = C3AD UTF-8
In short, your .encode() method is correct and should be used for writing to files or to a browser.

Python String encoding - file name

str(file.key) = '1011/101011/file_name'
newFileName = str(file.key)
But, when I run the code i get:
UnicodeEncodeError: 'ascii' codec can't encode characters in position
x-y: ordinal not in range(128)
I need to do some parsing on the file name and then download it from s3 server.
How do I get just 'file_name'?

You've posted far to little context to give a decent answer, but I'll try anyway.
The filename you are trying to create seems to contain non-ascii characters, which cannot be automatically be converted into a standard str in python 2.x.
If you replace str with unicode you can avoid the need for conversion alltogether. If some other part of your code requires you to use an str, you could try to encode it like this: newFileName = unicode(file.key).encode('ascii', 'ignore'). Note that unconvertable characters will be omitted in my example.

UnicodeEncodeError: 'ascii' codec can't encode character u'\xa3'

I have an Excel spreadsheet that I'm reading in that contains some £ signs.
When I try to read it in using the xlrd module, I get the following error:
x = table.cell_value(row, col)
x = x.decode("ISO-8859-1")
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa3' in position 0: ordinal not in range(128)
If I rewrite this to x.encode('utf-8') it stops throwing an error, but unfortunately when I then write the data out somewhere else (as latin-1), the £ signs have all become garbled.
How can I fix this, and read the £ signs in correctly?
--- UPDATE ---
Some kind readers have suggested that I don't need to decode it at all, or that I can just encode it to Latin-1 when I need to. The problem with this is that I need to write the data to a CSV file eventually, and it seems to object to the raw strings.
If I don't encode or decode the data at all, then this happens (after I've added the string to an array called items):
for item in items:
#item = [x.encode('latin-1') for x in item]
cleancsv.writerow(item)
File "clean_up_barnet.py", line 104, in <module>
cleancsv.writerow(item)
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2022' in position 43: ordinal not in range(128)
I get the same error even if I uncomment the Latin-1 line.

A very easy way around all the "'ascii' codec can't encode character…" issues with csvwriter is to instead use unicodecsv, a drop-in replacement for csvwriter.
Install unicodecsv with pip and then you can use it in the exact same way, eg:
import unicodecsv
file = open('users.csv', 'w')
w = unicodecsv.writer(file)
for user in User.objects.all().values_list('first_name', 'last_name', 'email', 'last_login'):
w.writerow(user)

For what it's worth: I'm the author of xlrd.
Does xlrd produce unicode?
Option 1: Read the Unicode section at the bottom of the first screenful of xlrd doc: This module presents all text strings as Python unicode objects.
Option 2: print type(text), repr(text)
You say """If I rewrite this to x.encode('utf-8') it stops throwing an error, but unfortunately when I then write the data out somewhere else (as latin-1), the £ signs have all become garbled.""" Of course if you write UTF-8-encoded text to a device that's expecting latin1, it will be garbled. What do did you expect?
You say in your edit: """I get the same error even if I uncomment the Latin-1 line""". This is very unlikely -- much more likely is that you got a slightly different error (mentioning the latin1 codec instead of the ascii codec) in a different source line (the uncommented latin1 line instead of the writerow line). Reading error messages carefully aids understanding.
Your problem here is that in general your data is NOT encodable in latin1; very little real-world data is. Your POUND SIGN is encodable in latin1, but that's not all your non-ASCII data. The problematic character is U+2022 BULLET which is not encodable in latin1.
It would have helped you get a better answer sooner if you had mentioned up front that you were working on Mac OS X ... the usual suspect for a CSV-suitable encoding is cp1252 (Windows), not mac-roman.

Your code snippet says x.decode, but you're getting an encode error -- meaning x is Unicode already, so, to "decode" it, it must be first turned into a string of bytes (and that's where the default codec ansi comes up and fails). In your text then you say "if I rewrite ot to x.encode"... which seems to imply that you do know x is Unicode.
So what it IS you're doing -- and what it is you mean to be doing -- encoding a unicode x to get a coded string of bytes, or decoding a string of bytes into a unicode object?
I find it unfortunate that you can call encode on a byte string, and decode on a unicode object, because I find it seems to lead users to nothing but confusion... but at least in this case you seem to manage to propagate the confusion (at least to me;-).
If, as it seems, x is unicode, then you never want to "decode" it -- you may want to encode it to get a byte string with a certain codec, e.g. latin-1, if that's what you need for some kind of I/O purposes (for your own internal program use I recommend sticking with unicode all the time -- only encode/decode if and when you absolutely need, or receive, coded byte strings for input / output purposes).

x = x.decode("ISO-8859-1")
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa3' in position 0: ordinal not in range(128)
Look closely: You got a Unicode***Encode***Error calling the decode method.
The reason for this is that decode is intended to convert from a byte sequence (str) to a unicode object. But, as John said, xlrd already uses Unicode strings, so x is already a unicode object.
In this situation, Python 2.x assumes that you meant to decode a str object, so it "helpfully" creates one for you. But in order to convert a unicode to a str, it needs an encoding, and chooses ASCII because it's the lowest common denominator of character encodings. Your code effectively gets interpreted as
x = x.encode('ascii').decode("ISO-8859-1")
which fails because x contains a non-ASCII character.
Since x is already a unicode object, the decode is unnecessary. However, now you run into the problem that the Python 2.x csv module doesn't support Unicode. You have to convert your data to str objects.
for item in items:
item = [x.encode('latin-1') for x in item]
cleancsv.writerow(item)
This would be correct, except that you have the • character (U+2022 BULLET) in your data, and Latin-1 can't represent it. There are several ways around this problem:
Write x.encode('latin-1', 'ignore') to remove the bullet (or other non-Latin-1 characters).
Write x.encode('latin-1', 'replace') to replace the bullet with a question mark.
Replace the bullets with a Latin-1 character like * or ·.
Use a character encoding that does contain all the characters you need.
These days, UTF-8 is widely supported, so there is little reason to use any other encoding for text files.

xlrd works with Unicode, so the string you get back is a Unicode string. The £-sign has code point U+00A3, so the representation of said string should be u'\xa3'. This has been read in correctly; it is the string that you should be working with throughout your program.
When you write this (abstract, Unicode) string somewhere, you need to choose an encoding. At that point, you should .encode it into that encoding, say latin-1.
>>> book = xlrd.open_workbook( "test.xls" )
>>> sh = book.sheet_by_index( 0 )
>>> x = sh.cell_value( 0, 0 )
>>> x
u'\xa3'
>>> print x
£
# sample outputs (for e.g. writing to a file)
>>> x.encode( "latin-1" )
'\xa3'
>>> x.encode( "utf-8" )
'\xc2\xa3'
# garbage, because x is already Unicode
>>> x.decode( "ascii" )
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa3' in position 0:
ordinal not in range(128)
>>>

Working with xlrd, I have in a line ...xl_data.find(str(cell_value))... which gives the error:"'ascii' codec can't encode character u'\xdf' in position 3: ordinal not in range(128)". All suggestions in the forums have been useless for my german words. But changing into: ...xl_data.find(cell.value)... gives no error. So, I suppose using strings as arguments in certain commands with xldr has specific encoding problems.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.