Python string opts with unicode, UnicodeDecodeError

Python string opts with unicode, UnicodeDecodeError - python

Suppose if I had a string with some unicode characters inside it, and we needed to do operations on it, what would be the best way to do so?
s = u"blah ascii_word etc شاهد word1 word 2" # Delimited by spaces
words = s.split(u' ')
>>> UnicodeDecodeError: 'ascii' codec can't decode byte 0xd8 in
position 91: ordinal not in range(128)
Any clues?
Also, If I wanted to write this code into a text file and read it back later, what would be the procedure?

When you declare variable the way you do Python assumes it is in your default system encoding you have to add u before the string to make it unicode and add encoding declaration at the top of your file, if you do this you won't get any errors:
# -*- coding: utf-8 -*-
s = u"blah ascii_word etc شاهد word1 word 2"
words = s.split(u' ')
print words
# no error even tough my default system's encoding is ascii
I've checked this now and you don't even need the u - adding encoding is enough to fix the problem.
If you want to do things with unicode strings in the termainal you have to check your system encoding and change it if necessary:
>>> import sys
>>> sys.getdefaultencoding()
'ascii' #I have ascii
You can then manipulate this by using sys.setdefaultencoding(). But this is a tricky issue which depends on your operating system.

Related

"ascii" codec can't encode characters in position 0-2: ordinal not in range(128)

I am using python 2.7 and used Chinese characters in my code, so...
# coding = utf-8
and the problem is part of my code, as follows:
def fileoutput():
global percent_shown
date = str(datetime.datetime.now()).decode('utf-8')
with open("result.txt","a") as datafile:
datafile.write(date+" "+str(percent_shown.get()))
percent_shown is a string that includes Chinese characters
When I run it, I get:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-2: ordinal not in range(128)
How to fix it? Thanks

As per PEP 263, the coding declaration must match the regular expression r"^[ \t\v]*#.*?coding[:=][ \t]*([-_.a-zA-Z0-9]+)" so you need to get rid of the space between "coding" and the equal sign:
# coding=utf-8
This declaration tells python that the .py file itself is utf-8 encoded, but doesn't change the rest of the program. This is useful if you are writing unicode literals but you still need to cast them to unicde properly to make sure things work.
Since you haven't shown us what you are trying to print, I found some Chinese characters to demonstrate. I have no idea what they mean... so appollogies for anyone I insult!
foo = u"学而设" # Good! you've got a unicode string
bar = "学而设" # Bad! you've got a utf-8 encoded string that python
# thinks is ascii
I think you can fix your program with a few tweaks. First, don't try to decode datetime.now(). Its just ascii. It didn't change its return type just because you declared the source file encoding. Second, use the codecs module to open the file with the encoding you wnat (I'm assuming its utf-8). Now, since you are working with unicode strings you can write them directly to the file.
import codecs
def fileoutput():
date = unicode(datetime.datetime.now())
with codecs.open("result.txt","a", encoding="utf-8") as datafile:
datafile.write(date+" "+percent_shown.get())

You can't have whitespace before the = in your coding comment. Try:
# coding=utf-8
See the regular expression in: https://www.python.org/dev/peps/pep-0263/

Umlaut in raw_input()

I am currently learning Python and I came across the following code:
text=raw_input()
for letter in text:
x=[alpha_dic[letter]]
print x
When I write an umlaut (in the dictionary by the way) it gives me an error like -KeyError: '\xfc'- (for ü in this case) because the umlauts are saved internally in this way! I saw some solutions with unicode encoding or utf but either I am not skilled enough to apply it correctly or maybe it simply does not work that way.

You get some trouble from multiple shortcomings in Python (2.x).
raw_input() gives you raw bytes from the system with no encoding info
Native encoding for python strings is 'ascii', which cannot represent 'ü'
The encoding of the literal in your script is either ascii or needs to be declared in a header at the top of the file
So if you have a simple file like this:
x = {'ü': 20, 'ä': 10}
And run it with python you get an error, because the encoding is unknown:
SyntaxError: Non-ASCII character '\xfc' in file foo.py on line 1, but no encoding declared;
see http://python.org/dev/peps/pep-0263/ for details
This can be fixed, of course, by adding an encoding header to the file and turning the literals into unicode literals.
For example, if the encoding is CP1252 (like a German Windows GUI):
# -*- coding: cp1252 -*-
x = {u'ü': 20, u'ä':30}
print repr(x)
This prints:
{u'\xfc': 20, u'\xe4': 30}
But if you get the header wrong (e.g. write CP850 instead of CP1252, but keep the same content), it prints:
{u'\xb3': 20, u'\xf5': 30}
Totally different.
So first check that your editor settings match the encoding header in your file, otherwise all non-ascii literals will simply be wrong.
Next step is fixing raw_input(). It does what it says it does, providing you raw input from the console. Just bytes. But an 'ü' can be represented with a lot of different bytes 0xfc for ISO-8859-1, CP1252, CP850 etc., 0xc3 + 0xbc in UTF-8, 0x00 + 0xfc or 0xfc + 0x00 in UTF-16, and so on.
So your code has two issues with that:
for letter in text:
If text happens to be a simple byte string in a multibyte encoding (e.g UTF-8, UTF-16, some others), one-byte is not equal to one letter, so iterating like that over the string will not do what you expect. For a very simplified view of letter you might be able to do that kind of iteration with the python unicode strings (if properly normalized). So you need to make sure text is a unicode string first.
How to convert from a byte string to unicode? A bytestring offers the decode() method, which takes an encoding. A good first guess for that encoding is the piece of code here sys.stdin.encoding or locale.getpreferredencoding(True))
Putting things together:
alpha_dict = {u'\xfc': u'small umlaut u'}
text = raw_input()
# turn text into unicode
utext = text.decode(sys.stdin.encoding or locale.getpreferredencoding(True))
# iterate over unicode string, not really letters...
for letter in utext:
x=[alpha_dic[letter]]
print x

I got this to work borrowing from this answer:
# -*- coding: utf-8 -*-
import sys, locale
alpha_dict = {u"ü":"umlaut"}
text= raw_input().decode(sys.stdin.encoding or locale.getpreferredencoding(True))
for letter in text:
x=[alpha_dict[unicode(letter)]]
print x
>>> ü
>>> ['umlaut']
Python 2 and unicode are not for the feint of heart...

'ascii' codec can't encode character u'\xe9'

I already tried all previous answers and solution.
I am trying to use this value, which gave me encoding related error.
ar = [u'http://dbpedia.org/resource/Anne_Hathaway', u'http://dbpedia.org/resource/Jodie_Bain', u'http://dbpedia.org/resource/Wendy_Divine', u'http://dbpedia.org/resource/Jos\xe9_El\xedas_Moreno', u'http://dbpedia.org/resource/Baaba_Maal']
So I tried,
d = [x.decode('utf-8') for x in ar]
which gives:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 31: ordinal not in range(128)
I tried out
d = [x.encode('utf-8') for x in ar]
which removes error but changes the original content
original value was u'http://dbpedia.org/resource/Jos\xe9_El\xedas_Moreno' which converted to 'http://dbpedia.org/resource/Jos\xc3\xa9_El\xc3\xadas_Moreno' while using encode
what is correct way to deal with this scenario?
Edit
Error comes when I feed these links in
req = urllib2.Request()

The second version of your string is the correct utf-8 representation of your original unicode string. If you want to have a meaningful comparison, you have to use the same representation for both the stored string and the user input string. The sane thing to do here is to always use Unicode string internally (in your code), and make sure both your user inputs and stored strings are correctly decoded to unicode from their respective encodings at your system's boundaries (storage subsystem and user inputs subsystem).
Also you seem to be a bit confused about unicode and encodings, so reading this and this might help.

Unicode strings in python are "raw" unicode, so make sure to .encode() and .decode() them as appropriate. Using utf8 encoding is considered a best practice among multiple dev groups all over the world.
To encode use the quote function from the urllib2 library:
from urllib2 import quote
escaped_string = quote(unicode_string.encode('utf-8'))
To decode, use unquote:
from urllib2 import unquote
src = "http://dbpedia.org/resource/Jos\xc3\xa9_El\xc3\xadas_Moreno"
unicode_string = unquote(src).decode('utf-8')
Also, if you're more interested in Unicode and UTF-8 work, check out Unicode HOWTO and

In your Unicode list, u'http://dbpedia.org/resource/Jos\xe9_El\xedas_Moreno' is an ASCII safe way to represent a Unicode string. When encoded in a form that supports the full Western European character set, such as UTF-8, it's: http://dbpedia.org/resource/José_Elías_Moreno
Your .encode("UTF-8") is correct and would have looked ok in a UTF-8 editor or browser. What you saw after the encode was an ASCII safe representation of UTF-8.
For example, your trouble chars were é and í.
é = 00E9 Unicode = C3A9 UTF-8
í = 00ED Unicode = C3AD UTF-8
In short, your .encode() method is correct and should be used for writing to files or to a browser.

Python UnicodeEncodeError / Wikipedia-API

I am trying to parse this document with Python and BeautifulSoup:
http://en.wikipedia.org/w/api.php?format=xml&action=opensearch&search=rage_against_the_machine
The seventh Item down as this Text tag:
Rage Against the Machine's 1994–1995
Tour
When I try to print out the text "Rage Against the Machine's 1994–1995 Tour", python is giving me this:
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2013' in position 31: ordinal not in range(128)
I can resolve it by simply replacing u'\u2013' with '-' like so:
itemText = itemText.replace(u'\u2013', '-')
However what about every character that I have not coded for? I do not want to ignore them nor do I want to list out every possible find and replace.
Surely a library must exist to try it's very best to detect the encoding from a list of common known encoding's (however likely it is to get it wrong).
someText = getTextWithUnknownEncoding(someLocation);
bestAsciiAttemptText = someLibrary.tryYourBestToConvertToAscii(someText)
Thank you

Decoding it as UTF-8 should work:
itemText = itemText.decode('utf-8')

Normally, you should try to preserve characters as unicode or utf-8. Avoid converting characters to your local codepage, as this results in loss of information.
However, if you must, here are. Few things to do. Let's use your example character:
>>> s = u'\u2013'
If you want to print the string e.g. for debugging, you can use repr:
>>> print(repr(s))
u'\u2013'
In an interactive session, you can just type the variable name to achieve the same result:
>>> s
u'\u2013'
If you really want to convert it the text to your local codepage, and it is OK that characters outside this codepage are converted to '?', you can use this:
>>> s.encode('latin-1', 'replace')
'?'
If '?' is not good enough, you can use translate to convert selected characters into an equivalent character as in this answer.

You may need to explicitly declare your encoding.
On the first line of your file (or after the hashbang, if there is one), add the following line:
-*- coding: utf-8 -*-
This 'magic comment' forces Python to expect UTF-8 characters and should decode them successfully.
More details: http://www.python.org/dev/peps/pep-0263/

UnicodeEncodeError: 'ascii' codec can't encode character u'\xa3'

I have an Excel spreadsheet that I'm reading in that contains some £ signs.
When I try to read it in using the xlrd module, I get the following error:
x = table.cell_value(row, col)
x = x.decode("ISO-8859-1")
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa3' in position 0: ordinal not in range(128)
If I rewrite this to x.encode('utf-8') it stops throwing an error, but unfortunately when I then write the data out somewhere else (as latin-1), the £ signs have all become garbled.
How can I fix this, and read the £ signs in correctly?
--- UPDATE ---
Some kind readers have suggested that I don't need to decode it at all, or that I can just encode it to Latin-1 when I need to. The problem with this is that I need to write the data to a CSV file eventually, and it seems to object to the raw strings.
If I don't encode or decode the data at all, then this happens (after I've added the string to an array called items):
for item in items:
#item = [x.encode('latin-1') for x in item]
cleancsv.writerow(item)
File "clean_up_barnet.py", line 104, in <module>
cleancsv.writerow(item)
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2022' in position 43: ordinal not in range(128)
I get the same error even if I uncomment the Latin-1 line.

A very easy way around all the "'ascii' codec can't encode character…" issues with csvwriter is to instead use unicodecsv, a drop-in replacement for csvwriter.
Install unicodecsv with pip and then you can use it in the exact same way, eg:
import unicodecsv
file = open('users.csv', 'w')
w = unicodecsv.writer(file)
for user in User.objects.all().values_list('first_name', 'last_name', 'email', 'last_login'):
w.writerow(user)

For what it's worth: I'm the author of xlrd.
Does xlrd produce unicode?
Option 1: Read the Unicode section at the bottom of the first screenful of xlrd doc: This module presents all text strings as Python unicode objects.
Option 2: print type(text), repr(text)
You say """If I rewrite this to x.encode('utf-8') it stops throwing an error, but unfortunately when I then write the data out somewhere else (as latin-1), the £ signs have all become garbled.""" Of course if you write UTF-8-encoded text to a device that's expecting latin1, it will be garbled. What do did you expect?
You say in your edit: """I get the same error even if I uncomment the Latin-1 line""". This is very unlikely -- much more likely is that you got a slightly different error (mentioning the latin1 codec instead of the ascii codec) in a different source line (the uncommented latin1 line instead of the writerow line). Reading error messages carefully aids understanding.
Your problem here is that in general your data is NOT encodable in latin1; very little real-world data is. Your POUND SIGN is encodable in latin1, but that's not all your non-ASCII data. The problematic character is U+2022 BULLET which is not encodable in latin1.
It would have helped you get a better answer sooner if you had mentioned up front that you were working on Mac OS X ... the usual suspect for a CSV-suitable encoding is cp1252 (Windows), not mac-roman.

Your code snippet says x.decode, but you're getting an encode error -- meaning x is Unicode already, so, to "decode" it, it must be first turned into a string of bytes (and that's where the default codec ansi comes up and fails). In your text then you say "if I rewrite ot to x.encode"... which seems to imply that you do know x is Unicode.
So what it IS you're doing -- and what it is you mean to be doing -- encoding a unicode x to get a coded string of bytes, or decoding a string of bytes into a unicode object?
I find it unfortunate that you can call encode on a byte string, and decode on a unicode object, because I find it seems to lead users to nothing but confusion... but at least in this case you seem to manage to propagate the confusion (at least to me;-).
If, as it seems, x is unicode, then you never want to "decode" it -- you may want to encode it to get a byte string with a certain codec, e.g. latin-1, if that's what you need for some kind of I/O purposes (for your own internal program use I recommend sticking with unicode all the time -- only encode/decode if and when you absolutely need, or receive, coded byte strings for input / output purposes).

x = x.decode("ISO-8859-1")
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa3' in position 0: ordinal not in range(128)
Look closely: You got a Unicode***Encode***Error calling the decode method.
The reason for this is that decode is intended to convert from a byte sequence (str) to a unicode object. But, as John said, xlrd already uses Unicode strings, so x is already a unicode object.
In this situation, Python 2.x assumes that you meant to decode a str object, so it "helpfully" creates one for you. But in order to convert a unicode to a str, it needs an encoding, and chooses ASCII because it's the lowest common denominator of character encodings. Your code effectively gets interpreted as
x = x.encode('ascii').decode("ISO-8859-1")
which fails because x contains a non-ASCII character.
Since x is already a unicode object, the decode is unnecessary. However, now you run into the problem that the Python 2.x csv module doesn't support Unicode. You have to convert your data to str objects.
for item in items:
item = [x.encode('latin-1') for x in item]
cleancsv.writerow(item)
This would be correct, except that you have the • character (U+2022 BULLET) in your data, and Latin-1 can't represent it. There are several ways around this problem:
Write x.encode('latin-1', 'ignore') to remove the bullet (or other non-Latin-1 characters).
Write x.encode('latin-1', 'replace') to replace the bullet with a question mark.
Replace the bullets with a Latin-1 character like * or ·.
Use a character encoding that does contain all the characters you need.
These days, UTF-8 is widely supported, so there is little reason to use any other encoding for text files.

xlrd works with Unicode, so the string you get back is a Unicode string. The £-sign has code point U+00A3, so the representation of said string should be u'\xa3'. This has been read in correctly; it is the string that you should be working with throughout your program.
When you write this (abstract, Unicode) string somewhere, you need to choose an encoding. At that point, you should .encode it into that encoding, say latin-1.
>>> book = xlrd.open_workbook( "test.xls" )
>>> sh = book.sheet_by_index( 0 )
>>> x = sh.cell_value( 0, 0 )
>>> x
u'\xa3'
>>> print x
£
# sample outputs (for e.g. writing to a file)
>>> x.encode( "latin-1" )
'\xa3'
>>> x.encode( "utf-8" )
'\xc2\xa3'
# garbage, because x is already Unicode
>>> x.decode( "ascii" )
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa3' in position 0:
ordinal not in range(128)
>>>

Working with xlrd, I have in a line ...xl_data.find(str(cell_value))... which gives the error:"'ascii' codec can't encode character u'\xdf' in position 3: ordinal not in range(128)". All suggestions in the forums have been useless for my german words. But changing into: ...xl_data.find(cell.value)... gives no error. So, I suppose using strings as arguments in certain commands with xldr has specific encoding problems.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python string opts with unicode, UnicodeDecodeError - python

Related

"ascii" codec can't encode characters in position 0-2: ordinal not in range(128)

Umlaut in raw_input()

'ascii' codec can't encode character u'\xe9'

Python UnicodeEncodeError / Wikipedia-API

UnicodeEncodeError: 'ascii' codec can't encode character u'\xa3'

Categories

Resources