Python unicode accent a (à) hex - python

I have a string from bs4 that is
s = "vinili-disponibili/311-canzoniere-del-lazio-lassa-st\u00c3\u00a0-la-me-creatura.html"
\u00c3\u00a0should be accent a (à) I have gotten it to show up in the console partly correct as
vinili-disponibili/311-canzoniere-del-lazio-lassa-stà-la-me-creatura.html
with
str2 = u'%s' % s
print(str2.encode('utf-8').decode('unicode-escape'))
but it's decoding c3 and a0 separately, so I get a tilde A instead of an accent a. I know that c3 a0 is the hex utf-8 for accent a. I have no idea what's going on and I got to here using Google and the combinatory approach to the answers I got. This entire character encoding thing seems like a big mess to me.
The way it is supposed to be is
311-canzoniere-del-lazio-lassa-stà-la-me-creatura.html
EDIT:
Andrey's method worked when printing it out, but trying to use urlopen with the string I get UnicodeEncodeError: 'ascii' codec can't encode character '\xe0' in position 60: ordinal not in range(128)
After using unquote(str,":/") it gives UnicodeEncodeError: 'ascii' codec can't encode characters in position 56-57: ordinal not in range(128).

Transform the string back into bytes using .encode('latin-1'), then decode the unicode-escapes \u, transform everything into bytes again using the "wrong" 'latin-1' encoding, and finally, decode "properly" as 'utf-8':
s = "vinili-disponibili/311-canzoniere-del-lazio-lassa-st\u00c3\u00a0-la-me-creatura.html"
s.encode('latin-1').decode('raw_unicode_escape').encode('latin-1').decode('utf-8')
gives:
'vinili-disponibili/311-canzoniere-del-lazio-lassa-stà-la-me-creatura.html'
It works for the same reason as explained in this answer.

Assuming Python 2:
This is a byte string with Unicode escapes. The Unicode escapes were incorrectly generated for some UTF-8-encoded data:
>>> s = "vinili-disponibili/311-canzoniere-del-lazio-lassa-st\u00c3\u00a0-la-me-creatura.html"
>>> s.decode('unicode-escape')
u'vinili-disponibili/311-canzoniere-del-lazio-lassa-st\xc3\xa0-la-me-creatura.html'
Now it is a Unicode string but now appears mis-decoded since the code points resemble UTF-8 bytes. It turns output the latin1 (also iso-8859-1) codec maps the first 256 code points directly to bytes 0-255, so use this trick to convert back to a byte string:
>>> s.decode('unicode-escape').encode('latin1')
'vinili-disponibili/311-canzoniere-del-lazio-lassa-st\xc3\xa0-la-me-creatura.html'
Now it can be decoded correctly as UTF-8:
>>> s.decode('unicode-escape').encode('latin1').decode('utf8')
u'vinili-disponibili/311-canzoniere-del-lazio-lassa-st\xe0-la-me-creatura.html'
It is a Unicode string, so Python displays its repr() value, which shows code points above U+007F as escape codes. print it to see the actual value assuming your terminal is correctly configured with an encoding that supports the characters printed:
>>> print(s.decode('unicode-escape').encode('latin1').decode('utf8'))
vinili-disponibili/311-canzoniere-del-lazio-lassa-stà-la-me-creatura.html
Ideally, fix the problem that generated this string incorrectly in the first place instead of working around the mess.

Related

How to allow encode('utf-8') twice without getting error in python?

I have a legacy code segment that always encode('utf-8') for me when I pass in an unicode string (directly from database), is there a way to change unicode string to other format to allow it to be encoded to 'utf-8' again without getting an error, since I am not allowed to change the legacy code segment.
I've tried decoding it first but it returns this error
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-1: ordinal not in range(128)
If I leave the unicode string as is it returns
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe8 in position 986: ordinal not in range(128)
If I change the legacy code to not encode('utf-8') it works, but this is not a viable option
Edit:
Here is the code snippet
#!/usr/bin/env python2
# -*- coding: utf-8 -*-
if __name__ == "__main__":
# 1
a = u'贸易'
# 2
a = a.decode('utf-8')
# 3
a.encode('utf-8')
For some reason if I skip #2 I don't get the error that I mentioned above, I double check the type for the string, it seems like both is unicode, and both is the same character, but the code I am working on does not allow me to encode or decode to utf-8 , while the same character in some snippet allows me to do that.
Consider the following cases:
If you want a unicode string, and you already have a unicode string, you need do nothing.
If you want a bytestring, and you already have a bytestring, you need do nothing.
If you have a unicode string and want a bytestring, you encode it.
If you have a bytestring and want a unicode string, you decode it.
In none of these cases is it appropriate to encode or decode more than once.
In order for encode('utf-8') to make sense, the string must be a unicode string (or contain all-ASCII characters...). So, unless it's a unicode instance already, you have to decode it first from whatever encoding it's in to a unicode string, after which you can pass it into your legacy interface.
At no point does it make sense for anything to be double-encoded -- encoding takes a string and transforms it to a series of bytes; decoding takes a series of bytes and transforms them back into a string. The confusion only arises because Python 2 uses the str for both plain-ASCII strings and byte sequences.
>>> u'é'.encode('utf-8') # unicode string
'\xc3\xa9' # bytes, not unicode string
>>> '\xc3\xa9'.decode('utf-8')
u'\xe9' # unicode string
>>> u'\xe9' == u'é'
True

'ascii' codec can't encode character u'\xe9'

I already tried all previous answers and solution.
I am trying to use this value, which gave me encoding related error.
ar = [u'http://dbpedia.org/resource/Anne_Hathaway', u'http://dbpedia.org/resource/Jodie_Bain', u'http://dbpedia.org/resource/Wendy_Divine', u'http://dbpedia.org/resource/Jos\xe9_El\xedas_Moreno', u'http://dbpedia.org/resource/Baaba_Maal']
So I tried,
d = [x.decode('utf-8') for x in ar]
which gives:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 31: ordinal not in range(128)
I tried out
d = [x.encode('utf-8') for x in ar]
which removes error but changes the original content
original value was u'http://dbpedia.org/resource/Jos\xe9_El\xedas_Moreno' which converted to 'http://dbpedia.org/resource/Jos\xc3\xa9_El\xc3\xadas_Moreno' while using encode
what is correct way to deal with this scenario?
Edit
Error comes when I feed these links in
req = urllib2.Request()
The second version of your string is the correct utf-8 representation of your original unicode string. If you want to have a meaningful comparison, you have to use the same representation for both the stored string and the user input string. The sane thing to do here is to always use Unicode string internally (in your code), and make sure both your user inputs and stored strings are correctly decoded to unicode from their respective encodings at your system's boundaries (storage subsystem and user inputs subsystem).
Also you seem to be a bit confused about unicode and encodings, so reading this and this might help.
Unicode strings in python are "raw" unicode, so make sure to .encode() and .decode() them as appropriate. Using utf8 encoding is considered a best practice among multiple dev groups all over the world.
To encode use the quote function from the urllib2 library:
from urllib2 import quote
escaped_string = quote(unicode_string.encode('utf-8'))
To decode, use unquote:
from urllib2 import unquote
src = "http://dbpedia.org/resource/Jos\xc3\xa9_El\xc3\xadas_Moreno"
unicode_string = unquote(src).decode('utf-8')
Also, if you're more interested in Unicode and UTF-8 work, check out Unicode HOWTO and
In your Unicode list, u'http://dbpedia.org/resource/Jos\xe9_El\xedas_Moreno' is an ASCII safe way to represent a Unicode string. When encoded in a form that supports the full Western European character set, such as UTF-8, it's: http://dbpedia.org/resource/José_Elías_Moreno
Your .encode("UTF-8") is correct and would have looked ok in a UTF-8 editor or browser. What you saw after the encode was an ASCII safe representation of UTF-8.
For example, your trouble chars were é and í.
é = 00E9 Unicode = C3A9 UTF-8
í = 00ED Unicode = C3AD UTF-8
In short, your .encode() method is correct and should be used for writing to files or to a browser.

exceptions.UnicodeDecodeError - 'ascii' codec can't decode byte

I keep getting this error:
<type 'exceptions.UnicodeDecodeError'>: 'ascii' codec can't decode byte 0xe2 in position 0: ordinal not in range(128)
args = ('ascii', '\xe2\x9d\xb6 Senn =)', 0, 1, 'ordinal not in range(128)')
encoding = 'ascii'
end = 1
message = ''
object = '\xe2\x9d\xb6 Senn =)'
reason = 'ordinal not in range(128)'
start = 0
Using this code:
steamFriend = data['response']['players'][i]
n = steamUser(steamFriend['personaname'].encode("utf-8"), steamFriend['steamid'], steamFriend['avatarfull'], steamFriend['profileurl'], steamFriend['personastate'], False)
Some things to note here:
steamFriend is a JSON object
I get this error only sometimes, because the steamFriend['personaname'] contains some weird symbols (for example ❶), and I don't know how to parse this correctly so I don't get errors.
Any help is greatly appreciated.
Also, \xe2\x9d\xb6 Senn =) is supposed to represent ❶ Senn =), if that helps.
Without seeing the full code it is hard to tell, but it seems that steamUser expects ascii input. If that is the problem, you can solve it by:
streamFriend['personaname'].encode("ascii", errors="ignore")
or
streamFriend['personaname'].encode("ascii", errors="replace")
Obviously you will lose unicode characters in the process.
If the quoted error is occurring on the n=... line, the implication is that steamFriend['personaname'] is a byte string, not a Unicode string.
Consequently when you ask to .encode it, Python has to decode the string to Unicode in order to be able to encode it back to bytes. An implicit decoding happens using the default encoding, which is ASCII, so because the byte string does not contain only ASCII you get a failure.
Are you sure you didn't mean to do:
steamFriend['personaname'].decode("utf-8")
decoding the byte string '\xe2\x9d\xb6 Senn =)' using UTF-8 would give you the Unicode string u'\u2776 Senn =)', where U+2776=❶ so that would seem more like what you are after.
(Normally, however, JSON strings are explicitly Unicode, so it's not clear where you would have got the byte string from. How are you loading the JSON content?)

UnicodeEncodeError: 'ascii' codec can't encode character u'\xa3'

I have an Excel spreadsheet that I'm reading in that contains some £ signs.
When I try to read it in using the xlrd module, I get the following error:
x = table.cell_value(row, col)
x = x.decode("ISO-8859-1")
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa3' in position 0: ordinal not in range(128)
If I rewrite this to x.encode('utf-8') it stops throwing an error, but unfortunately when I then write the data out somewhere else (as latin-1), the £ signs have all become garbled.
How can I fix this, and read the £ signs in correctly?
--- UPDATE ---
Some kind readers have suggested that I don't need to decode it at all, or that I can just encode it to Latin-1 when I need to. The problem with this is that I need to write the data to a CSV file eventually, and it seems to object to the raw strings.
If I don't encode or decode the data at all, then this happens (after I've added the string to an array called items):
for item in items:
#item = [x.encode('latin-1') for x in item]
cleancsv.writerow(item)
File "clean_up_barnet.py", line 104, in <module>
cleancsv.writerow(item)
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2022' in position 43: ordinal not in range(128)
I get the same error even if I uncomment the Latin-1 line.
A very easy way around all the "'ascii' codec can't encode character…" issues with csvwriter is to instead use unicodecsv, a drop-in replacement for csvwriter.
Install unicodecsv with pip and then you can use it in the exact same way, eg:
import unicodecsv
file = open('users.csv', 'w')
w = unicodecsv.writer(file)
for user in User.objects.all().values_list('first_name', 'last_name', 'email', 'last_login'):
w.writerow(user)
For what it's worth: I'm the author of xlrd.
Does xlrd produce unicode?
Option 1: Read the Unicode section at the bottom of the first screenful of xlrd doc: This module presents all text strings as Python unicode objects.
Option 2: print type(text), repr(text)
You say """If I rewrite this to x.encode('utf-8') it stops throwing an error, but unfortunately when I then write the data out somewhere else (as latin-1), the £ signs have all become garbled.""" Of course if you write UTF-8-encoded text to a device that's expecting latin1, it will be garbled. What do did you expect?
You say in your edit: """I get the same error even if I uncomment the Latin-1 line""". This is very unlikely -- much more likely is that you got a slightly different error (mentioning the latin1 codec instead of the ascii codec) in a different source line (the uncommented latin1 line instead of the writerow line). Reading error messages carefully aids understanding.
Your problem here is that in general your data is NOT encodable in latin1; very little real-world data is. Your POUND SIGN is encodable in latin1, but that's not all your non-ASCII data. The problematic character is U+2022 BULLET which is not encodable in latin1.
It would have helped you get a better answer sooner if you had mentioned up front that you were working on Mac OS X ... the usual suspect for a CSV-suitable encoding is cp1252 (Windows), not mac-roman.
Your code snippet says x.decode, but you're getting an encode error -- meaning x is Unicode already, so, to "decode" it, it must be first turned into a string of bytes (and that's where the default codec ansi comes up and fails). In your text then you say "if I rewrite ot to x.encode"... which seems to imply that you do know x is Unicode.
So what it IS you're doing -- and what it is you mean to be doing -- encoding a unicode x to get a coded string of bytes, or decoding a string of bytes into a unicode object?
I find it unfortunate that you can call encode on a byte string, and decode on a unicode object, because I find it seems to lead users to nothing but confusion... but at least in this case you seem to manage to propagate the confusion (at least to me;-).
If, as it seems, x is unicode, then you never want to "decode" it -- you may want to encode it to get a byte string with a certain codec, e.g. latin-1, if that's what you need for some kind of I/O purposes (for your own internal program use I recommend sticking with unicode all the time -- only encode/decode if and when you absolutely need, or receive, coded byte strings for input / output purposes).
x = x.decode("ISO-8859-1")
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa3' in position 0: ordinal not in range(128)
Look closely: You got a Unicode***Encode***Error calling the decode method.
The reason for this is that decode is intended to convert from a byte sequence (str) to a unicode object. But, as John said, xlrd already uses Unicode strings, so x is already a unicode object.
In this situation, Python 2.x assumes that you meant to decode a str object, so it "helpfully" creates one for you. But in order to convert a unicode to a str, it needs an encoding, and chooses ASCII because it's the lowest common denominator of character encodings. Your code effectively gets interpreted as
x = x.encode('ascii').decode("ISO-8859-1")
which fails because x contains a non-ASCII character.
Since x is already a unicode object, the decode is unnecessary. However, now you run into the problem that the Python 2.x csv module doesn't support Unicode. You have to convert your data to str objects.
for item in items:
item = [x.encode('latin-1') for x in item]
cleancsv.writerow(item)
This would be correct, except that you have the • character (U+2022 BULLET) in your data, and Latin-1 can't represent it. There are several ways around this problem:
Write x.encode('latin-1', 'ignore') to remove the bullet (or other non-Latin-1 characters).
Write x.encode('latin-1', 'replace') to replace the bullet with a question mark.
Replace the bullets with a Latin-1 character like * or ·.
Use a character encoding that does contain all the characters you need.
These days, UTF-8 is widely supported, so there is little reason to use any other encoding for text files.
xlrd works with Unicode, so the string you get back is a Unicode string. The £-sign has code point U+00A3, so the representation of said string should be u'\xa3'. This has been read in correctly; it is the string that you should be working with throughout your program.
When you write this (abstract, Unicode) string somewhere, you need to choose an encoding. At that point, you should .encode it into that encoding, say latin-1.
>>> book = xlrd.open_workbook( "test.xls" )
>>> sh = book.sheet_by_index( 0 )
>>> x = sh.cell_value( 0, 0 )
>>> x
u'\xa3'
>>> print x
£
# sample outputs (for e.g. writing to a file)
>>> x.encode( "latin-1" )
'\xa3'
>>> x.encode( "utf-8" )
'\xc2\xa3'
# garbage, because x is already Unicode
>>> x.decode( "ascii" )
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa3' in position 0:
ordinal not in range(128)
>>>
Working with xlrd, I have in a line ...xl_data.find(str(cell_value))... which gives the error:"'ascii' codec can't encode character u'\xdf' in position 3: ordinal not in range(128)". All suggestions in the forums have been useless for my german words. But changing into: ...xl_data.find(cell.value)... gives no error. So, I suppose using strings as arguments in certain commands with xldr has specific encoding problems.

Trouble with encoding and urllib

I'm loading web-page using urllib. Ther eis russian symbols, but page encoding is 'utf-8'
1
pageData = unicode(requestHandler.read()).decode('utf-8')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd0 in position 262: ordinal not in range(128)
2
pageData = requestHandler.read()
soupHandler = BeautifulSoup(pageData)
print soupHandler.findAll(...)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 340-345: ordinal not in range(128)
In your first snippet, the call unicode(requestHandler.read()) tells Python to convert the bytestring returned by read into unicode: since no code is specified for the conversion, ascii gets tried (and fails). It never gets to the point where you're going to call .decode (which would make no sense to call on that unicode object anyway).
Either use unicode(requestHandler.read(), 'utf-8'), or requestHandler.read().decode('utf-8'): either of these should produce a correct unicode object if the encoding is indeed utf-8 (the presence of that D0 byte suggests it may not be, but it's impossible to guess from being shown a single non-ascii character out of context).
printing Unicode data is a different issue and requires a well configured and cooperative terminal emulator -- one that lets Python set sys.stdout.encoding on startup. For example, on a Mac, using Apple's Terminal.App:
>>> sys.stdout.encoding
'UTF-8'
so the printing of Unicode objects works fine here:
>>> print u'\xabutf8\xbb'
«utf8»
as does the printing of utf8-encoded byte strings:
>>> print u'\xabutf8\xbb'.encode('utf8')
«utf8»
but on other machines only the latter will work (using the terminal emulator's own encoding, which you need to discover on your own because the terminal emulator isn't telling Python;-).
If requestHandler.read() delivers a UTF-8 encoded stream, then
pageData = requestHandler.read().decode('utf-8')
will decode this into a Unicode string (at which point, as Dietrich Epp noted correctly), the unicode() call is not necessary anymore.
If it throws an exception, then the input is obviously not UTF-8-encoded.

Categories