Trouble with encoding and urllib - python

I'm loading web-page using urllib. Ther eis russian symbols, but page encoding is 'utf-8'
1
pageData = unicode(requestHandler.read()).decode('utf-8')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd0 in position 262: ordinal not in range(128)
2
pageData = requestHandler.read()
soupHandler = BeautifulSoup(pageData)
print soupHandler.findAll(...)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 340-345: ordinal not in range(128)

In your first snippet, the call unicode(requestHandler.read()) tells Python to convert the bytestring returned by read into unicode: since no code is specified for the conversion, ascii gets tried (and fails). It never gets to the point where you're going to call .decode (which would make no sense to call on that unicode object anyway).
Either use unicode(requestHandler.read(), 'utf-8'), or requestHandler.read().decode('utf-8'): either of these should produce a correct unicode object if the encoding is indeed utf-8 (the presence of that D0 byte suggests it may not be, but it's impossible to guess from being shown a single non-ascii character out of context).
printing Unicode data is a different issue and requires a well configured and cooperative terminal emulator -- one that lets Python set sys.stdout.encoding on startup. For example, on a Mac, using Apple's Terminal.App:
>>> sys.stdout.encoding
'UTF-8'
so the printing of Unicode objects works fine here:
>>> print u'\xabutf8\xbb'
«utf8»
as does the printing of utf8-encoded byte strings:
>>> print u'\xabutf8\xbb'.encode('utf8')
«utf8»
but on other machines only the latter will work (using the terminal emulator's own encoding, which you need to discover on your own because the terminal emulator isn't telling Python;-).

If requestHandler.read() delivers a UTF-8 encoded stream, then
pageData = requestHandler.read().decode('utf-8')
will decode this into a Unicode string (at which point, as Dietrich Epp noted correctly), the unicode() call is not necessary anymore.
If it throws an exception, then the input is obviously not UTF-8-encoded.

Related

Python unicode accent a (à) hex

I have a string from bs4 that is
s = "vinili-disponibili/311-canzoniere-del-lazio-lassa-st\u00c3\u00a0-la-me-creatura.html"
\u00c3\u00a0should be accent a (à) I have gotten it to show up in the console partly correct as
vinili-disponibili/311-canzoniere-del-lazio-lassa-stà-la-me-creatura.html
with
str2 = u'%s' % s
print(str2.encode('utf-8').decode('unicode-escape'))
but it's decoding c3 and a0 separately, so I get a tilde A instead of an accent a. I know that c3 a0 is the hex utf-8 for accent a. I have no idea what's going on and I got to here using Google and the combinatory approach to the answers I got. This entire character encoding thing seems like a big mess to me.
The way it is supposed to be is
311-canzoniere-del-lazio-lassa-stà-la-me-creatura.html
EDIT:
Andrey's method worked when printing it out, but trying to use urlopen with the string I get UnicodeEncodeError: 'ascii' codec can't encode character '\xe0' in position 60: ordinal not in range(128)
After using unquote(str,":/") it gives UnicodeEncodeError: 'ascii' codec can't encode characters in position 56-57: ordinal not in range(128).
Transform the string back into bytes using .encode('latin-1'), then decode the unicode-escapes \u, transform everything into bytes again using the "wrong" 'latin-1' encoding, and finally, decode "properly" as 'utf-8':
s = "vinili-disponibili/311-canzoniere-del-lazio-lassa-st\u00c3\u00a0-la-me-creatura.html"
s.encode('latin-1').decode('raw_unicode_escape').encode('latin-1').decode('utf-8')
gives:
'vinili-disponibili/311-canzoniere-del-lazio-lassa-stà-la-me-creatura.html'
It works for the same reason as explained in this answer.
Assuming Python 2:
This is a byte string with Unicode escapes. The Unicode escapes were incorrectly generated for some UTF-8-encoded data:
>>> s = "vinili-disponibili/311-canzoniere-del-lazio-lassa-st\u00c3\u00a0-la-me-creatura.html"
>>> s.decode('unicode-escape')
u'vinili-disponibili/311-canzoniere-del-lazio-lassa-st\xc3\xa0-la-me-creatura.html'
Now it is a Unicode string but now appears mis-decoded since the code points resemble UTF-8 bytes. It turns output the latin1 (also iso-8859-1) codec maps the first 256 code points directly to bytes 0-255, so use this trick to convert back to a byte string:
>>> s.decode('unicode-escape').encode('latin1')
'vinili-disponibili/311-canzoniere-del-lazio-lassa-st\xc3\xa0-la-me-creatura.html'
Now it can be decoded correctly as UTF-8:
>>> s.decode('unicode-escape').encode('latin1').decode('utf8')
u'vinili-disponibili/311-canzoniere-del-lazio-lassa-st\xe0-la-me-creatura.html'
It is a Unicode string, so Python displays its repr() value, which shows code points above U+007F as escape codes. print it to see the actual value assuming your terminal is correctly configured with an encoding that supports the characters printed:
>>> print(s.decode('unicode-escape').encode('latin1').decode('utf8'))
vinili-disponibili/311-canzoniere-del-lazio-lassa-stà-la-me-creatura.html
Ideally, fix the problem that generated this string incorrectly in the first place instead of working around the mess.

How to allow encode('utf-8') twice without getting error in python?

I have a legacy code segment that always encode('utf-8') for me when I pass in an unicode string (directly from database), is there a way to change unicode string to other format to allow it to be encoded to 'utf-8' again without getting an error, since I am not allowed to change the legacy code segment.
I've tried decoding it first but it returns this error
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-1: ordinal not in range(128)
If I leave the unicode string as is it returns
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe8 in position 986: ordinal not in range(128)
If I change the legacy code to not encode('utf-8') it works, but this is not a viable option
Edit:
Here is the code snippet
#!/usr/bin/env python2
# -*- coding: utf-8 -*-
if __name__ == "__main__":
# 1
a = u'贸易'
# 2
a = a.decode('utf-8')
# 3
a.encode('utf-8')
For some reason if I skip #2 I don't get the error that I mentioned above, I double check the type for the string, it seems like both is unicode, and both is the same character, but the code I am working on does not allow me to encode or decode to utf-8 , while the same character in some snippet allows me to do that.
Consider the following cases:
If you want a unicode string, and you already have a unicode string, you need do nothing.
If you want a bytestring, and you already have a bytestring, you need do nothing.
If you have a unicode string and want a bytestring, you encode it.
If you have a bytestring and want a unicode string, you decode it.
In none of these cases is it appropriate to encode or decode more than once.
In order for encode('utf-8') to make sense, the string must be a unicode string (or contain all-ASCII characters...). So, unless it's a unicode instance already, you have to decode it first from whatever encoding it's in to a unicode string, after which you can pass it into your legacy interface.
At no point does it make sense for anything to be double-encoded -- encoding takes a string and transforms it to a series of bytes; decoding takes a series of bytes and transforms them back into a string. The confusion only arises because Python 2 uses the str for both plain-ASCII strings and byte sequences.
>>> u'é'.encode('utf-8') # unicode string
'\xc3\xa9' # bytes, not unicode string
>>> '\xc3\xa9'.decode('utf-8')
u'\xe9' # unicode string
>>> u'\xe9' == u'é'
True

Python Encoding error with some unicode characers

I am having some problems with encoding some unicode characters.
This is the code I am using:
test = raw_input("Test: ")
print test.encode("utf-8")
When I use now normal ASCII characters it works, same with some "strange" unicode characters like ☃.
But when I use characters like ß ä ö ü § it fails creating this error:
Traceback (most recent call last):
File "C:\###\Test.py", line 5, in <module>
print test.encode("utf-8")
UnicodeDecodeError: 'ascii' codec can't decode byte 0xdf in position 0: ordinal not in range(128)
Note that I am using a pc where German is the default language (so these characters are default characters).
raw_input() returns a byte string. You don't need to encode that byte string, it is already encoded.
What happens instead then is that Python will first decode to get a unicode value to encode; you asked Python to encode so it'll damn well try to get you something that can be encoded. It is the decoding that fails here. Implicit decoding uses ASCII, which is why you got a UnicodeDecodeError exception (note the Decode in the name) for that codec.
If you wanted to produce a unicode object you'd have to explicitly decode. Use the codec Python has detected for stdin:
import sys
test = raw_input("Test: ")
print test.decode(sys.stdin.encoding)
You don't need to do that here because you are printing, so writing right back to the same terminal which will use the same codec for input and output. Writing a byte string encoded with UTF-8 when you just received that byte string is then fine. Decoding to unicode is fine too, as printing will auto-encode to sys.stdout.encoding.

Weird unicode issue

I've got following problem. If I'll run my app in eclipse it works OK, but when I'll run it in standalone debuger - I got following error:
UnicodeEncodeError: 'ascii' codec can't encode character u'\u0144' in position 7: ordinal not in range(128)
How can I fix it?
My code fragment:
x = x.replace("Ł", "L")
Guess, based on (insufficient) information provided:
You are running Python 2.x.
[Guess] x is a str object.
[Guess] Eclipse sets the default encoding to UTF-8.
The "standard debugger" sets the default encoding to ascii.
Result: splat.
Solution (standard operating procedure for working with Unicode):
On input, convert all str objects to `unicode'.
Work in Unicode.
On output, encode all unicode objects using whatever encoding the
consumer of the output is expecting.
Important update Actually if x was a UTF-8-encoded str object, you should have got a message like UnicodeDecodeError: 'ascii' codec can't decode byte 0xc5 in position 7: etc etc.
Note that your actual error message says UnicodeEncodeError: 'ascii' codec can't encode character u'\u0144' in position 7: etc etc This indicates that whatever it is complaining about is (a) a unicode object (b) at least 8 characters long. However you are saying in effect that x is not a unicode object (otherwise x.decode('utf8') would fail) and the other two args of replace are only 1 character long. Consequently we have an impossibility.
To help resolve this:
print type(x), repr(x) # for Python 2.x
Lstroke = "Ł"
print type(Lstroke), repr(Lstroke)
y = x.replace(Lstroke, 'L')
and edit your question to show the actual code that you ran plus the full error message and the traceback.
By the way: u'\u0144' is LATIN SMALL LETTER N WITH ACUTE; does that info help at all?
Try to add # -*- coding: utf-8 -*- to the top of the file to make the Python interpreter aware of which encoding the file uses, in my example UTF-8. You can also do this by saving the file with a BOM header. Not sure how Eclipse hints about the encoding but maybe they use sys.setdefaultencoding() somehow?.
You can read more details in the Python manual about source code encoding.

UnicodeEncodeError: 'ascii' codec can't encode character u'\xa3'

I have an Excel spreadsheet that I'm reading in that contains some £ signs.
When I try to read it in using the xlrd module, I get the following error:
x = table.cell_value(row, col)
x = x.decode("ISO-8859-1")
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa3' in position 0: ordinal not in range(128)
If I rewrite this to x.encode('utf-8') it stops throwing an error, but unfortunately when I then write the data out somewhere else (as latin-1), the £ signs have all become garbled.
How can I fix this, and read the £ signs in correctly?
--- UPDATE ---
Some kind readers have suggested that I don't need to decode it at all, or that I can just encode it to Latin-1 when I need to. The problem with this is that I need to write the data to a CSV file eventually, and it seems to object to the raw strings.
If I don't encode or decode the data at all, then this happens (after I've added the string to an array called items):
for item in items:
#item = [x.encode('latin-1') for x in item]
cleancsv.writerow(item)
File "clean_up_barnet.py", line 104, in <module>
cleancsv.writerow(item)
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2022' in position 43: ordinal not in range(128)
I get the same error even if I uncomment the Latin-1 line.
A very easy way around all the "'ascii' codec can't encode character…" issues with csvwriter is to instead use unicodecsv, a drop-in replacement for csvwriter.
Install unicodecsv with pip and then you can use it in the exact same way, eg:
import unicodecsv
file = open('users.csv', 'w')
w = unicodecsv.writer(file)
for user in User.objects.all().values_list('first_name', 'last_name', 'email', 'last_login'):
w.writerow(user)
For what it's worth: I'm the author of xlrd.
Does xlrd produce unicode?
Option 1: Read the Unicode section at the bottom of the first screenful of xlrd doc: This module presents all text strings as Python unicode objects.
Option 2: print type(text), repr(text)
You say """If I rewrite this to x.encode('utf-8') it stops throwing an error, but unfortunately when I then write the data out somewhere else (as latin-1), the £ signs have all become garbled.""" Of course if you write UTF-8-encoded text to a device that's expecting latin1, it will be garbled. What do did you expect?
You say in your edit: """I get the same error even if I uncomment the Latin-1 line""". This is very unlikely -- much more likely is that you got a slightly different error (mentioning the latin1 codec instead of the ascii codec) in a different source line (the uncommented latin1 line instead of the writerow line). Reading error messages carefully aids understanding.
Your problem here is that in general your data is NOT encodable in latin1; very little real-world data is. Your POUND SIGN is encodable in latin1, but that's not all your non-ASCII data. The problematic character is U+2022 BULLET which is not encodable in latin1.
It would have helped you get a better answer sooner if you had mentioned up front that you were working on Mac OS X ... the usual suspect for a CSV-suitable encoding is cp1252 (Windows), not mac-roman.
Your code snippet says x.decode, but you're getting an encode error -- meaning x is Unicode already, so, to "decode" it, it must be first turned into a string of bytes (and that's where the default codec ansi comes up and fails). In your text then you say "if I rewrite ot to x.encode"... which seems to imply that you do know x is Unicode.
So what it IS you're doing -- and what it is you mean to be doing -- encoding a unicode x to get a coded string of bytes, or decoding a string of bytes into a unicode object?
I find it unfortunate that you can call encode on a byte string, and decode on a unicode object, because I find it seems to lead users to nothing but confusion... but at least in this case you seem to manage to propagate the confusion (at least to me;-).
If, as it seems, x is unicode, then you never want to "decode" it -- you may want to encode it to get a byte string with a certain codec, e.g. latin-1, if that's what you need for some kind of I/O purposes (for your own internal program use I recommend sticking with unicode all the time -- only encode/decode if and when you absolutely need, or receive, coded byte strings for input / output purposes).
x = x.decode("ISO-8859-1")
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa3' in position 0: ordinal not in range(128)
Look closely: You got a Unicode***Encode***Error calling the decode method.
The reason for this is that decode is intended to convert from a byte sequence (str) to a unicode object. But, as John said, xlrd already uses Unicode strings, so x is already a unicode object.
In this situation, Python 2.x assumes that you meant to decode a str object, so it "helpfully" creates one for you. But in order to convert a unicode to a str, it needs an encoding, and chooses ASCII because it's the lowest common denominator of character encodings. Your code effectively gets interpreted as
x = x.encode('ascii').decode("ISO-8859-1")
which fails because x contains a non-ASCII character.
Since x is already a unicode object, the decode is unnecessary. However, now you run into the problem that the Python 2.x csv module doesn't support Unicode. You have to convert your data to str objects.
for item in items:
item = [x.encode('latin-1') for x in item]
cleancsv.writerow(item)
This would be correct, except that you have the • character (U+2022 BULLET) in your data, and Latin-1 can't represent it. There are several ways around this problem:
Write x.encode('latin-1', 'ignore') to remove the bullet (or other non-Latin-1 characters).
Write x.encode('latin-1', 'replace') to replace the bullet with a question mark.
Replace the bullets with a Latin-1 character like * or ·.
Use a character encoding that does contain all the characters you need.
These days, UTF-8 is widely supported, so there is little reason to use any other encoding for text files.
xlrd works with Unicode, so the string you get back is a Unicode string. The £-sign has code point U+00A3, so the representation of said string should be u'\xa3'. This has been read in correctly; it is the string that you should be working with throughout your program.
When you write this (abstract, Unicode) string somewhere, you need to choose an encoding. At that point, you should .encode it into that encoding, say latin-1.
>>> book = xlrd.open_workbook( "test.xls" )
>>> sh = book.sheet_by_index( 0 )
>>> x = sh.cell_value( 0, 0 )
>>> x
u'\xa3'
>>> print x
£
# sample outputs (for e.g. writing to a file)
>>> x.encode( "latin-1" )
'\xa3'
>>> x.encode( "utf-8" )
'\xc2\xa3'
# garbage, because x is already Unicode
>>> x.decode( "ascii" )
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa3' in position 0:
ordinal not in range(128)
>>>
Working with xlrd, I have in a line ...xl_data.find(str(cell_value))... which gives the error:"'ascii' codec can't encode character u'\xdf' in position 3: ordinal not in range(128)". All suggestions in the forums have been useless for my german words. But changing into: ...xl_data.find(cell.value)... gives no error. So, I suppose using strings as arguments in certain commands with xldr has specific encoding problems.

Categories