Encoding unicode with python - python

I’m trying to understand how encoding unicode works in python2.7, so far it’s easy to find the solution but I haven’t found any clear explanation as what is going on here. Here is an example.
The introduction
We have a unicode variable we received, called filter_type
filter_type = u'some_välüe'.
We put this into a dict and pass this into the python library urllib.urlencode.
Like so:
urllib.urlencode({"param:" ..., "filter_type": filter_type}
This issue.
Inside urllib.urlencode it loops around the data given to it and wraps the keys and values into the str() builtin function to get a string representation of each key and value before encoding it into a url.
We get an error similar to the following:
{UnicodeEncodeError}'ascii' codec can't encode character u'\xf1' in position 42: ordinal not in range(128).
You get this same error by doing str(u'some_välüe').
So after some research and digging into this it looks like when you wrap unicode values in the str() it tries to encode the value into the default encoding that is set. (my assumption)
>>> import sys
>>> sys.getdefaultencoding()
ascii
The solution.
So we can fix this by encoding these unicode strings with utf-8.
filter_type = u'some_välüe'.encode('utf-8').
The question.
But here is the question. Before i mentioned that urllib.urlencode wraps keys and values into the str() function.
These values are already encoded now, so..
What does str() does in this case now?
Does the representation of a unicode object change when it’s encoded to utf-8?
If it does why did str() try to encode the unicode object to ascii (default) in the first place.

Related

Bytes to string conversion in Python doesn't seem to work as expected

Why in Python 3 would the following code
print(str(b"Hello"))
output b'Hello' instead of just Hello as it happens with regular text strings? It looks like ultimately explicit, would-be-easy creating a str object from the most related binary string type is so counter-intuitive.
In Python 3, bytes.__str__ is not defined, so bytes.__repr__ is used instead, when you use str() on the object. Note that print() also calls str() on objects passed in, so the call is entirely redundant here.
If you are expecting text, decode explicitly instead:
print(b'Hello'.decode('ascii'))
The str() type can handle bytes objects explicitly, but only if (again) you provide an explicit codec to decode the bytes with first:
print(str(b'Hello', 'ascii'))
The documentation is very explicit about this behaviour:
If neither encoding nor errors is given, str(object) returns object.__str__(), which is the “informal” or nicely printable string representation of object. For string objects, this is the string itself. If object does not have a __str__() method, then str() falls back to returning repr(object).
If at least one of encoding or errors is given, object should be a bytes-like object (e.g. bytes or bytearray). In this case, if object is a bytes (or bytearray) object, then str(bytes, encoding, errors) is equivalent to bytes.decode(encoding, errors).
and
Passing a bytes object to str() without the encoding or errors arguments falls under the first case of returning the informal string representation.
Emphasis mine.
Why do you want this to "work"? A bytes object is a bytes object, and its string representation in Python 3 is on that form. You can convert it's contents to a proper text string (in Python3 - which in Python2 would be "unicode" objects) you have to decode it to text.
And for that you need to know the encoding -
Try the following instead:
print(b"Hello".decode("latin-1"))
Note the assumed "latin-1" text codec which will translate transparently codes not in ASCII range (128-256) to unicode. It is the codec used by default by Windows for western-European languages.
The "utf-8" codec can represent a much larger range of characters, and is the preferred encoding for international text - but if your byte string is not properly composed of utf-8 characters you might have an UnicodeDecode error on the process.
Please read http://www.joelonsoftware.com/articles/Unicode.html to proper undestand what text is about.
Beforehand, sorry for my English...
Hey, I had this problem some weeks ago. It works as the people above said.
Here is a tip if the exceptions of the decoding process do not matter. In this case you can use:
bytesText.decode(textEncoding, 'ignore')
Ex:
>>> b'text \xab text'.decode('utf-8', 'ignore') # Using UTF-8 is nice as you might know!
'text text' # As you can see, the « (\xab) symbol was
# ignored :D

Encoding of a string in Python

I have a string S="Test" in Python. I want to encode the string into CP1256, ISO-8859-1, ISO-8859-2, ISO-8859-6, ISO-8859-15 and Window-1252 formats. How can I do the encoding of the string into the mentioned formats?
I don't know why Slava Bacherikov deleted his answer, but it was the right answer, so I'll repeat it with more detail.
str.encode is exactly what you want:
Return an encoded version of the string as a bytes object. Default encoding is 'utf-8'. errors may be given to set a different error handling scheme. The default for errors is 'strict', meaning that encoding errors raise a UnicodeError. Other possible values are 'ignore', 'replace', 'xmlcharrefreplace', 'backslashreplace' and any other name registered via codecs.register_error(), see section Codec Base Classes. For a list of possible encodings, see section Standard Encodings.
If you follow that link to Standard Encodings, you'll see a nice table that shows you the names to use for each of these (you can use either the main codec name, or any of the aliases).
So:
encoded_bytes = [S.encode(codec) for codec in
('cp1256', 'iso-8859-1', 'iso-8859-2', 'iso-8859-6',
'iso-8859-15', 'windows-1252')]
While you could use codecs.encode as the other answers suggest, there's really no good reason to do so, and one good reason not to: str.encode enforces the fact that you're calling it on a str object, and using a codec that translates str to bytes; you'll get an exception if you accidentally use it on an already-encoded bytes or a list or something.
All of the above is assuming you're using Python 3. If you're using Python 2, a str is already encoded. So, if you can start with a unicode object, like u"Test" instead of "Test", do that; if not, you will want to decode it first. Unfortunately, Python 2 won't enforce that; if you call str.encode it will actually decode it with sys.getdefaultencoding, which will usually be ASCII, which will lead to silly errors.
its what that codecs module is for :
codecs.encode(S,'CP1256')
Just use the codecs module
import codecs
codecs.encode("hello", "iso-8859-6")
If you want to first check if python is aware of a certain encoding format just use
format_name = "iso-8859-6"
try:
codecs.lookup(format_name)
except LookupError:
print "Encoding {} can't be found".format(format_name)

Python & fql: getting "Dami\u00e1n" instead of "Damián"

I created a file containing a dictionary with data written in Spanish (i.e. Damián, etc.):
fileNameX.write(json.dumps(dictionaryX, indent=4))
The data come from some fql fetching operations, i.e.:
select name from user where uid in XXX
When I open the file, I find that, for instance, "Damián" looks like "Dami\u00e1n".
I've tried some options:
ensure_ascii=False:
fileNameX.write(json.dumps(dictionaryX, indent=4, ensure_ascii=False))
But I get an error (UnicodeEncodeError: 'ascii' codec can´t encode character u'\xe1' in position XXX: ordinal not in range(128)).
encode(encoding='latin-1):
dictionaryX.append({
'name': unicodeVar.encode(encoding='latin-1'),
...
})
But I get another error (UnicodeDecodeError: 'utf8' codec can't decode byte 0xe1 in position XXX: invalid continuation byte)
To sum up, I've tried several possibilities, but have less than a clue. I'm lost. Please, I need help. Thanks!
You have many options, and have stumbled upon something rather complicated that depends on your Python version and which you absolutely must understand fully in order to write correct code. Generally the approach taken in 3.x is stricter and a bit harder to work with, but it is much less likely that you will make a mistake or get yourself into a complicated situation. (Based on the exact symptoms you report, you seem to be using 2.x.)
json.dumps has different behaviour in 2.x and 3.x. In 2.x, it produces a str, which is a byte-string (unknown encoding). In 3.x, it still produces a str, but now str in 3.x is a proper Unicode string.
JSON is inherently a Unicode-supporting format, but it expects files to be in UTF-8 encoding. However, please understand that JSON supports \u style escapes in strings. When you read in this data, you will get the correct encoded string back. The reading code produces unicode objects (no matter whether you use 2.x or 3.x) when it reads strings out of the JSON.
When I open the file, I find that, for instance, "Damián" looks like "Dami\u00e1n"
á cannot be represented in ASCII. It gets encoded as \u00e1 by default, to avoid the other problems you had. This happens even in 3.x.
ensure_ascii=False
This disables the previous encoding. In 2.x, it means you get a unicode object instead - a real Unicode object, preserving the original á character. In 3.x, it means that the character is not explicitly translated. But either way, ensure_ascii=False means that json.dumps will give you a Unicode string.
Unicode strings must be encoded to be written to a file. There is no such thing as "unicode data"; Unicode is an abstraction. In 2.x, this encoding is implicitly 'ascii' when you feed a Unicode object to file.write; it was expecting a str. To get around this, you can use the codecs module, or explicitly encode as 'utf-8' before writing. In 3.x, the encoding is set with the encoding keyword argument when you open the file (the default is again probably not what you want).
encode(encoding='latin-1')
Here, you are encoding before producing the dictionary, so that you have a str object in your data. Now a problem occurs because when there are str objects in your data, the JSON encoder assumes, by default, that they represent Unicode strings in UTF-8 encoding. This can be changed, in 2.x, using the encoding keyword argument to json.dumps. (In 3.x, the encoder will simply refuse to serialize bytes objects, i.e. non-Unicode strings!)
However, if your goal is simply to get the data into the file directly, then json.dumps is the wrong tool for you. Have you wondered what that s in the name is for? It stands for "string"; this is the special case. The ordinary case, in fact, is writing directly to a file! (Instead of giving you a string and expecting you to write it yourself.) Which is what json.dump (no 's') does. Again, the JSON standard expects UTF-8 encoding, and again 2.x has an encoding keyword parameter that defaults to UTF-8 (you should leave this alone).
Use codecs.open() to open fileNameX with a specific encoding like encoding='utf-8' for example instead of using open().
Also, json.dump().
Since the string has a \u inside it that means it's a Unicode string. The string is actually correct! Your problem lies in displaying the string. If you print the string, Python's output encoding should print it in the proper encoding for your environment.
For example, this is what I get inside IDLE on Windows:
>>> print u'Dami\u00e1n'
Damián

Converting Unicode objects with non-ASCII symbols in them into strings objects (in Python)

I want to send Chinese characters to be translated by an online service, and have the resulting English string returned. I'm using simple JSON and urllib for this.
And yes, I am declaring.
# -*- coding: utf-8 -*-
on top of my code.
Now everything works fine if I feed urllib a string type object, even if that object contains what would be Unicode information. My function is called translate.
For example:
stringtest1 = '無與倫比的美麗'
print translate(stringtest1)
results in the proper translation and doing
type(stringtest1)
confirms this to be a string object.
But if do
stringtest1 = u'無與倫比的美麗'
and try to use my translation function I get this error:
File "C:\Python27\lib\urllib.py", line 1275, in urlencode
v = quote_plus(str(v))
UnicodeEncodeError: 'ascii' codec can't encode characters in position 2-8: ordinal not in range(128)
After researching a bit, it seems this is a common problem:
Problem: neither urllib2.quote nor urllib.quote encode the unicode strings arguments
urllib.quote throws exception on Unicode URL
Now, if I type in a script
stringtest1 = '無與倫比的美麗'
stringtest2 = u'無與倫比的美麗'
print 'stringtest1',stringtest1
print 'stringtest2',stringtest2
excution of it returns:
stringtest1 無與倫比的美麗
stringtest2 無與倫比的美麗
But just typing the variables in the console:
>>> stringtest1
'\xe7\x84\xa1\xe8\x88\x87\xe5\x80\xab\xe6\xaf\x94\xe7\x9a\x84\xe7\xbe\x8e\xe9\xba\x97'
>>> stringtest2
u'\u7121\u8207\u502b\u6bd4\u7684\u7f8e\u9e97'
gets me that.
My problem is that I don't control how the information to be translated comes to my function. And it seems I have to bring it in the Unicode form, which is not accepted by the function.
So, how do I convert one thing into the other?
I've read Stack Overflow question Convert Unicode to a string in Python (containing extra symbols).
But this is not what I'm after. Urllib accepts the string object but not the Unicode object, both containing the same information
Well, at least in the eyes of the web application I'm sending the unchanged information to, I'm not sure if they're are still equivalent things in Python.
When you get a unicode object and want to return a UTF-8 encoded byte string from it, use theobject.encode('utf8').
It seems strange that you don't know whether the incoming object is a str or unicode -- surely you do control the call sites to that function, too?! But if that is indeed the case, for whatever weird reason, you may need something like:
def ensureutf8(s):
if isinstance(s, unicode):
s = s.encode('utf8')
return s
which only encodes conditionally, that is, if it receives a unicode object, not if the object it receives is already a byte string. It returns a byte string in either case.
BTW, part of your confusion seems to be due to the fact that you don't know that just entering an expression at the interpreter prompt will show you its repr, which is not the same effect you get with print;-).

Python unicode character in __str__

I'm trying to print cards using their suit unicode character and their values. I tried doing to following:
def __str__(self):
return u'\u2660'.encode('utf-8')
like suggested in another thread, but I keep getting errors saying UnicodeEncodeError: ascii, ♠, 0, 1, ordinal not in range(128). What can I do to get those suit character to show up when I print a list of cards?
Where does that UnicodeEncodeError occur exactly? I can think about two possible issues here:
The UnicodeEncodeError occurs in you __unicode__ method.
Your __unicode__ method returns a byte string instead of a unicode object and that byte string contains non-ASCII characters.
Do you have a __unicode__ method in your class?
I tried this on the Python console according to the actual data from your comment:
>>> u'\u2660'.encode('utf-8')
'\xe2\x99\xa0'
>>> print '\xe2\x99\xa0'
♠
It seems to work. Could you please try to print the same on your console? Maybe your console encoding is the problem.
Depending on how you have encoded those "suit symbols" into a byte string, you'll need to make the unicode string back for it by mentioning the appropriate codec (for example, thebytestr.decode('latin-1') if latin-1 is how you encoded it!), before making the utf-8 encoding of that unicode string. Just unicode(something) uses the default encoding, which is ASCII and therefore totally ignorant of any "suit symbols"!-)
As I said back then (3 months ago), I'd go for implementing __unicode__ instead of __str__, but that's just a minor issue of simplicity. The core point is, rather: if your byte string includes anything outside of the limited ASCII encoding, you must know what encoding your byte string uses, and decode it back into Unicode by explicitly using that codec!
I ran the same code and got
>>> u'\u2660'.encode('utf-8')
'\xe2\x99\xa0'
>>> print ('\xe2\x99\xa0')
â™ 

Categories