What is the use case of encode/decode?
My understanding was that encode is used to convert string into byte string in order to be able to pass non ascii data across the program. And decode was to convert this byte string back into string.
But foll. examples shows non acsii characters getting successfully printed even if not encoded/decoded. Example:
val1="À È Ì Ò Ù Ỳ Ǹ Ẁ"
val2 = val1
print('val1 is: ',val2)
encoded_val1=val1.encode()
print('encoded_val1 is: ',encoded_val1)
decoded_encoded_val1=encoded_val1.decode()
print('decoded_encoded_val1 is: ',decoded_encoded_val1)
Output:
So what is the use case of encode and decode in python?
The environment you are working on may support those characters, in addition to that your terminal(or whatever you use to see output) may support displaying those characters. Some terminals/command lines or text editors may not support them. Apart from displaying issues, here are some actual reasons and examples:
1- When you transfer data over internet/network (eg with a socket), information is transferred as raw bytes. Non-ascii characters can not be represented by a single byte so we need a special representation for them (utf-16 or utf-8 with more than one byte). This is the most common reason I encountered.
2- Some text editors only supports utf-8. For example you need to represent your Ẁ character in utf-8 format in order to work with them. Reason for that is when dealing with text, people mostly used ASCII characters, which are just one byte. When some systems needed to be integrated with non-ascii characters people converted them to utf-8. Some people with more in-depth knowledge about text editors may give a better explanation about this point.
3- You may have a text written with unicode characters with some Chinese/Russian letters in it, and for some reason store it in your remote Linux server. But your server does not support letters from those languages. You need to convert your text to some strict format (utf-8 or utf-16) and store it in your server so you can recover them later.
Here is a little explanation of UTF-8 format. There are also other articles about the topic if you are interested.
Use utf-8 encoding because it's universal.
Set your code editor to utf-8 encoding and put at the top of all your python file: # coding: utf8
When you get an input (file, string...), it can have a different encoding then you have to get his encode type and decode it. Exemple in HTML file encode type is in meta balise.
If you change something in the HTML file and want to save it or send it by network, then you have to encode it in the encode type it was juste before.
Always use unicode for your string in python. (Automatic for python 3 but for python2.7 use the prefix u like u'Hi')
$ python2.7
Python 2.7.3 (default, Aug 1 2012, 05:14:39)
[GCC 4.6.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> type('this is a string') # bits => encoded
<type 'str'>
>>> type(u'this is a string') # unicode => decoded
<type 'unicode'>
$ python3
Python 3.2.3 (default, Oct 19 2012, 20:10:41)
[GCC 4.6.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> type("this is a string") # unicode => decoded
<class 'str'>
>>> type(b"this is a string") # bits => encoded
<class 'bytes'>
1 Use UTF8. Now. All over.
2 In your code, specify the file encoding and declare your strings as "unicode".
3 At the entrance, know the encoding of your data, and decode with decode ().
4 At the output, encode in the expected encoding by the system which will receive the data, or if you can not know it, in UTF8, with encode ().
Related
I'm really confused. I tried to encode but the error said can't decode....
>>> "你好".encode("utf8")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 0: ordinal not in range(128)
I know how to avoid the error with "u" prefix on the string. I'm just wondering why the error is "can't decode" when encode was called. What is Python doing under the hood?
"你好".encode('utf-8')
encode converts a unicode object to a string object. But here you have invoked it on a string object (because you don't have the u). So python has to convert the string to a unicode object first. So it does the equivalent of
"你好".decode().encode('utf-8')
But the decode fails because the string isn't valid ascii. That's why you get a complaint about not being able to decode.
Always encode from unicode to bytes.
In this direction, you get to choose the encoding.
>>> u"你好".encode("utf8")
'\xe4\xbd\xa0\xe5\xa5\xbd'
>>> print _
你好
The other way is to decode from bytes to unicode.
In this direction, you have to know what the encoding is.
>>> bytes = '\xe4\xbd\xa0\xe5\xa5\xbd'
>>> print bytes
你好
>>> bytes.decode('utf-8')
u'\u4f60\u597d'
>>> print _
你好
This point can't be stressed enough. If you want to avoid playing unicode "whack-a-mole", it's important to understand what's happening at the data level. Here it is explained another way:
A unicode object is decoded already, you never want to call decode on it.
A bytestring object is encoded already, you never want to call encode on it.
Now, on seeing .encode on a byte string, Python 2 first tries to implicitly convert it to text (a unicode object). Similarly, on seeing .decode on a unicode string, Python 2 implicitly tries to convert it to bytes (a str object).
These implicit conversions are why you can get UnicodeDecodeError when you've called encode. It's because encoding usually accepts a parameter of type unicode; when receiving a str parameter, there's an implicit decoding into an object of type unicode before re-encoding it with another encoding. This conversion chooses a default 'ascii' decoder†, giving you the decoding error inside an encoder.
In fact, in Python 3 the methods str.decode and bytes.encode don't even exist. Their removal was a [controversial] attempt to avoid this common confusion.
† ...or whatever coding sys.getdefaultencoding() mentions; usually this is 'ascii'
You can try this
import sys
reload(sys)
sys.setdefaultencoding("utf-8")
Or
You can also try following
Add following line at top of your .py file.
# -*- coding: utf-8 -*-
If you're using Python < 3, you'll need to tell the interpreter that your string literal is Unicode by prefixing it with a u:
Python 2.7.2 (default, Jan 14 2012, 23:14:09)
[GCC 4.2.1 (Based on Apple Inc. build 5658) (LLVM build 2335.15.00)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> "你好".encode("utf8")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 0: ordinal not in range(128)
>>> u"你好".encode("utf8")
'\xe4\xbd\xa0\xe5\xa5\xbd'
Further reading: Unicode HOWTO.
You use u"你好".encode('utf8') to encode an unicode string.
But if you want to represent "你好", you should decode it. Just like:
"你好".decode("utf8")
You will get what you want. Maybe you should learn more about encode & decode.
In case you're dealing with Unicode, sometimes instead of encode('utf-8'), you can also try to ignore the special characters, e.g.
"你好".encode('ascii','ignore')
or as something.decode('unicode_escape').encode('ascii','ignore') as suggested here.
Not particularly useful in this example, but can work better in other scenarios when it's not possible to convert some special characters.
Alternatively you can consider replacing particular character using replace().
If you are starting the python interpreter from a shell on Linux or similar systems (BSD, not sure about Mac), you should also check the default encoding for the shell.
Call locale charmap from the shell (not the python interpreter) and you should see
[user#host dir] $ locale charmap
UTF-8
[user#host dir] $
If this is not the case, and you see something else, e.g.
[user#host dir] $ locale charmap
ANSI_X3.4-1968
[user#host dir] $
Python will (at least in some cases such as in mine) inherit the shell's encoding and will not be able to print (some? all?) unicode characters. Python's own default encoding that you see and control via sys.getdefaultencoding() and sys.setdefaultencoding() is in this case ignored.
If you find that you have this problem, you can fix that by
[user#host dir] $ export LC_CTYPE="en_EN.UTF-8"
[user#host dir] $ locale charmap
UTF-8
[user#host dir] $
(Or alternatively choose whichever keymap you want instead of en_EN.) You can also edit /etc/locale.conf (or whichever file governs the locale definition in your system) to correct this.
I am adding UTF-8 data to a database in Django.
As the data goes into the database, everything looks fine - the characters (for example): “Hello” are UTF-8 encoded.
My MySQL database is UTF-8 encoded. When I examine the data from the DB by doing a select, my example string looks like this: ?Hello?. I assume this is showing the characters as UTF-8 encoded.
When I select the data from the database in the terminal or for export as a web-service, however - my string looks like this: \u201cHello World\u201d.
Does anyone know how I can display my characters correctly?
Do I need to perform some additional UTF-8 encoding somewhere?
Thanks,
Nick.
u'\u201cHello World\u201d'
Is the correct Python representation of the Unicode text “Hello World”. The smartquote characters are being displayed using a \uXXXX hex escape rather than verbatim because there are often problems with writing Unicode characters to the terminal, particularly on Windows. (It looks like MySQL tried to write them to the terminal but failed, resulting in the ? placeholders.)
On a terminal that does manage to correctly input and output Unicode characters, you can confirm that they're the same thing:
Python 2.6.5 (r265:79063, Apr 16 2010, 13:57:41) [GCC 4.4.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> u'\u201cHello World\u201d'==u'“Hello World”'
True
just as for byte strings, \x sequences are just the same as characters:
>>> '\x61'=='a'
True
Now if you've got \u or \x sequences escaping Python and making their way into an exported file, then you've done something wrong with the export. Perhaps you used repr() somewhere by mistake.
having problem with
UnicodeEncodeError('ascii', u'Phase \u2013 II', 6, 7, 'ordinal not in range(128)')
Basically what I am doing here is reading the value from excel sheet
and sheet contain address in this format
Phase- II
So wanted to know how to change`
somestring = u'Phase \u2013 II'
to str
thanks
`
Excel mostly uses cp1252, so try this:
>>> somestring.encode('cp1252', 'replace')
'Phase \x96 II'
>>> print somestring.encode('cp1252', 'replace')
Phase – II
That doesn't give you an ascii string (since your unicode string contains non-ascii characters it cannot), but it does give you a byte string that Excel will interpret correctly if for example you write it into a csv file.
If you just want to print it for display then you'll need to know the output encoding of whatever you use to display the text: I copied the example from idle which will, at least on my system displays cp1252, but if you print it in a command prompt you may have another encoding in effect. Use the DOS chcp command to select an appropriate encoding if required as the default encoding may not support that character:
C:\>chcp
Active code page: 850
C:\>\python26\python
Python 2.6.2 (r262:71605, Apr 14 2009, 22:40:02) [MSC v.1500 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> somestring = u'Phase \u2013 II'
>>> print somestring.encode('cp850', 'replace')
Phase ? II
>>>
Using the 'replace' argument to encode means that if you do manage to get any characters that cannot be interpreted as cp1252 will be replaced by question marks.
If I try to paste a unicode character such as the middle dot:
·
in my python interpreter it does nothing. I'm using Terminal.app on Mac OS X and when I'm simply in in bash I have no trouble:
:~$ ·
But in the interpreter:
:~$ python
Python 2.6.1 (r261:67515, Feb 11 2010, 00:51:29)
[GCC 4.2.1 (Apple Inc. build 5646)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>>
^^ I get nothing, it just ignores that I just pasted the character. If I use the escape \xNN\xNN representation of the middle dot '\xc2\xb7', and try to convert to unicode, trying to show the dot causes the interpreter to throw an error:
>>> unicode('\xc2\xb7')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0: ordinal not in range(128)
I have setup 'utf-8' as my default encoding in sitecustomize.py so:
>>> sys.getdefaultencoding()
'utf-8'
What gives? It's not the Terminal. It's not Python, what am I doing wrong?!
This question is not related to this question, as that indivdiual is able to paste unicode into his Terminal.
unicode('\xc2\xb7') means to decode the byte string in question with the default codec, which is ascii -- and that of course fails (trying to set a different default encoding has never worked well, and in particular doesn't apply to "pasted literals" -- that would require a different setting anyway). You could use instead u'\xc2\xb7', and see:
>>> print(u'\xc2\xb7')
·
since those are two unicode characters of course. While:
>>> print(u'\uc2b7')
슷
gives you a single unicode character (of some oriental persuasion -- sorry, I'm ignorant about these things). BTW, neither of these is the "middle dot" you were looking for. Maybe you mean
>>> print('\xc2\xb7'.decode('utf8'))
·
which is the middle dot. BTW, for me (python 2.6.4 from python.org on a Mac Terminal.app):
>>> print('슷')
슷
which kind of surprised me (I expected an error...!-).
I was doing some work today, and came across an issue where something "looked funny". I had been interpreting some string data as utf-8, and checking the encoded form. The data was coming from ldap (Specifically, Active Directory) via python-ldap. No surprises there.
So I came upon the byte sequence '\xe3\x80\xb0' a few times, which, when decoded as utf-8, is unicode codepoint 3030 (wavy dash). I need the string data in utf-16, so naturally I converted it via .encode('utf-16'). Unfortunately, it seems python doesn't like this character:
D:\> python
Python 2.6.4 (r264:75708, Oct 26 2009, 08:23:19) [MSC v.1500 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> u"\u3030"
u'\u3030'
>>> u"\u3030".encode("utf-8")
'\xe3\x80\xb0'
>>> u"\u3030".encode("utf-16-le")
'00'
>>> u"\u3030".encode("utf-16-be")
'00'
>>> '\xe3\x80\xb0'.decode('utf-8')
u'\u3030'
>>> '\xe3\x80\xb0'.decode('utf-8').encode('utf-16')
'\xff\xfe00'
>>> '\xe3\x80\xb0'.decode('utf-8').encode('utf-16-le').decode('utf-8')
u'00'
It seems IronPython isn't a fan either:
D:\ipy
IronPython 2.6 Beta 2 (2.6.0.20) on .NET 2.0.50727.3053
Type "help", "copyright", "credits" or "license" for more information.
>>> u"\u3030"
u'\u3030'
>>> u"\u3030".encode('utf-8')
u'\xe3\x80\xb0'
>>> u"\u3030".encode('utf-16-le')
'00'
If somebody could tell me what, exactly, is going on here, it'd be much appreciated.
This seems to be the correct behaviour. The character u'\u3030' when encoded in UTF-16 is the same as the encoding of '00' in UTF-8. It looks strange, but it's correct.
The '\xff\xfe' you can see is just a Byte Order Mark.
Are you sure you want a wavy dash, and not some other character? If you were hoping for a different character then it might be because it had already been misencoded before entering your application.
But it decodes okay:
>>> u"\u3030".encode("utf-16-le")
'00'
>>> '00'.decode("utf-16-le")
u'\u3030'
It's that the UTF-16 encoding of that character happens to coincide with the ASCII code for '0'. You could also represent it with '\x30\x30':
>>> '00' == '\x30\x30'
True
You are being confused by two things here (threw me off too):
utf-16 and utf-32 encodings use a BOM unless you specify which byte order to use, via utf-16-be and such. This is the \xff\xfe in the second last line.
'00' is two of the characters digit zero. It is not a null character. That'd print differently anyway:
>>> '\0\0'
'\x00\x00'
There is a basic error in your sample code above. Remember, you encode Unicode to an encoded string, and you decode from an encoded string back to Unicode. So, you do:
'\xe3\x80\xb0'.decode('utf-8').encode('utf-16-le').decode('utf-8')
which translates to the following steps:
'\xe3\x80\xb0' # (some string)
.decode('utf-8') # decode above text as UTF-8 encoded text, giving u'\u3030'
.encode('utf-16-le') # encode u'\u3030' as UTF-16-LE, i.e. '00'
.decode('utf-8') # OOPS! decode using the wrong encoding here!
u'\u3030' is indeed encoded as '00' (ascii zero twice) in UTF-16LE but you somehow think that this is a null byte ('\0') or something.
Remember, you can't reach to the same character if you encode with one and decode with another encoding:
>>> import unicodedata as ud
>>> c= unichr(193)
>>> ud.name(c)
'LATIN CAPITAL LETTER A WITH ACUTE'
>>> ud.name(c.encode("cp1252").decode("cp1253"))
'GREEK CAPITAL LETTER ALPHA'
In this code, I encoded to Windows-1252 and decoded from Windows-1253. In your code, you encoded to UTF-16LE and decoded from UTF-8.