I tried to understand by myself encode and decode in Python but nothing is really clear for me.
str.encode([encoding,[errors]])
str.decode([encoding,[errors]])
First, I don't understand the need of the "encoding" parameter in these two functions.
What is the output of each function, its encoding? What is the use of the "encoding" parameter in each function? I don't really understand the definition of "bytes string".
I have an important question, is there some way to pass from one encoding to another?
I have read some text on ASN.1 about "octet string", so I wondered whether it was the same as "bytes string".
Thanks for you help.
It's a little more complex in Python 2 (compared to Python 3), since it conflates the concepts of 'string' and 'bytestring' quite a bit, but see The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets. Essentially, what you need to understand is that 'string' and 'character' are abstract concepts that can't be directly represented by a computer. A bytestring is a raw stream of bytes straight from disk (or that can be written straight from disk). encode goes from abstract to concrete (you give it preferably a unicode string, and it gives you back a byte string); decode goes the opposite way.
The encoding is the rule that says 'a' should be represented by the byte 0x61 and 'α' by the two-byte sequence 0xc0\xb1.
My presentation from PyCon, Pragmatic Unicode, or, How Do I Stop The Pain covers all of these details.
Briefly, Unicode strings are sequences of integers called code points, and bytestrings are sequences of bytes. An encoding is a way to represent Unicode code points as a series of bytes. So unicode_string.encode(enc) will return the byte string of the Unicode string encoded with "enc", and byte_string.decode(enc) will return the Unicode string created by decoding the byte string with "enc".
Python 2.x has two types of strings:
str = "byte strings" = a sequence of octets. These are used for both "legacy" character encodings (such as windows-1252 or IBM437) and for raw binary data (such as struct.pack output).
unicode = "Unicode strings" = a sequence of UTF-16 or UTF-32 depending on how Python is built.
This model was changed for Python 3.x:
2.x unicode became 3.x str (and the u prefix was dropped from the literals).
A bytes type was introduced for representing binary data.
A character encoding is a mapping between Unicode strings and byte strings. To convert a Unicode string, to a byte string, use the encode method:
>>> u'\u20AC'.encode('UTF-8')
'\xe2\x82\xac'
To convert the other way, use the decode method:
>>> '\xE2\x82\xAC'.decode('UTF-8')
u'\u20ac'
Yes, a byte string is an octet string. Encoding and decoding happens when inputting / outputting text (from/to the console, files, the network, ...). Your console may use UTF-8 internally, your web server serves latin-1, and certain file formats need strange encodings like Bibtex's accents: fran\c{c}aise. You need to convert from/to them on input/output.
The {en|de}code methods do this. They are often called behind the scenes (for example, print "hello world" encodes the string to whatever your terminal uses).
Related
Encoding in JS means converting a string with special characters to escaped usable string. like : encodeURIComponent would convert spaces to %20 etc to be usable in URIs.
So encoding here means converting to a particular format.
In Python 2.7, I have a string : 奥多比. To convert it into UTF-8 format, however, I need to use decode() function.
Like: "奥多比".decode("utf-8") == u'\u5965\u591a\u6bd4'
I want to understand how the meaning of encode and decode is changing with language. To me essentially I should be doing "奥多比".encode("utf-8")
What am I missing here.
You appear to be confusing Unicode text (represented in Python 2 as the unicode type, indicated by the u prefix on the literal syntax), with one of the standard Unicode encodings, UTF-8.
You are not creating UTF-8, you created a Unicode text object, by decoding from a UTF-8 byte stream.
The byte string literal `"奥多比"' is a sequence of binary data, bytes. You either entered these in a text editor and saved the file as UTF-8 (and told Python to treat your source code as UTF-8 by starting the file with a PEP 263 codec header), or you typed it into the Python interactive prompt in a terminal that was configured to send UTF-8 data.
I strongly urge you to read more about the difference between bytes, codecs and Unicode text. The following links are highly recommended:
Ned Batchelder's Pragmatic Unicode
The Python Unicode HOWTO
Joel Spolsky's The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
In Python v2, it's type str, i.e. sequence of bytes. To convert it to a Unicode string, you need to decode this sequence of bytes using a codec. Simply said, it specifies how should bytes be converted to a sequence of Unicode code points. Look into Unicode HOWTO for more in-depth article on this.
I see that the Python manual mentions .encode() and .decode() string methods. Playing around on the Python CLI I see that I can create unicode strings u'hello' with a different datatype than a 'regular' string 'hello' and can convert / cast with str(). But the real problems start when using characters above ASCII 127 u'שלום' and I am having a hard time determining empirically exactly what is happening.
Stack Overflow is overflowing with examples of confusion regarding Python's unicode and string-encoding/decoding handling.
What exactly happens (how are the bytes changed, and how is the datatype changed) when encoding and decoding strings with the str() method, especially when characters that cannot be represented in 7 bytes are included in the string? Is it true, as it seems, that a Python variable with datatype <type 'str'> can be both encoded and decoded? If it is encoded, I understand that means that the string is represented by UTF-8, ISO-8859-1, or some other encoding, is this correct? If it is decoded, what does this mean? Are decoded strings unicode? If so, then why don't they have the datatype <type 'unicode'>?
In the interest of those who will read this later, I think that both Python 2 and Python 3 should be addressed. Thank you!
This is only the case in Python 2. The existence of a decode method on Python 2's strings is a wart, which has been changed in Python 3 (where the equivalent, bytes, has only decode).
You can't 'encode' an already-encoded string. What happens when you do call encode on a str is that Python implicitly calls decode on it using the default encoding, which is usually ASCII. This is almost always not what you want. You should always call decode to convert a str to unicode before converting it to a different encoding.
(And decoded strings are unicode, and they do have type <unicode>, so I don't know what you mean by that question.)
In Python 3 of course strings are unicode by default. You can only encode them to bytes - which, as I mention above, can only be decoded.
I've this string :
sig=45C482D2486105B02211ED4A0E3163A9F7095E81.4DDB3B3A13C77FE508DCFB7C6CC68957096A406C\u0026type=video%2F3gpp%3B+codecs%3D%22mp4v.20.3%2C+mp4a.40.2%22\u0026quality=small\u
0026itag=17\u0026url=http%3A%2F%2Fr6---sn-cx5h-itql.c.youtube.com%2Fvideoplayback%3Fsource%3Dyoutube%26mt%3D1367776467%26expire%3D1367797699%26itag%3D17%26factor%3D1.25%2
6upn%3DpkX9erXUHx4%26cp%3DU0hVTFdUVV9OU0NONV9PTllHOnhGdTVLUThqUWJW%26key%3Dyt1%26id%3Dab9b0e2f311eaf00%26mv%3Dm%26newshard%3Dyes%26ms%3Dau%26ip%3D49.205.30.138%26sparams%
3Dalgorithm%252Cburst%252Ccp%252Cfactor%252Cid%252Cip%252Cipbits%252Citag%252Csource%252Cupn%252Cexpire%26burst%3D40%26algorithm%3Dthrottle-factor%26ipbits%3D8%26fexp%3D9
17000%252C919366%252C916626%252C902533%252C932000%252C932004%252C906383%252C904479%252C901208%252C925714%252C929119%252C931202%252C900821%252C900823%252C912518%252C911416
%252C930807%252C919373%252C906836%252C926403%252C900824%252C912711%252C929606%252C910075%26sver%3D3\u0026fallback_host=tc.v19.cache2.c.youtube.com
As you can see it contains the both forms:
%xx. For example, %3, %2F etc.
\uxxxx. For example, \u0026
I need to convert them to their unicode character representation. I'm using Python 3.3.1, and urllib.parse.unquote(s) converts only %xx to their unicode character representation. It doesn't, however, convert \uxxxx to their unicode character representation. For example, \u0026 should convert into &.
How can I convert both of them?
Two options:
Choose to interpret it as JSON; that format uses the same escape codes. The input does need to have quotes around it to be seen as a string.
Encode to latin 1 (to preserve bytes), then decode with the unicode_escape codec:
>>> urllib.parse.unquote(sig).encode('latin1').decode('unicode_escape')
'45C482D2486105B02211ED4A0E3163A9F7095E81.4DDB3B3A13C77FE508DCFB7C6CC68957096A406C&type=video/3gpp;+codecs="mp4v.20.3,+mp4a.40.2"&quality=small&itag=17&url=http://r6---sn-cx5h-itql.c.youtube.com/videoplayback?source=youtube&mt=1367776467&expire=1367797699&itag=17&factor=1.25&upn=pkX9erXUHx4&cp=U0hVTFdUVV9OU0NONV9PTllHOnhGdTVLUThqUWJW&key=yt1&id=ab9b0e2f311eaf00&mv=m&newshard=yes&ms=au&ip=49.205.30.138&sparams=algorithm%2Cburst%2Ccp%2Cfactor%2Cid%2Cip%2Cipbits%2Citag%2Csource%2Cupn%2Cexpire&burst=40&algorithm=throttle-factor&ipbits=8&fexp=917000%2C919366%2C916626%2C902533%2C932000%2C932004%2C906383%2C904479%2C901208%2C925714%2C929119%2C931202%2C900821%2C900823%2C912518%2C911416%2C930807%2C919373%2C906836%2C926403%2C900824%2C912711%2C929606%2C910075&sver=3&fallback_host=tc.v19.cache2.c.youtube.com'
This interprets \u escape codes just like it Python would do when reading string literals in Python source code.
If I'm guessing right, this is more or less a URL. The '%xx' encodes a single byte outside the allowed character set. The '\uxxxx' encodes a Unicode codepoint. I believe that it is normal for URLs to encode Unicode characters as UTF-8 and then to encode the bytes outside the allowed charset as '%xx' (which affects all multibyte UTF-8 sequences). This makes it surprising that there are '%xx'-encoded bytes already, because translating the Unicode codepoints will make the conversions irreversible.
Make sure you have tests and that you can verify the actual results, because this seems like it was unsafe. At least I don't fully understand the requirements here.
I know this looks embarrassingly easy, and I guess the problem is that I just don't have a clear understanding of all this bytes-str-unicode (and encoding-decoding, speaking frankly) stuff yet.
I've been trying to get my working code to run on Python 3. The part I'm stuck with is when I parse an XML with lxml and decode a base64 string that is in that XML.
The code now works in the following manner:
I retrieve the binary data with an XPath query '.../binary/text()'. This produces a one-element list containing a lxml.etree._ElementUnicodeResult object. Then, with python 2, I was able to do:
decoded = source.decode('base64')
and finally
output = numpy.frombuffer(decoded)
However, on python 3 I get an error message saying
AttributeError: 'lxml.etree._ElementUnicodeResult' object has no attribute 'decode'
This is not so surprising, because lxml.etree._ElementUnicodeResult is a subclass of str.
Another way would be to get a real str with the same data in it with
binary = tree.xpath('//binary')[0]
binary_string = binary.text
That would be essentially the same. So what do I do to decode it from base64? I've looked at the base64 module, but it takes a bytes object as an argument, and I can't think of the way to present str as bytes, because if I try to construct a bytes object, Python will try to encode the string, which I don't need.
Googling further, I came across the binascii module (which is invoked indirectly from base64 anyway, if I'm not mistaken), but calling binascii.b2a_base64() on my string produces
TypeError: 'str' does not support the buffer interface
P.S. I've even found an answered question on how to decode a hex string in Python 3, but this is done with a dedicated method bytes.fromhex() so I don't see how it would be helpful.
Could someone please tell me what I'm missing? I'm afraid most of the post is irrelevant and only aggravates my shame, but at least you guys know what I tried.
OK, I think I'm going to summarize my current understanding of things (feel free to correct me). Hopefully it will help someone else out there as confused as I've been.
The credit totally goes to thebjorn and delnan, of course.
So, starting with the most common things:
there's Unicode, and it's a global standard that assigns codes (or code points) to all the exotic characters you can imagine. Those codes are just integer numbers. As of Unicode 6.1 there are 109,975 graphic characters, says Wikipedia.
Then there are encodings that define how to designate Unicode characters with byte codes. One byte isn't enough to designate an arbitrary Unicode char. Although, if you only take a small subset of them (English alphabet, digits, punctuation, some control characters), you can do with one byte per character (or even 7 bits; see ASCII).
To pass a Unicode string anywhere, one needs to encode it in bytes, then it can be decoded on the other end.
In Python 2, str is actually bytes, and unicode is Unicode, but Python 2 will do implicit encoding/decoding for you when needed. It will try to use ASCII encoding.
In Python 3, str is always a Unicode string, and bytes is a new data type for actual bytes. No implicit conversion is ever done by Python 3, you always need to do it yourself and specify the encoding. That means that your program won't work until you understand what's going on, which totally happened to me.
Now, that being more or less clear, let's move on to base64 encoding, which is also an encoding of sorts, but has a slightly different meaning.
Suppose you have some binary data (i.e. bytes) that may mean anything (in my case it's a bunch of floats). Now you want to represent this binary array with a string. That's what base64 encoding means: you have your bytes represented as an ASCII string.
Base64 means 6 bit, so in a base64-encoded string a single character stands for 6 bits of your data. That is why base64-encoded strings need to have the length that is a multiple of 4: otherwise the number of bytes encoded will be not integer.
Finally, to decode from base64 you need an ASCII string. A Unicode string won't do, there can only be characters from the base64 alphabet. Base64 module does the job in Python. The base64.b64decode() function takes a byte string as the argument. In Python 2 it means: str. In Python 3 it means: bytes. So if you have a str, such as
>>> s = 'U3RhY2sgT3ZlcmZsb3c='
In Python 2 you could just do
>>> s.decode('base64')
because s is already in ASCII.
In Python 3, you need to encode it in ASCII first, so you'll have to do:
>>> base64.b64decode(s.encode('ascii'))
And by the way, this will return a bytes object, so it's really up to you how to treat those bytes then. Maybe it's my floats, but maybe you should try to decode it as ASCII :)
In Python 2 however it will be just a str. Anyway, have a look at struct for the tools to unpack your data from those bytes.
So if you need the code to work on both Python 2 and 3, go with the last one. To make sure you have Unicode in the end (if you are decoding text from base64), you'll have to decode it:
>>> base64.b64decode(s.encode('ascii')).decode('ascii')
On Python 2, encode('ascii') won't effectively do anything because it's applied to str. So it will do an implicit conversion to Unicode first, and then do what you want (convert it back to ASCII). decode('ascii') will return a unicode object on Python 2.
I don't have Python 3 installed, but it sounds like you need to convert the Unicode returned from lxml to bytes, perhaps by calling .encode('ascii') ?
Given in arbitrary "string" from a library I do not have control over, I want to make sure the "string" is a unicode type and encoded in utf-8. I would like to know if this is the best way to do this:
import types
input = <some value from a lib I dont have control over>
if isinstance(input, types.StringType):
input = input.decode("utf-8")
elif isinstance(input, types.UnicodeType):
input = input.encode("utf-8").decode("utf-8")
In my actual code I wrap this in a try/except and handle the errors but I left that part out.
A Unicode object is not encoded (it is internally but this should be transparent to you as a Python user). The line input.encode("utf-8").decode("utf-8") does not make much sense: you get the exact same sequence of Unicode characters at the end that you had in the beginning.
if isinstance(input, str):
input = input.decode('utf-8')
is all you need to ensure that str objects (byte strings) are converted into Unicode strings.
Simply;
try:
input = unicode(input.encode('utf-8'))
except ValueError:
pass
Its always better to seek forgiveness than ask permission.
I think you have a misunderstanding of Unicode and encodings. Unicode characters are just numbers. Encodings are the representation of the numbers. Think of Unicode characters as a concept like fifteen, and encodings as 15, 1111, F, XV. You have to know the encoding (decimal, binary, hexadecimal, roman numerals) before you can decode an encoding and "know" the Unicode value.
If you have no control over the input string, it is difficult to convert it to anything. For example, if the input was read from a file you'd have to know the encoding of the text file to decode it meaningfully to Unicode, and then encode it into 'UTF-8' for your C++ library.
Are you sure you want a UTF-8 encoded sequence stored in a Unicode type? Normally, Python stores characters in a types.UnicodeType using UCS-2 or -4, what is sometimes referred to as "wide" characters, which should be capable of containing characters from all reasonably common scripts.
One wonders what sort of lib this is that sometimes outputs types.StringType and sometimes types.UnicodeType. If I would take a wild guess, the lib always produces type.StringType, but doesn't tell which encoding it is in. If that is the case, you are actually looking for code that can guess what charset a type.StringType is encoded as.
In most cases, this is easy as you can assume that it is either in e.g. latin-1 or UTF-8. If the text can actually be in any odd encoding (e.g. incoming mail w/o proper header) you need a lib that guesses encoding. See http://chardet.feedparser.org/.