I know this looks embarrassingly easy, and I guess the problem is that I just don't have a clear understanding of all this bytes-str-unicode (and encoding-decoding, speaking frankly) stuff yet.
I've been trying to get my working code to run on Python 3. The part I'm stuck with is when I parse an XML with lxml and decode a base64 string that is in that XML.
The code now works in the following manner:
I retrieve the binary data with an XPath query '.../binary/text()'. This produces a one-element list containing a lxml.etree._ElementUnicodeResult object. Then, with python 2, I was able to do:
decoded = source.decode('base64')
and finally
output = numpy.frombuffer(decoded)
However, on python 3 I get an error message saying
AttributeError: 'lxml.etree._ElementUnicodeResult' object has no attribute 'decode'
This is not so surprising, because lxml.etree._ElementUnicodeResult is a subclass of str.
Another way would be to get a real str with the same data in it with
binary = tree.xpath('//binary')[0]
binary_string = binary.text
That would be essentially the same. So what do I do to decode it from base64? I've looked at the base64 module, but it takes a bytes object as an argument, and I can't think of the way to present str as bytes, because if I try to construct a bytes object, Python will try to encode the string, which I don't need.
Googling further, I came across the binascii module (which is invoked indirectly from base64 anyway, if I'm not mistaken), but calling binascii.b2a_base64() on my string produces
TypeError: 'str' does not support the buffer interface
P.S. I've even found an answered question on how to decode a hex string in Python 3, but this is done with a dedicated method bytes.fromhex() so I don't see how it would be helpful.
Could someone please tell me what I'm missing? I'm afraid most of the post is irrelevant and only aggravates my shame, but at least you guys know what I tried.
OK, I think I'm going to summarize my current understanding of things (feel free to correct me). Hopefully it will help someone else out there as confused as I've been.
The credit totally goes to thebjorn and delnan, of course.
So, starting with the most common things:
there's Unicode, and it's a global standard that assigns codes (or code points) to all the exotic characters you can imagine. Those codes are just integer numbers. As of Unicode 6.1 there are 109,975 graphic characters, says Wikipedia.
Then there are encodings that define how to designate Unicode characters with byte codes. One byte isn't enough to designate an arbitrary Unicode char. Although, if you only take a small subset of them (English alphabet, digits, punctuation, some control characters), you can do with one byte per character (or even 7 bits; see ASCII).
To pass a Unicode string anywhere, one needs to encode it in bytes, then it can be decoded on the other end.
In Python 2, str is actually bytes, and unicode is Unicode, but Python 2 will do implicit encoding/decoding for you when needed. It will try to use ASCII encoding.
In Python 3, str is always a Unicode string, and bytes is a new data type for actual bytes. No implicit conversion is ever done by Python 3, you always need to do it yourself and specify the encoding. That means that your program won't work until you understand what's going on, which totally happened to me.
Now, that being more or less clear, let's move on to base64 encoding, which is also an encoding of sorts, but has a slightly different meaning.
Suppose you have some binary data (i.e. bytes) that may mean anything (in my case it's a bunch of floats). Now you want to represent this binary array with a string. That's what base64 encoding means: you have your bytes represented as an ASCII string.
Base64 means 6 bit, so in a base64-encoded string a single character stands for 6 bits of your data. That is why base64-encoded strings need to have the length that is a multiple of 4: otherwise the number of bytes encoded will be not integer.
Finally, to decode from base64 you need an ASCII string. A Unicode string won't do, there can only be characters from the base64 alphabet. Base64 module does the job in Python. The base64.b64decode() function takes a byte string as the argument. In Python 2 it means: str. In Python 3 it means: bytes. So if you have a str, such as
>>> s = 'U3RhY2sgT3ZlcmZsb3c='
In Python 2 you could just do
>>> s.decode('base64')
because s is already in ASCII.
In Python 3, you need to encode it in ASCII first, so you'll have to do:
>>> base64.b64decode(s.encode('ascii'))
And by the way, this will return a bytes object, so it's really up to you how to treat those bytes then. Maybe it's my floats, but maybe you should try to decode it as ASCII :)
In Python 2 however it will be just a str. Anyway, have a look at struct for the tools to unpack your data from those bytes.
So if you need the code to work on both Python 2 and 3, go with the last one. To make sure you have Unicode in the end (if you are decoding text from base64), you'll have to decode it:
>>> base64.b64decode(s.encode('ascii')).decode('ascii')
On Python 2, encode('ascii') won't effectively do anything because it's applied to str. So it will do an implicit conversion to Unicode first, and then do what you want (convert it back to ASCII). decode('ascii') will return a unicode object on Python 2.
I don't have Python 3 installed, but it sounds like you need to convert the Unicode returned from lxml to bytes, perhaps by calling .encode('ascii') ?
Related
I'm new to Python 3 and it seems that I can't quite completely grasp unicode and character encoding.
I'm working with the output of another tool that returns the content of an html page as a bytes object. Other tools we use need this output to be in bytes type, but, I'd like to convert the bytes output to a string for some parsing and comparison to other strings. For cases that I'm interested in, printing the output bytes object shows only characters and no \x or \u binary. I'm a little confused on how best to do this and why the methods that create the desired output, actually do work.
I've read elsewhere that .decode() should be used in this context and this does work, but I don't understand why I am decoding an object that is already characters. From what I understand, decoding is intended for binary numbers, for example:
>>> b'\x41'.decode('utf-8')
'A'
In my understanding, all I really want to do is tell Python that an object that's been labeled as a bytes type object is actually a str object. Simply using the str() function on the bytes object also accomplishes this goal, but adds the "b" prefix and adds quotations around the string.
Here are the two solutions I'm working with:
>>> str(b'htmltext')
"b'htmltext'"
>>> b'htmltext'.decode('utf-8')
'htmltext'
Essentially, either of these solutions appears to achieve what I'm looking for, but the decode() obviously seems cleaner and, from what I've read, the recommended method. I'm wondering why decode() works, given that, apparently, I'm not converting binary numbers to characters. Furthermore, is there any reason other than the unappealing "b" and quotation marks in the output that str() would not be a valid solution here?
Don't confuse the developer-friendly representation of the bytes object with the data that is contained in it. You have binary data either way.
The developer representation makes it easy for you to see what is contained by showing anything that just happens to be a valid ASCII codepoint as that ASCII character, rather than the \xhh escape code. It's just easier to read text encoded as ASCII that way, and a lot of the world's text happens to be ASCII encoded.
You'll have a harder time when the data is not within the ASCII range however:
>>> 'Åæøéï'.encode('utf8')
b'\xc3\x85\xc3\xa6\xc3\xb8\xc3\xa9\xc3\xaf'
That's a UTF-8 byte sequence encoding text with accents. The above may be a little bit contrived, but most non-English text will include some non-ASCII text. Even English text can contain em-dashes or fancy quotes, and the b'...' bytes version of that is not nearly as readable as the properly decoded text version:
>>> '“Kragerø” is a town in Norway – in the province of Vestfold'.encode('utf8')
b'\xe2\x80\x9cKrager\xc3\xb8\xe2\x80\x9d is a town in Norway \xe2\x80\x93 in the province of Vestfold'
Note that the b'....' output is the result of using the repr() function on a bytes object; that calls the object.__repr__() method, which has the explicit function of producing a developer-friendly string for you. There is no dedicated object.__str__() method on a bytes object, the __repr__ method is called instead, even when you use the str() function. The proper way to convert a bytes value to a string is to decode (using the correct codec for the data).
Of course, when you have binary data that represents something else, like, say, image data, then keep it as bytes. There is no text to decode there.
What's a Python bytestring?
All I can find are topics on how to encode to bytestring or decode to ASCII or UTF-8. I'm trying to understand how it works under the hood. In a normal ASCII string, it's an array or list of characters, and each character represents an ASCII value from 0-255, so that's how you know what character is represented by the number. In Unicode, it's the 8- or 16-byte representation for the character that tells you what character it is.
So what is a bytestring? How does Python know which characters to represent as what? How does it work under the hood? Since you can print or even return these strings and it shows you the string representation, I don't quite get it...
Ok, so my point is definitely getting missed here. I've been told that it's an immutable sequence of bytes without any particular interpretation.
A sequence of bytes.. Okay, let's say one byte:
'a'.encode() returns b'a'.
Simple enough. Why can I read the a?
Say I get the ASCII value for a, by doing this:
printf "%d" "'a"
It returns 97. Okay, good, the integer value for the ASCII character a. If we interpret 97 as ASCII, say in a C char, then we get the letter a. Fair enough. If we convert the byte representation to bits, we get this:
01100001
2^0 + 2^5 + 2^6 = 97. Cool.
So why is 'a'.encode() returning b'a' instead of 01100001??
If it's without a particular interpretation, shouldn't it be returning something like b'01100001'?
It seems like it's interpreting it like ASCII.
Someone mentioned that it's calling __repr__ on the bytestring, so it's displayed in human-readable form. However, even if I do something like:
with open('testbytestring.txt', 'wb') as f:
f.write(b'helloworld')
It will still insert helloworld as a regular string into the file, not as a sequence of bytes... So is a bytestring in ASCII?
It is a common misconception that text is ASCII or UTF-8 or Windows-1252, and therefore bytes are text.
Text is only text, in the way that images are only images. The matter of storing text or images to disk is a matter of encoding that data into a sequence of bytes. There are many ways to encode images into bytes: JPEG, PNG, SVG, and likewise many ways to encode text, ASCII, UTF-8 or Windows-1252.
Once encoding has happened, bytes are just bytes. Bytes are not images anymore; they have forgotten the colors they mean; although an image format decoder can recover that information. Bytes have similarly forgotten the letters they used to be. In fact, bytes don't remember whether they were images or text at all. Only out of band knowledge (filename, media headers, etcetera) can guess what those bytes should mean, and even that can be wrong (in case of data corruption).
so, in Python (Python 3), we have two types for things that might otherwise look similar; For text, we have str, which knows it's text; it knows which letters it's supposed to mean. It doesn't know which bytes that might be, since letters are not bytes. We also have bytestring, which doesn't know if it's text or images or any other kind of data.
The two types are superficially similar, since they are both sequences of things, but the things that they are sequences of is quite different.
Implementationally, str is stored in memory as UCS-? where the ? is implementation defined, it may be UCS-4, UCS-2 or UCS-1, depending on compile time options and which code points are present in the represented string.
"But why"?
Some things that look like text are actually defined in other terms. A really good example of this are the many Internet protocols of the world. For instance, HTTP is a "text" protocol that is in fact defined using the ABNF syntax common in RFCs. These protocols are expressed in terms of octets, not characters, although an informal encoding may also be suggested:
2.3. Terminal Values
Rules resolve into a string of terminal values, sometimes called
characters. In ABNF, a character is merely a non-negative integer.
In certain contexts, a specific mapping (encoding) of values into a
character set (such as ASCII) will be specified.
This distinction is important, because it's not possible to send text over the internet, the only thing you can do is send bytes. saying "text but in 'foo' encoding" makes the format that much more complex, since clients and servers need to now somehow figure out the encoding business on their own, hopefully in the same way, since they must ultimately pass data around as bytes anyway. This is doubly useless since these protocols are seldom about text handling anyway, and is only a convenience for implementers. Neither the server owners nor end users are ever interested in reading the words Transfer-Encoding: chunked, so long as both the server and the browser understand it correctly.
By comparison, when working with text, you don't really care how it's encoded. You can express the "Heävy Mëtal Ümlaüts" any way you like, except "Heδvy Mλtal άmlaόts"
The distinct types thus give you a way to say "this value 'means' text" or "bytes".
Python does not know how to represent a bytestring. That's the point.
When you output a character with value 97 into pretty much any output window, you'll get the character 'a' but that's not part of the implementation; it's just a thing that happens to be locally true. If you want an encoding, you don't use bytestring. If you use bytestring, you don't have an encoding.
Your piece about .txt files shows you have misunderstood what is happening. You see, plain text files too don't have an encoding. They're just a series of bytes. These bytes get translated into letters by the text editor but there is no guarantee at all that someone else opening your file will see the same thing as you if you stray outside the common set of ASCII characters.
As the name implies, a Python 3 bytestring (or simply a str in Python 2.7) is a string of bytes. And, as others have pointed out, it is immutable.
It is distinct from a Python 3 str (or, more descriptively, a unicode in Python 2.7) which is a
string of abstract Unicode characters (a.k.a. UTF-32, though Python 3 adds fancy compression under the hood to reduce the actual memory footprint similar to UTF-8, perhaps even in a more general way).
There are essentially three ways of "interpreting" these bytes. You can look at the numeric value of an element, like this:
>>> ord(b'Hello'[0]) # Python 2.7 str
72
>>> b'Hello'[0] # Python 3 bytestring
72
Or you can tell Python to emit one or more elements to the terminal (or a file, device, socket, etc.) as 8-bit characters, like this:
>>> print b'Hello'[0] # Python 2.7 str
H
>>> import sys
>>> sys.stdout.buffer.write(b'Hello'[0:1]) and None; print() # Python 3 bytestring
H
As Jack hinted at, in this latter case it is your terminal interpreting the character, not Python.
Finally, as you have seen in your own research, you can also get Python to interpret a bytestring. For example, you can construct an abstract unicode object like this in Python 2.7:
>>> u1234 = unicode(b'\xe1\x88\xb4', 'utf-8')
>>> print u1234.encode('utf-8') # if terminal supports UTF-8
ሴ
>>> u1234
u'\u1234'
>>> print ('%04x' % ord(u1234))
1234
>>> type(u1234)
<type 'unicode'>
>>> len(u1234)
1
>>>
Or like this in Python 3:
>>> u1234 = str(b'\xe1\x88\xb4', 'utf-8')
>>> print (u1234) # if terminal supports UTF-8 AND python auto-infers
ሴ
>>> u1234.encode('unicode-escape')
b'\\u1234'
>>> print ('%04x' % ord(u1234))
1234
>>> type(u1234)
<class 'str'>
>>> len(u1234)
1
(and I am sure that the amount of syntax churn between Python 2.7 and Python3 around bystestring, strings, and Unicode had something to do with the continued popularity of Python 2.7. I suppose that when Python 3 was invented they didn't yet realize that everything would become UTF-8 and therefore all the fuss about abstraction was unnecessary).
But the Unicode abstraction does not happen automatically if you don't want it to. The point of a bytestring is that you can directly get at the bytes. Even if your string happens to be a UTF-8 sequence, you can still access bytes in the sequence:
>>> len(b'\xe1\x88\xb4')
3
>>> b'\xe1\x88\xb4'[0]
'\xe1'
And this works in both Python 2.7 and Python 3, with the difference being that in Python 2.7 you have str, while in Python3 you have bytestring.
You can also do other wonderful things with bytestrings, like knowing if they will fit in a reserved space within a file, sending them directly over a socket, calculating the HTTP content-length field correctly, and avoiding Python Bug 8260. In short, use bytestrings when your data is processed and stored in bytes.
Bytes objects are immutable sequences of single bytes. The documentation has a very good explanation of what they are and how to use them.
From Dive into Python:
In Python 3, all strings are sequences of Unicode characters. There is no such thing as a Python string encoded in UTF-8, or a Python string encoded as CP-1252. “Is this string UTF-8?” is an invalid question. UTF-8 is a way of encoding characters as a sequence of bytes. If you want to take a string and turn it into a sequence of bytes in a particular character encoding, Python 3 can help you with that. If you want to take a sequence of bytes and turn it into a string, Python 3 can help you with that too. Bytes are not characters; bytes are bytes. Characters are an abstraction. A string is a sequence of those abstractions.
I don't understand what the author means by that.
When I say s = 'hello', how is s encoded internally? Of course it must use some use some encoding. He says all strings are sequences of Unicode characters. But how many bytes is each character? Is this string UTF-8? Why does he say : "There is no such thing as a Python string encoded in UTF-8".
I understand Python provides capabilities of converting a Python "string" into a series of bytes that can be read by another software that uses that encoding. It also supports conversion of a series of bytes into a Python "string". Now the internal representation of this "string" is what confuses me.
When I say s = 'hello', how is s encoded internally? Of course it must use some use some encoding.
It depends. Frankly, it doesn't matter. CPython now uses the Flexible String Representation, a wonderful space and time optimisation. But you shouldn't care because it doesn't matter.
He says all strings are sequences of Unicode characters. But how many bytes is each character?
Dunno. It depends. It'll probably be in Latin-1 (1 byte) (when using CPython) in that particular case.
Is this string UTF-8?
No.
Why does he say : "There is no such thing as a Python string encoded in UTF-8".
Because it's a series of Unicode Code points. If you confuse encodings with strings (as other languages often force you to do), you might think that 'Jalape\xc3\xb1o' is 'Jalapeño', because in UTF-8 the byte-sequence '\xc3\xb1o' represents 'ñ'. But it's not, because the string doesn't have an intrinsic encoding, just like the number 100 is the number 100, not 4, whether or not you represent it in binary, decimal or unary.
He says it because people come from languages where they only have bytes that represent strings and they think "but how is this encoded" as if they have to decode it themselves. It'd be like carrying a list of 1s and 0s instead of being able to use numbers, and you have to tell every function what endianness you're using.
I understand Python provides capabilities of converting a Python "string" into a series of bytes that can be read by another software that uses that encoding. It also supports conversion of a series of bytes into a Python "string". Now the internal representation of this "string" is what confuses me.
Hopefully it does not any more :).
If this confuses you, I reccomend this question, partially 'cause someone called my answer "superbly comprehensive"¹ but also because Steven D'Aprano has had one of his Python Mailing List excelencies posted there - he and I answered from the list and had our text posted across.
If you're wondering why it's relevant, I'll quote:
So the person you are quoting is causing confusion when he talks about an "encoded string", he should either make it clear he means a string of bytes, or not mention the word string at all.
Isn't that exactly your confusion?
¹ Technically he called another answer "another superbly comprehensive answer", but that implies what I just said ;).
Author compares strings in Python 2 and 3. In Python 2 strings were represented as byte arrays and thus introduced a lot of problems when dealing with non-ASCII characters. Programmer had to always keep track of current encoding of strings in their applications (e.g. encoding of the text on HTML page). There was an attempt to solve it in Python 2.x with introduction of Unicode objects:
s = 'text' # string/byte array object
un = u'text' # unicode object
But many application still used normal, old-style strings.
So, in Python 3 it was decided to separate strings (making them all Unicode) and byte arrays. Thus, in Python 3 we have:
s = 'text' # string/unicode object
b = bytes([0xA2,0x01,0x02,0x03,0x04]) # byte array object
Python uses UCS-2 or UCS-4 encoding internally for unicode strings (at least in Python 2.x).
I see that the Python manual mentions .encode() and .decode() string methods. Playing around on the Python CLI I see that I can create unicode strings u'hello' with a different datatype than a 'regular' string 'hello' and can convert / cast with str(). But the real problems start when using characters above ASCII 127 u'שלום' and I am having a hard time determining empirically exactly what is happening.
Stack Overflow is overflowing with examples of confusion regarding Python's unicode and string-encoding/decoding handling.
What exactly happens (how are the bytes changed, and how is the datatype changed) when encoding and decoding strings with the str() method, especially when characters that cannot be represented in 7 bytes are included in the string? Is it true, as it seems, that a Python variable with datatype <type 'str'> can be both encoded and decoded? If it is encoded, I understand that means that the string is represented by UTF-8, ISO-8859-1, or some other encoding, is this correct? If it is decoded, what does this mean? Are decoded strings unicode? If so, then why don't they have the datatype <type 'unicode'>?
In the interest of those who will read this later, I think that both Python 2 and Python 3 should be addressed. Thank you!
This is only the case in Python 2. The existence of a decode method on Python 2's strings is a wart, which has been changed in Python 3 (where the equivalent, bytes, has only decode).
You can't 'encode' an already-encoded string. What happens when you do call encode on a str is that Python implicitly calls decode on it using the default encoding, which is usually ASCII. This is almost always not what you want. You should always call decode to convert a str to unicode before converting it to a different encoding.
(And decoded strings are unicode, and they do have type <unicode>, so I don't know what you mean by that question.)
In Python 3 of course strings are unicode by default. You can only encode them to bytes - which, as I mention above, can only be decoded.
I tried to understand by myself encode and decode in Python but nothing is really clear for me.
str.encode([encoding,[errors]])
str.decode([encoding,[errors]])
First, I don't understand the need of the "encoding" parameter in these two functions.
What is the output of each function, its encoding? What is the use of the "encoding" parameter in each function? I don't really understand the definition of "bytes string".
I have an important question, is there some way to pass from one encoding to another?
I have read some text on ASN.1 about "octet string", so I wondered whether it was the same as "bytes string".
Thanks for you help.
It's a little more complex in Python 2 (compared to Python 3), since it conflates the concepts of 'string' and 'bytestring' quite a bit, but see The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets. Essentially, what you need to understand is that 'string' and 'character' are abstract concepts that can't be directly represented by a computer. A bytestring is a raw stream of bytes straight from disk (or that can be written straight from disk). encode goes from abstract to concrete (you give it preferably a unicode string, and it gives you back a byte string); decode goes the opposite way.
The encoding is the rule that says 'a' should be represented by the byte 0x61 and 'α' by the two-byte sequence 0xc0\xb1.
My presentation from PyCon, Pragmatic Unicode, or, How Do I Stop The Pain covers all of these details.
Briefly, Unicode strings are sequences of integers called code points, and bytestrings are sequences of bytes. An encoding is a way to represent Unicode code points as a series of bytes. So unicode_string.encode(enc) will return the byte string of the Unicode string encoded with "enc", and byte_string.decode(enc) will return the Unicode string created by decoding the byte string with "enc".
Python 2.x has two types of strings:
str = "byte strings" = a sequence of octets. These are used for both "legacy" character encodings (such as windows-1252 or IBM437) and for raw binary data (such as struct.pack output).
unicode = "Unicode strings" = a sequence of UTF-16 or UTF-32 depending on how Python is built.
This model was changed for Python 3.x:
2.x unicode became 3.x str (and the u prefix was dropped from the literals).
A bytes type was introduced for representing binary data.
A character encoding is a mapping between Unicode strings and byte strings. To convert a Unicode string, to a byte string, use the encode method:
>>> u'\u20AC'.encode('UTF-8')
'\xe2\x82\xac'
To convert the other way, use the decode method:
>>> '\xE2\x82\xAC'.decode('UTF-8')
u'\u20ac'
Yes, a byte string is an octet string. Encoding and decoding happens when inputting / outputting text (from/to the console, files, the network, ...). Your console may use UTF-8 internally, your web server serves latin-1, and certain file formats need strange encodings like Bibtex's accents: fran\c{c}aise. You need to convert from/to them on input/output.
The {en|de}code methods do this. They are often called behind the scenes (for example, print "hello world" encodes the string to whatever your terminal uses).