What's a Python bytestring?
All I can find are topics on how to encode to bytestring or decode to ASCII or UTF-8. I'm trying to understand how it works under the hood. In a normal ASCII string, it's an array or list of characters, and each character represents an ASCII value from 0-255, so that's how you know what character is represented by the number. In Unicode, it's the 8- or 16-byte representation for the character that tells you what character it is.
So what is a bytestring? How does Python know which characters to represent as what? How does it work under the hood? Since you can print or even return these strings and it shows you the string representation, I don't quite get it...
Ok, so my point is definitely getting missed here. I've been told that it's an immutable sequence of bytes without any particular interpretation.
A sequence of bytes.. Okay, let's say one byte:
'a'.encode() returns b'a'.
Simple enough. Why can I read the a?
Say I get the ASCII value for a, by doing this:
printf "%d" "'a"
It returns 97. Okay, good, the integer value for the ASCII character a. If we interpret 97 as ASCII, say in a C char, then we get the letter a. Fair enough. If we convert the byte representation to bits, we get this:
01100001
2^0 + 2^5 + 2^6 = 97. Cool.
So why is 'a'.encode() returning b'a' instead of 01100001??
If it's without a particular interpretation, shouldn't it be returning something like b'01100001'?
It seems like it's interpreting it like ASCII.
Someone mentioned that it's calling __repr__ on the bytestring, so it's displayed in human-readable form. However, even if I do something like:
with open('testbytestring.txt', 'wb') as f:
f.write(b'helloworld')
It will still insert helloworld as a regular string into the file, not as a sequence of bytes... So is a bytestring in ASCII?
It is a common misconception that text is ASCII or UTF-8 or Windows-1252, and therefore bytes are text.
Text is only text, in the way that images are only images. The matter of storing text or images to disk is a matter of encoding that data into a sequence of bytes. There are many ways to encode images into bytes: JPEG, PNG, SVG, and likewise many ways to encode text, ASCII, UTF-8 or Windows-1252.
Once encoding has happened, bytes are just bytes. Bytes are not images anymore; they have forgotten the colors they mean; although an image format decoder can recover that information. Bytes have similarly forgotten the letters they used to be. In fact, bytes don't remember whether they were images or text at all. Only out of band knowledge (filename, media headers, etcetera) can guess what those bytes should mean, and even that can be wrong (in case of data corruption).
so, in Python (Python 3), we have two types for things that might otherwise look similar; For text, we have str, which knows it's text; it knows which letters it's supposed to mean. It doesn't know which bytes that might be, since letters are not bytes. We also have bytestring, which doesn't know if it's text or images or any other kind of data.
The two types are superficially similar, since they are both sequences of things, but the things that they are sequences of is quite different.
Implementationally, str is stored in memory as UCS-? where the ? is implementation defined, it may be UCS-4, UCS-2 or UCS-1, depending on compile time options and which code points are present in the represented string.
"But why"?
Some things that look like text are actually defined in other terms. A really good example of this are the many Internet protocols of the world. For instance, HTTP is a "text" protocol that is in fact defined using the ABNF syntax common in RFCs. These protocols are expressed in terms of octets, not characters, although an informal encoding may also be suggested:
2.3. Terminal Values
Rules resolve into a string of terminal values, sometimes called
characters. In ABNF, a character is merely a non-negative integer.
In certain contexts, a specific mapping (encoding) of values into a
character set (such as ASCII) will be specified.
This distinction is important, because it's not possible to send text over the internet, the only thing you can do is send bytes. saying "text but in 'foo' encoding" makes the format that much more complex, since clients and servers need to now somehow figure out the encoding business on their own, hopefully in the same way, since they must ultimately pass data around as bytes anyway. This is doubly useless since these protocols are seldom about text handling anyway, and is only a convenience for implementers. Neither the server owners nor end users are ever interested in reading the words Transfer-Encoding: chunked, so long as both the server and the browser understand it correctly.
By comparison, when working with text, you don't really care how it's encoded. You can express the "Heävy Mëtal Ümlaüts" any way you like, except "Heδvy Mλtal άmlaόts"
The distinct types thus give you a way to say "this value 'means' text" or "bytes".
Python does not know how to represent a bytestring. That's the point.
When you output a character with value 97 into pretty much any output window, you'll get the character 'a' but that's not part of the implementation; it's just a thing that happens to be locally true. If you want an encoding, you don't use bytestring. If you use bytestring, you don't have an encoding.
Your piece about .txt files shows you have misunderstood what is happening. You see, plain text files too don't have an encoding. They're just a series of bytes. These bytes get translated into letters by the text editor but there is no guarantee at all that someone else opening your file will see the same thing as you if you stray outside the common set of ASCII characters.
As the name implies, a Python 3 bytestring (or simply a str in Python 2.7) is a string of bytes. And, as others have pointed out, it is immutable.
It is distinct from a Python 3 str (or, more descriptively, a unicode in Python 2.7) which is a
string of abstract Unicode characters (a.k.a. UTF-32, though Python 3 adds fancy compression under the hood to reduce the actual memory footprint similar to UTF-8, perhaps even in a more general way).
There are essentially three ways of "interpreting" these bytes. You can look at the numeric value of an element, like this:
>>> ord(b'Hello'[0]) # Python 2.7 str
72
>>> b'Hello'[0] # Python 3 bytestring
72
Or you can tell Python to emit one or more elements to the terminal (or a file, device, socket, etc.) as 8-bit characters, like this:
>>> print b'Hello'[0] # Python 2.7 str
H
>>> import sys
>>> sys.stdout.buffer.write(b'Hello'[0:1]) and None; print() # Python 3 bytestring
H
As Jack hinted at, in this latter case it is your terminal interpreting the character, not Python.
Finally, as you have seen in your own research, you can also get Python to interpret a bytestring. For example, you can construct an abstract unicode object like this in Python 2.7:
>>> u1234 = unicode(b'\xe1\x88\xb4', 'utf-8')
>>> print u1234.encode('utf-8') # if terminal supports UTF-8
ሴ
>>> u1234
u'\u1234'
>>> print ('%04x' % ord(u1234))
1234
>>> type(u1234)
<type 'unicode'>
>>> len(u1234)
1
>>>
Or like this in Python 3:
>>> u1234 = str(b'\xe1\x88\xb4', 'utf-8')
>>> print (u1234) # if terminal supports UTF-8 AND python auto-infers
ሴ
>>> u1234.encode('unicode-escape')
b'\\u1234'
>>> print ('%04x' % ord(u1234))
1234
>>> type(u1234)
<class 'str'>
>>> len(u1234)
1
(and I am sure that the amount of syntax churn between Python 2.7 and Python3 around bystestring, strings, and Unicode had something to do with the continued popularity of Python 2.7. I suppose that when Python 3 was invented they didn't yet realize that everything would become UTF-8 and therefore all the fuss about abstraction was unnecessary).
But the Unicode abstraction does not happen automatically if you don't want it to. The point of a bytestring is that you can directly get at the bytes. Even if your string happens to be a UTF-8 sequence, you can still access bytes in the sequence:
>>> len(b'\xe1\x88\xb4')
3
>>> b'\xe1\x88\xb4'[0]
'\xe1'
And this works in both Python 2.7 and Python 3, with the difference being that in Python 2.7 you have str, while in Python3 you have bytestring.
You can also do other wonderful things with bytestrings, like knowing if they will fit in a reserved space within a file, sending them directly over a socket, calculating the HTTP content-length field correctly, and avoiding Python Bug 8260. In short, use bytestrings when your data is processed and stored in bytes.
Bytes objects are immutable sequences of single bytes. The documentation has a very good explanation of what they are and how to use them.
I have been searching for hours now to find a way to fully reverse the result of a str.encode-call like this:
"testäch基er".encode("cp1252", "backslashreplace")
The result is
b'test\xe4ch\\u57faer'
now i want to convert it back with
b'test\xe4ch\\u57faer'.decode("cp1252")
and I get
'testäch\\u57faer'
So how do I get my 基 back? I'm getting nearly there by using decode("unicode-escape") instead (it would work for this example), but that assumes bytes encoded with iso8859-1 not cp1252, so any characters between 80 and 9F would be wrong.
Well...
>>> b'test\xe4ch\\u57faer'.decode('unicode-escape')
'testäch基er'
But backslashreplace->unicode-escape is not a consistent round trip. If you have backslashes in the original string, they won't get encoded by backslashreplace but they will get decoded by unicode-escape, and replaced with unexpected characters.
>>> '☃ \\u2603'.encode('cp1252', 'backslashreplace').decode('unicode-escape')
'☃ ☃'
There is no way to reliably reverse the encoding of string that has been encoded with an errors fallback such as backslashreplace. That's why it's a fallback, if you could consistently encode and decode to it, it would have been a real encoding.
I was still very new to Python when I asked this question. Now I understand that these fallback mechanisms are just meant for handling unexpected errors, not something to save and restore data. If you really need a simple and reliable way to encode single unicode characters in ASCII, have a look at the quote and unquote functions from the urllib.parse module.
In C, I can do that: printf("%c%c",-108,109);. But how to do it in Python?
args[2]='Test%c%cFile'% (-108,-109)
OverflowError: unsigned byte integer is less than minimum
From the comments, it seems you want to put some non-ASCII characters into a string. Hacks involving negative integers are not necessary in Python, which has perfectly good Unicode strings for holding any character you could dream of. Use a code chart (like this one) to find the code points for your characters, then insert them into your string using the \uxxxx syntax.
Edit: You seem to be after U+9493 CJK UNIFIED IDEOGRAPH-9493, which can be inserted into a string using
str = u"\u9493"
(Thanks #KeithThompson!)
The real answer is that you have been relying on non standard behavior in your C code (one of the language lawyers can butt in and tell you if it's truly undefined or not).
You need to look into how to properly support non ASCII, multibyte character sets in python rather than just trying to duplicate your incorrect C code.
There are plenty of resources that will help with correct "i18n" of an app.
Is there a comprehensive character replacement module for python that finds all non-ascii or non-unicode characters in a string and replaces them with ascii or unicode equivilents? This comfort with the "ignore" argument during encoding or decoding is insane, but likewise so is a '?' in every place that a non translated character was.
I'm looking for one module that finds irksome characters and conforms them to whatever standard is requested.
I realize that the amount of extant alphabets and encodings makes this somewhat impossible, but surely someone has taken a stab at it? Even a rudimentary solution would be better than the status quo.
The simplification for data transfer that this would mean is enormous.
i don't think what you want is really possible - but i think there is a decent option.
unicodedata has a 'normalize' method that can gracefully degrade text for you...
import unicodedata
def gracefully_degrade_to_ascii( text ):
return unicodedata.normalize('NFKD',text).encode('ascii','ignore')
assuming the charset you're using is already mapped into unicode - or at least can be mapped into unicode - you should be able to degrade the unicode version of that text down to ascii or utf-8 with this module ( it's part of the standard library too )
Full Docs - http://docs.python.org/library/unicodedata.html
To look at any individual character and guess its encoding would be hard and probably not very accurate. However, you can use chardet to try and detect the encoding of an entire file. Then you can use the string decode() and encode() methods to convert its encoding to UTF-8.
http://pypi.python.org/pypi/chardet
And UTF-8 is backwards compatible with ASCII so that won't be a big deal.
Hello i was wondering if you know any other way to encode a string to a url-safe, because urllib.quote is doing it wrong, the output is different than expected:
If i try
urllib.quote('á')
i get
'%C3%A1'
But thats not the correct output, it should be
%E1
As demostrated by the tool provided here this site
And this is not me being difficult, the incorrect output of quote is preventing the browser to found resources, if i try
urllib.quote('\images\á\some file.jpg')
And then i try with the javascript tool i mentioned i get this strings respectively
%5Cimages%5C%C3%A1%5Csome%20file.jpg
%5Cimages%5C%E1%5Csome%20file.jpg
Note how is almost the same but the url provided by quote doesn't work and the other one it does.
I tried messing with encode('utf-8) on the string provided to quote but it does not make a difference.
I tried with other spanish words with accents and the ñ they all are differently represented.
Is this a python bug?
Do you know some module that get this right?
According to RFC 3986, %C3%A1 is correct. Characters are supposed to be converted to an octet stream using UTF-8 before the octet stream is percent-encoded. The site you link is out of date.
See Why does the encoding's of a URL and the query string part differ? for more detail on the history of handling non-ASCII characters in URLs.
Ok, got it, i have to encode to iso-8859-1 like this
word = u'á'
word = word.encode('iso-8859-1')
print word
Python is interpreted in ASCII by default, so even though your file may be encoded differently, your UTF-8 char is interpereted as two ASCII chars.
Try putting a comment as the first of second line of your code like this to match the file encoding, and you might need to use u'á' also.
# coding: utf-8
What about using unicode strings and the numeric representation (ord) of the char?
>>> print '%{0:X}'.format(ord(u'á'))
%E1
In this question it seems some guy wrote a pretty large function to convert to ascii urls, thats what i need. But i was hoping there was some encoding tool in the std lib for the job.