Python default string encoding

Python default string encoding - python

When, where and how does Python implicitly apply encodings to strings or does implicit transcodings (conversions)?
And what are those "default" (i.e., implied) encodings?
For example, what are the encodings:
of string literals?
s = "Byte string with national characters"
us = u"Unicode string with national characters"
of byte strings when type-converted to and from Unicode?
data = unicode(random_byte_string)
when byte- and Unicode strings are written to/from a file or a terminal?
print(open("The full text of War and Peace.txt").read())

There are multiple parts of Python's functionality involved here: reading the source code and parsing the string literals, transcoding, and printing. Each has its own conventions.
Short answer:
For the purpose of code parsing:
str (Py2) -- not applicable, raw bytes from the file are taken
unicode (Py2)/str (Py3) -- "source encoding", defaults are ascii (Py2) and utf-8 (Py3)
bytes (Py3) -- none, non-ASCII characters are prohibited in the literal
For the purpose of transcoding:
both (Py2) -- sys.getdefaultencoding() (ascii almost always)
there are implicit conversions which often result in a UnicodeDecodeError/UnicodeEncodeError
both (Py3) -- none, must specify encoding explicitly when converting
For the purpose of I/O:
unicode (Py2) -- <file>.encoding if set, otherwise sys.getdefaultencoding()
str (Py2) -- not applicable, raw bytes are written
str (Py3) -- <file>.encoding, always set and defaults to locale.getpreferredencoding()
bytes (Py3) -- none, printing produces its repr() instead
First of all, some terminology clarification so that you understand the rest correctly. Decoding is translation from bytes to characters (Unicode or otherwise), and encoding (as a process) is the reverse. See The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) – Joel on Software to get the distinction.
Now...
Reading the source and parsing string literals
At the start of a source file, you can specify the file's "source encoding" (its exact effect is described later). If not specified, the default is ascii for Python 2 and utf-8 for Python 3. A UTF-8 BOM has the same effect as a utf-8 encoding declaration.
Python 2
Python 2 reads the source as raw bytes. It only uses the "source encoding" to parse a Unicode literal when it sees one. (It's more complicated than that under the hood, but this is the net effect.)
> type t.py
# Encoding: cp1251
s = "абвгд"
us = u"абвгд"
print repr(s), repr(us)
> py -2 t.py
'\xe0\xe1\xe2\xe3\xe4' u'\u0430\u0431\u0432\u0433\u0434'
<change encoding declaration in the file to cp866, do not change the contents>
> py -2 t.py
'\xe0\xe1\xe2\xe3\xe4' u'\u0440\u0441\u0442\u0443\u0444'
<transcode the file to utf-8, update declaration or replace with BOM>
> py -2 t.py
'\xd0\xb0\xd0\xb1\xd0\xb2\xd0\xb3\xd0\xb4' u'\u0430\u0431\u0432\u0433\u0434'
So, regular strings will contain the exact bytes that are in the file. And Unicode strings will contain the result of decoding the file's bytes with the "source encoding".
If the decoding fails, you will get a SyntaxError. Same if there is a non-ASCII character in the file when there's no encoding specified. Finally, if unicode_literals future is used, any regular string literals (in that file only) are treated as Unicode literals when parsing, with all what that means.
Python 3
Python 3 decodes the entire source file with the "source encoding" into a sequence of Unicode characters. Any parsing is done after that. (In particular, this makes it possible to have Unicode in identifiers.) Since all string literals are now Unicode, no additional transcoding is needed. In byte literals, non-ASCII characters are prohibited (such bytes must be specified with escape sequences), evading the issue altogether.
Transcoding
As per the clarification at the start:
str (Py2)/bytes (Py3) -- bytes => can only be decoded (directly, that is; details follow)
unicode (Py2)/str (Py3) -- characters => can only be encoded
Python 2
In both cases, if the encoding is not specified, sys.getdefaultencoding() is used. It is ascii (unless you uncomment a code chunk in site.py, or do some other hacks which are a recipe for disaster). So, for the purpose of transcoding, sys.getdefaultencoding() is the "string's default encoding".
Now, here's a caveat:
a decode() and encode() -- with the default encoding -- is done implicitly when converting str<->unicode:
in string formatting (a third of UnicodeDecodeError/UnicodeEncodeError questions on Stack Overflow are about this)
when trying to encode() a str or decode() a unicode (the second third of the Stack Overflow questions)
Python 3
There's no "default encoding" at all: implicit conversion between str and bytes is now prohibited.
bytes can only be decoded and str -- encoded, and the encoding argument is mandatory.
converting bytes->str (incl. implicitly) produces its repr() instead (which is only useful for debug printing), evading the encoding issue entirely
converting str->bytes is prohibited
Printing
This matter is unrelated to a variable's value but related to what you would see on the screen when it's printed -- and whether you will get a UnicodeEncodeError when printing.
Python 2
A unicode is encoded with <file>.encoding if set; otherwise, it's implicitly converted to str as per the above. (The final third of the UnicodeEncodeError SO questions fall into here.)
For standard streams, the stream's encoding is guessed at startup from various environment-specific sources, and can be overridden with the PYTHONIOENCODING environment variable.
str's bytes are sent to the OS stream as-is. What specific glyphs you will see on the screen depends on your terminal's encoding settings (if it's something like UTF-8, you may see nothing at all if you print a byte sequence that is invalid UTF-8).
Python 3
The changes are:
Now files opened with text vs. binary mode natively accept str or bytes, correspondingly, and outright refuse to process the wrong type. Text-mode files always have an encoding set, locale.getpreferredencoding(False) being the default.
print for text streams still implicitly converts everything to str, which in the case of bytes prints its repr() as per the above, evading the encoding issue altogether

Implicit encoding as internal format to store strings/arrays: you should not care about the encoding. In fact, Python decodes characters in a Python internal way. It is mostly transparent. Just image that it is a Unicode text, or a sequence of bytes, in an abstract way.
The internal coding in Python 3.x varies according the "larger" character. It could be UTF-8/ASCII (for ASCII strings), UTF-16 or UTF-32. When you are using strings, it is like you have a Unicode string (so abstract, not a real encoding). If you do not program in C or you use some special functions (memory view), you will never be able to see the internal encoding.
Bytes are just a view of actual memory. Python interprets is as unsigned char. But again, often you should just think about what the sequence it is, not on internal encoding.
Python 2 has bytes and string as unsigned char, and Unicode as UCS-2 (so code points above 65535 will be coded with two characters (UCS-2) in Python 2, and just one character (UTF-32) in Python 3).

Related

Is there a way to specify which Unicode format is used in unicode encoding in python 2.7?

so I'd like to encode some values in Unicode in my python 2.7 script. I'd like to know if I can specify which type of Unicode to use, i.e UTF-8 vs UTF-32. Apart from that are there any limitations as to which encodings are supported in python 2.7, and how is the default encoding determined?

So, first things first: you should be using Python 3, not Python 2.
The handling of text and unicode is the major difference between the two versions of the language, and the real reason they had to do incompatible changes, and it is much, much more straightforward in Python 3.
This means to talk about unicode in Python 2 you have to understand certain things - unicode is used to represent text: characters regardless of the underlying representation these characters have.
In Python 2 programs, all text typed in the program itself have to be typed with "u" prefixed strings, like u"..." or u'...' - otherwise the strings are considered "byte strings" - just like one have in C code. (Alternativelly, one can place from __future__ import unicode_literals in the first or second line of the file, so this is done automatically.
Otherwise, all data read into the program, either from text files, database connections, inbound HTTP requests, will usually get as byte strings in Python2, and have to explicitly converted to text-strings (that is "unicode objects" in Python 2 speak) before being processed. This is done by calling the bytes-string .decode method - and you pass as the first parameter to it the encoding name used for those bytes. That is, if you have data you have read from an utf-8 encoded file, it can be decoded to text by doing:
data = data.decode("utf-8") # and so on for other encodings.
Also, if you are typing any non-ascii character in the source code of a Python2 file, regardless of it being inside a string (or, inside a comment, for example), you have to declare the file encoding in the first line of the file.
That is done with a Python comment that is treated in a special way by the language parser - the first LoC should contain:
# encoding: utf-8
(of course, you should type the encoding actually used by your program-editor to store the file. Also, some variants on this marking are allowed, as writting "coding" instead of encoding, the ":" being optional, and so on)
So - what I've described in the previous 5 paragraphs takes place automatically in Python 3. But if you followed up so far, you now have a program running with text to be handled. As you can perceive, you did not mention in your question how you are inputing this text you want to encode in different ways.
So, just as you did explicitly convert the input bytes to in memory unicode strings, now you can use the .encode method to convert the text back to whatever text-encoding you want.
If you have some text that you want to write in a text-file encoded in utf-32 little endian, you do:
with open("myfile.txt", "wt") as file_:
file_.write(data.encode("utf-32 LE"))
The valid text codecs are listed, as per Eran's answer at:
https://docs.python.org/2/library/codecs.html#standard-encodings
Now, if you do some tests with this and succeed, you'd better do two things before proceeding any further:
switch to use Python 3. Python 2 is real obsolete at this point - check if it is not already installed in your system by typing "python3" instead of just "Python". If it is not, just install it - it can live side-by-side with Python2
Read this article, to get a grasp on what really goes on whn we talk about unicode in encodings. (The author, Joel, is the founder of Stackoverflow itself, and the article is from 2003)

In python 2, strings are by default ASCII. You can decode them and re-encode them.
supported encodings can be found here: https://docs.python.org/2/library/codecs.html#standard-encodings
Here's an example:
a = "my string" # a is ASCII encoded bytes
b = u"my string" # b is unicode, not encoded
c = a.decode() # c is unicode, not encoded, by default decoding ASCII, you can specify otherwise as an argument
d = c.encode('utf-32') # d is utf-32 encoded bytes
print type(a) # output: <type 'str'>
print type(b) # output: <type 'unicode'>
print type(c) # output: <type 'unicode'>
print type(d) # output: <type 'str'>
Note 1: that in python 3 things are somewhat different.
Note 2: In order to write non-ascii literals in your script (that is if you want to write a = "☂" as part of your code, as opposed to having a just a variable that contains data you got from somewhere) you have to declare the encoding at the top of the file, more info here. And in python 2 only a small subset of unicode characters are accepted in the literal code. (while in memory of course you are not limited).
Note 3: Of course that while unicode type is to you not encoded, internally python keeps it encoded (either as utf-32 if I'm not mistaken). But that's an internal detail that shouldn't affect your code generally speaking.

Python string encode and decode

Encoding in JS means converting a string with special characters to escaped usable string. like : encodeURIComponent would convert spaces to %20 etc to be usable in URIs.
So encoding here means converting to a particular format.
In Python 2.7, I have a string : 奥多比. To convert it into UTF-8 format, however, I need to use decode() function.
Like: "奥多比".decode("utf-8") == u'\u5965\u591a\u6bd4'
I want to understand how the meaning of encode and decode is changing with language. To me essentially I should be doing "奥多比".encode("utf-8")
What am I missing here.

You appear to be confusing Unicode text (represented in Python 2 as the unicode type, indicated by the u prefix on the literal syntax), with one of the standard Unicode encodings, UTF-8.
You are not creating UTF-8, you created a Unicode text object, by decoding from a UTF-8 byte stream.
The byte string literal `"奥多比"' is a sequence of binary data, bytes. You either entered these in a text editor and saved the file as UTF-8 (and told Python to treat your source code as UTF-8 by starting the file with a PEP 263 codec header), or you typed it into the Python interactive prompt in a terminal that was configured to send UTF-8 data.
I strongly urge you to read more about the difference between bytes, codecs and Unicode text. The following links are highly recommended:
Ned Batchelder's Pragmatic Unicode
The Python Unicode HOWTO
Joel Spolsky's The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

In Python v2, it's type str, i.e. sequence of bytes. To convert it to a Unicode string, you need to decode this sequence of bytes using a codec. Simply said, it specifies how should bytes be converted to a sequence of Unicode code points. Look into Unicode HOWTO for more in-depth article on this.

Python encoding in vim

Trying to understand encoding/decoding/unicode business in Python2.7 with vim.
I have a unicode string us to which I assign some unicode string u'é'.
Question 1
How is us represented in memory? Is it a sequence of 32- bits long ints that unicode code points \u should consist of? Or is it kept in memory as a sequence of 8- bits long hex values \x in some default encoding?
Question 2
I see four different ways to set encoding for the unicode string us: #1 in the beginning of the test.py file; #2 as an argument of encode function; #3 as an argument for vim; #4 as a local encoding of the file system. So, what do all these four encodings (#1,#2,#3,#4) do?
$ vim test.py
_____________
#encoding: #1
us=u'é'
print us.encode(encoding='#2')
_____________
:set encoding=#3
$ locale | grep LANG
LANG=en_US.#4
LANGUAGE=

In Python 2.x unicodes are encoded as either UCS-2 or UCS-4 depending on the options used when building it.
Source encoding as far as Python is concerned.
Encoding used to encode us as bytes when the code is executed.
Source encoding as far as vim is concerned. If this doesn't match #1 then expect trouble.
System encoding. Mostly affects filesystem and terminal output operations.

Question 1 - Storage
us = u'é'
This creates a Unicode character with a value of é - In Python 2.2+ Unicode characters are stored in UCS-2 or UCS-4 which use 2 or 4 byte long unsigned integers depending on a build time option.
Python 3.3+ uses UTF-8 which uses between 1 & 4 bytes for each character depending on the range it is in.
The storage of Unicode strings now depends on the highest codepoint in
the string:
pure ASCII and Latin1 strings (U+0000-U+007F) use 1 byte per codepoint 0xxxxxxx;
BMP strings partial (U+0080-U+07FF) use 2 bytes per codepoint 110xxxxx 10xxxxxx;
BMP strings remaining (U+0800-U+FFFF) use 3 bytes per codepoint 1110xxxx 10xxxxxx 10xxxxxx;
Other Plains (U+10000-U+10FFFF) use 4 bytes per codepoint 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx.
Question 2 - Encoding
us=u'é'
Declares us to be a Unicode string stored as above, note that in python 3 all strings are by default Unicode so the u can be omitted.
print(us.encode('ascii', strict)) # encoding='#2')
Tells print how to attempt to translate the Unicode string for output, note that if you are using Python 3.3+ and a Unicode capable terminal/console you probably don't need to ever use this.
#set encoding=#3
Tells vim, emacs and a number of editors the encoding to use when displaying &/or editing the file applies to all text files not just python.
$ locale | grep LANG
LANG=en_US.#4
Is an operating system setting for the Locale Language that tells it how to display various things specifically which code page to use when displaying extended ASCII.

This doesn't actually answer the question but I'm hoping it gives some more insight into this problem.
Answer to question 1: it shouldn't matter to the programmer how Unicode strings are represented internally in Python.
To question 2:
All the programmer should care about is that the data sink and source encoding requirements are known and correctly specified. I would assume that Python can correctly interpret UTF encoded files by reading the BOM and maybe even by making educated guesses but without the BOM it can be ambiguous how to handle bytes with the high bit set so it's advisable to either make sure the BOM is there or tell Python that the file is UTF-8 encoded if you're not sure.
There's a difference between "Unicode" and "UTF" that seems to be glossed-over above; "UTF" specifies the representation in storage (disk, memory, network packet) but "Unicode" is simply the fact that each character has a single value (code point) that ranges from 0 to 0x10FFFF. The various flavors of UTF encode that value into the appropriate storage. Working with encoded strings can be annoying though (as the character width is variable) so when strings are actually represented in memory often it's easier to expand them into some format that allows for easy manipulation. (This is touched on in a comment on another answer.)
If you want a Unicode string in Python pre-3, just type u'<whatever>' and in 3+ type '<whatever>'. You'll get Unicode and you can use \uXXXX and \UXXXXXXXX escapes if it's infeasible to just type the characters in directly. When you want to write the data, specify the encoding. UTF-8 is often the easiest to deal with and seems to be the most commonly used but you may have reason to use a UTF-16 flavor.
The takeaway here is that the encoding is just a way to transform Unicode data so that it can be persisted. The various flavors of UTF are just the encodings, they are not actually Unicode.

Why Python2 and Python3 treat same windows directory differently?

My windows language is Chinese.
To illustrate my point, I use package pathlib.
from pathlib import *
rootdir=Path(r'D:\新建文件夹')
print(rootdir.exists())
Python2.7 I get False
Python3 I get True
Any ideas?Thanks for any advice.
For Python2.7,you can install pathlib with "pip install pathlib"

In Python 3 strings are Unicode by default. In Python 2, they are byte strings encoded in the source file encoding. Use a Unicode string in Python 2.
Also make sure to declare the source file encoding and make sure the source is saved in that encoding.
#coding:utf8
from pathlib import *
rootdir=Path(ur'D:\新建文件夹')
print(rootdir.exists())

The main difference between Python 2 and Python 3 is the basic types that exist to deal with texts and bytes. On Python 3 we have one text type: str which holds Unicode data and two byte types bytes and bytearray.
On the other hand on Python 2 we have two text types: str which for all intents and purposes is limited to ASCII + some undefined data above the 7 bit range, unicode which is equivalent to the Python 3 str type and one byte type bytearray which it inherited from Python 3.
Python 3 removed all codecs that don't go from bytes to Unicode or vice versa and removed the now useless .encode() method on bytes and .decode() method on strings.
More about this e.g. here.

Use Unicode literals for Windows paths: add from __future__ import unicode_literals at the top.
Explanation
r'D:\新建文件夹' is a bytestring on Python 2. Its specific value depends on the encoding declaration at the top (such as # -*- coding: utf-8 -*-). You should get an error without the declaration if you use non-ascii literal in Python 2. r'D:\新建文件夹' is a Unicode string on Python 3 and the default source code encoding is utf-8 (no encoding declaration is required)
Python uses Unicode API when working with files on Windows if the input is Unicode and "ANSI" API if the input is bytes.
If the source code encoding differs from "ANSI" encoding (such as cp1252) then the result may differ: the bytes are passed as is (the same byte-sequence can represent different characters in different encodings). If the filename can't be represented in "ANSI" encoding (e.g., cp1252 -- a single byte encoding can't represent all Unicode characters -- there are around a million Unicode characters but only 256 bytes); the results may differ. Using Unicode strings for filenames on Windows fixes both issues.

I don't understand encode and decode in Python (2.7.3)

I tried to understand by myself encode and decode in Python but nothing is really clear for me.
str.encode([encoding,[errors]])
str.decode([encoding,[errors]])
First, I don't understand the need of the "encoding" parameter in these two functions.
What is the output of each function, its encoding? What is the use of the "encoding" parameter in each function? I don't really understand the definition of "bytes string".
I have an important question, is there some way to pass from one encoding to another?
I have read some text on ASN.1 about "octet string", so I wondered whether it was the same as "bytes string".
Thanks for you help.

It's a little more complex in Python 2 (compared to Python 3), since it conflates the concepts of 'string' and 'bytestring' quite a bit, but see The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets. Essentially, what you need to understand is that 'string' and 'character' are abstract concepts that can't be directly represented by a computer. A bytestring is a raw stream of bytes straight from disk (or that can be written straight from disk). encode goes from abstract to concrete (you give it preferably a unicode string, and it gives you back a byte string); decode goes the opposite way.
The encoding is the rule that says 'a' should be represented by the byte 0x61 and 'α' by the two-byte sequence 0xc0\xb1.

My presentation from PyCon, Pragmatic Unicode, or, How Do I Stop The Pain covers all of these details.
Briefly, Unicode strings are sequences of integers called code points, and bytestrings are sequences of bytes. An encoding is a way to represent Unicode code points as a series of bytes. So unicode_string.encode(enc) will return the byte string of the Unicode string encoded with "enc", and byte_string.decode(enc) will return the Unicode string created by decoding the byte string with "enc".

Python 2.x has two types of strings:
str = "byte strings" = a sequence of octets. These are used for both "legacy" character encodings (such as windows-1252 or IBM437) and for raw binary data (such as struct.pack output).
unicode = "Unicode strings" = a sequence of UTF-16 or UTF-32 depending on how Python is built.
This model was changed for Python 3.x:
2.x unicode became 3.x str (and the u prefix was dropped from the literals).
A bytes type was introduced for representing binary data.
A character encoding is a mapping between Unicode strings and byte strings. To convert a Unicode string, to a byte string, use the encode method:
>>> u'\u20AC'.encode('UTF-8')
'\xe2\x82\xac'
To convert the other way, use the decode method:
>>> '\xE2\x82\xAC'.decode('UTF-8')
u'\u20ac'

Yes, a byte string is an octet string. Encoding and decoding happens when inputting / outputting text (from/to the console, files, the network, ...). Your console may use UTF-8 internally, your web server serves latin-1, and certain file formats need strange encodings like Bibtex's accents: fran\c{c}aise. You need to convert from/to them on input/output.
The {en|de}code methods do this. They are often called behind the scenes (for example, print "hello world" encodes the string to whatever your terminal uses).

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.