Unicode vs UTF-8 confusion in Python / Django?

Unicode vs UTF-8 confusion in Python / Django? - python

I stumbled over this passage in the Django tutorial:
Django models have a default str() method that calls unicode() and converts the result to a UTF-8 bytestring. This means that unicode(p) will return a Unicode string, and str(p) will return a normal string, with characters encoded as UTF-8.
Now, I'm confused because afaik Unicode is not any particular representation, so what is a "Unicode string" in Python? Does that mean UCS-2? Googling turned up this "Python Unicode Tutorial" which boldly states
Unicode is a two-byte encoding which covers all of the world's common writing systems.
which is plain wrong, or is it? I have been confused many times by character set and encoding issues, but here I'm quite sure that the documentation I'm reading is confused. Does anybody know what's going on in Python when it gives me a "Unicode string"?

what is a "Unicode string" in Python? Does that mean UCS-2?
Unicode strings in Python are stored internally either as UCS-2 (fixed-length 16-bit representation, almost the same as UTF-16) or UCS-4/UTF-32 (fixed-length 32-bit representation). It's a compile-time option; on Windows it's always UTF-16 whilst many Linux distributions set UTF-32 (‘wide mode’) for their versions of Python.
You are generally not supposed to care: you will see Unicode code-points as single elements in your strings and you won't know whether they're stored as two or four bytes. If you're in a UTF-16 build and you need to handle characters outside the Basic Multilingual Plane you'll be Doing It Wrong, but that's still very rare, and users who really need the extra characters should be compiling wide builds.
plain wrong, or is it?
Yes, it's quite wrong. To be fair I think that tutorial is rather old; it probably pre-dates wide Unicode strings, if not Unicode 3.1 (the version that introduced characters outside the Basic Multilingual Plane).
There is an additional source of confusion stemming from Windows's habit of using the term “Unicode” to mean, specifically, the UTF-16LE encoding that NT uses internally. People from Microsoftland may often copy this somewhat misleading habit.

Meanwhile, I did a refined research to verify what the internal representation in Python is, and also what its limits are. "The Truth About Unicode In Python" is a very good article which cites directly from the Python developers. Apparently, internal representation is either UCS-2 or UCS-4 depending on a compile-time switch. So Jon, it's not UTF-16, but your answer put me on the right track anyway, thanks.

Python stores Unicode as UTF-16. str() will return the UTF-8 representation of the UTF-16 string.

From Wikipedia on UTF-8:
UTF-8 (8-bit UCS/Unicode Transformation Format) is a variable-length character encoding for Unicode. It is able to represent any character in the Unicode standard, yet the initial encoding of byte codes and character assignments for UTF-8 is backwards compatible with ASCII. For these reasons, it is steadily becoming the preferred encoding for e-mail, web pages[1], and other places where characters are stored or streamed.
So, it's anywhere between one and four bytes depending on which character you wish to represent within the realm of Unicode.
From Wikipedia on Unicode:
In computing, Unicode is an industry standard allowing computers to consistently represent and manipulate text expressed in most of the world's writing systems.
So it's able to represent most (but not all) of the world's writing systems.
I hope this helps :)

so what is a "Unicode string" in
Python?
Python 'knows' that your string is Unicode. Hence if you do regex on it, it will know which is character and which is not etc, which is really helpful. If you did a strlen it will also give the correct result. As an example if you did string count on Hello, you will get 5 (even if it's Unicode). But if you did a string count of a foreign word and that string was not a Unicode string than you will have much larger result. Pythong uses the information form the Unicode Character Database to identify each character in the Unicode String. Hope that helps.

Related

Example of sequence of characters that may _never_ appear in Python code?

This is a rather theoretical question, pertaining to the fundamental general syntax of Python. I am looking for an example of sequence of characters (*1) that would always cause a syntax error when present inside a Python program, regardless of the context (*2). For instance, the sequence a[0) is not a correct example, because the program
s = 'a[0)'
is perfectly valid. What I want is a sequence of characters that, wherever it occurs in the source code, causes a syntax error! (Oh, and of course, all the characters in this sequence have to be characters individually allowed to appear in a Python program).
(edit: the following blockquoted example is wrong, since newlines may appear in triple-quoted strings. Thanks to ekhumoro for this relevant remark!)
I suspect that the sequence “newline-quote-newline” is forbidden,
because the newline character may not appear in a quoted string: so,
if the first newline character does not causes a syntax error, this
means that the quote character starts a quoted string, and then the
second newline character will cause a syntax error.
It seems to me that a fundamentally buggy sequence could be
(edited some mistakes here: thanks to ekhumoro for noticing!)
␤'[)"[)'''[)"""[)'[)"[)'''[)"""[)
(where ␤ denotes a newline character), because one of the [)'s shall necessarily occur outside a quoted string, and the string cannot occur in a comment because of the initial ␤.
However, I do not know enough about the sharp details of Python syntax to be sure that the above examples are correct: maybe there exists some bizarre context, more subtle than mere quoted strings, where the above sequences of characters would be allowed? Maybe the full details of Python syntax even make it actually impossible to build any buggy sequence such as what I am looking for?…
(edit added for more clarity)
So, actually my question is about whether the specifications allow you to define a new kind of quoted context at some point: is there something in the Python specifications that say that the only possible quoted contexts are '…', "…", '''…''', """…""" and #…␤ (plus possibly a few more which I would not be currently aware of), or may you devise new quoted contexts as you wish? Or maybe you could make your program start with a kind of codec, after which you would write the sequel of the program in an arbitrary language completely different from Python…?
(*1) In a first version of this question, I wrote “bytes” instead of “characters”, because I did not want to be bothered with bizarre Unicode characters; but that made possible to turn the question into encoding issues… So, let us assume that we are working with a fixed encoding, whose set of admissible characters is fixed and well-known (say, ASCII for more simplicity).
(*2) FYI, the motivation of my question is to stress the difference between the language of a universal Turing machine (with self-delimited programs) and a general-purpose programming language, in the context of Kolmogorov complexity.
PS.: Answers to the same question for other (interpreted) real-life languages also welcomed :-)

Why does Python default to ASCII encoding?

I've run into lots of bugs in Python due to the default ASCII encoding. I always have to remember to switch it to utf8
I wanted to know, is there any reason or benefit to a default ASCII encoding? It seems strictly worse than utf8, and causes annoying bugs. Am I missing something by always switching to utf8?

Because Python 2 Unicode was built (back in 1999-2000) before UTF-8 was ubiquitous. ASCII on the other hand was understood by almost all target platforms using 8-bit codecs.
If you look at the Wikipedia UTF-8 adoption graph, you'll see that UTF-8 didn't really rise to popularity until 2006:
Only with Python 3 was it possible to change this default; there implicit encoding and decoding is gone, and the default source code encoding has been changed to UTF-8 (the default for printing, file I/O and filesystem names is system dependent, as it is in Python 2).

How do I universally ignore all unicode errors in python?

Running python2.7 here. I am writing a quick and dirty little script to do some web scraping, and I just want the unicode handler to just ignore all unicode errors.
That is, I am totally fine if it just drops whatever characters it can't convert to ascii anywhere in the program. This is just a throwaway script I just want to get done :-)
Is there some global "ignore" variable I can set?
Thanks!
/YGA

I am totally fine if it just drops whatever characters it can't convert to ascii anywhere in the program
Then you want to explicitly create your Unicode objects from the ascii codec, and specify to ignore errors:
input = unicode(input_bytes, encoding='ascii', errors='ignore')
See the Unicode HOWTO for more on properly handling Unicode.
(And for writing new code, always choose Python 3 or later unless you have an excellent well-formed reason to stay behind.)

How to deal with strings where encoding is unclear

I know there is quite a lot on the web and on stackoverflow about Python and character encoding, but I haven't really found the answer I'm looking for. So at the risk of creating a duplicate, I'm going to ask anyway.
It's a script that gets a dictionary, where all keys are properly as unicode. The values are strings with unknown encoding. For the keys it wouldn't matter that much, keys are all very simple very unlike the values. The values can (and do) contain a large variety of encodings. There are some dictionaries, where some values are in ASCII others as UTF-16BE yet others cp1250.
That totally messes up further processing, which currently consists mainly printing or concatenating (yes, that simple).
The work-around that I came up with, which makes Python print statements work properly is:
for key in data.keys():
# hope they did not chose a funky encoding
try:
print key+":"+data[key] # this triggers a UnicodeDecodeError on many encodings
current_data = data[key]
except UnicodeDecodeError:
# trying to cope with a funky encoding
current_data = data[key].decode(chardet.detect(data[key])['encoding']) # doing this on each value, because the dictionary sometimes contains multiple encodings
print key+":", # printing without newline was a workaround, because connecting didn't work
print current_data.encode('UTF-8')
In Python this works just fine. In Jython 2.7rc1 which I use in the project (not an option to switch), it prints characters which are definitely not the original encoding (funky looking characters). If anyone has an idea how I can make this also work in Jython that'd be great!
Edit (Example):
Sample-Value:
Our latest scenarios explore two possible versions of the future seen through fresh “lenses”.
Creates a string where the right and left double quotes turn to \x8D and \x8E. I don't know what encoding that is. In Python after using the above code it strips them. In Jython it turns them into white squares.

I'm not familiar with Jython, but the following link I found may prove useful: http://python.6.x6.nabble.com/character-encoding-issues-td1766833.html
It says that you should keep all unicode strings in separate files to your source, and read them with codecs.open. This seemed to work for the person who was experiencing a problem similar to yours.
The following link also mentions something about specifying an encoding parameter to the JVM: https://answers.launchpad.net/sikuli/+question/156443
Without seeing any actual error output, this is the extent of the help I can provide.

URLs: Binary Blob, Unicode or Encoded Unicode String?

I wish to store URLs in a database (MySQL in this case) and process it in Python. Though the database and programming language are probably not this relevant to my question.
In my setup I receive unicode strings when querying a text field in the database. But is a URL actually text? Is encoding from and decoding to unicode an operation that should be done to a URL? Or is it better to make the column in the database a binary blob?
So, how do you handle this problem?
Clarification:
This question is not about urlencoding non-ASCII characters with the percent notation. It's about the distiction that unicode represents text and byte strings represent a way to encode this text into a sequence of bytes. In Python (prior to 3.0) this distinction is between the unicode and the str types. In MySQL it is TEXT to BLOBS. So the concepts seem to correspond between programming language and database. But what is the best way to handle URLs in this scheme?

The relevant answer is found in RFC 2396, section
2.1 URI and non-ASCII characters
The relationship between URI and characters has been a source of
confusion for characters that are not part of US-ASCII. To describe
the relationship, it is useful to distinguish between a "character"
(as a distinguishable semantic entity) and an "octet" (an 8-bit
byte). There are two mappings, one from URI characters to octets, and
a second from octets to original characters:
URI character sequence->octet sequence->original character sequence
A URI is represented as a sequence of characters, not as a sequence
of octets. That is because URI might be "transported" by means that
are not through a computer network, e.g., printed on paper, read over
the radio, etc.

Do note there is also a standard for Unicode Web addresses, IRI (Internationalized Resource Identifiers). RFC 3987

On the question: "But is a URL actually text?"
It depends on the context, in some languages or libraries (for example java, I'm not sure about python), a URL may be represented internally as an object. However, a URL always has a well defined text representation. So storing the text-representation is much more portable than storing the internal representation used by whatever is the current language of choice.
URL syntax and semantics are covered by quite a few standards, recommendations and implementations, but I think the most authoritative source for parsing and constructing correct URL-s would be RFC 2396.
On the question about unicode, section 2.1 deals with non-ascii characters.
(Edit: changed rfc-reference to the newest edition, thank you S.Lott)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Unicode vs UTF-8 confusion in Python / Django? - python

Python stores Unicode as UTF-16. str() will return the UTF-8 representation of the UTF-16 string.

Related

Example of sequence of characters that may _never_ appear in Python code?

Why does Python default to ASCII encoding?

How do I universally ignore all unicode errors in python?

How to deal with strings where encoding is unclear

URLs: Binary Blob, Unicode or Encoded Unicode String?

Categories

Resources