I've run into lots of bugs in Python due to the default ASCII encoding. I always have to remember to switch it to utf8
I wanted to know, is there any reason or benefit to a default ASCII encoding? It seems strictly worse than utf8, and causes annoying bugs. Am I missing something by always switching to utf8?
Because Python 2 Unicode was built (back in 1999-2000) before UTF-8 was ubiquitous. ASCII on the other hand was understood by almost all target platforms using 8-bit codecs.
If you look at the Wikipedia UTF-8 adoption graph, you'll see that UTF-8 didn't really rise to popularity until 2006:
Only with Python 3 was it possible to change this default; there implicit encoding and decoding is gone, and the default source code encoding has been changed to UTF-8 (the default for printing, file I/O and filesystem names is system dependent, as it is in Python 2).
Related
Running python2.7 here. I am writing a quick and dirty little script to do some web scraping, and I just want the unicode handler to just ignore all unicode errors.
That is, I am totally fine if it just drops whatever characters it can't convert to ascii anywhere in the program. This is just a throwaway script I just want to get done :-)
Is there some global "ignore" variable I can set?
Thanks!
/YGA
I am totally fine if it just drops whatever characters it can't convert to ascii anywhere in the program
Then you want to explicitly create your Unicode objects from the ascii codec, and specify to ignore errors:
input = unicode(input_bytes, encoding='ascii', errors='ignore')
See the Unicode HOWTO for more on properly handling Unicode.
(And for writing new code, always choose Python 3 or later unless you have an excellent well-formed reason to stay behind.)
I know there is quite a lot on the web and on stackoverflow about Python and character encoding, but I haven't really found the answer I'm looking for. So at the risk of creating a duplicate, I'm going to ask anyway.
It's a script that gets a dictionary, where all keys are properly as unicode. The values are strings with unknown encoding. For the keys it wouldn't matter that much, keys are all very simple very unlike the values. The values can (and do) contain a large variety of encodings. There are some dictionaries, where some values are in ASCII others as UTF-16BE yet others cp1250.
That totally messes up further processing, which currently consists mainly printing or concatenating (yes, that simple).
The work-around that I came up with, which makes Python print statements work properly is:
for key in data.keys():
# hope they did not chose a funky encoding
try:
print key+":"+data[key] # this triggers a UnicodeDecodeError on many encodings
current_data = data[key]
except UnicodeDecodeError:
# trying to cope with a funky encoding
current_data = data[key].decode(chardet.detect(data[key])['encoding']) # doing this on each value, because the dictionary sometimes contains multiple encodings
print key+":", # printing without newline was a workaround, because connecting didn't work
print current_data.encode('UTF-8')
In Python this works just fine. In Jython 2.7rc1 which I use in the project (not an option to switch), it prints characters which are definitely not the original encoding (funky looking characters). If anyone has an idea how I can make this also work in Jython that'd be great!
Edit (Example):
Sample-Value:
Our latest scenarios explore two possible versions of the future seen through fresh “lenses”.
Creates a string where the right and left double quotes turn to \x8D and \x8E. I don't know what encoding that is. In Python after using the above code it strips them. In Jython it turns them into white squares.
I'm not familiar with Jython, but the following link I found may prove useful: http://python.6.x6.nabble.com/character-encoding-issues-td1766833.html
It says that you should keep all unicode strings in separate files to your source, and read them with codecs.open. This seemed to work for the person who was experiencing a problem similar to yours.
The following link also mentions something about specifying an encoding parameter to the JVM: https://answers.launchpad.net/sikuli/+question/156443
Without seeing any actual error output, this is the extent of the help I can provide.
When using unicode strings in source code, there seems to be many ways to skin a cat. The docs and the relevant PEPs have plenty of information about what's possible, but are scant about what is preferred.
For example, the following each seem to give same result:
# coding: utf8
u1 = '\xe2\x82\xac'.decode('utf8')
u2 = u'\u20ac'
u3 = unichr(0x20ac)
u4 = "€".decode('utf8')
u5 = u"€"
If using the __future__ imports, I've found one more option:
# coding: utf8
from __future__ import unicode_literals
u6 = "€"
In python I am used to there being one obvious way to do it, so what is the recommended method of including international content in source files?
This is a python 2 question.
some background...
Methods u1, u2, u3 just seem silly to me, but I have seen enough people writing like this that I assume it is not just personal preference - is there any particular reason why we might want to force only ascii characters in source files, rather than specifying the encoding, or is this just a habit more likely to be found in older code lying around?
There's huge readability improvement in the code to use the actual symbols rather than some escape sequences, and to not do so would seem to be ignoring the strengths of the language rather than taking advantage of hard work by the python devs.
I think the most common way I've used (in Python 2) is:
# coding: utf-8
text = u'résumé'
The text is readable. Compare to text = u'r\u00e9sum\u00e9', where I must look up what character that is. Everything else is less readable.
If you're using Unicode, your variable is most certainly text and not binary data, so there's no point in keeping it in anything other than a unicode object. (Just in case '€' became an option.)
from __future__ import unicode_literals changes the parsing mode of the program; I think you'd need to be more aware of the difference between text & binary data. (Something that, if you ask me, most programmers are not good at.)
In large projects, it might be confusing to have the parsing mode change for just one file, so it's probably better as an all files or no files, so you don't need to refer to the file header. If you're in Python 2, the default is probably off unless you're also targetting Python 3. If you're targetting Python 2.5 or older¹, then it's not an option.
Most editors these days are Unicode-aware. That said, I have seen editors corrupt non-ASCII characters in files, but exceedingly rarely; if the author of such a commit doesn't review his code adequately, code review should catch this. (The diff will be painfully obvious.) It is not worth supporting these people: Unicode is here to stay; track them down and fix their set up. Of note, vim handles Unicode just fine.
¹You should upgrade.
I wish to store URLs in a database (MySQL in this case) and process it in Python. Though the database and programming language are probably not this relevant to my question.
In my setup I receive unicode strings when querying a text field in the database. But is a URL actually text? Is encoding from and decoding to unicode an operation that should be done to a URL? Or is it better to make the column in the database a binary blob?
So, how do you handle this problem?
Clarification:
This question is not about urlencoding non-ASCII characters with the percent notation. It's about the distiction that unicode represents text and byte strings represent a way to encode this text into a sequence of bytes. In Python (prior to 3.0) this distinction is between the unicode and the str types. In MySQL it is TEXT to BLOBS. So the concepts seem to correspond between programming language and database. But what is the best way to handle URLs in this scheme?
The relevant answer is found in RFC 2396, section
2.1 URI and non-ASCII characters
The relationship between URI and characters has been a source of
confusion for characters that are not part of US-ASCII. To describe
the relationship, it is useful to distinguish between a "character"
(as a distinguishable semantic entity) and an "octet" (an 8-bit
byte). There are two mappings, one from URI characters to octets, and
a second from octets to original characters:
URI character sequence->octet sequence->original character sequence
A URI is represented as a sequence of characters, not as a sequence
of octets. That is because URI might be "transported" by means that
are not through a computer network, e.g., printed on paper, read over
the radio, etc.
Do note there is also a standard for Unicode Web addresses, IRI (Internationalized Resource Identifiers). RFC 3987
On the question: "But is a URL actually text?"
It depends on the context, in some languages or libraries (for example java, I'm not sure about python), a URL may be represented internally as an object. However, a URL always has a well defined text representation. So storing the text-representation is much more portable than storing the internal representation used by whatever is the current language of choice.
URL syntax and semantics are covered by quite a few standards, recommendations and implementations, but I think the most authoritative source for parsing and constructing correct URL-s would be RFC 2396.
On the question about unicode, section 2.1 deals with non-ascii characters.
(Edit: changed rfc-reference to the newest edition, thank you S.Lott)
I stumbled over this passage in the Django tutorial:
Django models have a default str() method that calls unicode() and converts the result to a UTF-8 bytestring. This means that unicode(p) will return a Unicode string, and str(p) will return a normal string, with characters encoded as UTF-8.
Now, I'm confused because afaik Unicode is not any particular representation, so what is a "Unicode string" in Python? Does that mean UCS-2? Googling turned up this "Python Unicode Tutorial" which boldly states
Unicode is a two-byte encoding which covers all of the world's common writing systems.
which is plain wrong, or is it? I have been confused many times by character set and encoding issues, but here I'm quite sure that the documentation I'm reading is confused. Does anybody know what's going on in Python when it gives me a "Unicode string"?
what is a "Unicode string" in Python? Does that mean UCS-2?
Unicode strings in Python are stored internally either as UCS-2 (fixed-length 16-bit representation, almost the same as UTF-16) or UCS-4/UTF-32 (fixed-length 32-bit representation). It's a compile-time option; on Windows it's always UTF-16 whilst many Linux distributions set UTF-32 (‘wide mode’) for their versions of Python.
You are generally not supposed to care: you will see Unicode code-points as single elements in your strings and you won't know whether they're stored as two or four bytes. If you're in a UTF-16 build and you need to handle characters outside the Basic Multilingual Plane you'll be Doing It Wrong, but that's still very rare, and users who really need the extra characters should be compiling wide builds.
plain wrong, or is it?
Yes, it's quite wrong. To be fair I think that tutorial is rather old; it probably pre-dates wide Unicode strings, if not Unicode 3.1 (the version that introduced characters outside the Basic Multilingual Plane).
There is an additional source of confusion stemming from Windows's habit of using the term “Unicode” to mean, specifically, the UTF-16LE encoding that NT uses internally. People from Microsoftland may often copy this somewhat misleading habit.
Meanwhile, I did a refined research to verify what the internal representation in Python is, and also what its limits are. "The Truth About Unicode In Python" is a very good article which cites directly from the Python developers. Apparently, internal representation is either UCS-2 or UCS-4 depending on a compile-time switch. So Jon, it's not UTF-16, but your answer put me on the right track anyway, thanks.
Python stores Unicode as UTF-16. str() will return the UTF-8 representation of the UTF-16 string.
From Wikipedia on UTF-8:
UTF-8 (8-bit UCS/Unicode Transformation Format) is a variable-length character encoding for Unicode. It is able to represent any character in the Unicode standard, yet the initial encoding of byte codes and character assignments for UTF-8 is backwards compatible with ASCII. For these reasons, it is steadily becoming the preferred encoding for e-mail, web pages[1], and other places where characters are stored or streamed.
So, it's anywhere between one and four bytes depending on which character you wish to represent within the realm of Unicode.
From Wikipedia on Unicode:
In computing, Unicode is an industry standard allowing computers to consistently represent and manipulate text expressed in most of the world's writing systems.
So it's able to represent most (but not all) of the world's writing systems.
I hope this helps :)
so what is a "Unicode string" in
Python?
Python 'knows' that your string is Unicode. Hence if you do regex on it, it will know which is character and which is not etc, which is really helpful. If you did a strlen it will also give the correct result. As an example if you did string count on Hello, you will get 5 (even if it's Unicode). But if you did a string count of a foreign word and that string was not a Unicode string than you will have much larger result. Pythong uses the information form the Unicode Character Database to identify each character in the Unicode String. Hope that helps.