I wish to store URLs in a database (MySQL in this case) and process it in Python. Though the database and programming language are probably not this relevant to my question.
In my setup I receive unicode strings when querying a text field in the database. But is a URL actually text? Is encoding from and decoding to unicode an operation that should be done to a URL? Or is it better to make the column in the database a binary blob?
So, how do you handle this problem?
Clarification:
This question is not about urlencoding non-ASCII characters with the percent notation. It's about the distiction that unicode represents text and byte strings represent a way to encode this text into a sequence of bytes. In Python (prior to 3.0) this distinction is between the unicode and the str types. In MySQL it is TEXT to BLOBS. So the concepts seem to correspond between programming language and database. But what is the best way to handle URLs in this scheme?
The relevant answer is found in RFC 2396, section
2.1 URI and non-ASCII characters
The relationship between URI and characters has been a source of
confusion for characters that are not part of US-ASCII. To describe
the relationship, it is useful to distinguish between a "character"
(as a distinguishable semantic entity) and an "octet" (an 8-bit
byte). There are two mappings, one from URI characters to octets, and
a second from octets to original characters:
URI character sequence->octet sequence->original character sequence
A URI is represented as a sequence of characters, not as a sequence
of octets. That is because URI might be "transported" by means that
are not through a computer network, e.g., printed on paper, read over
the radio, etc.
Do note there is also a standard for Unicode Web addresses, IRI (Internationalized Resource Identifiers). RFC 3987
On the question: "But is a URL actually text?"
It depends on the context, in some languages or libraries (for example java, I'm not sure about python), a URL may be represented internally as an object. However, a URL always has a well defined text representation. So storing the text-representation is much more portable than storing the internal representation used by whatever is the current language of choice.
URL syntax and semantics are covered by quite a few standards, recommendations and implementations, but I think the most authoritative source for parsing and constructing correct URL-s would be RFC 2396.
On the question about unicode, section 2.1 deals with non-ascii characters.
(Edit: changed rfc-reference to the newest edition, thank you S.Lott)
Related
This is a rather theoretical question, pertaining to the fundamental general syntax of Python. I am looking for an example of sequence of characters (*1) that would always cause a syntax error when present inside a Python program, regardless of the context (*2). For instance, the sequence a[0) is not a correct example, because the program
s = 'a[0)'
is perfectly valid. What I want is a sequence of characters that, wherever it occurs in the source code, causes a syntax error! (Oh, and of course, all the characters in this sequence have to be characters individually allowed to appear in a Python program).
(edit: the following blockquoted example is wrong, since newlines may appear in triple-quoted strings. Thanks to ekhumoro for this relevant remark!)
I suspect that the sequence “newline-quote-newline” is forbidden,
because the newline character may not appear in a quoted string: so,
if the first newline character does not causes a syntax error, this
means that the quote character starts a quoted string, and then the
second newline character will cause a syntax error.
It seems to me that a fundamentally buggy sequence could be
(edited some mistakes here: thanks to ekhumoro for noticing!)
'[)"[)'''[)"""[)'[)"[)'''[)"""[)
(where  denotes a newline character), because one of the [)'s shall necessarily occur outside a quoted string, and the string cannot occur in a comment because of the initial .
However, I do not know enough about the sharp details of Python syntax to be sure that the above examples are correct: maybe there exists some bizarre context, more subtle than mere quoted strings, where the above sequences of characters would be allowed? Maybe the full details of Python syntax even make it actually impossible to build any buggy sequence such as what I am looking for?…
(edit added for more clarity)
So, actually my question is about whether the specifications allow you to define a new kind of quoted context at some point: is there something in the Python specifications that say that the only possible quoted contexts are '…', "…", '''…''', """…""" and #… (plus possibly a few more which I would not be currently aware of), or may you devise new quoted contexts as you wish? Or maybe you could make your program start with a kind of codec, after which you would write the sequel of the program in an arbitrary language completely different from Python…?
(*1) In a first version of this question, I wrote “bytes” instead of “characters”, because I did not want to be bothered with bizarre Unicode characters; but that made possible to turn the question into encoding issues… So, let us assume that we are working with a fixed encoding, whose set of admissible characters is fixed and well-known (say, ASCII for more simplicity).
(*2) FYI, the motivation of my question is to stress the difference between the language of a universal Turing machine (with self-delimited programs) and a general-purpose programming language, in the context of Kolmogorov complexity.
PS.: Answers to the same question for other (interpreted) real-life languages also welcomed :-)
I'm trying to understand why when we were using pandas to_csv(), a number 3189069486778499 has been output as "0.\x103189069486778499". And this is the only case happened within a huge amount of data.
When using to_csv(), we have already used encoding='utf8', normally that would solve some unicode problems...
So, I'm trying to understand what is "\x10", so that I may know why...
Since the whole process was running in luigi pipeline, sometimes luigi will generate weird output. I tried the same thing in IPython, same version of pandas and everything works fine....
Because it's the likely answer, even if the details aren't provide in your question:
It's highly likely something in your pipeline is intentionally producing fields with length prefixed text, rather than the raw unstructured text. \x103189069486778499 is a binary byte with the value 16 (0x10), followed by precisely 16 characters. The 0. before it may be from a previous output, or some other part of whatever custom data serialization format it's using.
This design is usually intended to make parsing more efficient; if you use a delimiter character between fields (e.g. a comma, like CSV), you're stuck coming up with ways to escape or quote the delimiter when it occurs in your actual data, and parsers have to scan character by character, statefully, to figure out where a field begins and ends. With length prefixed text, the parser finds a field length and knows exactly how many characters to read to slurp the field, or how many to skip to find the next field, no quoting or escaping required, no matter what the field contains.
As for what's doing this: You're going to have to check the commands in your pipeline. Your question provides no meaningful way to determine the cause of this problem.
According to Django Documentation:
locale name
A locale name, either a language specification of the form ll or a
combined language and country specification of the form ll_CC.
Examples: it, de_AT, es, pt_BR. The language part is always in
lower case and the country part in upper case. The separator is an
underscore.
language code
Represents the name of a language. Browsers send the names of the
languages they accept in the Accept-Language HTTP header using this
format. Examples: it, de-at, es, pt-br. Language codes are
generally represented in lower-case, but the HTTP Accept-Language
header is case-insensitive. The separator is a dash.
Questions:
When I see it or es in someone's code, how can I tell whether it's a locale name or a language code?
When should we use locale code, and when to use language code?
Locale codes are understood by the setlocale(3) call and configure localization of several well-known formats, such as dates, times and currency as well as the language for error messages. The available locales are platform and system dependent.
A language code is built upon the locale, but used in network communication. In practice, you should be dealing with language codes at the request and response layer and locale code within the application, but the distinction is not strict as they solve the same problem: localization and internationalization.
Case in point: Django sets the language based on the Accept-Language header, which use the language code format and then sets the locale accordingly for the application, thus selecting the locale code corresponding to the language code.
Therefore it's safe to say that "language codes are a serialization format of locale codes":
nl_NL.ISO-8859-15 is serialized to Accept-language: nl-NL + Accept-Charset: iso-8859-15. The important part is to use the correct form under the right circumstances, but the meaning of es is always Spanish, no matter the origin.
I have a simple python program to index emails on an Exchange server, and find that the list names that it returns are not all the same format. It seems that any names with any special characters (notably blanks) are double-quoted, and others are not.
(\Marked \HasNoChildren) "/" "Mail/_DE Courses/_cs435-ADL"
(\Marked \HasNoChildren) "/" Mail/_etc
Last time I ran this program several years ago it did not have this issue. All other examples that I have seen show every name string quoted. Is this something non-standard, and well known? (I just made a regex to correct for this.)
If the name contains special characters, the server has to quote. If the name is plain the server may quote or not, its choice. I can easily believe that the version you used three years ago made a different choice than the one deployed today.
In an IMAP request or response, it is possible to represent items as strings (the quoted kind) or atoms (no quotes). Using an atom format (unquoted string) is often sufficient. However, when a space is present, opting for an unquoted string would cause the string to be interpreted as two separate strings (the part before the space, and that after it). Since client and server usually know to expect a predefined number of space-delimited items in a response, doing this would result in a parse error.
I stumbled over this passage in the Django tutorial:
Django models have a default str() method that calls unicode() and converts the result to a UTF-8 bytestring. This means that unicode(p) will return a Unicode string, and str(p) will return a normal string, with characters encoded as UTF-8.
Now, I'm confused because afaik Unicode is not any particular representation, so what is a "Unicode string" in Python? Does that mean UCS-2? Googling turned up this "Python Unicode Tutorial" which boldly states
Unicode is a two-byte encoding which covers all of the world's common writing systems.
which is plain wrong, or is it? I have been confused many times by character set and encoding issues, but here I'm quite sure that the documentation I'm reading is confused. Does anybody know what's going on in Python when it gives me a "Unicode string"?
what is a "Unicode string" in Python? Does that mean UCS-2?
Unicode strings in Python are stored internally either as UCS-2 (fixed-length 16-bit representation, almost the same as UTF-16) or UCS-4/UTF-32 (fixed-length 32-bit representation). It's a compile-time option; on Windows it's always UTF-16 whilst many Linux distributions set UTF-32 (‘wide mode’) for their versions of Python.
You are generally not supposed to care: you will see Unicode code-points as single elements in your strings and you won't know whether they're stored as two or four bytes. If you're in a UTF-16 build and you need to handle characters outside the Basic Multilingual Plane you'll be Doing It Wrong, but that's still very rare, and users who really need the extra characters should be compiling wide builds.
plain wrong, or is it?
Yes, it's quite wrong. To be fair I think that tutorial is rather old; it probably pre-dates wide Unicode strings, if not Unicode 3.1 (the version that introduced characters outside the Basic Multilingual Plane).
There is an additional source of confusion stemming from Windows's habit of using the term “Unicode” to mean, specifically, the UTF-16LE encoding that NT uses internally. People from Microsoftland may often copy this somewhat misleading habit.
Meanwhile, I did a refined research to verify what the internal representation in Python is, and also what its limits are. "The Truth About Unicode In Python" is a very good article which cites directly from the Python developers. Apparently, internal representation is either UCS-2 or UCS-4 depending on a compile-time switch. So Jon, it's not UTF-16, but your answer put me on the right track anyway, thanks.
Python stores Unicode as UTF-16. str() will return the UTF-8 representation of the UTF-16 string.
From Wikipedia on UTF-8:
UTF-8 (8-bit UCS/Unicode Transformation Format) is a variable-length character encoding for Unicode. It is able to represent any character in the Unicode standard, yet the initial encoding of byte codes and character assignments for UTF-8 is backwards compatible with ASCII. For these reasons, it is steadily becoming the preferred encoding for e-mail, web pages[1], and other places where characters are stored or streamed.
So, it's anywhere between one and four bytes depending on which character you wish to represent within the realm of Unicode.
From Wikipedia on Unicode:
In computing, Unicode is an industry standard allowing computers to consistently represent and manipulate text expressed in most of the world's writing systems.
So it's able to represent most (but not all) of the world's writing systems.
I hope this helps :)
so what is a "Unicode string" in
Python?
Python 'knows' that your string is Unicode. Hence if you do regex on it, it will know which is character and which is not etc, which is really helpful. If you did a strlen it will also give the correct result. As an example if you did string count on Hello, you will get 5 (even if it's Unicode). But if you did a string count of a foreign word and that string was not a Unicode string than you will have much larger result. Pythong uses the information form the Unicode Character Database to identify each character in the Unicode String. Hope that helps.