Running python2.7 here. I am writing a quick and dirty little script to do some web scraping, and I just want the unicode handler to just ignore all unicode errors.
That is, I am totally fine if it just drops whatever characters it can't convert to ascii anywhere in the program. This is just a throwaway script I just want to get done :-)
Is there some global "ignore" variable I can set?
Thanks!
/YGA
I am totally fine if it just drops whatever characters it can't convert to ascii anywhere in the program
Then you want to explicitly create your Unicode objects from the ascii codec, and specify to ignore errors:
input = unicode(input_bytes, encoding='ascii', errors='ignore')
See the Unicode HOWTO for more on properly handling Unicode.
(And for writing new code, always choose Python 3 or later unless you have an excellent well-formed reason to stay behind.)
Related
This is a rather theoretical question, pertaining to the fundamental general syntax of Python. I am looking for an example of sequence of characters (*1) that would always cause a syntax error when present inside a Python program, regardless of the context (*2). For instance, the sequence a[0) is not a correct example, because the program
s = 'a[0)'
is perfectly valid. What I want is a sequence of characters that, wherever it occurs in the source code, causes a syntax error! (Oh, and of course, all the characters in this sequence have to be characters individually allowed to appear in a Python program).
(edit: the following blockquoted example is wrong, since newlines may appear in triple-quoted strings. Thanks to ekhumoro for this relevant remark!)
I suspect that the sequence “newline-quote-newline” is forbidden,
because the newline character may not appear in a quoted string: so,
if the first newline character does not causes a syntax error, this
means that the quote character starts a quoted string, and then the
second newline character will cause a syntax error.
It seems to me that a fundamentally buggy sequence could be
(edited some mistakes here: thanks to ekhumoro for noticing!)
'[)"[)'''[)"""[)'[)"[)'''[)"""[)
(where  denotes a newline character), because one of the [)'s shall necessarily occur outside a quoted string, and the string cannot occur in a comment because of the initial .
However, I do not know enough about the sharp details of Python syntax to be sure that the above examples are correct: maybe there exists some bizarre context, more subtle than mere quoted strings, where the above sequences of characters would be allowed? Maybe the full details of Python syntax even make it actually impossible to build any buggy sequence such as what I am looking for?…
(edit added for more clarity)
So, actually my question is about whether the specifications allow you to define a new kind of quoted context at some point: is there something in the Python specifications that say that the only possible quoted contexts are '…', "…", '''…''', """…""" and #… (plus possibly a few more which I would not be currently aware of), or may you devise new quoted contexts as you wish? Or maybe you could make your program start with a kind of codec, after which you would write the sequel of the program in an arbitrary language completely different from Python…?
(*1) In a first version of this question, I wrote “bytes” instead of “characters”, because I did not want to be bothered with bizarre Unicode characters; but that made possible to turn the question into encoding issues… So, let us assume that we are working with a fixed encoding, whose set of admissible characters is fixed and well-known (say, ASCII for more simplicity).
(*2) FYI, the motivation of my question is to stress the difference between the language of a universal Turing machine (with self-delimited programs) and a general-purpose programming language, in the context of Kolmogorov complexity.
PS.: Answers to the same question for other (interpreted) real-life languages also welcomed :-)
I'm trying to understand why when we were using pandas to_csv(), a number 3189069486778499 has been output as "0.\x103189069486778499". And this is the only case happened within a huge amount of data.
When using to_csv(), we have already used encoding='utf8', normally that would solve some unicode problems...
So, I'm trying to understand what is "\x10", so that I may know why...
Since the whole process was running in luigi pipeline, sometimes luigi will generate weird output. I tried the same thing in IPython, same version of pandas and everything works fine....
Because it's the likely answer, even if the details aren't provide in your question:
It's highly likely something in your pipeline is intentionally producing fields with length prefixed text, rather than the raw unstructured text. \x103189069486778499 is a binary byte with the value 16 (0x10), followed by precisely 16 characters. The 0. before it may be from a previous output, or some other part of whatever custom data serialization format it's using.
This design is usually intended to make parsing more efficient; if you use a delimiter character between fields (e.g. a comma, like CSV), you're stuck coming up with ways to escape or quote the delimiter when it occurs in your actual data, and parsers have to scan character by character, statefully, to figure out where a field begins and ends. With length prefixed text, the parser finds a field length and knows exactly how many characters to read to slurp the field, or how many to skip to find the next field, no quoting or escaping required, no matter what the field contains.
As for what's doing this: You're going to have to check the commands in your pipeline. Your question provides no meaningful way to determine the cause of this problem.
I know there is quite a lot on the web and on stackoverflow about Python and character encoding, but I haven't really found the answer I'm looking for. So at the risk of creating a duplicate, I'm going to ask anyway.
It's a script that gets a dictionary, where all keys are properly as unicode. The values are strings with unknown encoding. For the keys it wouldn't matter that much, keys are all very simple very unlike the values. The values can (and do) contain a large variety of encodings. There are some dictionaries, where some values are in ASCII others as UTF-16BE yet others cp1250.
That totally messes up further processing, which currently consists mainly printing or concatenating (yes, that simple).
The work-around that I came up with, which makes Python print statements work properly is:
for key in data.keys():
# hope they did not chose a funky encoding
try:
print key+":"+data[key] # this triggers a UnicodeDecodeError on many encodings
current_data = data[key]
except UnicodeDecodeError:
# trying to cope with a funky encoding
current_data = data[key].decode(chardet.detect(data[key])['encoding']) # doing this on each value, because the dictionary sometimes contains multiple encodings
print key+":", # printing without newline was a workaround, because connecting didn't work
print current_data.encode('UTF-8')
In Python this works just fine. In Jython 2.7rc1 which I use in the project (not an option to switch), it prints characters which are definitely not the original encoding (funky looking characters). If anyone has an idea how I can make this also work in Jython that'd be great!
Edit (Example):
Sample-Value:
Our latest scenarios explore two possible versions of the future seen through fresh “lenses”.
Creates a string where the right and left double quotes turn to \x8D and \x8E. I don't know what encoding that is. In Python after using the above code it strips them. In Jython it turns them into white squares.
I'm not familiar with Jython, but the following link I found may prove useful: http://python.6.x6.nabble.com/character-encoding-issues-td1766833.html
It says that you should keep all unicode strings in separate files to your source, and read them with codecs.open. This seemed to work for the person who was experiencing a problem similar to yours.
The following link also mentions something about specifying an encoding parameter to the JVM: https://answers.launchpad.net/sikuli/+question/156443
Without seeing any actual error output, this is the extent of the help I can provide.
When using unicode strings in source code, there seems to be many ways to skin a cat. The docs and the relevant PEPs have plenty of information about what's possible, but are scant about what is preferred.
For example, the following each seem to give same result:
# coding: utf8
u1 = '\xe2\x82\xac'.decode('utf8')
u2 = u'\u20ac'
u3 = unichr(0x20ac)
u4 = "€".decode('utf8')
u5 = u"€"
If using the __future__ imports, I've found one more option:
# coding: utf8
from __future__ import unicode_literals
u6 = "€"
In python I am used to there being one obvious way to do it, so what is the recommended method of including international content in source files?
This is a python 2 question.
some background...
Methods u1, u2, u3 just seem silly to me, but I have seen enough people writing like this that I assume it is not just personal preference - is there any particular reason why we might want to force only ascii characters in source files, rather than specifying the encoding, or is this just a habit more likely to be found in older code lying around?
There's huge readability improvement in the code to use the actual symbols rather than some escape sequences, and to not do so would seem to be ignoring the strengths of the language rather than taking advantage of hard work by the python devs.
I think the most common way I've used (in Python 2) is:
# coding: utf-8
text = u'résumé'
The text is readable. Compare to text = u'r\u00e9sum\u00e9', where I must look up what character that is. Everything else is less readable.
If you're using Unicode, your variable is most certainly text and not binary data, so there's no point in keeping it in anything other than a unicode object. (Just in case '€' became an option.)
from __future__ import unicode_literals changes the parsing mode of the program; I think you'd need to be more aware of the difference between text & binary data. (Something that, if you ask me, most programmers are not good at.)
In large projects, it might be confusing to have the parsing mode change for just one file, so it's probably better as an all files or no files, so you don't need to refer to the file header. If you're in Python 2, the default is probably off unless you're also targetting Python 3. If you're targetting Python 2.5 or older¹, then it's not an option.
Most editors these days are Unicode-aware. That said, I have seen editors corrupt non-ASCII characters in files, but exceedingly rarely; if the author of such a commit doesn't review his code adequately, code review should catch this. (The diff will be painfully obvious.) It is not worth supporting these people: Unicode is here to stay; track them down and fix their set up. Of note, vim handles Unicode just fine.
¹You should upgrade.
I stumbled over this passage in the Django tutorial:
Django models have a default str() method that calls unicode() and converts the result to a UTF-8 bytestring. This means that unicode(p) will return a Unicode string, and str(p) will return a normal string, with characters encoded as UTF-8.
Now, I'm confused because afaik Unicode is not any particular representation, so what is a "Unicode string" in Python? Does that mean UCS-2? Googling turned up this "Python Unicode Tutorial" which boldly states
Unicode is a two-byte encoding which covers all of the world's common writing systems.
which is plain wrong, or is it? I have been confused many times by character set and encoding issues, but here I'm quite sure that the documentation I'm reading is confused. Does anybody know what's going on in Python when it gives me a "Unicode string"?
what is a "Unicode string" in Python? Does that mean UCS-2?
Unicode strings in Python are stored internally either as UCS-2 (fixed-length 16-bit representation, almost the same as UTF-16) or UCS-4/UTF-32 (fixed-length 32-bit representation). It's a compile-time option; on Windows it's always UTF-16 whilst many Linux distributions set UTF-32 (‘wide mode’) for their versions of Python.
You are generally not supposed to care: you will see Unicode code-points as single elements in your strings and you won't know whether they're stored as two or four bytes. If you're in a UTF-16 build and you need to handle characters outside the Basic Multilingual Plane you'll be Doing It Wrong, but that's still very rare, and users who really need the extra characters should be compiling wide builds.
plain wrong, or is it?
Yes, it's quite wrong. To be fair I think that tutorial is rather old; it probably pre-dates wide Unicode strings, if not Unicode 3.1 (the version that introduced characters outside the Basic Multilingual Plane).
There is an additional source of confusion stemming from Windows's habit of using the term “Unicode” to mean, specifically, the UTF-16LE encoding that NT uses internally. People from Microsoftland may often copy this somewhat misleading habit.
Meanwhile, I did a refined research to verify what the internal representation in Python is, and also what its limits are. "The Truth About Unicode In Python" is a very good article which cites directly from the Python developers. Apparently, internal representation is either UCS-2 or UCS-4 depending on a compile-time switch. So Jon, it's not UTF-16, but your answer put me on the right track anyway, thanks.
Python stores Unicode as UTF-16. str() will return the UTF-8 representation of the UTF-16 string.
From Wikipedia on UTF-8:
UTF-8 (8-bit UCS/Unicode Transformation Format) is a variable-length character encoding for Unicode. It is able to represent any character in the Unicode standard, yet the initial encoding of byte codes and character assignments for UTF-8 is backwards compatible with ASCII. For these reasons, it is steadily becoming the preferred encoding for e-mail, web pages[1], and other places where characters are stored or streamed.
So, it's anywhere between one and four bytes depending on which character you wish to represent within the realm of Unicode.
From Wikipedia on Unicode:
In computing, Unicode is an industry standard allowing computers to consistently represent and manipulate text expressed in most of the world's writing systems.
So it's able to represent most (but not all) of the world's writing systems.
I hope this helps :)
so what is a "Unicode string" in
Python?
Python 'knows' that your string is Unicode. Hence if you do regex on it, it will know which is character and which is not etc, which is really helpful. If you did a strlen it will also give the correct result. As an example if you did string count on Hello, you will get 5 (even if it's Unicode). But if you did a string count of a foreign word and that string was not a Unicode string than you will have much larger result. Pythong uses the information form the Unicode Character Database to identify each character in the Unicode String. Hope that helps.