I know there is quite a lot on the web and on stackoverflow about Python and character encoding, but I haven't really found the answer I'm looking for. So at the risk of creating a duplicate, I'm going to ask anyway.
It's a script that gets a dictionary, where all keys are properly as unicode. The values are strings with unknown encoding. For the keys it wouldn't matter that much, keys are all very simple very unlike the values. The values can (and do) contain a large variety of encodings. There are some dictionaries, where some values are in ASCII others as UTF-16BE yet others cp1250.
That totally messes up further processing, which currently consists mainly printing or concatenating (yes, that simple).
The work-around that I came up with, which makes Python print statements work properly is:
for key in data.keys():
# hope they did not chose a funky encoding
try:
print key+":"+data[key] # this triggers a UnicodeDecodeError on many encodings
current_data = data[key]
except UnicodeDecodeError:
# trying to cope with a funky encoding
current_data = data[key].decode(chardet.detect(data[key])['encoding']) # doing this on each value, because the dictionary sometimes contains multiple encodings
print key+":", # printing without newline was a workaround, because connecting didn't work
print current_data.encode('UTF-8')
In Python this works just fine. In Jython 2.7rc1 which I use in the project (not an option to switch), it prints characters which are definitely not the original encoding (funky looking characters). If anyone has an idea how I can make this also work in Jython that'd be great!
Edit (Example):
Sample-Value:
Our latest scenarios explore two possible versions of the future seen through fresh “lenses”.
Creates a string where the right and left double quotes turn to \x8D and \x8E. I don't know what encoding that is. In Python after using the above code it strips them. In Jython it turns them into white squares.
I'm not familiar with Jython, but the following link I found may prove useful: http://python.6.x6.nabble.com/character-encoding-issues-td1766833.html
It says that you should keep all unicode strings in separate files to your source, and read them with codecs.open. This seemed to work for the person who was experiencing a problem similar to yours.
The following link also mentions something about specifying an encoding parameter to the JVM: https://answers.launchpad.net/sikuli/+question/156443
Without seeing any actual error output, this is the extent of the help I can provide.
Related
This is a rather theoretical question, pertaining to the fundamental general syntax of Python. I am looking for an example of sequence of characters (*1) that would always cause a syntax error when present inside a Python program, regardless of the context (*2). For instance, the sequence a[0) is not a correct example, because the program
s = 'a[0)'
is perfectly valid. What I want is a sequence of characters that, wherever it occurs in the source code, causes a syntax error! (Oh, and of course, all the characters in this sequence have to be characters individually allowed to appear in a Python program).
(edit: the following blockquoted example is wrong, since newlines may appear in triple-quoted strings. Thanks to ekhumoro for this relevant remark!)
I suspect that the sequence “newline-quote-newline” is forbidden,
because the newline character may not appear in a quoted string: so,
if the first newline character does not causes a syntax error, this
means that the quote character starts a quoted string, and then the
second newline character will cause a syntax error.
It seems to me that a fundamentally buggy sequence could be
(edited some mistakes here: thanks to ekhumoro for noticing!)
'[)"[)'''[)"""[)'[)"[)'''[)"""[)
(where  denotes a newline character), because one of the [)'s shall necessarily occur outside a quoted string, and the string cannot occur in a comment because of the initial .
However, I do not know enough about the sharp details of Python syntax to be sure that the above examples are correct: maybe there exists some bizarre context, more subtle than mere quoted strings, where the above sequences of characters would be allowed? Maybe the full details of Python syntax even make it actually impossible to build any buggy sequence such as what I am looking for?…
(edit added for more clarity)
So, actually my question is about whether the specifications allow you to define a new kind of quoted context at some point: is there something in the Python specifications that say that the only possible quoted contexts are '…', "…", '''…''', """…""" and #… (plus possibly a few more which I would not be currently aware of), or may you devise new quoted contexts as you wish? Or maybe you could make your program start with a kind of codec, after which you would write the sequel of the program in an arbitrary language completely different from Python…?
(*1) In a first version of this question, I wrote “bytes” instead of “characters”, because I did not want to be bothered with bizarre Unicode characters; but that made possible to turn the question into encoding issues… So, let us assume that we are working with a fixed encoding, whose set of admissible characters is fixed and well-known (say, ASCII for more simplicity).
(*2) FYI, the motivation of my question is to stress the difference between the language of a universal Turing machine (with self-delimited programs) and a general-purpose programming language, in the context of Kolmogorov complexity.
PS.: Answers to the same question for other (interpreted) real-life languages also welcomed :-)
I'm trying to understand why when we were using pandas to_csv(), a number 3189069486778499 has been output as "0.\x103189069486778499". And this is the only case happened within a huge amount of data.
When using to_csv(), we have already used encoding='utf8', normally that would solve some unicode problems...
So, I'm trying to understand what is "\x10", so that I may know why...
Since the whole process was running in luigi pipeline, sometimes luigi will generate weird output. I tried the same thing in IPython, same version of pandas and everything works fine....
Because it's the likely answer, even if the details aren't provide in your question:
It's highly likely something in your pipeline is intentionally producing fields with length prefixed text, rather than the raw unstructured text. \x103189069486778499 is a binary byte with the value 16 (0x10), followed by precisely 16 characters. The 0. before it may be from a previous output, or some other part of whatever custom data serialization format it's using.
This design is usually intended to make parsing more efficient; if you use a delimiter character between fields (e.g. a comma, like CSV), you're stuck coming up with ways to escape or quote the delimiter when it occurs in your actual data, and parsers have to scan character by character, statefully, to figure out where a field begins and ends. With length prefixed text, the parser finds a field length and knows exactly how many characters to read to slurp the field, or how many to skip to find the next field, no quoting or escaping required, no matter what the field contains.
As for what's doing this: You're going to have to check the commands in your pipeline. Your question provides no meaningful way to determine the cause of this problem.
Running python2.7 here. I am writing a quick and dirty little script to do some web scraping, and I just want the unicode handler to just ignore all unicode errors.
That is, I am totally fine if it just drops whatever characters it can't convert to ascii anywhere in the program. This is just a throwaway script I just want to get done :-)
Is there some global "ignore" variable I can set?
Thanks!
/YGA
I am totally fine if it just drops whatever characters it can't convert to ascii anywhere in the program
Then you want to explicitly create your Unicode objects from the ascii codec, and specify to ignore errors:
input = unicode(input_bytes, encoding='ascii', errors='ignore')
See the Unicode HOWTO for more on properly handling Unicode.
(And for writing new code, always choose Python 3 or later unless you have an excellent well-formed reason to stay behind.)
When using unicode strings in source code, there seems to be many ways to skin a cat. The docs and the relevant PEPs have plenty of information about what's possible, but are scant about what is preferred.
For example, the following each seem to give same result:
# coding: utf8
u1 = '\xe2\x82\xac'.decode('utf8')
u2 = u'\u20ac'
u3 = unichr(0x20ac)
u4 = "€".decode('utf8')
u5 = u"€"
If using the __future__ imports, I've found one more option:
# coding: utf8
from __future__ import unicode_literals
u6 = "€"
In python I am used to there being one obvious way to do it, so what is the recommended method of including international content in source files?
This is a python 2 question.
some background...
Methods u1, u2, u3 just seem silly to me, but I have seen enough people writing like this that I assume it is not just personal preference - is there any particular reason why we might want to force only ascii characters in source files, rather than specifying the encoding, or is this just a habit more likely to be found in older code lying around?
There's huge readability improvement in the code to use the actual symbols rather than some escape sequences, and to not do so would seem to be ignoring the strengths of the language rather than taking advantage of hard work by the python devs.
I think the most common way I've used (in Python 2) is:
# coding: utf-8
text = u'résumé'
The text is readable. Compare to text = u'r\u00e9sum\u00e9', where I must look up what character that is. Everything else is less readable.
If you're using Unicode, your variable is most certainly text and not binary data, so there's no point in keeping it in anything other than a unicode object. (Just in case '€' became an option.)
from __future__ import unicode_literals changes the parsing mode of the program; I think you'd need to be more aware of the difference between text & binary data. (Something that, if you ask me, most programmers are not good at.)
In large projects, it might be confusing to have the parsing mode change for just one file, so it's probably better as an all files or no files, so you don't need to refer to the file header. If you're in Python 2, the default is probably off unless you're also targetting Python 3. If you're targetting Python 2.5 or older¹, then it's not an option.
Most editors these days are Unicode-aware. That said, I have seen editors corrupt non-ASCII characters in files, but exceedingly rarely; if the author of such a commit doesn't review his code adequately, code review should catch this. (The diff will be painfully obvious.) It is not worth supporting these people: Unicode is here to stay; track them down and fix their set up. Of note, vim handles Unicode just fine.
¹You should upgrade.
Say you have a some meta data for a custom file format that your python app reads. Something like a csv with variables that can change as the file is manipulated:
var1,data1
var2,data2
var3,data3
So if the user can manipulate this meta data, do you have to worry about someone crafting a malformed meta data file that will allow some arbitrary code execution? The only thing I can imagine if you you made the poor choice to make var1 be a shell command that you execute with os.sys(data1) in your own code somewhere. Also, if this were C then you would have to worry about buffers being blown, but I don't think you have to worry about that with python. If your reading in that data as a string is it possible to somehow escape the string "\n os.sys('rm -r /'), this SQL like example totally wont work, but is there similar that is possible?
If you are doing what you say there (plain text, just reading and parsing a simple format), you will be safe. As you indicate, Python is generally safe from the more mundane memory corruption errors that C developers can create if they are not careful. The SQL injection scenario you note is not a concern when simply reading in files in python.
However, if you are concerned about security, which it seems you are (interjection: good for you! A good programmer should be lazy and paranoid), here are some things to consider:
Validate all input. Make sure that each piece of data you read is of the expected size, type, range, etc. Error early, and don't propagate tainted variables elsewhere in your code.
Do you know the expected names of the vars, or at least their format? Make sure the validate that it is the kind of thing you expect before you use it. If it should be just letters, confirm that with a regex or similar.
Do you know the expected range or format of the data? If you're expecting a number, make sure it's a number before you use it. If it's supposed to be a short string, verify the length; you get the idea.
What if you get characters or bytes you don't expect? What if someone throws unicode at you?
If any of these are paths, make sure you canonicalize and know that the path points to an acceptable location before you read or write.
Some specific things not to do:
os.system(attackerControlledString)
eval(attackerControlledString)
__import__(attackerControlledString)
pickle/unpickle attacker controlled content (here's why)
Also, rather than rolling your own config file format, consider ConfigParser or something like JSON. A well understood format (and libraries) helps you get a leg up on proper validation.
OWASP would be my normal go-to for providing a "further reading" link, but their Input Validation page needs help. In lieu, this looks like a reasonably pragmatic read: "Secure Programmer: Validating Input". A slightly dated but more python specific one is "Dealing with User Input in Python"
Depends entirely on the way the file is processed, but generally this should be safe. In Python, you have to put in some effort if you want to treat text as code and execute it.