I have a simple python program to index emails on an Exchange server, and find that the list names that it returns are not all the same format. It seems that any names with any special characters (notably blanks) are double-quoted, and others are not.
(\Marked \HasNoChildren) "/" "Mail/_DE Courses/_cs435-ADL"
(\Marked \HasNoChildren) "/" Mail/_etc
Last time I ran this program several years ago it did not have this issue. All other examples that I have seen show every name string quoted. Is this something non-standard, and well known? (I just made a regex to correct for this.)
If the name contains special characters, the server has to quote. If the name is plain the server may quote or not, its choice. I can easily believe that the version you used three years ago made a different choice than the one deployed today.
In an IMAP request or response, it is possible to represent items as strings (the quoted kind) or atoms (no quotes). Using an atom format (unquoted string) is often sufficient. However, when a space is present, opting for an unquoted string would cause the string to be interpreted as two separate strings (the part before the space, and that after it). Since client and server usually know to expect a predefined number of space-delimited items in a response, doing this would result in a parse error.
Related
This is a rather theoretical question, pertaining to the fundamental general syntax of Python. I am looking for an example of sequence of characters (*1) that would always cause a syntax error when present inside a Python program, regardless of the context (*2). For instance, the sequence a[0) is not a correct example, because the program
s = 'a[0)'
is perfectly valid. What I want is a sequence of characters that, wherever it occurs in the source code, causes a syntax error! (Oh, and of course, all the characters in this sequence have to be characters individually allowed to appear in a Python program).
(edit: the following blockquoted example is wrong, since newlines may appear in triple-quoted strings. Thanks to ekhumoro for this relevant remark!)
I suspect that the sequence “newline-quote-newline” is forbidden,
because the newline character may not appear in a quoted string: so,
if the first newline character does not causes a syntax error, this
means that the quote character starts a quoted string, and then the
second newline character will cause a syntax error.
It seems to me that a fundamentally buggy sequence could be
(edited some mistakes here: thanks to ekhumoro for noticing!)
'[)"[)'''[)"""[)'[)"[)'''[)"""[)
(where  denotes a newline character), because one of the [)'s shall necessarily occur outside a quoted string, and the string cannot occur in a comment because of the initial .
However, I do not know enough about the sharp details of Python syntax to be sure that the above examples are correct: maybe there exists some bizarre context, more subtle than mere quoted strings, where the above sequences of characters would be allowed? Maybe the full details of Python syntax even make it actually impossible to build any buggy sequence such as what I am looking for?…
(edit added for more clarity)
So, actually my question is about whether the specifications allow you to define a new kind of quoted context at some point: is there something in the Python specifications that say that the only possible quoted contexts are '…', "…", '''…''', """…""" and #… (plus possibly a few more which I would not be currently aware of), or may you devise new quoted contexts as you wish? Or maybe you could make your program start with a kind of codec, after which you would write the sequel of the program in an arbitrary language completely different from Python…?
(*1) In a first version of this question, I wrote “bytes” instead of “characters”, because I did not want to be bothered with bizarre Unicode characters; but that made possible to turn the question into encoding issues… So, let us assume that we are working with a fixed encoding, whose set of admissible characters is fixed and well-known (say, ASCII for more simplicity).
(*2) FYI, the motivation of my question is to stress the difference between the language of a universal Turing machine (with self-delimited programs) and a general-purpose programming language, in the context of Kolmogorov complexity.
PS.: Answers to the same question for other (interpreted) real-life languages also welcomed :-)
I'm trying to understand why when we were using pandas to_csv(), a number 3189069486778499 has been output as "0.\x103189069486778499". And this is the only case happened within a huge amount of data.
When using to_csv(), we have already used encoding='utf8', normally that would solve some unicode problems...
So, I'm trying to understand what is "\x10", so that I may know why...
Since the whole process was running in luigi pipeline, sometimes luigi will generate weird output. I tried the same thing in IPython, same version of pandas and everything works fine....
Because it's the likely answer, even if the details aren't provide in your question:
It's highly likely something in your pipeline is intentionally producing fields with length prefixed text, rather than the raw unstructured text. \x103189069486778499 is a binary byte with the value 16 (0x10), followed by precisely 16 characters. The 0. before it may be from a previous output, or some other part of whatever custom data serialization format it's using.
This design is usually intended to make parsing more efficient; if you use a delimiter character between fields (e.g. a comma, like CSV), you're stuck coming up with ways to escape or quote the delimiter when it occurs in your actual data, and parsers have to scan character by character, statefully, to figure out where a field begins and ends. With length prefixed text, the parser finds a field length and knows exactly how many characters to read to slurp the field, or how many to skip to find the next field, no quoting or escaping required, no matter what the field contains.
As for what's doing this: You're going to have to check the commands in your pipeline. Your question provides no meaningful way to determine the cause of this problem.
I know there is quite a lot on the web and on stackoverflow about Python and character encoding, but I haven't really found the answer I'm looking for. So at the risk of creating a duplicate, I'm going to ask anyway.
It's a script that gets a dictionary, where all keys are properly as unicode. The values are strings with unknown encoding. For the keys it wouldn't matter that much, keys are all very simple very unlike the values. The values can (and do) contain a large variety of encodings. There are some dictionaries, where some values are in ASCII others as UTF-16BE yet others cp1250.
That totally messes up further processing, which currently consists mainly printing or concatenating (yes, that simple).
The work-around that I came up with, which makes Python print statements work properly is:
for key in data.keys():
# hope they did not chose a funky encoding
try:
print key+":"+data[key] # this triggers a UnicodeDecodeError on many encodings
current_data = data[key]
except UnicodeDecodeError:
# trying to cope with a funky encoding
current_data = data[key].decode(chardet.detect(data[key])['encoding']) # doing this on each value, because the dictionary sometimes contains multiple encodings
print key+":", # printing without newline was a workaround, because connecting didn't work
print current_data.encode('UTF-8')
In Python this works just fine. In Jython 2.7rc1 which I use in the project (not an option to switch), it prints characters which are definitely not the original encoding (funky looking characters). If anyone has an idea how I can make this also work in Jython that'd be great!
Edit (Example):
Sample-Value:
Our latest scenarios explore two possible versions of the future seen through fresh “lenses”.
Creates a string where the right and left double quotes turn to \x8D and \x8E. I don't know what encoding that is. In Python after using the above code it strips them. In Jython it turns them into white squares.
I'm not familiar with Jython, but the following link I found may prove useful: http://python.6.x6.nabble.com/character-encoding-issues-td1766833.html
It says that you should keep all unicode strings in separate files to your source, and read them with codecs.open. This seemed to work for the person who was experiencing a problem similar to yours.
The following link also mentions something about specifying an encoding parameter to the JVM: https://answers.launchpad.net/sikuli/+question/156443
Without seeing any actual error output, this is the extent of the help I can provide.
We have a perl XMLRPC server in our web service that other teams in our place query via python scripts using XMLRPC.
Here's the thing - one of the parameters that we return is an ID of say, an employee, and due to the setup of the employee ids at our institute the IDs are in a format like this:
000126721 (basically every id for every employee is the same length, and start off with 0 or more 0's to enforce this).
In the perl server side, we force this to be a string - and perl returns a hashref to xmlrpc in a format like this to ensure that the employee id is forced to be a string:
{employee_id => "$employee_id"}
So in perl the above example is returned like this if we query our perl server with a perl client:
{employee_id => "000126721"}
But apparently the other teams are telling us that python xmlrpc when it grabs the ids truncates the 0s and turns employee id to an integer such that it gets the above field like this:
{employee_id : 126721}
??? Why on earth would xmlrpc in python force something that is explicitly specified as a string into an integer?
If I stick in say a comma in the id, python xmlrpc does not do this - but if any string that is getting returned consists purely of numbers, python xmlrpc client seems to convert it to integer (and therefore the preceding zero's are lost when they MUST be kept) whereas perl does not seem to do this client or server side.
What can we do to solve this issue? We tested this with a different rest web service using JSON, YAML, and none of the other ones seem to have this issue, just XMLRPC. Unfortunately, until our next release, we will have to find a way to fix the XMLRPC so... :(
EDIT: Here's another issue that I forgot to mention -- employee id is not the only field that does this. There are other fields that do the same thing but also enforce a different amount of digits. So ideally rather than modifying the data post-xmlrpc in python (that really wouldn't be hard to do but risks screwing up the data, and the data is actually sensitive data that cannot be and should not be touched), we would really need to find a way to force xmlrpc to not convert.
I want to handle geographic names i.e /new_york or /new-york etc
and since new-york is django-slugify for New york then maybe I should use the slugifed names even if names with underscores look better since I may want to automate the URL creation via an algorithm such as django slugify. A guess is that ([A-Za-z]+) or simply ([\w-]+) can work but to be safe I ask you which regex is best choice in this case.
I've already got a regex that handles number connecting numbers to a class:
('/([0-9]*)',ById)#fetches and displays an entity by id
Now I want another regex to match names e.g. new_york so that a request for
/new_york gets handled by the appropriate handler. Basically the negation of the regex above would or any combination letters+underscore and maybe a dash - since the names are geographical and It seems I could use this regex but I believe it works only because of precedence it that it just takes everything:
('/(.*)', ByName)#Handle for instance /new_york entities, /sao_paulo entities etc by custom mapping for my relevant places.
Since I have other handlers and I don't want conflicting regexes and I have other request handlers, could you recommend how to formulate the regex?
How does it work when an expression suits 2 regexes? Which has higher precedence? Can you tell me more how I should learn to write regexes and possible implementations for the geographical datastore - as entities or instance variables and special problems such as geographic locations that have different names in different languages e.g. Germany in german is called Deutschland so I also want to apply translations that I can do with gettext / djang.po files.
the first match wins.
usually your URLs will differ in other parts of the path. for example you might have
/cities/(?P<city>[^/]+)
/users/(?P<user>[^/]+)
and in many cases [^/]+ is a good regex because it will match anything except /, which you would normally avoid because it is used to separate path elements.
i don't think it's a good idea to separate URLs based solely on characters (in your case, letters or digits), but if you want to do that, use [-A-Za-z_]+ (note that the "-" goes at the start of the [], or it needs a backslash).
avoid \w because that can also match digits. unless you want to go really crazy and send digits only to one handler and letters+digits elsewhere, in which case use:
/(?P<id>\d+)
/(?P<city>[-\w]+)
in that order.