Handling unicode data in XMLRPC

Handling unicode data in XMLRPC - python

I have to migrate data to OpenERP through XMLRPC by using TerminatOOOR.
I send a name with value "Rotule right Aurélia".
In Python the name with be encoded with value : 'Rotule right Aur\xc3\xa9lia '
But in TerminatOOOR (xmlrpc client) the data is encoded with value 'Rotule middle Aur\357\277\275lia'
So in the server side, the data value is not decoded correctly and I get bad data.
The terminateOOOR is a ruby plugin for Kettle ( Java product) and I guess it should encode data by utf-8.
I just don't know why it happens like this.
Any help?

This issue comes from Kettle.
My program is using Kettle to get an Excel file, get the active sheet and transfer the data in that sheet to TerminateOOOR for further handling.
At the phase of reading data from Excel file, Kettle can not recognize the encoding then it gives bad data to TerminateOOOR.
My work around solution is manually exporting excel to csv before giving data to TerminateOOOR. By doing this, I don't use the feature to mapping excel column name a variable name (used by kettle).

first off, whenever you deal with text (and all text is bound to contain some non-US-ASCII character sooner or later), you'll be much happier doing that in Python 3.x instead of in the 2.x series. if Py3 is not an option, try to always use from __future__ import unicode_literals (available in Python 2.6 and 2.7).
basically, when you send text or any other data over the wire, that will only happen in the form of bytes (octets of bits), so it will have to be encoded at some point. try to find out exactly where that encoding takes place in your tool chain; if necessary, use a debugging tool (or deploy print( repr( x ) ) statements) to look into relevant variables. the other software you mention is presumably written in PHP, a language which is known to have issues with unicode. you say that 'it should encode the data by utf-8', but on the other hand, when the receiving end sees the data of an incoming RPC request, that data should already be in utf-8. it would have to be decoded to obtain unicode again.

Related

Python, Windows, Ansi - encoding, again

Hello there,
even if i really tried... im stuck and somewhat desperate when it comes to Python, Windows, Ansi and character encoding. I need help, seriously... searching the web for the last few hours wasn't any help, it just drives me crazy.
I'm new to Python, so i have almost no clue what's going on. I'm about to learn the language, so my first program, which ist almost done, should automatically generate music-playlists from a given folder containing mp3s. That works just fine, besides one single problem...
...i can't write Umlaute (äöü) to the playlist-file.
After i found a solution for "wrong-encoded" Data in the sys.argv i was able to deal with that. When reading Metadata from the MP3s, i'm using some sort of simple character substitution to get rid of all those international special chars, like french accents or this crazy skandinavian "o" with a slash in it (i don't even know how to type it...). All fine.
But i'd like to write at least the mentioned Umlaute to the playlist-file, those characters are really common here in Germany. And unlike the Metadata, where i don't care about some missing characters or miss-spelled words, this is relevant - because now i'm writing the paths to the files.
I've tried so many various encoding and decoding methods, i can't list them all here.. heck, i'm not even able to tell which settings i tried half an hour ago. I found code online, here, and elsewhere, that seemed to work for some purposes. Not for mine.
I think the tricky part is this: it seems like the Problem is the Ansi called format of the files i need to write. Correct - i actually need this Ansi-stuff. About two hours ago i actually managed to write whatever i'd like to an UFT-8 file. Works like charm... until i realized that my Player (Winamp, old Version) somehow doesn't work with those UTF-8 playlist files. It couldn't resolve the Path, even if it looks right in my editor.
If i change the file format back to Ansi, Paths containing special chars get corrupted. I'm just guessing, but if Winamp reads this UTF-8 files as Ansi, that would cause the Problem i'm experiencing right now.
So...
I DO have to write äöü in a path, or it will not work
It DOES have to be an ANSI-"encoded" file, or it will not work
Things like line.write(str.decode('utf-8')) break the funktion of the file
A magical comment at the beginning of the script like # -*- coding: iso-8859-1 -*- does nothing here (though it is helpful when it comes to the mentioned Metadata and allowed characters in it...)
Oh, and i'm using Python 2.7.3. Third-Party modules dependencies, you know...
Is there ANYONE who could guide me towards a way out of this encoding hell? Any help is welcome. If i need 500 lines of Code for another functions or classes, i'll type them. If there's a module for handling such stuff, let me know! I'd buy it! Anything helpful will be tested.
Thank you for reading, thanks for any comment,
greets!

As mentioned in the comments, your question isn't very specific, so I'll try to give you some hints about character encodings, see if you can apply those to your specific case!
Unicode and Encoding
Here's a small primer about encoding. Basically, there are two ways to represent text in Python:
unicode. You can consider that unicode is the ultimate encoding, you should strive to use it everywhere. In Python 2.x source files, unicode strings look like u'some unicode'.
str. This is encoded text - to be able to read it, you need to know the encoding (or guess it). In Python 2.x, those strings look like 'some str'.
This changed in Python 3 (unicode is now str and str is now bytes).
How does that play out?
Usually, it's pretty straightforward to ensure that you code uses unicode for its execution, and uses str for I/O:
Everything you receive is encoded, so you do input_string.decode('encoding') to convert it to unicode.
Everything you need to output is unicode but needs to be encoded, so you do output_string.encode('encoding').
The most common encodings are cp-1252 on Windows (on US or EU systems), and utf-8 on Linux.
Applying this to your case
I DO have to write äöü in a path, or it will not work
Windows natively uses unicode for file paths and names, so you should actually always use unicode for those.
It DOES have to be an ANSI-"encoded" file, or it will not work
When you write to the file, be sure to always run your output through output.encode('cp1252') (or whatever encoding ANSI would be on your system).
Things like line.write(str.decode('utf-8')) break the funktion of the file
By now you probably realized that:
If str as indeed an str instance, Python will try to convert it to unicode using the utf-8 encoding, but then try to encode it again (likely in ascii) to write it to the file
If str is actually an unicode instance, Python will first encode it (likely in ascii, and that will probably crash) to then be able to decode it.
Bottom line is, you need to know if str is unicode, you should encode it. If it's already encoded, don't touch it (or decode it then encode it if the encoding is not the one you want!).
A magical comment at the beginning of the script like # -- coding: iso-8859-1 -- does nothing here (though it is helpful when it comes to the mentioned Metadata and allowed characters in it...)
Not a surprise, this only tells Python what encoding should be used to read your source file so that non-ascii characters are properly recognized.
Oh, and i'm using Python 2.7.3. Third-Party modules dependencies, you know...
Python 3 probably is a big update in terms of unicode and encoding, but that doesn't mean Python 2.x can't make it work!
Will that solve your issue?
You can't be sure, it's possible that the problem lies in the player you're using, not in your code.
Once you output it, you should make sure that your script's output is readable using reference tools (such as Windows Explorer). If it is, but the player still can't open it, you should consider updating to a newer version.

On Windows there is special encoding available called mbcs, it converts between current default ANSI codepage and UNICODE.
For example on a Spanish Language PC:
u'ñ'.encode('mbcs') -> '\xf1'
'\xf1'.decode('mbcs') -> u'ñ'
On Windows ANSI means current default multi-byte code page. For western European languages Windows ISO-8859-1, for eastern European languages windows ISO-8859-2) encoded byte string and other encodings for other languages as appropriate.
More info available at:
https://docs.python.org/2.4/lib/standard-encodings.html
See also:
https://docs.python.org/2/library/sys.html#sys.getfilesystemencoding

# -*- coding comments declare the character encoding of the source code (and therefore of byte-string literals like 'abc').
Assuming that by "playlist" you mean m3u files, then based on this specification you may be at the mercy of the mp3 player software you are using. This spec says only that the files contain text, no mention of what character encoding.
I have personally observed that various mp3 encoding software will use different encodings for mp3 metadata. Some use UTF-8, others ISO-8859-1. So you may have to allow encoding to be specified in configuration and leave it at that.

How to reliable tell the uploaded file type (text or binary)?

I have an application where users should be able to upload a wide variety of files, but I need to know for each file, if I can safely display its textual representation as plain text.
Using python-magic like
m = Magic(mime=True).from_buffer(cgi.FieldStorage.file.read())
gives me the correct MIME type.
But sometimes, the MIME type for scripts is application/*, so simply looking for m.startswith('text/') is not enough.
Another site suggested using
m = Magic().from_buffer(cgi.FieldStorage.file.read())
and checking for 'text' in m.
Would the second approach be reliable enough for a collection of arbitrary file uploads or could someone give me another idea?
Thanks a lot.

What is your goal? Do you want the real mime type? Is that important for security reasons? Or is it "nice to have"?
The problem is that the same file can have different mime types. When a script file has a proper #! header, python-magic can determine the script type and tell you. If the header is missing, text/plain might be the best you can get.
This means there is no general "will always work" magic solution (despite the name of the module). You will have to sit down and think what information you can get, what it means and how you want to treat it.
The secure solution would be to create a list of mime types that you accept and check them with:
allowed_mime_types = [ ... ]
if m in allowed_mime_types:
That means only perfect matches are accepted. It also means that your server will reject valid files which don't have the correct mime type for some reason (missing header, magic failed to recognize the file, you forgot to mention the mime type in your list).
Or to put it another way: Why do you check the mime type of the file if you don't really care?
[EDIT] When you say
I need to know for each file, if I can safely display its textual representation as plain text.
then this isn't as easy as it sounds. First of all, "text" files have no encoding stored in them, so you will need to know the encoding that the user used when they created the file. This isn't a trivial task. There are heuristics to do so but things get hairy when encodings like ISO 8859-1 and 8859-15 are used (the latter has the Euro symbol).
To fix this, you will need to force your users to either save the text files in a specific encoding (UTF-8 is currently the best choice) or you need to supply a form into which users will have to paste the text.
When using a form, the user can see whether the text is encoded correctly (they see it on the screen), they can fix any problems and you can make sure that the browser sends you the text encoded with UTF-8.
If you can't do that, your only choice is to check for any bytes below 0x20 in the input with the exception of \r, \n and \t. That is a pretty good check for "is this a text document".
But when users use umlauts (like when you write an application that is being used world wide), this approach will eventually fail unless you can enforce a specific encoding on the user's side (which you probably can't since you don't trust the user).
[EDIT2] Since you need this to check actual source code: If you want to make sure the source code is "safe", then parse it. Most languages allow to parse the code without actually executing it. That would give you some real information (because the parsers know what to look for) and you wouldn't need to make wild guesses :-)

After playing around a bit, I discovered that I can propably use the Magic(mime_encoding=True) results!
I ran a simple script on my Dropbox folder and grouped the results both by encoding and by extension to check for irregularities.
But it does seem pretty usable by looking for 'binary' in encoding.
I think I will hang on to that, but thank you all.

How do I access both binary and text data for email processing with Python 3?

I am converting a Python 2 program to Python 3 and I'm not sure about the approach to take.
The program reads in either a single email from STDIN, or file(s) are specified containing emails. The program then parses the emails and does some processing on them.
SO we need to work with the raw data of the email input, to store it on disk and do an MD5 hash on it. We also need to work with the text of the email input in order to run it through the Python email parser and extract fields etc.
With Python 3 it is unclear to me how we should be reading in the data. I believe we need the raw binary data in order to do an md5 on it, and also to be able to write it to disk. I understand we also need it in text form to be able to parse it with the email library. Python 3 has made significant changes to the IO handling and text handling and I can't see the "correct" approach to read the email raw data and also use the same data in text form.
Can anyone offer general guidance on this?

The general guidance is convert everything to unicode ASAP and keep it that way until the last possible minute.
Remember that str is the old unicode and bytes is the old str.
See http://docs.python.org/dev/howto/unicode.html for a start.
With Python 3 it is unclear to me how we should be reading in the data.
Specify the encoding when you open the file it and it will automatically give you unicode. If you're reading from stdin, you'll get unicode. You can read from stdin.buffer to get binary data.
I believe we need the raw binary data in order to do an md5 on it
Yes, you do. encode it when you need to hash it.
and also to be able to write it to disk.
You specify the encoding when you open the file you're writing it to, and the file object encodes it for you.
I understand we also need it in text form to be able to parse it with the email library.
Yep, but since it'll get decoded when you open the file, that's what you'll have.
That said, this question is really too open ended for Stack Overflow. When you have a specific problem / question, come back and we'll help.

Best way to send string data using python UDP packets?

To preface I'm very new to python (about 7 days) but I'm an experienced software eng undergrad.
I would like to send data between machines running python scripts. The idea I had (in order to simplify things) was to concatenate the data (strings & ints) into a string and do the parsing client-side.
The UDP packets send beautifully with simple strings but when I try to send useful data python always complains about the data I send; specifically python won't let me concatenate tuples.
In order to parse the data on the client I need to seperate the data with a dash character: '-'.
nodeList is of type dictionary where the key is a string and value is a double.
randKey = random.choice( nodeList.keys() )
data = str(randKey) +'-'+ str(nodeList[randKey])
mySocket.sendto ( data , address )
The code above produces the following error:
TypeError: coercing to Unicode: need string or buffer, tuple found
I don't understand why it thinks it is a tuple I am trying to concatenate...
So my question is how can I correct this to keep Python happy, or can someone suggest I better way of sending the data?
Thank you in advance.

I highly suggest using Google Protocol Buffers as implemented in Python as protobuf for this as it will handle the serialization on both ends of the line. It has Python bindings that will allow you to easily use it with your existing Python program.
Using your example code you would create a .proto file like so:
message SomeCoolMessage {
required string key = 1;
required double value = 2;
}
Then after generating, you can use it like so:
randKey = random.choice( nodeList.keys() )
data = SomeCoolMessage()
data.key = randKey
data.value = nodeList[randKey]
mySocket.sendto ( data.SerializeToString() , address )

I'd probably use the json module serialize the data.

You need to serialize the data. Pickle does this built in for you, and you can ask pickle for an ascii representation of the data vs binary data (see the docs), or you could use json (it also serializes the data for you) both are in the standard library. But really there are a hundred thousand different libraries that handle ALL the work for you, in getting data from 1 machine to another. I'd suggest using a library.
Depending on speed, etc. there are different trade offs for the various libraries. In the standard library you get HTTP, that's about it (well and raw sockets). But there are others.
If super fast speed is more important than other things..., zeroMQ, or google's protocol buffers might be valid options.
For me, I use rpyc usually, it lets me be totally lazy, and just call over to the other process across the network. It's fast enough usually.
You know that UDP has no guarantee that the data will ever show up on the other side, or that it will show up IN ORDER. for your application you may not care, I don't know, but just thought I'd bring it up.

Help me understand the difference between CLOBs and BLOBs in Oracle

This is mainly just a "check my understanding" type of question. Here's my understanding of CLOBs and BLOBs as they work in Oracle:
CLOBs are for text like XML, JSON, etc. You should not assume what encoding the database will store it as (at least in an application) as it will be converted to whatever encoding the database was configured to use.
BLOBs are for binary data. You can be reasonably assured that they will be stored how you send them and that you will get them back with exactly the same data as they were sent as.
So in other words, say I have some binary data (in this case a pickled python object). I need to be assured that when I send it, it will be stored exactly how I sent it and that when I get it back it will be exactly the same. A BLOB is what I want, correct?
Is it really feasible to use a CLOB for this? Or will character encoding cause enough problems that it's not worth it?

CLOB is encoding and collation sensitive, BLOB is not.
When you write into a CLOB using, say, CL8WIN1251, you write a 0xC0 (which is Cyrillic letter А).
When you read data back using AL16UTF16, you get back 0x0410, which is a UTF16 represenation of this letter.
If you were reading from a BLOB, you would get same 0xC0 back.

Your understanding is correct. Since you mention Python, think of the Python 3 distinction between strings and bytes: CLOBs and BLOBs are quite analogous, with the extra issue that the encoding of CLOBs is not under your app's control.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.