Help me understand the difference between CLOBs and BLOBs in Oracle - python

This is mainly just a "check my understanding" type of question. Here's my understanding of CLOBs and BLOBs as they work in Oracle:
CLOBs are for text like XML, JSON, etc. You should not assume what encoding the database will store it as (at least in an application) as it will be converted to whatever encoding the database was configured to use.
BLOBs are for binary data. You can be reasonably assured that they will be stored how you send them and that you will get them back with exactly the same data as they were sent as.
So in other words, say I have some binary data (in this case a pickled python object). I need to be assured that when I send it, it will be stored exactly how I sent it and that when I get it back it will be exactly the same. A BLOB is what I want, correct?
Is it really feasible to use a CLOB for this? Or will character encoding cause enough problems that it's not worth it?

CLOB is encoding and collation sensitive, BLOB is not.
When you write into a CLOB using, say, CL8WIN1251, you write a 0xC0 (which is Cyrillic letter А).
When you read data back using AL16UTF16, you get back 0x0410, which is a UTF16 represenation of this letter.
If you were reading from a BLOB, you would get same 0xC0 back.

Your understanding is correct. Since you mention Python, think of the Python 3 distinction between strings and bytes: CLOBs and BLOBs are quite analogous, with the extra issue that the encoding of CLOBs is not under your app's control.

Related

Can't Read Encoded Text in Visual FoxPro DBF FIles

I recently acquired a ton of data stored in Visual FoxPro 9.0 databases. The text I need is in Cyrillic (Russian), but of the 1000 .dbf files (complete with .fpt and .cdx files), only 4 or 5 return readable text. The rest (usually in the form of memos) returns something like this:
??9Y?u?
yL??x??itZ?????zv?|7?g?̚?繠X6?~u?ꢴe}
?aL1? Ş6U?|wL(Wz???8???7?#R?
.FAc?TY?H???#f U???K???F&?w3A??hEڅԦX?MiOK?,?AZ&GtT??u??r:?q???%,NCGo0??H?5d??]?????O{??
z|??\??pq?ݑ?,??om???K*???lb?5?D?J+z!??
?G>j=???N ?H?jѺAs`c?HK\i
??9a*q??
For the life of me, I can't figure out how this is encoded. I have tried all kinds of online decoders, opened up the .dbfs in many database programs, and used Python to open and manipulate them. All of them returns the similar messiness as above, but never readable Russian.
Note: I know that these databases are not corrupt, because they came accompanied by enterprise software that can open, query and read them successfully. However, that software will not export the data, so I am left working directly with the .dbfs.
Happy to share an example .dbf if would help get to the bottom of this.
I would expect if it is FoxPro database, that the Russian there is encoded in some pre-Unicode encoding for Russian as for most Eastern European languages in ancient times.
For example: Windows-1251 or ISO 8859-5.
'?' characters don't convey much. Try looking at the contents of the memo fields as hex, and see whether what you're seeing looks anything like text in any encodings. (Apologies if you've tried this using Python already). Of course if it is actually encrypted you may be out of luck unless you can find out the key and method.
There are two possibilities:
the encoding has not been correctly stored in the dbf file
the dbf file has been encrypted
If it's been encrypted I can't help you. If it's a matter of finding the correct encoding, my dbf package may be of use. Feel free to send me a sample dbf file if you get stuck.

How to reliable tell the uploaded file type (text or binary)?

I have an application where users should be able to upload a wide variety of files, but I need to know for each file, if I can safely display its textual representation as plain text.
Using python-magic like
m = Magic(mime=True).from_buffer(cgi.FieldStorage.file.read())
gives me the correct MIME type.
But sometimes, the MIME type for scripts is application/*, so simply looking for m.startswith('text/') is not enough.
Another site suggested using
m = Magic().from_buffer(cgi.FieldStorage.file.read())
and checking for 'text' in m.
Would the second approach be reliable enough for a collection of arbitrary file uploads or could someone give me another idea?
Thanks a lot.
What is your goal? Do you want the real mime type? Is that important for security reasons? Or is it "nice to have"?
The problem is that the same file can have different mime types. When a script file has a proper #! header, python-magic can determine the script type and tell you. If the header is missing, text/plain might be the best you can get.
This means there is no general "will always work" magic solution (despite the name of the module). You will have to sit down and think what information you can get, what it means and how you want to treat it.
The secure solution would be to create a list of mime types that you accept and check them with:
allowed_mime_types = [ ... ]
if m in allowed_mime_types:
That means only perfect matches are accepted. It also means that your server will reject valid files which don't have the correct mime type for some reason (missing header, magic failed to recognize the file, you forgot to mention the mime type in your list).
Or to put it another way: Why do you check the mime type of the file if you don't really care?
[EDIT] When you say
I need to know for each file, if I can safely display its textual representation as plain text.
then this isn't as easy as it sounds. First of all, "text" files have no encoding stored in them, so you will need to know the encoding that the user used when they created the file. This isn't a trivial task. There are heuristics to do so but things get hairy when encodings like ISO 8859-1 and 8859-15 are used (the latter has the Euro symbol).
To fix this, you will need to force your users to either save the text files in a specific encoding (UTF-8 is currently the best choice) or you need to supply a form into which users will have to paste the text.
When using a form, the user can see whether the text is encoded correctly (they see it on the screen), they can fix any problems and you can make sure that the browser sends you the text encoded with UTF-8.
If you can't do that, your only choice is to check for any bytes below 0x20 in the input with the exception of \r, \n and \t. That is a pretty good check for "is this a text document".
But when users use umlauts (like when you write an application that is being used world wide), this approach will eventually fail unless you can enforce a specific encoding on the user's side (which you probably can't since you don't trust the user).
[EDIT2] Since you need this to check actual source code: If you want to make sure the source code is "safe", then parse it. Most languages allow to parse the code without actually executing it. That would give you some real information (because the parsers know what to look for) and you wouldn't need to make wild guesses :-)
After playing around a bit, I discovered that I can propably use the Magic(mime_encoding=True) results!
I ran a simple script on my Dropbox folder and grouped the results both by encoding and by extension to check for irregularities.
But it does seem pretty usable by looking for 'binary' in encoding.
I think I will hang on to that, but thank you all.

How do I access both binary and text data for email processing with Python 3?

I am converting a Python 2 program to Python 3 and I'm not sure about the approach to take.
The program reads in either a single email from STDIN, or file(s) are specified containing emails. The program then parses the emails and does some processing on them.
SO we need to work with the raw data of the email input, to store it on disk and do an MD5 hash on it. We also need to work with the text of the email input in order to run it through the Python email parser and extract fields etc.
With Python 3 it is unclear to me how we should be reading in the data. I believe we need the raw binary data in order to do an md5 on it, and also to be able to write it to disk. I understand we also need it in text form to be able to parse it with the email library. Python 3 has made significant changes to the IO handling and text handling and I can't see the "correct" approach to read the email raw data and also use the same data in text form.
Can anyone offer general guidance on this?
The general guidance is convert everything to unicode ASAP and keep it that way until the last possible minute.
Remember that str is the old unicode and bytes is the old str.
See http://docs.python.org/dev/howto/unicode.html for a start.
With Python 3 it is unclear to me how we should be reading in the data.
Specify the encoding when you open the file it and it will automatically give you unicode. If you're reading from stdin, you'll get unicode. You can read from stdin.buffer to get binary data.
I believe we need the raw binary data in order to do an md5 on it
Yes, you do. encode it when you need to hash it.
and also to be able to write it to disk.
You specify the encoding when you open the file you're writing it to, and the file object encodes it for you.
I understand we also need it in text form to be able to parse it with the email library.
Yep, but since it'll get decoded when you open the file, that's what you'll have.
That said, this question is really too open ended for Stack Overflow. When you have a specific problem / question, come back and we'll help.

Handling unicode data in XMLRPC

I have to migrate data to OpenERP through XMLRPC by using TerminatOOOR.
I send a name with value "Rotule right Aurélia".
In Python the name with be encoded with value : 'Rotule right Aur\xc3\xa9lia '
But in TerminatOOOR (xmlrpc client) the data is encoded with value 'Rotule middle Aur\357\277\275lia'
So in the server side, the data value is not decoded correctly and I get bad data.
The terminateOOOR is a ruby plugin for Kettle ( Java product) and I guess it should encode data by utf-8.
I just don't know why it happens like this.
Any help?
This issue comes from Kettle.
My program is using Kettle to get an Excel file, get the active sheet and transfer the data in that sheet to TerminateOOOR for further handling.
At the phase of reading data from Excel file, Kettle can not recognize the encoding then it gives bad data to TerminateOOOR.
My work around solution is manually exporting excel to csv before giving data to TerminateOOOR. By doing this, I don't use the feature to mapping excel column name a variable name (used by kettle).
first off, whenever you deal with text (and all text is bound to contain some non-US-ASCII character sooner or later), you'll be much happier doing that in Python 3.x instead of in the 2.x series. if Py3 is not an option, try to always use from __future__ import unicode_literals (available in Python 2.6 and 2.7).
basically, when you send text or any other data over the wire, that will only happen in the form of bytes (octets of bits), so it will have to be encoded at some point. try to find out exactly where that encoding takes place in your tool chain; if necessary, use a debugging tool (or deploy print( repr( x ) ) statements) to look into relevant variables. the other software you mention is presumably written in PHP, a language which is known to have issues with unicode. you say that 'it should encode the data by utf-8', but on the other hand, when the receiving end sees the data of an incoming RPC request, that data should already be in utf-8. it would have to be decoded to obtain unicode again.

Which AES library to use in Ruby/Python?

I need to be able to send encrypted data between a Ruby client and a Python server (and vice versa) and have been having trouble with the ruby-aes gem/library. The library is very easy to use but we've been having trouble passing data between it and the pyCrypto AES library for Python. These libraries seem to be fine when they're the only one being used, but they don't seem to play well across language boundaries. Any ideas?
Edit: We're doing the communication over SOAP and have also tried converting the binary data to base64 to no avail. Also, it's more that the encryption/decryption is almost but not exactly the same between the two (e.g., the lengths differ by one or there is extra garbage characters on the end of the decrypted string)
(e.g., the lengths differ by one or there is extra garbage characters on the end of the decrypted string)
I missed that bit. There's nothing wrong with your encryption/decryption. It sounds like a padding problem. AES always encodes data in blocks of 128 bits. If the length of your data isn't a multiple of 128 bits the data should be padded before encryption and the padding needs to be removed/ignored after encryption.
Turns out what happened was that ruby-aes automatically pads data to fill up 16 chars and sticks a null character on the end of the final string as a delimiter. PyCrypto requires you to do multiples of 16 chars so that was how we figured out what ruby-aes was doing.
It's hard to even guess at what's happening without more information ...
If I were you, I'd check that in your Python and Ruby programs:
The keys are the same (obviously). Dump them as hex and compare each byte.
The initialization vectors are the same. This is the parameter IV in AES.new() in pyCrypto. Dump them as hex too.
The modes are the same. The parameter mode in AES.new() in pyCrypto.
There are defaults for IV and mode in pyCrypto, but don't trust that they are the same as in the Ruby implementation. Use one of the simpler modes, like CBC. I've found that different libraries have different interpretations of how the mode complex modes, such as PTR, work.
Wikipedia has a great article about how block cipher modes.
Kind of depends on how you are transferring the encrypted data. It is possible that you are writing a file in one language and then trying to read it in from the other. Python (especially on Windows) requires that you specify binary mode for binary files. So in Python, assuming you want to decrypt there, you should open the file like this:
f = open('/path/to/file', 'rb')
The "b" indicates binary. And if you are writing the encrypted data to file from Python:
f = open('/path/to/file', 'wb')
f.write(encrypted_data)
Basically what Hugh said above: check the IV's, key sizes and the chaining modes to make sure everything is identical.
Test both sides independantly, encode some information and check that Ruby and Python endoded it identically. You're assuming that the problem has to do with encryption, but it may just be something as simple as sending the encrypted data with puts which throws random newlines into the data. Once you're sure they encrypt the data correctly, check that you receive exactly what you think you sent. Keep going step by step until you find the stage that corrupts the data.
Also, I'd suggest using the openssl library that's included in ruby's standard library instead of using an external gem.

Categories