Among all the encodings available here http://docs.python.org/library/codecs.html
which one is the one I should use for decoding binary data into unicode without it becoming corrupted when I encode it back to string?
I've used raw_unicode_data and it doesn't work.
Example: I upload picture in a POST (but not as file attachment). Django converts POST data to unicode using utf-8. However when converting back from unicode to string (again using utf-8), data becomes corrupted. I used raw_unicode_data and the same happened (though only a few bytes this time). Which encoding should I use so that the decode and encode steps don't corrupt the data.
If you want to post binary data use the base64 encoding.
http://docs.python.org/library/base64.html
"Binary data" is not text, therefore converting it to a unicode is meaningless. If there is text embedded in the binary data then extract it first and decode using the encoding given in the specification for the data format.
As others have already stated, your question isn't particularly clear. If you are wanting to funnel binary data through a text channel (such as POST), then base64 is the right format to use with appropriate data transformation operations in the client and the server (binary data -> base64 text -> pass over text channel -> base64 text -> binary data).
Alternatively, if you are wanting to tolerate improperly encoded text (e.g. as Python 3 tries to do for some interfaces such as file paths and environment variables), then Python 3.1 and later offer the surrogatescape error handler, which will convert invalid values into a format that isn't valid readable text, but allows the original binary data to be faithfully recreated when encoding back to bytes.
Related
I am trying to do the first CryptoPals challenge where I have to implement the base64 algorithm. I am using Python.
It has the following advice:
Always operate on raw bytes, never on encoded strings. Only use hex and base64 for pretty-printing.
So say I have a string and I have to convert this into bits. The issue is that I have seen people using UTF-8 or ASCII before converting the string to bytes/bits. (See here)
Should I always proceed with UTF-8? Or there could be issues down the road if my script parses something encoded in ASCII?
I had a problem writing a UTF-8 supported character (\ufffd) to a text file. I was wondering what the most inclusive character set Python 3.x supports for writing string data to files.
I was able to overcome the problem by
valEncoded = (origVal.encode(encoding='ASCII', errors='replace')).decode()
which basically filtered out non-ASCII characters from origVal. But I figure Python file I/O should support more than ASCII, which is pretty conservative. So I am looking for what is the most inclusive character set supported.
Any of the UTF encodings should work:
UTF-8 is typically the most compact (particularly if the text is largely ASCII compatible), and portable between systems with different endianness without requiring a BOM (byte order mark). It's the most common encoding used on non-Windows systems that support the full Unicode range (and the most common encoding used for serving data over a network, e.g. HTML, XML, JSON)
UTF-16 is commonly used by Windows (the system "wide character" APIs use it as the native encoding)
UTF-32 is uncommon, and wastes a lot of disk space, with the only real benefit being a fixed ratio between characters and bytes (you can divide file size by four after subtracting the BOM and you get the number of characters)
In general, I'd recommend going with UTF-8 unless you know it will be consumed by another tool that expects UTF-16 or the like.
For input text files, I know that .seek and .tell both operate with bytes, usually - that is, .seek seeks a certain number of bytes in relation to a point specified by its given arguments, and .tell returns the number of bytes since the beginning of the file.
My question is: does this work the same way when using other encodings like utf-8? I know utf-8, for example, requires several bytes for some characters.
It would seem that if those methods still deal with bytes when parsing utf-8 files, then unexpected behavior could result (for instance, the cursor could end up inside of a character's multi-byte encoding, or a multi-byte character could register as several characters).
If so, are there other methods to do the same tasks? Especially for when parsing a file requires information about the cursor's position in terms of characters.
On the other hand, if you specify the encoding in the open() function ...
infile = open(filename, encoding='utf-8')
Does the behavior of .seek and .tell change?
Assuming you're using io.open() (not the same as the builtin open()), then using text mode gets you an instance of a io.TextIO, so this should anwser your question:
Text I/O over a binary storage (such as a file) is significantly
slower than binary I/O over the same storage, because it implies
conversions from unicode to binary data using a character codec. This
can become noticeable if you handle huge amounts of text data (for
example very large log files). Also, TextIOWrapper.tell() and
TextIOWrapper.seek() are both quite slow due to the reconstruction
algorithm used.
NOTE: You should also be aware, that this still doesn't guarantee that seek() will skip over characters, but rather unicode codepoints (a single character can be composed out of more then one codepoint, for example ą can be written as u'\u0105' or u'a\u0328' - both will print the same character).
Source: http://docs.python.org/library/io.html#id1
Some experimentation with utf-8 encodings (repeated seeking and printing of .read(1) methods in a file with lots of multi-byte characters) revealed that yes, .seek() and .read() do behave differently in utf-8 files... they don't deal with single bytes, but single characters. This consisted of several simple re-writings of code, reading and seeking in different patterns.
Thanks to #satuon for your help.
My text editor allows me to code in several different character formats Ansi, UTF-8, UTF-8(No BOM), UTF-16LE, and UTF-16BE.
What is the difference between them?
What is commonly regarded as the best format (I'm using Python if that makes a diffrence)?
"Ansi" is a misnomer and usually refers to some 8-bit encoding that's the default on the current platform (on "western" Windows installations that's usually Windows-1252). It only supports a small set of characters (256 different characters at most).
UTF-8 is a variable-length, ASCII-compatible encoding capable of storing any and all Unicode characters. It's a pretty good choice for western text that should support all Unicode characters and a very viable choice in the general case.
"UTF-8 (no BOM)" is the name Windows gives to using UTF-8 without writing a Byte Order Marker. Since a BOM is not needed for UTF-8, it shouldn't be used and this would be the correct choice (pretty much everyone else calls this version simply "UTF-8"!).
UTF-16LE and UTF-16BE are the Little Endian and Big Endian versions of the UTF-16 encoding. As UTF-8, UTF-16 is capable of representing any Unicode character, however it is not ASCII-compatible.
Generally speaking UTF-8 is a great overall choice and has wide compatibility (just make sure not to write the BOM, because that's what most other software expects).
UTF-16 could take less space if the majority of your text is composed of non-ASCII characters (i.e. doesn't use the basic latin alphabet).
"Ansi" should only be used when you have a specific need to interact with a legacy application that doesn't support Unicode.
An important thing about any encoding is that they are meta-data that need to be communicated in addition to the data. This means that you must know the encoding of some byte stream to interpret it as a text correctly. So you should either use formats that document the actual encoding used (XML is a prime example here) or standardize on a single encoding in a given context and use only that.
For example, if you start a software project, then you can specify that all your source code is in a given encoding (again: I suggest UTF-8) and stick with that.
For Python files specifically, there's a way to specify the encoding of your source files.
Here. Note that "ANSI" is usually CP1252.
You'll probably get greatest utility with UTF-8 No BOM. Forget that ANSI and ASCII exist, they are deprecated dinosaurs.
This is mainly just a "check my understanding" type of question. Here's my understanding of CLOBs and BLOBs as they work in Oracle:
CLOBs are for text like XML, JSON, etc. You should not assume what encoding the database will store it as (at least in an application) as it will be converted to whatever encoding the database was configured to use.
BLOBs are for binary data. You can be reasonably assured that they will be stored how you send them and that you will get them back with exactly the same data as they were sent as.
So in other words, say I have some binary data (in this case a pickled python object). I need to be assured that when I send it, it will be stored exactly how I sent it and that when I get it back it will be exactly the same. A BLOB is what I want, correct?
Is it really feasible to use a CLOB for this? Or will character encoding cause enough problems that it's not worth it?
CLOB is encoding and collation sensitive, BLOB is not.
When you write into a CLOB using, say, CL8WIN1251, you write a 0xC0 (which is Cyrillic letter А).
When you read data back using AL16UTF16, you get back 0x0410, which is a UTF16 represenation of this letter.
If you were reading from a BLOB, you would get same 0xC0 back.
Your understanding is correct. Since you mention Python, think of the Python 3 distinction between strings and bytes: CLOBs and BLOBs are quite analogous, with the extra issue that the encoding of CLOBs is not under your app's control.