How does file reading work in utf-8 encoding?

How does file reading work in utf-8 encoding? - python

For input text files, I know that .seek and .tell both operate with bytes, usually - that is, .seek seeks a certain number of bytes in relation to a point specified by its given arguments, and .tell returns the number of bytes since the beginning of the file.
My question is: does this work the same way when using other encodings like utf-8? I know utf-8, for example, requires several bytes for some characters.
It would seem that if those methods still deal with bytes when parsing utf-8 files, then unexpected behavior could result (for instance, the cursor could end up inside of a character's multi-byte encoding, or a multi-byte character could register as several characters).
If so, are there other methods to do the same tasks? Especially for when parsing a file requires information about the cursor's position in terms of characters.
On the other hand, if you specify the encoding in the open() function ...
infile = open(filename, encoding='utf-8')
Does the behavior of .seek and .tell change?

Assuming you're using io.open() (not the same as the builtin open()), then using text mode gets you an instance of a io.TextIO, so this should anwser your question:
Text I/O over a binary storage (such as a file) is significantly
slower than binary I/O over the same storage, because it implies
conversions from unicode to binary data using a character codec. This
can become noticeable if you handle huge amounts of text data (for
example very large log files). Also, TextIOWrapper.tell() and
TextIOWrapper.seek() are both quite slow due to the reconstruction
algorithm used.
NOTE: You should also be aware, that this still doesn't guarantee that seek() will skip over characters, but rather unicode codepoints (a single character can be composed out of more then one codepoint, for example ą can be written as u'\u0105' or u'a\u0328' - both will print the same character).
Source: http://docs.python.org/library/io.html#id1

Some experimentation with utf-8 encodings (repeated seeking and printing of .read(1) methods in a file with lots of multi-byte characters) revealed that yes, .seek() and .read() do behave differently in utf-8 files... they don't deal with single bytes, but single characters. This consisted of several simple re-writings of code, reading and seeking in different patterns.
Thanks to #satuon for your help.

Related

Python unicode errors reading files generated by other apps

I'm getting decoding exception errors when reading export files from multiple applications. Have been running into this for a month, as I learn far more about unicode than I ever wanted to know. Some fundamentals are still missing. I understand utf, I understand codepages, I understand how they tend to be used in practice (a single codepage per document e.g., though I can't imagine that's still true today--see the back page of a health statement with 15 languages.)
Is it true that utf-8 can and does encode every possible unicode char? How then is it possible for one application to write a utf-8 file and another to not be able to read it?
when utf is used, codepages are NOT used, is that correct? as I think it through, the codepage is an older style and is made obsolete by utf. I'm sure there are some exceptions.
utf could also be looked as a data compression scheme, less than an encoding one.
But there I'm stuck, as in practice, I have 6 different applications made in different countries, which can create export files, 3 in ut-f, 3 in cp1252, yet python 3.7 cannot read them without error:
'charmap' codec can't decode byte 0x9d in position 1555855: character maps to
'charmap' codec can't decode byte 0x81 in position 4179683: character maps to
I use Edit Pro to examine the files, which successfully reads the files. It points to a line that contains an extra pair of special double quotes:
"Metro Exodus review: “Not only the best Metro yet, it's one of the best shooters in years” | GamesRadar+"
Removing that ” allows python to continue reading in the file, to the next error.
python reports it as char x9d, but an (really old: Codewright) old editor reports it as x94. Codewright I believe. Verified it is an x94 and x93 pair on the internet so it must be true. ;-)
It is very troublesome that I don't know for sure what the actual bytes are, as there are so many layers of translation, interpretation, format for display, etc.
So the visual studio debug report of x9d is a misdirect. What's going on with the python library that it would report this?
How is this possible? I can find no info about how chars in one codepage can be invalid under utf (if that's the problem). What would I search under?
It should not be this hard. I have 30 years experience in programming c++, sql, you name it, learning new libraries, languages is just breakfast.
I also do not understand why the information to handle this is so hard to find. Surely numerous other programmers doing data conversions, import/exports between applications have run into this for decades.
The files I'm importing are csv files from 6 apps, and json files from another. the 6 apps export in utf-8 and cp1252 (as reported by Edit Pro) and the other app exports json in utf-8, though I could also choose csv.
The 6 apps run on an iPhone and export files I'm attempting to read on windows 10. I'm running python 3.7.8, though this problem has persisted since 3.6.3.
Thanks in advance
Dan

The error 'charmap' codec can't decode byte... shows that you are not using utf-8 to read the file. That's the source of your struggles on this one. Unless the file starts with a BOM (byte order mark), you kinda have to know how the file was encoded to decode it correctly.
utf-8 encodes all unicode characters and python should be able to read them all. Displaying is another matter. You need font files for the unicode characters to do that part. You were reading in "charmap", not "utf-8" and that's why you had the error.
"when utf is used" ... there are several UTF encodings. utf-8, utf-16-be (big endian), utf-16-le (little endian), utf-16 (synonym for utf-16-le), utf-32 variants (I've never seen this in the wild) and variants that include the BOM (byte order mark) which is an optional set of characters at the start of the file describing utf encoding type.
But yes, UTF encodings are meant to replace the older codepage encodings.
No, its not compression. The encoded stream could be larger than the bytes needed to hold the string in memory. This is especially true of utf-8, less true with utf-16 (that's why Microsoft went with utf-16). But utf-8 as a superset of ASCII that does not have byte order issues like utf-16 has many other advantages (that's why all the sane people chose it). I can't think of a case where a UTF encoding would ever be smaller than the count of its characters.

Which character encoding Python 3.x supports for file I/O?

I had a problem writing a UTF-8 supported character (\ufffd) to a text file. I was wondering what the most inclusive character set Python 3.x supports for writing string data to files.
I was able to overcome the problem by
valEncoded = (origVal.encode(encoding='ASCII', errors='replace')).decode()
which basically filtered out non-ASCII characters from origVal. But I figure Python file I/O should support more than ASCII, which is pretty conservative. So I am looking for what is the most inclusive character set supported.

Any of the UTF encodings should work:
UTF-8 is typically the most compact (particularly if the text is largely ASCII compatible), and portable between systems with different endianness without requiring a BOM (byte order mark). It's the most common encoding used on non-Windows systems that support the full Unicode range (and the most common encoding used for serving data over a network, e.g. HTML, XML, JSON)
UTF-16 is commonly used by Windows (the system "wide character" APIs use it as the native encoding)
UTF-32 is uncommon, and wastes a lot of disk space, with the only real benefit being a fixed ratio between characters and bytes (you can divide file size by four after subtracting the BOM and you get the number of characters)
In general, I'd recommend going with UTF-8 unless you know it will be consumed by another tool that expects UTF-16 or the like.

Binary Data To Unicode

Among all the encodings available here http://docs.python.org/library/codecs.html
which one is the one I should use for decoding binary data into unicode without it becoming corrupted when I encode it back to string?
I've used raw_unicode_data and it doesn't work.
Example: I upload picture in a POST (but not as file attachment). Django converts POST data to unicode using utf-8. However when converting back from unicode to string (again using utf-8), data becomes corrupted. I used raw_unicode_data and the same happened (though only a few bytes this time). Which encoding should I use so that the decode and encode steps don't corrupt the data.

If you want to post binary data use the base64 encoding.
http://docs.python.org/library/base64.html

"Binary data" is not text, therefore converting it to a unicode is meaningless. If there is text embedded in the binary data then extract it first and decode using the encoding given in the specification for the data format.

As others have already stated, your question isn't particularly clear. If you are wanting to funnel binary data through a text channel (such as POST), then base64 is the right format to use with appropriate data transformation operations in the client and the server (binary data -> base64 text -> pass over text channel -> base64 text -> binary data).
Alternatively, if you are wanting to tolerate improperly encoded text (e.g. as Python 3 tries to do for some interfaces such as file paths and environment variables), then Python 3.1 and later offer the surrogatescape error handler, which will convert invalid values into a format that isn't valid readable text, but allows the original binary data to be faithfully recreated when encoding back to bytes.

How to handle undecodable filenames in Python?

I'd really like to have my Python application deal exclusively with Unicode strings internally. This has been going well for me lately, but I've run into an issue with handling paths. The POSIX API for filesystems isn't Unicode, so it's possible (and actually somewhat common) for files to have "undecodable" names: filenames that aren't encoded in the filesystem's stated encoding.
In Python, this manifests as a mixture of unicode and str objects being returned from os.listdir().
>>> os.listdir(u'/path/to/foo')
[u'bar', 'b\xe1z']
In that example, the character '\xe1' is encoded in Latin-1 or somesuch, even when the (hypothetical) filesystem reports sys.getfilesystemencoding() == 'UTF-8' (in UTF-8, that character would be the two bytes '\xc3\xa1'). For this reason, you'll get UnicodeErrors all over the place if you try to use, for example, os.path.join() with Unicode paths, because the filename can't be decoded.
The Python Unicode HOWTO offers this advice about unicode pathnames:
Note that in most occasions, the Unicode APIs should be used. The bytes APIs should only be used on systems where undecodable file names can be present, i.e. Unix systems.
Because I mainly care about Unix systems, does this mean I should restructure my program to deal only with bytestrings for paths? (If so, how can I maintain Windows compatibility?) Or are there other, better ways of dealing with undecodable filenames? Are they rare enough "in the wild" that I should just ask users to rename their damn files?
(If it is best to just deal with bytestrings internally, I have a followup question: How do I store bytestrings in SQLite for one column while keeping the rest of the data as friendly Unicode strings?)

Python does have a solution to the problem, if you're willing to switch to Python 3.1 or later:
PEP 383 - Non-decodable Bytes in System Character Interfaces.

If you need to store bytestrings in a DB that is geared for UNICODE then it is probably easier to record the bytestrings encoded in hex. That way, the hex-encoded string is safe to store as a unicode string in the db.
As for the UNIX pathname issue, my understanding is that there is no particular encoding enforced for filenames so it is entirely possible to have Latin-1, KOI-8-R, CP1252 and others on various files. This means that each component in a pathname could have a separate encoding.
I would be tempted to try and guess the encoding of filenames using something like the chardet module. Of course, there are no guarantees so you still have to handle exceptions, but you would have fewer undecodeable names. Some software replaces undecodeable characters by ? which is non-reversible. I would rather see them replaced with \xdd or \xdddd because it can be manually reversed if necessary. In some applications it may be possible to present the string to a user so that they can key in unicode characters to replace the unencodeable ones.
If you do go down this route, you may end up extending chardet to handle this job. It would be nice to supplement it with a utility that scans a filesystem finding undecodeable names and produces a list that could be edited, then fed back, to fix all the names with unicode equivalents.

Character Encoding

My text editor allows me to code in several different character formats Ansi, UTF-8, UTF-8(No BOM), UTF-16LE, and UTF-16BE.
What is the difference between them?
What is commonly regarded as the best format (I'm using Python if that makes a diffrence)?

"Ansi" is a misnomer and usually refers to some 8-bit encoding that's the default on the current platform (on "western" Windows installations that's usually Windows-1252). It only supports a small set of characters (256 different characters at most).
UTF-8 is a variable-length, ASCII-compatible encoding capable of storing any and all Unicode characters. It's a pretty good choice for western text that should support all Unicode characters and a very viable choice in the general case.
"UTF-8 (no BOM)" is the name Windows gives to using UTF-8 without writing a Byte Order Marker. Since a BOM is not needed for UTF-8, it shouldn't be used and this would be the correct choice (pretty much everyone else calls this version simply "UTF-8"!).
UTF-16LE and UTF-16BE are the Little Endian and Big Endian versions of the UTF-16 encoding. As UTF-8, UTF-16 is capable of representing any Unicode character, however it is not ASCII-compatible.
Generally speaking UTF-8 is a great overall choice and has wide compatibility (just make sure not to write the BOM, because that's what most other software expects).
UTF-16 could take less space if the majority of your text is composed of non-ASCII characters (i.e. doesn't use the basic latin alphabet).
"Ansi" should only be used when you have a specific need to interact with a legacy application that doesn't support Unicode.
An important thing about any encoding is that they are meta-data that need to be communicated in addition to the data. This means that you must know the encoding of some byte stream to interpret it as a text correctly. So you should either use formats that document the actual encoding used (XML is a prime example here) or standardize on a single encoding in a given context and use only that.
For example, if you start a software project, then you can specify that all your source code is in a given encoding (again: I suggest UTF-8) and stick with that.
For Python files specifically, there's a way to specify the encoding of your source files.

Here. Note that "ANSI" is usually CP1252.

You'll probably get greatest utility with UTF-8 No BOM. Forget that ANSI and ASCII exist, they are deprecated dinosaurs.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.