How to handle undecodable filenames in Python?

How to handle undecodable filenames in Python? - python

I'd really like to have my Python application deal exclusively with Unicode strings internally. This has been going well for me lately, but I've run into an issue with handling paths. The POSIX API for filesystems isn't Unicode, so it's possible (and actually somewhat common) for files to have "undecodable" names: filenames that aren't encoded in the filesystem's stated encoding.
In Python, this manifests as a mixture of unicode and str objects being returned from os.listdir().
>>> os.listdir(u'/path/to/foo')
[u'bar', 'b\xe1z']
In that example, the character '\xe1' is encoded in Latin-1 or somesuch, even when the (hypothetical) filesystem reports sys.getfilesystemencoding() == 'UTF-8' (in UTF-8, that character would be the two bytes '\xc3\xa1'). For this reason, you'll get UnicodeErrors all over the place if you try to use, for example, os.path.join() with Unicode paths, because the filename can't be decoded.
The Python Unicode HOWTO offers this advice about unicode pathnames:
Note that in most occasions, the Unicode APIs should be used. The bytes APIs should only be used on systems where undecodable file names can be present, i.e. Unix systems.
Because I mainly care about Unix systems, does this mean I should restructure my program to deal only with bytestrings for paths? (If so, how can I maintain Windows compatibility?) Or are there other, better ways of dealing with undecodable filenames? Are they rare enough "in the wild" that I should just ask users to rename their damn files?
(If it is best to just deal with bytestrings internally, I have a followup question: How do I store bytestrings in SQLite for one column while keeping the rest of the data as friendly Unicode strings?)

Python does have a solution to the problem, if you're willing to switch to Python 3.1 or later:
PEP 383 - Non-decodable Bytes in System Character Interfaces.

If you need to store bytestrings in a DB that is geared for UNICODE then it is probably easier to record the bytestrings encoded in hex. That way, the hex-encoded string is safe to store as a unicode string in the db.
As for the UNIX pathname issue, my understanding is that there is no particular encoding enforced for filenames so it is entirely possible to have Latin-1, KOI-8-R, CP1252 and others on various files. This means that each component in a pathname could have a separate encoding.
I would be tempted to try and guess the encoding of filenames using something like the chardet module. Of course, there are no guarantees so you still have to handle exceptions, but you would have fewer undecodeable names. Some software replaces undecodeable characters by ? which is non-reversible. I would rather see them replaced with \xdd or \xdddd because it can be manually reversed if necessary. In some applications it may be possible to present the string to a user so that they can key in unicode characters to replace the unencodeable ones.
If you do go down this route, you may end up extending chardet to handle this job. It would be nice to supplement it with a utility that scans a filesystem finding undecodeable names and produces a list that could be edited, then fed back, to fix all the names with unicode equivalents.

Related

Python unicode errors reading files generated by other apps

I'm getting decoding exception errors when reading export files from multiple applications. Have been running into this for a month, as I learn far more about unicode than I ever wanted to know. Some fundamentals are still missing. I understand utf, I understand codepages, I understand how they tend to be used in practice (a single codepage per document e.g., though I can't imagine that's still true today--see the back page of a health statement with 15 languages.)
Is it true that utf-8 can and does encode every possible unicode char? How then is it possible for one application to write a utf-8 file and another to not be able to read it?
when utf is used, codepages are NOT used, is that correct? as I think it through, the codepage is an older style and is made obsolete by utf. I'm sure there are some exceptions.
utf could also be looked as a data compression scheme, less than an encoding one.
But there I'm stuck, as in practice, I have 6 different applications made in different countries, which can create export files, 3 in ut-f, 3 in cp1252, yet python 3.7 cannot read them without error:
'charmap' codec can't decode byte 0x9d in position 1555855: character maps to
'charmap' codec can't decode byte 0x81 in position 4179683: character maps to
I use Edit Pro to examine the files, which successfully reads the files. It points to a line that contains an extra pair of special double quotes:
"Metro Exodus review: “Not only the best Metro yet, it's one of the best shooters in years” | GamesRadar+"
Removing that ” allows python to continue reading in the file, to the next error.
python reports it as char x9d, but an (really old: Codewright) old editor reports it as x94. Codewright I believe. Verified it is an x94 and x93 pair on the internet so it must be true. ;-)
It is very troublesome that I don't know for sure what the actual bytes are, as there are so many layers of translation, interpretation, format for display, etc.
So the visual studio debug report of x9d is a misdirect. What's going on with the python library that it would report this?
How is this possible? I can find no info about how chars in one codepage can be invalid under utf (if that's the problem). What would I search under?
It should not be this hard. I have 30 years experience in programming c++, sql, you name it, learning new libraries, languages is just breakfast.
I also do not understand why the information to handle this is so hard to find. Surely numerous other programmers doing data conversions, import/exports between applications have run into this for decades.
The files I'm importing are csv files from 6 apps, and json files from another. the 6 apps export in utf-8 and cp1252 (as reported by Edit Pro) and the other app exports json in utf-8, though I could also choose csv.
The 6 apps run on an iPhone and export files I'm attempting to read on windows 10. I'm running python 3.7.8, though this problem has persisted since 3.6.3.
Thanks in advance
Dan

The error 'charmap' codec can't decode byte... shows that you are not using utf-8 to read the file. That's the source of your struggles on this one. Unless the file starts with a BOM (byte order mark), you kinda have to know how the file was encoded to decode it correctly.
utf-8 encodes all unicode characters and python should be able to read them all. Displaying is another matter. You need font files for the unicode characters to do that part. You were reading in "charmap", not "utf-8" and that's why you had the error.
"when utf is used" ... there are several UTF encodings. utf-8, utf-16-be (big endian), utf-16-le (little endian), utf-16 (synonym for utf-16-le), utf-32 variants (I've never seen this in the wild) and variants that include the BOM (byte order mark) which is an optional set of characters at the start of the file describing utf encoding type.
But yes, UTF encodings are meant to replace the older codepage encodings.
No, its not compression. The encoded stream could be larger than the bytes needed to hold the string in memory. This is especially true of utf-8, less true with utf-16 (that's why Microsoft went with utf-16). But utf-8 as a superset of ASCII that does not have byte order issues like utf-16 has many other advantages (that's why all the sane people chose it). I can't think of a case where a UTF encoding would ever be smaller than the count of its characters.

Which character encoding Python 3.x supports for file I/O?

I had a problem writing a UTF-8 supported character (\ufffd) to a text file. I was wondering what the most inclusive character set Python 3.x supports for writing string data to files.
I was able to overcome the problem by
valEncoded = (origVal.encode(encoding='ASCII', errors='replace')).decode()
which basically filtered out non-ASCII characters from origVal. But I figure Python file I/O should support more than ASCII, which is pretty conservative. So I am looking for what is the most inclusive character set supported.

Any of the UTF encodings should work:
UTF-8 is typically the most compact (particularly if the text is largely ASCII compatible), and portable between systems with different endianness without requiring a BOM (byte order mark). It's the most common encoding used on non-Windows systems that support the full Unicode range (and the most common encoding used for serving data over a network, e.g. HTML, XML, JSON)
UTF-16 is commonly used by Windows (the system "wide character" APIs use it as the native encoding)
UTF-32 is uncommon, and wastes a lot of disk space, with the only real benefit being a fixed ratio between characters and bytes (you can divide file size by four after subtracting the BOM and you get the number of characters)
In general, I'd recommend going with UTF-8 unless you know it will be consumed by another tool that expects UTF-16 or the like.

Python, Windows, Ansi - encoding, again

Hello there,
even if i really tried... im stuck and somewhat desperate when it comes to Python, Windows, Ansi and character encoding. I need help, seriously... searching the web for the last few hours wasn't any help, it just drives me crazy.
I'm new to Python, so i have almost no clue what's going on. I'm about to learn the language, so my first program, which ist almost done, should automatically generate music-playlists from a given folder containing mp3s. That works just fine, besides one single problem...
...i can't write Umlaute (äöü) to the playlist-file.
After i found a solution for "wrong-encoded" Data in the sys.argv i was able to deal with that. When reading Metadata from the MP3s, i'm using some sort of simple character substitution to get rid of all those international special chars, like french accents or this crazy skandinavian "o" with a slash in it (i don't even know how to type it...). All fine.
But i'd like to write at least the mentioned Umlaute to the playlist-file, those characters are really common here in Germany. And unlike the Metadata, where i don't care about some missing characters or miss-spelled words, this is relevant - because now i'm writing the paths to the files.
I've tried so many various encoding and decoding methods, i can't list them all here.. heck, i'm not even able to tell which settings i tried half an hour ago. I found code online, here, and elsewhere, that seemed to work for some purposes. Not for mine.
I think the tricky part is this: it seems like the Problem is the Ansi called format of the files i need to write. Correct - i actually need this Ansi-stuff. About two hours ago i actually managed to write whatever i'd like to an UFT-8 file. Works like charm... until i realized that my Player (Winamp, old Version) somehow doesn't work with those UTF-8 playlist files. It couldn't resolve the Path, even if it looks right in my editor.
If i change the file format back to Ansi, Paths containing special chars get corrupted. I'm just guessing, but if Winamp reads this UTF-8 files as Ansi, that would cause the Problem i'm experiencing right now.
So...
I DO have to write äöü in a path, or it will not work
It DOES have to be an ANSI-"encoded" file, or it will not work
Things like line.write(str.decode('utf-8')) break the funktion of the file
A magical comment at the beginning of the script like # -*- coding: iso-8859-1 -*- does nothing here (though it is helpful when it comes to the mentioned Metadata and allowed characters in it...)
Oh, and i'm using Python 2.7.3. Third-Party modules dependencies, you know...
Is there ANYONE who could guide me towards a way out of this encoding hell? Any help is welcome. If i need 500 lines of Code for another functions or classes, i'll type them. If there's a module for handling such stuff, let me know! I'd buy it! Anything helpful will be tested.
Thank you for reading, thanks for any comment,
greets!

As mentioned in the comments, your question isn't very specific, so I'll try to give you some hints about character encodings, see if you can apply those to your specific case!
Unicode and Encoding
Here's a small primer about encoding. Basically, there are two ways to represent text in Python:
unicode. You can consider that unicode is the ultimate encoding, you should strive to use it everywhere. In Python 2.x source files, unicode strings look like u'some unicode'.
str. This is encoded text - to be able to read it, you need to know the encoding (or guess it). In Python 2.x, those strings look like 'some str'.
This changed in Python 3 (unicode is now str and str is now bytes).
How does that play out?
Usually, it's pretty straightforward to ensure that you code uses unicode for its execution, and uses str for I/O:
Everything you receive is encoded, so you do input_string.decode('encoding') to convert it to unicode.
Everything you need to output is unicode but needs to be encoded, so you do output_string.encode('encoding').
The most common encodings are cp-1252 on Windows (on US or EU systems), and utf-8 on Linux.
Applying this to your case
I DO have to write äöü in a path, or it will not work
Windows natively uses unicode for file paths and names, so you should actually always use unicode for those.
It DOES have to be an ANSI-"encoded" file, or it will not work
When you write to the file, be sure to always run your output through output.encode('cp1252') (or whatever encoding ANSI would be on your system).
Things like line.write(str.decode('utf-8')) break the funktion of the file
By now you probably realized that:
If str as indeed an str instance, Python will try to convert it to unicode using the utf-8 encoding, but then try to encode it again (likely in ascii) to write it to the file
If str is actually an unicode instance, Python will first encode it (likely in ascii, and that will probably crash) to then be able to decode it.
Bottom line is, you need to know if str is unicode, you should encode it. If it's already encoded, don't touch it (or decode it then encode it if the encoding is not the one you want!).
A magical comment at the beginning of the script like # -- coding: iso-8859-1 -- does nothing here (though it is helpful when it comes to the mentioned Metadata and allowed characters in it...)
Not a surprise, this only tells Python what encoding should be used to read your source file so that non-ascii characters are properly recognized.
Oh, and i'm using Python 2.7.3. Third-Party modules dependencies, you know...
Python 3 probably is a big update in terms of unicode and encoding, but that doesn't mean Python 2.x can't make it work!
Will that solve your issue?
You can't be sure, it's possible that the problem lies in the player you're using, not in your code.
Once you output it, you should make sure that your script's output is readable using reference tools (such as Windows Explorer). If it is, but the player still can't open it, you should consider updating to a newer version.

On Windows there is special encoding available called mbcs, it converts between current default ANSI codepage and UNICODE.
For example on a Spanish Language PC:
u'ñ'.encode('mbcs') -> '\xf1'
'\xf1'.decode('mbcs') -> u'ñ'
On Windows ANSI means current default multi-byte code page. For western European languages Windows ISO-8859-1, for eastern European languages windows ISO-8859-2) encoded byte string and other encodings for other languages as appropriate.
More info available at:
https://docs.python.org/2.4/lib/standard-encodings.html
See also:
https://docs.python.org/2/library/sys.html#sys.getfilesystemencoding

# -*- coding comments declare the character encoding of the source code (and therefore of byte-string literals like 'abc').
Assuming that by "playlist" you mean m3u files, then based on this specification you may be at the mercy of the mp3 player software you are using. This spec says only that the files contain text, no mention of what character encoding.
I have personally observed that various mp3 encoding software will use different encodings for mp3 metadata. Some use UTF-8, others ISO-8859-1. So you may have to allow encoding to be specified in configuration and leave it at that.

How does file reading work in utf-8 encoding?

For input text files, I know that .seek and .tell both operate with bytes, usually - that is, .seek seeks a certain number of bytes in relation to a point specified by its given arguments, and .tell returns the number of bytes since the beginning of the file.
My question is: does this work the same way when using other encodings like utf-8? I know utf-8, for example, requires several bytes for some characters.
It would seem that if those methods still deal with bytes when parsing utf-8 files, then unexpected behavior could result (for instance, the cursor could end up inside of a character's multi-byte encoding, or a multi-byte character could register as several characters).
If so, are there other methods to do the same tasks? Especially for when parsing a file requires information about the cursor's position in terms of characters.
On the other hand, if you specify the encoding in the open() function ...
infile = open(filename, encoding='utf-8')
Does the behavior of .seek and .tell change?

Assuming you're using io.open() (not the same as the builtin open()), then using text mode gets you an instance of a io.TextIO, so this should anwser your question:
Text I/O over a binary storage (such as a file) is significantly
slower than binary I/O over the same storage, because it implies
conversions from unicode to binary data using a character codec. This
can become noticeable if you handle huge amounts of text data (for
example very large log files). Also, TextIOWrapper.tell() and
TextIOWrapper.seek() are both quite slow due to the reconstruction
algorithm used.
NOTE: You should also be aware, that this still doesn't guarantee that seek() will skip over characters, but rather unicode codepoints (a single character can be composed out of more then one codepoint, for example ą can be written as u'\u0105' or u'a\u0328' - both will print the same character).
Source: http://docs.python.org/library/io.html#id1

Some experimentation with utf-8 encodings (repeated seeking and printing of .read(1) methods in a file with lots of multi-byte characters) revealed that yes, .seek() and .read() do behave differently in utf-8 files... they don't deal with single bytes, but single characters. This consisted of several simple re-writings of code, reading and seeking in different patterns.
Thanks to #satuon for your help.

Character Encoding

My text editor allows me to code in several different character formats Ansi, UTF-8, UTF-8(No BOM), UTF-16LE, and UTF-16BE.
What is the difference between them?
What is commonly regarded as the best format (I'm using Python if that makes a diffrence)?

"Ansi" is a misnomer and usually refers to some 8-bit encoding that's the default on the current platform (on "western" Windows installations that's usually Windows-1252). It only supports a small set of characters (256 different characters at most).
UTF-8 is a variable-length, ASCII-compatible encoding capable of storing any and all Unicode characters. It's a pretty good choice for western text that should support all Unicode characters and a very viable choice in the general case.
"UTF-8 (no BOM)" is the name Windows gives to using UTF-8 without writing a Byte Order Marker. Since a BOM is not needed for UTF-8, it shouldn't be used and this would be the correct choice (pretty much everyone else calls this version simply "UTF-8"!).
UTF-16LE and UTF-16BE are the Little Endian and Big Endian versions of the UTF-16 encoding. As UTF-8, UTF-16 is capable of representing any Unicode character, however it is not ASCII-compatible.
Generally speaking UTF-8 is a great overall choice and has wide compatibility (just make sure not to write the BOM, because that's what most other software expects).
UTF-16 could take less space if the majority of your text is composed of non-ASCII characters (i.e. doesn't use the basic latin alphabet).
"Ansi" should only be used when you have a specific need to interact with a legacy application that doesn't support Unicode.
An important thing about any encoding is that they are meta-data that need to be communicated in addition to the data. This means that you must know the encoding of some byte stream to interpret it as a text correctly. So you should either use formats that document the actual encoding used (XML is a prime example here) or standardize on a single encoding in a given context and use only that.
For example, if you start a software project, then you can specify that all your source code is in a given encoding (again: I suggest UTF-8) and stick with that.
For Python files specifically, there's a way to specify the encoding of your source files.

Here. Note that "ANSI" is usually CP1252.

You'll probably get greatest utility with UTF-8 No BOM. Forget that ANSI and ASCII exist, they are deprecated dinosaurs.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.