Python not able to open file with non-english characters in path

Python not able to open file with non-english characters in path - python

I have a file with the following path : D:/bar/クレイジー・ヒッツ！/foo.abc
I am parsing the path from a XML file and storing it in a variable called path in the form of file://localhost/D:/bar/クレイジー・ヒッツ！/foo.abc
Then, the following operations are being done:
path=path.strip()
path=path[17:] #to remove the file://localhost/ part
path=urllib.url2pathname(path)
path=urllib.unquote(path)
The error is:
IOError: [Errno 2] No such file or directory: 'D:\\bar\\\xe3\x82\xaf\xe3\x83\xac\xe3\x82\xa4\xe3\x82\xb8\xe3\x83\xbc\xe3\x83\xbb\xe3\x83\x92\xe3\x83\x83\xe3\x83\x84\xef\xbc\x81\\foo.abc'
I am using Python 2.7 on Windows 7

The path in your error is:
'\xe3\x82\xaf\xe3\x83\xac\xe3\x82\xa4\xe3\x82\xb8\xe3\x83\xbc\xe3\x83\xbb\xe3\x83\x92\xe3\x83\x83\xe3\x83\x84\xef\xbc\x81'
I think this is the UTF8 encoded version of your filename.
I've created a folder of the same name on Windows7 and placed a file called 'abc.txt' in it:
>>> a = '\xe3\x82\xaf\xe3\x83\xac\xe3\x82\xa4\xe3\x82\xb8\xe3\x83\xbc\xe3\x83\xbb\xe3\x83\x92\xe3\x83\x83\xe3\x83\x84\xef\xbc\x81'
>>> os.listdir('.')
['?????\xb7???!']
>>> os.listdir(u'.') # Pass unicode to have unicode returned to you
[u'\u30af\u30ec\u30a4\u30b8\u30fc\u30fb\u30d2\u30c3\u30c4\uff01']
>>>
>>> a.decode('utf8') # UTF8 decoding your string matches the listdir output
u'\u30af\u30ec\u30a4\u30b8\u30fc\u30fb\u30d2\u30c3\u30c4\uff01'
>>> os.listdir(a.decode('utf8'))
[u'abc.txt']
So it seems that Duncan's suggestion of path.decode('utf8') does the trick.
Update
I can't test this for you, but I suggest that you try checking whether the path contains non-ascii before doing the .decode('utf8'). This is a bit hacky...
ASCII_TRANS = '_'*32 + ''.join([chr(x) for x in range(32,126)]) + '_'*130
path=path.strip()
path=path[17:] #to remove the file://localhost/ part
path=urllib.unquote(path)
if path.translate(ASCII_TRANS) != path: # Contains non-ascii
path = path.decode('utf8')
path=urllib.url2pathname(path)

Provide the filename as a unicode string to the open call.
How do you produce the filename?
if provided as a constant by you
Add a line near the beginning of your script:
# -*- coding: utf8 -*-
Then, in a UTF-8 capable editor, set path to the unicode filename:
path = u"D:/bar/クレイジー・ヒッツ！/foo.abc"
read from a list of directory contents
Retrieve the contents of the directory using a unicode dirspec:
dir_files= os.listdir(u'.')
read from a text file
Open the filename-containing-file using codecs.open to read unicode data from it. You need to specify the encoding of the file (because you know what is the “default windows charset” for non-Unicode applications on your computer).
in any case
Do a:
path= path.decode("utf8")
before opening the file; substitute the correct encoding if not "utf8".

Here's some interesting stuff from the documentation:
sys.getfilesystemencoding()
Return the name of the encoding used
to convert Unicode filenames into
system file names, or None if the
system default encoding is used. The
result value depends on the operating
system: On Mac OS X, the encoding is
'utf-8'. On Unix, the encoding is the
user’s preference according to the
result of nl_langinfo(CODESET), or
None if the nl_langinfo(CODESET)
failed. On Windows NT+, file names are
Unicode natively, so no conversion is
performed. getfilesystemencoding()
still returns 'mbcs', as this is the
encoding that applications should use
when they explicitly want to convert
Unicode strings to byte strings that
are equivalent when used as file
names. On Windows 9x, the encoding is
'mbcs'.
New in version 2.3.
If I understand this correctly, you should pass the file name as unicode:
f = open(unicode(path, encoding))

Related

File name encoding in Python 2.7

I want to read files with special file names in Python (2.7). But whatever I try, it always fails to open them.
The filenames are
F\xA8\xB9hrerschein
and
Gro\xDFhandel
I know, the encoding was done with one of several codepages. I could try to find out which one and try to convert it and all the mumbo jumbo, but I don't want that.
Can't I somehow tell python to open that file without having to go through all that encoding stuff? I mean opening the file by its raw name in bytes?

After all, I fixed it with
reload(sys)
sys.setdefaultencoding('utf-8')
and setting the environment variable
LANG="C.UTF-8"
Thanks for the hints.

One way is to use os.listdir(). See the following example.
Add some data to a file with non-ascii character 0xdf in the name:
$ echo abcd > `printf "A\xdfA"`
Check that the file contains a non-ascii character:
$ ls A*
A?A
Start Python, read the directory and open the first file (which is the one with the non-ascii character):
$ Python
>>> import os
>>> d = os.listdir('.')
>>> d
['A\xdfA']
>>> f = open(d[0])
>>> f.readline()
'abcd\n'
>>>

If you have source code like
with open('Großhandel') as input:
#stuff
You should look at Source Code Encodings and write
#!python2
# -*- coding: utf-8 -*-
with open('Großhandel') as input:
…
It is worth mention that the authors of PEP-263 are Marc-André Lemburg and Martin von Löwis, which I suppose makes pushing defined toward source encoding back in 2002 slightly more understandable.

Under Linux, filenames can be encoded in any character encoding. When opening a file, you must use the exact name encoded to match.
I.e. If the filename is Großhandel.txt encoded using UTF-8, it must be encoded as Gro\xc3\x9fhandel.txt.
If you pass a Unicode string to open(), the user's locale is used to encode the filename, which may match the filename.
Under OS X, UTF-8 encoding is enforced. Under Windows, the character encoding is abstracted by the i/o drivers. A Unicode object passed to open() should always be used for these Operating Systems, where it'll be converted appropriately.
If you're reading filenames from the filesystem, it would be useful to get decoded Unicode filenames to pass straight to open() - Well, you can pass Unicode strings to os.listdir().
E.g.
Locale: LANG=en_GB.UTF-8
A directory with the following files, with their filenames encoded to UTF-8:
test.txt
€.txt
When running Python 2.7 using a string:
>>> os.listdir(".")
['\xe2\x82\xac.txt', 'test.txt']
Using a Unicode path:
>>> os.listdir(u".")
[u'\u20ac.txt', u'test.txt']

Switching to Python 3 causing UnicodeDecodeError

I've just added Python3 interpreter to Sublime, and the following code stopped working:
for directory in directoryList:
fileList = os.listdir(directory)
for filename in fileList:
filename = os.path.join(directory, filename)
currentFile = open(filename, 'rt')
for line in currentFile: ##Here comes the exception.
currentLine = line.split(' ')
for word in currentLine:
if word.lower() not in bigBagOfWords:
bigBagOfWords.append(word.lower())
currentFile.close()
I get a following exception:
File "/Users/Kuba/Desktop/DictionaryCreator.py", line 11, in <module>
for line in currentFile:
File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/encodings/ascii.py", line 26, in decode
return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xcc in position 305: ordinal not in range(128)
I found this rather strange, because as far as I know Python3 is supposed to support utf-8 everywhere. What's more, the same exact code works with no problems on Python2.7. I've read about adding environmental variable PYTHONIOENCODING, but I tried it - to no avail (however, it appears it is not that easy to add an environmental variable in OS X Mavericks, so maybe I did something wrong with adding the variable? I modidified /etc/launchd.conf)

Python 3 decodes text files when reading, encodes when writing. The default encoding is taken from locale.getpreferredencoding(False), which evidently for your setup returns 'ASCII'. See the open() function documenation:
In text mode, if encoding is not specified the encoding used is platform dependent: locale.getpreferredencoding(False) is called to get the current locale encoding.
Instead of relying on a system setting, you should open your text files using an explicit codec:
currentFile = open(filename, 'rt', encoding='latin1')
where you set the encoding parameter to match the file you are reading.
Python 3 supports UTF-8 as the default for source code.
The same applies to writing to a writeable text file; data written will be encoded, and if you rely on the system encoding you are liable to get UnicodeEncodingError exceptions unless you explicitly set a suitable codec. What codec to use when writing depends on what text you are writing and what you plan to do with the file afterward.
You may want to read up on Python 3 and Unicode in the Unicode HOWTO, which explains both about source code encoding and reading and writing Unicode data.

"as far as I know Python3 is supposed to support utf-8 everywhere ..."
Not true. I have python 3.6 and my default encoding is NOT utf-8.
To change it to utf-8 in my code I use:
import locale
def getpreferredencoding(do_setlocale = True):
return "utf-8"
locale.getpreferredencoding = getpreferredencoding
as explained in
Changing the “locale preferred encoding” in Python 3 in Windows

In general, I found 3 ways to fix Unicode related Errors in Python3:
Use the encoding explicitly like currentFile = open(filename, 'rt',encoding='utf-8')
As the bytes have no encoding, convert the string data to bytes before writing to file like data = 'string'.encode('utf-8')
Especially in Linux environment, check $LANG. Such issue usually arises when LANG=C which makes default encoding as 'ascii' instead of 'utf-8'. One can change it with other appropriate value like LANG='en_IN'

UnicodeDecodeError when performing os.walk

I am getting the error:
'ascii' codec can't decode byte 0x8b in position 14: ordinal not in range(128)
when trying to do os.walk. The error occurs because some of the files in a directory have the 0x8b (non-utf8) character in them. The files come from a Windows system (hence the utf-16 filenames), but I have copied the files over to a Linux system and am using python 2.7 (running in Linux) to traverse the directories.
I have tried passing a unicode start path to os.walk, and all the files & dirs it generates are unicode names until it comes to a non-utf8 name, and then for some reason, it doesn't convert those names to unicode and then the code chokes on the utf-16 names. Is there anyway to solve the problem short of manually finding and changing all the offensive names?
If there is not a solution in python2.7, can a script be written in python3 to traverse the file tree and fix the bad filenames by converting them to utf-8 (by removing the non-utf8 chars)? N.B. there are many non-utf8 chars in the names besides 0x8b, so it would need to work in a general fashion.
UPDATE: The fact that 0x8b is still only a btye char (just not valid ascii) makes it even more puzzling. I have verified that there is a problem converting such a string to unicode, but that a unicode version can be created directly. To wit:
>>> test = 'a string \x8b with non-ascii'
>>> test
'a string \x8b with non-ascii'
>>> unicode(test)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0x8b in position 9: ordinal not in range(128)
>>>
>>> test2 = u'a string \x8b with non-ascii'
>>> test2
u'a string \x8b with non-ascii'
Here's a traceback of the error I am getting:
80. for root, dirs, files in os.walk(unicode(startpath)):
File "/usr/lib/python2.7/os.py" in walk
294. for x in walk(new_path, topdown, onerror, followlinks):
File "/usr/lib/python2.7/os.py" in walk
294. for x in walk(new_path, topdown, onerror, followlinks):
File "/usr/lib/python2.7/os.py" in walk
284. if isdir(join(top, name)):
File "/usr/lib/python2.7/posixpath.py" in join
71. path += '/' + b
Exception Type: UnicodeDecodeError at /admin/casebuilder/company/883/
Exception Value: 'ascii' codec can't decode byte 0x8b in position 14: ordinal not in range(128)
The root of the problem occurs in the list of files returned from listdir (on line 276 of os.walk):
names = listdir(top)
The names with chars > 128 are returned as non-unicode strings.

Right I just spent some time sorting through this error, and wordier answers here aren't getting at the underlying issue:
The problem is, if you pass a unicode string into os.walk(), then os.walk starts getting unicode back from os.listdir() and tries to keep it as ASCII (hence 'ascii' decode error). When it hits a unicode only special character which str() can't translate, it throws the exception.
The solution is to force the starting path you pass to os.walk to be a regular string - i.e. os.walk(str(somepath)). This means os.listdir returns regular byte-like strings and everything works the way it should.
You can reproduce this problem (and show it's solution works) trivially like:
Go into bash in some directory and run touch $(echo -e "\x8b\x8bThis is a bad filename") which will make some test files.
Now run the following Python code (iPython Qt is handy for this) in the same directory:
l = []
for root,dir,filenames in os.walk(unicode('.')):
l.extend([ os.path.join(root, f) for f in filenames ])
print l
And you'll get a UnicodeDecodeError.
Now try running:
l = []
for root,dir,filenames in os.walk('.'):
l.extend([ os.path.join(root, f) for f in filenames ])
print l
No error and you get a print out!
Thus the safe way in Python 2.x is to make sure you only pass raw text to os.walk(). You absolutely should not pass unicode or things which might be unicode to it, because os.walk will then choke when an internal ascii conversion fails.

This problem stems from two fundamental problems. The first is fact that Python 2.x default encoding is 'ascii', while the default Linux encoding is 'utf8'. You can verify these encodings via:
sys.getdefaultencoding() #python
sys.getfilesystemencoding() #OS
When os module functions returning directory contents, namely os.walk & os.listdir return a list of files containing ascii only filenames and non-ascii filenames, the ascii-encoding filenames are converted automatically to unicode. The others are not. Therefore, the result is a list containing a mix of unicode and str objects. It is the str objects that can cause problems down the line. Since they are not ascii, python has no way of knowing what encoding to use, and therefore they can't be decoded automatically into unicode.
Therefore, when performing common operations such as os.path(dir, file), where dir is unicode and file is an encoded str, this call will fail if the file is not ascii-encoded (the default). The solution is to check each filename as soon as they are retrieved and decode the str (encoded ones) objects to unicode using the appropriate encoding.
That's the first problem and its solution. The second is a bit trickier. Since the files originally came from a Windows system, their filenames probably use an encoding called windows-1252. An easy means of checking is to call:
filename.decode('windows-1252')
If a valid unicode version results you probably have the correct encoding. You can further verify by calling print on the unicode version as well and see the correct filename rendered.
One last wrinkle. In a Linux system with files of Windows origin, it is possible or even probably to have a mix of windows-1252 and utf8 encodings. There are two means of dealing with this mixture. The first and preferable is to run:
$ convmv -f windows-1252 -t utf8 -r DIRECTORY --notest
where DIRECTORY is the one containing the files needing conversion.This command will convert any windows-1252 encoded filenames to utf8. It does a smart conversion, in that if a filename is already utf8 (or ascii), it will do nothing.
The alternative (if one cannot do this conversion for some reason) is to do something similar on the fly in python. To wit:
def decodeName(name):
if type(name) == str: # leave unicode ones alone
try:
name = name.decode('utf8')
except:
name = name.decode('windows-1252')
return name
The function tries a utf8 decoding first. If it fails, then it falls back to the windows-1252 version. Use this function after a os call returning a list of files:
root, dirs, files = os.walk(path):
files = [decodeName(f) for f in files]
# do something with the unicode filenames now
I personally found the entire subject of unicode and encoding very confusing, until I read this wonderful and simple tutorial:
http://farmdev.com/talks/unicode/
I highly recommend it for anyone struggling with unicode issues.

I can reproduce the os.listdir() behavior: os.listdir(unicode_name) returns undecodable entries as bytes on Python 2.7:
>>> import os
>>> os.listdir(u'.')
[u'abc', '<--\x8b-->']
Notice: the second name is a bytestring despite listdir()'s argument being a Unicode string.
A big question remains however - how can this be solved without resorting to this hack?
Python 3 solves undecodable bytes (using filesystem's character encoding) bytes in filenames via surrogateescape error handler (os.fsencode/os.fsdecode). See PEP-383: Non-decodable Bytes in System Character Interfaces:
>>> os.listdir(u'.')
['abc', '<--\udc8b-->']
Notice: both string are Unicode (Python 3). And surrogateescape error handler was used for the second name. To get the original bytes back:
>>> os.fsencode('<--\udc8b-->')
b'<--\x8b-->'
In Python 2, use Unicode strings for filenames on Windows (Unicode API), OS X (utf-8 is enforced) and use bytestrings on Linux and other systems.

\x8 is not a valid utf-8 encoding character. os.path expects the filenames to be in utf-8. If you want to access invalid filenames, you have to pass the os.path.walk the non-unicode startpath; this way the os module will not do the utf8 decoding. You would have to do it yourself and decide what to do with the filenames that contain incorrect characters.
I.e.:
for root, dirs, files in os.walk(startpath.encode('utf8')):

After examination of the source of the error, something happens within the C-code routine listdir which returns non-unicode filenames when they are not standard ascii. The only fix therefore is to do a forced decode of the directory list within os.walk, which requires a replacement of os.walk. This replacement function works:
def asciisafewalk(top, topdown=True, onerror=None, followlinks=False):
"""
duplicate of os.walk, except we do a forced decode after listdir
"""
islink, join, isdir = os.path.islink, os.path.join, os.path.isdir
try:
# Note that listdir and error are globals in this module due
# to earlier import-*.
names = os.listdir(top)
# force non-ascii text out
names = [name.decode('utf8','ignore') for name in names]
except os.error, err:
if onerror is not None:
onerror(err)
return
dirs, nondirs = [], []
for name in names:
if isdir(join(top, name)):
dirs.append(name)
else:
nondirs.append(name)
if topdown:
yield top, dirs, nondirs
for name in dirs:
new_path = join(top, name)
if followlinks or not islink(new_path):
for x in asciisafewalk(new_path, topdown, onerror, followlinks):
yield x
if not topdown:
yield top, dirs, nondirs
By adding the line:
names = [name.decode('utf8','ignore') for name in names]
all the names are proper ascii & unicode, and everything works correctly.
A big question remains however - how can this be solved without resorting to this hack?

I got this problem when use os.walk on some directories with Chinese (unicode) names. I implemented the walk function myself as follows, which worked fine with unicode dir/file names.
import os
ft = list(tuple())
def walk(dir, cur):
fl = os.listdir(dir)
for f in fl:
full_path = os.path.join(dir,f)
if os.path.isdir(full_path):
walk(full_path, cur)
else:
path, filename = full_path.rsplit('/',1)
ft.append((path, filename, os.path.getsize(full_path)))

Python 2.7: Setting I/O Encoding, ‚Äô?

Attempting to write a line to a text file in Python 2.7, and have the following code:
# -*- coding: utf-8 -*-
...
f = open(os.path.join(os.path.dirname(__file__), 'output.txt'), 'w')
f.write('Smith’s BaseBall Cap') // Note the strangely shaped apostrophe
However, in output.txt, I get Smith‚Äôs BaseBall Cap, instead. Not sure how to correct this encoding problem? Any protips with this sort of issue?

You have declared your file to be encoded with UTF-8, so your byte-string literal is in UTF-8. The curly apostrophe is U+2019. In UTF-8, this is encoded as three bytes, \xE2\x80\x99. Those three bytes are written to your output file. Then, when you examine the output file, it is interpreted as something other than UTF-8, and you see the three incorrect characters instead.
In Mac OS Roman, those three bytes display as ‚Äô.
Your file is a correct UTF-8 file, but you are viewing it incorrectly.

There are a couple possibilities, but the first one to check is that the output file actually contains what you think it does. Are you sure you're not viewing the file with the wrong encoding? Some editors have an option to choose what encoding you're viewing the file in. The editor needs to know the file's encoding, and if it interprets the file as being in some other encoding than UTF-8, it will display the wrong thing even though the contents of the file are correct.
When I run your code (on Python 2.6) I get the correct output in the file. Another thing to try: Use the codecs module to open the file for UTF-8 writing: f = codecs.open("file.txt", "w", "utf-8"). Then declare the string as a unicode string withu"'Smith’s BaseBall Cap'"`.

open file with a unicode filename?

I don't seem to be able to open a file which has a unicode filename. Lets say I do:
for i in os.listdir():
open(i, 'r')
When I try to search for some solution, I always get pages about how to read and write a unicode string to a file, not how to open a file with file() or open() which has a unicode name.

Simply pass open() a unicode string for the file name:
In Python 2.x:
>>> open(u'someUnicodeFilenameλ')
<open file u'someUnicodeFilename\u03bb', mode 'r' at 0x7f1b97e70780>
In Python 3.x, all strings are Unicode, so there is literally nothing to it.
As always, note that the best way to open a file is always using the with statement in conjunction with open().
Edit: With regards to os.listdir() the advice again varies, under Python 2.x, you have to be careful:
os.listdir(), which returns filenames, raises an issue: should it return the Unicode version of filenames, or should it return 8-bit strings containing the encoded versions? os.listdir() will do both, depending on whether you provided the directory path as an 8-bit string or a Unicode string. If you pass a Unicode string as the path, filenames will be decoded using the filesystem’s encoding and a list of Unicode strings will be returned, while passing an 8-bit path will return the 8-bit versions of the filenames.
Source
So in short, if you want Unicode out, put Unicode in:
>>> os.listdir(".")
['someUnicodeFilename\xce\xbb', 'old', 'Dropbox', 'gdrb']
>>> os.listdir(u".")
[u'someUnicodeFilename\u03bb', u'old', u'Dropbox', u'gdrb']
Note that the file will still open either way - it won't be represented well within Python as it'll be an 8-bit string, but it'll still work.
open('someUnicodeFilename\xce\xbb')
<open file 'someUnicodeFilenameλ', mode 'r' at 0x7f1b97e70660>
Under 3.x, as always, it's always Unicode.

You can try this:
import os
import sys
for filename in os.listdir(u"/your-direcory-path/"):
open(filename.encode(sys.getfilesystemencoding()), "r")

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python not able to open file with non-english characters in path - python

Related

File name encoding in Python 2.7

Switching to Python 3 causing UnicodeDecodeError

UnicodeDecodeError when performing os.walk

Python 2.7: Setting I/O Encoding, ‚Äô?

open file with a unicode filename?

Categories

Resources