handling Unicode filenames in Python 3.4 on Windows

handling Unicode filenames in Python 3.4 on Windows - python

I'm trying to find a reliable way to scan files on Windows in Python, while allowing for the possibility that there may be various Unicode code points in the filenames. I've seen several proposed solutions to this problem, but none of them work for all of the actual issues that I've encountered in scanning filenames created by real-world software and users.
The code sample below is an attempt to extricate and demonstrate the core issue. It creates three files in a subfolder with the sorts of variations I've encountered, and then attempts to scan through that folder and display each filename followed by the file's contents. It will crash on the attempt to read the third test file, with OSError [Errno 22] Invalid argument.
import os
# create files in .\temp that demonstrate various issues encountered in the wild
tempfolder = os.getcwd() + '\\temp'
if not os.path.exists(tempfolder):
os.makedirs(tempfolder)
print('file contents', file=open('temp/simple.txt','w'))
print('file contents', file=open('temp/with a ® symbol.txt','w'))
print('file contents', file=open('temp/with these chars ΣΑΠΦΩ.txt','w'))
# goal is to scan the files in a manner that allows for printing
# the filename as well as opening/reading the file ...
for root,dirs,files in os.walk(tempfolder.encode('UTF-8')):
for filename in files:
fullname = os.path.join(tempfolder.encode('UTF-8'), filename)
print(fullname)
print(open(fullname,'r').read())
As it says in the code, I just want to be able to display the filenames and open/read the files. Regarding display of the filename, I don't care whether the Unicode characters are rendered correctly for the special cases. I just want to print the filename in a manner that uniquely identifies which file is being processed, and doesn't throw an error for these unusual sorts of filenames.
If you comment out the final line of code, the approach shown here will display all three filenames with no errors. But it won't open the file with miscellaneous Unicode in the name.
Is there a single approach that will reliably display/open all three of these filename variations in Python? I'm hoping there is, and my limited grasp of Unicode subtleties is preventing me from seeing it.

The following works fine, if you save the file in the declared encoding, and if you use an IDE or terminal encoding that supports the characters being displayed. Note that this does not have to be UTF-8. The declaration at the top of the file is the encoding of the source file only.
#coding:utf8
import os
# create files in .\temp that demonstrate various issues encountered in the wild
tempfolder = os.path.join(os.getcwd(),'temp')
if not os.path.exists(tempfolder):
os.makedirs(tempfolder)
print('file contents', file=open('temp/simple.txt','w'))
print('file contents', file=open('temp/with a ® symbol.txt','w'))
print('file contents', file=open('temp/with these chars ΣΑΠΦΩ.txt','w'))
# goal is to scan the files in a manner that allows for printing
# the filename as well as opening/reading the file ...
for root,dirs,files in os.walk(tempfolder):
for filename in files:
fullname = os.path.join(tempfolder, filename)
print(fullname)
print(open(fullname,'r').read())
Output:
c:\\temp\simple.txt
file contents
c:\temp\with a ® symbol.txt
file contents
c:\temp\with these chars ΣΑΠΦΩ.txt
file contents
If you use a terminal that does not support encoding the characters used in the filename, You will get UnicodeEncodeError. Change:
print(fullname)
to:
print(ascii(fullname))
and you will see that the filename was read correctly, but just couldn't print one or more symbols in the terminal encoding:
'C:\\temp\\simple.txt'
file contents
'C:\\temp\\with a \xae symbol.txt'
file contents
'C:\\temp\\with these chars \u03a3\u0391\u03a0\u03a6\u03a9.txt'
file contents

Related

Python: How do I interact with unicode filenames on Windows? (Python 2.7)

My problem:
Start with US Windows 10 install
Create a Japanese filename in Windows explorer
Open the Python shell, and os.listdir('.')
The listed filename is full of question marks.
os.path.exists() unsurprisingly reports file not found.
NTFS stores the filename as Unicode. I'm sure if I used the win32api CreateFile() series of functions I will get my Unicode filename back, however those APIs are too cumbersome (and not portable). I'd prefer that I get utf-8 encoded filenames, or the Unicode bytes from the FS directory structure, but in default mode this doesn't seem to happen.
I have tried playing around with setlocale() but I haven't stumbled upon the correct arguments to make my program work. I do not want to (and cannot) install additional code pages onto the Windows machine. This needs to work with a stock install of Windows.
Please note this has nothing to do with the console. A repr() shows that the ? chars that end up in the filename listed by os.listdir('.') are real question marks and not some display artifact. I assume they have been added by the API that listdir() uses under the hood.

You may be getting ?s while displaying that filename in the console using os.listdir() but you can access that filename without any problems as internally everything is stored in binary. If you are trying to copy the filename and paste it directly in python, it will be interpreted as mere question marks...
If you want to open that file and perform any operations, then, have a look at this...
files = os.listdir(".")
# Possible output:
# ["a.txt", "file.py", ..., "??.html"]
filename = files[-1] # The last file in this case
f = open(filename, 'r')
# Sample file operation
lines = f.readlines()
print(lines)
f.close()
EDIT:
In Python 2, you need to pass current path as Unicode which could be done using: os.listdir(u'.'), where the . means current path. This will return the list of filenames in Unicode...

How to use os.walk to only list text files

This question was similar in addressing hidden filetypes. I am struggling with a similar problem because I need to process only text containing files in folders that have many different filetypes- pictures, text, music. I am using os.walk which lists EVERYTHING, including files without an extension-like Icon files. I am using linux and would be satisfied to filter for only txt files. One way is too check the filename extension and this post explains nicely how it's done.
But this still leaves mislabeled files or files without an extension. There are hex values that uniquely identify filetypes known as magic numbers or file signatures. here and here Unfortunately, magic numbers do not exist for text files (see here).
One strategy that I have come up with is to parse the first bunch of characters to make sure they are words by doing a dictionary lookup(I am only dealing with English texts) Then only proceed with the full text processing if that is true.This approach seems rather heavy and expensive (doing a bunch of dictionary lookups for each file). Another approach is simply to look for the word 'the' which is unlikely to be frequent in a data file but commonly found in text files. But false negatives would cause me to lose text files for processing. I tried asking google for the longest text without the word 'the' but had no luck with that.
I do not know if this is the appropriate forum for this kind of question-it's almost a question of AI rather than computer science/coding. It's not as difficult as gibberish detection. The texts may not be semantically or syntactically correct- they might just be words like the inventory of a stockroom but also they might be prose and poetry. I just do not want to process files that could be byte code,source code, or collections of alphanumeric characters that are not English words.

You can use Python's mimetypes library to check whether a file is a plaintext file.
import os
import mimetypes
for dirpath, dirnames, filenames in os.walk('/path/to/directory'):
for filename in filenames:
if mimetypes.guess_type(filename)[0] == 'text/plain':
print(os.path.join(dirpath, filename))
UPDATE: Since the mimetypes library uses file extension to determine the type of file, it is not very reliable, especially since you mentioned that some files are mislabeled or without extensions.
For those cases you can use the magic library (which is not in the standard library unfortunately).
import os
import magic
mime = magic.Magic(mime=True)
for dirpath, dirnames, filenames in os.walk('/path/to/directory'):
for filename in filenames:
fullpath = os.path.join(dirpath, filename)
if mime.from_file(fullpath) == 'text/plain':
print(fullpath)
UPDATE 2: The above solution wouldn't catch files you would otherwise consider "plaintext" (e.g. XML files, source files, etc). The following solution should work in those cases:
import os
import magic
for dirpath, dirnames, filenames in os.walk('/path/to/directory'):
for filename in filenames:
fullpath = os.path.join(dirpath, filename)
if 'text' in magic.from_file(fullpath):
print(fullpath)
Let me know if any of these works for you.

A pretty good heuristic is to look for null bytes at the beginning of the file. Text files don't typically have them and binary files usually have lots of them. Below checks that the first 1K bytes contain no nulls. You can of course adjust how much or little of the file to read:
#!python3
import os
def textfiles(root):
for path,dirs,files in os.walk(root):
for file in files:
fullname = os.path.join(path,file)
with open(fullname,'rb') as f:
data = f.read(1024)
if not 0 in data:
yield fullname
for file in textfiles('.'):
print(file)

ExpatError: not well-formed (invalid token) when using SimpleXMLRPCServer caused by diacritic characters

It took me a long while to pinpoint some specific cause of the bug. I am writing a simple XML RPC server that allows you for directory listing and possibly other read-only operations. I already made a simple method to list all folders and files and represent them as dictionary:
def list_dir(self, dirname):
"""Returns list of files and directories as a dictionary, where key is name and values is either 'file' or 'dir'"""
dirname = os.path.abspath(os.path.join(self.server.cwd,dirname))
#Check that the path doesn't lead above
if dirname.find(self.server.cwd)==-1:
raise SecurityError("There was an attempt to browse in %s wthich is above the root working directory %s."%(dirname, self.server.cwd))
check_for_valid_directory(dirname)
#Looping through directory
files = [i for i in os.listdir(dirname)]
#Create associative array
result = {}
#Iterate through files
for f in files:
fullpath = os.path.join(dirname, f)
#Appending directories
if os.path.isdir(fullpath):
result[f] = "dir"
else:
result[f] = "file"
print "Sending data", result
return result
Now when directory contains file (or rather folder) named Nová složka the client receives error instead of desired list. When I removed the problematic filename I received data with no errors. I don't think Python library has this right - either the argument conversion should be complete, including any unicode stuff, or not present at all.
But anyway, how should I encode the data Python library can't handle?

You have to make sure the filenames and paths are unicode objects and that all filenames use the correct encoding. The last part may be a bit tricky as POSIX filenames are byte strings and there is no requirement that all filenames on a partition have to be encoded with the same encoding. In that case there is not much you can do other than decoding the names yourself and handle errors somehow or returning the filenames as binary data instead of (unicode) strings.
The filename related functions in os and os.path return unicode strings if they get unicode strings as arguments. So if you make sure that dirname is of type unicode instead of str then os.listdir() will return unicode strings which should be able to be transitted via XML-RPC.

taking data from files which are in folder

How do I get the data from multiple txt files that placed in a specific folder. I started with this could not fix. It gives an error like 'No such file or directory: '.idea' (??)
(Let's say I have an A folder and in that, there are x.txt, y.txt, z.txt and so on. I am trying to get and print the information from all the files x,y,z)
def find_get(folder):
for file in os.listdir(folder):
f = open(file, 'r')
for data in open(file, 'r'):
print data
find_get('filex')
Thanks.

If you just want to print each line:
import glob
import os
def find_get(path):
for f in glob.glob(os.path.join(path,"*.txt")):
with open(os.path.join(path, f)) as data:
for line in data:
print(line)
glob will find only your .txt files in the specified path.
Your error comes from not joining the path to the filename, unless the file was in the same directory you were running the code from python would not be able to find the file without the full path. Another issue is you seem to have a directory .idea which would also give you an error when trying to open it as a file. This also presumes you actually have permissions to read the files in the directory.
If your files were larger I would avoid reading all into memory and/or storing the full content.

First of all make sure you add the folder name to the file name, so you can find the file relative to where the script is executed.
To do so you want to use os.path.join, which as it's name suggests - joins paths. So, using a generator:
def find_get(folder):
for filename in os.listdir(folder):
relative_file_path = os.path.join(folder, filename)
with open(relative_file_path) as f:
# read() gives the entire data from the file
yield f.read()
# this consumes the generator to a list
files_data = list(find_get('filex'))
See what we got in the list that consumed the generator:
print files_data
It may be more convenient to produce tuples which can be used to construct a dict:
def find_get(folder):
for filename in os.listdir(folder):
relative_file_path = os.path.join(folder, filename)
with open(relative_file_path) as f:
# read() gives the entire data from the file
yield (relative_file_path, f.read(), )
# this consumes the generator to a list
files_data = dict(find_get('filex'))
You will now have a mapping from the file's name to it's content.
Also, take a look at the answer by #Padraic Cunningham . He brought up the glob module which is suitable in this case.

The error you're facing is simple: listdir returns filenames, not full pathnames. To turn them into pathnames you can access from your current working directory, you have to join them to the directory path:
for filename in os.listdir(directory):
pathname = os.path.join(directory, filename)
with open(pathname) as f:
# do stuff
So, in your case, there's a file named .idea in the folder directory, but you're trying to open a file named .idea in the current working directory, and there is no such file.
There are at least four other potential problems with your code that you also need to think about and possibly fix after this one:
You don't handle errors. There are many very common reasons you may not be able to open and read a file--it may be a directory, you may not have read access, it may be exclusively locked, it may have been moved since your listdir, etc. And those aren't logic errors in your code or user errors in specifying the wrong directory, they're part of the normal flow of events, so your code should handle them, not just die. Which means you need a try statement.
You don't do anything with the files but print out every line. Basically, this is like running cat folder/* from the shell. Is that what you want? If not, you have to figure out what you want and write the corresponding code.
You open the same file twice in a row, without closing in between. At best this is wasteful, at worst it will mean your code doesn't run on any system where opens are exclusive by default. (Are there such systems? Unless you know the answer to that is "no", you should assume there are.)
You don't close your files. Sure, the garbage collector will get to them eventually--and if you're using CPython and know how it works, you can even prove the maximum number of open file handles that your code can accumulate is fixed and pretty small. But why rely on that? Just use a with statement, or call close.
However, none of those problems are related to your current error. So, while you have to fix them too, don't expect fixing one of them to make the first problem go away.

Full variant:
import os
def find_get(path):
files = {}
for file in os.listdir(path):
if os.path.isfile(os.path.join(path,file)):
with open(os.path.join(path,file), "r") as data:
files[file] = data.read()
return files
print(find_get("filex"))
Output:
{'1.txt': 'dsad', '2.txt': 'fsdfs'}
After the you could generate one file from that content, etc.
Key-thing:
os.listdir return a list of files without full path, so you need to concatenate initial path with fount item to operate.
there could be ideally used dicts :)
os.listdir return files and folders, so you need to check if list item is really file

You should check if the file is actually file and not a folder, since you can't open folders for reading. Also, you can't just open a relative path file, since it is under a folder, so you should get the correct path with os.path.join. Check below:
import os
def find_get(folder):
for file in os.listdir(folder):
if not os.path.isfile(file):
continue # skip other directories
f = open(os.path.join(folder, file), 'r')
for line in f:
print line

Handling UTF filenames in Windows

Given the following files:
E:/Media/Foo/info.nfo
E:/Media/Bar/FXGâ¢.nfo
I can "find" them with the following:
BASE = r'E:/Media/'
for dirpath, _, files in os.walk(BASE):
for f in fnmatch.filter(files, '*.nfo'):
nfopath = os.path.join(dirpath, f)
print(nfopath)
This snippet would then print the above paths.
However, if I make sure that each path created by os.path.join() is indeed a regular file -- for example with something like:
for dirpath, _, files in os.walk(BASE):
for f in fnmatch.filter(files, '*.nfo'):
nfopath = os.path.join(dirpath, f)
print(nfopath)
assert os.path.isfile(nfopath) # <------
The assertion fails for the second filename, but not for the first.
I checked the folder in explorer, and the script indeed found a regular file and printed the name and path correctly, so I'm not clear on why the assertion failed.
I've tried specifying the BASE string as a unicode string (ur'E:/Media/') as well as explicitly encoding the nfopath inside the isfile() call (assert os.path.isfile(nfopath.encode('utf-8')).
Neither seemed to work.
Of course, I could keep track of and manually go through and delete the failing files, but I'm interested in how one would handle this correctly.
Thanks in advance.
(Python 2.7, Windows 7)

According to this SO question, Windows stores file names as UTF-16 when using the NTFS filesystem. Retry your encoding step with UTF-16.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.