How to list all files with a certain extension recursively? - python

I want to retrieve a list of files (filtered by extension) inside a directory (recursively).
I have the solution below but I am confident there is a cleaner way to do that. Probably a simple glob expression that I am missing but any better solution is fine. Better in this scenario is readability (self-documenting) not performance.
I know this example is very simple, but of course it is part of a more complex scenario.
files = glob.glob('documents/*.txt') + glob.glob('documents/**/*.txt')
I'd expect something like
files = glob.glob('playbooks/(**/)?*.yml')
(just an example, that does not work)

To make use of the ** specifier in glob.glob() you need to explicitly set recursive parameter to True, e.g.:
glob.glob('Documents/**/*.txt', recursive=True)
From the official doc:
If recursive is true, the pattern “**” will match any files and zero or more directories, subdirectories and symbolic links to directories. If the pattern is followed by an os.sep or os.altsep then files will not match.

Related

Python search files with multiple extensions

I wish to search a directory, and all contained subdirectories, for files with a substring contained within their name and one of three possible extensions
Please can you help me edit the following code
os.chdir(directory)
files = glob.glob("**/*{}*.pro".format(myStr), recursive = True)
I wish to find files with the extension .pro, .bd3 and .mysql
I'm running Python 3.5
You could create a list and loop over it
exten_to_find = ['.pro','bd3','.mysql']
you could format like this for iteration
files = glob.glob("**/*{x}*.{y}".format(x = myStr, y = extension_toFind), recursive = True)
you could try:
def get_files_with_extension(my_str, exts):
for f in glob.iglob("**/*{}*.*".format(my_str), recursive = True):
if any(f.endswith(ext) for ext in exts):
yield f
Actual-glob syntax has no way to do this. The "enhanced glob" syntaxes of most modern shells can, but I'm pretty sure Python's glob module is only very lightly enhanced.
Under the covers, glob is a pretty simple module, and the docs link to the source. As you can see, it ultimately defers to fnmatch, which is also a pretty simple module, and while ultimately just builds a regex and defers to that. And of course you can do alternations in a regex.
So, one option is to fork all the code from glob.py and fnmatch.py so you can build a fancier pattern to pass down to re.
But the simplest thing to do is just stop using glob here. It's the wrong tool for the job. Just use os.walk and filter things yourself.
If you understand how to write a regex like r'.*{}.*\.(pro|md3|mysql)'.format(myStr), use that to filter; if not, just write what you do know how to do; the performance cost will probably be minimal, and you'll be able to extend and maintain it yourself.
files = []
for root, dirnames, filenames in os.walk('.'):
for file in filenames:
fname, fext = os.path.splitext(file)
if fext in {'pro', 'md3', 'mysql'} and myStr in fname:
files.append(os.path.join(root, file))
If it turns out that doing a set method and a string method really is so much slower than regex that it makes a difference, and you can't write the regex yourself, come back and ask a new question. (I wouldn't count on the one I used above, if you can't figure out how to debug it.)
Also, if you're using Python before… I think 3.5… os.walk may actually be inherently slower than iglob. In that case, you'll want to look for betterwalk on PyPI, the module that the current implementation is based on.

Iterate over infinite files in a directory in Python

I'm using Python 3.3.
If I'm manipulating potentially infinite files in a directory (bear with me; just pretend I have a filesystem that supports that), how do I do that without encountering a MemoryError? I only want the string name of one file to be in memory at a time. I don't want them all in an iterable as that would cause a memory error when there are too many.
Will os.walk() work just fine, since it returns a generator? Or, do generators not work like that?
Is this possible?
If you have a system for naming the files that can be figured out computationally, you can do such as this (this iterates over any number of numbered txt files, with only one in memory at a time; you could convert to another calculable system to get shorter filenames for large numbers):
import os
def infinite_files(path):
num=0;
while 1:
if not os.path.exists(os.path.join(path, str(num)+".txt")):
break
else:
num+=1 #perform operations on the file: str(num)+".txt"
[My old inapplicable answer is below]
glob.iglob seems to do exactly what the question asks for. [EDIT: It doesn't. It actually seems less efficient than listdir(), but see my alternative solution above.] From the official documentation:
glob.glob(pathname, *, recursive=False)
Return a possibly-empty list of path names that match pathname, which must be a string containing a path specification. pathname can be either absolute (like /usr/src/Python-1.5/Makefile) or relative (like ../../Tools/*/*.gif), and can contain shell-style wildcards. Broken symlinks are included in the results (as in the shell).
glob.iglob(pathname, *, recursive=False)
Return an iterator which yields the same values as glob() without actually storing them all simultaneously.
iglob returns an "iterator which yields" or-- more concisely-- a generator.
Since glob.iglob has the same behavior as glob.glob, you can search with wildcard characters:
import glob
for x glob.iglob("/home/me/Desktop/*.txt"):
print(x) #prints all txt files in that directory
I don't see a way for it to differentiate between files and directories without doing it manually. That is certainly possible, however.

pythonic way to access each file in a list of directories

I have working code that looks like this:
# Wow. Much nesting. So spacebar
if __name__ == '__main__:
for eachDir in list_of_unrelated_directories:
for eachFile in os.listdir(eachDir):
if eachFile.endswith('.json'):
# do stuff here
I'd like to know if there is a more elegant way of doing that. I would like to not have my code nested three layers deep like that, and if I could get this to a one-liner like
for each file that ends with .json in all these directories:
# do stuff
That would be even more awesome. I also edited this to point out that the directories are not all in the same folder. Like you might be looking for .json files in your home folder and also your /tmp folder. So I'm not trying to move recursively through a single folder.
The most Pythonic way is (in my opinion) to write a function that yields the files of a certain type and use it. Then your calling code is very clear and concise. Some of the other answers are very concise but incredibly confusing; in your actual code you should value clarity over brevity (although when you can get both that is, of course, preferred).
Here's the function:
import os
def files_from_directories(directories, filetype):
"""Yield files of filetype from all directories given."""
for directory in directories:
for file in glob.glob(os.path.join(directory, '*' + filetype))
yield file
Now your calling code really is a one-liner:
# What a good one-liner!!!!
for json_file in files_from_directories(directories, '.json'):
# do stuff
So now you have a one-liner, and a very clear one. Plus if you want to process any other type of file, you can just reuse the function with a different filetype.
You can use a generator expression to remove the nested loops:
for json_file in (f for dir in list_of_unrelated_dirs
for f in os.listdir(dir)
if f.endswith('.json')):
print json_file.
If you want to apply a function to them, you could even improve it removing the remaining for loop with map() function:
map(fun,
(f for dir in list_of_unrelated_dirs
for f in os.listdir(dir)
if f.endswith('.json'))
)
Hope this helps!
Surely, the following code is not Pythonic because is not the simplest or the most clear one, definitely doesn't follow the Zen of Python
However, It's a one-line approach and It was fun to do it ;-):
def manage_file(filepath):
print('File to manage:', filepath)
EDIT: Based in accepted answer and I've updated my answer to use glob(), the result is still a bit Freak code but It's less code that my previous approach
map(manage_file, [fn for fn in sum((glob('%s/*.json' % eachDir) for eachDir in data_path), [])])
glob() can reduce you to two levels:
for d in list_of_unrelated_directories:
for f in glob(join(d, '*.json')):
_process_json_file(f)
If your list_of_unrelated_directories is really a list of totally unrelated directories, I don't see how you can avoid the first loop.If they do have something in common (say a common root, and some common prefix), then you can use os.walk() to traverse the tree and grab all matching files.
It's not really less nesting, it's just nesting within a comprehension.
This gets all things that end with '.json' and are also confirmed as files (ignores folders that end with '.json').
Standalone Code
import os
unrelated_paths = ['c:/', 't:/']
json_files = (os.path.join(p, o) for p in unrelated_paths
for o in os.listdir(p)
if (o.lower().endswith('.json')
and os.path.isfile(os.path.join(p, o))))
for json_file in json_files:
print json_file

How can I tell if a file is a descendant of a given directory?

On the surface, this is pretty simple, and I could implement it myself easily. Just successively call dirname() to go up each level in the file's path and check each one to see if it's the directory we're checking for.
But symlinks throw the whole thing into chaos. Any directory along the path of either the file or directory being checked could be a symlink, and any symlink could have an arbitrary chain of symlinks to other symlinks. At this point my brain melts and I'm not sure what to do. I've tried writing the code to handle these special cases, but it soon gets too complicated and I assume I'm doing it wrong. Is there a reasonably elegant way to do this?
I'm using Python, so any mention of a library that does this would be cool. Otherwise, this is a pretty language-neutral problem.
Use os.path.realpath and os.path.commonprefix:
os.path.commonprefix(['/the/dir/', os.path.realpath(filename)]) == "/the/dir/"
os.path.realpath will expand any symlinks as well as .. in the filename. os.path.commonprefix is a bit fickle -- it doesn't really test for paths, just plain string prefixes, so you should make sure your directory ends in a directory separator. If you don't, it will claim /the/dirtwo/filename is also in /the/dir
Python 3.5 has the useful function os.path.commonpath:
Return the longest common sub-path of each pathname in the sequence paths. Raise ValueError if paths contains both absolute and relative pathnames, or if paths is empty. Unlike commonprefix(), this returns a valid path.
So to check if a file is a descendant of a directory, you could do this:
os.path.commonpath(["/the/dir", os.path.realpath(filename)]) == "/the/dir"
Unlike commonprefix, you don't need to worry if the inputs have trailing slashes or not. The return value of commonprefix always lacks a trailing slash.
Another way to do this in Python 3 is to use pathlib:
from pathlib import Path
is_descendant = Path("/the/dir") in Path(filename).resolve().parents
See documentation for Path.resolve() and Path.parents.

How would you implement ant-style patternsets in python to select groups of files?

Ant has a nice way to select groups of files, most handily using ** to indicate a directory tree. E.g.
**/CVS/* # All files immediately under a CVS directory.
mydir/mysubdir/** # All files recursively under mysubdir
More examples can be seen here:
http://ant.apache.org/manual/dirtasks.html
How would you implement this in python, so that you could do something like:
files = get_files("**/CVS/*")
for file in files:
print file
=>
CVS/Repository
mydir/mysubdir/CVS/Entries
mydir/mysubdir/foo/bar/CVS/Entries
Sorry, this is quite a long time after your OP. I have just released a Python package which does exactly this - it's called Formic and it's available at the PyPI Cheeseshop. With Formic, your problem is solved with:
import formic
fileset = formic.FileSet(include="**/CVS/*", default_excludes=False)
for file_name in fileset.qualified_files():
print file_name
There is one slight complexity: default_excludes. Formic, just like Ant, excludes CVS directories by default (as for the most part collecting files from them for a build is dangerous), the default answer to the question would result in no files. Setting default_excludes=False disables this behaviour.
As soon as you come across a **, you're going to have to recurse through the whole directory structure, so I think at that point, the easiest method is to iterate through the directory with os.walk, construct a path, and then check if it matches the pattern. You can probably convert to a regex by something like:
def glob_to_regex(pat, dirsep=os.sep):
dirsep = re.escape(dirsep)
print re.escape(pat)
regex = (re.escape(pat).replace("\\*\\*"+dirsep,".*")
.replace("\\*\\*",".*")
.replace("\\*","[^%s]*" % dirsep)
.replace("\\?","[^%s]" % dirsep))
return re.compile(regex+"$")
(Though note that this isn't that fully featured - it doesn't support [a-z] style glob patterns for instance, though this could probably be added). (The first \*\*/ match is to cover cases like \*\*/CVS matching ./CVS, as well as having just \*\* to match at the tail.)
However, obviously you don't want to recurse through everything below the current dir when not processing a ** pattern, so I think you'll need a two-phase approach. I haven't tried implementing the below, and there are probably a few corner cases, but I think it should work:
Split the pattern on your directory seperator. ie pat.split('/') -> ['**','CVS','*']
Recurse through the directories, and look at the relevant part of the pattern for this level. ie. n levels deep -> look at pat[n].
If pat[n] == '**' switch to the above strategy:
Reconstruct the pattern with dirsep.join(pat[n:])
Convert to a regex with glob\_to\_regex()
Recursively os.walk through the current directory, building up the path relative to the level you started at. If the path matches the regex, yield it.
If pat doesn't match "**", and it is the last element in the pattern, then yield all files/dirs matching glob.glob(os.path.join(curpath,pat[n]))
If pat doesn't match "**", and it is NOT the last element in the pattern, then for each directory, check if it matches (with glob) pat[n]. If so, recurse down through it, incrementing depth (so it will look at pat[n+1])
os.walk is your friend. Look at the example in the Python manual
(https://docs.python.org/2/library/os.html#os.walk) and try to build something from that.
To match "**/CVS/*" against a file name you get, you can do something like this:
def match(pattern, filename):
if pattern.startswith("**"):
return fnmatch.fnmatch(file, pattern[1:])
else:
return fnmatch.fnmatch(file, pattern)
In fnmatch.fnmatch, "*" matches anything (including slashes).
There's an implementation in the 'waf' build system source code.
http://code.google.com/p/waf/source/browse/trunk/waflib/Node.py?r=10755#471
May be this should be wrapped up in a library of its own?
Yup. Your best bet is, as has already been suggested, to work with 'os.walk'. Or, write wrappers around 'glob' and 'fnmatch' modules, perhaps.
os.walk is your best bet for this. I did the example below with .svn because I had that handy, and it worked great:
import re
for (dirpath, dirnames, filenames) in os.walk("."):
if re.search(r'\.svn$', dirpath):
for file in filenames:
print file

Categories