I wish to search a directory, and all contained subdirectories, for files with a substring contained within their name and one of three possible extensions
Please can you help me edit the following code
os.chdir(directory)
files = glob.glob("**/*{}*.pro".format(myStr), recursive = True)
I wish to find files with the extension .pro, .bd3 and .mysql
I'm running Python 3.5
You could create a list and loop over it
exten_to_find = ['.pro','bd3','.mysql']
you could format like this for iteration
files = glob.glob("**/*{x}*.{y}".format(x = myStr, y = extension_toFind), recursive = True)
you could try:
def get_files_with_extension(my_str, exts):
for f in glob.iglob("**/*{}*.*".format(my_str), recursive = True):
if any(f.endswith(ext) for ext in exts):
yield f
Actual-glob syntax has no way to do this. The "enhanced glob" syntaxes of most modern shells can, but I'm pretty sure Python's glob module is only very lightly enhanced.
Under the covers, glob is a pretty simple module, and the docs link to the source. As you can see, it ultimately defers to fnmatch, which is also a pretty simple module, and while ultimately just builds a regex and defers to that. And of course you can do alternations in a regex.
So, one option is to fork all the code from glob.py and fnmatch.py so you can build a fancier pattern to pass down to re.
But the simplest thing to do is just stop using glob here. It's the wrong tool for the job. Just use os.walk and filter things yourself.
If you understand how to write a regex like r'.*{}.*\.(pro|md3|mysql)'.format(myStr), use that to filter; if not, just write what you do know how to do; the performance cost will probably be minimal, and you'll be able to extend and maintain it yourself.
files = []
for root, dirnames, filenames in os.walk('.'):
for file in filenames:
fname, fext = os.path.splitext(file)
if fext in {'pro', 'md3', 'mysql'} and myStr in fname:
files.append(os.path.join(root, file))
If it turns out that doing a set method and a string method really is so much slower than regex that it makes a difference, and you can't write the regex yourself, come back and ask a new question. (I wouldn't count on the one I used above, if you can't figure out how to debug it.)
Also, if you're using Python before… I think 3.5… os.walk may actually be inherently slower than iglob. In that case, you'll want to look for betterwalk on PyPI, the module that the current implementation is based on.
Related
I want to retrieve a list of files (filtered by extension) inside a directory (recursively).
I have the solution below but I am confident there is a cleaner way to do that. Probably a simple glob expression that I am missing but any better solution is fine. Better in this scenario is readability (self-documenting) not performance.
I know this example is very simple, but of course it is part of a more complex scenario.
files = glob.glob('documents/*.txt') + glob.glob('documents/**/*.txt')
I'd expect something like
files = glob.glob('playbooks/(**/)?*.yml')
(just an example, that does not work)
To make use of the ** specifier in glob.glob() you need to explicitly set recursive parameter to True, e.g.:
glob.glob('Documents/**/*.txt', recursive=True)
From the official doc:
If recursive is true, the pattern “**” will match any files and zero or more directories, subdirectories and symbolic links to directories. If the pattern is followed by an os.sep or os.altsep then files will not match.
I have a function that traverses a directory tree searching for files of a designated filetype which works just fine the only problem I have is it can be quite slow. Can anyone offer more pythonic suggestions to potentially speed the process up:
def findbyfiletype (filetype, directory):
"""
findbyfiletype allows the user to search by two parameters, filetype and directory.
Example:
If the user wishes to locate all pdf files with a directory including subdirectories
then the function would be called as follows:
findbyfiletype(".pdf", "D:\\\\")
this will return a dictionary of strings where the filename is the key and the file path is the value
e.g.
{'file.pdf':'c:\\folder\\file.pdf'}
note that both parameters filetype and directory must be enclosed in string double or single quotes
and the directory parameter must use the backslash escape \\\\ as opposed to \ as python will throw a string literal error
"""
indexlist =[] #holds all files in the given directory including sub folders
FiletypeFilenameList =[] #holds list of all filenames of defined filetype in indexlist
FiletypePathList = [] #holds path names to indvidual files of defined filetype
for root, dirs, files in os.walk(directory):
for name in files:
indexlist.append(os.path.join(root,name))
if filetype in name[-5:]:
FiletypeFilenameList.append(name)
for files in indexlist:
if filetype in files[-5:]:
FiletypePathList.append(files)
FileDictionary=dict(zip(FiletypeFilenameList, FiletypePathList))
del indexlist, FiletypePathList, FiletypeFilenameList
return FileDictionary
ok so this is what I ended up with using a combintion of #Ulrich Eckhardt #Anton and #Cox
import os
import scandir
def findbyfiletype (filetype, directory):
FileDictionary={}
for root, dirs, files in scandir.walk(directory):
for name in files:
if filetype in name and name.endswith(filetype):
FileDictionary.update({name:os.path.join(root,name)})
return FileDictionary
as you can see it's been re factored getting rid of the unnecessary lists and creating the dictionary in one step. #Anton your suggestion for the scandir module helped greatly reduced the time in one instance by about 97 percent which nearly blew my mind.
I'm Listing #Anton as the accpeted answer as it sums up all that i actually acheived with refactoring but #Ulrich Eckhardt and #Cox both get up votes as you were both very helpful
regards
Instead of os.walk(), you can use the faster scandir module (PEP-471).
Also, a few other tips:
Don't use the arbitrary [-5:]. Use the ensdswith() string method or use os.path.splitext().
Don't build up two long lists and then make a dict. Build the dict directly.
If escaping back slashes bother you, use forward slashes like 'c:/folder/file.pdf'. They just work.
walk() can be slow because try to cover a lot of things.
I use a simple variant:
def walk(self, path):
try:
l = (os.path.join(path, x) for x in os.listdir(path))
for x in l:
if os.path.isdir(x):self.walk(x)
elif x.endswith(("jpg", "png", "jpeg")):
self.lf.append(x)
except PermissionError:pass
That's fast and python do a local cache of filesystem, so a second invocation is even faster.
PS: function walk is member of a class, obviously, that's why „self” is there.
EDIT: in NTFS, don't bother with islink. Update with try/except.
But this just ignore dirs where you don't have permissions. You have to run script as admin if you want them listed.
I have working code that looks like this:
# Wow. Much nesting. So spacebar
if __name__ == '__main__:
for eachDir in list_of_unrelated_directories:
for eachFile in os.listdir(eachDir):
if eachFile.endswith('.json'):
# do stuff here
I'd like to know if there is a more elegant way of doing that. I would like to not have my code nested three layers deep like that, and if I could get this to a one-liner like
for each file that ends with .json in all these directories:
# do stuff
That would be even more awesome. I also edited this to point out that the directories are not all in the same folder. Like you might be looking for .json files in your home folder and also your /tmp folder. So I'm not trying to move recursively through a single folder.
The most Pythonic way is (in my opinion) to write a function that yields the files of a certain type and use it. Then your calling code is very clear and concise. Some of the other answers are very concise but incredibly confusing; in your actual code you should value clarity over brevity (although when you can get both that is, of course, preferred).
Here's the function:
import os
def files_from_directories(directories, filetype):
"""Yield files of filetype from all directories given."""
for directory in directories:
for file in glob.glob(os.path.join(directory, '*' + filetype))
yield file
Now your calling code really is a one-liner:
# What a good one-liner!!!!
for json_file in files_from_directories(directories, '.json'):
# do stuff
So now you have a one-liner, and a very clear one. Plus if you want to process any other type of file, you can just reuse the function with a different filetype.
You can use a generator expression to remove the nested loops:
for json_file in (f for dir in list_of_unrelated_dirs
for f in os.listdir(dir)
if f.endswith('.json')):
print json_file.
If you want to apply a function to them, you could even improve it removing the remaining for loop with map() function:
map(fun,
(f for dir in list_of_unrelated_dirs
for f in os.listdir(dir)
if f.endswith('.json'))
)
Hope this helps!
Surely, the following code is not Pythonic because is not the simplest or the most clear one, definitely doesn't follow the Zen of Python
However, It's a one-line approach and It was fun to do it ;-):
def manage_file(filepath):
print('File to manage:', filepath)
EDIT: Based in accepted answer and I've updated my answer to use glob(), the result is still a bit Freak code but It's less code that my previous approach
map(manage_file, [fn for fn in sum((glob('%s/*.json' % eachDir) for eachDir in data_path), [])])
glob() can reduce you to two levels:
for d in list_of_unrelated_directories:
for f in glob(join(d, '*.json')):
_process_json_file(f)
If your list_of_unrelated_directories is really a list of totally unrelated directories, I don't see how you can avoid the first loop.If they do have something in common (say a common root, and some common prefix), then you can use os.walk() to traverse the tree and grab all matching files.
It's not really less nesting, it's just nesting within a comprehension.
This gets all things that end with '.json' and are also confirmed as files (ignores folders that end with '.json').
Standalone Code
import os
unrelated_paths = ['c:/', 't:/']
json_files = (os.path.join(p, o) for p in unrelated_paths
for o in os.listdir(p)
if (o.lower().endswith('.json')
and os.path.isfile(os.path.join(p, o))))
for json_file in json_files:
print json_file
I am writing a plug-in for RawTherapee in Python. I need to extract the version number from a file called 'AboutThisBuild.txt' that may exist anywhere in the directory tree. Although RawTherapee knows where it is installed this data is baked into the binary file.
My plug-in is being designed to collect basic system data when run without any command line parameters for the purpose of short circuiting troubleshooting. By having the version number, revision number and changeset (AKA Mercurial), I can sort out why the script may not be working as expected. OK that is the context.
I have tried a variety of methods, some suggested elsewhere on this site. The main one is using os.walk and fnmatch.
The problem is speed. Searching the entire directory tree is like watching paint dry!
To reduce load I have tried to predict likely hiding places and only traverse these. This is quicker but has the obvious disadvantage of missing some files.
This is what I have at the moment. Tested on Linux but not Windows as yet as I am still researching where the file might be placed.
import fnmatch
import os
import sys
rootPath = ('/usr/share/doc/rawtherapee',
'~',
'/media/CoreData/opt/',
'/opt')
pattern = 'AboutThisBuild.txt'
# Return the first instance of RT found in the paths searched
for CheckPath in rootPath:
print("\n")
print(">>>>>>>>>>>>> " + CheckPath)
print("\n")
for root, dirs, files in os.walk(CheckPath, True, None, False):
for filename in fnmatch.filter(files, pattern):
print( os.path.join(root, filename))
break
Usually 'AboutThisBuild.txt' is stored in a directory/subdirectory called 'rawtherapee' or has the string somewhere in the directory tree. I had naively though I could get the 5000 odd directory names and search these for 'rawtherapee' then use os.walk to traverse those directories but all modules and functions I have looked at collate all files in the directory (again).
Anyone have a quicker method of searching the entire directory tree or am I stuck with this hybrid option?
I am a beginner in Python, but I think I know the simplest way of finding a file in Windows.
import os
for dirpath, subdirs, filenames in os.walk('The directory you wanna search the file in'):
if 'name of your file with extension' in filenames:
print(dirpath)
This code will print out the directory of the file you are searching for in the console. All you have to do is get to the directory.
The thing about searching is that it doesn't matter too much how you get there (eg cheating). Once you have a result, you can verify it is correct relatively quickly.
You may be able to identify candidate locations fairly efficiently by guessing. For example, on Linux, you could first try looking in these locations (obviously not all are directories, but it doesn't do any harm to os.path.isfile('/;l$/AboutThisBuild.txt'))
$ strings /usr/bin/rawtherapee | grep '^/'
/lib/ld-linux.so.2
/H=!
/;l$
/9T$,
/.ba
/usr/share/rawtherapee
/usr/share/doc/rawtherapee
/themes/
/themes/slim
/options
/usr/share/color/icc
/cache
/languages/default
/languages/
/languages
/themes
/batch/queue
/batch/
/dcpprofiles
/#q=
/N6rtexif16NAISOInterpreterE
If you have it installed, you can try the locate command
If you still don't find it, move on to the brute force method
Here is a rough equivalent of strings using Python
>>> from string import printable, whitespace
>>> from itertools import groupby
>>> pathchars = set(printable) - set(whitespace)
>>> with open("/usr/bin/rawtherapee") as fp:
... data = fp.read()
...
>>> for k, g in groupby(data, pathchars.__contains__):
... if not k: continue
... g = ''.join(g)
... if len(g) > 3 and g.startswith("/"):
... print g
...
/lib64/ld-linux-x86-64.so.2
/^W0Kq[
/pW$<
/3R8
/)wyX
/WUO
/w=H
/t_1
/.badpixH
/d$(
/\$P
/D$Pv
/D$#
/D$(
/l$#
/d$#v?H
/usr/share/rawtherapee
/usr/share/doc/rawtherapee
/themes/
/themes/slim
/options
/usr/share/color/icc
/cache
/languages/default
/languages/
/languages
/themes
/batch/queue.csv
/batch/
/dcpprofiles
/#q=
/N6rtexif16NAISOInterpreterE
It sounds like you need a pure python solution here. If not, other answers will suffice.
In this case, you should traverse the folders using a queue and threads. While some may say Threads are never the solution, Threads are a great way of speeding up when you are I/O bound, which you are in this case. Essentially, you'll os.listdir the current dir. If it contains your file, party like it's 1999. If it doesn't, add each subfolder to the work queue.
If you're clever, you can play with depth first vs breadth first traversal to get the best results.
There is a great example I have used quite successfully at work at http://www.tutorialspoint.com/python/python_multithreading.htm. See the section titled Multithreaded Priority Queue. The example could probably be updated to include threadpools though, but it's not necessary.
Ant has a nice way to select groups of files, most handily using ** to indicate a directory tree. E.g.
**/CVS/* # All files immediately under a CVS directory.
mydir/mysubdir/** # All files recursively under mysubdir
More examples can be seen here:
http://ant.apache.org/manual/dirtasks.html
How would you implement this in python, so that you could do something like:
files = get_files("**/CVS/*")
for file in files:
print file
=>
CVS/Repository
mydir/mysubdir/CVS/Entries
mydir/mysubdir/foo/bar/CVS/Entries
Sorry, this is quite a long time after your OP. I have just released a Python package which does exactly this - it's called Formic and it's available at the PyPI Cheeseshop. With Formic, your problem is solved with:
import formic
fileset = formic.FileSet(include="**/CVS/*", default_excludes=False)
for file_name in fileset.qualified_files():
print file_name
There is one slight complexity: default_excludes. Formic, just like Ant, excludes CVS directories by default (as for the most part collecting files from them for a build is dangerous), the default answer to the question would result in no files. Setting default_excludes=False disables this behaviour.
As soon as you come across a **, you're going to have to recurse through the whole directory structure, so I think at that point, the easiest method is to iterate through the directory with os.walk, construct a path, and then check if it matches the pattern. You can probably convert to a regex by something like:
def glob_to_regex(pat, dirsep=os.sep):
dirsep = re.escape(dirsep)
print re.escape(pat)
regex = (re.escape(pat).replace("\\*\\*"+dirsep,".*")
.replace("\\*\\*",".*")
.replace("\\*","[^%s]*" % dirsep)
.replace("\\?","[^%s]" % dirsep))
return re.compile(regex+"$")
(Though note that this isn't that fully featured - it doesn't support [a-z] style glob patterns for instance, though this could probably be added). (The first \*\*/ match is to cover cases like \*\*/CVS matching ./CVS, as well as having just \*\* to match at the tail.)
However, obviously you don't want to recurse through everything below the current dir when not processing a ** pattern, so I think you'll need a two-phase approach. I haven't tried implementing the below, and there are probably a few corner cases, but I think it should work:
Split the pattern on your directory seperator. ie pat.split('/') -> ['**','CVS','*']
Recurse through the directories, and look at the relevant part of the pattern for this level. ie. n levels deep -> look at pat[n].
If pat[n] == '**' switch to the above strategy:
Reconstruct the pattern with dirsep.join(pat[n:])
Convert to a regex with glob\_to\_regex()
Recursively os.walk through the current directory, building up the path relative to the level you started at. If the path matches the regex, yield it.
If pat doesn't match "**", and it is the last element in the pattern, then yield all files/dirs matching glob.glob(os.path.join(curpath,pat[n]))
If pat doesn't match "**", and it is NOT the last element in the pattern, then for each directory, check if it matches (with glob) pat[n]. If so, recurse down through it, incrementing depth (so it will look at pat[n+1])
os.walk is your friend. Look at the example in the Python manual
(https://docs.python.org/2/library/os.html#os.walk) and try to build something from that.
To match "**/CVS/*" against a file name you get, you can do something like this:
def match(pattern, filename):
if pattern.startswith("**"):
return fnmatch.fnmatch(file, pattern[1:])
else:
return fnmatch.fnmatch(file, pattern)
In fnmatch.fnmatch, "*" matches anything (including slashes).
There's an implementation in the 'waf' build system source code.
http://code.google.com/p/waf/source/browse/trunk/waflib/Node.py?r=10755#471
May be this should be wrapped up in a library of its own?
Yup. Your best bet is, as has already been suggested, to work with 'os.walk'. Or, write wrappers around 'glob' and 'fnmatch' modules, perhaps.
os.walk is your best bet for this. I did the example below with .svn because I had that handy, and it worked great:
import re
for (dirpath, dirnames, filenames) in os.walk("."):
if re.search(r'\.svn$', dirpath):
for file in filenames:
print file