How to get progress of os.walk in python?

How to get progress of os.walk in python? - python

I have a piece of code which I'm using to search for the executables of game files and returning the directories. I would really like to get some sort of progress indicator as to how far along os.walk is. How would I accomplish such a thing?
I tried doing startpt = root.count(os.sep) and gauging off of that but that just gives how deep os.walk is in a directory tree.
def locate(filelist, root=os.curdir): #Find a list of files, return directories.
for path, dirs, files in os.walk(os.path.abspath(root)):
for filename in returnMatches(filelist, [k.lower() for k in files]):
yield path + "\\"

It depends!
If the files and directories are distributed more or less evenly you could show rough process by assuming every toplevel directory is going to take the same amount of time. But if they are not distributed evenly you cannot find out about it cheaply. You either have to know roughly how populated every directory is in advance, or you have to os.walk the entire thing twice (but that is only useful if your actual processing takes much longer than the os.walk itself does).
That is: say you have 4 toplevel directories, and each one contains 4 files. If you assume every toplevel dir takes 25% of progress, and each file takes another 25% of the progress for that dir, you can show a nice progress indicator. But if the last subdir turns out to contain many more files than the first few your progress indicator will have hit 75% before you find out about it. You cannot really fix that if the os.walk itself is the bottleneck (not your processing) and it's an arbitrary directory tree (not one where you know in advance roughly how long every subtree is going to take).
And of course that's assuming the cost here is about the same for every file...

I figured this out.
I used os.listdir to get a list of toplevel directories, and then used the .split function on the path that os.walk returned, returning the first level directory that it was currently in.
That left me with a list of toplevel directories, which I could find the index of the current directory of os.walk, and compare the index returned with the length of the list, giving me a % complete. ;)
This doesn't give me a smooth progress, because the level of work done in each directory can vary but smoothing out the progress indicator is of no concern for me. But it could easily be accomplished by extending the path checking deeper into the directory structure.
Here is the final code from getting my progress:
def locateGameDirs(filelist, root=os.curdir): #Find a list of files, return directories.
toplevel = [folder for folder in os.listdir(root) if os.path.isdir(os.path.join(root, folder))] #List of top-level directories
fileset = set(filelist)
for path, dirs, files in os.walk(os.path.abspath(root)):
curdir = path.split('\\')[1] #The directory os.walk is currently in.
try: #Thrown here because there's a nonexistant(?) first entry.
youarehere = toplevel.index(curdir)
progress = int(((youarehere)/len(toplevel))*100)
except:
pass
for filename in returnMatches(filelist, [k.lower() for k in files]):
yield filename, path + "\\", progress
And right now for debugging purposes I'm doing this further in the code:
for wow in locateGameDirs(["wow.exe", "firefox.exe", "vlc.exe"], "C:\\"):
print wow
Is there a nice little way to get rid of that try/except?; it seems the first iteration of path gives me nothing...

Just show an indeterminate progress bar (i.e. the ones that show a blob bouncing back and forth or the barber pole effect). That way users know that the program is doing something useful but doesn't mislead them as far as time to complete and such.

Do it in two passes: first count how many total files/folders are in the tree, and then during the second pass do actual processing.

You need to know the total number of files to do a meaningful progress indicator.
You can get the number of files like this
len(list(os.walk(os.path.abspath(root))))
but that is going to take some time and you probably need a progress indicator for that...
To find the number of files really quickly you'd need a filesystem which keeps track of the number of files for you.
Perhaps you can save the total from a previous run and use that as an estimate

I suggest you avoid walking the directory. Instead use an indexed-based app for quickly finding files. You can use the app's command-line interface via subprocess and find the files almost instantaneously.
On Windows, see Everything. On UNIX, check out locate. Not sure about Mac, but I'm sure there's an option there too.

as I said in the comment, the performance bottle neck likely lies outside of the locate function. your returnMatches is a fairly expensive function. I think you'd be better off replacing it with the following code:
def locate(filelist, root=os.curdir)
fileset = set(filelist) # if possible, pass the set instead of the list as a first argument
for path, dirs, files in os.walk(os.path.abspath(root)):
if any(file.lower() in fileset for file in files):
yield path + '\\'
This way you reduce the number of wasteful operations, yield once per file in the directory (which I think is what you actually indented to do) and you can forget about the progress at the same time. I don't think that progress would be an expected feature of the interface anyway.

Thinking out of the box here...what if you did it based on size:
Use subprocess to run 'du -sb' and get the total_size of your root directory
As you walk, check the size of each file and decrement from your total_size (giving you remaining_size)
pct_complete = (total_size - remaining_size)/total_size
Thoughts?
-aj

One optimisation you could do - you are converting filelist into a set on every call to returnMatches, even though it never changes. move the conversion to the start of the 'locate' function and pass the set in on every iteration.

Well, this was fun. Here is another silly way of doing it, but as everything else, it only calculates the right progress for uniform paths.
import os, sys, time
def calc_progress(progress, root, dirs):
prog_start, prog_end, prog_slice = 0.0, 1.0, 1.0
current_progress = 0.0
parent_path, current_name = os.path.split(root)
data = progress.get(parent_path)
if data:
prog_start, prog_end, subdirs = data
i = subdirs.index(current_name)
prog_slice = (prog_end - prog_start) / len(subdirs)
current_progress = prog_slice * i + prog_start
if i == (len(subdirs) - 1):
del progress[parent_path]
if dirs:
progress[root] = (current_progress, current_progress+prog_slice, dirs)
return current_progress
def walk(start_root):
progress = {}
print 'Starting with {start_root}'.format(**locals())
for root, dirs, files in os.walk(start_root):
print '{0}: {1:%}'.format(root[len(start_root)+1:], calc_progress(progress, root, dirs))

Related

Most efficient way to determine the size of a directory in Python

os.walk has a helpful example:
import os
from os.path import join, getsize
for root, dirs, files in os.walk('python/Lib/email'):
print(root, "consumes", end=" ")
print(sum(getsize(join(root, name)) for name in files), end=" ")
print("bytes in", len(files), "non-directory files")
if 'CVS' in dirs:
dirs.remove('CVS') # don't visit CVS directories
Despite the note that os.walk got faster in Python 3.5 by switching to os.scandir, this doesn't mention that it's still a sub-optimal implementation on Windows.
https://www.python.org/dev/peps/pep-0471/ does describe this & gets it almost right. However, it recommends using recursion. When dealing with arbitrary folder structures, this doesn't work so well as you'll quickly hit Python recursion limits (you'll only be able to iterate a folder structure up to 1000 folders deep, which if you're starting at the root of the filesystem isn't necessarily unrealistic. The real limit isn't actually 1000. It's 1000 - your Python call depth when you go to run this function. If you're doing this in response to a web service request through Django with lots of business logic layers, it wouldn't be unrealistic to get close to this limit easily.

The following snippet should be optimal on all operating systems & handle any folder structure you throw at it. Memory usage will obviously grow the more folders you encounter but to my knowledge there's nothing you can really do about that as you somehow have to keep track of where you need to go.
def get_tree_size(path):
total_size = 0
dirs = [path]
while dirs:
next_dir = dirs.pop()
with os.scandir(next_dir) as it:
for entry in it:
if entry.is_dir(follow_symlinks=False):
dirs.append(entry.path)
else:
total_size += entry.stat(follow_symlinks=False).st_size
return total_size
It's possible using a collections.deque may speed up operations vs the naiive usage of a list here but I suspect it would be hard to write a benchmark to show this with disk speeds what they are today.

Get file path of continuously updating file

I have found a few approaches to search for the newest file created by a user in a directory, but I need to determine if an easier approach exists. Most posts on this topics work in some instances or have major hurdles, so I am hoping to unmuddy the water.
I am having difficulty looking through a growing file system, as well as bringing more users in with more potential errors.
I get data from a Superlogics Winview CP 32 for a continuously streaming system. On each occasion of use of the system, I have the operator input a unique identifier for the file name containing a few of the initial conditions of the system we need to track. I would like to get that file name with no help from the operator/user.
Eventually, the end goal is to pare down a list of files I want to search, filtered based on keys, so my first instinct was to use only matching file types, trim all folders in a pathway into a list, and sort based on max timestamp. I used some pretty common functions from these pages:
def fileWalkIn(path='.',matches=[],filt='*.csv'): # Useful for walking through a given directory
"""Iterates through all files under the given path using a filter."""
for root, dirnames, filenames in os.walk(path):
for filename in fnmatch.filter(filenames, filt):
matches.append(os.path.join(root, filename))
yield os.path.join(root, filename)
def getRecentFile(path='.',matches=[],filt='*.dat'):
rr = max(fileWalkIn(path=path,matches=matches,filt=filt), key=os.path.getmtime)
return rr
This got me far, but is rather bulky and slow, which means I cannot do this repeatedly if I want to explore the files that match, lest I have to carry around a bulky list of the matching files.
Ideally, I will be able to process the data on the fly, executing and printing live while it writes, so this approach is not usable in that instance.
I borrowed from these pages a new approach by alex-martelli, which does not use a filter, gives the option of giving files, opposed to directories, is much slimmer than fileWalkIn, and works quicker if using the timestamp.
def all_subdirs_of(b='.'): # Useful for walking through a given directory
# Create hashable list of files or directories in the parent directory
results = []
for d in os.listdir(b):
bd = os.path.join(b, d)
if os.path.isfile(bd):
results.append(bd)
elif os.path.isdir(bd):
results.append(bd)
# return both
return results
def newest(path='.'):
rr = max(all_subdirs_of(b=path), key=os.path.getmtime)
return rr
def getActiveFile(newFile ='.'):
while os.path.exists(newFile):
newFile = newest(newFile)
if os.path.isfile(newFile):
return newFile
else:
if newFile:
continue
else:
return newFile
This gets me the active file in a directory much more quickly, but only if no other files have written since launching my data collection. I can see all kinds of problems here and need some help determining if I have gone down a rabbit hole and there is a more simple solution, like testing file sizes, or whether a more cohesive solution with less potential snags exists.
I found other answers for different languages (java, how-to-get-the-path-of-a-running-jar-file), but would need something in Python. I have explored functions like watchdog and win32, but both require steep learning curves, and I feel like I am either very close, or need to change my paradigm entirely.

dircache might speed up the second approach a bit. It's a wrapper around listdir that checks the directory timestamp and only re-reads directory contents if there's been a change.
Beyond that you really need something that listens to file system events. A quick google turned up two pip packages, pyinotify for Linux only and watchdog.
Hope this helps.

Recreate input folder tree for output of some analyses

I am new to Python and, although having been reading and enjoying it so far, have ∂ experience, where ∂ → 0.
I have a folder tree and each folder at the bottom of the tree's branches contains many files. For me, this whole tree in the input.
I would to perform several steps of analysis (I believe these are irrelavant to this question), the results of which I would like to have returned in an identical tree to that of the input, called output.
I have two ideas:
Read through each folder recursively using os.walk() and for each file to perform the analysis, and
Use a function such as shutil.copytree() and perform the analysis somewhere along the way. So actually, I do not want to COPY the tree at all, rather replicate it's structure but with new files. I thought this might be a kind of 'hack' as I do actually want to use each input file to create the output file, so instead of a copycommand, I need an analyse command. The rest should remain unchanged as far as my imagination allows me to understand.
I have little experience with option 1 and zero experience with option 2.
For smaller trees up until now I have been hard-coding the paths, which has become too time-consuming at this point.
I have also seen more mundane ways, such as using glob to first find all the files I would like and work on them, but I don't know how this might help find a shortcut in recreating the input tree for my output.
My attempt at option 1 looks like this:
import os
for root, dirs, files in os.walk('/Volumes/Mac OS Drive/Data/input/'):
# I have no actual need to print these, it just helps me see what is happening
print root, "\n"
print dirs, "\n"
# This is my actual work going on
[analysis_function(name) for name in files]
however I fear this is going to be very slow, I would also like to do some kind of filtering on files too - for example the .DS_Store files created in mac trees are included in the results of the above. I would attempt to use the fnmatch module to filter only the files I want.
I have seen in the copytree function that it is possible to ignore files according to a pattern, which would be helpful, however I do not understand from the documentation where I could put my analysis function in on each file.

You can use both options: you could provide your custom copy_function that performs analysis instead of the default shutil.copy2 to shutil.copytree() (it is a more of a hack) or you could use os.walk() to have a finer control over the process.
You don't need to create parent directories manually either way. copytree() creates the parent directories for you and os.makedirs(root) can create parent directories if you use os.walk():
#!/usr/bin/env python2
import fnmatch
import itertools
import os
ignore_dir = lambda d: d in ('.git', '.svn', '.hg')
src_dir = '/Volumes/Mac OS Drive/Data/input/' # source directory
dst_dir = '/path/to/destination/' # destination directory
for root, dirs, files in os.walk(src_dir):
for input_file in fnmatch.filter(files, "*.input"): # for each input file
output_file = os.path.splitext(input_file)[0] + '.output'
output_dir = os.path.join(dst_dir, root[len(src_dir):])
if not os.path.isdir(output_dir):
os.makedirs(output_dir) # create destination directories
analyze(os.path.join(root, input_file), # perform analysis
os.path.join(output_dir, output_file))
# don't visit ignored subtrees
dirs[:] = itertools.ifilterfalse(ignore_dir, dirs)

What is the fastest method of finding a file in Linux and Windows using Python?

I am writing a plug-in for RawTherapee in Python. I need to extract the version number from a file called 'AboutThisBuild.txt' that may exist anywhere in the directory tree. Although RawTherapee knows where it is installed this data is baked into the binary file.
My plug-in is being designed to collect basic system data when run without any command line parameters for the purpose of short circuiting troubleshooting. By having the version number, revision number and changeset (AKA Mercurial), I can sort out why the script may not be working as expected. OK that is the context.
I have tried a variety of methods, some suggested elsewhere on this site. The main one is using os.walk and fnmatch.
The problem is speed. Searching the entire directory tree is like watching paint dry!
To reduce load I have tried to predict likely hiding places and only traverse these. This is quicker but has the obvious disadvantage of missing some files.
This is what I have at the moment. Tested on Linux but not Windows as yet as I am still researching where the file might be placed.
import fnmatch
import os
import sys
rootPath = ('/usr/share/doc/rawtherapee',
'~',
'/media/CoreData/opt/',
'/opt')
pattern = 'AboutThisBuild.txt'
# Return the first instance of RT found in the paths searched
for CheckPath in rootPath:
print("\n")
print(">>>>>>>>>>>>> " + CheckPath)
print("\n")
for root, dirs, files in os.walk(CheckPath, True, None, False):
for filename in fnmatch.filter(files, pattern):
print( os.path.join(root, filename))
break
Usually 'AboutThisBuild.txt' is stored in a directory/subdirectory called 'rawtherapee' or has the string somewhere in the directory tree. I had naively though I could get the 5000 odd directory names and search these for 'rawtherapee' then use os.walk to traverse those directories but all modules and functions I have looked at collate all files in the directory (again).
Anyone have a quicker method of searching the entire directory tree or am I stuck with this hybrid option?

I am a beginner in Python, but I think I know the simplest way of finding a file in Windows.
import os
for dirpath, subdirs, filenames in os.walk('The directory you wanna search the file in'):
if 'name of your file with extension' in filenames:
print(dirpath)
This code will print out the directory of the file you are searching for in the console. All you have to do is get to the directory.

The thing about searching is that it doesn't matter too much how you get there (eg cheating). Once you have a result, you can verify it is correct relatively quickly.
You may be able to identify candidate locations fairly efficiently by guessing. For example, on Linux, you could first try looking in these locations (obviously not all are directories, but it doesn't do any harm to os.path.isfile('/;l$/AboutThisBuild.txt'))
$ strings /usr/bin/rawtherapee | grep '^/'
/lib/ld-linux.so.2
/H=!
/;l$
/9T$,
/.ba
/usr/share/rawtherapee
/usr/share/doc/rawtherapee
/themes/
/themes/slim
/options
/usr/share/color/icc
/cache
/languages/default
/languages/
/languages
/themes
/batch/queue
/batch/
/dcpprofiles
/#q=
/N6rtexif16NAISOInterpreterE
If you have it installed, you can try the locate command
If you still don't find it, move on to the brute force method
Here is a rough equivalent of strings using Python
>>> from string import printable, whitespace
>>> from itertools import groupby
>>> pathchars = set(printable) - set(whitespace)
>>> with open("/usr/bin/rawtherapee") as fp:
... data = fp.read()
...
>>> for k, g in groupby(data, pathchars.__contains__):
... if not k: continue
... g = ''.join(g)
... if len(g) > 3 and g.startswith("/"):
... print g
...
/lib64/ld-linux-x86-64.so.2
/^W0Kq[
/pW$<
/3R8
/)wyX
/WUO
/w=H
/t_1
/.badpixH
/d$(
/\$P
/D$Pv
/D$#
/D$(
/l$#
/d$#v?H
/usr/share/rawtherapee
/usr/share/doc/rawtherapee
/themes/
/themes/slim
/options
/usr/share/color/icc
/cache
/languages/default
/languages/
/languages
/themes
/batch/queue.csv
/batch/
/dcpprofiles
/#q=
/N6rtexif16NAISOInterpreterE

It sounds like you need a pure python solution here. If not, other answers will suffice.
In this case, you should traverse the folders using a queue and threads. While some may say Threads are never the solution, Threads are a great way of speeding up when you are I/O bound, which you are in this case. Essentially, you'll os.listdir the current dir. If it contains your file, party like it's 1999. If it doesn't, add each subfolder to the work queue.
If you're clever, you can play with depth first vs breadth first traversal to get the best results.
There is a great example I have used quite successfully at work at http://www.tutorialspoint.com/python/python_multithreading.htm. See the section titled Multithreaded Priority Queue. The example could probably be updated to include threadpools though, but it's not necessary.

Python: os.chdir() not working within a for loop?

I'm trying to get a homemade path navigation function working - basically I need to go through one folder, and explore every folder within it, running a function within each folder.
I reach a problem when I try to change directories within a for loop. I've got this "findDirectories" function:
def findDirectories(list):
for files in os.listdir("."):
print (files)
list.append(files)
os.chdir("y")
That last line causes the problems. If I remove it, the function just compiles a list with all the folders in that folder. Unfortunately, this means I have to run this each time I go down a folder, I can't just run the whole thing once. I've specified the folder "y" as that's a real folder, but the program crashes upon opening even with that. Doing os.chdir("y") outside of the for loop has no issues at all.
I'm new to Python, but not to programming in general. How can I get this to work, or is there a better way? The final result I need is running a Function on each single "*Response.xml" file that exists within this folder, no matter how deeply nested it is.

Well, you don't post the traceback of the actual error but clearly it doesn't work as you have specified y as a relative path.
Thus it may be able to change to y in the first iteration of the loop, but in the second it will be trying to change to a subdirectory of y that is also called y
Which you probably do not have.
You want to be doing something like
import os
for dirName, subDirs, fileNames in os.walk(rootPath):
# its not clear which files you want, I assume anything that ends with Response.xml?
for f in fileNames:
if f.endswith("Response.xml"):
# this is the path you will want to use
filePath = os.path.join(dirName, f)
# now do something with it!
doSomethingWithFilePath(filePath)
Thats untested, but you have the idea ...

As Dan said, os.walk would be better. See the example there.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to get progress of os.walk in python? - python

Just show an indeterminate progress bar (i.e. the ones that show a blob bouncing back and forth or the barber pole effect). That way users know that the program is doing something useful but doesn't mislead them as far as time to complete and such.

Do it in two passes: first count how many total files/folders are in the tree, and then during the second pass do actual processing.

One optimisation you could do - you are converting filelist into a set on every call to returnMatches, even though it never changes. move the conversion to the start of the 'locate' function and pass the set in on every iteration.

Related

Most efficient way to determine the size of a directory in Python

Get file path of continuously updating file

Recreate input folder tree for output of some analyses

What is the fastest method of finding a file in Linux and Windows using Python?

Python: os.chdir() not working within a for loop?

Categories

Resources