Python os.walk memory issue

Python os.walk memory issue - python

I programmed a scanner that looks for certain files on all hard drives of a system that gets scanned. Some of these systems are pretty old, running Windows 2000 with 256 or 512 MB of RAM but the file system structure is complex as some of them serve as file servers.
I use os.walk() in my script to parse all directories and files.
Unfortunately we noticed that the scanner consumes a lot of RAM after some time of scanning and we figured out that the os.walk function alone uses about 50 MB of RAM after 2h of walk over the file system. This RAM usage increases over the time. We had about 90 MB of RAM after 4 hours of scanning.
Is there a way to avoid this behaviour? We also tried "betterwalk.walk()" and "scandir.walk()". The result was the same.
Do we have to write our own walk function that removes already scanned directory and file objects from memory so that the garbage collector can remove them from time to time?
Thanks

have you tried the glob module?
import os, glob
def globit(srchDir):
srchDir = os.path.join(srchDir, "*")
for file in glob.glob(srchDir):
print file
globit(file)
if __name__ == '__main__':
dir = r'C:\working'
globit(dir)

If you are running in the os.walk loop, del() everything that you don't need anymore. And try running gc.collect() at the end of every iteration of os.walk.

Generators are better solutions as they do lazy computations
here is one example of implementation.
import os
import fnmatch
#this may or may not be implemented
def list_dir(path):
for name in os.listdir(path):
yield os.path.join(path, name)
#modify this to take some pattern as input
def os_walker(top):
for root,dlist,flist in os.walk(top):
for name in fnmatch.filter(flist, '*.py'):
yield os.path.join(root, name)
all_dirs = list_dir("D:\\tuts\\pycharm")
for l in all_dirs:
for name in os_walker(l):
print(name)
Thanks to David Beazley

Related

Combining 100's of mp4s without running out of ram. In order

I have some code that is great for doing small numbers of mp4s, but at the 100th one I start to run out of ram. I know you can sequentially write CSV files, I am just not sure how to do that for mp4s. Here is the code I have:`11
This solution works:
from moviepy.editor import *
import os
from natsort import natsorted
L = []
for root, dirs, files in os.walk("/path/to/the/files"):
#files.sort()
files = natsorted(files)
for file in files:
if os.path.splitext(file)[1] == '.mp4':
filePath = os.path.join(root, file)
video = VideoFileClip(filePath)
L.append(video)
final_clip = concatenate_videoclips(L)
final_clip.to_videofile("output.mp4", fps=24, remove_temp=False)`
The code above is what I tried, I expected a smooth result on first glance, though it worked perfect on a test batch it could not handle the main batch.

You appear to be appending the contents
of a large number of video files to a list.
Yet you report that available RAM is much
less than total size of those files.
So don't accumulate the result in memory.
Follow one of these approaches:
keep an open file descriptor
with open("combined_video.mp4", "wb") as fout:
for file in files:
...
video = ...
fout.write(video)
Or perhaps it is fout.write(video.data)
or video.write_segment(fout) -- I don't
know about the video I/O library you're using.
The point is that the somewhat large video
object is re-assigned each time, so it
does not grow without bound, unlike your list L.
append to existing file
We can nest in the other order, if that's more convenient.
for file in files:
with open("combined_video.mp4", "ab") as fout:
...
video = ...
fout.write(video)
Here we're doing binary append.
Repeated open / close is slighty less efficient.
But it has the advantage of letting you
do a run with four input files,
then python exits,
then later you do a run with pair of new files
and you'll still find the expected half a dozen
files in the combined output.

Most efficient way to determine the size of a directory in Python

os.walk has a helpful example:
import os
from os.path import join, getsize
for root, dirs, files in os.walk('python/Lib/email'):
print(root, "consumes", end=" ")
print(sum(getsize(join(root, name)) for name in files), end=" ")
print("bytes in", len(files), "non-directory files")
if 'CVS' in dirs:
dirs.remove('CVS') # don't visit CVS directories
Despite the note that os.walk got faster in Python 3.5 by switching to os.scandir, this doesn't mention that it's still a sub-optimal implementation on Windows.
https://www.python.org/dev/peps/pep-0471/ does describe this & gets it almost right. However, it recommends using recursion. When dealing with arbitrary folder structures, this doesn't work so well as you'll quickly hit Python recursion limits (you'll only be able to iterate a folder structure up to 1000 folders deep, which if you're starting at the root of the filesystem isn't necessarily unrealistic. The real limit isn't actually 1000. It's 1000 - your Python call depth when you go to run this function. If you're doing this in response to a web service request through Django with lots of business logic layers, it wouldn't be unrealistic to get close to this limit easily.

The following snippet should be optimal on all operating systems & handle any folder structure you throw at it. Memory usage will obviously grow the more folders you encounter but to my knowledge there's nothing you can really do about that as you somehow have to keep track of where you need to go.
def get_tree_size(path):
total_size = 0
dirs = [path]
while dirs:
next_dir = dirs.pop()
with os.scandir(next_dir) as it:
for entry in it:
if entry.is_dir(follow_symlinks=False):
dirs.append(entry.path)
else:
total_size += entry.stat(follow_symlinks=False).st_size
return total_size
It's possible using a collections.deque may speed up operations vs the naiive usage of a list here but I suspect it would be hard to write a benchmark to show this with disk speeds what they are today.

Windows disk usage issues with python

I am executing the python code that follows.
I am running it on a folder ("articles") which has a couple hundred subfolders and 240,226 files in all.
I am timing the execution. At first the times were pretty stable but went non-linear after 100,000 files. Now the times (I am timing at 10,000 file intervals) can go non_linear after 30,000 or so (or not).
I have the Task Manager open and correlate the slow-downs to 99% Disk usage by python.exe. I have done gc-collect(). dels etc., turned off Windows indexing. I have re-started Windows, emptied the trash (I have a few hundred GBs free). Nothing helps, the disk usage seems to be getting more erratic if anything.
Sorry for the long post - Thanks for the help
def get_filenames():
for (dirpath, dirnames, filenames) in os.walk("articles/"):
dirs.extend(dirnames)
for dir in dirs:
path = "articles" + "\\" + dir
nxml_files.extend(glob.glob(path + "/*.nxml"))
return nxml_files
def extract_text_from_files(nxml_files):
for nxml_file in nxml_files:
fast_parse(nxml_file)
def fast_parse(infile):
file = open(infile,"r")
filetext = file.read()
tag_breaks = filetext.split('><')
paragraphs = [tag_break.strip('p>').strip('</') for tag_break in tag_breaks if tag_break.startswith('p>')]
def run_files():
nxml_files = get_filenames()
extract_text_from_files(nxml_files)
if __name__ == "__main__":
run_files()

There are some things that could be optimized.
At first, is you open files, close them as well. A with open(...) as name: block will do that easily. BTW in Python 2 file is a bad choice for a variable name, it is built-in function's name.
You can remove one disc read by doing string comparisons instead of the glob.
And last but not least: os.walk spits out the results cleverly, so don't buffer them into a list, process everything inside one loop. This will save a lot of memory.
That is what I can advise from the code. For more details on what is causing the I/O you should use profiling. See https://docs.python.org/2/library/profile.html for details.

What is the fastest method of finding a file in Linux and Windows using Python?

I am writing a plug-in for RawTherapee in Python. I need to extract the version number from a file called 'AboutThisBuild.txt' that may exist anywhere in the directory tree. Although RawTherapee knows where it is installed this data is baked into the binary file.
My plug-in is being designed to collect basic system data when run without any command line parameters for the purpose of short circuiting troubleshooting. By having the version number, revision number and changeset (AKA Mercurial), I can sort out why the script may not be working as expected. OK that is the context.
I have tried a variety of methods, some suggested elsewhere on this site. The main one is using os.walk and fnmatch.
The problem is speed. Searching the entire directory tree is like watching paint dry!
To reduce load I have tried to predict likely hiding places and only traverse these. This is quicker but has the obvious disadvantage of missing some files.
This is what I have at the moment. Tested on Linux but not Windows as yet as I am still researching where the file might be placed.
import fnmatch
import os
import sys
rootPath = ('/usr/share/doc/rawtherapee',
'~',
'/media/CoreData/opt/',
'/opt')
pattern = 'AboutThisBuild.txt'
# Return the first instance of RT found in the paths searched
for CheckPath in rootPath:
print("\n")
print(">>>>>>>>>>>>> " + CheckPath)
print("\n")
for root, dirs, files in os.walk(CheckPath, True, None, False):
for filename in fnmatch.filter(files, pattern):
print( os.path.join(root, filename))
break
Usually 'AboutThisBuild.txt' is stored in a directory/subdirectory called 'rawtherapee' or has the string somewhere in the directory tree. I had naively though I could get the 5000 odd directory names and search these for 'rawtherapee' then use os.walk to traverse those directories but all modules and functions I have looked at collate all files in the directory (again).
Anyone have a quicker method of searching the entire directory tree or am I stuck with this hybrid option?

I am a beginner in Python, but I think I know the simplest way of finding a file in Windows.
import os
for dirpath, subdirs, filenames in os.walk('The directory you wanna search the file in'):
if 'name of your file with extension' in filenames:
print(dirpath)
This code will print out the directory of the file you are searching for in the console. All you have to do is get to the directory.

The thing about searching is that it doesn't matter too much how you get there (eg cheating). Once you have a result, you can verify it is correct relatively quickly.
You may be able to identify candidate locations fairly efficiently by guessing. For example, on Linux, you could first try looking in these locations (obviously not all are directories, but it doesn't do any harm to os.path.isfile('/;l$/AboutThisBuild.txt'))
$ strings /usr/bin/rawtherapee | grep '^/'
/lib/ld-linux.so.2
/H=!
/;l$
/9T$,
/.ba
/usr/share/rawtherapee
/usr/share/doc/rawtherapee
/themes/
/themes/slim
/options
/usr/share/color/icc
/cache
/languages/default
/languages/
/languages
/themes
/batch/queue
/batch/
/dcpprofiles
/#q=
/N6rtexif16NAISOInterpreterE
If you have it installed, you can try the locate command
If you still don't find it, move on to the brute force method
Here is a rough equivalent of strings using Python
>>> from string import printable, whitespace
>>> from itertools import groupby
>>> pathchars = set(printable) - set(whitespace)
>>> with open("/usr/bin/rawtherapee") as fp:
... data = fp.read()
...
>>> for k, g in groupby(data, pathchars.__contains__):
... if not k: continue
... g = ''.join(g)
... if len(g) > 3 and g.startswith("/"):
... print g
...
/lib64/ld-linux-x86-64.so.2
/^W0Kq[
/pW$<
/3R8
/)wyX
/WUO
/w=H
/t_1
/.badpixH
/d$(
/\$P
/D$Pv
/D$#
/D$(
/l$#
/d$#v?H
/usr/share/rawtherapee
/usr/share/doc/rawtherapee
/themes/
/themes/slim
/options
/usr/share/color/icc
/cache
/languages/default
/languages/
/languages
/themes
/batch/queue.csv
/batch/
/dcpprofiles
/#q=
/N6rtexif16NAISOInterpreterE

It sounds like you need a pure python solution here. If not, other answers will suffice.
In this case, you should traverse the folders using a queue and threads. While some may say Threads are never the solution, Threads are a great way of speeding up when you are I/O bound, which you are in this case. Essentially, you'll os.listdir the current dir. If it contains your file, party like it's 1999. If it doesn't, add each subfolder to the work queue.
If you're clever, you can play with depth first vs breadth first traversal to get the best results.
There is a great example I have used quite successfully at work at http://www.tutorialspoint.com/python/python_multithreading.htm. See the section titled Multithreaded Priority Queue. The example could probably be updated to include threadpools though, but it's not necessary.

Parallel file matching, Python

I am trying to improve on a script which scans files for malicious code. We have a list of regex patterns in a file, one pattern on each line. These regex are for grep as our current implementation is basically a bash script find\grep combo. The bash script takes 358 seconds on my benchmark directory. I was able to write a python script that did this in 72 seconds but want to improve more. First I will post the base-code and then tweaks I have tried:
import os, sys, Queue, threading, re
fileList = []
rootDir = sys.argv[1]
class Recurser(threading.Thread):
def __init__(self, queue, dir):
self.queue = queue
self.dir = dir
threading.Thread.__init__(self)
def run(self):
self.addToQueue(self.dir)
## HELPER FUNCTION FOR INTERNAL USE ONLY
def addToQueue(self, rootDir):
for root, subFolders, files in os.walk(rootDir):
for file in files:
self.queue.put(os.path.join(root,file))
self.queue.put(-1)
self.queue.put(-1)
self.queue.put(-1)
self.queue.put(-1)
self.queue.put(-1)
self.queue.put(-1)
self.queue.put(-1)
self.queue.put(-1)
self.queue.put(-1)
self.queue.put(-1)
self.queue.put(-1)
self.queue.put(-1)
self.queue.put(-1)
self.queue.put(-1)
self.queue.put(-1)
self.queue.put(-1)
self.queue.put(-1)
self.queue.put(-1)
self.queue.put(-1)
self.queue.put(-1)
class Scanner(threading.Thread):
def __init__(self, queue, patterns):
self.queue = queue
self.patterns = patterns
threading.Thread.__init__(self)
def run(self):
nextFile = self.queue.get()
while nextFile is not -1:
#print "Trying " + nextFile
self.scanFile(nextFile)
nextFile = self.queue.get()
#HELPER FUNCTION FOR INTERNAL UES ONLY
def scanFile(self, file):
fp = open(file)
contents = fp.read()
i=0
#for patt in self.patterns:
if self.patterns.search(contents):
print "Match " + str(i) + " found in " + file
############MAIN MAIN MAIN MAIN##################
############MAIN MAIN MAIN MAIN##################
############MAIN MAIN MAIN MAIN##################
############MAIN MAIN MAIN MAIN##################
############MAIN MAIN MAIN MAIN##################
############MAIN MAIN MAIN MAIN##################
############MAIN MAIN MAIN MAIN##################
############MAIN MAIN MAIN MAIN##################
############MAIN MAIN MAIN MAIN##################
fileQueue = Queue.Queue()
#Get the shell scanner patterns
patterns = []
fPatt = open('/root/patterns')
giantRE = '('
for line in fPatt:
#patterns.append(re.compile(line.rstrip(), re.IGNORECASE))
giantRE = giantRE + line.rstrip() + '|'
giantRE = giantRE[:-1] + ')'
giantRE = re.compile(giantRE, re.IGNORECASE)
#start recursing the directories
recurser = Recurser(fileQueue,rootDir)
recurser.start()
print "starting scanner"
#start checking the files
for scanner in xrange(0,8):
scanner = Scanner(fileQueue, giantRE)
scanner.start()
This is obviously debugging\ugly code, do not mind the million queue.put(-1), I will clean this up later. Some indentations are not showing up properly, paticularly in scanFile.
Anyway some things I've noticed. Using 1, 4, and even 8 threads (for scanner in xrange(0,???):) does not make a difference. I still get ~72 seconds regardless. I assume this is due to python's GIL.
As opposed to making a giant regex I tried placing each line (pattern) as a compilex RE in a list and iterating through this list in my scanfile function. This resulted in longer execution time.
In an effort to avoid python's GIL I tried having each thread fork to grep as in:
#HELPER FUNCTION FOR INTERNAL UES ONLY
def scanFile(self, file):
s = subprocess.Popen(("grep", "-El", "--file=/root/patterns", file), stdout = subprocess.PIPE)
output = s.communicate()[0]
if output != '':
print 'Matchfound in ' + file
This resulted in longer execution time.
Any suggestions on improving performance.
:::::::::::::EDIT::::::::
I can not post answers to my own questions yet however here are the answers to several points raised:
#David Nehme - Just to let people know I am aware of the fact that I have a million queue.put(-1)'s
#Blender - To mark the bottom of the queue. My scanner threads keep dequeing until they hit -1 which is at the bottom (while nextFile is not -1:). The processor cores is 8 however due to the GIL using 1 thread, 4 threads, or 8 threads does NOT make a difference. Spawning 8 subprocesses resulted in significantly slower code (142 sec vs 72)
#ed - Yes that and it's just as slow as the find\grep combo, actually slower because it indiscriminately greps file that aren't needed
#Ron - Can't upgrade, this must be universal. Do you think this will speed up > 72 seconds? The bash grepper does 358 seconds. My python giant RE method does 72 seconds w\ 1-8 threads. The popen method w\ 8 thrads (8 subprocesses) ran at 142 seconds. So far the giant RE python only method is a clear winner by far
#intuted
Here's the meat of our current find\grep combo (Not my script). It's pretty simple. There are some additional things in there like ls, but nothing that should result in a 5x slowdown. Even if grep -r is slightly more efficient 5x is a HUGE slowdown.
find "${TARGET}" -type f -size "${SZLIMIT}" -exec grep -Eaq --file="${HOME}/patterns" "{}" \; -and -ls | tee -a "${HOME}/found.txt"
The python code is more efficient, I don't know why, but I experimentally tested it. I prefer to do this in python. I already achieved a speedup of 5x with python, I would like to get it sped up more.
:::::::::::::WINNER WINNER WINNER:::::::::::::::::
Looks like we have a winner.
intued's shell script comes in 2nd place with 34 seconds however #steveha's came in first with 24 seconds. Due to the fact that a lot of our boxes do not have python2.6 I had to cx_freeze it. I can write a shell script wrapper to wget a tar and unpack it. I do like intued's for simplicity however.
Thanks you for all your help guys, I now have an efficient tool for sysadmining

I think that, rather than using the threading module, you should be using the multiprocessing module for your Python solution. Python threads can run afoul of the GIL; the GIL is not a problem if you simply have multiple Python processes going.
I think that for what you are doing a pool of worker processes is just what you want. By default, the pool will default to one process for each core in your system processor. Just call the .map() method with a list of filenames to check and the function that does the checking.
http://docs.python.org/library/multiprocessing.html
If this is not faster than your threading implementation, then I don't think the GIL is your problem.
EDIT: Okay, I'm adding a working Python program. This uses a pool of worker processes to open each file and search for the pattern in each. When a worker finds a filename that matches, it simply prints it (to standard output) so you can redirect the output of this script into a file and you have your list of files.
EDIT: I think this is a slightly easier to read version, easier to understand.
I timed this, searching through the files in /usr/include on my computer. It completes the search in about half a second. Using find piped through xargs to run as few grep processes as possible, it takes about 0.05 seconds, about a 10x speedup. But I hate the baroque weird language you must use to get find to work properly, and I like the Python version. And perhaps on really big directories the disparity would be smaller, as part of the half-second for Python must have been startup time. And maybe half a second is fast enough for most purposes!
import multiprocessing as mp
import os
import re
import sys
from stat import S_ISREG
# uncomment these if you really want a hard-coded $HOME/patterns file
#home = os.environ.get('HOME')
#patterns_file = os.path.join(home, 'patterns')
target = sys.argv[1]
size_limit = int(sys.argv[2])
assert size_limit >= 0
patterns_file = sys.argv[3]
# build s_pat as string like: (?:foo|bar|baz)
# This will match any of the sub-patterns foo, bar, or baz
# but the '?:' means Python won't bother to build a "match group".
with open(patterns_file) as f:
s_pat = r'(?:{})'.format('|'.join(line.strip() for line in f))
# pre-compile pattern for speed
pat = re.compile(s_pat)
def walk_files(topdir):
"""yield up full pathname for each file in tree under topdir"""
for dirpath, dirnames, filenames in os.walk(topdir):
for fname in filenames:
pathname = os.path.join(dirpath, fname)
yield pathname
def files_to_search(topdir):
"""yield up full pathname for only files we want to search"""
for fname in walk_files(topdir):
try:
# if it is a regular file and big enough, we want to search it
sr = os.stat(fname)
if S_ISREG(sr.st_mode) and sr.st_size >= size_limit:
yield fname
except OSError:
pass
def worker_search_fn(fname):
with open(fname, 'rt') as f:
# read one line at a time from file
for line in f:
if re.search(pat, line):
# found a match! print filename to stdout
print(fname)
# stop reading file; just return
return
mp.Pool().map(worker_search_fn, files_to_search(target))

I'm a bit confused as to how your Python script ended up being faster than your find/grep combo. If you want to use grep in a way somewhat similar to what's suggested by Ron Smith in his answer, you can do something like
find -type f | xargs -d \\n -P 8 -n 100 grep --file=/root/patterns
to launch grep processes which will process 100 files before exiting, keeping up to 8 such processes active at any one time. Having them process 100 files should make the process startup overhead time of each one negligible.
note: The -d \\n option to xargs is a GNU extension which won't work on all POSIX-ish systems. It specifies that the *d*elimiter between filenames is a newline. Although technically filenames can contain newlines, in practice nobody does this and keeps their jobs. For compatibility with non-GNU xargs you need to add the -print0 option to find and use -0 instead of -d \\n with xargs. This will arrange for the null byte \0 (hex 0x00) to be used as the delimiter both by find and xargs.
You could also take the approach of first counting the number of files to be grepped
NUMFILES="$(find -type f | wc -l)";
and then using that number to get an even split among the 8 processes (assuming bash as shell)
find -type f | xargs -d \\n -P 8 -n $(($NUMFILES / 8 + 1)) grep --file=/root/patterns
I think this might work better because the disk I/O of find won't be interfering with the disk I/O of the various greps. I suppose it depends in part on how large the files are, and whether they are stored contiguously — with small files, the disk will be seeking a lot anyway, so it won't matter as much. Note also that, especially if you have a decent amount of RAM, subsequent runs of such a command will be faster because some of the files will be saved in your memory cache.
Of course, you can parameterize the 8 to make it easier to experiment with different numbers of concurrent processes.
As ed. mentions in the comments, it's quite possible that the performance of this approach will still be less impressive than that of a single-process grep -r. I guess it depends on the relative speed of your disk [array], the number of processors in your system, etc.

If you are willing to upgrade to version 3.2 or better, you can take advantage of the concurrent.futures.ProcessPoolExecutor. I think it will improve performance over the popen method you attempted because it will pre-create a pool of processes where your popen method creates a new process every time. You could write your own code to do the same thing for an earlier version if you can't move to 3.2 for some reason.

Let me also show you how to do this in Ray, which is an open-source framework for writing parallel Python applications. The advantage of this approach is that it is fast, easy to write and extend (say you want to pass a lot of data between the tasks or do some stateful accumulation), and can also be run on a cluster or the cloud without modifications. It's also very efficient at utilizing all cores on a single machine (even for very large machines like 100 cores) and data transfer between tasks.
import os
import ray
import re
ray.init()
patterns_file = os.path.expanduser("~/patterns")
topdir = os.path.expanduser("~/folder")
with open(patterns_file) as f:
s_pat = r'(?:{})'.format('|'.join(line.strip() for line in f))
regex = re.compile(s_pat)
#ray.remote
def match(pattern, fname):
results = []
with open(fname, 'rt') as f:
for line in f:
if re.search(pattern, line):
results.append(fname)
return results
results = []
for dirpath, dirnames, filenames in os.walk(topdir):
for fname in filenames:
pathname = os.path.join(dirpath, fname)
results.append(match.remote(regex, pathname))
print("matched files", ray.get(results))
More information including how to run this on a cluster or the cloud is available in the documentatation

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.