Parallel file matching, Python - python

I am trying to improve on a script which scans files for malicious code. We have a list of regex patterns in a file, one pattern on each line. These regex are for grep as our current implementation is basically a bash script find\grep combo. The bash script takes 358 seconds on my benchmark directory. I was able to write a python script that did this in 72 seconds but want to improve more. First I will post the base-code and then tweaks I have tried:
import os, sys, Queue, threading, re
fileList = []
rootDir = sys.argv[1]
class Recurser(threading.Thread):
def __init__(self, queue, dir):
self.queue = queue
self.dir = dir
threading.Thread.__init__(self)
def run(self):
self.addToQueue(self.dir)
## HELPER FUNCTION FOR INTERNAL USE ONLY
def addToQueue(self, rootDir):
for root, subFolders, files in os.walk(rootDir):
for file in files:
self.queue.put(os.path.join(root,file))
self.queue.put(-1)
self.queue.put(-1)
self.queue.put(-1)
self.queue.put(-1)
self.queue.put(-1)
self.queue.put(-1)
self.queue.put(-1)
self.queue.put(-1)
self.queue.put(-1)
self.queue.put(-1)
self.queue.put(-1)
self.queue.put(-1)
self.queue.put(-1)
self.queue.put(-1)
self.queue.put(-1)
self.queue.put(-1)
self.queue.put(-1)
self.queue.put(-1)
self.queue.put(-1)
self.queue.put(-1)
class Scanner(threading.Thread):
def __init__(self, queue, patterns):
self.queue = queue
self.patterns = patterns
threading.Thread.__init__(self)
def run(self):
nextFile = self.queue.get()
while nextFile is not -1:
#print "Trying " + nextFile
self.scanFile(nextFile)
nextFile = self.queue.get()
#HELPER FUNCTION FOR INTERNAL UES ONLY
def scanFile(self, file):
fp = open(file)
contents = fp.read()
i=0
#for patt in self.patterns:
if self.patterns.search(contents):
print "Match " + str(i) + " found in " + file
############MAIN MAIN MAIN MAIN##################
############MAIN MAIN MAIN MAIN##################
############MAIN MAIN MAIN MAIN##################
############MAIN MAIN MAIN MAIN##################
############MAIN MAIN MAIN MAIN##################
############MAIN MAIN MAIN MAIN##################
############MAIN MAIN MAIN MAIN##################
############MAIN MAIN MAIN MAIN##################
############MAIN MAIN MAIN MAIN##################
fileQueue = Queue.Queue()
#Get the shell scanner patterns
patterns = []
fPatt = open('/root/patterns')
giantRE = '('
for line in fPatt:
#patterns.append(re.compile(line.rstrip(), re.IGNORECASE))
giantRE = giantRE + line.rstrip() + '|'
giantRE = giantRE[:-1] + ')'
giantRE = re.compile(giantRE, re.IGNORECASE)
#start recursing the directories
recurser = Recurser(fileQueue,rootDir)
recurser.start()
print "starting scanner"
#start checking the files
for scanner in xrange(0,8):
scanner = Scanner(fileQueue, giantRE)
scanner.start()
This is obviously debugging\ugly code, do not mind the million queue.put(-1), I will clean this up later. Some indentations are not showing up properly, paticularly in scanFile.
Anyway some things I've noticed. Using 1, 4, and even 8 threads (for scanner in xrange(0,???):) does not make a difference. I still get ~72 seconds regardless. I assume this is due to python's GIL.
As opposed to making a giant regex I tried placing each line (pattern) as a compilex RE in a list and iterating through this list in my scanfile function. This resulted in longer execution time.
In an effort to avoid python's GIL I tried having each thread fork to grep as in:
#HELPER FUNCTION FOR INTERNAL UES ONLY
def scanFile(self, file):
s = subprocess.Popen(("grep", "-El", "--file=/root/patterns", file), stdout = subprocess.PIPE)
output = s.communicate()[0]
if output != '':
print 'Matchfound in ' + file
This resulted in longer execution time.
Any suggestions on improving performance.
:::::::::::::EDIT::::::::
I can not post answers to my own questions yet however here are the answers to several points raised:
#David Nehme - Just to let people know I am aware of the fact that I have a million queue.put(-1)'s
#Blender - To mark the bottom of the queue. My scanner threads keep dequeing until they hit -1 which is at the bottom (while nextFile is not -1:). The processor cores is 8 however due to the GIL using 1 thread, 4 threads, or 8 threads does NOT make a difference. Spawning 8 subprocesses resulted in significantly slower code (142 sec vs 72)
#ed - Yes that and it's just as slow as the find\grep combo, actually slower because it indiscriminately greps file that aren't needed
#Ron - Can't upgrade, this must be universal. Do you think this will speed up > 72 seconds? The bash grepper does 358 seconds. My python giant RE method does 72 seconds w\ 1-8 threads. The popen method w\ 8 thrads (8 subprocesses) ran at 142 seconds. So far the giant RE python only method is a clear winner by far
#intuted
Here's the meat of our current find\grep combo (Not my script). It's pretty simple. There are some additional things in there like ls, but nothing that should result in a 5x slowdown. Even if grep -r is slightly more efficient 5x is a HUGE slowdown.
find "${TARGET}" -type f -size "${SZLIMIT}" -exec grep -Eaq --file="${HOME}/patterns" "{}" \; -and -ls | tee -a "${HOME}/found.txt"
The python code is more efficient, I don't know why, but I experimentally tested it. I prefer to do this in python. I already achieved a speedup of 5x with python, I would like to get it sped up more.
:::::::::::::WINNER WINNER WINNER:::::::::::::::::
Looks like we have a winner.
intued's shell script comes in 2nd place with 34 seconds however #steveha's came in first with 24 seconds. Due to the fact that a lot of our boxes do not have python2.6 I had to cx_freeze it. I can write a shell script wrapper to wget a tar and unpack it. I do like intued's for simplicity however.
Thanks you for all your help guys, I now have an efficient tool for sysadmining

I think that, rather than using the threading module, you should be using the multiprocessing module for your Python solution. Python threads can run afoul of the GIL; the GIL is not a problem if you simply have multiple Python processes going.
I think that for what you are doing a pool of worker processes is just what you want. By default, the pool will default to one process for each core in your system processor. Just call the .map() method with a list of filenames to check and the function that does the checking.
http://docs.python.org/library/multiprocessing.html
If this is not faster than your threading implementation, then I don't think the GIL is your problem.
EDIT: Okay, I'm adding a working Python program. This uses a pool of worker processes to open each file and search for the pattern in each. When a worker finds a filename that matches, it simply prints it (to standard output) so you can redirect the output of this script into a file and you have your list of files.
EDIT: I think this is a slightly easier to read version, easier to understand.
I timed this, searching through the files in /usr/include on my computer. It completes the search in about half a second. Using find piped through xargs to run as few grep processes as possible, it takes about 0.05 seconds, about a 10x speedup. But I hate the baroque weird language you must use to get find to work properly, and I like the Python version. And perhaps on really big directories the disparity would be smaller, as part of the half-second for Python must have been startup time. And maybe half a second is fast enough for most purposes!
import multiprocessing as mp
import os
import re
import sys
from stat import S_ISREG
# uncomment these if you really want a hard-coded $HOME/patterns file
#home = os.environ.get('HOME')
#patterns_file = os.path.join(home, 'patterns')
target = sys.argv[1]
size_limit = int(sys.argv[2])
assert size_limit >= 0
patterns_file = sys.argv[3]
# build s_pat as string like: (?:foo|bar|baz)
# This will match any of the sub-patterns foo, bar, or baz
# but the '?:' means Python won't bother to build a "match group".
with open(patterns_file) as f:
s_pat = r'(?:{})'.format('|'.join(line.strip() for line in f))
# pre-compile pattern for speed
pat = re.compile(s_pat)
def walk_files(topdir):
"""yield up full pathname for each file in tree under topdir"""
for dirpath, dirnames, filenames in os.walk(topdir):
for fname in filenames:
pathname = os.path.join(dirpath, fname)
yield pathname
def files_to_search(topdir):
"""yield up full pathname for only files we want to search"""
for fname in walk_files(topdir):
try:
# if it is a regular file and big enough, we want to search it
sr = os.stat(fname)
if S_ISREG(sr.st_mode) and sr.st_size >= size_limit:
yield fname
except OSError:
pass
def worker_search_fn(fname):
with open(fname, 'rt') as f:
# read one line at a time from file
for line in f:
if re.search(pat, line):
# found a match! print filename to stdout
print(fname)
# stop reading file; just return
return
mp.Pool().map(worker_search_fn, files_to_search(target))

I'm a bit confused as to how your Python script ended up being faster than your find/grep combo. If you want to use grep in a way somewhat similar to what's suggested by Ron Smith in his answer, you can do something like
find -type f | xargs -d \\n -P 8 -n 100 grep --file=/root/patterns
to launch grep processes which will process 100 files before exiting, keeping up to 8 such processes active at any one time. Having them process 100 files should make the process startup overhead time of each one negligible.
note: The -d \\n option to xargs is a GNU extension which won't work on all POSIX-ish systems. It specifies that the *d*elimiter between filenames is a newline. Although technically filenames can contain newlines, in practice nobody does this and keeps their jobs. For compatibility with non-GNU xargs you need to add the -print0 option to find and use -0 instead of -d \\n with xargs. This will arrange for the null byte \0 (hex 0x00) to be used as the delimiter both by find and xargs.
You could also take the approach of first counting the number of files to be grepped
NUMFILES="$(find -type f | wc -l)";
and then using that number to get an even split among the 8 processes (assuming bash as shell)
find -type f | xargs -d \\n -P 8 -n $(($NUMFILES / 8 + 1)) grep --file=/root/patterns
I think this might work better because the disk I/O of find won't be interfering with the disk I/O of the various greps. I suppose it depends in part on how large the files are, and whether they are stored contiguously — with small files, the disk will be seeking a lot anyway, so it won't matter as much. Note also that, especially if you have a decent amount of RAM, subsequent runs of such a command will be faster because some of the files will be saved in your memory cache.
Of course, you can parameterize the 8 to make it easier to experiment with different numbers of concurrent processes.
As ed. mentions in the comments, it's quite possible that the performance of this approach will still be less impressive than that of a single-process grep -r. I guess it depends on the relative speed of your disk [array], the number of processors in your system, etc.

If you are willing to upgrade to version 3.2 or better, you can take advantage of the concurrent.futures.ProcessPoolExecutor. I think it will improve performance over the popen method you attempted because it will pre-create a pool of processes where your popen method creates a new process every time. You could write your own code to do the same thing for an earlier version if you can't move to 3.2 for some reason.

Let me also show you how to do this in Ray, which is an open-source framework for writing parallel Python applications. The advantage of this approach is that it is fast, easy to write and extend (say you want to pass a lot of data between the tasks or do some stateful accumulation), and can also be run on a cluster or the cloud without modifications. It's also very efficient at utilizing all cores on a single machine (even for very large machines like 100 cores) and data transfer between tasks.
import os
import ray
import re
ray.init()
patterns_file = os.path.expanduser("~/patterns")
topdir = os.path.expanduser("~/folder")
with open(patterns_file) as f:
s_pat = r'(?:{})'.format('|'.join(line.strip() for line in f))
regex = re.compile(s_pat)
#ray.remote
def match(pattern, fname):
results = []
with open(fname, 'rt') as f:
for line in f:
if re.search(pattern, line):
results.append(fname)
return results
results = []
for dirpath, dirnames, filenames in os.walk(topdir):
for fname in filenames:
pathname = os.path.join(dirpath, fname)
results.append(match.remote(regex, pathname))
print("matched files", ray.get(results))
More information including how to run this on a cluster or the cloud is available in the documentatation

Related

Concatenating a list of files with subprocess and wildcards in python

I'm trying to concatenate multiple files in a directory to a single file. So far I've been trying to use cat with subprocess with poor results.
My original code was:
source = ['folder01/*', 'folder02/*']
target = ['/output/File1', '/output/File2']
for f1, f2, in zip(source, target):
subprocess.call(['cat', f1, '>>', f2])
I've tried handing it shell=True:
..., f2], shell=True)
And in conjunction with subprocess.Popen instead of call in a number of permutations, but with no joy.
As I've understood from other similar questions, with shell=True the command will need to be provided as a string. How can I go about calling cat on all items in my list whilst executing as a string?
You don't need subprocess here and you must always avoid subprocess when you can (that means: 99.99% of time).
As Joel pointed out in comments, maybe I should take a few minutes and bullet points to explain you why:
Using subprocess (or similar) assume your code will always run on the exact same environment, that means same OS, version, shell, tools installed, etc.. This is really not fitted for a production grade code.
These kind of libraries will prevent you to make "pythonic Python code", you will have to handle errors by parsing string instead of try / except, etc..
Tim Peters wrote the Zen of Python and I encourage you to follow it, at least 3 points are relevant here: "Beautiful is better than ugly.", "Readability counts." and "Simple is better than complex.".
In other words: subprocess will only make your code less robust, force you to handle non-Python issues, force you to perform tricky computing where you could just write clean and powerful Python code.
There are way more good reasons to not use subprocess, but I think you got the point.
Just open files with open, here is a basic example you will need to adapt:
import os
for filename in os.listdir('./'):
with open(filename, 'r') as fileh:
with open('output.txt', 'a') as outputh:
outputh.write(fileh.read())
Implementation example for your specific needs:
import os
sources = ['/tmp/folder01/', '/tmp/folder02/']
targets = ['/tmp/output/File1', '/tmp/output/File2']
# Loop in `sources`
for index, directory in enumerate(sources):
# Match `sources` file with expected `targets` directory
output_file = targets[index]
# Loop in files within `directory`
for filename in os.listdir(directory):
# Compute absolute path of file
filepath = os.path.join(directory, filename)
# Open source file in read mode
with open(filepath, 'r') as fileh:
# Open output file in append mode
with open(output_file, 'a') as outputh:
# Write content into output
outputh.write(fileh.read())
Be careful, I changed your source and target values (/tmp/)

Windows disk usage issues with python

I am executing the python code that follows.
I am running it on a folder ("articles") which has a couple hundred subfolders and 240,226 files in all.
I am timing the execution. At first the times were pretty stable but went non-linear after 100,000 files. Now the times (I am timing at 10,000 file intervals) can go non_linear after 30,000 or so (or not).
I have the Task Manager open and correlate the slow-downs to 99% Disk usage by python.exe. I have done gc-collect(). dels etc., turned off Windows indexing. I have re-started Windows, emptied the trash (I have a few hundred GBs free). Nothing helps, the disk usage seems to be getting more erratic if anything.
Sorry for the long post - Thanks for the help
def get_filenames():
for (dirpath, dirnames, filenames) in os.walk("articles/"):
dirs.extend(dirnames)
for dir in dirs:
path = "articles" + "\\" + dir
nxml_files.extend(glob.glob(path + "/*.nxml"))
return nxml_files
def extract_text_from_files(nxml_files):
for nxml_file in nxml_files:
fast_parse(nxml_file)
def fast_parse(infile):
file = open(infile,"r")
filetext = file.read()
tag_breaks = filetext.split('><')
paragraphs = [tag_break.strip('p>').strip('</') for tag_break in tag_breaks if tag_break.startswith('p>')]
def run_files():
nxml_files = get_filenames()
extract_text_from_files(nxml_files)
if __name__ == "__main__":
run_files()
There are some things that could be optimized.
At first, is you open files, close them as well. A with open(...) as name: block will do that easily. BTW in Python 2 file is a bad choice for a variable name, it is built-in function's name.
You can remove one disc read by doing string comparisons instead of the glob.
And last but not least: os.walk spits out the results cleverly, so don't buffer them into a list, process everything inside one loop. This will save a lot of memory.
That is what I can advise from the code. For more details on what is causing the I/O you should use profiling. See https://docs.python.org/2/library/profile.html for details.

Python os.walk memory issue

I programmed a scanner that looks for certain files on all hard drives of a system that gets scanned. Some of these systems are pretty old, running Windows 2000 with 256 or 512 MB of RAM but the file system structure is complex as some of them serve as file servers.
I use os.walk() in my script to parse all directories and files.
Unfortunately we noticed that the scanner consumes a lot of RAM after some time of scanning and we figured out that the os.walk function alone uses about 50 MB of RAM after 2h of walk over the file system. This RAM usage increases over the time. We had about 90 MB of RAM after 4 hours of scanning.
Is there a way to avoid this behaviour? We also tried "betterwalk.walk()" and "scandir.walk()". The result was the same.
Do we have to write our own walk function that removes already scanned directory and file objects from memory so that the garbage collector can remove them from time to time?
Thanks
have you tried the glob module?
import os, glob
def globit(srchDir):
srchDir = os.path.join(srchDir, "*")
for file in glob.glob(srchDir):
print file
globit(file)
if __name__ == '__main__':
dir = r'C:\working'
globit(dir)
If you are running in the os.walk loop, del() everything that you don't need anymore. And try running gc.collect() at the end of every iteration of os.walk.
Generators are better solutions as they do lazy computations
here is one example of implementation.
import os
import fnmatch
#this may or may not be implemented
def list_dir(path):
for name in os.listdir(path):
yield os.path.join(path, name)
#modify this to take some pattern as input
def os_walker(top):
for root,dlist,flist in os.walk(top):
for name in fnmatch.filter(flist, '*.py'):
yield os.path.join(root, name)
all_dirs = list_dir("D:\\tuts\\pycharm")
for l in all_dirs:
for name in os_walker(l):
print(name)
Thanks to David Beazley

Top 5 Folders Consuming The Most Space?

My hard drive is full.
What is the easiest way to find out the TOP 5 FOLDERS that consume the most disk space?
A python solution would be greatly appreciated. I use Ubuntu Linux.
Not a python solution, but one using the shell is to use du. To list the number of kilobytes in each folder under /var/ then sort by size with the largest one last, run the following in a shell prompt:
du -k --max-depth 1 /var/|sort -n
If you want this under python, use the always super-handy subprocess module:
import subprocess
p = subprocess.Popen(["/usr/bin/du", "-k", "--max-depth", "1"], stdout=subprocess.PIPE)
(output, stderr) = p.communicate()
Split output by newline, then by tab, then sort, and you'll have the results in python.
Not a Python solution and not a code example:
http://www.jgoodies.com/freeware/jdiskreport/index.html
A simple and pure python solution:
import os
def get_folder_size(folder_path):
folder_size = 0
for (path, dirs, files) in os.walk(folder_path):
for file in files:
filename = os.path.join(path, file)
folder_size += os.path.getsize(filename)
return folder_size
def get_file_size(file):
return os.path.getsize(file)
print get_folder_size("/home/magnun")
print get_file_size("/home/magnun/background.png")
This functions returns the size (a long type) in bytes, you may need to convert it into MBytes, GBytes and etc.
You could just check the properties of each folder and look for the total size of the files in it. In Ubuntu, just right click on the folder, click properties and check the contents size. Not a python solution, but much easier (not everything needs a script to find the answer). The biggest folders will likely be the ones at the top of the tree.

Reading updated files on the fly in Python

I'm writing two Python scripts that both parse files. One is a standard unix logfile and the other is a binary file. I'm trying to figure out the best way to monitor these so I can read data as soon as they're updated. Most of the solutions I've found thus far are linux specific, but I need this to work in FreeBSD.
Obviously one approach would be to just run my script every X amount of time, but this seems gross and inefficient. If I want my Python app running continuously in the background monitoring a file and acting on it once it's changed/updated, what's my best bet?
Have you tried KQueue events?
http://docs.python.org/library/select.html#kqueue-objects
kqueue is the FreeBSD / OS version of inotify (file change notification service). I haven't used this, but I think it's what you want.
I once did to make a sort of daemon process for a parser built in Python. I needed to watch a series of files and process them in Python, and it had to be a truly multi-OS solution (Windows & Linux in this case). I wrote a program that watches over a list of files by checking their modification time. The program sleeps for a while and then checks the modification time of the files being watched. If the modification time is newer than the one previously registered, then the file has changed and so stuff has to be done with this file.
Something like this:
import os
import time
path = os.path.dirname(__file__)
print "Looking for files in", path, "..."
# get interesting files
files = [{"file" : f} for f in os.listdir(path) if os.path.isfile(f) and os.path.splitext(f)[1].lower() == ".src"]
for f in files:
f["output"] = os.path.splitext(f["file"])[0] + ".out"
f["modtime"] = os.path.getmtime(f["file"]) - 10
print " watching:", f["file"]
while True:
# sleep for a while
time.sleep(0.5)
# check if anything changed
for f in files:
# is mod time of file is newer than the one registered?
if os.path.getmtime(f["file"]) > f["modtime"]:
# store new time and...
f["modtime"] = os.path.getmtime(f["file"])
print f["file"], "has changed..."
# do your stuff here
It does not look like top notch code, but it works quite well.
There are other SO questions about this, but I don't know if they'll provide a direct answer to your question:
How to implement a pythonic equivalent of tail -F?
How do I watch a file for changes?
How can I "watch" a file for modification / change?
Hope this helps!

Categories