Background:
Python 3.5.1, Windows 7
I have a network drive that holds a large number of files and directories. I'm trying to write a script to parse through all of these as quickly as possible to find all files that match a RegEx, and copy these files to my local PC for review. There are about 3500 directories and subdirectories, and a few million files. I'm trying to make this as generic as possible (i.e., not writing code to this exact file structure) in order to reuse this for other network drives. My code works when run against a small network drive, the issue here seems to be scalability.
I've tried a few things using the multiprocessing library and can't seem to get it to work reliably. My idea was to create a new job to parse through each subdirectory to work as quickly as possible. I have a recursive function that parses through all objects in a directory, then calls itself for any subdirectories, and checks any files it finds against the RegEx.
Question: how can I limit the number of threads/processes without using Pools to achieve my goal?
What I've tried:
If I only use Process jobs, I get the error RuntimeError: can't start new thread after more than a few hundred threads start, and it starts dropping connections. I end up with about half the files found, as half of the directories error out (code for this below).
To limit the number of total threads, I tried to use the Pool methods, but I can't pass pool objects to called methods according to this question, which makes the recursion implementation not possible.
To fix that, I tried to call Processes inside the Pool methods, but I get the error daemonic processes are not allowed to have children.
I think that if I can limit the number of concurrent threads, then my solution will work as designed.
Code:
import os
import re
import shutil
from multiprocessing import Process, Manager
CheckLocations = ['network drive location 1', 'network drive location 2']
SaveLocation = 'local PC location'
FileNameRegex = re.compile('RegEx here', flags = re.IGNORECASE)
# Loop through all items in folder, and call itself for subfolders.
def ParseFolderContents(path, DebugFileList):
FolderList = []
jobs = []
TempList = []
if not os.path.exists(path):
return
try:
for item in os.scandir(path):
try:
if item.is_dir():
p = Process(target=ParseFolderContents, args=(item.path, DebugFileList))
jobs.append(p)
p.start()
elif FileNameRegex.search(item.name) != None:
DebugFileList.append((path, item.name))
else:
pass
except Exception as ex:
if hasattr(ex, 'message'):
print(ex.message)
else:
print(ex)
# print('Error in file:\t' + item.path)
except Exception as ex:
if hasattr(ex, 'message'):
print(ex.message)
else:
print('Error in path:\t' + path)
pass
else:
print('\tToo many threads to restart directory.')
for job in jobs:
job.join()
# Save list of debug files.
def SaveDebugFiles(DebugFileList):
for file in DebugFileList:
try:
shutil.copyfile(file[0] + '\\' + file[1], SaveLocation + file[1])
except PermissionError:
continue
if __name__ == '__main__':
with Manager() as manager:
# Iterate through all directories to make a list of all desired files.
DebugFileList = manager.list()
jobs = []
for path in CheckLocations:
p = Process(target=ParseFolderContents, args=(path, DebugFileList))
jobs.append(p)
p.start()
for job in jobs:
job.join()
print('\n' + str(len(DebugFileList)) + ' files found.\n')
if len(DebugFileList) == 0:
quit()
# Iterate through all debug files and copy them to local PC.
n = 25 # Number of files to grab for each parallel path.
TempList = [DebugFileList[i:i + n] for i in range(0, len(DebugFileList), n)] # Split list into small chunks.
jobs = []
for item in TempList:
p = Process(target=SaveDebugFiles, args=(item, ))
jobs.append(p)
p.start()
for job in jobs:
job.join()
Don't disdain the usefulness of pools, especially when you want to control the number of processes to create. They also take care of managing your workers (create/start/join/distribute chunks of work) and help you collect potential results.
As you have realized yourself, you create way too many processes, up to a point where you seem to exhaust so many system resources that you cannot create more processes.
Additionally, the creation of new processes in your code is controlled by outside factors, i.e. the number of folders in your file trees, which makes it very difficult to limit the number of processes. Also, creating a new process comes with quite some overhead on the OS and you might even end up wasting that overhead on empty directories. Plus, context switches between processes are quite costly.
With the number of processes you create, given the number of folders you stated, your processes will basically just sit there and idle most of the time while they are waiting for a share of CPU time to actually do some work. There will be a lot of contention for said CPU time, unless you have a supercomputer with thousands of cores at your disposal. And even when a process gets some CPU time to work, it will likely spend a quite a bit of that time waiting for I/O.
That being said, you'll probably want to look into using threads for such a task. And you could do some optimization in your code. From your example, I don't see any reason why you would split identifying the files to copy and actually copying them into different tasks. Why not let your workers copy each file they found matching the RE right away?
I'd create a list of files in the directories in question using os.walk (which I consider reasonably fast) from the main thread and then offload that list to a pool of workers that checks these files for matches and copies those right away:
import os
import re
from multiprocessing.pool import ThreadPool
search_dirs = ["dir 1", "dir2"]
ptn = re.compile(r"your regex")
# your target dir definition
file_list = []
for topdir in search_dirs:
for root, dirs, files in os.walk(topdir):
for file in files:
file_list.append(os.path.join(root, file))
def copier(path):
if ptn.match(path):
# do your shutil.copyfile with the try-except right here
# obviously I did not want to start mindlessly copying around files on my box :)
return path
with ThreadPool(processes=10) as pool:
results = pool.map(copier, file_list)
# print all the processed files. For those that did not match, None is returned
print("\n".join([r for r in results if r]))
On a side note: don't concatenate your paths manually (file[0] + "\\" + file[1]), rather use os.path.join for this.
I was unable to get this to work exactly as I desired. os.walk was slow, and every other method I thought of was either a similar speed or crashed due to too many threads.
I ended up using a similar method that I posted above, but instead of starting the recursion at the top level directory, it would go down one or two levels until there were several directories. It would then start the recursion at each of these directories in series, which limited the number of threads enough to finish successfully. Execution time is similar to os.walk, which would probably make for a simpler and more readable implementation.
Related
The program is monitoring a folder received_dir and process the files received in real time. After processing the file, the original file should be deleted to save the disk space.
I am trying to use Python multiprocessing and Pool.
I want to check if there is any technical flaw in current approach.
One of the problem in the current code is that the program should wait until all 20 files in the queue are processed before starting the next round, so it may be inefficient in certain conditions (i.e, various file sizes).
from multiprocessing import Pool
import os
import os.path
Parse_OUT="/opt/out/"
Receive_Dir="/opt/receive/"
def parser(infile):
out_dir=date_of(filename)
if not os.path.exists(out_dir):
os.mkdir(out_dir)
fout=gzip.open(out_dir+'/'+filename+'csv.gz','wb')
with gzip.open(infile) as fin:
for line in fin:
data=line.split(',')
fout.write(data)
fout.close()
os.remove(infile)
if __name__ == '__main__':
pool=Pool(20)
while True:
targets=glob.glob(Receive_Dir)[:10]
pool.map(parser, targets)
pool.close()
I see several issues:
if not os.path.exists(out_dir): os.mkdir(out_dir): This is a race condition. If two workers try to create the same directory at the same time, one will raise an exception. Don't do the if condition. Simply call os.makedirs(out_dir, exist_ok=True)
Don't assemble file paths with string addition. Simply do os.path.join(out_dir, filename+'csv.gz'). This is cleaner and has fewer failure states
Instead of spinning in your while True-loop even if no new directories appear, you can use the inotify mechanism on Linux to monitor the directory for changes. That would only wake your process if there is actually anything to do. Check out pyinotify: https://github.com/seb-m/pyinotify
Since you mentioned that you are dissatisfied with the batching: You can use pool.apply_async to start new operations as they become available. Your main loop doesn't do anything with the results, so you can just "fire and forget"
Incidentally, why are you starting a pool with 20 workers and then you just launch 10 directory operations at once?
I've never done anything with multiprocessing before, but I recently ran into a problem with one of my projects taking an excessive amount of time to run. I have about 336,000 files I need to process, and a traditional for loop would likely take about a week to run.
There are two loops to do this, but they are effectively identical in what they return so I've only included one.
import json
import os
from tqdm import tqdm
import multiprocessing as mp
jsons = os.listdir('/content/drive/My Drive/mrp_workflow/JSONs')
materials = [None] * len(jsons)
def asyncJSONs(file, index):
try:
with open('/content/drive/My Drive/mrp_workflow/JSONs/{}'.format(file)) as f:
data = json.loads(f.read())
properties = process_dict(data, {})
properties['name'] = file.split('.')[0]
materials[index] = properties
except:
print("Error parsing at {}".format(file))
process_list = []
i = 0
for file in tqdm(jsons):
p = mp.Process(target=asyncJSONs,args=(file,i))
p.start()
process_list.append(p)
i += 1
for process in process_list:
process.join()
Everything in that relating to multiprocessing was cobbled together from a collection of google searches and articles, so I wouldn't be surprised if it wasn't remotely correct. For example, the 'i' variable is a dirty attempt to keep the information in some kind of order.
What I'm trying to do is load information from those JSON files and store it in the materials variable. But when I run my current code nothing is stored in materials.
As you can read in other answers - processes don't share memory and you can't set value directly in materials. Function has to use return to send result back to main process and it has to wait for result and get it.
It can be simpler with Pool. It doesn't need to use queue manually. And it should return results in the same order as data in all_jsons. And you can set how many processes to run at the same time so it will not block CPU for other processes in system.
But it can't use tqdm.
I couldn't test it but it can be something like this
import os
import json
from multiprocessing import Pool
# --- functions ---
def asyncJSONs(filename):
try:
fullpath = os.path.join(folder, filename)
with open(fullpath) as f:
data = json.loads(f.read())
properties = process_dict(data, {})
properties['name'] = filename.split('.')[0]
return properties
except:
print("Error parsing at {}".format(filename))
# --- main ---
# for all processes (on some systems it may have to be outside `__main__`)
folder = '/content/drive/My Drive/mrp_workflow/JSONs'
if __name__ == '__main__':
# code only for main process
all_jsons = os.listdir(folder)
with Pool(5) as p:
materials = p.map(asyncJSONs, all_jsons)
for item in materials:
print(item)
BTW:
Other modules: concurrent.futures, joblib, ray,
Going to mention a totally different way of solving this problem. Don't bother trying to append all the data to the same list. Extract the data you need, and append it to some target file in ndjson/jsonlines format. That's just where, instead of objects part of a json array [{},{}...], you have separate objects on each line.
{"foo": "bar"}
{"foo": "spam"}
{"eggs": "jam"}
The workflow looks like this:
spawn N workers with a manifest of files to process and the output file to write to. You don't even need MP, you could use a tool like rush to parallelize.
worker parses data, generates the output dict
worker opens the output file with append flag. dump the data and flush immediately:
with open(out_file, 'a') as fp:
print(json.dumps(data), file=fp, flush=True)
Flush ensure that as long as your data is less than the buffer size on your kernel (usually several MB), your different processes won't stomp on each other and conflict writes. If they do get conflicted, you may need to write to a separate output file for each worker, and then join them all.
You can join the files and/or convert to regular JSON array if needed using jq. To be honest, just embrace jsonlines. It's a way better data format for long lists of objects, since you don't have to parse the whole thing in memory.
You need to understand how multiprocessing works. It starts a brand new process for EACH task, each with a brand new Python interpreter, which runs your script all over again. These processes do not share memory in any way. The other processes get a COPY of your globals, but they obviously can't be the same memory.
If you need to send information back, you can using a multiprocessing.queue. Have the function stuff the results in a queue, while your main code waits for stuff to magically appear in the queue.
Also PLEASE read the instructions in the multiprocessing docs about main. Each new process will re-execute all the code in your main file. Thus, any one-time stuff absolutely must be contained in a
if __name__ == "__main__":
block. This is one case where the practice of putting your mainline code into a function called main() is a "best practice".
What is taking all the time here? Is it reading the files? If so, then you might be able to do this with multithreading instead of multiprocessing. However, if you are limited by disk speed, then no amount of multiprocessing is going to reduce your run time.
I'm trying to use the fbx python module from autodesk, but it seems I can't thread any operation. This seems due to the GIL not relased. Has anyone found the same issue or am I doing something wrong? When I say it doesn't work, I mean the code doesn't release the thread and I'm not be able to do anything else, while the fbx code is running.
There isn't much of code to post, just to know whether it did happen to anyone to try.
Update:
here is the example code, please note each fbx file is something like 2GB
import os
import fbx
import threading
file_dir = r'../fbxfiles'
def parse_fbx(filepath):
print '-' * (len(filepath) + 9)
print 'parsing:', filepath
manager = fbx.FbxManager.Create()
importer = fbx.FbxImporter.Create(manager, '')
status = importer.Initialize(filepath)
if not status:
raise IOError()
scene = fbx.FbxScene.Create(manager, '')
importer.Import(scene)
# freeup memory
rootNode = scene.GetRootNode()
def traverse(node):
print node.GetName()
for i in range(0, node.GetChildCount()):
child = node.GetChild(i)
traverse(child)
# RUN
traverse(rootNode)
importer.Destroy()
manager.Destroy()
files = os.listdir(file_dir)
tt = []
for file_ in files:
filepath = os.path.join(file_dir, file_)
t = threading.Thread(target=parse_fbx, args=(filepath,))
tt.append(t)
t.start()
One problem I see is with your traverse() function. It's calling itself recursively potentially a huge number of times. Another is having all the threads printing stuff at the same time. Doing that properly requires coordinating access to the shared output device (i.e. the screen). A simple way to do that is by creating and using a global threading.Lock object.
First create a global Lock to prevent threads from printing at same time:
file_dir = '../fbxfiles' # an "r" prefix needed only when path contains backslashes
print_lock = threading.Lock() # add this here
Then make a non-recursive version of traverse() that uses it:
def traverse(rootNode):
with print_lock:
print rootNode.GetName()
for i in range(node.GetChildCount()):
child = node.GetChild(i)
with print_lock:
print child.GetName()
It's not clear to me exactly where the reading of each fbxfile takes place. If it all happens as a result of the importer.Import(scene) call, then that is the only time any other threads will be given a chance to run — unless some I/O is [also] done within the traverse() function.
Since printing is most definitely a form of output, thread switching will also be able to occur when it's done. However, if all the function did was perform computations of some kind, no multi-threading would take place within it during its execution.
Once you get the multi-reading working, you may encounter insufficient memory issues if multiple 2GB fbxfiles are being read into memory simultaneously by the various different threads.
I am trying to parallelize some calculations with the use of the multiprocessing module.
How can be sure that every process that is spawned by multiprocessing.Pool.map_async is running on a different (previously created) folder?
The problem is that each process calls some third parts library that wrote temp files to disk, and if you run many of those in the same folder, you mess up one with the other.
Additionally, I can't create a new folder for every function call made by map_async, but rather, I would like to create as little as possible folders (ie, one per each process).
The code would be similar to this:
import multiprocessing,os,shutil
processes=16
#starting pool
pool=multiprocessing.Pool(processes)
#The asked dark-magic here?
devshm='/dev/shm/'
#Creating as many folders as necessary
for p in range(16):
os.mkdir(devshm+str(p)+'/')
shutil.copy(some_files,p)
def example_function(i):
print os.getcwd()
return i*i
result=pool.map_async(example_function,range(1000))
So that at any time, every call of example_function is executed on a different folder.
I know that a solution might be to use subprocess to spawn the different processes, but I would like to stick to multiprocessing (I would need to pickle some objects, write to disk,read, unpickle for every spawned subprocess, rather than passing the object itself through the function call(using functools.partial) .
PS.
This question is somehow similar, but that solution doesn't guarantee that every function call is taking place on a different folder, which indeed is my goal.
Since you don't specify in your question, i'm assuming you don't need the contents of the directory after your function has finished executing.
The absolute easiest method is to create and destroy the temp directories in your function that uses them. This way the rest of your code doesn't care about environment/directories of the worker processes and Pool fits nicely. I would also use python's built-in functionality for creating temporary directories:
import multiprocessing, os, shutil, tempfile
processes=16
def example_function(i):
with tempfile.TemporaryDirectory() as path:
os.chdir(path)
print(os.getcwd())
return i*i
if __name__ == '__main__':
#starting pool
pool=multiprocessing.Pool(processes)
result=pool.map(example_function,range(1000))
NOTE: tempfile.TemporaryDirectory was introduced in python 3.2. If you are using an older version of python, you can copy the wrapper class into your code.
If you really need to setup the directories beforehand...
Trying to make this work with Pool is a little hacky. You could pass the name of the directory to use along with the data, but you could only pass an initial amount equal to the number of directories. Then, you would need to use something like imap_unordered to see when a result is done (and it's directory is available for reuse).
A much better approach, in my opinion, is not to use Pool at all, but create individual Process objects and assign each one to a directory. This is generally better if you need to control some part of the Process's environment, where Pool is generally better when your problem is data-driven and doesn't care about the processes or their environment.
There different ways to pass data to/from the Process objects, but the simplest is a queue:
import multiprocessing,os,shutil
processes=16
in_queue = multiprocessing.Queue()
out_queue = multiprocessing.Queue()
def example_function(path, qin, qout):
os.chdir(path)
for i in iter(qin.get, 'stop'):
print(os.getcwd())
qout.put(i*i)
devshm='/dev/shm/'
# create processes & folders
procs = []
for i in range(processes):
path = devshm+str(i)+'/'
os.mkdir(path)
#shutil.copy(some_files,path)
procs.append(multiprocessing.Process(target=example_function, args=(path,in_queue, out_queue)))
procs[-1].start()
# send input
for i in range(1000):
in_queue.put(i)
# send stop signals
for i in range(processes):
in_queue.put('stop')
# collect output
results = []
for i in range(1000):
results.append(out_queue.get())
Is it a good practice to call pool.map inside a for loop to minimize memory usage?
For example, in my code, I'm trying to minimize memory usage by only processing one directory at a time:
PATH = /dir/files
def readMedia(fname):
""" Do CPU-intensive task
"""
pass
def init(queue):
readMedia.queue = queue
def main():
print("Starting the scanner in root " + PATH)
queue = multiprocessing.Queue()
pool = multiprocessing.Pool(processes=32, initializer=init, initargs=[queue])
for dirpath, dirnames, filenames in os.walk(PATH):
full_path_fnames = map(lambda fn: os.path.join(dirpath, fn),
filenames)
pool.map(readMedia, full_path_fnames)
result = queue.get()
print(result)
The above code, when tested, actually eats up all my memory even when the script is terminated.
There are probably a few issues here. First, you're using too many processes in your pool. Because you're doing a CPU intensive task, you're only going to get diminishing returns if you start more than multiprocessing.cpu_count() workers; if you've got 32 workers doing CPU intensive tasks but only 4 CPUs, 28 processes are always going to be sitting around doing no work, but wasting memory.
You're probably still seeing high memory usage after killing the script because one or more of the child processes is still running. Take a look at the process list after you kill the main script and make sure none of the children are left behind.
If you're still seeing memory usage growing too high over time, you could try setting the maxtasksperchild keyword argument when you create the pool, which will restart each child process once its run the given number of tasks, releasing any memory that may have leaked.
As for memory usage gains by calling map in a for loop, you do get the advantage of not having to store the results of every single call to readMedia in one in-memory list, which definitely saves memory if there is a huge list of files being iterated over.