I am trying to parallelize some calculations with the use of the multiprocessing module.
How can be sure that every process that is spawned by multiprocessing.Pool.map_async is running on a different (previously created) folder?
The problem is that each process calls some third parts library that wrote temp files to disk, and if you run many of those in the same folder, you mess up one with the other.
Additionally, I can't create a new folder for every function call made by map_async, but rather, I would like to create as little as possible folders (ie, one per each process).
The code would be similar to this:
import multiprocessing,os,shutil
processes=16
#starting pool
pool=multiprocessing.Pool(processes)
#The asked dark-magic here?
devshm='/dev/shm/'
#Creating as many folders as necessary
for p in range(16):
os.mkdir(devshm+str(p)+'/')
shutil.copy(some_files,p)
def example_function(i):
print os.getcwd()
return i*i
result=pool.map_async(example_function,range(1000))
So that at any time, every call of example_function is executed on a different folder.
I know that a solution might be to use subprocess to spawn the different processes, but I would like to stick to multiprocessing (I would need to pickle some objects, write to disk,read, unpickle for every spawned subprocess, rather than passing the object itself through the function call(using functools.partial) .
PS.
This question is somehow similar, but that solution doesn't guarantee that every function call is taking place on a different folder, which indeed is my goal.
Since you don't specify in your question, i'm assuming you don't need the contents of the directory after your function has finished executing.
The absolute easiest method is to create and destroy the temp directories in your function that uses them. This way the rest of your code doesn't care about environment/directories of the worker processes and Pool fits nicely. I would also use python's built-in functionality for creating temporary directories:
import multiprocessing, os, shutil, tempfile
processes=16
def example_function(i):
with tempfile.TemporaryDirectory() as path:
os.chdir(path)
print(os.getcwd())
return i*i
if __name__ == '__main__':
#starting pool
pool=multiprocessing.Pool(processes)
result=pool.map(example_function,range(1000))
NOTE: tempfile.TemporaryDirectory was introduced in python 3.2. If you are using an older version of python, you can copy the wrapper class into your code.
If you really need to setup the directories beforehand...
Trying to make this work with Pool is a little hacky. You could pass the name of the directory to use along with the data, but you could only pass an initial amount equal to the number of directories. Then, you would need to use something like imap_unordered to see when a result is done (and it's directory is available for reuse).
A much better approach, in my opinion, is not to use Pool at all, but create individual Process objects and assign each one to a directory. This is generally better if you need to control some part of the Process's environment, where Pool is generally better when your problem is data-driven and doesn't care about the processes or their environment.
There different ways to pass data to/from the Process objects, but the simplest is a queue:
import multiprocessing,os,shutil
processes=16
in_queue = multiprocessing.Queue()
out_queue = multiprocessing.Queue()
def example_function(path, qin, qout):
os.chdir(path)
for i in iter(qin.get, 'stop'):
print(os.getcwd())
qout.put(i*i)
devshm='/dev/shm/'
# create processes & folders
procs = []
for i in range(processes):
path = devshm+str(i)+'/'
os.mkdir(path)
#shutil.copy(some_files,path)
procs.append(multiprocessing.Process(target=example_function, args=(path,in_queue, out_queue)))
procs[-1].start()
# send input
for i in range(1000):
in_queue.put(i)
# send stop signals
for i in range(processes):
in_queue.put('stop')
# collect output
results = []
for i in range(1000):
results.append(out_queue.get())
Related
I've never done anything with multiprocessing before, but I recently ran into a problem with one of my projects taking an excessive amount of time to run. I have about 336,000 files I need to process, and a traditional for loop would likely take about a week to run.
There are two loops to do this, but they are effectively identical in what they return so I've only included one.
import json
import os
from tqdm import tqdm
import multiprocessing as mp
jsons = os.listdir('/content/drive/My Drive/mrp_workflow/JSONs')
materials = [None] * len(jsons)
def asyncJSONs(file, index):
try:
with open('/content/drive/My Drive/mrp_workflow/JSONs/{}'.format(file)) as f:
data = json.loads(f.read())
properties = process_dict(data, {})
properties['name'] = file.split('.')[0]
materials[index] = properties
except:
print("Error parsing at {}".format(file))
process_list = []
i = 0
for file in tqdm(jsons):
p = mp.Process(target=asyncJSONs,args=(file,i))
p.start()
process_list.append(p)
i += 1
for process in process_list:
process.join()
Everything in that relating to multiprocessing was cobbled together from a collection of google searches and articles, so I wouldn't be surprised if it wasn't remotely correct. For example, the 'i' variable is a dirty attempt to keep the information in some kind of order.
What I'm trying to do is load information from those JSON files and store it in the materials variable. But when I run my current code nothing is stored in materials.
As you can read in other answers - processes don't share memory and you can't set value directly in materials. Function has to use return to send result back to main process and it has to wait for result and get it.
It can be simpler with Pool. It doesn't need to use queue manually. And it should return results in the same order as data in all_jsons. And you can set how many processes to run at the same time so it will not block CPU for other processes in system.
But it can't use tqdm.
I couldn't test it but it can be something like this
import os
import json
from multiprocessing import Pool
# --- functions ---
def asyncJSONs(filename):
try:
fullpath = os.path.join(folder, filename)
with open(fullpath) as f:
data = json.loads(f.read())
properties = process_dict(data, {})
properties['name'] = filename.split('.')[0]
return properties
except:
print("Error parsing at {}".format(filename))
# --- main ---
# for all processes (on some systems it may have to be outside `__main__`)
folder = '/content/drive/My Drive/mrp_workflow/JSONs'
if __name__ == '__main__':
# code only for main process
all_jsons = os.listdir(folder)
with Pool(5) as p:
materials = p.map(asyncJSONs, all_jsons)
for item in materials:
print(item)
BTW:
Other modules: concurrent.futures, joblib, ray,
Going to mention a totally different way of solving this problem. Don't bother trying to append all the data to the same list. Extract the data you need, and append it to some target file in ndjson/jsonlines format. That's just where, instead of objects part of a json array [{},{}...], you have separate objects on each line.
{"foo": "bar"}
{"foo": "spam"}
{"eggs": "jam"}
The workflow looks like this:
spawn N workers with a manifest of files to process and the output file to write to. You don't even need MP, you could use a tool like rush to parallelize.
worker parses data, generates the output dict
worker opens the output file with append flag. dump the data and flush immediately:
with open(out_file, 'a') as fp:
print(json.dumps(data), file=fp, flush=True)
Flush ensure that as long as your data is less than the buffer size on your kernel (usually several MB), your different processes won't stomp on each other and conflict writes. If they do get conflicted, you may need to write to a separate output file for each worker, and then join them all.
You can join the files and/or convert to regular JSON array if needed using jq. To be honest, just embrace jsonlines. It's a way better data format for long lists of objects, since you don't have to parse the whole thing in memory.
You need to understand how multiprocessing works. It starts a brand new process for EACH task, each with a brand new Python interpreter, which runs your script all over again. These processes do not share memory in any way. The other processes get a COPY of your globals, but they obviously can't be the same memory.
If you need to send information back, you can using a multiprocessing.queue. Have the function stuff the results in a queue, while your main code waits for stuff to magically appear in the queue.
Also PLEASE read the instructions in the multiprocessing docs about main. Each new process will re-execute all the code in your main file. Thus, any one-time stuff absolutely must be contained in a
if __name__ == "__main__":
block. This is one case where the practice of putting your mainline code into a function called main() is a "best practice".
What is taking all the time here? Is it reading the files? If so, then you might be able to do this with multithreading instead of multiprocessing. However, if you are limited by disk speed, then no amount of multiprocessing is going to reduce your run time.
Background:
Python 3.5.1, Windows 7
I have a network drive that holds a large number of files and directories. I'm trying to write a script to parse through all of these as quickly as possible to find all files that match a RegEx, and copy these files to my local PC for review. There are about 3500 directories and subdirectories, and a few million files. I'm trying to make this as generic as possible (i.e., not writing code to this exact file structure) in order to reuse this for other network drives. My code works when run against a small network drive, the issue here seems to be scalability.
I've tried a few things using the multiprocessing library and can't seem to get it to work reliably. My idea was to create a new job to parse through each subdirectory to work as quickly as possible. I have a recursive function that parses through all objects in a directory, then calls itself for any subdirectories, and checks any files it finds against the RegEx.
Question: how can I limit the number of threads/processes without using Pools to achieve my goal?
What I've tried:
If I only use Process jobs, I get the error RuntimeError: can't start new thread after more than a few hundred threads start, and it starts dropping connections. I end up with about half the files found, as half of the directories error out (code for this below).
To limit the number of total threads, I tried to use the Pool methods, but I can't pass pool objects to called methods according to this question, which makes the recursion implementation not possible.
To fix that, I tried to call Processes inside the Pool methods, but I get the error daemonic processes are not allowed to have children.
I think that if I can limit the number of concurrent threads, then my solution will work as designed.
Code:
import os
import re
import shutil
from multiprocessing import Process, Manager
CheckLocations = ['network drive location 1', 'network drive location 2']
SaveLocation = 'local PC location'
FileNameRegex = re.compile('RegEx here', flags = re.IGNORECASE)
# Loop through all items in folder, and call itself for subfolders.
def ParseFolderContents(path, DebugFileList):
FolderList = []
jobs = []
TempList = []
if not os.path.exists(path):
return
try:
for item in os.scandir(path):
try:
if item.is_dir():
p = Process(target=ParseFolderContents, args=(item.path, DebugFileList))
jobs.append(p)
p.start()
elif FileNameRegex.search(item.name) != None:
DebugFileList.append((path, item.name))
else:
pass
except Exception as ex:
if hasattr(ex, 'message'):
print(ex.message)
else:
print(ex)
# print('Error in file:\t' + item.path)
except Exception as ex:
if hasattr(ex, 'message'):
print(ex.message)
else:
print('Error in path:\t' + path)
pass
else:
print('\tToo many threads to restart directory.')
for job in jobs:
job.join()
# Save list of debug files.
def SaveDebugFiles(DebugFileList):
for file in DebugFileList:
try:
shutil.copyfile(file[0] + '\\' + file[1], SaveLocation + file[1])
except PermissionError:
continue
if __name__ == '__main__':
with Manager() as manager:
# Iterate through all directories to make a list of all desired files.
DebugFileList = manager.list()
jobs = []
for path in CheckLocations:
p = Process(target=ParseFolderContents, args=(path, DebugFileList))
jobs.append(p)
p.start()
for job in jobs:
job.join()
print('\n' + str(len(DebugFileList)) + ' files found.\n')
if len(DebugFileList) == 0:
quit()
# Iterate through all debug files and copy them to local PC.
n = 25 # Number of files to grab for each parallel path.
TempList = [DebugFileList[i:i + n] for i in range(0, len(DebugFileList), n)] # Split list into small chunks.
jobs = []
for item in TempList:
p = Process(target=SaveDebugFiles, args=(item, ))
jobs.append(p)
p.start()
for job in jobs:
job.join()
Don't disdain the usefulness of pools, especially when you want to control the number of processes to create. They also take care of managing your workers (create/start/join/distribute chunks of work) and help you collect potential results.
As you have realized yourself, you create way too many processes, up to a point where you seem to exhaust so many system resources that you cannot create more processes.
Additionally, the creation of new processes in your code is controlled by outside factors, i.e. the number of folders in your file trees, which makes it very difficult to limit the number of processes. Also, creating a new process comes with quite some overhead on the OS and you might even end up wasting that overhead on empty directories. Plus, context switches between processes are quite costly.
With the number of processes you create, given the number of folders you stated, your processes will basically just sit there and idle most of the time while they are waiting for a share of CPU time to actually do some work. There will be a lot of contention for said CPU time, unless you have a supercomputer with thousands of cores at your disposal. And even when a process gets some CPU time to work, it will likely spend a quite a bit of that time waiting for I/O.
That being said, you'll probably want to look into using threads for such a task. And you could do some optimization in your code. From your example, I don't see any reason why you would split identifying the files to copy and actually copying them into different tasks. Why not let your workers copy each file they found matching the RE right away?
I'd create a list of files in the directories in question using os.walk (which I consider reasonably fast) from the main thread and then offload that list to a pool of workers that checks these files for matches and copies those right away:
import os
import re
from multiprocessing.pool import ThreadPool
search_dirs = ["dir 1", "dir2"]
ptn = re.compile(r"your regex")
# your target dir definition
file_list = []
for topdir in search_dirs:
for root, dirs, files in os.walk(topdir):
for file in files:
file_list.append(os.path.join(root, file))
def copier(path):
if ptn.match(path):
# do your shutil.copyfile with the try-except right here
# obviously I did not want to start mindlessly copying around files on my box :)
return path
with ThreadPool(processes=10) as pool:
results = pool.map(copier, file_list)
# print all the processed files. For those that did not match, None is returned
print("\n".join([r for r in results if r]))
On a side note: don't concatenate your paths manually (file[0] + "\\" + file[1]), rather use os.path.join for this.
I was unable to get this to work exactly as I desired. os.walk was slow, and every other method I thought of was either a similar speed or crashed due to too many threads.
I ended up using a similar method that I posted above, but instead of starting the recursion at the top level directory, it would go down one or two levels until there were several directories. It would then start the recursion at each of these directories in series, which limited the number of threads enough to finish successfully. Execution time is similar to os.walk, which would probably make for a simpler and more readable implementation.
I want to run several python script at the same time using concurrent.futures.
The serial version of my code go and look for a specific python file in folder and execute it.
import re
import os
import glob
import re
from glob import glob
import concurrent.futures as cf
FileList = [];
import time
FileList = [];
start_dir = os.getcwd();
pattern = "Read.py"
for dir,_,_ in os.walk(start_dir):
FileList.extend(glob(os.path.join(dir,pattern))) ;
FileList
i=0
for file in FileList:
dir=os.path.dirname((file))
dirname1 = os.path.basename(dir)
print(dirname1)
i=i+1
Str='python '+ file
print(Str)
completed_process = subprocess.run(Str)`
for the Parallel version of my code:
def Python_callback(future):
print(future.run_type, future.jid)
return "One Folder finished executing"
def Python_execute():
from concurrent.futures import ProcessPoolExecutor as Pool
args = FileList
pool = Pool(max_workers=1)
future = pool.submit(subprocess.call, args, shell=1)
future.run_type = "run_type"
future.jid = FileList
future.add_done_callback(Python_callback)
print("Python executed")
if __name__ == '__main__':
import subprocess
Python_execute()
The issue is that I am not sure how to pass each element of the FileList to separate cpu
Thanks for your help in advance
The smallest change is to use submit once for each element, instead of once for the whole list:
futures = []
for file in FileList:
future = pool.submit(subprocess.call, file, shell=1)
future.blah blah
futures.append(future)
The futures list is only necessary if you want to do something with the futures—wait for them to finish, check their return values, etc.
Meanwhile, you're explicitly creating the pool with max_workers=1. Not surprisingly, this means you'll only get 1 worker child process, so it'll end up waiting for one subprocess to finish before grabbing the next one. If you want to actually run them concurrently, remove that max_workers and let it default to one per core (or pass max_workers=8 or some other number that's not 1, if you have a good reason to override the default).
While we're at it, there are a lot of ways to simplify what you're doing:
Do you really need multiprocessing here? If you need to communicate with each subprocess, that can be painful to do in a single thread—but threads, or maybe asyncio, will work just as well as processes here.
More to the point, it doesn't look like you actually do need anything but launch the process and wait for it to finish, and that can be done in simple, synchronous code.
Why are you building a string and using shell=1 instead of just passing a list and not using the shell? Using the shell unnecessarily creates overhead, safety problems, and debugging annoyances.
You really don't need the jid on each future—it's just the list of all of your invocation strings, which can't be useful. What might be more useful is some kind of identifier, or the subprocess return code, or… probably lots of other things, but they're all things that could be done by reading the return value of subprocess.call or a simple wrapper.
You really don't need the callback either. If you just gather all the futures in a list and as_completed it, you can print the results as they show up more simply.
If you do both of the above, you've got nothing left but a pool.submit inside the loop—which means you can replace the entire loop with pool.map.
You rarely need, or want, to mix os.walk and glob. When you actually have a glob pattern, apply fnmatch over the files list from os.walk. But here, you're just looking for a specific filename in each dir, so really, all you need to filter on is file == 'Read.py'.
You're not using the i in your loop. But if you do need it, it's better to do for i, file in enumerate(FileList): than to do for file in FileList: and manually increment an i.
I have written a method called 'get_names' which accepts argument as the path of folder containing several python scripts(there could be several folders inside it) and returns the names of all the methods inside those python scripts.
But due the vast number of the python scripts in the folder, it takes a huge amount of time to print the names of all the methods. I am planning to create 3-4 processes which will run on one-third/one-fourth the number of python scripts.
How should I write the method to do this so that my method knows which portion of the script it has to work on?
names = name_loader.get_names(name_prefix=params.get('name_prefix'))
'name_prefix' could be /users/Aditya/workspace/codes/ where 'codes' contains all the python scripts.
You could do something like this:
import multiprocessing
if __name__ == "__main__":
calc_pool = multiprocessing.Pool(4)
path = 'list with your paths'
methode = calc_pool.map(get_names, path)
You may have to edit your method, so it splits the list with python files in 4 sublists, in which case each process will process a sublist, which together are your complete list. For example:
import multiprocessing
if __name__ == "__main__":
calc_pool = multiprocessing.Pool(4)
path = 'list with your paths'
path = split(path, parts = 4)
data_pack = ((path[0]), (path[1]), (path[2]), (path[3]))
methode = calc_pool.map(get_names, data_pack)
In this case you have to pack the data, since .map only accepts one argument. In that case the method split splits your list with paths from something like this:
path = ['path_0', 'path_1', 'path_2', 'path_3']
to something like that:
path = [['path_0'], ['path_1'], ['path_2'], ['path_3']]
Keep in mind that processes of multiprocessing do not share data, and that you want to submit as little data as possible, since sending data to each process is quite slow.
Also this obviously increases CPU and RAM usage.
The reason I would choose multiprocessing over threads is that multiprocessing enables you to actually run tasks parallel, while threads mostly gives you an advantage in I/O tasks.
Edit: Also keep in mind that if __name__ == "__main__": is mandatory on windows systems for multiprocessing to work.
I'm trying to use the fbx python module from autodesk, but it seems I can't thread any operation. This seems due to the GIL not relased. Has anyone found the same issue or am I doing something wrong? When I say it doesn't work, I mean the code doesn't release the thread and I'm not be able to do anything else, while the fbx code is running.
There isn't much of code to post, just to know whether it did happen to anyone to try.
Update:
here is the example code, please note each fbx file is something like 2GB
import os
import fbx
import threading
file_dir = r'../fbxfiles'
def parse_fbx(filepath):
print '-' * (len(filepath) + 9)
print 'parsing:', filepath
manager = fbx.FbxManager.Create()
importer = fbx.FbxImporter.Create(manager, '')
status = importer.Initialize(filepath)
if not status:
raise IOError()
scene = fbx.FbxScene.Create(manager, '')
importer.Import(scene)
# freeup memory
rootNode = scene.GetRootNode()
def traverse(node):
print node.GetName()
for i in range(0, node.GetChildCount()):
child = node.GetChild(i)
traverse(child)
# RUN
traverse(rootNode)
importer.Destroy()
manager.Destroy()
files = os.listdir(file_dir)
tt = []
for file_ in files:
filepath = os.path.join(file_dir, file_)
t = threading.Thread(target=parse_fbx, args=(filepath,))
tt.append(t)
t.start()
One problem I see is with your traverse() function. It's calling itself recursively potentially a huge number of times. Another is having all the threads printing stuff at the same time. Doing that properly requires coordinating access to the shared output device (i.e. the screen). A simple way to do that is by creating and using a global threading.Lock object.
First create a global Lock to prevent threads from printing at same time:
file_dir = '../fbxfiles' # an "r" prefix needed only when path contains backslashes
print_lock = threading.Lock() # add this here
Then make a non-recursive version of traverse() that uses it:
def traverse(rootNode):
with print_lock:
print rootNode.GetName()
for i in range(node.GetChildCount()):
child = node.GetChild(i)
with print_lock:
print child.GetName()
It's not clear to me exactly where the reading of each fbxfile takes place. If it all happens as a result of the importer.Import(scene) call, then that is the only time any other threads will be given a chance to run — unless some I/O is [also] done within the traverse() function.
Since printing is most definitely a form of output, thread switching will also be able to occur when it's done. However, if all the function did was perform computations of some kind, no multi-threading would take place within it during its execution.
Once you get the multi-reading working, you may encounter insufficient memory issues if multiple 2GB fbxfiles are being read into memory simultaneously by the various different threads.