How can I limit the number of concurrent threads in Python?
For example, I have a directory with many files, and I want to process all of them, but only 4 at a time in parallel.
Here is what I have so far:
def process_file(fname):
# open file and do something
def process_file_thread(queue, fname):
queue.put(process_file(fname))
def process_all_files(d):
files=glob.glob(d + '/*')
q=Queue.Queue()
for fname in files:
t=threading.Thread(target=process_file_thread, args=(q, fname))
t.start()
q.join()
def main():
process_all_files('.')
# Do something after all files have been processed
How can I modify the code so that only 4 threads are run at a time?
Note that I want to wait for all files to be processed and then continue and work on the processed files.
For example, I have a directory with many files, and I want to process all of them, but only 4 at a time in parallel.
That's exactly what a thread pool does: You create jobs, and the pool runs 4 at a time in parallel. You can make things even simpler by using an executor, where you just hand it functions (or other callables) and it hands you back futures for the results. You can build all of this yourself, but you don't have to.*
The stdlib's concurrent.futures module is the easiest way to do this. (For Python 3.1 and earlier, see the backport.) In fact, one of the main examples is very close to what you want to do. But let's adapt it to your exact use case:
def process_all_files(d):
files = glob.glob(d + '/*')
with concurrent.futures.ThreadPoolExecutor(max_workers=4) as executor:
fs = [executor.submit(process_file, file) for file in files]
concurrent.futures.wait(fs)
If you wanted process_file to return something, that's almost as easy:
def process_all_files(d):
files = glob.glob(d + '/*')
with concurrent.futures.ThreadPoolExecutor(max_workers=4) as executor:
fs = [executor.submit(process_file, file) for file in files]
for f in concurrent.futures.as_completed(fs):
do_something(f.result())
And if you want to handle exceptions too… well, just look at the example; it's just a try/except around the call to result().
* If you want to build them yourself, it's not that hard. The source to multiprocessing.pool is well written and commented, and not that complicated, and most of the hard stuff isn't relevant to threading; the source to concurrent.futures is even simpler.
I used this technique a few times, I think it's a bit ugly thought:
import threading
def process_something():
something = list(get_something)
def worker():
while something:
obj = something.pop()
# do something with obj
threads = [Thread(target=worker) for i in range(4)]
[t.start() for t in threads]
[t.join() for t in threads]
Related
I've never done anything with multiprocessing before, but I recently ran into a problem with one of my projects taking an excessive amount of time to run. I have about 336,000 files I need to process, and a traditional for loop would likely take about a week to run.
There are two loops to do this, but they are effectively identical in what they return so I've only included one.
import json
import os
from tqdm import tqdm
import multiprocessing as mp
jsons = os.listdir('/content/drive/My Drive/mrp_workflow/JSONs')
materials = [None] * len(jsons)
def asyncJSONs(file, index):
try:
with open('/content/drive/My Drive/mrp_workflow/JSONs/{}'.format(file)) as f:
data = json.loads(f.read())
properties = process_dict(data, {})
properties['name'] = file.split('.')[0]
materials[index] = properties
except:
print("Error parsing at {}".format(file))
process_list = []
i = 0
for file in tqdm(jsons):
p = mp.Process(target=asyncJSONs,args=(file,i))
p.start()
process_list.append(p)
i += 1
for process in process_list:
process.join()
Everything in that relating to multiprocessing was cobbled together from a collection of google searches and articles, so I wouldn't be surprised if it wasn't remotely correct. For example, the 'i' variable is a dirty attempt to keep the information in some kind of order.
What I'm trying to do is load information from those JSON files and store it in the materials variable. But when I run my current code nothing is stored in materials.
As you can read in other answers - processes don't share memory and you can't set value directly in materials. Function has to use return to send result back to main process and it has to wait for result and get it.
It can be simpler with Pool. It doesn't need to use queue manually. And it should return results in the same order as data in all_jsons. And you can set how many processes to run at the same time so it will not block CPU for other processes in system.
But it can't use tqdm.
I couldn't test it but it can be something like this
import os
import json
from multiprocessing import Pool
# --- functions ---
def asyncJSONs(filename):
try:
fullpath = os.path.join(folder, filename)
with open(fullpath) as f:
data = json.loads(f.read())
properties = process_dict(data, {})
properties['name'] = filename.split('.')[0]
return properties
except:
print("Error parsing at {}".format(filename))
# --- main ---
# for all processes (on some systems it may have to be outside `__main__`)
folder = '/content/drive/My Drive/mrp_workflow/JSONs'
if __name__ == '__main__':
# code only for main process
all_jsons = os.listdir(folder)
with Pool(5) as p:
materials = p.map(asyncJSONs, all_jsons)
for item in materials:
print(item)
BTW:
Other modules: concurrent.futures, joblib, ray,
Going to mention a totally different way of solving this problem. Don't bother trying to append all the data to the same list. Extract the data you need, and append it to some target file in ndjson/jsonlines format. That's just where, instead of objects part of a json array [{},{}...], you have separate objects on each line.
{"foo": "bar"}
{"foo": "spam"}
{"eggs": "jam"}
The workflow looks like this:
spawn N workers with a manifest of files to process and the output file to write to. You don't even need MP, you could use a tool like rush to parallelize.
worker parses data, generates the output dict
worker opens the output file with append flag. dump the data and flush immediately:
with open(out_file, 'a') as fp:
print(json.dumps(data), file=fp, flush=True)
Flush ensure that as long as your data is less than the buffer size on your kernel (usually several MB), your different processes won't stomp on each other and conflict writes. If they do get conflicted, you may need to write to a separate output file for each worker, and then join them all.
You can join the files and/or convert to regular JSON array if needed using jq. To be honest, just embrace jsonlines. It's a way better data format for long lists of objects, since you don't have to parse the whole thing in memory.
You need to understand how multiprocessing works. It starts a brand new process for EACH task, each with a brand new Python interpreter, which runs your script all over again. These processes do not share memory in any way. The other processes get a COPY of your globals, but they obviously can't be the same memory.
If you need to send information back, you can using a multiprocessing.queue. Have the function stuff the results in a queue, while your main code waits for stuff to magically appear in the queue.
Also PLEASE read the instructions in the multiprocessing docs about main. Each new process will re-execute all the code in your main file. Thus, any one-time stuff absolutely must be contained in a
if __name__ == "__main__":
block. This is one case where the practice of putting your mainline code into a function called main() is a "best practice".
What is taking all the time here? Is it reading the files? If so, then you might be able to do this with multithreading instead of multiprocessing. However, if you are limited by disk speed, then no amount of multiprocessing is going to reduce your run time.
I'm reading in several thousand files at once, and for each file I need to perform operations on before yielding rows from each file. To increase performance I thought I could use asyncio to perhaps perform operations on files (and yield rows) whilst waiting for new files to be read in.
However from print statements I can see that all the files are opened and gathered, then each file is iterated over (same as would occur without asyncio).
I feel like I'm missing something quite obvious here which is making my asynchronous attempts, synchronous.
import asyncio
async def open_files(file):
with open(file) as file:
# do stuff
print('opening files')
return x
async def async_generator():
file_outputs = await asyncio.gather(*[open_files(file) for file in files])
for file_output in file_ouputs:
print('using open file')
for row in file_output:
# Do stuff to row
yield row
async def main():
async for yield_value in async_generator():
pass
loop = asyncio.get_event_loop()
loop.run_until_complete(main())
Output:
opening files
opening files
.
.
.
using open file
using open file
EDIT
Using the code supplied by #user4815162342, I noticed that, although it was 3x quicker, the set of rows yielded from the generator were slightly different than if done without concurrency. I'm unsure as of yet if this is because some yields were missed out from each file, or if the files were somehow re-ordered. So I introduced the following changes to the code from user4815162342 and entered a lock into the pool.submit()
I should have mentioned when first asking, the ordering of rows in each file and of the files themselves is required.
import concurrent.futures
def open_files(file):
with open(file) as file:
# do stuff
print('opening files')
return x
def generator():
m = multiprocessing.Manager()
lock = m.Lock()
pool = concurrent.futures.ThreadPoolExecutor()
file_output_futures = [pool.submit(open_files, file, lock) for file in files]
for fut in concurrent.futures.as_completed(file_output_futures):
file_output = fut.result()
print('using open file')
for row in file_output:
# Do stuff to row
yield row
def main():
for yield_value in generator():
pass
if __name__ == '__main__':
main()
This way my non-concurrent and concurrent approaches yield the same values each time, however I have just lost all the speed gained from using concurrency.
I feel like I'm missing something quite obvious here which is making my asynchronous attempts, synchronous.
There are two issues with your code. The first one is that asyncio.gather() by design waits for all the futures to complete in parallel, and only then returns their results. So the processing you do in the generator is not interspersed with the IO in open_files as was your intention, but only begins after all the calls to open_files have returned. To process async calls as they are done, you should be using something like asyncio.as_completed.
The second and more fundamental issue is that, unlike threads which can parallelize synchronous code, asyncio requires everything to be async from the ground up. It's not enough to add async to a function like open_files to make it async. You need to go through the code and replace any blocking calls, such as calls to IO, with equivalent async primitives. For example, connecting to a network port should be done with open_connection, and so on. If your async function doesn't await anything, as appears to be the case with open_files, it will execute exactly like a regular function and you won't get any benefits of asyncio.
Since you use IO on regular files, and operating systems don't expose portable async interface for regular files, you are unlikely to profit from asyncio. There are libraries like aiofiles that use threads under the hood, but they are as likely to make your code slower than to speed it up because their nice-looking async APIs involve a lot of internal thread synchronization. To speed up your code, you can use a classic thread pool, which Python exposes through the concurrent.futures module. For example (untested):
import concurrent.futures
def open_files(file):
with open(file) as file:
# do stuff
print('opening files')
return x
def generator():
pool = concurrent.futures.ThreadPoolExecutor()
file_output_futures = [pool.submit(open_files, file) for file in files]
for fut in file_output_futures:
file_output = fut.result()
print('using open file')
for row in file_output:
# Do stuff to row
yield row
def main():
for yield_value in generator():
pass
if __name__ == '__main__':
main()
I want to run several python script at the same time using concurrent.futures.
The serial version of my code go and look for a specific python file in folder and execute it.
import re
import os
import glob
import re
from glob import glob
import concurrent.futures as cf
FileList = [];
import time
FileList = [];
start_dir = os.getcwd();
pattern = "Read.py"
for dir,_,_ in os.walk(start_dir):
FileList.extend(glob(os.path.join(dir,pattern))) ;
FileList
i=0
for file in FileList:
dir=os.path.dirname((file))
dirname1 = os.path.basename(dir)
print(dirname1)
i=i+1
Str='python '+ file
print(Str)
completed_process = subprocess.run(Str)`
for the Parallel version of my code:
def Python_callback(future):
print(future.run_type, future.jid)
return "One Folder finished executing"
def Python_execute():
from concurrent.futures import ProcessPoolExecutor as Pool
args = FileList
pool = Pool(max_workers=1)
future = pool.submit(subprocess.call, args, shell=1)
future.run_type = "run_type"
future.jid = FileList
future.add_done_callback(Python_callback)
print("Python executed")
if __name__ == '__main__':
import subprocess
Python_execute()
The issue is that I am not sure how to pass each element of the FileList to separate cpu
Thanks for your help in advance
The smallest change is to use submit once for each element, instead of once for the whole list:
futures = []
for file in FileList:
future = pool.submit(subprocess.call, file, shell=1)
future.blah blah
futures.append(future)
The futures list is only necessary if you want to do something with the futures—wait for them to finish, check their return values, etc.
Meanwhile, you're explicitly creating the pool with max_workers=1. Not surprisingly, this means you'll only get 1 worker child process, so it'll end up waiting for one subprocess to finish before grabbing the next one. If you want to actually run them concurrently, remove that max_workers and let it default to one per core (or pass max_workers=8 or some other number that's not 1, if you have a good reason to override the default).
While we're at it, there are a lot of ways to simplify what you're doing:
Do you really need multiprocessing here? If you need to communicate with each subprocess, that can be painful to do in a single thread—but threads, or maybe asyncio, will work just as well as processes here.
More to the point, it doesn't look like you actually do need anything but launch the process and wait for it to finish, and that can be done in simple, synchronous code.
Why are you building a string and using shell=1 instead of just passing a list and not using the shell? Using the shell unnecessarily creates overhead, safety problems, and debugging annoyances.
You really don't need the jid on each future—it's just the list of all of your invocation strings, which can't be useful. What might be more useful is some kind of identifier, or the subprocess return code, or… probably lots of other things, but they're all things that could be done by reading the return value of subprocess.call or a simple wrapper.
You really don't need the callback either. If you just gather all the futures in a list and as_completed it, you can print the results as they show up more simply.
If you do both of the above, you've got nothing left but a pool.submit inside the loop—which means you can replace the entire loop with pool.map.
You rarely need, or want, to mix os.walk and glob. When you actually have a glob pattern, apply fnmatch over the files list from os.walk. But here, you're just looking for a specific filename in each dir, so really, all you need to filter on is file == 'Read.py'.
You're not using the i in your loop. But if you do need it, it's better to do for i, file in enumerate(FileList): than to do for file in FileList: and manually increment an i.
I'm working on a python 2.7 program that performs these actions in parallel using multiprocessing:
reads a line from file 1 and file 2 at the same time
applies function(line_1, line_2)
writes the function output to a file
I am new to multiprocessing and I'm not extremely expert with python in general. Therefore, I read a lot of already asked questions and tutorials: I feel close to the point but I am now probably missing something that I can't really spot.
The code is structured like this:
from itertools import izip
from multiprocessing import Queue, Process, Lock
nthreads = int(mp.cpu_count())
outq = Queue(nthreads)
l = Lock()
def func(record_1, record_2):
result = # do stuff
outq.put(result)
OUT = open("outputfile.txt", "w")
IN1 = open("infile_1.txt", "r")
IN2 = open("infile_2.txt", "r")
processes = []
for record_1, record_2 in izip(IN1, IN2):
proc = Process(target=func, args=(record_1, record_2))
processes.append(proc)
proc.start()
for proc in processes:
proc.join()
while (not outq.empty()):
l.acquire()
item = outq.get()
OUT.write(item)
l.release()
OUT.close()
IN1.close()
IN2.close()
To my understanding (so far) of multiprocessing as package, what I'm doing is:
creating a queue for the results of the function that has a size limit compatible with the number of cores of the machine.
filling this queue with the results of func().
reading the queue items until the queue is empty, writing them to the output file.
Now, my problem is that when I run this script it immediately becomes a zombie process. I know that the function works because without the multiprocessing implementation I had the results I wanted.
I'd like to read from the two files and write to output at the same time, to avoid generating a huge list from my input files and then reading it (input files are huge). Do you see anything gross, completely wrong or improvable?
The biggest issue I see is that you should pass the queue object through the process instead of trying to use it as a global in your function.
def func(record_1, record_2, queue):
result = # do stuff
queue.put(result)
for record_1, record_2 in izip(IN1, IN2):
proc = Process(target=func, args=(record_1, record_2, outq))
Also, as currently written, you would still be pulling all that information into memory (aka the queue) and waiting for the read to finish before writing to the output file. You need to move the p.join loop until after reading through the queue, and instead of putting all the information in the queue at the end of the func it should be filling the queue with chucks in a loop over time, or else it's the same as just reading it all into memory.
You also don't need a lock unless you are using it in the worker function func, and if you do, you will again want to pass it through.
If you want to not to read / store a lot in memory, I would write out the same time I am iterating through the input files. Here is a basic example of combining each line of the files together.
with open("infile_1.txt") as infile1, open("infile_2.txt") as infile2, open("out", "w") as outfile:
for line1, line2 in zip(infile1, infile2):
outfile.write(line1 + line2)
I don't want to write to much about all of these, just trying to give you ideas. Let me know if you want more detail about something. Hope it helps!
I am looking to use multiprocessing or threading in my application to do some time-consuming operations in the background. I have looked at many examples, but I still have been unable to achieve what I want. I am trying to load a bunch of images, each of which takes several seconds. I would like the first image to be loaded and then have the others loading in the background and being stored in a list (to use later) while the program is still doing other things (like allowing controls on my GUI to still work). If I have something like the example below, how can I do this? And should I use multiprocessing or threading?
class myClass():
def __init__(self, arg1, arg2):
#initializes some parameters
def aFunction(self):
#does some things
#creates multiple processes or threads that each call interestingFunc
#continues doing things
def interestingFunc(self):
#performs operations
m = myClass()
You can use either approach. Have your Process or Thread perform its work and then put the results onto a Queue. Your main thread/process can then, at its leisure, take the results off the queue and do something with them. Here's an example with multiprocessing.
from multiprocessing import Process, Queue
def load_image(img_file, output_q):
with open(img_file, 'rb') as f:
img_data = f.read()
# perform processing on img_data, then queue results
output_q.put((img_file, img_data))
result_q = Queue()
images = ['/tmp/p1.png', '/tmp/p2.jpg', '/tmp/p3.gif', '/tmp/p4.jpg']
for img in images:
Process(target=load_image, args=(img, result_q)).start()
for i in range(len(images)):
img, data = result_q.get()
# do something with the image data
print "processing of image file %s complete" % img
This assumes that the order of processing is not significant to your application, i.e. the image data from each file might be loaded onto the queue in any particular order.
Here's the simplest possible way to do multiple things in parallel, it'll help get you started:
source
import multiprocessing
def calc(num):
return num*2
pool = multiprocessing.Pool(5)
for output in pool.map(calc, [1,2,3]):
print 'output:',output
output
output: 2
output: 4
output: 6
You could try something like this:
from thread import start_new_thread
pictureList = [ f for f in os.listdir(r"C:\your\picture\folder")]
for pic in pictureList:
start_new_thread(loadPicture,(pic,))
def loadPicture(pic):
pass # do some stuff with the pictures
This is a quite simple approach, the thread returns immediatley and perhaps you'll need to use allocate_lock. If you need more capabilities, you might consider using the threading module. Be careful to pass a tuple as 2nd argument to the thread.