Whenever I use my (other) multiprocessing code it works fine but in terms of feedback for where I am in regards to completion of the script for example "Completed 5 / 10 files" I do not know how to adapt my code to return the count. Basically I would like to adapt the code below to allow multiprocessing.
So I Use
file_paths = r"path to file with paths"
count = 0
pool = Pool(16)
pool.map(process_control, file_paths)
pool.close()
pool.join()
within process_control I have at the end of the function count += 1 and return count
I guess the equivelant code would be something like
def process_control(count, file_path):
do stuff
count += 1
print("Process {} / {} completed".format(count, len(file_paths))
return count
file_paths = r"path to file with paths"
count = 0
for path in file_paths:
count = process_control(count, path)
SOmething like that so that. I hope my explanation is clear.
Each subprocess has its own copy of count so all they can do is track the work in that one process. The count won't aggregate for all of the processes. But the parent can do the counting. map waits for all tasks to complete, so that isn't helpful. imap is better, it iterates but it also maintains order so reporting is still delayed. imap_unordered with chunksize 1 is your best option. Each task return value (even if it is None) is returned immediately.
def process_control(count, file_path):
do stuff
file_paths = ["path1", ...]
with multiprocessing.Pool(16) as pool:
count = 0
for _ in pool.imap_unordered(porcess_control, file_paths,chunksize=1):
count += 1
print("Process {} / {} completed".format(count, len(file_paths))
A note on chunksize. There are costs to using a pool - each work item needs to be sent to the subprocess and its value returned. This back-and-forth IPC is relatively expensive, so the pool will "chunk" the work items, meaning that it will send many work items to a given subprocess all in one chunk and the process will only return when the entire chunk of data has been processed through the worker function.
This is great when there are many relatively short work items. But suppose that different work items take different amount of time to execute. There will be a tall-pole subprocess still working on its chunk even though the others have finished.
More important for your case, the results aren't posted back to the parent until the chunk completes so you don't get real-time reporting of completion.
Set chunksize to 1 and the subprocess will return results immediately for more accurate accounting.
For simple cases, the previous answer by #tedelaney is excellent.
For more complicated cases, Value provides shared memroy:
from multiprocessing import Value
counter = Value('i', 0)
# increment the value
with variable.get_lock():
counter.value += 1
# get the value. Read lock automatically used
processes_done = counter.value
Related
I have this task which is sort of I/O bound and CPU bound at the same time.
Basically I am getting a list of queries from a user, google search them (via custom-search-api), store each query results in a .txt file, and storing all results in a results.txt file.
I was thinking that maybe parallelism might be an advantage here.
My whole task is wrapped with an Object which has 2 member fields which I am supposed to use across all threads/processes (a list and a dictionary).
Therefore, when I use multiprocessing I get weird results (I assume that it is because of my shared resources).
i.e:
class MyObject(object):
_my_list = []
_my_dict = {}
_my_dict contains key:value pairs of "query_name":list().
_my_list is a list of queries to search in google. It is safe to assume that it is not written into.
For each query : I search it on google, grab the top results and store it in _my_dict
I want to do this in parallel. I thought that threading may be good but it seems that they slow the work..
how I attempted to do it (this is the method which is doing the entire job per query):
def _do_job(self, query):
""" search the query on google (via http)
save results on a .txt file locally. """
this is the method which is supposed to execute all jobs for all queries in parallel:
def find_articles(self):
p = Pool(processes=len(self._my_list))
p.map_async(self._do_job, self._my_list)
p.close()
p.join()
self._create_final_log()
The above execution does not work, I get corrupted results...
When I use multithreading however, the results are fine, but very slow:
def find_articles(self):
thread_pool = []
for vendor in self._vendors_list:
self._search_validate_cache(vendor)
thread = threading.Thread(target=self._search_validate_cache, args=. (vendor,))
thread_pool.append(thread)
thread.start()
for thread in thread_pool:
thread.join()
self._create_final_log()
Any help would be appreciated, thanks!
I have encountered this while doing similar projects in the past (multiprocessing doesn't work efficiently, single-threaded is too slow, starting a thread per query is too fast and is bottlenecked). I found an efficient way to complete a task like this is to create a thread pool with a limited amount of threads. Logically, the fastest way to complete this task is to use as many network resources as possible without a bottleneck, which is why the threads active at one time actively making requests are capped.
In your case, cycling a list of queries with a thread pool with a callback function would be a quick and easy way to go through all the data. Obviously, there is a lot of factors that affect that such as network speed and finding the correct size threadpool to avoid a bottlneck, but overall I've found this to work well.
import threading
class MultiThread:
def __init__(self, func, list_data, thread_cap=10):
"""
Parameters
----------
func : function
Callback function to multi-thread
threads : int
Amount of threads available in the pool
list_data : list
List of data to multi-thread index
"""
self.func = func
self.thread_cap = thread_cap
self.thread_pool = []
self.current_index = -1
self.total_index = len(list_data) - 1
self.complete = False
self.list_data = list_data
def start(self):
for _ in range(self.thread_cap):
thread = threading.Thread(target=self._wrapper)
self.thread_pool += [thread]
thread.start()
def _wrapper(self):
while not self.complete:
if self.current_index < self.total_index:
self.current_index += 1
self.func(self.list_data[self.current_index])
else:
self.complete = True
def wait_on_completion(self):
for thread in self.thread_pool:
thread.join()
import requests #, time
_my_dict = {}
base_url = "https://www.google.com/search?q="
s = requests.sessions.session()
def example_callback_func(query):
global _my_dict
# code to grab data here
r = s.get(base_url+query)
_my_dict[query] = r.text # whatever parsed results
print(r, query)
#start_time = time.time()
_my_list = ["examplequery"+str(n) for n in range(100)]
mt = MultiThread(example_callback_func, _my_list, thread_cap=30)
mt.start()
mt.wait_on_completion()
# output queries to file
#print("Time:{:2f}".format(time.time()-start_time))
You could also open the file and output whatever you need to as you go, or output data at the end. Obviously, my replica here isn't exactly what you need, but it's a solid boilerplate with a lightweight function I made that will greatly reduce the time it takes. It uses a thread pool to call a callback to a default function that takes a single parameter (the query).
In my test here, it completed cycling 100 queries in ~2 seconds. I could definitely play with the thread cap and get the timings lower before I find the bottleneck.
I have a Python multiprocessing pool doing a very long job that even after a thorough debugging is not robust enough not to fail every 24 hours or so, because it depends on many third-party, non-Python tools with complex interactions. Also, the underlying machine has certain problems that I cannot control. Note that by failing I don't mean the whole program crashing, but some or most of the processes becoming idle because of some errors, and the app itself either hanging or continuing the job just with the processes that haven't failed.
My solution right now is to periodically kill the job, manually, and then just restart from where it was.
Even if it's not ideal, what I want to do now is the following: restart the multiprocessing pool periodically, programatically, from the Python code itself. I don't really care if this implies killing the pool workers in the middle of their job. Which would be the best way to do that?
My code looks like:
with Pool() as p:
for _ in p.imap_unordered(function, data):
save_checkpoint()
log()
What I have in mind would be something like:
start = 0
end = 1000 # magic number
while start + 1 < len(data):
current_data = data[start:end]
with Pool() as p:
for _ in p.imap_unordered(function, current_data):
save_checkpoint()
log()
start += 1
end += 1
Or:
start = 0
end = 1000 # magic number
while start + 1 < len(data):
current_data = data[start:end]
start_timeout(time=TIMEOUT) # which would be the best way to to do that without breaking multiprocessing?
try:
with Pool() as p:
for _ in p.imap_unordered(function, current_data):
save_checkpoint()
log()
start += 1
end += 1
except Timeout:
pass
Or any suggestion you think would be better. Any help would be much appreciated, thanks!
The problem with your current code is that it iterates the multiprocessed results directly, and that call will block. Fortunately there's an easy solution: use apply_async exactly as suggested in the docs. But because of how you describe the use-case here and the failure, I've adapted it somewhat. Firstly, a mock task:
from multiprocessing import Pool, TimeoutError, cpu_count
from time import sleep
from random import randint
def log():
print("logging is a dangerous activity: wear a hard hat.")
def work(d):
sleep(randint(1, 100) / 100)
print("finished working")
if randint(1, 10) == 1:
print("blocking...")
while True:
sleep(0.1)
return d
This work function will fail with a probabilty of 0.1, blocking indefinitely. We create the tasks:
data = list(range(100))
nproc = cpu_count()
And then generate futures for all of them:
while data:
print(f"== Processing {len(data)} items. ==")
with Pool(nproc) as p:
tasks = [p.apply_async(work, (d,)) for d in data]
Then we can try to get the tasks out manually:
for task in tasks:
try:
res = task.get(timeout=1)
data.remove(res)
log()
except TimeoutError:
failed.append(task)
if len(failed) < nproc:
print(
f"{len(failed)} processes are blocked,"
f" but {nproc - len(failed)} remain."
)
else:
break
The controlling timeout here is the timeout to .get. It should be as long as you expect the longest process to take. Note that we detect when the whole pool is tied up and give up.
But since in the scenario you describe some threads are going to take longer than others, we can give 'failed' processes some time to recover. Thus every time a task fails we quickly check if the others have in fact succeeded:
for task in failed:
try:
res = task.get(timeout=0.01)
data.remove(res)
failed.remove(task)
log()
except TimeoutError:
continue
Whether this is a good addition in your case depends on whether your tasks really are as flaky as I'm guessing they are.
Exiting the context manager for the pool will terminate the pool, so we don't even need to handle that ourselves. If you have significant variation you might want to increase the pool size (thus increasing the number of tasks which are allowed to stall) or allow tasks a grace period before considering them 'failed'.
I'm trying to create a function that will generate a hash using sha1 algorithm with 9 leading zeroes. The hash is based on some random data and, like in concurrency mining, I just want to add 1 to the string that is used in the hash function.
For this to be faster I used map() from the Pool class to make it run on all my cores, but I have an issue if I pass a chunk larger than range(99999999)
def computesha(counter):
hash = 'somedata'+'otherdata'+str(counter)
newHash = hashlib.sha1(hash.encode()).hexdigest()
if newHash[:9] == '000000000':
print(str(newHash))
print(str(counter))
return str(newHash), str(counter)
if __name__ == '__main__':
d1 = datetime.datetime.now()
print("Start timestamp" + str(d1))
manager = multiprocessing.Manager()
return_dict = manager.dict()
p = Pool()
p.map(computesha, range(sys.maxsize) )
print(return_dict)
p.close()
p.join()
d2 = datetime.datetime.now()
print("End timestamp " + str(d2))
print("Elapsed time: " + str((d2-d1)))
I want to create something similar to a global counter to feed it into the function while it is running multi-threaded, but if I try range(sys.maxsize) I get a MemoryError (I know, because i don't have enough RAM, few have), but I want to split the list generated by range() into chunks.
Is this possible or should I try a different approach?
Hi Alin and welcome to stackoverflow.
Firstly, yes, a global counter is possible. E.g with a multiprocessing.Queue or a multiprocessing.Value which is passed to the workers. However, fetching a new number from the global counter would result in locking (and possibly waiting for) the counter. This can and should be avoided, as you need to make A LOT of counter queries. My proposed solution below avoids the global counter by installing several local counters which work together as if they were a single global counter.
Regarding the RAM consumption of your code, I see two problems:
computesha returns a None value most of the time. This goes into the iterator which is created by map (even though you do not assign the return value of map). This means, that the iterator is a lot bigger than necessary.
Generally speaking, the RAM of a process is freed, after the process finishes. Your processes start A LOT of tasks which all reserve their own memory. A possible solution is the maxtasksperchild option (see the documentation of multiprocessing.pool.Pool). When you set this option to 1000, it closes the process after 1000 task and creates a new one, which frees the memory.
However, i'd like to propose a different solution which solves both problems, is very memory-friendly and runs faster (as it seems to me after N<10 tests) as the solution with the maxtasksperchild option:
#!/usr/bin/env python3
import datetime
import multiprocessing
import hashlib
import sys
def computesha(process_number, number_of_processes, max_counter, results):
counter = process_number # every process starts with a different counter
data = 'somedata' + 'otherdata'
while counter < max_counter: #stop after max_counter jobs have been started
hash = "".join((data,str(counter)))
newHash = hashlib.sha1(hash.encode()).hexdigest()
if newHash[:9] == '000000000':
print(str(newHash))
print(str(counter))
# return the results through a queue
results.put((str(newHash), str(counter)))
counter += number_of_processes # 'jump' to the next chunk
if __name__ == '__main__':
# execute this file with two command line arguments:
number_of_processes = int(sys.argv[1])
max_counter = int(sys.argv[2])
# this queue will be used to collect the results after the jobs finished
results = multiprocessing.Queue()
processes = []
# start a number of processes...
for i in range(number_of_processes):
p = multiprocessing.Process(target=computesha, args=(i,
number_of_processes,
max_counter,
results))
p.start()
processes.append(p)
# ... then wait for all processes to end
for p in processes:
p.join()
# collect results
while not results.empty():
print(results.get())
results.close()
This code spawns the desired number_of_processes which then call the computesha function. If number_of_processes=8 then the first process calculates the hash for the counter values [0,8,16,24,...], the second process for [1,9,17,25] and so on.
The advantages of this approach: In each iteration of the while loop the memory of hash, and newHash can be reused, loops are cheaper than functions and only number_of_processes function calls have to be made, and the uninteresting results are simply forgotten.
A possible disadvantage is, that the counters are completely independent and every process will do exactly 1/number_of_processes of the overall work, even if the some are faster than others. Eventually, the program is as fast as the slowest process. I did't measure it, but I guess it is a rather theoretical problem here.
Hope that helps!
I have a python script where at the top of the file I have:
result_queue = Queue.Queue()
key_list = *a large list of small items* #(actually from bucket.list() via boto)
I have learned that Queues are process safe data structures. I have a method:
def enqueue_tasks(keys):
for key in keys:
try:
result = perform_scan.delay(key)
result_queue.put(result)
except:
print "failed"
The perform_scan.delay() function here actually calls a celery worker, but I don't think is relevant (it is an asynchronous process call).
I also have:
def grouper(iterable, n, fillvalue=None):
args = [iter(iterable)] * n
return izip_longest(fillvalue=fillvalue, *args)
Lastly I have a main() function:
def main():
executor = concurrent.futures.ProcessPoolExecutor(10)
futures = [executor.submit(enqueue_tasks, group) for group in grouper(key_list, 40)]
concurrent.futures.wait(futures)
print len(result_queue)
The result from the print statement is a 0. Yet if I include a print statement of the size of result_queue in enqueue_tasks, while the program is running, I can see that the size is increasing and things are being added to the queue.
Ideas of what is happening?
It looks like there's a simpler solution to this problem.
You're building a list of futures. The whole point of futures is that they're future results. In particular, whatever each function returns, that's the (eventual) value of the future. So, don't do the whole "push results onto a queue" thing at all, just return them from the task function, and pick them up from the futures.
The simplest way to do this is to break that loop up so that each key is a separate task, with a separate future. I don't know whether that's appropriate for your real code, but if it is:
def do_task(key):
try:
return perform_scan.delay(key)
except:
print "failed"
def main():
executor = concurrent.futures.ProcessPoolExecutor(10)
futures = [executor.submit(do_task, key) for key in key_list]
# If you want to do anything with these results, you probably want
# a loop around concurrent.futures.as_completed or similar here,
# rather than waiting for them all to finish, ignoring the results,
# and printing the number of them.
concurrent.futures.wait(futures)
print len(futures)
Of course that doesn't do the grouping. But do you need it?
The most likely reason for the grouping to be necessary is that the tasks are so tiny that the overhead in scheduling them (and pickling the inputs and outputs) swamps the actual work. If that's true, then you can almost certainly wait until a whole batch is done to return any results. Especially given that you're not even looking at the results until they're all done anyway. (This model of "split into groups, process each group, merge back together" is pretty common in cases like numerical work, where each element may be tiny, or elements may not be independent of each other, but there are groups that are big enough or independent from the rest of the work.)
At any rate, that's almost as simple:
def do_tasks(keys):
results = []
for key in keys:
try:
result = perform_scan.delay(key)
results.append(result)
except:
print "failed"
return results
def main():
executor = concurrent.futures.ProcessPoolExecutor(10)
futures = [executor.submit(enqueue_tasks, group) for group in grouper(key_list, 40)]
print sum(len(results) for results in concurrent.futures.as_completed(futures))
Or, if you prefer to first wait and then calculate:
def main():
executor = concurrent.futures.ProcessPoolExecutor(10)
futures = [executor.submit(enqueue_tasks, group) for group in grouper(key_list, 40)]
concurrent.futures.wait(futures)
print sum(len(future.result()) for future in futures)
But again, I doubt you need even this.
You need to use a multiprocessing.Queue, not a Queue.Queue. Queue.Queue is thread-safe, not process-safe, so the changes you make to it in one process are not reflected in any others.
I am writing a script that processes some mmaps concurrently with multiprocessing.Process and updates a result list stored in an mmap and locked with a mutex.
My function to write to the result list looks like this
def update_result(result_mmap, new_value, new_value_index, sema):
sema.acquire()
result_mmap.seek(0)
old_result = result_mmap.readline().split("\t")
old_result[new_value_index] = new_value
new_result = "\t".join(map(str, old_result))
result_mmap.resize(len(new_result))
result_mmap.seek(0)
result_mmap.write(new_result)
sema.release()
This works SOMETIMES, but other times, depending on the order of execution of the processes, it seems that the result_mmap isn't resizing properly. I am not sure where to look from here- I know that a race condition exists but I don't know why.
Edit: This is the function that calls update_result:
def apply_function(mmapped_files, function, result_mmap, result_index, sema):
for mf in mmapped_files:
accumulator = int(mf.readline())
while True:
line = mf.readline()
if line is None or line == '':
break
num = int(line)
accumulator = function(num, accumulator)
update_result(result_mmap, result_index, inc, sema)
Maybe I'm wrong, but are you sure that the semaphore really works between the processes (is it a system mutex?). Because if it's not, processes do not share the same memory space. What you might want to consider using would be the threading library, in order for the threads to use the same semaphore.