when I trying to make my script multi-threading,
I've found out multiprocessing,
I wonder if there is a way to make multiprocessing work with threading?
cpu 1 -> 3 threads(worker A,B,C)
cpu 2 -> 3 threads(worker D,E,F)
...
Im trying to do it myself but I hit so much problems.
is there a way to make those two work together?
You can generate a number of Processes, and then spawn Threads from inside them. Each Process can handle almost anything the standard interpreter thread can handle, so there's nothing stopping you from creating new Threads or even new Processes within each Process. As a minimal example:
def foo():
print("Thread Executing!")
def bar():
threads = []
for _ in range(3): # each Process creates a number of new Threads
thread = threading.Thread(target=foo)
threads.append(thread)
thread.start()
for thread in threads:
thread.join()
if __name__ == "__main__":
processes = []
for _ in range(3):
p = multiprocessing.Process(target=bar) # create a new Process
p.start()
processes.append(p)
for process in processes:
process.join()
Communication between threads can be handled within each Process, and communication between the Processes can be handled at the root interpreter level using Queues or Manager objects.
You can define a function that takes a process and make it run 3 threads and then spawn your processes to target this function, for example:
def threader(process):
for _ in range(3):
threading.Thread(target=yourfunc).start()
def main():
# spawn whatever processes here to target threader
Related
On several occasions, I have a list of tasks that need to be executed via Python. Typically these tasks take a few seconds, but there are hundreds-of-thousands of tasks and treading significantly improves execution time. Is there a way to dynamically specify the number of threads a python script should utilize in order to solve a stack of tasks?
I have had success running threads when executed in the body of Python code, but I have never been able to run threads correctly when they are within a function (I assume this is because of scoping). Below is my approach to dynamically define a list of threads which should be used to execute several tasks.
The problem is that this approach waits for a single thread to complete before continuing through the for loop.
import threading
import sys
import time
def null_thread():
""" used to instanciate threads """
pass
def instantiate_threads(number_of_threads):
""" returns a list containing the number of threads specified """
threads_str = []
threads = []
index = 0
while index < number_of_threads:
exec("threads_str.append(f't{index}')")
index += 1
for t in threads_str:
t = threading.Thread(target = null_thread())
t.start()
threads.append(t)
return threads
def sample_task():
""" dummy task """
print("task start")
time.sleep(10)
def main():
number_of_threads = int(sys.argv[1])
threads = instantiate_threads(number_of_threads)
# a routine that assigns work to the array of threads
index = 0
while index < 100:
task_assigned = False
while not task_assigned:
for thread in threads:
if not thread.is_alive():
thread = threading.Thread(target = sample_task())
thread.start()
# script seems to wait until thread is complete before moving on...
print(f'index: {index}')
task_assigned = True
index += 1
# wait for threads to finish before terminating
for thread in threads:
while thread.is_alive():
pass
if __name__ == '__main__':
main()
Solved:
You could convert to using concurrent futures ThreadPoolExecutor,
where you can set the amount of threads to spawn using
max_workers=amount of threads. – user56700
I am using python multiprocessing library for executing a selenium script. My code is below :
#-- start and join multiple threads ---
thread_list = []
total_threads=10 #-- no of parallel threads
for i in range(total_threads):
t = Process(target=get_browser_and_start, args=[url,nlp,pixel])
thread_list.append(t)
print "starting thread..."
t.start()
for t in thread_list:
print "joining existing thread..."
t.join()
As I understood the join() function, it will wait for each process to complete. But I want that as soon as a process is released, it will be assigned another task to perform new function.
It can be understood like this:
Say 8 processes started in first instance.
no_of_tasks_to_perform = 100
for i in range(no_of_tasks_to_perform):
processes start(8)
if process no 2 finished executing, start new process
maintain 8 process at any point of time till
"i" is <= no_of_tasks_to_perform
Instead of starting new processes every now and then, try to put all your tasks into a multiprocessing.Queue(), and start 8 long-running processes, in each process keep accessing the task queue to get new tasks and then do the job, until there's no task any more.
In your case, it's more like this:
from multiprocessing import Queue, Process
def worker(queue):
while not queue.empty():
task = queue.get()
# now start to work on your task
get_browser_and_start(url,nlp,pixel) # url, nlp, pixel can be unpacked from task
def main():
queue = Queue()
# Now put tasks into queue
no_of_tasks_to_perform = 100
for i in range(no_of_tasks_to_perform):
queue.put([url, nlp, pixel, ...])
# Now start all processes
process = Process(target=worker, args=(queue, ))
process.start()
...
process.join()
I'm using the Process class to create and manage subprocesses, which may return non-trival quantities of data. The documentation states that join() is the correct way to wait for a Process to complete (https://docs.python.org/2/library/multiprocessing.html#the-process-class).
However, when using multiprocessing.Queue this can cause a hang after joining the process, as described here: https://bugs.python.org/issue8426 and here https://docs.python.org/2/library/multiprocessing.html#multiprocessing-programming (not a bug).
These docs suggest removing p.join() - but surely this will remove the guarantee that all processes have completed, as Queue.get() only waits for a single item to become available?
How can I wait for completion of all Processes in this case, and ensure I'm collecting output from them all?
A simple example of the hang I'd like to deal with:
from multiprocessing import Process, Queue
class MyClass:
def __init__(self):
pass
def example_run(output):
output.put([MyClass() for i in range(1000)])
print("Bottom of example_run() - note hangs after this is printed")
if __name__ == '__main__':
output = Queue()
processes = [Process(target=example_run, args=(output,)) for x in range(5)]
for p in processes:
p.start()
for p in processes:
p.join()
print("Processes completed")
https://bugs.python.org/issue8426
This means that whenever you use a queue you need to make sure that
all items which have been put on the queue will eventually be removed
before the process is joined. Otherwise you cannot be sure that
processes which have put items on the queue will terminate.
In your example I just added output.get() before calling to join() and every thing worked fine. We put data in queue to be used some where, so just make sure that.
for p in processes:
p.start()
print output.get()
for p in processes:
p.join()
print("Processes completed")
An inelegant solution is to add
output_final = []
for i in range(5): # we have 5 processes
output_final.append(output.get())
before attempting to join any of the processes. This simply tries to get the appropriate number of outputs for the number of processes we've started.
Turns out a much better, wider solution is not to use Process at all; use Pool instead. This way the hassles of starting worker processes and collecting the results is handled for you:
import multiprocessing
class MyClass:
def __init__(self):
pass
def example_run(someArbitraryInput):
foo = [MyClass() for i in range(10000)]
return foo
if __name__ == '__main__':
pool = multiprocessing.Pool(processes=5)
output = pool.map(example_run, range(5))
pool.close(); pool.join() # make sure the processes are complete and tidy
print("Processes completed")
I have a simple implementation of python's multi-processing module
if __name__ == '__main__':
jobs = []
while True:
for i in range(40):
# fetch one by one from redis queue
#item = item from redis queue
p = Process(name='worker '+str(i), target=worker, args=(item,))
# if p is not running, start p
if not p.is_alive():
jobs.append(p)
p.start()
for j in jobs:
j.join()
jobs.remove(j)
def worker(url_data):
"""worker function"""
print url_data['link']
What I expect this code to do:
run in infinite loop, keep waiting for Redis queue.
if Redis queue not empty, fetch item.
create 40 multiprocess.Process, not more not less
if a process has finished processing, start new process, so that ~40 process are running at all time.
I read that, to avoid zombie process that should be bound(join) to the parent, that's what I expected to achieve in the second loop. But the issue is that on launching it spawns 40 processes, workers finish processing and enter zombie state, until all currently spawned processes haven't finished,
then in next iteration of "while True", the same pattern continues.
So my question is:
How can I avoid zombie processes. and spawn new process as soon as 1 in 40 has finished
For a task like the one you described is usually better to use a different approach using Pool.
You can have the main process fetching data and the workers deal with it.
Following an example of Pool from Python Docs
def f(x):
return x*x
if __name__ == '__main__':
pool = Pool(processes=4) # start 4 worker processes
result = pool.apply_async(f, [10]) # evaluate "f(10)" asynchronously
print result.get(timeout=1) # prints "100" unless your computer is *very* slow
print pool.map(f, range(10)) # prints "[0, 1, 4,..., 81]"
I also suggest to use imap instead of map as it seems your task can be asynch.
Roughly your code will be:
p = Pool(40)
while True:
items = items from redis queue
p.imap_unordered(worker, items) #unordered version is faster
def worker(url_data):
"""worker function"""
print url_data['link']
I have a huge list of information and my program should analyze each one of them. To speed up I want to use threads, but I want to limit them by 5. So I need to make a loop with 5 threads and when one finish their job grab a new one till the end of the list.
But I don't have a clue how to do that. Should I use queue? For now I just running 5 threads in the most simple way:
Thank you!
for thread_number in range (5):
thread = Th(thread_number)
thread.start()
Separate the idea of worker thread and task -- do not have one worker work on one task, then terminate the thread. Instead, spawn 5 threads, and let them all get tasks from a common queue. Let them each iterate until they receive a sentinel from the queue which tells them to quit.
This is more efficient than continually spawning and terminating threads after they complete just one task.
import logging
import Queue
import threading
logger = logging.getLogger(__name__)
N = 100
sentinel = object()
def worker(jobs):
name = threading.current_thread().name
for task in iter(jobs.get, sentinel):
logger.info(task)
logger.info('Done')
def main():
logging.basicConfig(level=logging.DEBUG,
format='[%(asctime)s %(threadName)s] %(message)s',
datefmt='%H:%M:%S')
jobs = Queue.Queue()
# put tasks in the jobs Queue
for task in range(N):
jobs.put(task)
threads = [threading.Thread(target=worker, args=(jobs,))
for thread_number in range (5)]
for t in threads:
t.start()
jobs.put(sentinel) # Send a sentinel to terminate worker
for t in threads:
t.join()
if __name__ == '__main__':
main()
It seems that you want a thread pool. If you're using python 3, you're lucky : there is a ThreadPoolExecutor class
Else, from this SO question, you can find various solutions, either handcrafted or using hidden modules from the python library.