i've a program which starts multiple threads with data from database and i store the thread object in a dictionary.
thread_manager.py
threads = {}
............
#controller for /start-threads
def start_threads(request):
datas = Data.objects.all()
for data in datas:
thread = MyThread(data)
threads[data.id] = thread
thread.start()
return HttpResponse("all threads are running")
def get_thread(id,request):
return threads[id]
at this point threads dictionary has all the threads in it and i can access the thread object with threads[id], now if try to get the thread from another endpoint (im using django)
views.py
import thread_manager
def get_thread(request, id):
thread = thread_manager.get_thread(id, request)
return HttpResponse("got thread with id {0}".format(id))
the threads dictionary is empty at this point(ofc i get a keyerror), if i run this on local server everything works fine. if i run this on live server which has uwsgi running django. it doesnt work, is this a problem with uWsgi or am i doing anything wrong, thanx.
Your server is almost certainly running with more than one process. But threads belong to a single process; you can't access them from a different one.
You don't say what you're doing with these threads, but offline work is almost always better done with a specific system such as Celery.
Related
Background:
I have an inventory application that scrapes data from our various IT resources (VMware, storage, backups, etc...) We have a vCenter that has over 2000 VMs registered to it. I have code to go in and pull details for each VM in its own thread to parallelize the collections.
I have them joined to a parent thread so that the different sections will complete before it moves onto the next area. I also have it set to timeout after 10 minutes so that the collection isn't held up by a single object thread that just gets stuck. What I've found though is that when I try to pull data for more than about 1000 objects at once, it overloads the vCenter and it kills my connection, and almost all of the child threads die.
I'm quite sure that it's partially related to vCenter versions that are below 7.0 (we're using 6.7 in a lot of places). But we're stuck using the current versions due to older hardware.
What I would like to do is limit the number of threads spun up using semaphores, but also have them joined to the parent thread when they are spun up. All of the ways I've thought of to do this either end up serializing the collection, or end up having the join timeout after 10 minutes.
Is there a way to pull this off? The part that gets me stuck is joining the thread because it blocks the rest of the operations. Once I stop joining the threads, I can't join any others.
Code sample:
try:
objects = vsphere_client.vcenter.VM.list() # try newer REST API operation
old_objects = container_view.view # old pyvmomi objects
rest_api = True
except UnableToAllocateResource: # if there's too many objects for the REST API to return happens at 1000 on vCenter 6.7 and 4000 on 7.0
objects = container_view.view
old_objects = None
except OperationNotFound: # if different error happens
objects = container_view.view
old_objects = None
threads = []
for obj in objects:
thread = RESTVMDetail(vsphere_client, db_vcenter, obj, old_objects, rest_api, db_vms, db_hosts,
db_datastores, db_networks, db_vm_disks, db_vm_os_disks, db_vm_nics, db_vm_cdroms,
db_vm_floppies, db_vm_scsis, db_regions, db_sites, db_environments, db_platforms,
db_applications, db_functions, db_costs, db_vm_snapshots, api_limiter)
threads.append(thread)
for thread in threads:
thread.start()
for thread in threads:
thread.join(600)
I had to switch this to a consumer/producer implementation utilizing a queue. That allowed me to limit the number of collections that would be kicked off simultaneously.
I'm building a web application for processing ~60,000 (and growing) large files, perform some analysis and return a "best guess" that needs to be verified by a user. The files will be refined by category to avoid loading every file, but I'm still left with a scenario where I might have to process 1000+ files at a time.
These are large files that can take up to 8-9 seconds each to process, and in a 1000+ file situation it is impractical to have a user wait 8 seconds between reviews or 2 hours+ while the files are processed before hand.
To overcome this, I've decided to use multiprocessing to spawn several workers, each of which will pick from a queue of files, process them and insert into an output queue. I have another method that basically polls the output queue for items and then streams them to the client when one becomes available.
We're using gevent with uWSGI and Django in our environment and I'm aware that child process creation via multiprocessing in the context of gevent yields an undesired event loop state in the child. Greenlets spawned before forking are duplicated in the child. Therefore, I've decided to use lets to assist in the handling of the child processes.
This all works beautifully while uninterrupted. However, if a user were to switch categories anytime while files are still being processed, I close the SSE connection in the browser and open another causing a new set of child processes to spawn and killing the existing processes (or attempting to). This causes one of two things to happen.
When I yield out the results to the client, I get an IOError from uWSGI because the connection has closed. I wrapped the entire function in a try...finally and kill all the workers before the function exits.
I can either block while the processes are killed or do it in the background. Each method has different consequences. When trying to kill without blocking, the original processes are never killed, the new processes stop yielding, and any request (from any page) to the server hangs until I manually kill all uWSGI processes.
When blocking, uWSGI reports a segmentation fault, the main worker is killed and restarted, killing all child processes - new and old.
An example of the JavaScript used to open/close the connection:
var state = {};
function analyze(){
// If a connection exists, close it.
if (state.evtSource) {
state.evtSource.close();
}
// Create a new connection to the server.
evtSource = state.evtSource = new EventSource('?myarg=myval');
evtSource.onmessage = function(message){
//do stuff
}
}
Server-side example code:
from item import Item
import lets
import multiprocessing
import time
MAX_WORKERS = 10
# Worker is outside of ProcessFiles because ``lets``
# pickles the target.
def worker(item):
return item.process()
class ProcessFiles(object):
def __init__(self):
self.input_queue = multiprocessing.Queue()
self.output_queue = multiprocessing.Queue()
self.file_count = 0
self.pool = lets.ProcessPool(MAX_WORKERS)
def query_for_results(self):
# Query db for records of files to process.
# Return results and set self.file_count equal to
# the number of records returned.
pass
def start(self):
# Queue up files to process.
for result in self.query_for_results():
item = Item(result)
self.input_queue.append(item)
# Spawn up to MAX_WORKERS child processes to analyze
# all of the items in the input queue. Append processed file
# to output queue.
for item in self.input_queue:
self.pool.apply_async(worker, args=(item,), callback=self.callback)
# Poll for items to send to client.
return self.get_processed_items()
def callback(self, processed):
self.output_queue.put(processed)
def get_processed_items(self):
# Wait for the output queue to hold at least 1 item.
# When an item becomes available, yield it to client.
try:
count = 0
while count != self.file_count:
try:
item = self.output_queue.get(timeout=1)
except:
# Queue is empty. Wait and retry.
time.sleep(1)
continue
count += 1
yield item
yield 'end'
finally:
# Kill all child processes.
self.pool.kill(block=True) # <- Causes segfault.
#self.pool.kill(block=False) # <- Silently fails.
This only happens when a user makes a selection, and while processing those files, makes another selection, effectively closing the current connection and creating a new one, creating two different sets of child processes.
Why does blocking cause a segmentation fault? Why is the behavior different when blocking vs not blocking? What can I do to kill all the original processes?
I have a simple Flask web app that make many HTTP requests to an external service when a user push a button. On the client side I have an angularjs app.
The server side of the code look like this (using multiprocessing.dummy):
worker = MyWorkerClass()
pool = Pool(processes=10)
result_objs = [pool.apply_async(worker.do_work, (q,))
for q in queries]
pool.close() # Close pool
pool.join() # Wait for all task to finish
errors = not all(obj.successful() for obj in result_objs)
# extract result only from successful task
items = [obj.get() for obj in result_objs if obj.successful()]
As you can see I'm using apply_async because I want to later inspect each task and extract from them the result only if the task didn't raise any exception.
I understood that in order to show a progress bar on client side, I need to publish somewhere the number of completed tasks so I made a simple view like this:
#app.route('/api/v1.0/progress', methods=['GET'])
def view_progress():
return jsonify(dict(progress=session['progress']))
That will show the content of a session variable. Now, during the process, I need to update that variable with the number of completed tasks (the total number of tasks to complete is fixed and known).
Any ideas about how to do that? I working in the right direction?
I'have seen similar questions on SO like this one but I'm not able to adapt the answer to my case.
Thank you.
For interprocess communication you can use a multiprocessiong.Queue and your workers can put_nowait tuples with progress information on it while doing their work. Your main process can update whatever your view_progress is reading until all results are ready.
A bit like in this example usage of a Queue, with a few adjustments:
In the writers (workers) I'd use put_nowait instead of put because working is more important than waiting to report that you are working (but perhaps you judge otherwise and decide that informing the user is part of the task and should never be skipped).
The example just puts strings on the queue, I'd use collections.namedtuples for more structured messages. On tasks with many steps, this enables you to raise the resolution of you progress report, and report more to the user.
In general the approach you are taking is okay, I do it in a similar way.
To calculate the progress you can use an auxiliary function that counts the completed tasks:
def get_progress(result_objs):
done = 0
errors = 0
for r in result_objs:
if r.ready():
done += 1
if not r.successful():
errors += 1
return (done, errors)
Note that as a bonus this function returns how many of the "done" tasks ended in errors.
The big problem is for the /api/v1.0/progress route to find the array of AsyncResult objects.
Unfortunately AsyncResult objects cannot be serialized to a session, so that option is out. If your application supports a single set of async tasks at a time then you can just store this array as a global variable. If you need to support multiple clients, each with a different set of async tasks, then you will need figure out a strategy to keep client session data in the server.
I implemented the single client solution as a quick test. My view functions are as follows:
results = None
#app.route('/')
def index():
global results
results = [pool.apply_async(do_work) for n in range(20)]
return render_template('index.html')
#app.route('/api/v1.0/progress')
def progress():
global results
total = len(results)
done, errored = get_progress(results)
return jsonify({'total': total, 'done': done, 'errored': errored})
I hope this helps!
I think you should be able to update the number of completed tasks using multiprocessing.Value and multiprocessing.Lock.
In your main code, use:
processes=multiprocessing.Value('i', 10)
lock=multiprocessing.Lock()
And then, when you call worker.dowork, pass a lock object and the value to it:
worker.dowork(lock, processes)
In your worker.dowork code, decrease "processes" by one when the code is finished:
lock.acquire()
processes.value-=1
lock.release()
Now, "processes.value" should be accessible from your main code, and be equal to the number of remaining processes. Make sure you acquire the lock before acessing processes.value, and release the lock afterwards
I am trying to create a class than can run a separate process to go do some work that takes a long time, launch a bunch of these from a main module and then wait for them all to finish. I want to launch the processes once and then keep feeding them things to do rather than creating and destroying processes. For example, maybe I have 10 servers running the dd command, then I want them all to scp a file, etc.
My ultimate goal is to create a class for each system that keeps track of the information for the system in which it is tied to like IP address, logs, runtime, etc. But that class must be able to launch a system command and then return execution back to the caller while that system command runs, to followup with the result of the system command later.
My attempt is failing because I cannot send an instance method of a class over the pipe to the subprocess via pickle. Those are not pickleable. I therefore tried to fix it various ways but I can't figure it out. How can my code be patched to do this? What good is multiprocessing if you can't send over anything useful?
Is there any good documentation of multiprocessing being used with class instances? The only way I can get the multiprocessing module to work is on simple functions. Every attempt to use it within a class instance has failed. Maybe I should pass events instead? I don't understand how to do that yet.
import multiprocessing
import sys
import re
class ProcessWorker(multiprocessing.Process):
"""
This class runs as a separate process to execute worker's commands in parallel
Once launched, it remains running, monitoring the task queue, until "None" is sent
"""
def __init__(self, task_q, result_q):
multiprocessing.Process.__init__(self)
self.task_q = task_q
self.result_q = result_q
return
def run(self):
"""
Overloaded function provided by multiprocessing.Process. Called upon start() signal
"""
proc_name = self.name
print '%s: Launched' % (proc_name)
while True:
next_task_list = self.task_q.get()
if next_task is None:
# Poison pill means shutdown
print '%s: Exiting' % (proc_name)
self.task_q.task_done()
break
next_task = next_task_list[0]
print '%s: %s' % (proc_name, next_task)
args = next_task_list[1]
kwargs = next_task_list[2]
answer = next_task(*args, **kwargs)
self.task_q.task_done()
self.result_q.put(answer)
return
# End of ProcessWorker class
class Worker(object):
"""
Launches a child process to run commands from derived classes in separate processes,
which sit and listen for something to do
This base class is called by each derived worker
"""
def __init__(self, config, index=None):
self.config = config
self.index = index
# Launce the ProcessWorker for anything that has an index value
if self.index is not None:
self.task_q = multiprocessing.JoinableQueue()
self.result_q = multiprocessing.Queue()
self.process_worker = ProcessWorker(self.task_q, self.result_q)
self.process_worker.start()
print "Got here"
# Process should be running and listening for functions to execute
return
def enqueue_process(target): # No self, since it is a decorator
"""
Used to place an command target from this class object into the task_q
NOTE: Any function decorated with this must use fetch_results() to get the
target task's result value
"""
def wrapper(self, *args, **kwargs):
self.task_q.put([target, args, kwargs]) # FAIL: target is a class instance method and can't be pickled!
return wrapper
def fetch_results(self):
"""
After all processes have been spawned by multiple modules, this command
is called on each one to retreive the results of the call.
This blocks until the execution of the item in the queue is complete
"""
self.task_q.join() # Wait for it to to finish
return self.result_q.get() # Return the result
#enqueue_process
def run_long_command(self, command):
print "I am running number % as process "%number, self.name
# In here, I will launch a subprocess to run a long-running system command
# p = Popen(command), etc
# p.wait(), etc
return
def close(self):
self.task_q.put(None)
self.task_q.join()
if __name__ == '__main__':
config = ["some value", "something else"]
index = 7
workers = []
for i in range(5):
worker = Worker(config, index)
worker.run_long_command("ls /")
workers.append(worker)
for worker in workers:
worker.fetch_results()
# Do more work... (this would actually be done in a distributor in another class)
for worker in workers:
worker.close()
Edit: I tried to move the ProcessWorker class and the creation of the multiprocessing queues outside of the Worker class and then tried to manually pickle the worker instance. Even that doesn't work and I get an error
RuntimeError: Queue objects should only be shared between processes
through inheritance
. But I am only passing references of those queues into the worker instance?? I am missing something fundamental. Here is the modified code from the main section:
if __name__ == '__main__':
config = ["some value", "something else"]
index = 7
workers = []
for i in range(1):
task_q = multiprocessing.JoinableQueue()
result_q = multiprocessing.Queue()
process_worker = ProcessWorker(task_q, result_q)
worker = Worker(config, index, process_worker, task_q, result_q)
something_to_look_at = pickle.dumps(worker) # FAIL: Doesn't like queues??
process_worker.start()
worker.run_long_command("ls /")
So, the problem was that I was assuming that Python was doing some sort of magic that is somehow different from the way that C++/fork() works. I somehow thought that Python only copied the class, not the whole program into a separate process. I seriously wasted days trying to get this to work because all of the talk about pickle serialization made me think that it actually sent everything over the pipe. I knew that certain things could not be sent over the pipe, but I thought my problem was that I was not packaging things up properly.
This all could have been avoided if the Python docs gave me a 10,000 ft view of what happens when this module is used. Sure, it tells me what the methods of multiprocess module does and gives me some basic examples, but what I want to know is what is the "Theory of Operation" behind the scenes! Here is the kind of information I could have used. Please chime in if my answer is off. It will help me learn.
When you run start a process using this module, the whole program is copied into another process. But since it is not the "__main__" process and my code was checking for that, it doesn't fire off yet another process infinitely. It just stops and sits out there waiting for something to do, like a zombie. Everything that was initialized in the parent at the time of calling multiprocess.Process() is all set up and ready to go. Once you put something in the multiprocess.Queue or shared memory, or pipe, etc. (however you are communicating), then the separate process receives it and gets to work. It can draw upon all imported modules and setup just as if it was the parent. However, once some internal state variables change in the parent or separate process, those changes are isolated. Once the process is spawned, it now becomes your job to keep them in sync if necessary, either through a queue, pipe, shared memory, etc.
I threw out the code and started over, but now I am only putting one extra function out in the ProcessWorker, an "execute" method that runs a command line. Pretty simple. I don't have to worry about launching and then closing a bunch of processes this way, which has caused me all kinds of instability and performance issues in the past in C++. When I switched to launching processes at the beginning and then passing messages to those waiting processes, my performance improved and it was very stable.
BTW, I looked at this link to get help, which threw me off because the example made me think that methods were being transported across the queues: http://www.doughellmann.com/PyMOTW/multiprocessing/communication.html
The second example of the first section used "next_task()" that appeared (to me) to be executing a task received via the queue.
Instead of attempting to send a method itself (which is impractical), try sending a name of a method to execute.
Provided that each worker runs the same code, it's a matter of a simple getattr(self, task_name).
I'd pass tuples (task_name, task_args), where task_args were a dict to be directly fed to the task method:
next_task_name, next_task_args = self.task_q.get()
if next_task_name:
task = getattr(self, next_task_name)
answer = task(**next_task_args)
...
else:
# poison pill, shut down
break
REF: https://stackoverflow.com/a/14179779
Answer on Jan 6 at 6:03 by David Lynch is not factually correct when he says that he was misled by
http://www.doughellmann.com/PyMOTW/multiprocessing/communication.html.
The code and examples provided are correct and work as advertised. next_task() is executing a task received via the queue -- try and understand what the Task.__call__() method is doing.
In my case what, tripped me up was syntax errors in my implementation of run(). It seems that the sub-process will not report this and just fails silently -- leaving things stuck in weird loops! Make sure you have some kind of syntax checker running e.g. Flymake/Pyflakes in Emacs.
Debugging via multiprocessing.log_to_stderr()F helped me narrow down the problem.
I am given the task to update a database over the network with sqlalchemy. I have decided to use python's threading module. Currently I am using 1 thread, aka the producer thread, to direct other threads to consume work units via a queue.
The producer thread does something like this:
def produce(self, last_id):
unit = session.query(Request).order_by(Request.id) \
.filter(Request.item_id == None).yield_per(50)
self.queue.put(unit, True, Master.THREAD_TIMEOUT)
while the consumer threads does something similar to this:
def consume(self):
unit = self.queue.get()
request = unit
item = Item.get_item_by_url(request)
request.item = item
session.add(request)
session.flush()
and I am using sqlalchemy's scoped session:
session = scoped_session(sessionmaker(autocommit=True, autoflush=True, bind=engine))
However, I am getting the exception,
"sqlalchemy.exc.InvalidRequestError: Object FOO is already attached to session '1234' (this is '5678')"
I understand that this exception comes from the fact that the request object is created in one session (the producer session) while the consumers are using another scoped session because they belong to another thread.
My work around is to have my producer thread pass in the request.id into the queue while the consumer has to call the code below to retrieve the request object.
request = session.query(Request).filter(Request.id == request_id).first()
I do not like this solution because this involves another network call and is obviously not optimal.
Are there ways to avoid wasting the result of the producer's db call?
Is there a way to write the "produce" so that more than 1 id is passed into the queue as a work unit?
Feedback welcomed!
You need to detach your Request instance from the main thread session before you put it into the queue, then attach it to the queue processing thread session when taken from the queue again.
To detach, call .expunge() on the session, passing in the request:
session.expunge(unit)
and then when processing it in a queue thread, re-attach it by merging; set the load flag to False to prevent a round-trip to the database again:
session.merge(request, load=False)