Edit: I am closing this question.
As it turns out, my goal of having parallel HTTP posts is pointless. After implementing it successfully with aiohttp, I run into deadlocks elsewhere in the pipeline.
I will reformulate this and post a single question in a few days.
Problem
I want to have a class that, during some other computation, holds generated data and can write it to a DB via HTTP (details below) when convenient. It's gotta be a class as it is also used to load/represent/manipulate data.
I have written a naive, nonconcurrent implementation that works:
The class is initialized and then used in a "main loop". Data is added to it in this main loop to a naive "Queue" (a list of HTTP requests). At certain intervals in the main loop, the class calls a function to write those requests and clear the "queue".
As you can expect, this is IO bound. Whenever I need to write the "queue", the main loop halts. Furthermore, since the main computation runs on a GPU, the loop is also not really CPU bound.
Essentially, I want to have a queue, and, say, ten workers running in the background and pushing items to the http connector, waiting for the push to finish and then taking on the next (or just waiting for the next write call, not a big deal). In the meantime, my main loop runs and adds to the queue.
Program example
My naive program looks something like this
class data_storage(...):
def add(...):
def write_queue(self):
if len(self.queue) > 0:
res = self.connector.run(self.queue)
self.queue = []
def main_loop(storage):
# do many things
for batch in dataset: #simplified example
# Do stuff
for some_other_loop:
(...)
storage.add(results)
# For example, call each iteration
storage.write_queue()
if __name__ == "__main__":
storage=data_storage()
main_loop(storage)
...
In detail: the connector class is from the package 'neo4j-connector' to post to my Neo4j database. It essentially does JSON formatting and uses the "requests" api from python.
This works, even without a real queue, since nothing is concurrent.
Now I have to make it work concurrently.
From my research, I have seen that ideally I would want a "producer-consumer" pattern, where both are initialized via asyncio. I have only seen this implemented via functions, not classes, so I don't know how to approach this. With functions, my main loop should be a producer coroutine and my write function becomes the consumer. Both are initiated as tasks on the queue and then gathered, where I'd initialize only one producer but many consumers.
My issue is that the main loop includes parts that are already parallel (e.g. PyTorch). Asyncio is not thread safe, so I don't think I can just wrap everything in async decorators and make a co-routine. This is also precisely why I want the DB logic in a separate class.
I also don't actually want or need the main loop to run "concurrently" on the same thread with the workers. But it's fine if that's the outcome as the workers don't do much on the CPU. But technically speaking, I want multi-threading? I have no idea.
My only other option would be to write into the queue until it is "full", halt the loop and then use multiple threads to dump it to the DB. Still, this would be much slower than doing it while the main loop is running. My gain would be minimal, just concurrency while working through the queue. I'd settle for it if need be.
However, from a stackoverflow post, I came up with this small change
class data_storage(...):
def add(...):
def background(f):
def wrapped(*args, **kwargs):
return asyncio.get_event_loop().run_in_executor(None, f, *args, **kwargs)
return wrapped
#background
def write_queue(self):
if len(self.queue) > 0:
res = self.connector.run(self.queue)
self.queue = []
Shockingly this sort of "works" and is blazingly fast. Of course since it's not a real queue, things get overwritten. Furthermore, this overwhelms or deadlocks the HTTP API and in general produces a load of errors.
But since this - in principle - works, I wonder if I could do is the following:
class data_storage(...):
def add(...):
def background(f):
def wrapped(*args, **kwargs):
return asyncio.get_event_loop().run_in_executor(None, f, *args, **kwargs)
return wrapped
#background
def post(self, items):
if len(items) > 0:
self.nr_workers.increase()
res = self.connector.run(items)
self.nr_workers.decrease()
def write_queue(self):
if self.nr_workers < 10:
items=self.queue.get(200) # Extract and delete from queue, non-concurrent
self.post(items) # Add "Worker"
for some hypothetical queue and nr_workers objects. Then at the end of the main loop, have a function that blocks progress until number of workers is zero and clears, non-concurrently, the rest of the queue.
This seems like a monumentally bad idea, but I don't know how else to implement this. If this is terrible, I'd like to know before I start doing more work on this. Do you think it would it work?
Otherwise, could you give me any pointers as how to approach this situation correctly?
Some key words, tools or things to research would of course be enough.
Thank you!
Related
Let's say I add 100 push tasks (as group 1) to my tasks-queue. Then I add another 200 tasks (as group 2) to the same queue. How can I understand if all tasks of group 1 are finished?
Looks like QueueStatistics will not help here. tag works only with pull queues.
And I can not have separate queues (since I may have hundreds of groups).
I would probably solve it by using a sharded counter in datastore like #mgilson said and decorate my deferred functions to run a callback when the tasks are done running.
I think something like this is what you are looking for if you include the code at https://cloud.google.com/appengine/articles/sharding_counters?hl=en and write a decriment function to complement the increment one.
import random
import time
from google.appengine.ext import deferred
def done_work():
logging.info('work done!')
def worker(callback=None):
def fst(f):
def snd(*args, **kwargs):
key = kwargs['shard_key']
del kwargs['shard_key']
retval = f(*args, **kwargs)
decriment(key)
if get_count(key) == 0:
callback()
return retval
return snd
return fst
def func(n):
# do some work
time.sleep(random.randint(1, 10) / 10.0)
logging.info('task #{:d}'.format(n))
def make_some_tasks():
func = worker(callback=done_work)(func)
key = random.randint(0, 1000)
for n in xrange(0, 100):
increment(key)
deferred.defer(func, n, shard_key=key)
Tasks are not guaranteed to run only once, occasionally even successfully executed tasks may be repeated. Here's such an example: GAE deferred task retried due to "instance unavailable" despite having already succeeded.
Because of this using a counter incremented at task enqueueing and decremented at task completion wouldn't work - it would be decremented twice in such a duplicate execution case, throwing the whole computation off.
The only reliable way of keeping track of task completion (that I can think of) is to independently track each individual enqueued task. You can do that using the task names (either specified or auto-assigned after successful enqueueing) - they are unique for a given queue. Task names to be tracked can be kept in task lists persisted in the datastore, for example.
Note: this is just the theoretical answer I got to when I asked myself the same question, I didn't get to actually test it.
I want to iterate over a list using 2 thread. One from leading and other from trailing, and put the elements in a Queue on each iteration. But before putting the value in Queue I need to check for existence of the value within Queue (its when that one of the threads has putted that value in Queue), So when this happens I need to stop the thread and return list of traversed values for each thread.
This is what I have tried so far :
from Queue import Queue
from threading import Thread, Event
class ThreadWithReturnValue(Thread):
def __init__(self, group=None, target=None, name=None,
args=(), kwargs={}, Verbose=None):
Thread.__init__(self, group, target, name, args, kwargs, Verbose)
self._return = None
def run(self):
if self._Thread__target is not None:
self._return = self._Thread__target(*self._Thread__args,
**self._Thread__kwargs)
def join(self):
Thread.join(self)
return self._return
main_path = Queue()
def is_in_queue(x, q):
with q.mutex:
return x in q.queue
def a(main_path,g,l=[]):
for i in g:
l.append(i)
print 'a'
if is_in_queue(i,main_path):
return l
main_path.put(i)
def b(main_path,g,l=[]):
for i in g:
l.append(i)
print 'b'
if is_in_queue(i,main_path):
return l
main_path.put(i)
g=['a','b','c','d','e','f','g','h','i','j','k','l']
t1 = ThreadWithReturnValue(target=a, args=(main_path,g))
t2 = ThreadWithReturnValue(target=b, args=(main_path,g[::-1]))
t2.start()
t1.start()
# Wait for all produced items to be consumed
print main_path.join()
I used ThreadWithReturnValue that will create a custom thread that returns the value.
And for membership checking I used the following function :
def is_in_queue(x, q):
with q.mutex:
return x in q.queue
Now if I first start the t1 and then the t2 I will get 12 a then one b then it doesn't do any thing and I need to terminate the python manually!
But if I first run the t2 then t1 I will get the following result:
b
b
b
b
ab
ab
b
b
b
b
a
a
So my questions is that why python treads different in this cases? and how can I terminate the threads and make them communicate with each other?
Before we get into bigger problems, you're not using Queue.join right.
The whole point of this function is that a producer who adds a bunch of items to a queue can wait until the consumer or consumers have finished working on all of those items. This works by having the consumer call task_done after they finish working on each item that they pulled off with get. Once there have been as many task_done calls as put calls, the queue is done. You're not doing a get anywhere, much less a task_done, so there's no way the queue can ever be finished. So, that's why you block forever after the two threads finish.
The first problem here is that your threads are doing almost no work outside of the actual synchronization. If the only thing they do is fight over a queue, only one of them is going to be able to run at a time.
Of course that's common in toy problems, but you have to think through your real problem:
If you're doing a lot of I/O work (listening on sockets, waiting for user input, etc.), threads work great.
If you're doing a lot of CPU work (calculating primes), threads don't work in Python because of the GIL, but processes do.
If you're actually primarily dealing with synchronizing separate tasks, neither one is going to work well (and processes will be worse). It may still be simpler to think in terms of threads, but it'll be the slowest way to do things. You may want to look into coroutines; Greg Ewing has a great demonstration of how to use yield from to use coroutines to build things like schedulers or many-actor simulations.
Next, as I alluded to in your previous question, making threads (or processes) work efficiently with shared state requires holding locks for as short a time as possible.
So, if you have to search a whole queue under a lock, that had better be a constant-time search, not a linear-time search. That's why I suggested using something like an OrderedSet recipe rather than a list, like the one inside the stdlib's Queue.Queue. Then this function:
def is_in_queue(x, q):
with q.mutex:
return x in q.queue
… is only blocking the queue for a tiny fraction of a second—just long enough to look up a hash value in a table, instead of long enough to compare every element in the queue against x.
Finally, I tried to explain about race conditions on your other question, but let me try again.
You need a lock around every complete "transaction" in your code, not just around the individual operations.
For example, if you do this:
with queue locked:
see if x is in the queue
if x was not in the queue:
with queue locked:
add x to the queue
… then it's always possible that x was not in the queue when you checked, but in the time between when you unlocked it and relocked it, someone added it. This is exactly why it's possible for both threads to stop early.
To fix this, you need to put a lock around the whole thing:
with queue locked:
if x is not in the queue:
add x to the queue
Of course this goes directly against what I said before about locking the queue for as short a time as possible. Really, that's what makes multithreading hard in a nutshell. It's easy to write safe code that just locks everything for as long as might conceivably be necessary, but then your code ends up only using a single core, while all the other threads are blocked waiting for the lock. And it's easy to write fast code that just locks everything as briefly as possible, but then it's unsafe and you get garbage values or even crashes all over the place. Figuring out what needs to be a transaction, and how to minimize the work inside those transactions, and how to deal with the multiple locks you'll probably need to make that work without deadlocking them… that's not so easy.
A couple of things that I think can be improved:
Due to the GIL, you might want to use the multiprocessing (rather than threading) module. In general, CPython threading will not cause CPU intensive work to speed up. (Depending on what exactly is the context of your question, it's also possible that multiprocessing won't, but threading almost certainly won't.)
A function like your is_inqueue would likely lead to high contention.
The locked time seems linear in the number of items that need to be traversed:
def is_in_queue(x, q):
with q.mutex:
return x in q.queue
So, instead, you could possibly do the following.
Use multiprocessing with a shared dict:
from multiprocessing import Process, Manager
manager = Manager()
d = manager.dict()
# Fn definitions and such
p1 = Process(target=p1, args=(d,))
p2 = Process(target=p2, args=(d,))
within each function, check for the item like this:
def p1(d):
# Stuff
if 'foo' in d:
return
I'm slowly but surely getting the hang of twisted, but I'm not sure how I should be approaching this particular project.
I'm trying to create a class for batch processing of web pages. There are multiple web pages that I would like to process independently, so it makes sense to have a pipeline of sorts for each url. Additionally, I would like to call a one-time preprocessing function before any of the urls are processed, and when all urls have been processed, I would like to call a post-processing function. Importantly, I'd like to be able to subclass this processing class and override certain methods based on the contents I'm trying to process -- not all web-pages will require the same processing steps.
If this were synchronous code, I would probably do this with a context manager. Consider the following example code:
class Pipeline(object):
def __init__(self, urls):
self.urls = urls # iterable
self.continue = False
def __enter__(self):
self.continue = self.preprocess()
def __exit__(self, type, value, traceback):
if self.continue: # if we decided to run the batch pipeline...
self.postprocess()
def preprocess(self):
# does some stuff and returns a bool
def postprocess(self):
# do some stuff
def pipeline(self):
for url in self.urls:
try:
# download url, do some stuff
except:
# recover so that other urls are not interrupted
After which, I would use it as follows:
with Pipeline(list_of_urls) as p:
p.pipeline()
This works well with synchronous network operations, but not with Twisted, since the pipeline function will return before the end of the processing pipeline, thus calling __exit__.
Additionally, I would like each URL's processing to take place completely separately because there may be conditional branching based on the result of my queries. For this reason, using Twisted's DeferredList is not desirable.
In a nutshell, I need the following:
Preprocess must run before all else
Postprocess must run when the following is true:
At least one url began processing (preprocess returns True)
All urls have either completed or thrown an exception
What's the sanest way to set something like this up with Twisted? The issue I'm having is that some of the code involves asynchronous IO and some is just straight synchronous logic (i.e. processing the results in memory), so I'm not sure how to make the whole thing work with deferreds (or whether I even should).
Any advice?
In light of the comments for the original question, I would suggest using DeferredList in combination with maybeDeferred. I propose the latter because of this:
The issue I'm having is that some of the code involves asynchronous IO and some is just straight synchronous logic (i.e. processing the results in memory), so I'm not sure how to make the whole thing work with deferreds (or whether I even should).
Using a maybeDeferred allows you to treat all of your function calls as though they were asynchronous, whether they actually are or not.
Following structure helps me in close condition:
def Processor:
def __init__()
self.runned = 0
self.hasSuccess = True
self.preprocess()
def launch(self, urls):
for url in urls:
dfrd = url.process()
dfrd.addCallback(self.succ)
dfrd.addErrback(self.fail)
self.runned += 1
def fail(self, reason):
self.runned -= 1
if self.runned == 0 and self.hasSuccess:
self.postprocess()
def succ(self, arg):
self.runned -= 1
self.hasSuccess = True
if not self.runned
self.postprocess()
I have a very simple script that monitors a file transfer progress, comparing its actual size with the target then calculating its hash, comparing with the desired hash and firing up a few extra things when everything seems alright.
I've replaced the tool used for the file transfers (wget) with deluged, which has a neat api to integrate with.
Instead of comparing the file progress and compare the hashes, I only need to know now when deluged finished downloading the files. To achieve that, I was able to modify this script to my needs, but I'm stuck trying to wrap my head around twisted framework, that deluged makes use of.
To try getting over it, I grabbed one sample script from twisted deferred documentation, wrapped a class around it and attempted to use the same concept I'm using on this script I mentioned.
Now, I don't know exactly what to do with the reactor object, since it's basically a blocking loop that can't be restarted.
This is my sample code I'm working with:
from twisted.internet import reactor, defer
import time
class DummyDataGetter:
done = False
result = 0
def getDummyData(self, x):
d = defer.Deferred()
# simulate a delayed result by asking the reactor to fire the
# Deferred in 2 seconds time with the result x * 3
reactor.callLater(2, d.callback, x * 3)
return d
def assignResult(self, d):
"""
Data handling function to be added as a callback: handles the
data by printing the result
"""
self.result = d
self.done = True
reactor.stop()
def run(self):
d = self.getDummyData(3)
d.addCallback(self.assignResult)
reactor.run()
getter = DummyDataGetter()
getter.run()
while not getter.done:
time.sleep(0.5)
print getter.result
# then somewhere else I want to get dummy data again
getter = DummyDataGetter()
getter.run() #this throws an exception of type error.ReactorNotRestartable
while not getter.done:
time.sleep(0.5)
print getter.result
My questions are:
Should reactor be fired in another thread to prevent it blocking the code?
If so, how would I add more callbacks to this reactor living in a separate thread? Simply by doing something similar to reactor.callLater(2, d.callback, x * 3), from my main thread?
If not, what is the technique to overcome this problem of not being able to starting/stopping reactor twice or more on the same process?
OK, easiest approach I found to this is to simply have a separate script called using subprocess.Popen, dump the statuses of the torrents and anything else needed into the stdout (serialized using JSON) and pipe that into the calling script.
Way less traumatic than learning twisted, but of course far away from optimal.
I'm writing a threaded program in Python. This program is interrupted very frequently, by user (CRTL+C) interaction, and by other programs sending various signals, all of which should stop thread operation in various ways. The thread does a bunch of units of work (I call them "atoms") in sequence.
Each atom can be stopped quickly and safely, so making the thread itself stop is fairly trivial, but my question is: what is the "right", or canonical way to implement a stoppable thread, given stoppable, pseudo-atomic pieces of work to be done?
Should I poll a stop_at_next_check flag before each atom (example below)? Should I decorate each atom with something that does the flag-checking (basically the same as the example, but hidden in a decorator)? Or should I use some other technique I haven't thought of?
Example (simple stopped-flag checking):
class stoppable(Thread):
stop_at_next_check = False
current_atom = None
def __init__(self):
Thread.__init__(self)
def do_atom(self, atom):
if self.stop_at_next_check:
return False
self.current_atom = atom
self.current_atom.do_work()
return True
def run(self):
#get "work to be done" objects atom1, atom2, etc. from somewhere
if not do_atom(atom1):
return
if not do_atom(atom2):
return
#...etc
def die(self):
self.stop_at_next_check = True
self.current_atom.stop()
Flag checking seems right, but you missed an occasion to simplify it by using a list for atoms. If you put atoms in a list, you can use a single for loop without needing a do_atom() method, and the problem of where to do the check solves itself.
def run(self):
atoms = # get atoms
for atom in atoms:
if self.stop_at_next_check:
break
self.current_atom = atom
atom.do_work()
Create a "thread x should continue processing" flag, and when you're done with the thread, set the flag to false.
Killing a thread directly is considered bad form, because you might get a fractional chunk of work completed.
A tad late but I have created a small library, ants, solving this problem. In your example an atomic unit is represented by an worker
Example
from ants import worker
#worker
def hello():
print(“hello world”)
t = hello.start()
...
t.stop()
In above example hello() will run in a separate thread being called in a while True: loop thus spitting out “hello world” as fast as possible
You can also have triggering events , e.g. in above replace hello.start() with hello.start(lambda: time.sleep(5)) and you will have it trigger every 5:th second
The library is very new and work is ongoing on GitHub https://github.com/fa1k3n/ants.git
Future work includes adding a colony for having several workers working on different parts of same data, also planning on a queen for worker communication and control, like synch