I have separate processes working on redis keys, one of them is very fast at updating, the second can take a bit to happen, but still needs to happen.
I'm looking to implement a real lock in redis, not just a watch that fails the following actions if the key is modified.
Is there a way to check for a watch on a key or something like that?
Related
When I use dictionary.get() function, is it locking the whole dictionary? I am developing a multiprocess and multithreading program. The dictionary is used to act as a state table to keep track of data. I have to impose a size limit to the dictionary, so whenever the limit is being hit, I have to do garbage collection on the table, based on the timestamp. The current implementation will delay adding operation while garbage collection is iterating through the whole table.
I will have 2 or more threads, one just to add data and one just to do garbage collection. Performance is critical in my program to handle streaming data. My program is receiving streaming data, and whenever it receives a message, it has to look for it in the state table, then add the record if it's non-existent in the first place, or copy certain information and then send it along the pipe.
I have thought of using multiprocessing to do the search and adding operation concurrently, but if I used processes, I have to make a copy of state table for each process, in that case, the performance overhead for synchronization is too high. And I also read that multiprocessing.manager.dict() is also locking the access for each CRUD operation. I could not spare the overhead for it so my current approach is using threading.
So my question is while one thread is doing .get(), del dict['key'] operation on the table, will the other insertion thread be blocked from accessing it?
Note: I have read through most SO's python dictionary related posts, but I cannot seem to find the answer. Most people only answer that even though python dictionary operations are atomic, it is safer to use a Lock for insertion/update. I'm handling a huge amount of streaming data so Locking every time is not ideal for me. Please advise if there is a better approach.
If the process of hashing or comparing the keys in your dictionary could invoke arbitrary Python code (basically, if the keys aren't all Python built-in types implemented in C, e.g. str, int, float, etc.), then yes, it would be possible for a race condition to occur in which the GIL is released while a bucket collision is being resolved (during the equality test), and another thread could leap in and cause the object being compared to to disappear from the dict. They try to ensure it doesn't actually crash the interpreter, but it has been a source of errors in the past.
If that's a possibility (or you're on a non-CPython interpreter, where there is no GIL providing basic guarantees like this), then you should really use a lock to coordinate access. On CPython, as long as you're on modern Python 3, the cost will be fairly low; contention on the lock should be fairly low since the GIL ensures only one thread is actually running at once; most of the time your lock should be uncontended (because the contention is on the GIL), so the incremental cost of using it should be fairly small.
A note: You might consider using collections.OrderedDict to simplify the process of limiting the size of your table. With OrderedDict, you can implement the size limit as a strict LRU (least-recently used) system by making additions to the table done as:
with lock:
try:
try:
odict.move_to_end(key) # If key already existed, make sure it's "renewed"
finally:
odict[key] = value # set new value whether or not key already existed
except KeyError:
# move_to_end raising key error means newly added key, so we might
# have grown larger than limit
if len(odict) > maxsize:
odict.popitem(False) # Pops oldest item
and usage done as:
with lock:
# move_to_end optional; if using key means it should live longer, then do it
# if only setting key should refresh it, omit move_to_end
odict.move_to_end(key)
return odict[key]
This does need a lock, but it also reduces the work for garbage collection when it grows too large from "check every key" (O(n) work) to "pop the oldest item off without looking at anything else" (O(1) work).
A lock is used to avoid race conditions so no two threads could make change to the dict at the same time so it is advisible that you use the lock else you might go into a race condition causing program to fail. A mutex lock can be used to deal with 2 threads.
The problem is still on the drawing board so far, so I can go for another better suited approach. The situation is like this:
We create a queue of n-processes, each of which execute independently of the other tasks in the queue itself. They do not share any resources etc. However, we noticed that sometimes (depending on queue parameters) a process k's behaviour might depend on existence of a flag specific to k+1 process. This flag is to be set in a DynamoDB table, and therefore; the execution could fails.
What I am currently searching around for is a method so that I can set some sort of waiters/suspenders in my tasks/workers so that they poll until the flag is set in the DynamoDB table, and meanwhile let the other subprocess take up the CPU.
The setting of this boolean value is done a little early in the processes themselves. The dependent part of the process comes much later.
So, we went ahead with creating n number of processes instead of having a suspender. This would not be the ideal approach, but for the time being; it solves the issue at hand.
I'd still love a better method to achieve the same.
I would like to run jobs, but as they may be long, I would like to know how far they have been processed during their execution. That is, the executor would regularly return its progress, without ending the job it is executing.
I have tried to do this with APScheduler, but it seems the scheduler can only receive event messages like EVENT_JOB_EXECUTED or EVENT_JOB_ERROR.
Is it possible to get information from an executor while it is executing a job?
Thanks in advance!
There is, I think, no particular support for this within APScheduler. This requirement has come up for me many times, and the best solution will depend on exactly what you need. Some possibilities:
Job status dictionary
The simplest solution would be to use a plain python dictionary. Make the key the job's key, and the value whatever status information you require. This solution works best if you only have one copy of each job running concurrently (max_instances=1), of course. If you need some structure to your status information, I'm a fan of namedtuples for this. Then, you either keep the dictionary as an evil global variable or pass it into each job function.
There are some drawbacks, though. The status information will stay in the dictionary forever, unless you delete it. If you delete it at the end of the job, you don't get to read a 'job complete' status, and otherwise you have to make sure that whatever is monitoring the status definitely checks and clears every job. This of course isn't a big deal if you have a reasonable sized set of jobs/keys.
Custom dict
If you need some extra functions, you can do as above, but subclass dict (or UserDict or MutableMapping, depending on what you want).
Memcached
If you've got a memcached server you can use, storing the status reports in memcached works great, since they can expire automatically and they should be globally accessible to your application. One probably-minor drawback is that the status information could be evicted from the memcached server if it runs out of memory, so you can't guarantee that the information will be available.
A more major drawback is that this does require you to have a memcached server available. If you might or might not have one available, you can use dogpile.cache and choose the backend that's appropriate at the time.
Something else
Pieter's comment about using a callback function is worth taking note of. If you know what kind of status information you'll need, but you're not sure how you'll end up storing or using it, passing a wrapper to your jobs will make it easy to use a different backend later.
As always, though, be wary of over-engineering your solution. If all you want is a report that says "20/133 items processed", a simple dictionary is probably enough.
I got a lot scripts running: scrappers, checkers, cleaners, etc. They have some things in common:
they are forever running;
they have no time constrain to finish their job;
they all access the same MYSQL DB, writting and reading.
Accumulating them, it's starting to slow down the website, which runs on the same system, but depends on these scripts.
I can use queues with Kombu to inline all writtings.
But do you know a way to make the same with reading ?
E.G: if one script need to read from the DB, his request is sent to a blocking queue, et it resumes when it got the answer ? This way everybody is making request to one process, and the process is the only one talking to the DB, making one request at the time.
I have no idea how to do this.
Of course, in the end I may have to add more servers to the mix, but before that, is there something I can do at the software level ?
You could use a connection pooler and make the connections from the scripts go through it. It would limit the number of real connections hitting your DB while being transparent to your scripts (their connections would be held in a "wait" state until a real connections is freed).
I don't know what DB you use, but for Postgres I'm using PGBouncer for similiar reasons, see http://pgfoundry.org/projects/pgbouncer/
You say that your dataset is <1GB, the problem is CPU bound.
Now start analyzing what is eating CPU cycles:
Which queries are really slow and executed often. MySQL can log those queries.
What about the slow queries? Can they be accelerated by using an index?
Are there unused indices? Drop them!
Nothing helps? Can you solve it by denormalizing/precomputing stuff?
You could create a function that each process must call in order to talk to the DB. You could re-write the scripts so that they must call that function rather than talk directly to the DB. Within that function, you could have a scope-based lock so that only one process would be talking to the DB at a time.
I was wondering if it would be a good idea to use callLater in Twisted to keep track of auction endings. It would be a callLater on the order of 100,000's of seconds, though does that matter? Seems like it would be very convenient. But then again it seems like a horrible idea if the server crashes.
Keeping a database of when all the auctions are ending seems like the most secure solution, but checking the whole database each second to see if any auction has ended seems very expensive.
If the server crashes, maybe the server can recreate all the callLater's from database entries of auction end times. Are there other potential concerns for such a model?
One of the Divmod projects, Axiom, might be applicable here. Axiom is an object database. One of its unexpected, useful features is a persistent scheduling system.
You schedule events using APIs provided by the database. When the events come due, a callback you specified is called. The events persist across process restarts, since they're represented as database objects. Large numbers of scheduled events are supported, by only doing work to keep track when the next event is going to happen.
The canonical Divmod site went down some time ago (sadly the company is no longer an operating concern), but the code is all available at http://launchpad.net/divmod.org and the documentation is being slowly rehosted at http://divmod.readthedocs.org/.