Graceful caching with Python and Redis - python

I want to cache data from external service in graceful manner: the old data, though expired, is being served until the worker successfully fetches new data. The data is not time-critical, but the lack of data (external service down) would prevent the service running, thus using a persistent cache.
Currently
I store fetch timestamp in a separate Redis key
I cache the data indefinitely until the worker fetches new one (I do not set expiration time)
Questions
Is this correct way of doing graceful caching with Redis?
Can I natively get key updated timestamp from Redis, so I do not need to store this information myself
This is the code:
def set_data(self, data):
self.redis.set("bitcoinaverage", pickle.dumps(data))
self.redis.set("bitcoinaverage_updated", calendar.timegm(datetime.datetime.utcnow().utctimetuple()))
def get_data(self):
return pickle.loads(self.redis.get("bitcoinaverage"))
def is_up_to_date(self):
last_updated = datetime.utcfromtimestamp(self.redis.get("bitcoinaverage_updated"))
if not last_updated:
return False
return datetime.datetime.utcnow() < last_updated + self.refresh_delay
def tick(self):
""" Run a periodical worker task to see if we need to update.
"""
if not self.is_up_to_date():
self.update()

What you have could work, but depends on how big you expect your data set to get. If I understand correctly, your current implementation will require you to run a worker pinging every key with tick(). This would be very, very expensive in terms of network back-and-forth in your current implementation since you'd have to go over the network, query Redis, and send the results back for every single key (possibly two queries if you need to update). If it's just for the two keys you mention, that's fine. If it's for more, pipelines are your friend.
If you wanted to be more elegant and robust, you could use notifications when your keys expire. The paradigm here would be, for each value you would set two keys: k, and k_updated. k would hold the actual value for the key, and k_updated would just be a dumb key with a TTL set for when you want k to be updated. When k_updated expires, you'll get a notification, and you can have a listener then immediately process this as a request for a new job to update k and set a new k_updated. This would use the pubsub model, and you could have multiple subscribers and use a Queue to manage the new jobs, if you wanted to be very robust about it. The benefits of this system:
Keys will get updated immediately when they expire with no need to constantly query them to see if they need to be updated.
You can have multiple independent workers subscribed and listening for update tasks and manage new update jobs as they come in, so that if one update worker goes down, you don't lose updating until you bring the box back up.
This latter system could be overkill for your case if you don't have extreme speed or robustness needs, but if you do or plan to, it's something to consider.

What you are doing will certainly work.
There's no way to natively get the timestamp that a key was inserted/updated, so you will have to either store it as you are at present or as an alternative you could change what you store for the bitcoinaverage key to include the timestamp (e.g. use some JSON to hold both the timestamp and pickled data, and store the JSON) however that makes your get_data() and is_up_to_date() more complex.
What happens in get_data() and is_up_to_date() if you have nothing stored for that key? Have you tested that condition?

Related

Mongo TTL and Change Events to return full document as cursor

I looking to schedule a task when a document's datetime field hits that time, I've set that up using TTL. Problem is that according to delete event when I receive the cursor, the original document is not returned to the program. I still need the document (that is now deleted) on python stack since it contains other properties that are important to executing the task. Is there some kind of a workaround where I can get the document via change event without deleting it, or get the deleted document without having to do a query?
There is no workaround. The tools you've chosen are not sufficient for the job.
Replace the TTL index with a regular one and replace your ChangeStream listener with a cron job to run a worker every minute.
The worker will get all expired documents, do the job, and delete the documents from the collection either one by one or in batches.
It is more reliable, flexible and scalable approach comparing to the TTL + ChangeStream.

Efficient way of updating real time location using django-channels/websocket

I am working on Real Time based app, it needs to update location of user whenever it is changed.
Android app is used as frontend, which get location using Google/Fused Api and in onLocationChanged(loc:Location), I am sending the latest location over the Websocket. The location update is then received by a django channel consumer, and job of this consumer is to store location in database asynchronously (I am using #database_sync_to_async decorator.
But the problem is, server crashes when Android app tries to send 10-15 location updates per second. What will be the efficient way of updating real time location?
Note: Code can be supplied on demand
Ask yourself what kind of resolution you need for that data. Do you really need 10 updates a second? If not, take every nth update or see if Android will just give you the updates slower. Secondly, look for a native async database library. #database_sync_to_async runs a different thread every time you call it which kills the performance gains you're getting from the event loop. If you say in one thread you'll keep the CPU caches fresh. You won't get to use the ORM. But do you really need a database or would Redis work? If so, call aioredis directly and it will be a lot faster since its in memory and you can use it's fast data structures like queues and sets. If you need Redis to be even faster look at it's multithreaded fork KeyDB.

Get feedback from a scheduled job while it is processed

I would like to run jobs, but as they may be long, I would like to know how far they have been processed during their execution. That is, the executor would regularly return its progress, without ending the job it is executing.
I have tried to do this with APScheduler, but it seems the scheduler can only receive event messages like EVENT_JOB_EXECUTED or EVENT_JOB_ERROR.
Is it possible to get information from an executor while it is executing a job?
Thanks in advance!
There is, I think, no particular support for this within APScheduler. This requirement has come up for me many times, and the best solution will depend on exactly what you need. Some possibilities:
Job status dictionary
The simplest solution would be to use a plain python dictionary. Make the key the job's key, and the value whatever status information you require. This solution works best if you only have one copy of each job running concurrently (max_instances=1), of course. If you need some structure to your status information, I'm a fan of namedtuples for this. Then, you either keep the dictionary as an evil global variable or pass it into each job function.
There are some drawbacks, though. The status information will stay in the dictionary forever, unless you delete it. If you delete it at the end of the job, you don't get to read a 'job complete' status, and otherwise you have to make sure that whatever is monitoring the status definitely checks and clears every job. This of course isn't a big deal if you have a reasonable sized set of jobs/keys.
Custom dict
If you need some extra functions, you can do as above, but subclass dict (or UserDict or MutableMapping, depending on what you want).
Memcached
If you've got a memcached server you can use, storing the status reports in memcached works great, since they can expire automatically and they should be globally accessible to your application. One probably-minor drawback is that the status information could be evicted from the memcached server if it runs out of memory, so you can't guarantee that the information will be available.
A more major drawback is that this does require you to have a memcached server available. If you might or might not have one available, you can use dogpile.cache and choose the backend that's appropriate at the time.
Something else
Pieter's comment about using a callback function is worth taking note of. If you know what kind of status information you'll need, but you're not sure how you'll end up storing or using it, passing a wrapper to your jobs will make it easy to use a different backend later.
As always, though, be wary of over-engineering your solution. If all you want is a report that says "20/133 items processed", a simple dictionary is probably enough.

Distributed Task Queue Based on Sets as a Data Structure instead of Lists

I'm wondering if there's a way to set up RabbitMQ or Redis to work with Celery so that when I send a task to the queue, it doesn't go into a list of tasks, but rather into a Set of tasks keyed based on the payload of my task, in order to avoid duplicates.
Here's my setup for more context:
Python + Celery. I've tried RabbitMQ as a backend, now I'm using Redis as a backend because I don't need the 100% reliability, easier to use, small memory footprint, etc.
I have roughly 1000 ids that need work done repeatedly. Stage 1 of my data pipeline is triggered by a scheduler and it outputs tasks for stage 2. The tasks contain just the id for which work needs to be done and the actual data is stored in the database. I can run any combination or sequence of stage 1 and stage 2 tasks without harm.
If stage 2 doesn't have enough processing power to deal with the volume of tasks output by stage 1, my task queue grows and grows. This wouldn't have to be the case if the task queue used sets as the underlying data structure instead of lists.
Is there an off-the-shelf solution for switching from lists to sets as distributed task queues? Is Celery capable of this? I recently saw that Redis has just released an alpha version of a queue system, so that's not ready for production use just yet.
Should I architect my pipeline differently?
You can use an external data structure to store and monitor the current state of your celery queue.
1. Lets take a redis key-value for example. Whenever you push a task into celery, you mark a key with your 'id' field as true in redis.
Before trying to push a new task with any 'id', you would check if the key with 'id' is true in redis or not, if yes, you skip pushing the task.
To clear the keys at proper time, you can use after_return handler of celery, which runs when the task has returned. This handler will unset the key 'id' in redis , hence clearing the lock for next task push .
This method ensures you only have ONE instance per id of task running in celery queue. You can also enhance it to allow only N tasks per id by using INCR and DECR commands on the redis key, when the task is pushed and after_return of the task.
Can your tasks in stage 2 check whether the work has already been done and, if it has, then not do the work again? That way, even though your task list will grow, the amount of work you need to do won't.
I haven't come across a solution re the sets / lists, and I'd think there were lots of other ways of getting around this issue.
Use a SortedSet within Redis for your jobs queue. It is indeed a Set so if you put the exact same data inside it won't add a new value in it (it absolutely needs to be the exact same data, you can't override the hash function used in SortedSet in Redis).
You will need a score to use with SortedSet, you can use a timestamp (value as a double, using unixtime for instance) that will allow you to get the most recent items / oldest items if you want. ZRANGEBYSCORE is probably the command you will be looking for.
http://redis.io/commands/zrangebyscore
Moreover, if you need additional behaviours, you can wrap everything inside a Lua Script for atomistic behaviour and custom eviction strategy if needed. For instance calling a "get" script that gets the job and remove it from the queue atomically or evicts data if there is too much back pressure etc.

Django queue function calls

I have small problem with the nature of the data processing and django.
for starters. I have webpage with advanced dhtmlx table. While adding rows to table DHTMLX automatically send POST data to mine django backend where this is processed and return XML data is sent to webpage. All of it works just fine when adding 1 row at a time. But when adding several rows at a time, some problem starts to occur. For starters, I have checked the order of send data to backend and its proper (let say Rows ID 1,2,3,4 are sent in that order). Problem is that backend processes the query when it arrives, usually they arrives in the same order (even though the randomness of the Internet). But django fires the same function for them instantly and it's complex functions that takes some time to compute, then sends the response. Problem is that every time function is called there is a change in the database and one of the variables depends on how big is a database table we are altering. While having the same data table altered in wrong order (different threads speed) the result data is rubbish.
Is there any automatic solution to queue calls of one web called function so that every call could go to the queue and wait for previous to complete ??
I want to make such a queue for this function only.
It seems like you should build the queue in django. If the rows need to be processed serially on the backend, then insert the change data into a queue and process the queue like an event handler.
You could build a send queue using dhtmlx's event handlers and the ajax callback handler, yet why? The network is already slow, slowing it down further is the wrong approach.

Categories