I was thinking about the way to secure accomplishment of all tasks stored in Redis queues in case of server shutdown e.g.
My initial thought was to create an instance of job-description and saving it to database. Something like:
class JobDescription(db.Model):
id = Column(...)
name = Column(...)
queue_name = (...)
is_started = (Boolean)
is_finished = (Boolean)
....
And then updating boolean flags when necessary.
So on Flask/Django/FastApi application startup I would search for jobs that are either not started/not finished.
My question is - are there any best practices for what I described here or better ways to restore lost jobs rather than saving to db job-descriptions?
You're on the right track. To restore Rq tasks after Redis has restarted,
you have to record them so that you can determine what jobs need to be re-delayed (queued).
The JobDescription approach you're using can work fine, with the caveat that as time passes and the underlying table for JobDescription gets larger, it'll take longer to query that table unless you build a secondary index on is_finished.
You might find that using datetimes instead of booleans gives you more options. E.g.
started_at = db.Column(db.DateTime)
finished_at = db.Column(db.DateTime, index=True)
lets you answer several useful questions, such as whether there are interesting patterns in job requests over time, how completion latency varies over time, etc.
Related
This is a rather specific question to advanced users of celery. Let me explain the use case I have:
Usecase
I have to run ~1k-100k tasks that will run a simulation (movie) and return the data of the simulation as a rather large list of smaller objects (frames), say 10k-100k per frame and 1k frames. So the total amount of data produced will be very large, but assume that I have a database that can handle this. Speed is not a key factor here. Later I need to compute features from each frame which can be done completely independent.
The frames look like a dict that point to some numpy arrays and other simple data like strings and numbers and have a unique identifier UUID.
Important is that the final objects of interest are arbitrary joins and splits of these generated lists. As a metaphor consider the result movies be chop and recombined into new movies. These final lists (movies) are then basically a list of references to the frames using their UUIDs.
Now, I consider using celery to get these first movies and since these will end up in the backend DB anyway I might just keep these results indefinitely, at least the ones I specify to keep.
My question
Can I configure a backend, preferably a NonSQL DB, in a way to keep the results and access these later independent from Celery using the objects UUID. And if so does that make sense because of overhead and performance, etc.
Another possibility would be to not return anything and let the worker store the result in a DB. Is that preferred? It seems unnecessary to have a second channel of communication to another DB when Celery can do this already.
I am also interested in comments on using Celery in general for highly independent tasks that run long (>1h) and return large result objects. A fail is not problematic and can just be restarted. The resulting movies are stochastic! so functional approaches can be problematic. Even storing the random seed might not garantuee reproducible results! although I do not have side-effects. I just might have lots of workers available that are widely distributed. Imagine lots of desktop machines in a closed environment where every worker helps even if it is slow. Network speed and security is not an issue here. I know that this is not the original use case, but it seemed very easy to use it for these cases. The best analogy I found are projects like Folding#Home.
Can I configure a backend, preferably a NonSQL DB, in a way to keep the results and access these later independent from Celery using the objects UUID.
Yes, you can configure celery to store its results in a NoSQL database such as redis for access by UUID later. The two settings that will control the behavior of interest for you are result_expires and result_backend.
result_backend will specify which NoSQL database you want to store your results in (e.g., elasticsearch or redis) while result_expires will specify how long after a task completes that the task's result will be available for access.
After the task completes, you can access the results in python like this:
from celery.result import AsyncResult
result = task_name.delay()
print result.id
uuid = result.id
checked_result = AsyncResult(uuid)
# and you can access the result output here however you'd like
And if so does that make sense because of overhead and performance, etc.
I think this strategy makes perfect sense. I have typically used this a number of times when generating long-running reports for web users. The initial post will return the UUID from the celery task. The web client can poll the app sever via javascript using the UUID to see if the task is ready/complete. Once the report is ready, the page can redirect the user to the route that will allow the user to download or view the report by passing in the UUID.
In my Django (1.9) project I need to construct a table from an expensive JOIN. I would therefore like to store the table in the DB and only redo the query if the tables involved in the JOIN change. As I need the table as a basis for later JOIN operations I definitely want to store it in my database and not in any cache.
The problem I'm facing is that I'm not sure how to determine whether the data in the tables have changed. Connecting to the post_save, post_delete signals of the respective models seems not to be right since the models might be updated in bulk via CSV upload and I don't want the expensive query to be fired each time a new row is imported, because the DB table will change right away. My current approach is to check whether the data has changed every certain time interval, which would be perfectly fine for me. For this purpose I use a new thread, which compares the Checksums of the involved tables (see code below) to run this task. As I'm not really familiar with multi threading, especially on web servers I do not now, whether this is acceptable. My questions therefore:
Is the threading approach acceptable for running this single task?
Would a Distributed Task Queue like Celery be more appropriate?
Is there any way to disconnect a signal for a certain time after it is received, so that a bulk upload does not trigger the signal over and over again?
This is my current code:
import threading
from django.apps import apps
from .models import SomeModel
def check_for_table_change():
app_label = SomeModel._meta.app_label
def join():
"""Join the tables and save the resulting table to the DB."""
...
def get_involved_models(app_label):
"""Get all the models that are involved in the join."""
...
involved_models = get_involved_models(app_label)
involved_dbtables = tuple(model._meta.db_table for model in involved_models)
sql = 'CHECKSUM TABLE %s' % ', '.join(involved_dbtables)
old_checksums = None
while(True):
# Get the result of the query as named tuples.
checksums = from_db(sql, fetch_as='namedtuple')
if old_checksums is not None:
# Compare checksums.
for pair in zip(checksums, old_checksums):
if pair[0].Checksum != pair[1].Checksum:
print('db changed, table is rejoined')
join()
break
old_checksums = checksums
time.sleep(60)
check_tables_thread = threading.Thread()
check_tables_thread.run = check_for_table_change
check_tables_thread.start()
I'm grateful for any suggestions.
Materialized Views and Postgresql
If you were on postgresql, you could have used what's known as a Materialized View. Thus you can create a view based on your join and it would exist almost like a real table. This is very different from normal joins where the query needs to be executed each and every time a view is used. Now the bad news. Mysql does not have materialized views.
If you switched to postgresql, you might even find that materialized vies are not needed after all. That's because postgresql can use more than one index per table in queries. Thus your join that seems slow at the moment on mysql might be made to run faster with better use of indexes on Postgresql. Of course this is very dependent on what your structure is like.
Signals vs Triggers
The problem I'm facing is that I'm not sure how to determine whether
the data in the tables have changed. Connecting to the post_save,
post_delete signals of the respective models seems not to be right
since the models might be updated in bulk via CSV upload and I don't
want the expensive query to be fired each time a new row is imported,
because the DB table will change right away.
As you have rightly determined Django signals isn't the right way. This is the sort of task that is best done at the database level. Since you don't have materialized views, this is a job for triggers. However that's a lot of hard work involved (whether you use triggers or signals)
Is the threading approach acceptable for running this single task?
Why not use django as a CLI here? Which effectively means a django script is invoked by a cron or executed by some other mechanism independently of your website.
Would a Distributed Task Queue like Celery be more appropriate?
Very much so. Each time the data changes, you can fire off a task that does the update of the table.
Is there any way to disconnect a signal for a certain time after it is received, so that a bulk upload does not trigger the signal over and over again?
Keyword here is 'TRIGGER' :-)
Alternatives.
Having said all that doing a join and physically populating a table is going to be very very slow if your table grows to even a few thousand rows. This is because you will need an elaborate query to determine which records have changed (unless you used a separate queue for that). You would then need to insert or update the records in the 'join table' generally update/insert is slower than retrieve so as the size of the data goes, this would become progressively worse.
The real solution maybe to optimize your queries and or tables. May I suggest you post a new question with the slow query and also share your table structures?
I'm wondering if there's a way to set up RabbitMQ or Redis to work with Celery so that when I send a task to the queue, it doesn't go into a list of tasks, but rather into a Set of tasks keyed based on the payload of my task, in order to avoid duplicates.
Here's my setup for more context:
Python + Celery. I've tried RabbitMQ as a backend, now I'm using Redis as a backend because I don't need the 100% reliability, easier to use, small memory footprint, etc.
I have roughly 1000 ids that need work done repeatedly. Stage 1 of my data pipeline is triggered by a scheduler and it outputs tasks for stage 2. The tasks contain just the id for which work needs to be done and the actual data is stored in the database. I can run any combination or sequence of stage 1 and stage 2 tasks without harm.
If stage 2 doesn't have enough processing power to deal with the volume of tasks output by stage 1, my task queue grows and grows. This wouldn't have to be the case if the task queue used sets as the underlying data structure instead of lists.
Is there an off-the-shelf solution for switching from lists to sets as distributed task queues? Is Celery capable of this? I recently saw that Redis has just released an alpha version of a queue system, so that's not ready for production use just yet.
Should I architect my pipeline differently?
You can use an external data structure to store and monitor the current state of your celery queue.
1. Lets take a redis key-value for example. Whenever you push a task into celery, you mark a key with your 'id' field as true in redis.
Before trying to push a new task with any 'id', you would check if the key with 'id' is true in redis or not, if yes, you skip pushing the task.
To clear the keys at proper time, you can use after_return handler of celery, which runs when the task has returned. This handler will unset the key 'id' in redis , hence clearing the lock for next task push .
This method ensures you only have ONE instance per id of task running in celery queue. You can also enhance it to allow only N tasks per id by using INCR and DECR commands on the redis key, when the task is pushed and after_return of the task.
Can your tasks in stage 2 check whether the work has already been done and, if it has, then not do the work again? That way, even though your task list will grow, the amount of work you need to do won't.
I haven't come across a solution re the sets / lists, and I'd think there were lots of other ways of getting around this issue.
Use a SortedSet within Redis for your jobs queue. It is indeed a Set so if you put the exact same data inside it won't add a new value in it (it absolutely needs to be the exact same data, you can't override the hash function used in SortedSet in Redis).
You will need a score to use with SortedSet, you can use a timestamp (value as a double, using unixtime for instance) that will allow you to get the most recent items / oldest items if you want. ZRANGEBYSCORE is probably the command you will be looking for.
http://redis.io/commands/zrangebyscore
Moreover, if you need additional behaviours, you can wrap everything inside a Lua Script for atomistic behaviour and custom eviction strategy if needed. For instance calling a "get" script that gets the job and remove it from the queue atomically or evicts data if there is too much back pressure etc.
I want to cache data from external service in graceful manner: the old data, though expired, is being served until the worker successfully fetches new data. The data is not time-critical, but the lack of data (external service down) would prevent the service running, thus using a persistent cache.
Currently
I store fetch timestamp in a separate Redis key
I cache the data indefinitely until the worker fetches new one (I do not set expiration time)
Questions
Is this correct way of doing graceful caching with Redis?
Can I natively get key updated timestamp from Redis, so I do not need to store this information myself
This is the code:
def set_data(self, data):
self.redis.set("bitcoinaverage", pickle.dumps(data))
self.redis.set("bitcoinaverage_updated", calendar.timegm(datetime.datetime.utcnow().utctimetuple()))
def get_data(self):
return pickle.loads(self.redis.get("bitcoinaverage"))
def is_up_to_date(self):
last_updated = datetime.utcfromtimestamp(self.redis.get("bitcoinaverage_updated"))
if not last_updated:
return False
return datetime.datetime.utcnow() < last_updated + self.refresh_delay
def tick(self):
""" Run a periodical worker task to see if we need to update.
"""
if not self.is_up_to_date():
self.update()
What you have could work, but depends on how big you expect your data set to get. If I understand correctly, your current implementation will require you to run a worker pinging every key with tick(). This would be very, very expensive in terms of network back-and-forth in your current implementation since you'd have to go over the network, query Redis, and send the results back for every single key (possibly two queries if you need to update). If it's just for the two keys you mention, that's fine. If it's for more, pipelines are your friend.
If you wanted to be more elegant and robust, you could use notifications when your keys expire. The paradigm here would be, for each value you would set two keys: k, and k_updated. k would hold the actual value for the key, and k_updated would just be a dumb key with a TTL set for when you want k to be updated. When k_updated expires, you'll get a notification, and you can have a listener then immediately process this as a request for a new job to update k and set a new k_updated. This would use the pubsub model, and you could have multiple subscribers and use a Queue to manage the new jobs, if you wanted to be very robust about it. The benefits of this system:
Keys will get updated immediately when they expire with no need to constantly query them to see if they need to be updated.
You can have multiple independent workers subscribed and listening for update tasks and manage new update jobs as they come in, so that if one update worker goes down, you don't lose updating until you bring the box back up.
This latter system could be overkill for your case if you don't have extreme speed or robustness needs, but if you do or plan to, it's something to consider.
What you are doing will certainly work.
There's no way to natively get the timestamp that a key was inserted/updated, so you will have to either store it as you are at present or as an alternative you could change what you store for the bitcoinaverage key to include the timestamp (e.g. use some JSON to hold both the timestamp and pickled data, and store the JSON) however that makes your get_data() and is_up_to_date() more complex.
What happens in get_data() and is_up_to_date() if you have nothing stored for that key? Have you tested that condition?
In a Django Python app, I launch jobs with Celery (a task manager). When each job is launched, they return an object (lets call it an instance of class X) that lets you check on the job and retrieve the return value or errors thrown.
Several people (someday, I hope) will be able to use this web interface at the same time; therefore, several instances of class X may exist at the same time, each corresponding to a job that is queued or running in parallel. It's difficult to come up with a way to hold onto these X objects because I cannot use a global variable (a dictionary that allows me to look up each X objects from a key); this is because Celery uses different processes, not just different threads, so each would modify its own copy of the global table, causing mayhem.
Subsequently, I received the great advice to use memcached to share the memory across the tasks. I got it working and was able to set and get integer and string values between processes.
The trouble is this: after a great deal of debugging today, I learned that memcached's set and get don't seem to work for classes. This is my best guess: Perhaps under the hood memcached serializes objects to the shared memory; class X (understandably) cannot be serialized because it points at live data (the status of the job), and so the serial version may be out of date (i.e. it may point to the wrong place) when it is loaded again.
Attempts to use a SQLite database were similarly fruitless; not only could I not figure out how to serialize objects as database fields (using my Django models.py file), I would be stuck with the same problem: the handles of the launched jobs need to stay in RAM somehow (or use some fancy OS tricks underneath), so that they update as the jobs finish or fail.
My best guess is that (despite the advice that thankfully got me this far) I should be launching each job in some external queue (for instance Sun/Oracle Grid Engine). However, I couldn't come up with a good way of doing that without using a system call, which I thought may be bad style (and potentially insecure).
How do you keep track of jobs that you launch in Django or Django Celery? Do you launch them by simply putting the job arguments into a database and then have another job that polls the database and runs jobs?
Thanks a lot for your help, I'm quite lost.
I think django-celery does this work for you. Did you had a look at the tables made by django-celery? I.e. djcelery_taskstate holds all data for a given task like state, worker_id and so on. For periodic tasks there is a table called djcelery_periodictask.
In a Django view you can access the TaskMeta object:
from djcelery.models import TaskMeta
task = TaskMeta.objects.get(task_id=task_id)
print task.status