I need to track data from another website. Since it's spread over 60+ pages, I intend to use a daily cron job to add a task to the queue. This task then should take care of one page and depending on some checks, put another instance of itself on the queue for the next page.
Now a simple
taskqueue.add(url='/path/to_self', params=control)
in the get of my webapp.RequestHandler class for this task leads to a
"POST /path/to_self HTTP/1.1" 405 -
Is there a way to get this to work, or is it simply not possible to add tasks to the queue from within tasks?
It's possible to add tasks from within tasks. I'm doing it in my application.
It's very useful when you want to migrate a large set of entities : one task processes a small chunk of entities then adds itself to the queue in order to process the rest until the migration is over.
I am not sure what is the problem with your code.
Have you implemented the post(self) method in your RequestHandler class ? Task calls default to the POST method.
Related
Pardon my ignorance as I am learning how I can use celery for my purposes.
Suppose I have two tasks: create_ticket and add_message_to_ticket. Usually create_ticket task is created and completed before add_message_to_ticket tasks are created multiple times.
#app.task
def create_ticket(ticket_id):
time.sleep(random.uniform(1.0, 4.0)) # replace with code that processes ticket creation
return f"Successfully processed ticket creation: {ticket_id}"
#app.task
def add_message_to_ticket(ticket_id, who, when, message_contents):
# TODO add code that checks to see if create_ticket task for ticket_id has already been completed
time.sleep(random.uniform(1.0, 4.0)) # replace with code that handles added message
return f"Successfully processed message for ticket {ticket_id} by {who} at {when}"
Now suppose that these tasks are created out of order due to Python's server receiving the events from an external web service out of order. For example, one add_message_to_ticket.delay(82, "auroranil", 1599039427, "This issue also occurs on Microsoft Edge on Windows 10.") gets called few seconds before create_ticket.delay(82) gets called. How would I solve the following problems?
How would I fetch results of celery task create_ticket by specifying ticket_id within task add_message_to_ticket? All I can think of is to maintain a database that stores tickets state, and checks to see if a particular ticket has been created, but I want to know if I am able to use celery's result backend somehow.
If I receive an add_message_to_ticket task with a ticket id where I find out that corresponding ticket does not have create_ticket task completed, do I reject that task, and put that back in the queue?
Do I need to ensure that the tasks are idempotent? I know that is good practice, but is it a requirement for this to work?
Is there a better approach at solving this problem? I am aware of Celery Canvas workflow with primitives such as chain, but I am not sure how I can ensure that these events are processed in order, or be able to put tasks on pending state while it waits for tasks it depends on to be completed based on arguments I want celery to check, which in this case is ticket_id.
I am not particularly worried if I receive multiple user messages for a particular ticket with timestamps out of order, as it is not as important as knowing that a ticket has been created before messages are added to that ticket. The point I am making is that I am coding up several tasks where some events crucially depend on others, whereas the ordering of other events do not matter as much for the Python's server to function.
Edit:
Partial solutions:
Use task_id to identify Celery tasks, with a formatted string containing argument values which identifies that task. For example, task_id="create_ticket(\"TICKET000001\")"
Retry tasks that do not meet dependency requirements. Blocking for subtasks to be completed is bad, as subtask may never complete, and will hog a process in one of the worker machines.
Store arguments as part of result of a completed task, so that you can use that information not available in later tasks.
Relevant links:
Where do you set the task_id of a celery task?
Retrieve result from 'task_id' in Celery from unknown task
Find out whether celery task exists
More questions:
How do I ensure that I send task once per task_id? For instance, I want create_ticket task to be applied asynchronous only once. This is an alternative to making all tasks idempotent.
How do I use AsyncResult in add_message_to_ticket to check for status of create_ticket task? Is it possible to specify a chain somehow even though the first task may have already been completed?
How do I fetch all results of tasks given task name derived from the name of the function definition?
Most importantly, should I use Celery results backend to abstract stored data away from dealing with a database? Or should I scratch this idea and just go ahead with designing a database schema instead?
I'm running Django, Celery and RabbitMQ. What I'm trying to achieve is to ensure, that tasks related to one user are executed in order (specifically, one at the time, I don't want task concurrency per user)
whenever new task is added for user, it should depend on the most recently added task. Additional functionality might include not adding task to queue, if task of this type is queued for this user and has not yet started.
I've done some research and:
I couldn't find a way to link newly created task with already queued one in Celery itself, chains seem to be only able to link new tasks.
I think that both functionalities are possible to implement with custom RabbitMQ message handler, though it might be hard to code after all.
I've also read about celery-tasktree and this might be an easiest way to ensure execution order, but how do I link new task with already "applied_async" task_tree or queue? Is there any way that I could implement that additional no-duplicate functionality using this package?
Edit: There is this also this "lock" example in celery cookbook and as the concept is fine, I can't see a possible way to make it work as intended in my case - simply if I can't acquire lock for user, task would have to be retried, but this means pushing it to the end of queue.
What would be the best course of action here?
If you configure the celery workers so that they can only execute one task at a time (see worker_concurrency setting), then you could enforce the concurrency that you need on a per user basis. Using a method like
NUMBER_OF_CELERY_WORKERS = 10
def get_task_queue_for_user(user):
return "user_queue_{}".format(user.id % NUMBER_OF_CELERY_WORKERS)
to get the task queue based on the user id, every task will be assigned to the same queue for each user. The workers would need to be configured to only consume tasks from a single task queue.
It would play out like this:
User 49 triggers a task
The task is sent to user_queue_9
When the one and only celery worker that is listening to user_queue_9 is ready to consume a new task, the task is executed
This is a hacky answer though, because
requiring just a single celery worker for each queue is a brittle system -- if the celery worker stops, the whole queue stops
the workers are running inefficiently
The task I'm implementing is related to scrape some basic info about a URL, such as title, description and OGP metadata. If User A requests 200 URLs to scrape, and after User B requests for 10 URLs, User B may wait much more than s/he expect.
What I'm trying to achieve is to rate limit a specific task on a per user basis or, at least, to be fair between users.
The Celery implementation for rate limiting is too broad, since it uses the task name only
Do you have any suggestion to achieve this kind of fairness?
Related Celery (Django) Rate limiting
Another way would be to rate limit individual users using a lock. Use the user id as the lock name. If the lock is already held retry after some task dependent delay.
Basically, do this:
Ensuring a task is only executed one at a time
Lock on the user id and retry instead of doing nothing if the lock can't be acquired. Also, it would be better to use Redis instead of the the Django cache, but either way will work.
One way to work this around could be to control that a user does not enqueue more than x tasks, which means counting for each user the number of non-processed tasks enqueued (on the django side, not trying to do this with celery).
How about, instead of running all URL scrapes in a single task, make each scrape into a single task and then run them as chains or groups?
I have a "queue" of about a million entities on google app engine. I have to "pop" items off of the queue by using a query.
There are a bunch of client processes running all over the place that are constantly making requests to the stack. My problem is that when one of the clients requests an item, I want to make sure that I am removing that item from the front of the queue, sending it to that client process, and no other processes.
Currently, I am querying for the item, modifying its properties so that a query to the queue no longer includes that item, then saving the item. Using this method, it is very common for one item to be sent to more than one client process at the same time. I suspect this is because there is a delay to when I am making the writes and when they are being reflected to other processes.
Perhaps I need to be using transactions in some way, but when I looked into that, there were a couple of "gotchas". What is a good way to approach this problem?
Is there any reason not to implement the "queue" using App Engine's TaskQueue API? If size of the queue is the problem, TaskQueue could contain up to 200 million Tasks for a paid app, so a million entities would be easily handled.
If you want to be able to simulate queries for a certain task in the queue, you could use task tags, and have your client process pull tasks with a certain tag to be processed. Note that pulling tasks is supported through pull queues rather than push queues.
Other than that, if you want to keep your "queue-as-entities" implementation, you could use the Memcache API to signal the client process which entity need to be processed. Memcache provides stronger consistency when you need to share data between instances of your app compared to the eventual consistency of the HRD datastore, with the caveat that data in Memcache could be lost at any point in time.
I see two ways to tackle this:
What you are doing is ok, you just need to use transactions. If your processes are longer then 30s then you can offload them to task queue, which can be a part of transaction.
You could use Pull Queues, where you fill up a queue and than client processes pull tasks from the queue in atomic fashion (lease-delete cycle). With Pull Queues you can be sure that task is leased only once. Also task must be manually deleted from queue after it's done, meaning if your process dies task will be put back in queue after lease expires.
I have my server on Google App Engine
One of my jobs is to match a huge set of records with another.
This takes very long, if i have to match 10000 records with 100.
Whats the best way of implementing this.
Im, using Web2py stack and deployed my application on Google App Engine.
maybe i'm misunderstanding something, but thos sounds like the perfect match for a task queue, and i can't see how multithreading will help, as i thought this only ment that you can serve many responses simultaneously, it won't help if your responses take longer than the 30 second limit.
With a task you can add it, then process until the time limit, then recreate another task with the remainder of the task if you haven't finished your job by the time limit.
Multithreading your code is not supported on GAE so you can not explicitly use it.
GAE itself can be multithreaded, which means that one frontend instance can handle multiple http requests simultaneously.
In your case, best way to achieve parallel task execution is Task Queue.
The basic structure for what you're doing is to have the cron job be responsible for dividing the work into smaller units, and executing each unit with the task queue. The payload for each task would be information that identifies the entities in the first set (such as a set of keys). Each task would perform whatever queries are necessary to join the entities in the first set with the entities in the second set, and store intermediate (or perhaps final) results. You can tweak the payload size and task queue rate until it performs the way you desire.
If the results of each task need to be aggregated, you can have each task record its completion and test for whether all tasks are complete, or just have another job that polls the completion records, to fire off the aggregation. When the MapReduce feature is more widely available, that will be a framework for performing this kind of work.
http://www.youtube.com/watch?v=EIxelKcyCC0
http://code.google.com/p/appengine-mapreduce/