Can we limit the throughput of a luigi Task?

Can we limit the throughput of a luigi Task? - python

We have a Luigi Task that request a piece of information from a 3rd party service. We are limited on the number of call requests we can perform per minute to that API call.
Is there a way to specify on a per-Task basis how many tasks of this kind must the scheduler run per unit of time?

We implemented our own rate limiting in the task. Our API limit was low enough that we could saturate it with a single thread. When we received a rate limit response, we just back off and retry.
One thing you can do is to declare the API call as a resource. You can set how many of the resource is available in the config, and then how many of the resource the task consumes as a property on the task. This will then limit you to running n of that task at a time.
in config:
[resources]
api=1
in code for Task:
resources = {"api": 1}

Related

What is an optimal way to control requests to the same API, having a request limit, across all DAGs in Airflow

In our Airflow we have multiple DAGs using the same API, which limits requests to 1 request per 10 ms. What is an optimal way to control requests to the same API across all DAGs?

Pools are used to limit concurrency for a resource across multiple DAGs.
For example, if you define a pool my_pool with 1 slot.
airflow pools set my_pool 1
And you have a DAG with 2 tasks as:
with DAG(
'tutorial',
default_args=default_args,
schedule_interval="#once",
start_date=datetime(2021, 6, 1),
) as dag:
t1 = DummyOperator(
task_id='t1_task',
pool='my_pool'
)
t2 = DummyOperator(
task_id='t2_task',
pool='my_pool'
)
In this DAG t1_task & t2_task can not run in parallel because they are both subject to my_pool which has only 1 slot. t1_task may run before t2_task or t2_task may run before t1_task

It's not straightforward. There is no direct way to limit API calls across multiple DAGS/Tasks because they are run in independent processes/nodes.
The best you can do is to introduce a proxy that all requests are going through and introduce rate limiting there (example rate-limiting proxy you can find here: Sphinx. This is the common best practice in many "serious" deployments that this kind of rate limiting is provided by some kind of proxies or "Gateways".
If you are in a cloud environment you can actually use existing API gateways for that purpose: GCP, AWS.
Approaching it throug proxy/Gateway has the added benefit that you can rate-limit requests not only from single Airflow Installation, but also from all the other sources of API calls for the service (which will inevitably happen eventually)
So, there is not much Airflow can (and should) do about it if you actually want to protect against rate limit for the "best" scenario where you want to protect against exceeding the limits in general.
However, you can make your Airflow installation somewhat more resilient to those kind of errors, and/or configure it in the way that the rate limits are happening less frequently (at least in the case where those calls are mostly generated by Airflow)
you can limit the number of parallell request to the API by using the Pools that Elad mentioned in the previous answer. There you can limit a number of concurrent tasks running in a pool. All tasks that cause the API call should belong to the same "pool" and you can limit how many of those tasks can run in parallel. This is quite nice because it is managed on the "Scheduler" level - which means that if the pools are "full", Airflow will not schedule any more of those "pool" tasks and the workers you have configured for airflow remain free for other tasks (Or Kubernetes Pods are not created) until the pool is "full". This way you make sure Airflow on its own does not flood the API with too many requests in too short time (max number of requests = <pool size>/<average length of the task> assuming all tasks are similar and generate one API call per task execution). This way airflow might be "nice citizen" for others using same API. But it does not protect you from "others" using the same API.
In order to make Airflow more resilient to such retries, you should use/implement retries with exponential back-off for Hooks that make the API calls that you want to rate-limit. Some of the "rate-limit-vulnerable" hooks which are implemented in Airflow Community providers already use that approach. For example you can see how Dataprep is doing it:
https://github.com/apache/airflow/blob/763919d4152ffa13433e2489fec85ed286b7b196/airflow/providers/google/cloud/hooks/dataprep.py#L61
It's super-easy to implement such retries. As you can see in the example above, this is a simple tenacity retry decorator with some parameters that make it retry several times with exponential back-off and custom method to decide whether the error is "permanent" or "retriable". This approach provides resilience in many situations. For example when you exceed rate yourself, when there is a sudden spike of requests, when others are causing the rate limit to be exceeded etc., etc. Ideally if all users of the rate-limiting API implement it, then the system becomes "self-healing" in all situations. This has the drawback however, that while such request waits, the worker/POD of airflow continues to run, actively sleeping, thus blocking an execution slot for other tasks.
In the future when Airflow implements Deferrable "Async" Operators (this is work in progress and will be there in Airflow 2.2 or 2.3), we will be able to get the best of both approaches. We will be able to make the operators retry with exponential back-off, while also freeing the running workers to perform other tasks.

Rate limit a celery task without blocking other tasks

I am trying to limit the rate of one celery task. Here is how I am doing it:
from project.celery import app
app.control.rate_limit('task_a', '10/m')
It is working well. However, there is a catch. Other tasks that this worker is responsible for are being blocked as well.
Let's say, 100 of task_a have been scheduled. As it is rate-limited, it will take 10 minutes to execute all of them. During this time, task_b has been scheduled as well. It will not be executed until task_a is done.
Is it possible to not block task_b?
By the looks of it, this is just how it works. I just didn't get that impression after reading the documentation.
Other options include:
Separate worker and queue only for this task
Adding an eta to the task task_a so that all of it are scheduled to run during the night
What is the best practice in such cases?

This should be part of a task declaration to work on per-task basis. The way you are doing it via control probably why it has this side-effect on other tasks
#task(rate_limit='10/m')
def task_a():
...
After more reading
Note that this is a per worker instance rate limit, and not a global rate limit. To enforce a global rate limit (e.g., for an API with a maximum number of requests per second), you must restrict to a given queue.
You probably will have to do this in separate queue

The easiest (no coding required) way is separating the task into its own queue and running a dedicated worker just for this purpose.
There's no shame in that, it is totally fine to have many Celery queues and workers, each dedicated just for a specific type of work. As an added bonus you may get some more control over the execution, you can easily turn workers ON/OFF to pause certain processes if needed, etc.
On the other hand, having lots of specialized workers idle most of the time (waiting for a specific job to be queued) is not particularly memory-efficient.
Thus, in case you need to rate limit more tasks and expect the specific workers to be idle most of the time, you may consider increasing the efficiency and implement a Token Bucket. With that all your workers can be generic-purpose and you can scale them naturally as your overall load increases, knowing that the work distribution will not be crippled by a single task's rate limit anymore.

AppEngine Timeout with Task Queues

I'm trying to execute a task in AppEngine through the Task Queues, but I still seem to be faced with a 60 second timeout. I'm unsure what I'm doing incorrectly, as the limit I'd think should be 10 minutes as advertised.
I have a call to urlfetch.fetch() that appears to be the culprit. My call is:
urlfetch.fetch(url, payload=query_data, method=method, deadline=300)
The tail end of my stack trace shows the method that triggers the url fetch call right before the DeadlineExceededError:
File "/base/data/home/apps/s~mips-conversion-scheduler/000-11.371629749593131630/views.py", line 81, in _get_mips_updated_data
policies_changed = InquiryClient().get_changed_policies(company_id, initial=initial).json()
When I look at the task queue information it shows:
Method/URL: POST /tasks/queue-initial-load
Dispatched time (UTC): 2013/11/14 15:18:49
Seconds late: 0.18
Seconds to process task: 59.90
Last http response code: 500
Reason to rety: AppError
My View that processes the task looks like:
class QueueInitialLoad(webapp2.RequestHandler):
def post(self):
company = self.request.get("company")
if company:
company_id = self.request.get("company")
queue_policy_load(company_id, queue_name="initialLoad", initial=True)
with the queue_policy_load being the method that triggers the urlfetch call.
Is there something obvious I'm missing that makes me limited to the 60 second timeout instead of 10 minutes?

Might be a little too general, but here are some thoughts that might help close the loop. There are 2 kinds of task queues, push queues and pull queues. Push queue tasks execute automatically, and they are only available to your App Engine app. On the other hand, pull queue tasks wait to be leased, are available to workers outside the app, and can be batched.
If you want to configure your queue, you can do it in the queue config file. In Java, that happens in the queue.xml file, and in Python that happens in the queue.yaml file. In terms of push queues specifically, push queue tasks are processed by handlers (URLs) as POST requests. They:
Are executed ASAP
May cause new instances (Frontend or Backend)
Have a task duration limit of 10 minutes
But, they have an unlimited duration if the tasks are run on the backend
Here is a quick Python code example showing how you can add tasks to a named push queue. Have a look at the Google developers page for Task Queues if you need more information: https://developers.google.com/appengine/docs/python/taskqueue/
Adding Tasks to a Named Push Queue:
queue = taskqueue.Queue("Qname")
task = taskqueue.Task(url='/handler', params=args)
queue.add(task)
On the other hand, let's say that you wanted to use a pull queue. You could add tasks in Python to a pull queue using the following:
queue = taskqueue.Queue("Qname")
task = taskqueue.Task(payload=load, method='PULL')
queue.add(task)
You can then lease these tasks out using the following approach in Python:
queue = taskqueue.Queue("Qname")
tasks = queue.lease_tasks(how-long, how-many)
Remember that, for pull queues, if a task fails, App Engine retries it until it succeeds.
Hope that helps in terms of providing a general perspective!

The task queues have a 10min deadline but a Urlfetch call has a 1 min deadline :
maximum deadline (request handler) 60 seconds
UPDATE: the intended behaviour was to have a max of 10mins URLFetch deadline when running in a TaskQueue, see this bug.

As GAE has evolved, this answer pertains to today where the idea of "backend" instances is deprecated. GAE Apps can be configured to be Services (aka module) and run with a manual scaling policy. Doing so allows one to set longer timeouts. If you were running your app with an autoscaling policy, it will cap your urlfetch's to 60sec and your queued tasks to 10 mins:
https://cloud.google.com/appengine/docs/python/an-overview-of-app-engine

Task Queues or Multi Threading on google app engine

I have my server on Google App Engine
One of my jobs is to match a huge set of records with another.
This takes very long, if i have to match 10000 records with 100.
Whats the best way of implementing this.
Im, using Web2py stack and deployed my application on Google App Engine.

maybe i'm misunderstanding something, but thos sounds like the perfect match for a task queue, and i can't see how multithreading will help, as i thought this only ment that you can serve many responses simultaneously, it won't help if your responses take longer than the 30 second limit.
With a task you can add it, then process until the time limit, then recreate another task with the remainder of the task if you haven't finished your job by the time limit.

Multithreading your code is not supported on GAE so you can not explicitly use it.
GAE itself can be multithreaded, which means that one frontend instance can handle multiple http requests simultaneously.
In your case, best way to achieve parallel task execution is Task Queue.

The basic structure for what you're doing is to have the cron job be responsible for dividing the work into smaller units, and executing each unit with the task queue. The payload for each task would be information that identifies the entities in the first set (such as a set of keys). Each task would perform whatever queries are necessary to join the entities in the first set with the entities in the second set, and store intermediate (or perhaps final) results. You can tweak the payload size and task queue rate until it performs the way you desire.
If the results of each task need to be aggregated, you can have each task record its completion and test for whether all tasks are complete, or just have another job that polls the completion records, to fire off the aggregation. When the MapReduce feature is more widely available, that will be a framework for performing this kind of work.
http://www.youtube.com/watch?v=EIxelKcyCC0
http://code.google.com/p/appengine-mapreduce/

Using Task Queues to schedule the fetching/parsing of a number of feeds in App Engine (Python)

Say I had over 10,000 feeds that I wanted to periodically fetch/parse.
If the period were say 1h that would be 24x10000 = 240,000 fetches.
The current 10k limit of the labs Task Queue API would preclude one from
setting up one task per fetch. How then would one do this?
Update: RE: Fetching nurls per task - Given the 30second timeout per request at some point this would hit a ceiling. Is
there anyway to parallelize it so each task queue initiates a bunch of async parallel fetches each of which would take less than 30sec to finish but the lot together may take more than that.

Here's the asynchronous urlfetch API:
http://code.google.com/appengine/docs/python/urlfetch/asynchronousrequests.html
Set of a bunch of requests with a reasonable deadline (give yourself some headroom under your timeout, so that if one request times out you still have time to process the others). Then wait on each one in turn and process as they complete.
I haven't used this technique myself in GAE, so you're on your own finding any non-obvious gotchas. Sadly there doesn't seem to be a select() style call in the API to wait for the first of several requests to complete.

2 fetches per task? 3?

Group up the fetches, so instead of queuing 1 fetch you queue up, say, a work unit that does 10 fetches.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.