Running Job On Airflow Based On Webrequest - python

I wanted to know if airflow tasks can be executed upon getting a request over HTTP. I am not interested in the scheduling part of Airflow. I just want to use it as a substitute for Celery.
So an example operation would be something like this.
User submits a form requesting for some report.
Backend receives the request and sends the user a notification that the request has been received.
The backend then schedules a job using Airflow to run immediately.
Airflow then executes a series of tasks associated with a DAG. For example, pull data from redshift first, pull data from MySQL, make some operations on the two result sets, combine them and then upload the results to Amazon S3, send an email.
From whatever I read online, you can run airflow jobs by executing airflow ... on the command line. I was wondering if there is a python api which can execute the same thing.
Thanks.

The Airflow REST API Plugin would help you out here. Once you have followed the instructions for installing the plugin you would just need to hit the following url: http://{HOST}:{PORT}/admin/rest_api/api/v1.0/trigger_dag?dag_id={dag_id}&run_id={run_id}&conf={url_encoded_json_parameters}, replacing dag_id with the id of your dag, either omitting run_id or specify a unique id, and passing a url encoded json for conf (with any of the parameters you need in the triggered dag).
Here is an example JavaScript function that uses jQuery to call the Airflow api:
function triggerDag(dagId, dagParameters){
var urlEncodedParameters = encodeURIComponent(dagParameters);
var dagRunUrl = "http://airflow:8080/admin/rest_api/api/v1.0/trigger_dag?dag_id="+dagId+"&conf="+urlEncodedParameters;
$.ajax({
url: dagRunUrl,
dataType: "json",
success: function(msg) {
console.log('Successfully started the dag');
},
error: function(e){
console.log('Failed to start the dag');
}
});
}

A new option in airflow is the experimental, but built-in, API endpoint in the more recent builds of 1.7 and 1.8. This allows you to run a REST service on your airflow server to listen to a port and accept cli jobs.
I only have limited experience myself, but I have run test dags with success. Per the docs:
/api/experimental/dags/<DAG_ID>/dag_runs creates a dag_run for a given dag id (POST).
That will schedule an immediate run of whatever dag you want to run. It does still use the scheduler, though, waiting for a heartbeat to see that dag is running and pass tasks to the worker. This is exactly the same behavior as the CLI, though, so I still believe it fits your use-case.
Documentation on how to configure it is available here: https://airflow.apache.org/api.html
There are some simple example clients in the github, too, under airflow/api/clients

You should look at Airflow HTTP Sensor for your needs. You can use this to trigger a dag.

Airflow's experimental REST API interface can be used for this purpose.
Following request will trigger a DAG:
curl -X POST \
http://<HOST>:8080/api/experimental/dags/process_data/dag_runs \
-H 'Cache-Control: no-cache' \
-H 'Content-Type: application/json' \
-d '{"conf":"{\"START_DATE\":\"2018-06-01 03:00:00\", \"STOP_DATE\":\"2018-06-01 23:00:00\"}'
Following request retrieves a list of Dag Runs for a specific DAG ID:
curl -i -H "Accept: application/json" -H "Content-Type: application/json" -X GET http://<HOST>:8080/api/experimental/dags/process_data/dag_runs
For the GET API to work set rbac flag to True at airflow.cfg.
Following are the list of APIs available: here & there.

UPDATE: stable Airflow REST API released:
https://airflow.apache.org/docs/apache-airflow/stable/stable-rest-api-ref.html
Almost everything stays the same, except API URL change.
Also "conf" is now required to be an object, so I added additional wrapping:
def trigger_dag_v2(self, dag_id, run_id=None, conf=None, execution_date=None):
endpoint = '/api/v1/dags/{}/dagRuns'.format(dag_id)
url = urljoin(self._api_base_url, endpoint)
data = self._request(url, method='POST',
json={
"run_id": run_id,
"conf": {'conf': json.dumps(event)},
"execution_date": execution_date,
})
return data['message']
OLD ANSWER:
Airflow has REST API (currently experimental) - available here:
https://airflow.apache.org/api.html#endpoints
If you do not want to install plugins as suggested in other answers - here is code how you can do it directly with the API:
def trigger_dag(self, dag_id, run_id=None, conf=None, execution_date=None):
endpoint = '/api/experimental/dags/{}/dag_runs'.format(dag_id)
url = urljoin(self._api_base_url, endpoint)
data = self._request(url, method='POST',
json={
"run_id": run_id,
"conf": conf,
"execution_date": execution_date,
})
return data['message']
More examples working with airflow API in python are available here:
https://github.com/apache/airflow/blob/master/airflow/api/client/json_client.py

I found this post while trying to do the same, after further investigation, I switch to ArgoEvents. It is basically the same but based on event-driven flows so it is much more suitable for this use case.
Link:
https://argoproj.github.io/argo

Airflow now has support for stable REST API. Using stable REST API, you can trigger DAG as:
curl --location --request POST 'localhost:8080/api/v1/dags/unpublished/dagRuns' \
--header 'Content-Type: application/json' \
--header 'Authorization: Basic YWRtaW46YWRtaW4=' \
--data-raw '{
"dag_run_id": "dag_run_1",
"conf": {
"key": "value"
}
}'

Related

notebook to execute Databricks job

Is there an api or other way to programmatically run a Databricks job. Ideally, we would like to call a Databricks job from a notebook. Following just gives currently running job id but that's not very useful:
dbutils.notebook.entry_point.getDbutils().notebook().getContext().currentRunId().toString()
To run a databricks job, you can use Jobs API. I have a databricks job called for_repro which I ran using the 2 ways provided below from databricks notebook.
Using requests library:
You can create an access token by navigating to Settings -> User settings. Under Access token tab, click generate token.
Use the above generated token along with the following code.
import requests
import json
my_json = {"job_id": <your_job-id>}
auth = {"Authorization": "Bearer <your_access-token>"}
response = requests.post('https://<databricks-instance>/api/2.0/jobs/run-now', json = my_json, headers=auth).json()
print(response)
The <databricks-instance> value from the above code can be extracted from your workspace URL.
Using %sh magic command script:
You can also use magic command %sh in your python notebook cell to run a databricks job.
%sh
curl --netrc --request POST --header "Authorization: Bearer <access_token>" \
https://<databricks-instance>/api/2.0/jobs/run-now \
--data '{"job_id": <your job id>}'
The following is my job details and run history for reference.
Refer to this Microsoft documentation to know all other operations that can be achieved using Jobs API.

Python requests.post timeing out on localhost

I'm having trouble getting the Python requests module to post to an endpoint on the same server.
The server is running flask, and the route I want to post to is /new_applet.
My code is below. It is inside a different flask route function, which could be an issue?
url = "http://0.0.0.0:5000/new_applet"
data = {
"action": "add",
"plugin": plugin,
"version": version,
"component": component
}
resp = requests.post(url, data=data)
However it hangs while trying to make the post request. Debugging shows that the request never reaches the flask route function.
The request works if I run the following command on the server:
curl --location --request POST '0.0.0.0:5000/new_applet' \
--form 'action=add' \
--form 'plugin=up_spotify' \
--form 'version=0.1' \
--form 'component=inputs'
Why doesn't the Python request work, when the curl request does?
I believe flask runs a single thread by default.
According to your explanation, if the request is made from flask then it blocks the thread waiting for a reply (same thread that is supposed to handle the request.post).
you could try to use multiple thread as explained here: Can I serve multiple clients using just Flask app.run() as standalone?

Flask session doesn't update consistently with parallel requests

I'm noticing that when requests running in parallel modify Flask's session, only some keys are recorded. This happens both with Flask's default cookie session and with Flask-Session using the Redis backend. The project is not new, but this only became noticeable once many requests were happening at the same time for the same session.
import time
from flask import Flask, session
from flask_session import Session
app = Flask(__name__)
app.secret_key = "example"
app.config["SESSION_TYPE"] = "redis"
Session(app)
#app.route("/set/<value>")
def set_value(value):
"""Simulate long running task."""
time.sleep(1)
session[value] = "done"
return "ok\n"
#app.route("/keys")
def keys():
return str(session.keys()) + "\n"
The following shell script demonstrates the issue. Notice that all the requests complete, but only one key is present in the final listing, and it's different between test runs.
#!/bin/bash
# set session
curl -c 'cookie' http://localhost:5007/keys
# run parallel
curl -b 'cookie' http://localhost:5007/set/key1 && echo "done1" &
curl -b 'cookie' http://localhost:5007/set/key2 && echo "done2" &
curl -b 'cookie' http://localhost:5007/set/key3 && echo "done3" &
wait
# get result
curl -b 'cookie' http://localhost:5007/keys
$ sh test.sh
dict_keys(['_permanent'])
ok
ok
ok
done3
done1
done2
dict_keys(['_permanent', 'key2'])
$ sh test.sh
dict_keys(['_permanent'])
ok
done3
ok
ok
done2
done1
dict_keys(['_permanent', 'key1'])
Why aren't all the keys present after the requests finish?
Cookie-based sessions are not thread safe. Any given request only sees the session cookie sent with it, and only returns the cookie with that request's modifications. This isn't specific to Flask, it's how HTTP requests work.
You issue three requests in parallel. They all read the initial cookie that only contains the _permanent key, send their requests, and get a response that sets a cookie with their specific key. Each response cookie would have the _permanent key and the key_keyN key only. Whichever request finishes last writes to the file, overwriting previous data, so you're left with its cookie only.
In practice this isn't an issue. The session isn't really meant to store data that changes rapidly between requests, that's what a database is for. Things that modify the session, such as logging in, don't happen in parallel to the same session (and are idempotent anyway).
If you're really concerned about this, use a server-side session to store the data in a database. Databases are good at synchronizing writes.
You're already using Flask-Session and Redis, but digging into the Flask-Session implementation reveals why you have this issue. Flask-Session doesn't store each session key separately, it writes a single serialized value with all the keys. So it suffers the same issue as cookie-based sessions: only what was present during that request is put back into Redis, overwriting what happened in parallel.
In this case, it will be better to write your own SessionInterface subclass to store each key individually. You would override save_session to set all keys in session and delete any that aren't present.

Django service on gunicorn POST request is recieved as GET?

I have a Django rest service running on virutal environment on gunicorn server with the following .wsgi file:
import os, sys import site
site.addsitedir('/opt/valuation/env/lib/python2.7/site-packages')
sys.stdout = sys.stderr
os.environ['DJANGO_SETTINGS_MODULE'] = 'valuation.valuationcont.valuation.settings'
import django.core.handlers.wsgi
application = django.core.handlers.wsgi.WSGIHandler()
When I do curl POST call the service works perfectly:
curl -H "Content-Type: application/json" -X POST -d '{...}' -u username:password http://localhost:8000/valuation/predict/
But when I do the same request on API gateway using axios, Django service responds my custom GET response ("GET not supported, try POST").
axios({
method: 'post',
url:'http://localhost:8000/valuation/predict',
headers:{
"Content-Type":"application/json",
"Authorization":"Basic [BASE64 ENCODING]"
},
data:{
...
}
}).then(response=>{
console.log(response.data)
}).catch(err=>{
console.log(err.toString())
})
The request is transformed from GET to POST.
This only happens with the django/gunicorn service.
Since I am new to django/gunicorn I think there is something wrong with the .wsgi file. But how come the curl call then works?
Any help appreciated, been struggling with this for a week now.
Edit:
Managed to recreate the same problem in my local machine. axios POST requests using its API are translated into GET.
Using the axios.post(...) method I managed to get 403 and 201. All while POSTMAN works fine.
I have a suspicion that since the POST fails axios API has a default fallback to GET which then doesn't fail and service responds normally ("GET not supported" as is should).
New step to debug this would be to ask, how do I recreate POSTMAN POST call as close as possible in javascript since POSTMAN is working and it is obviously axios that is causing the problems.
You're not using the same URL. In the curl snippet you request http://localhost:8000/valuation/predict/ but in the second you request http://localhost:8000/valuation/predict - without the final slash.
Django by default redirects URLs that don't end in a slash to one that does, and a redirect is always a GET.

How to auth into BigQuery on Google Compute Engine?

What's the easiest way to authenticate into Google BigQuery when on a Google Compute Engine instance?
Make sure that your instance has the scope to access BigQuery first of all - you can decide this only at creation time.
in a bash script, get a oauth token by calling :
ACCESSTOKEN=`curl -s "http://metadata/computeMetadata/v1/instance/service-accounts/default/token" -H "X-Google-Metadata-Request: True" | jq ".access_token" | sed 's/"//g'`
echo "retrieved access token $ACCESSTOKEN"
now let's say you want a list of the data sets in a project :
CURL_URL="https://www.googleapis.com/bigquery/v2/projects/YOURPROJECTID/datasets"
CURL_OPTIONS="-s --header 'Content-Type: application/json' --header 'Authorization: OAuth $ACCESSTOKEN' --header 'x-goog-project-id:YOURPROJECTID' --header 'x-goog-api-version:1'"
CURL_COMMAND="curl --request GET $CURL_URL $CURL_OPTIONS"
CURL_RESPONSE=`eval $CURL_COMMAND`
the response in JSON format can be found in the variable CURL_RESPONSE
PS: I realize now that this question is tagged as Python, but same principles apply.
In Python:
AppAssertionCredentials is a python class that allows a Compute Engine instance to identify itself to Google and other OAuth 2.0 servers, withour requiring a flow.
https://developers.google.com/api-client-library/python/
The project id can be read from the metadata server, so it doesn't need to be set as a variable.
https://cloud.google.com/compute/docs/metadata
The following code gets a token using AppAssertionCredentials, the project id from the metadata server, and instantiates a BigqueryClient with this data:
import bigquery_client
import urllib2
from oauth2client import gce
def GetMetadata(path):
return urllib2.urlopen(
'http://metadata/computeMetadata/v1/%s' % path,
headers={'Metadata-Flavor': 'Google'}
).read()
credentials = gce.AppAssertionCredentials(
scope='https://www.googleapis.com/auth/bigquery')
client = bigquery_client.BigqueryClient(
credentials=credentials,
api='https://www.googleapis.com',
api_version='v2',
project_id=GetMetadata('project/project-id'))
For this to work, you need to give the GCE instance access to the BigQuery API when creating it:
gcloud compute instances create <your_instance_name> --scopes storage-ro bigquery

Categories