notebook to execute Databricks job - python

Is there an api or other way to programmatically run a Databricks job. Ideally, we would like to call a Databricks job from a notebook. Following just gives currently running job id but that's not very useful:
dbutils.notebook.entry_point.getDbutils().notebook().getContext().currentRunId().toString()

To run a databricks job, you can use Jobs API. I have a databricks job called for_repro which I ran using the 2 ways provided below from databricks notebook.
Using requests library:
You can create an access token by navigating to Settings -> User settings. Under Access token tab, click generate token.
Use the above generated token along with the following code.
import requests
import json
my_json = {"job_id": <your_job-id>}
auth = {"Authorization": "Bearer <your_access-token>"}
response = requests.post('https://<databricks-instance>/api/2.0/jobs/run-now', json = my_json, headers=auth).json()
print(response)
The <databricks-instance> value from the above code can be extracted from your workspace URL.
Using %sh magic command script:
You can also use magic command %sh in your python notebook cell to run a databricks job.
%sh
curl --netrc --request POST --header "Authorization: Bearer <access_token>" \
https://<databricks-instance>/api/2.0/jobs/run-now \
--data '{"job_id": <your job id>}'
The following is my job details and run history for reference.
Refer to this Microsoft documentation to know all other operations that can be achieved using Jobs API.

Related

GCP cloud function - Could not find kubectl on the path

i'm writing this Google Cloud Function (Python)
def create_kubeconfig(request):
subprocess.check_output("curl https://sdk.cloud.google.com | bash | echo "" ",stdin=subprocess.PIPE, shell=True )
os.system("./google-cloud-sdk/install.sh")
os.system("gcloud init")
os.system("curl -LO https://storage.googleapis.com/kubernetes-release/release/v1.17.0/bin/linux/amd64/kubectl")
os.system("gcloud container clusters get-credentials **cluster name** --zone us-west2-a --project **project name**")
os.system("gcloud container clusters get-credentials **cluster name** --zone us-west2-a --project **project name**")
conf = KubeConfig()
conf.use_context('**cluster name**')
when i run the code it gives me the error
'Invalid kube-config file. ' kubernetes.config.config_exception.ConfigException: Invalid kube-config file. No configuration found.
help me to solve it please
You have to reach programmatically the K8S API. You have the description of the API in the documentation
But it's not easy and simple to perform. However, here some inputs for achieving what you want.
First, get the GKE master IP
Then you can access to the cluster easily. Here for reading the deployment
import google.auth
from google.auth.transport import requests
credentials, project_id = google.auth.default()
session = requests.AuthorizedSession(credentials)
response = session.get('https://34.76.28.194/apis/apps/v1/namespaces/default/deployments', verify=False)
response.raise_for_status()
print(response.json())
For creating one, you can do this
import google.auth
from google.auth.transport import requests
credentials, project_id = google.auth.default()
session = requests.AuthorizedSession(credentials)
with open("deployment.yaml", "r") as f:
data = f.read()
response = session.post('https://34.76.28.194/apis/apps/v1/namespaces/default/deployments', data=data,
headers={'content-type': 'application/yaml'}, verify=False)
response.raise_for_status()
print(response.json())
According with the object that you want to build, you have to use the correct file definition and the correct API endpoint. I don't know a way to apply a whole yaml with several definition in only one API call.
Last things, be sure to provide the correct GKE roles to the Cloud Function service Account
UPDATE
Another solution is to use Cloud Run. Indeed, with Cloud Run and thanks to the Container capability, you have the ability to install and to call system process (it's totally open because your container runs into a GVisor sandbox, but most of common usages are allowed)
The idea is the following: use a gcloud SDK base image and deploy your application on it. Then, code your app to perform system calls.
Here a working example in Go
Docker file
FROM golang:1.13 as builder
# Copy local code to the container image.
WORKDIR /app/
COPY go.mod .
ENV GO111MODULE=on
RUN go mod download
COPY . .
# Perform test for building a clean package
RUN go test -v ./...
RUN CGO_ENABLED=0 GOOS=linux go build -v -o server
# Gcloud capable image
FROM google/cloud-sdk
COPY --from=builder /app/server /server
CMD ["/server"]
Note: The image cloud-sdk image is heavy: 700Mb
The content example (only the happy path. I remove error management, and the stderr/stdout feedback for simplifying the code)
.......
// Example here: recover the yaml file into a bucket
client,_ := storage.NewClient(ctx)
reader,_ := client.Bucket("my_bucket").Object("deployment.yaml").NewReader(ctx)
content,_:= ioutil.ReadAll(reader)
// You can store locally the file into /tmp directory. It's an in-memory file system. Don't forget to purge it to avoid any out of memory crash
ioutil.WriteFile("/tmp/file.yaml",content, 0644)
// Execute external command
// 1st Recover the kube authentication
exec.Command("gcloud","container","clusters","get-credentials","cluster-1","--zone=us-central1-c").Run()
// Then interact with the cluster with kubectl tools and simply apply your description file
exec.Command("kubectl","apply", "-f","/tmp/file.yaml").Run()
.......
Instead of using gcloud inside the Cloud Function (and attempting to install it on every request, which will significantly increase the runtime of your function), you should use the google-cloud-container client library to make the same API calls directly from Python, for example:
from google.cloud import container_v1
client = container_v1.ClusterManagerClient()
project_id = 'YOUR_PROJECT_ID'
zone = 'YOUR_PROJECT_ZONE'
response = client.list_clusters(project_id, zone)

I got some problem in python env when use GCP api key

I want to use google translation api but I have some problems.
My env is Linux ubuntu 18 and python with Atom idle
I was used gcloud to set my configuration and got auth login, auth login token.
export GOOGLE_APPLICATION_CREDENTIALS=//api_key.json
gcloud init
gcloud auth application-default login
gcloud auth application-default print-access-token
so I could use curl and got some test data
curl -X POST -H "Authorization: Bearer "$(gcloud auth application-default print-access-token) -H "Content-Type: application/json; charset=utf-8" --data
"{
'q': 'Hello world',
'q': 'My name is Jeff',
'target': 'de'
}" "https://translation.googleapis.com/language/translate/v2"
{
"data": {
"translations": [
{
"translatedText": "Hallo Welt",
"detectedSourceLanguage": "en"
},
{
"translatedText": "Mein Name ist Jeff",
"detectedSourceLanguage": "en"
}
]
}
}
When I run test code in Atom idle, my project number is wrong.
It is my past project.
Even I run test code in bash python, it is same situation
I dont know what is wrong, I just guess some problem in python env.
raised error
raise exceptions.from_http_response(response)
google.api_core.exceptions.Forbidden: 403 POST
https://translation.googleapis.com/language/translate/v2: Cloud Translation
API has not been used in project [wrong number] before or it is disabled.
Enable it by visiting
https://console.developers.google.com/apis/api/translate.googleapis.com
/overview?project=[wrong number] then retry. If you enabled this API
recently, wait a few minutes for the action to propagate to our systems and
retry.
This error message is usually thrown when the application is not being authenticated correctly due to several reasons such as missing files, invalid credential paths, incorrect environment variables assignations, among other causes. Since the Client Libraries need to pull the credentials data from the environment variable or the client object, it is required to ensure you are pointing to the correct authentication files. Keep in mind this issue might not occur when using CURL command because you were passing the access-token directly.
Based on this, I recommend you to confirm that you are using the JSON file credentials of your current project, as well as follow the Obtaining and providing service account credentials manually guide in order to explicitly specify your service account file directly into your code; In this way, you will be able to set it permanently and verify if you are passing the service credentials correctly. Additionally, you can take a look on Using Client Libraries guide that contains the step-by-step process required to use the Translation API with Python.
Passing the path to the service account key in code example:
def explicit():
from google.cloud import storage
# Explicitly use service account credentials by specifying the private key
# file.
storage_client = storage.Client.from_service_account_json('service_account.json')
# Make an authenticated API request
buckets = list(storage_client.list_buckets())
print(buckets)

Running Job On Airflow Based On Webrequest

I wanted to know if airflow tasks can be executed upon getting a request over HTTP. I am not interested in the scheduling part of Airflow. I just want to use it as a substitute for Celery.
So an example operation would be something like this.
User submits a form requesting for some report.
Backend receives the request and sends the user a notification that the request has been received.
The backend then schedules a job using Airflow to run immediately.
Airflow then executes a series of tasks associated with a DAG. For example, pull data from redshift first, pull data from MySQL, make some operations on the two result sets, combine them and then upload the results to Amazon S3, send an email.
From whatever I read online, you can run airflow jobs by executing airflow ... on the command line. I was wondering if there is a python api which can execute the same thing.
Thanks.
The Airflow REST API Plugin would help you out here. Once you have followed the instructions for installing the plugin you would just need to hit the following url: http://{HOST}:{PORT}/admin/rest_api/api/v1.0/trigger_dag?dag_id={dag_id}&run_id={run_id}&conf={url_encoded_json_parameters}, replacing dag_id with the id of your dag, either omitting run_id or specify a unique id, and passing a url encoded json for conf (with any of the parameters you need in the triggered dag).
Here is an example JavaScript function that uses jQuery to call the Airflow api:
function triggerDag(dagId, dagParameters){
var urlEncodedParameters = encodeURIComponent(dagParameters);
var dagRunUrl = "http://airflow:8080/admin/rest_api/api/v1.0/trigger_dag?dag_id="+dagId+"&conf="+urlEncodedParameters;
$.ajax({
url: dagRunUrl,
dataType: "json",
success: function(msg) {
console.log('Successfully started the dag');
},
error: function(e){
console.log('Failed to start the dag');
}
});
}
A new option in airflow is the experimental, but built-in, API endpoint in the more recent builds of 1.7 and 1.8. This allows you to run a REST service on your airflow server to listen to a port and accept cli jobs.
I only have limited experience myself, but I have run test dags with success. Per the docs:
/api/experimental/dags/<DAG_ID>/dag_runs creates a dag_run for a given dag id (POST).
That will schedule an immediate run of whatever dag you want to run. It does still use the scheduler, though, waiting for a heartbeat to see that dag is running and pass tasks to the worker. This is exactly the same behavior as the CLI, though, so I still believe it fits your use-case.
Documentation on how to configure it is available here: https://airflow.apache.org/api.html
There are some simple example clients in the github, too, under airflow/api/clients
You should look at Airflow HTTP Sensor for your needs. You can use this to trigger a dag.
Airflow's experimental REST API interface can be used for this purpose.
Following request will trigger a DAG:
curl -X POST \
http://<HOST>:8080/api/experimental/dags/process_data/dag_runs \
-H 'Cache-Control: no-cache' \
-H 'Content-Type: application/json' \
-d '{"conf":"{\"START_DATE\":\"2018-06-01 03:00:00\", \"STOP_DATE\":\"2018-06-01 23:00:00\"}'
Following request retrieves a list of Dag Runs for a specific DAG ID:
curl -i -H "Accept: application/json" -H "Content-Type: application/json" -X GET http://<HOST>:8080/api/experimental/dags/process_data/dag_runs
For the GET API to work set rbac flag to True at airflow.cfg.
Following are the list of APIs available: here & there.
UPDATE: stable Airflow REST API released:
https://airflow.apache.org/docs/apache-airflow/stable/stable-rest-api-ref.html
Almost everything stays the same, except API URL change.
Also "conf" is now required to be an object, so I added additional wrapping:
def trigger_dag_v2(self, dag_id, run_id=None, conf=None, execution_date=None):
endpoint = '/api/v1/dags/{}/dagRuns'.format(dag_id)
url = urljoin(self._api_base_url, endpoint)
data = self._request(url, method='POST',
json={
"run_id": run_id,
"conf": {'conf': json.dumps(event)},
"execution_date": execution_date,
})
return data['message']
OLD ANSWER:
Airflow has REST API (currently experimental) - available here:
https://airflow.apache.org/api.html#endpoints
If you do not want to install plugins as suggested in other answers - here is code how you can do it directly with the API:
def trigger_dag(self, dag_id, run_id=None, conf=None, execution_date=None):
endpoint = '/api/experimental/dags/{}/dag_runs'.format(dag_id)
url = urljoin(self._api_base_url, endpoint)
data = self._request(url, method='POST',
json={
"run_id": run_id,
"conf": conf,
"execution_date": execution_date,
})
return data['message']
More examples working with airflow API in python are available here:
https://github.com/apache/airflow/blob/master/airflow/api/client/json_client.py
I found this post while trying to do the same, after further investigation, I switch to ArgoEvents. It is basically the same but based on event-driven flows so it is much more suitable for this use case.
Link:
https://argoproj.github.io/argo
Airflow now has support for stable REST API. Using stable REST API, you can trigger DAG as:
curl --location --request POST 'localhost:8080/api/v1/dags/unpublished/dagRuns' \
--header 'Content-Type: application/json' \
--header 'Authorization: Basic YWRtaW46YWRtaW4=' \
--data-raw '{
"dag_run_id": "dag_run_1",
"conf": {
"key": "value"
}
}'

Run a curl tlsv1.2 http get request in python?

I have the following command that I run using curl in linux.
curl --tlsv1.2 --cert ~/aws-iot/certs/certificate.pem.crt --key ~/aws-iot/certs/private.pem.key --cacert ~/aws-iot/certs/root-CA.crt -X GET https://data.iot.us-east-1.amazonaws.com:8443/things/pi_3/shadow
This command returns JSON text that I want. However I want to be able to run the above command in Python3. I do not know what library to use in order to get the same JSON response.
P.S. I replace "data" with my account number in AWS to get JSON
After playing around with it on my own I was able to successfully do it in python using the requests library.
import requests
s = requests.Session()
r = s.get('https://data.iot.us-east-1.amazonaws.com:8443/things/pi_3/shadow',
cert=('/home/pi/aws-iot/certs/certificate.pem.crt', '/home/pi/aws-iot/certs/private.pem.key', '/home/pi/aws-iot/certs/root-CA.crt'))
print(r.text)

How to auth into BigQuery on Google Compute Engine?

What's the easiest way to authenticate into Google BigQuery when on a Google Compute Engine instance?
Make sure that your instance has the scope to access BigQuery first of all - you can decide this only at creation time.
in a bash script, get a oauth token by calling :
ACCESSTOKEN=`curl -s "http://metadata/computeMetadata/v1/instance/service-accounts/default/token" -H "X-Google-Metadata-Request: True" | jq ".access_token" | sed 's/"//g'`
echo "retrieved access token $ACCESSTOKEN"
now let's say you want a list of the data sets in a project :
CURL_URL="https://www.googleapis.com/bigquery/v2/projects/YOURPROJECTID/datasets"
CURL_OPTIONS="-s --header 'Content-Type: application/json' --header 'Authorization: OAuth $ACCESSTOKEN' --header 'x-goog-project-id:YOURPROJECTID' --header 'x-goog-api-version:1'"
CURL_COMMAND="curl --request GET $CURL_URL $CURL_OPTIONS"
CURL_RESPONSE=`eval $CURL_COMMAND`
the response in JSON format can be found in the variable CURL_RESPONSE
PS: I realize now that this question is tagged as Python, but same principles apply.
In Python:
AppAssertionCredentials is a python class that allows a Compute Engine instance to identify itself to Google and other OAuth 2.0 servers, withour requiring a flow.
https://developers.google.com/api-client-library/python/
The project id can be read from the metadata server, so it doesn't need to be set as a variable.
https://cloud.google.com/compute/docs/metadata
The following code gets a token using AppAssertionCredentials, the project id from the metadata server, and instantiates a BigqueryClient with this data:
import bigquery_client
import urllib2
from oauth2client import gce
def GetMetadata(path):
return urllib2.urlopen(
'http://metadata/computeMetadata/v1/%s' % path,
headers={'Metadata-Flavor': 'Google'}
).read()
credentials = gce.AppAssertionCredentials(
scope='https://www.googleapis.com/auth/bigquery')
client = bigquery_client.BigqueryClient(
credentials=credentials,
api='https://www.googleapis.com',
api_version='v2',
project_id=GetMetadata('project/project-id'))
For this to work, you need to give the GCE instance access to the BigQuery API when creating it:
gcloud compute instances create <your_instance_name> --scopes storage-ro bigquery

Categories