Creating Airflow DAGs on GCP Composer

Creating Airflow DAGs on GCP Composer - python

I just learnt about GCP Composer and am trying to move the DAGs from my local airflow instance to cloud and had a couple of questions about the transition.
In local instance I used HiveOperator to read data from hive and create tables and write it back into hive. If I had to do this in GCP how would this be possible? Would I have to upload my data to Google Bucket and does the HiveOperator work in GCP?
I have a DAG which uses sensor to check if another DAG is complete, is that possible on Composer?

Yes, Cloud Composer is just managed Apache Airflow so you can do that.
Make sure that you use the same version of Airflow that you used locally. Cloud Composer supports Airflow 1.9.0 and 1.10.0 currently.

Composer have connection store. See menu Admin--> Connection. Check connection type available.
Sensors are available.

Related

BigQuery emulator with local instance of the Apache Airflow

I'm working on a project for integration different data sources into Google BigQuery.
It is the batch approach.
We are using Apache Airflow for orchestration.
Simplified flow is: create raw tables (predefined DDL) -> call python code to do batch insert (via bq python client) -> trigger different SQL for transformations -> end
For testing purposes, we're using the GCP dev project.
But, recently, I found BigQuery Emulator: BigQuery Emulator.
Python client example works just fine: BigQuery Emulator: Call endpoint from python client
I'm curious about how to configure the local instance of Airflow to use this emulator.
I didn't find a way to point BigQueryExecuteQueryOperator to use the emulator. We are using this operator to trigger all our SQL.
I tried to set 'gcp_conn_id' but it always fails with "Could not automatically determine credentials. Please set GOOGLE_APPLICATION_CREDENTIALS or explicitly create credentials and re-run the application. For more information, please see https://cloud.google.com/docs/authentication/getting-started"
As connection type I tried HTTP and Google Bigquery, no difference.
Airflow version: 2.3.4
bigquery-emulator: 0.1.11

How to mock and test Databricks Pyspark notebooks Locally

How can I mock existing Azure Databricks PySpark codes of a project (written by others) and run them locally on windows machine/Anaconda to test and practice?
Is it possible to mock the codes or I need to create a new cluster on Databricks for my own testing purposes?
how I can connect to storage account, use the Databricks Utilities, etc? I only have experience with Python & GCP and just joined a Databricks project and need to run the cells one by one to see the result and modify if required.
Thanks

You can test/run PySpark code from your IDE by installing PySpark on your local computer.
Now, to use Databricks Utilities, in fact you would need a Databricks instance and it's not available on local. You can try Databricks community Editionfor free but with some limitation
To acess a cloud storage account, it can be done locally from your computer or from your own Databricks instance. In both cases your will have to set up the end point of this storage account using its secrets.

Passing AWS credentials to Google Cloud Dataflow, Python

I use Google Cloud Dataflow implementation in Python on Google Cloud Platform.
My idea is to use input from AWS S3.
Google Cloud Dataflow (which is based on Apache Beam) supports reading files from S3.
However, I cannot find in documentation the best possiblity to pass credentials to a job.
I tried adding AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY to environment variables within setup.py file.
However, it work locally, but when I package Cloud Dataflow job as a template and trigger it to run on GCP, it sometimes work, and sometimes not, raising "NoCredentialsError" exception and causing job to fail.
Is there any coherent, best-practice solution to pass AWS credentials to Python Google Cloud Dataflow job on GCP?

The options to configure this have been added finally. They are available on Beam versions after 2.26.0.
The pipeline options are --s3_access_key_id and --s3_secret_access_key.
Unfortunately, the Beam 2.25.0 release and earlier don't have a good way of doing this, other than the following:
In this thread a user figured out how to do it in the setup.py file that they provide to Dataflow in their pipeline.

What is a convenient way to deploy and manage execution of a Python SDK Apache Beam pipeline for Google cloud Dataflow

Once an Apache Beam pipeline designed and tested in Google’s cloud Dataflow using Python SDK and DataflowRunner what is a convenient way to have it in the Google cloud and manage its execution?
What is a convenient way to deploy and manage execution of a Python SDK Apache Beam pipeline for Google Cloud Dataflow?
Should it be somehow packaged? Uploaded to Google storage? Create a Dataflow template? How can one schedule its execution beyond a developer execution it from its development environment?
Update
Preferably without 3rd party tools or a need in additional management tools/infrastructure beyond Google cloud and Dataflow in particular.

Intuitively you’d expect that “deploying a pipeline” section under How-to guides of the Dataflow documentation will cover that. But you find an explanation of that only 8 sections below in the “templates overview” section.
According to that section:
Cloud Dataflow templates introduce a new development and execution workflow that differs from traditional job execution workflow. The template workflow separates the development step from the staging and execution steps.
Trivially you do not deploy and execute your Dataflow pipeline from Google Cloud. But if you need to share the execution of a pipeline with nontechnical members of your cloud or simply want to trigger it without being dependant on a development environment or 3rd party tools then Dataflow templates is what you need.
Once a pipeline developed and tested you can create a Dataflow job template from it.
Please note that:
To create templates with the Cloud Dataflow SDK 2.x for Python, you must have version 2.0.0 or higher.
You will need to execute your pipeline using DataflowRunner with pipeline options that will generate a template on the Google Cloud storage rather than running it.
For more details refer to creating templates documentation section and to run it from template refer to executing templates section.

I'd say the most convenient way is to use Airflow. This allows you to author, schedule, and monitor workflows. The Dataflow Operator can start your designed data pipeline. Airflow can be started either on a small VM, or by using Cloud Composer, which is a tool on the Google Cloud Platform.
There are more options to automate your workflow, such as Jenkins, Azkaban, Rundeck, or even running a simple cronjob (which I'd discourage you to use). You might want to take a look at these options as well, but Airflow probably fits your needs.

How to store data in GCS while accessing it from GAE and 'GCE' locally

There's a GAE project using the GCS to store/retrieve files. These files also need to be read by code that will run on GCE (needs C++ libraries, so therefore not running on GAE).
In production, deployed on the actual GAE > GCS < GCE, this setup works fine.
However, testing and developing locally is a different story that I'm trying to figure out.
As recommended, I'm running GAE's dev_appserver with GoogleAppEngineCloudStorageClient to access the (simulated) GCS. Files are put in the local blobstore. Great for testing GAE.
Since these is no GCE SDK to run a VM locally, whenever I refer to the local 'GCE', it's just my local development machine running linux.
On the local GCE side I'm just using the default boto library (https://developers.google.com/storage/docs/gspythonlibrary) with a python 2.x runtime to interface with the C++ code and retrieving files from the GCS. However, in development, these files are inaccessible from boto because they're stored in the dev_appserver's blobstore.
Is there a way to properly connect the local GAE and GCE to a local GCS?
For now, I gave up on the local GCS part and tried using the real GCS. The GCE part with boto is easy. The GCS part is also able to use the real GCS using an access_token so it uses the real GCS instead of the local blobstore by:
cloudstorage.common.set_access_token(access_token)
According to the docs:
access_token: you can get one by run 'gsutil -d ls' and copy the
str after 'Bearer'.
That token works for a limited amount of time, so that's not ideal. Is there a way to set a more permanent access_token?

There is convenience option to access Google Cloud Storage from development environment. You should use client library provided with Google Cloud SDK. After executing gcloud init locally you get access to your resources.
As shown in examples to Client library authentication:
# Get the application default credentials. When running locally, these are
# available after running `gcloud init`. When running on compute
# engine, these are available from the environment.
credentials = GoogleCredentials.get_application_default()
# Construct the service object for interacting with the Cloud Storage API -
# the 'storage' service, at version 'v1'.
# You can browse other available api services and versions here:
# https://developers.google.com/api-client-library/python/apis/
service = discovery.build('storage', 'v1', credentials=credentials)

Google libraries come and go like tourists in a train station. Today (2020) google-cloud-storage should work on GCE and GAE Standard Environment with Python 3.
On GAE and CGE it picks up access credentials from the environment and locally you can provide it whit a servce account JSON-file like this:
GOOGLE_APPLICATION_CREDENTIALS=../sa-b0af54dea5e.json

If you're always using "real" remote GCS, the newer gcloud is probably the best library: http://googlecloudplatform.github.io/gcloud-python/
It's really confusing how many storage client libraries there are for Python. Some are for AE only, but they often force (or at least default to) using the local mock Blobstore when running with dev_appserver.py.
Seems like gcloud is always using the real GCS, which is what I want.
It also "magically" fixes authentication when running locally.

It looks like appengine-gcs-client for Python is now only useful for production App Engine and inside dev_appserver.py, and the local examples for it have been removed from the developer docs in favor of Boto :( If you are deciding not to use the local GCS emulation, it's probably best to stick with Boto for both local testing and GCE.
If you still want to use 'google.appengine.ext.cloudstorage' though, access tokens always expire so you'll need to manually refresh it. Given your setup honestly the easiest thing to to is just call 'gsutil -d ls' from Python and parse the output to get a new token from your local credentials. You could use the API Client Library to get a token in a more 'correct' fashion, but at that point things would be getting so roundabout you might as well just be using Boto.

There is a Google Cloud Storage local / development server for this purpose: https://developers.google.com/datastore/docs/tools/devserver
Once you have set it up, create a dataset and start the GCS development server
gcd.sh create [options] <dataset-directory>
gcd.sh start [options] <dataset-directory>
Export the environment variables
export DATASTORE_HOST=http://yourmachine:8080
export DATASTORE_DATASET=<dataset_id>
Then you should be able to use the datastore connection in your code, locally.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.