BigQuery emulator with local instance of the Apache Airflow

BigQuery emulator with local instance of the Apache Airflow - python

I'm working on a project for integration different data sources into Google BigQuery.
It is the batch approach.
We are using Apache Airflow for orchestration.
Simplified flow is: create raw tables (predefined DDL) -> call python code to do batch insert (via bq python client) -> trigger different SQL for transformations -> end
For testing purposes, we're using the GCP dev project.
But, recently, I found BigQuery Emulator: BigQuery Emulator.
Python client example works just fine: BigQuery Emulator: Call endpoint from python client
I'm curious about how to configure the local instance of Airflow to use this emulator.
I didn't find a way to point BigQueryExecuteQueryOperator to use the emulator. We are using this operator to trigger all our SQL.
I tried to set 'gcp_conn_id' but it always fails with "Could not automatically determine credentials. Please set GOOGLE_APPLICATION_CREDENTIALS or explicitly create credentials and re-run the application. For more information, please see https://cloud.google.com/docs/authentication/getting-started"
As connection type I tried HTTP and Google Bigquery, no difference.
Airflow version: 2.3.4
bigquery-emulator: 0.1.11

Related

Reauthentication failed error while accessing bigquery via python

i am trying to access bigquery using python . even though after executing "gcloud auth login"
getting below error.
google.auth.exceptions.ReauthFailError: Reauthentication failed. Reauthentication challenge could not be answered because you are not in an interactive session.
what can be issue here

You can solve this problem by creating a service account and set up the Cloud SDK to use the service account.
Example command:
gcloud auth activate-service-account account-name --key-file=/fullpath/service-account.json
Other way is to set up the environment variables for the Python script to use while accessing BigQuery.
Example command:
export GOOGLE_APPLICATION_CREDENTIALS=/fullpath/service-account.json

How to run a python script file in AWS CLI

I'm trying to run a python script file while in the AWS CLI. Does anyone have the syntax for that please? I've tried a few variations but without success:
aws ssm send-command --document-name "AWS-RunShellScript" --parameters commands=["/Documents/aws_instances_summary.py"]
I'm not looking to connect to a particular EC2 instance as the script gathers information about all instances

aws ssm send-command runs the command on an EC2 instance, not on your local computer.
From your comments, it looks like you are actually trying to determine how to configure the AWS SDK for Python (Boto3) with AWS API credentials, so you can run the script from your local computer and get information about the AWS account.
You would not use the AWS CLI tool at all for this purpose. Instead you would simply run the Python script directly, having configured the appropriate environment variables, or ~/.aws/credentials file, on your local computer with the API credentials. Please see the official documentation for configuring AWS API credentials for Boto3.
A minimal example would look something like this:
export AWS_ACCESS_KEY_ID=your_access_key_id
export AWS_SECRET_ACCESS_KEY=your_secret_access_key
python aws_instances_summary.py

Passing AWS credentials to Google Cloud Dataflow, Python

I use Google Cloud Dataflow implementation in Python on Google Cloud Platform.
My idea is to use input from AWS S3.
Google Cloud Dataflow (which is based on Apache Beam) supports reading files from S3.
However, I cannot find in documentation the best possiblity to pass credentials to a job.
I tried adding AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY to environment variables within setup.py file.
However, it work locally, but when I package Cloud Dataflow job as a template and trigger it to run on GCP, it sometimes work, and sometimes not, raising "NoCredentialsError" exception and causing job to fail.
Is there any coherent, best-practice solution to pass AWS credentials to Python Google Cloud Dataflow job on GCP?

The options to configure this have been added finally. They are available on Beam versions after 2.26.0.
The pipeline options are --s3_access_key_id and --s3_secret_access_key.
Unfortunately, the Beam 2.25.0 release and earlier don't have a good way of doing this, other than the following:
In this thread a user figured out how to do it in the setup.py file that they provide to Dataflow in their pipeline.

Creating Airflow DAGs on GCP Composer

I just learnt about GCP Composer and am trying to move the DAGs from my local airflow instance to cloud and had a couple of questions about the transition.
In local instance I used HiveOperator to read data from hive and create tables and write it back into hive. If I had to do this in GCP how would this be possible? Would I have to upload my data to Google Bucket and does the HiveOperator work in GCP?
I have a DAG which uses sensor to check if another DAG is complete, is that possible on Composer?

Yes, Cloud Composer is just managed Apache Airflow so you can do that.
Make sure that you use the same version of Airflow that you used locally. Cloud Composer supports Airflow 1.9.0 and 1.10.0 currently.

Composer have connection store. See menu Admin--> Connection. Check connection type available.
Sensors are available.

How to store data in GCS while accessing it from GAE and 'GCE' locally

There's a GAE project using the GCS to store/retrieve files. These files also need to be read by code that will run on GCE (needs C++ libraries, so therefore not running on GAE).
In production, deployed on the actual GAE > GCS < GCE, this setup works fine.
However, testing and developing locally is a different story that I'm trying to figure out.
As recommended, I'm running GAE's dev_appserver with GoogleAppEngineCloudStorageClient to access the (simulated) GCS. Files are put in the local blobstore. Great for testing GAE.
Since these is no GCE SDK to run a VM locally, whenever I refer to the local 'GCE', it's just my local development machine running linux.
On the local GCE side I'm just using the default boto library (https://developers.google.com/storage/docs/gspythonlibrary) with a python 2.x runtime to interface with the C++ code and retrieving files from the GCS. However, in development, these files are inaccessible from boto because they're stored in the dev_appserver's blobstore.
Is there a way to properly connect the local GAE and GCE to a local GCS?
For now, I gave up on the local GCS part and tried using the real GCS. The GCE part with boto is easy. The GCS part is also able to use the real GCS using an access_token so it uses the real GCS instead of the local blobstore by:
cloudstorage.common.set_access_token(access_token)
According to the docs:
access_token: you can get one by run 'gsutil -d ls' and copy the
str after 'Bearer'.
That token works for a limited amount of time, so that's not ideal. Is there a way to set a more permanent access_token?

There is convenience option to access Google Cloud Storage from development environment. You should use client library provided with Google Cloud SDK. After executing gcloud init locally you get access to your resources.
As shown in examples to Client library authentication:
# Get the application default credentials. When running locally, these are
# available after running `gcloud init`. When running on compute
# engine, these are available from the environment.
credentials = GoogleCredentials.get_application_default()
# Construct the service object for interacting with the Cloud Storage API -
# the 'storage' service, at version 'v1'.
# You can browse other available api services and versions here:
# https://developers.google.com/api-client-library/python/apis/
service = discovery.build('storage', 'v1', credentials=credentials)

Google libraries come and go like tourists in a train station. Today (2020) google-cloud-storage should work on GCE and GAE Standard Environment with Python 3.
On GAE and CGE it picks up access credentials from the environment and locally you can provide it whit a servce account JSON-file like this:
GOOGLE_APPLICATION_CREDENTIALS=../sa-b0af54dea5e.json

If you're always using "real" remote GCS, the newer gcloud is probably the best library: http://googlecloudplatform.github.io/gcloud-python/
It's really confusing how many storage client libraries there are for Python. Some are for AE only, but they often force (or at least default to) using the local mock Blobstore when running with dev_appserver.py.
Seems like gcloud is always using the real GCS, which is what I want.
It also "magically" fixes authentication when running locally.

It looks like appengine-gcs-client for Python is now only useful for production App Engine and inside dev_appserver.py, and the local examples for it have been removed from the developer docs in favor of Boto :( If you are deciding not to use the local GCS emulation, it's probably best to stick with Boto for both local testing and GCE.
If you still want to use 'google.appengine.ext.cloudstorage' though, access tokens always expire so you'll need to manually refresh it. Given your setup honestly the easiest thing to to is just call 'gsutil -d ls' from Python and parse the output to get a new token from your local credentials. You could use the API Client Library to get a token in a more 'correct' fashion, but at that point things would be getting so roundabout you might as well just be using Boto.

There is a Google Cloud Storage local / development server for this purpose: https://developers.google.com/datastore/docs/tools/devserver
Once you have set it up, create a dataset and start the GCS development server
gcd.sh create [options] <dataset-directory>
gcd.sh start [options] <dataset-directory>
Export the environment variables
export DATASTORE_HOST=http://yourmachine:8080
export DATASTORE_DATASET=<dataset_id>
Then you should be able to use the datastore connection in your code, locally.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.