Passing AWS credentials to Google Cloud Dataflow, Python

Passing AWS credentials to Google Cloud Dataflow, Python - python

I use Google Cloud Dataflow implementation in Python on Google Cloud Platform.
My idea is to use input from AWS S3.
Google Cloud Dataflow (which is based on Apache Beam) supports reading files from S3.
However, I cannot find in documentation the best possiblity to pass credentials to a job.
I tried adding AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY to environment variables within setup.py file.
However, it work locally, but when I package Cloud Dataflow job as a template and trigger it to run on GCP, it sometimes work, and sometimes not, raising "NoCredentialsError" exception and causing job to fail.
Is there any coherent, best-practice solution to pass AWS credentials to Python Google Cloud Dataflow job on GCP?

The options to configure this have been added finally. They are available on Beam versions after 2.26.0.
The pipeline options are --s3_access_key_id and --s3_secret_access_key.
Unfortunately, the Beam 2.25.0 release and earlier don't have a good way of doing this, other than the following:
In this thread a user figured out how to do it in the setup.py file that they provide to Dataflow in their pipeline.

Related

Authenticate Google Cloud Storage Python client with gsutil-generated boto file

I'm trying to automate report downloading from Google Play (thru Cloud Storage) using GC Python client library. From the docs, I found that it's possible to do it using gsutil. I found this question has been answered here, but I also found that Client infers credentials from environment and I plan to do this on automation platform with (assumed) no gcloud credentials set.
I've found that you can generate gsutil boto file then use it as credential, but how can I load this into the client library?

This is not exactly a direct answer to your question, but the best way would be to create a service account in GCP, and then use the service account's JSON keyfile to interact with GCS. See this documentation on how to generate said keyfile.
NOTE: You should treat this keyfile as a password as it will have the access you give it in the step below. So no uploading to public github repos for example.
You'll also have to give the serviceaccount the permission Storage Object Viewer, or one with more permissions.
NOTE: Always use the least needed to due to security considerations.
The code for this is extremely simple. Note that this is extremely similar to the methods mentioned in the link for generating the keyfile, the exception being the way the client is instantiated.
requirements.txt
google-cloud-storage
code
from google.cloud import storage
cred_json_file_path = 'path/to/file/credentials.json'
client = storage.Client.from_service_account_json(cred_json_file_path)
If you want to use the general Google API Python client library you can use this library to do a similar instantiation of a credentials object using the JSON keyfile, but for GCS the google-cloud-storage library is very much preferred as it does some magic behind the scenes, as the API python client library is a very generic one that (theoretically) be useable with all Google API's.

gsutil will look for a .boto file in the home directory of the user invoking it, so ~/.boto, for Linux and macOS, and in %HOMEDRIVE%%HOMEPATH% for Windows.
Alternately, you can set the BOTO_CONFIG environment variable to the path of the .boto file you want to use. Here's an example:
BOTO_CONFIG=/path/to/your_generated_boto_file.boto gsutil -m cp files gs://bucket
You can generate a .boto file with a service account by using the "-e" flag with the config command: gsutil config -e.
Also note that if gsutil is installed with the gcloud command, gcloud will share its authentication config with gsutil unless you disable that behavior with this command: gcloud config set pass_credentials_to_gsutil false.
https://cloud.google.com/storage/docs/boto-gsutil

Amazon cdk and boto3 difference

I am new to AWS with python. I came across boto3 initially, later somone suggested cdk. What is the difference between aws cdk and boto3?

In simple terms, CDK helps you to programmatically create AWS resources(Infrastructure as Code) while boto3 helps you to programmatically access AWS services.
Here is a snippet on CDK and Boto3 from AWS reference links :
CDK:
The AWS Cloud Development Kit (AWS CDK) is an open source software development framework to define your cloud application resources using familiar programming languages. AWS CDK provisions your resources in a safe, repeatable manner through AWS CloudFormation. It also enables you to compose and share your own custom constructs that incorporate your organization's requirements, helping you start new projects faster. (Reference: https://aws.amazon.com/cdk/)
With CDK and Cloudformation, you will get the benefits of repeatable deployment, easy rollback, and drift detection. (Reference: https://aws.amazon.com/cdk/features/)
Boto3:
Boto3 is the Amazon Web Services (AWS) Software Development Kit (SDK) for Python, which allows Python developers to write software that makes use of services like Amazon S3 and Amazon EC2.
(Reference: https://pypi.org/project/boto3/)

Welcome to Stack Overflow and to AWS usage!
boto3 is the python SDK for AWS. It is useful in order for your software to be able to elevate other AWS services.
use case example: your code has to put an object in an S3 bucket (store a file in other words).
aws-cdk is a framework that helps you provision infrastructure in an IaC (Infrastructure as Code) manner.
use case example: describe and provision your application infrastructure (e.g. a lambda function and an S3 bucket).
In many projects you will use both.
you can find an example URL shortener that uses boto3 and aws-cdk here. The URL shortener uses boto3 in order to access a DynamoDB table and aws-cdk in order to provision the whole infrastructure (including the lambda function which uses boto3).

You're creating an application that needs to use AWS services and resources. Should you use cdk or boto-3?
Consider if your application needs AWS services and resources at build time or run time.
Build time: you need the AWS resources to be available IN ORDER TO build the application.
Run time: you need the AWS resources to be available via API call when your application is up and running.
AWS CDK setups the infrastructure your application needs in order to run.
AWS SDK compliments your application to provide business logic and to make services available through your application.
Another point to add is that AWS CDK manages the state of your deployed resources internally, thus allowing you to keep track of what has been deployed and to specify the desired state of your final deployed resources.
On the other hand, if you're using AWS SDK, you have to store and manage the state of the deployed resources (deployed using AWS SDK) yourself.

I am also new to AWS, here is my understanding for relevant AWS services and boto3
AWS Cloud Development Kit (CDK) is a software library, available in different programming languages, to define and provision cloud infrastructure*, through AWS CloudFormation.
Boto3 is a Python software development kit (SDK) to create, configure, and manage AWS services.
AWS CloudFormation is a low-level service to create a collection of relation AWS and third-party resources, and provision and manage them in an orderly and predictable fashion.
AWS Elastic Beanstalk is a high-level service to deploy and run applications in the cloud easily, and sits on top of AWS CloudFormation.

What is a convenient way to deploy and manage execution of a Python SDK Apache Beam pipeline for Google cloud Dataflow

Once an Apache Beam pipeline designed and tested in Google’s cloud Dataflow using Python SDK and DataflowRunner what is a convenient way to have it in the Google cloud and manage its execution?
What is a convenient way to deploy and manage execution of a Python SDK Apache Beam pipeline for Google Cloud Dataflow?
Should it be somehow packaged? Uploaded to Google storage? Create a Dataflow template? How can one schedule its execution beyond a developer execution it from its development environment?
Update
Preferably without 3rd party tools or a need in additional management tools/infrastructure beyond Google cloud and Dataflow in particular.

Intuitively you’d expect that “deploying a pipeline” section under How-to guides of the Dataflow documentation will cover that. But you find an explanation of that only 8 sections below in the “templates overview” section.
According to that section:
Cloud Dataflow templates introduce a new development and execution workflow that differs from traditional job execution workflow. The template workflow separates the development step from the staging and execution steps.
Trivially you do not deploy and execute your Dataflow pipeline from Google Cloud. But if you need to share the execution of a pipeline with nontechnical members of your cloud or simply want to trigger it without being dependant on a development environment or 3rd party tools then Dataflow templates is what you need.
Once a pipeline developed and tested you can create a Dataflow job template from it.
Please note that:
To create templates with the Cloud Dataflow SDK 2.x for Python, you must have version 2.0.0 or higher.
You will need to execute your pipeline using DataflowRunner with pipeline options that will generate a template on the Google Cloud storage rather than running it.
For more details refer to creating templates documentation section and to run it from template refer to executing templates section.

I'd say the most convenient way is to use Airflow. This allows you to author, schedule, and monitor workflows. The Dataflow Operator can start your designed data pipeline. Airflow can be started either on a small VM, or by using Cloud Composer, which is a tool on the Google Cloud Platform.
There are more options to automate your workflow, such as Jenkins, Azkaban, Rundeck, or even running a simple cronjob (which I'd discourage you to use). You might want to take a look at these options as well, but Airflow probably fits your needs.

Permissions error with Apache Beam example on Google Dataflow

I'm having trouble submitting an Apache Beam example from a local machine to our cloud platform.
Using gcloud auth list I can see that the correct account is currently active. I can use gsutil and the web client to interact with the file system. I can use the cloud shell to run pipelines through the python REPL.
But when I try and run the python wordcount example I get the following error:
IOError: Could not upload to GCS path gs://my_bucket/tmp: access denied.
Please verify that credentials are valid and that you have write access
to the specified path.
Is there something I am missing with regards to the credentials?

Here are my two cents after spending the whole morning on the issue.
You should make sure that you login with gcloud on your local machine, however, pay attention to the warning message that return from gcloud auth login:
WARNING: `gcloud auth login` no longer writes application default credentials.
These credentials are required for the python code to identify your credentials properly.
Solution is rather simple, just use:
gcloud auth application-default login
This will write a credentials file under: ~/.config/gcloud/application_default_credentials.json which is used for the authentication in the local development env.

You'll need to create a GCS bucket and folder for your project, then specify that as the pipeline parameter instead of using the default value.
https://cloud.google.com/storage/docs/creating-buckets

Same Error Solved after creating a bucket.
gsutil mb gs://<bucket-name-from-the-error>/

I have faced the same issue where it throws up the IO error. Things that helped me here are (not in the order):
Checking the Name of the bucket. This step helped me a lot. Bucket names are global. If you make mistake in the bucket-name while accessing your bucket then you might be accessing buckets that you have NOT created and you don't have permission to.
Checking the service account that you have filled in:
export GOOGLE_CLOUD_PROJECT= yourkeyfile.json
Activating the service account for the key file you have plugged in -
gcloud auth activate-service-account --key-file=your-key-file.json
Also, listing out the auth accounts available might help you too.
gcloud auth list

One solution might work for you. It did for me.
In the cloud shell window, click on "Launch code Editor" (The Pencil Icon). The editor will work in Chrome (not sure about Firefox), it did not work in Brave browser.
Now, browse to your code file [in the launched code editor on GCP] (.py or .java) and locate the pre-defined PROJECT and BUCKET names and replace the name with your own Project and Bucket names and save it.
Now execute the file and it should work now.

Python doesn't use gcloud auth to authenticate but it uses the environment variable GOOGLE_APPLICATION_CREDENTIALS. So before you run the python command to launch the Dataflow job, you will need to set that environment variable:
export GOOGLE_APPLICATION_CREDENTIALS="/path/to/key"
More info on setting up the environment variable: https://cloud.google.com/docs/authentication/getting-started#setting_the_environment_variable
Then you'll have to make sure that the account you set up has the necessary permissions in your GCP project.
Permissions and service accounts:
User service account or user account: it needs the Dataflow Admin
role at the project level and to be able to act as the worker service
account (source:
https://cloud.google.com/dataflow/docs/concepts/security-and-permissions#worker_service_account).
Worker service account: it will be one worker service account per
Dataflow pipeline. This account will need the Dataflow Worker role at
the project level plus the necessary permissions to the resources
accessed by the Dataflow pipeline (source:
https://cloud.google.com/dataflow/docs/concepts/security-and-permissions#worker_service_account).
Example: if Dataflow pipeline’s input is Pub/Sub topic and output is
BigQuery table, the worker service account will need read access to
the topic as well as write permission to the BQ table.
Dataflow service account: this is the account that gets automatically
created when you enable the Dataflow API in a project. It
automatically gets the Dataflow Service Agent role at the project
level (source:
https://cloud.google.com/dataflow/docs/concepts/security-and-permissions#service_account).

How to store data in GCS while accessing it from GAE and 'GCE' locally

There's a GAE project using the GCS to store/retrieve files. These files also need to be read by code that will run on GCE (needs C++ libraries, so therefore not running on GAE).
In production, deployed on the actual GAE > GCS < GCE, this setup works fine.
However, testing and developing locally is a different story that I'm trying to figure out.
As recommended, I'm running GAE's dev_appserver with GoogleAppEngineCloudStorageClient to access the (simulated) GCS. Files are put in the local blobstore. Great for testing GAE.
Since these is no GCE SDK to run a VM locally, whenever I refer to the local 'GCE', it's just my local development machine running linux.
On the local GCE side I'm just using the default boto library (https://developers.google.com/storage/docs/gspythonlibrary) with a python 2.x runtime to interface with the C++ code and retrieving files from the GCS. However, in development, these files are inaccessible from boto because they're stored in the dev_appserver's blobstore.
Is there a way to properly connect the local GAE and GCE to a local GCS?
For now, I gave up on the local GCS part and tried using the real GCS. The GCE part with boto is easy. The GCS part is also able to use the real GCS using an access_token so it uses the real GCS instead of the local blobstore by:
cloudstorage.common.set_access_token(access_token)
According to the docs:
access_token: you can get one by run 'gsutil -d ls' and copy the
str after 'Bearer'.
That token works for a limited amount of time, so that's not ideal. Is there a way to set a more permanent access_token?

There is convenience option to access Google Cloud Storage from development environment. You should use client library provided with Google Cloud SDK. After executing gcloud init locally you get access to your resources.
As shown in examples to Client library authentication:
# Get the application default credentials. When running locally, these are
# available after running `gcloud init`. When running on compute
# engine, these are available from the environment.
credentials = GoogleCredentials.get_application_default()
# Construct the service object for interacting with the Cloud Storage API -
# the 'storage' service, at version 'v1'.
# You can browse other available api services and versions here:
# https://developers.google.com/api-client-library/python/apis/
service = discovery.build('storage', 'v1', credentials=credentials)

Google libraries come and go like tourists in a train station. Today (2020) google-cloud-storage should work on GCE and GAE Standard Environment with Python 3.
On GAE and CGE it picks up access credentials from the environment and locally you can provide it whit a servce account JSON-file like this:
GOOGLE_APPLICATION_CREDENTIALS=../sa-b0af54dea5e.json

If you're always using "real" remote GCS, the newer gcloud is probably the best library: http://googlecloudplatform.github.io/gcloud-python/
It's really confusing how many storage client libraries there are for Python. Some are for AE only, but they often force (or at least default to) using the local mock Blobstore when running with dev_appserver.py.
Seems like gcloud is always using the real GCS, which is what I want.
It also "magically" fixes authentication when running locally.

It looks like appengine-gcs-client for Python is now only useful for production App Engine and inside dev_appserver.py, and the local examples for it have been removed from the developer docs in favor of Boto :( If you are deciding not to use the local GCS emulation, it's probably best to stick with Boto for both local testing and GCE.
If you still want to use 'google.appengine.ext.cloudstorage' though, access tokens always expire so you'll need to manually refresh it. Given your setup honestly the easiest thing to to is just call 'gsutil -d ls' from Python and parse the output to get a new token from your local credentials. You could use the API Client Library to get a token in a more 'correct' fashion, but at that point things would be getting so roundabout you might as well just be using Boto.

There is a Google Cloud Storage local / development server for this purpose: https://developers.google.com/datastore/docs/tools/devserver
Once you have set it up, create a dataset and start the GCS development server
gcd.sh create [options] <dataset-directory>
gcd.sh start [options] <dataset-directory>
Export the environment variables
export DATASTORE_HOST=http://yourmachine:8080
export DATASTORE_DATASET=<dataset_id>
Then you should be able to use the datastore connection in your code, locally.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.