Permissions error with Apache Beam example on Google Dataflow - python

I'm having trouble submitting an Apache Beam example from a local machine to our cloud platform.
Using gcloud auth list I can see that the correct account is currently active. I can use gsutil and the web client to interact with the file system. I can use the cloud shell to run pipelines through the python REPL.
But when I try and run the python wordcount example I get the following error:
IOError: Could not upload to GCS path gs://my_bucket/tmp: access denied.
Please verify that credentials are valid and that you have write access
to the specified path.
Is there something I am missing with regards to the credentials?

Here are my two cents after spending the whole morning on the issue.
You should make sure that you login with gcloud on your local machine, however, pay attention to the warning message that return from gcloud auth login:
WARNING: `gcloud auth login` no longer writes application default credentials.
These credentials are required for the python code to identify your credentials properly.
Solution is rather simple, just use:
gcloud auth application-default login
This will write a credentials file under: ~/.config/gcloud/application_default_credentials.json which is used for the authentication in the local development env.

You'll need to create a GCS bucket and folder for your project, then specify that as the pipeline parameter instead of using the default value.
https://cloud.google.com/storage/docs/creating-buckets

Same Error Solved after creating a bucket.
gsutil mb gs://<bucket-name-from-the-error>/

I have faced the same issue where it throws up the IO error. Things that helped me here are (not in the order):
Checking the Name of the bucket. This step helped me a lot. Bucket names are global. If you make mistake in the bucket-name while accessing your bucket then you might be accessing buckets that you have NOT created and you don't have permission to.
Checking the service account that you have filled in:
export GOOGLE_CLOUD_PROJECT= yourkeyfile.json
Activating the service account for the key file you have plugged in -
gcloud auth activate-service-account --key-file=your-key-file.json
Also, listing out the auth accounts available might help you too.
gcloud auth list

One solution might work for you. It did for me.
In the cloud shell window, click on "Launch code Editor" (The Pencil Icon). The editor will work in Chrome (not sure about Firefox), it did not work in Brave browser.
Now, browse to your code file [in the launched code editor on GCP] (.py or .java) and locate the pre-defined PROJECT and BUCKET names and replace the name with your own Project and Bucket names and save it.
Now execute the file and it should work now.

Python doesn't use gcloud auth to authenticate but it uses the environment variable GOOGLE_APPLICATION_CREDENTIALS. So before you run the python command to launch the Dataflow job, you will need to set that environment variable:
export GOOGLE_APPLICATION_CREDENTIALS="/path/to/key"
More info on setting up the environment variable: https://cloud.google.com/docs/authentication/getting-started#setting_the_environment_variable
Then you'll have to make sure that the account you set up has the necessary permissions in your GCP project.
Permissions and service accounts:
User service account or user account: it needs the Dataflow Admin
role at the project level and to be able to act as the worker service
account (source:
https://cloud.google.com/dataflow/docs/concepts/security-and-permissions#worker_service_account).
Worker service account: it will be one worker service account per
Dataflow pipeline. This account will need the Dataflow Worker role at
the project level plus the necessary permissions to the resources
accessed by the Dataflow pipeline (source:
https://cloud.google.com/dataflow/docs/concepts/security-and-permissions#worker_service_account).
Example: if Dataflow pipeline’s input is Pub/Sub topic and output is
BigQuery table, the worker service account will need read access to
the topic as well as write permission to the BQ table.
Dataflow service account: this is the account that gets automatically
created when you enable the Dataflow API in a project. It
automatically gets the Dataflow Service Agent role at the project
level (source:
https://cloud.google.com/dataflow/docs/concepts/security-and-permissions#service_account).

Related

Roles Required to write to Cloud Storage (GCP) from python (pandas)

I have a question for the GCP connoisseurs among you.
I have an issue that I can upload to a bucket via UI and gsutil - but if I try to do this via python
df.to_csv('gs://BUCKET_NAME/test.csv')
I get a 403 insufficient permission error.
My guess at the moment is that python does this via an API and requires an extra role - to make things more confusing I am already project owner of the project of the bucket and compared to other team members did not really find lacking permissions for this specific bucket.
I use python 3.9.1 via pyenv and pandas '1.4.2'
Anyone had the same issue/ knows what role I am missing?
I checked that I have in principal rights to upload both via UI and gsutil
I used the same virtual python environemnt to read and write from bigquery to check that I can in principle use GCP data in python - this works
I have the following Roles on the Bucket
Storage Admin, Storage Object Admin, Storage Object Creator, Storage Object Viewer
gsutil and gcloud share credentials.
These credentials are not shared with other code running locally.
The quick-fix but sub-optimal solution is to:
gcloud auth application-default login
And run the code again.
It will then use your gcloud (gsutil) user credentials configured to run as if you were using a Service Account.
These credentials are stored (on Linux) in ${HOME}/.config/gcloud/application_default_credentials.json.
A better solution is to create a Service Account specifically for your app and grant it the minimal set of IAM permissions that it will need (BigQuery, GCS, ...).
For testing purposes (!) you can download the Service Account key locally.
You can then auth your code using Google's Application Default Credentials (ADC) by (on Linux):
export GOOGLE_APPLICATION_CREDENTIALS=/path/to/your/key.json
python3 your_app.py
When you deploy code that leverages ADC to a Google Cloud compute service (Compute Engine, Cloud Run, ...), it can be deployed unchanged because the credentials for the compute resource will be automatically obtained from the Metadata service.
You can Google e.g. "Google IAM BigQuery" to find the documentation that lists the roles:
IAM roles for BigQuery
IAM roles for Cloud Storage

GCP secrets manager: how to give cloudbuild the right permission? I don't see it in the list of service accounts

I am using Google Cloud Build to build a Docker image for deployment, but the program running inside needs the Secrets Manager.
I have this working from the command line with a service account, but trying to follow an example that puts the secret into an environment variable for use by Python, this one fails referring to a builder service account:
your build failed to run: generic::invalid_argument: builder service account "99999999#cloudbuild.gserviceaccount.com" does not have secretmanager.versions.access permissions for secret "projects/myproject/secrets/mypassword"
I look in the list of service accounts and don't find this account anywhere, so how do I give it access to this if I have to do it this way?
But actually all I need is the running program to have access (I have it coded this way already and it works from the cloud shell and from the command line) ... how to hook it up with the right IAM service account to run correctly from within a deployed Docker container?
Thanks much for any guidance on this!
Good description by Ferregina
One line solution in your case will be :
gcloud projects add-iam-policy-binding <YOUR_PROJECT_ID> --member='serviceAccount:99999999#cloudbuild.gserviceaccount.com' --role='roles/secretmanager.secretAccessor'
The Cloud Build Service account <PROJECT_NUMBER>#cloudbuild.gserviceaccount.com is not in the Service Account List because it is a Google Managed account and is not created in your project.
Keep in mind that only the Service Accounts created in your project will appear on that list.
Instead you will find the account in the IAM main page.
Regarding your second question
how to hook it up with the right IAM service account to run correctly from within a deployed Docker container?
Well, it will depend on the resource it will run (GCE, GKE, Cloud Run, App Engine Flex) and the account assigned for the chosen product.

Authenticate Google Cloud Storage Python client with gsutil-generated boto file

I'm trying to automate report downloading from Google Play (thru Cloud Storage) using GC Python client library. From the docs, I found that it's possible to do it using gsutil. I found this question has been answered here, but I also found that Client infers credentials from environment and I plan to do this on automation platform with (assumed) no gcloud credentials set.
I've found that you can generate gsutil boto file then use it as credential, but how can I load this into the client library?
This is not exactly a direct answer to your question, but the best way would be to create a service account in GCP, and then use the service account's JSON keyfile to interact with GCS. See this documentation on how to generate said keyfile.
NOTE: You should treat this keyfile as a password as it will have the access you give it in the step below. So no uploading to public github repos for example.
You'll also have to give the serviceaccount the permission Storage Object Viewer, or one with more permissions.
NOTE: Always use the least needed to due to security considerations.
The code for this is extremely simple. Note that this is extremely similar to the methods mentioned in the link for generating the keyfile, the exception being the way the client is instantiated.
requirements.txt
google-cloud-storage
code
from google.cloud import storage
cred_json_file_path = 'path/to/file/credentials.json'
client = storage.Client.from_service_account_json(cred_json_file_path)
If you want to use the general Google API Python client library you can use this library to do a similar instantiation of a credentials object using the JSON keyfile, but for GCS the google-cloud-storage library is very much preferred as it does some magic behind the scenes, as the API python client library is a very generic one that (theoretically) be useable with all Google API's.
gsutil will look for a .boto file in the home directory of the user invoking it, so ~/.boto, for Linux and macOS, and in %HOMEDRIVE%%HOMEPATH% for Windows.
Alternately, you can set the BOTO_CONFIG environment variable to the path of the .boto file you want to use. Here's an example:
BOTO_CONFIG=/path/to/your_generated_boto_file.boto gsutil -m cp files gs://bucket
You can generate a .boto file with a service account by using the "-e" flag with the config command: gsutil config -e.
Also note that if gsutil is installed with the gcloud command, gcloud will share its authentication config with gsutil unless you disable that behavior with this command: gcloud config set pass_credentials_to_gsutil false.
https://cloud.google.com/storage/docs/boto-gsutil

Cloud Composer + Airflow: Setting up DAWs to trigger on HTTP (or should I use Cloud Functions?)

Ultimately, what I want to do is have a Python script that runs whenever a HTTP request is created, dynamically. It'd be like: App 1 runs and sends out a webhook, Python script catches the webhook immediately and does whatever it does.
I saw that you could do this in GCP with Composer and Airflow.
But I'm having several issues following these instrutions https://cloud.google.com/composer/docs/how-to/using/triggering-with-gcf:
Running this in Cloud Shell to grant blob signing permissions:
gcloud iam service-accounts add-iam-policy-binding
your-project-id#appspot.gserviceaccount.com
--member=serviceAccount:your-project-id#appspot.gserviceaccount.com
--role=roles/iam.serviceAccountTokenCreator
When I put in my project ID, I get a "Gaia id not found for your-project-id#appspot.gserviceaccount.com"
When I run the airflow_uri = environment_data['config']['airflowUri'] bit, I get a key error on 'config'.
Is there a better way to do what I'm trying to do (i.e. run Python scripts dynamically)?
The reason for getting: Gaia id not found for email <project-id>#appspot.gserviceaccount.com error is not enabling all needed APIs in your project. Please follow the steps:
Create or select Google Cloud Platform Project you wish to work with.
Enable the Cloud Composer, Google Cloud Functions and Cloud Identity and Google Identity and Access Management (IAM) APIs. You can find it in Menu -> Products -> Marketplace and typing the name of corresponding API.
Grant blob signing permissions to the Cloud Functions Service Account. In order for GCF to authenticate to Cloud IAP, the proxy that protects the Airflow webserver, you need to grant the Appspot Service Account GCF the Service Account Token Creator role. Do so by running the following command in your Cloud Shell, substituting the name of your project for <your-project-id>:
gcloud iam service-accounts add-iam-policy-binding \
<your-project-id>#appspot.gserviceaccount.com \
--member=serviceAccount:<your-project-id>#appspot.gserviceaccount.com \
--role=roles/iam.serviceAccountTokenCreator
I tested the scenario, firstly without enabling APIs and I've retrieved the same error as you. After enabling the APIs, error disappear and IAM policy has been updated correctly.
There is already well described Codelabs tutorial, which shows the workflow of triggering the DAG with Google Cloud Functions.

How can I log in to an arbitrary user in appengine for use with the Drive SDK?

I have an application that needs to log into a singular Drive account and perform operations on the files automatically using a cron job. Initially, I tried to use the domain administrator login to do this, however I am unable to do any testing with the domain administrator as it seems that you cannot use the test server with a domain administrator account, which makes testing my application a bit impossible!
As such, I started looking at storing arbitray oauth tokens--especially the refresh token--to log into this account automatically after the initial setup. However, all of the APIs and documentation assume that multiple individual users are logging in manually, and I cannot find functionality in the oauth APIs that allow or account for logging into anything but the currently logged in user.
How can I achieve this in a way that I can test my code on a test domain? Can I do it without writing my own oauth library and doing the oauth requests by hand? Or is there a way to get the domain administrator authorization to work on a local test server?
You can load the credentials for a single account into your datastore using the Remote API, which can be enabled in your app.yaml file:
builtins:
- remote_api: on
By executing
remote_api_shell.py -s your_app_id.appspot.com
from the command line you'll have access to a shell which can execute in the environment of your application. Before doing this, make sure you have your application deployed (more on local development below) and make sure the source for google-api-python-client is included by pip-installing it and running enable-app-engine-project /path/to/project to add it to your App Engine project.
Once you are in the remote shell (after executing the remote command above), perform the following:
from oauth2client.appengine import CredentialsModel
from oauth2client.appengine import StorageByKeyName
from oauth2client.client import OAuth2WebServerFlow
from oauth2client.tools import run
KEY_NAME = 'your_choice_here'
CREDENTIALS_PROPERTY_NAME = 'credentials'
SCOPE = 'https://www.googleapis.com/auth/drive'
storage = StorageByKeyName(CredentialsModel, KEY_NAME, CREDENTIALS_PROPERTY_NAME)
flow = OAuth2WebServerFlow(
client_id=YOUR_CLIENT_ID,
client_secret=YOUR_CLIENT_SECRET,
scope=SCOPE)
run(flow, storage)
NOTE: If you have not deployed your application with the google-api-python-client code, this will fail, because your application won't know how to make the same imports you made on your local machine, e.g. from oauth2client.appengine import CredentialsModel.
When run is called, your web browser will open and prompt you to accept the OAuth access for the client you've specified with CLIENT_ID and CLIENT_SECRET and after successfully completing, it will save an instance of CredentialsModel in the datastore of the deployed application your_app_id.appspot.com and it will store it using the KEY_NAME your provided.
After doing this, any caller in your application -- including your cron jobs -- can access those credentials by executing
storage = StorageByKeyName(CredentialsModel, KEY_NAME, CREDENTIALS_PROPERTY_NAME)
credentials = storage.get()
Local Development:
If you'd like to test this locally, you can run your application locally via
dev_appserver.py --port=PORT /path/to/project
and you can execute the same commands using the remote API shell and pointing it at your local application:
remote_api_shell.py -s localhost:PORT
Once here, you can execute the same code you did in the remote api shell and similarly an instance of CredentialsModel will be stored in the datastore of your local development server.
As above, if you don't have the correct google-api-python-client modules included, this will fail.
EDIT: This used to recommend using the Interactive Console at:
http://localhost:PORT/_ah/admin/interactive
but it was discovered that this doesn't work because socket does not work properly in the App Engine local development sandbox.
This article explains how to interact with Google Drive on behalf of users of your domain by having the Domain Administrator delegate domain-wide authority to a Service Account
This other article explains how to interact with a Drive owned by your application using a Service Account.
Note that both methods use a JWT based Service Accounts and which currently need a modified version of the google-api-python-client in order to work on App Engine.
Unlike Google App Engine Service account, JWT based Service Accounts should work with the development server.

Categories