Roles Required to write to Cloud Storage (GCP) from python (pandas) - python

I have a question for the GCP connoisseurs among you.
I have an issue that I can upload to a bucket via UI and gsutil - but if I try to do this via python
df.to_csv('gs://BUCKET_NAME/test.csv')
I get a 403 insufficient permission error.
My guess at the moment is that python does this via an API and requires an extra role - to make things more confusing I am already project owner of the project of the bucket and compared to other team members did not really find lacking permissions for this specific bucket.
I use python 3.9.1 via pyenv and pandas '1.4.2'
Anyone had the same issue/ knows what role I am missing?
I checked that I have in principal rights to upload both via UI and gsutil
I used the same virtual python environemnt to read and write from bigquery to check that I can in principle use GCP data in python - this works
I have the following Roles on the Bucket
Storage Admin, Storage Object Admin, Storage Object Creator, Storage Object Viewer

gsutil and gcloud share credentials.
These credentials are not shared with other code running locally.
The quick-fix but sub-optimal solution is to:
gcloud auth application-default login
And run the code again.
It will then use your gcloud (gsutil) user credentials configured to run as if you were using a Service Account.
These credentials are stored (on Linux) in ${HOME}/.config/gcloud/application_default_credentials.json.
A better solution is to create a Service Account specifically for your app and grant it the minimal set of IAM permissions that it will need (BigQuery, GCS, ...).
For testing purposes (!) you can download the Service Account key locally.
You can then auth your code using Google's Application Default Credentials (ADC) by (on Linux):
export GOOGLE_APPLICATION_CREDENTIALS=/path/to/your/key.json
python3 your_app.py
When you deploy code that leverages ADC to a Google Cloud compute service (Compute Engine, Cloud Run, ...), it can be deployed unchanged because the credentials for the compute resource will be automatically obtained from the Metadata service.
You can Google e.g. "Google IAM BigQuery" to find the documentation that lists the roles:
IAM roles for BigQuery
IAM roles for Cloud Storage

Related

Cloud Composer + Airflow: Setting up DAWs to trigger on HTTP (or should I use Cloud Functions?)

Ultimately, what I want to do is have a Python script that runs whenever a HTTP request is created, dynamically. It'd be like: App 1 runs and sends out a webhook, Python script catches the webhook immediately and does whatever it does.
I saw that you could do this in GCP with Composer and Airflow.
But I'm having several issues following these instrutions https://cloud.google.com/composer/docs/how-to/using/triggering-with-gcf:
Running this in Cloud Shell to grant blob signing permissions:
gcloud iam service-accounts add-iam-policy-binding
your-project-id#appspot.gserviceaccount.com
--member=serviceAccount:your-project-id#appspot.gserviceaccount.com
--role=roles/iam.serviceAccountTokenCreator
When I put in my project ID, I get a "Gaia id not found for your-project-id#appspot.gserviceaccount.com"
When I run the airflow_uri = environment_data['config']['airflowUri'] bit, I get a key error on 'config'.
Is there a better way to do what I'm trying to do (i.e. run Python scripts dynamically)?
The reason for getting: Gaia id not found for email <project-id>#appspot.gserviceaccount.com error is not enabling all needed APIs in your project. Please follow the steps:
Create or select Google Cloud Platform Project you wish to work with.
Enable the Cloud Composer, Google Cloud Functions and Cloud Identity and Google Identity and Access Management (IAM) APIs. You can find it in Menu -> Products -> Marketplace and typing the name of corresponding API.
Grant blob signing permissions to the Cloud Functions Service Account. In order for GCF to authenticate to Cloud IAP, the proxy that protects the Airflow webserver, you need to grant the Appspot Service Account GCF the Service Account Token Creator role. Do so by running the following command in your Cloud Shell, substituting the name of your project for <your-project-id>:
gcloud iam service-accounts add-iam-policy-binding \
<your-project-id>#appspot.gserviceaccount.com \
--member=serviceAccount:<your-project-id>#appspot.gserviceaccount.com \
--role=roles/iam.serviceAccountTokenCreator
I tested the scenario, firstly without enabling APIs and I've retrieved the same error as you. After enabling the APIs, error disappear and IAM policy has been updated correctly.
There is already well described Codelabs tutorial, which shows the workflow of triggering the DAG with Google Cloud Functions.

Best practices to store credentials in your Python script

My setup is: the code is in the private repository in Github which I run from AWS EC2.
I have this doubt where should I store the API and database credentials. My feeling at the moment is that no credentials should be stored in the code, instead, I should use the AWS Secret Manager to access them but then, you also connect to AWS. What is your view on it? A disclosure, I am starting with Python, so, please, be gentle.
Never store your secrets in code. In your case I would recommend AWS Secret Manager (Or secret parameters in AWS System Manager Parameter Store) and store your secrets there.
I would recommend to create an IAM role for your EC2 which has a policy which allows the role to read the correct secrets from AWS Secret Manager. Connect the role with an instance profile and the instance profile with the EC2. This is done automatically in the AWS console but not when your using CloudFormation. An instance profile is kind of a wrapper around a role that allows the role to be attached to an instance.
In this flow your EC2 instance will be allowed to read the secrets from system manager by using the instance profile and role. Roles are the recommended way to make AWS resources interact with each other because it uses temporary credentials and restricts access.
With the above setup you should be able to read the secrets from within your code like explained here. You can use boto3 (AWS SDK for Python) to interact from within the EC2 to the secrets manager.

Google OAuth2 Service Account Auth with Domain-wide Delegation Using Default Credentials

Have anyone been able to get domain-wide delegation working by using default credentials (i.e. an AppEngine default service account or otherwise derived from the GOOGLE_APPLICATION_CREDENTIALS environment variable) specifically with the Drive or Gmail API? We've been able to follow this guide to use default credentials with admin sdk APIs, but not with user centric APIs like Gmail/Drive. We really dislike the key management situation we're stuck in by deploying keys with code or loading them into GCS buckets while knowing that many GCP centric services don't have this problem (i.e. google-cloud-firestore or google-cloud-bigquery python clients).

Permissions error with Apache Beam example on Google Dataflow

I'm having trouble submitting an Apache Beam example from a local machine to our cloud platform.
Using gcloud auth list I can see that the correct account is currently active. I can use gsutil and the web client to interact with the file system. I can use the cloud shell to run pipelines through the python REPL.
But when I try and run the python wordcount example I get the following error:
IOError: Could not upload to GCS path gs://my_bucket/tmp: access denied.
Please verify that credentials are valid and that you have write access
to the specified path.
Is there something I am missing with regards to the credentials?
Here are my two cents after spending the whole morning on the issue.
You should make sure that you login with gcloud on your local machine, however, pay attention to the warning message that return from gcloud auth login:
WARNING: `gcloud auth login` no longer writes application default credentials.
These credentials are required for the python code to identify your credentials properly.
Solution is rather simple, just use:
gcloud auth application-default login
This will write a credentials file under: ~/.config/gcloud/application_default_credentials.json which is used for the authentication in the local development env.
You'll need to create a GCS bucket and folder for your project, then specify that as the pipeline parameter instead of using the default value.
https://cloud.google.com/storage/docs/creating-buckets
Same Error Solved after creating a bucket.
gsutil mb gs://<bucket-name-from-the-error>/
I have faced the same issue where it throws up the IO error. Things that helped me here are (not in the order):
Checking the Name of the bucket. This step helped me a lot. Bucket names are global. If you make mistake in the bucket-name while accessing your bucket then you might be accessing buckets that you have NOT created and you don't have permission to.
Checking the service account that you have filled in:
export GOOGLE_CLOUD_PROJECT= yourkeyfile.json
Activating the service account for the key file you have plugged in -
gcloud auth activate-service-account --key-file=your-key-file.json
Also, listing out the auth accounts available might help you too.
gcloud auth list
One solution might work for you. It did for me.
In the cloud shell window, click on "Launch code Editor" (The Pencil Icon). The editor will work in Chrome (not sure about Firefox), it did not work in Brave browser.
Now, browse to your code file [in the launched code editor on GCP] (.py or .java) and locate the pre-defined PROJECT and BUCKET names and replace the name with your own Project and Bucket names and save it.
Now execute the file and it should work now.
Python doesn't use gcloud auth to authenticate but it uses the environment variable GOOGLE_APPLICATION_CREDENTIALS. So before you run the python command to launch the Dataflow job, you will need to set that environment variable:
export GOOGLE_APPLICATION_CREDENTIALS="/path/to/key"
More info on setting up the environment variable: https://cloud.google.com/docs/authentication/getting-started#setting_the_environment_variable
Then you'll have to make sure that the account you set up has the necessary permissions in your GCP project.
Permissions and service accounts:
User service account or user account: it needs the Dataflow Admin
role at the project level and to be able to act as the worker service
account (source:
https://cloud.google.com/dataflow/docs/concepts/security-and-permissions#worker_service_account).
Worker service account: it will be one worker service account per
Dataflow pipeline. This account will need the Dataflow Worker role at
the project level plus the necessary permissions to the resources
accessed by the Dataflow pipeline (source:
https://cloud.google.com/dataflow/docs/concepts/security-and-permissions#worker_service_account).
Example: if Dataflow pipeline’s input is Pub/Sub topic and output is
BigQuery table, the worker service account will need read access to
the topic as well as write permission to the BQ table.
Dataflow service account: this is the account that gets automatically
created when you enable the Dataflow API in a project. It
automatically gets the Dataflow Service Agent role at the project
level (source:
https://cloud.google.com/dataflow/docs/concepts/security-and-permissions#service_account).

Google App Engine authorization for Google BigQuery - Multiple Projects

I am running multiple projects that using the same source code (python) on GAE. I am currently trying to include BigQuery functionality in those projects. I have enabled the BigQuery API on all of the projects, successfully imported some data to BigQuery from GCS using the new developers console.
I am able to make queries from the GAE app using AppAssertionCredentials from some of the projects but get a 403 "Access Not Configured. The API (BigQuery API) is not enabled for your project. Please use the Google Developers Console to update your configuration." error for others.
tl;dr AppAssertionCredentials with BigQuery fails for some projects, not for others (same source code)
All of the projects have BigQuery API's and billing enabled I followed all the steps from Google App Engine authorization for Google BigQuery
The only difference between the projects is how they where created.
Original project:
project_id: project_A
service_account: project_A#appspot.gserviceaccount.com
Second project:
project_id: project_B
service_account: project_B#appspot.gserviceaccount.com
Third project (cloned from poject_A):
project_id: project_C
service_account: project_A#appspot.gserviceaccount.com
The third project project_C has been created using cloning feature of the old appengine console, that is why it shares the same service account email. This is the project that the AppAssertionCredentials fail for when trying to query BigQuery (although everything works fine when authenticating to GCS with the same credentials)
I have added project_A#appspot.gserviceaccount.com to project_C Permissions list with "Edit" permissions - that didn't help. The service discovery code:
from googleapiclient.discovery import build
from oauth2client.appengine import AppAssertionCredentials
credentials = AppAssertionCredentials('https://www.googleapis.com/auth/bigquery')
return build('bigquery', 'v2', credentials=credentials)
Is there a workaround this problem or maybe anything else I need to check for? I would really like to avoid using any other authorization method than AppAssertionCredentials.

Categories