Google cloud get bucket - works with cli but not in python - python

I was asked to preform integration with an external google storage bucket, I had received a credentials json,
And while trying to do
gsutil ls gs://bucket_name (after configuring myself with the creds json) I had received a valid response, as well as when I tried to upload a file into the bucket.
When trying to do it with Python3, it does not work:
While using google-cloud-storage==1.16.0 (tried also the newer versions), I'm doing:
project_id = credentials_dict.get("project_id")
credentials = service_account.Credentials.from_service_account_info(credentials_dict)
client = storage.Client(credentials=credentials, project=project_id)
bucket = client.get_bucket(bucket_name)
But on the get_bucket line, I get:
google.api_core.exceptions.Forbidden: 403 GET https://www.googleapis.com/storage/v1/b/BUCKET_NAME?projection=noAcl: USERNAME#PROJECT_ID.iam.gserviceaccount.com does not have storage.buckets.get access to the Google Cloud Storage bucket.
The external partner which I'm integrating with, saying that the user is set correctly, and to prove it they're showing that I can preform the action with gsutil.
Can you please assist? Any idea what might be the problem?

The answer was that the creds were indeed wrong, but it did worked when I tried to preform on the client client.bucket(bucket_name) instead of client.get_bucket(bucket_name).

Please follow these steps in order to correctly set up the Cloud Storage Client Library for Python. In general, the Cloud Storage Libraries can use Application default credentials or environment variables for authentication.
Notice that the recommended method to use would be to set up authentication using environment variables (i.e if you are using Linux: export GOOGLE_APPLICATION_CREDENTIALS="/path/to/[service-account-credentials].json" should work) and avoid the use of the service_account.Credentials.from_service_account_info() method altogether:
from google.cloud import storage
storage_client = storage.Client(project='project-id-where-the-bucket-is')
bucket_name = "your-bucket"
bucket = client.get_bucket(bucket_name)
should simply work because the authentication is handled by the client library via the environment variable.
Now, if you are interested in explicitly using the service account instead of using service_account.Credentials.from_service_account_info() method you can use the from_service_account_json() method directly in the following way:
from google.cloud import storage
# Explicitly use service account credentials by specifying the private key
# file.
storage_client = storage.Client.from_service_account_json(
'/[service-account-credentials].json')
bucket_name = "your-bucket"
bucket = client.get_bucket(bucket_name)
Find all the relevant details as to how to provide credentials to your application here.

tl;dr: dont use client.get_bucket at all.
See for detailed explanation and solution https://stackoverflow.com/a/51452170/705745

Related

Access Google Cloud Storage object in a project that I don't belong to

There is a GCP project that contains a bucket that I have read and write permissions to, but I don't know the name of the project nor am I part of the project. None of the contents of this bucket are public.
I have successfully authenticated my user locally using gcloud auth application-default login.
I can successfully download from this bucket using gsutil cat gs://BUCKET/PATH.
However, if I use the google.cloud.storage Python API, it fails at the point of identifying the project, presumably because I don't have access to the project itself:
from google.cloud import storage
client = storage.Client()
storage.Blob.from_string("gs://BUCKET/PATH", client=client).download_as_text()
The billing account for the owning project is disabled in state closed: ('Request failed with status code', 403, 'Expected one of', <HTTPStatus.OK: 200>, <HTTPStatus.PARTIAL_CONTENT: 206>)
I can't use storage.Client.create_anonymous_client() since this is only relevant for public buckets, but I suspect that I could fix this by changing the credentials argument to Client().
Can anyone help me download the file from Google Cloud in this case?
If you have permission, you can find the project number for a given bucket with the bucket get API call. See this guide for how to do it with various client libraries.

Accessing Google BigQuery from AWS SageMaker

When running locally, my Jupyter notebook is able to reference Google BigQuery like so:
%%bigquery some_bq_table
SELECT *
FROM
`some_bq_dataset.some_bq_table`
So that later in my notebook I can reference some_bq_table as a pandas dataframe, as exemplified here: https://cloud.google.com/bigquery/docs/visualize-jupyter
I want to run my notebook on AWS SageMaker to test a few things. To authenticate with BigQuery it seems that the only two ways are using a service account on GCP (or locally) or pointing the the SDK to a credentials JSON using an env var (as explained here: https://cloud.google.com/docs/authentication/getting-started).
For example
export GOOGLE_APPLICATION_CREDENTIALS="/home/user/Downloads/[FILE_NAME].json"
Is there an easy way to connect to bigquery from SageMaker? My best idea right now is to download the JSON from somewhere to the SageMaker instnace and then set the env var from the python code.
For example, I would do this:
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "/home/user/Downloads/[FILE_NAME].json"
However, this isn't very secure - I don't like the idea of downloading my credentials JSON to a SageMaker instance (this means I would have to upload the credentials to some private s3 bucket and then store them on the SageMaker instance). Not the end of the world but I rather avoid this.
Any ideas?
As you mentioned GCP currently authenticates using service account, credentials JSON and API tokens. Instead of storing credentials in S3 bucket you can consider using AWS Secrets Manager or AWS Systems Manager Parameter Store to store the GCP credentials and then fetch them in Jupyter notebook. This way credentials can be secured and the credentials file will be created from Secrets Manager only when needed.
This is sample code I used previously to connect to BigQuery from SageMaker instance.
import os
import json
import boto3
from google.cloud.bigquery import magics
from google.oauth2 import service_account
def get_gcp_credentials_from_ssm(param_name):
# read credentials from SSM parameter store
ssm = boto3.client('ssm')
# Get the requested parameter
response = ssm.get_parameters(Names=[param_name], WithDecryption=True)
# Store the credentials in a variable
gcp_credentials = response['Parameters'][0]['Value']
# save credentials temporarily to a file
credentials_file = '/tmp/.gcp/service_credentials.json'
with open(credentials_file, 'w') as outfile:
json.dump(json.loads(gcp_credentials), outfile)
# create google.auth.credentials.Credentials to use for queries
credentials = service_account.Credentials.from_service_account_file(credentials_file)
# remove temporary file
if os.path.exists(credentials_file):
os.remove(credentials_file)
return credentials
# this will set the context credentials to use for queries performed in jupyter
# using bigquery cell magic
magics.context.credentials = get_gcp_credentials_from_ssm('my_gcp_credentials')
Please note that SageMaker execution role should have access to SSM and of course other necessary route to connect to GCP. I am not sure if this is the best way though. Hope someone has better way.

Output TFRecord to Google Cloud Storage from Python

I know tf.python_io.TFRecordWriter has a concept of GCS, but it doesn't seem to have permissions to write to it.
If I do the following:
output_path = 'gs://my-bucket-name/{}/{}.tfrecord'.format(object_name, record_name)
writer = tf.python_io.TFRecordWriter(output_path)
# write to writer
writer.close()
then I get 401s saying "Anonymous caller does not have storage.objects.create access to my-bucket-name."
However, on the same machine, if I do gsutil rsync -d r gs://my-bucket-name bucket-backup, it properly syncs it, so I've authenticated properly using gcloud.
How can I give TFRecordWriter permissions to write to GCS? I'm going to just use Google's GCP python API for now, but I'm sure there's a way to do this using TF alone.
A common strategy to setup credentials on systems is to use Application Default Credentials (ADC). ADC is a strategy to locate Google Cloud Service Account credentials.
If the environment variable GOOGLE_APPLICATION_CREDENTIALS is set, ADC will use the filename that the variable points to for service account credentials. This file is a Google Cloud Service Account credentials file in Json format. The previous P12 (PFX) certificates are deprecated.
If the environment variable is not set, the default service account is used for credentials if the application running on Compute Engine, App Engine, Kubernetes Engine or Cloud Functions.
If the previous two steps fail to find valid credentials, ADC will fail, and an error occurs.
For this questions, ADC could not find credentials and the TensorFlow writes to GCS failed.
The solution is to set the environment variable GOOGLE_APPLICATION_CREDENTIALS to point to the service account Json file.
For Linux:
export GOOGLE_APPLICATION_CREDENTIALS=/path/to/service-account.json
For Windows
set GOOGLE_APPLICATION_CREDENTIALS=C:\path\to\service-account.json
I wrote an article that goes into more detail on ADC.
Google Cloud Application Default Credentials
When you use the gsutil command, you are using the GCP user configured in Cloud SDK (execute:gcloud config list to see).
Plausibly your python script is not authenticated in GCP.
I believe that has a better approach to solve this (sorry I don't have a lot of knowledge about TensorFlow), but I can see 2 workarounds to fix that:
First option - Mount Cloud Storage buckets as file systems using Cloud Fuse
Second option - Write locally and move later. In this approach, you can use this code:
# Service Account file
JSON_FILE_NAME = '<Service account json file>'
# Imports the Google Cloud client library
from google.cloud import storage
# Instantiates a client
storage_client = storage.Client.from_service_account_json(JSON_FILE_NAME)
#Example file (using the service account)
source_file_path = 'your file path'
destination_blob_name = 'name of file in gcs'
# The name for the new bucket
bucket_name = '<bucket_name>'
bucket = storage_client.get_bucket(bucket_name)
blob = bucket.blob(destination_blob_name)
blob.upload_from_filename(source_file_path)
print('File {} uploaded to {}.'.format(
source_file_path,
destination_blob_name))
Do note that export command won't work in a jupyter notebook.
if you're in a jupyter notebook, this should work
import os
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = 'path/to/json'

How can I obtain suitable credentials in a cloud composer environment to make calls to the google sheets API?

I would like to be able to access data on a google sheet when running python code via cloud composer; this is something I know how to do in several ways when running code locally, but moving to the cloud is proving challenging. In particular I wish to authenticate as the composer service account rather than stashing the contents of a client_secret.json file somewhere (be that the source code or some cloud location).
For essentially the same question but instead accessing google cloud platform services, this has been relatively easy (even when running through composer) thanks to the google-cloud_* libraries. For instance, I have verified that I can push data to bigquery:
from google.cloud import bigquery
client = bigquery.Client()
client.project='test project'
dataset_id = 'test dataset'
table_id = 'test table'
dataset_ref = client.dataset(dataset_id)
table_ref = dataset_ref.table(table_id)
table = client.get_table(table_ref)
rows_to_insert = [{'some_column':'test string'}]
errors = client.insert_rows(table,rows_to_insert)
and the success or failure of this can be managed through sharing (or not) 'test dataset' with the composer service account.
Similarly, getting data from a cloud storage bucket works fine:
from google.cloud import storage
client = storage.Client()
bucket = client.get_bucket('test bucket')
name = 'test.txt'
data_blob = bucket.get_blob(name)
data_pre = data_blob.download_as_string()
and once again I have the ability to control access through IAM.
However, for working with google sheets it seems I must resort to the Google APIs python client, and here I run into difficulties. Most documentation on this (which seems to be a moving target!) assumes local code execution, starting with the creation and storage of a client_secret.json file example 1, example 2, which I understand locally but doesn't make sense for a shared cloud environment with source control. So, a couple of approaches I've tried instead:
Trying to build credentials using discovery and oauth2
from googleapiclient.discovery import build
from httplib2 import Http
from oauth2client.contrib import gce
SAMPLE_SPREADSHEET_ID = 'key for test sheet'
SAMPLE_RANGE_NAME = 'test range'
creds = gce.AppAssertionCredentials(scope='https://www.googleapis.com/auth/spreadsheets')
service = build('sheets', 'v4', http = creds.authorize(Http()))
sheet = service.spreadsheets()
result = sheet.values().get(spreadsheetId=SAMPLE_SPREADSHEET_ID,
range=SAMPLE_RANGE_NAME).execute()
values = result.get('values', [])
Caveat: I know nothing about working with scopes to create credential objects via Http. But this seems closest to working: I get an HTTP403 error of
'Request had insufficient authentication scopes.'
However, I don't know if that means I successfully presented myself as the service account, which was then deemed unsuitable for access (so I need to mess around with permissions some more); or didn't actually get that far (and need to fix this credentials creation process).
Getting a credential object with google.auth and passing to gspread
My (limited) understanding is that oauth2client is being deprecated and google.auth is now the way to go. This yields credentials objects in a similarly simple way to my successful examples above for cloud platform services, that I hoped I could just pass to gspread:
import gspread
from google.auth import compute_engine
credentials = compute_engine.Credentials()
client = gspread.authorize(credentials)
Sadly, gspread doesn't work with these objects, because they don't have the attributes it expects:
AttributeError: 'Credentials' object has no attribute 'access_token'
This is presumably because gspread expects oauth2 credentials and those chucked out by google.auth aren't sufficiently compatible. The gspread docs also go down the 'just get a client_secret file'... but presumably if I can get the previous (oauth/http-based) approach to work, I could then use gspread for data retrieval. For now, though, a hybrid of these two approaches stumbles in the same way: a permission denied response due to insufficient authentication scopes.
So, whether using google.auth, oauth2 (assuming that'll stick around for a while) or some other cloud-friendly approach (i.e. not one based on storing the secret key), how can I obtain suitable credentials in a cloud composer environment to make calls to the google sheets API? Bonus marks for a way that is compatible with gspread (and hence gspread_dataframe), but this is not essential. Also happy to hear that this is a PEBCAK error and I just need to configure IAM permissions differently for my current approach to work.
It looks like your Composer environment oauthScopes config wasn't set up properly. If left unspecified, the default cloud-platform doesn't allow you to access Google sheets API. You may want to create a new Composer environment with oauthScopes = [
"https://www.googleapis.com/auth/spreadsheets",
"https://www.googleapis.com/auth/cloud-platform"].
Google sheets API reference: https://developers.google.com/sheets/api/reference/rest/v4/spreadsheets/create.

How to I access Security token for Python SDK boto3

I want to access AWS comprehend api from python script. Not getting any leads of how do I remove this error. One thing I know that I have to get session security token.
try:
client = boto3.client(service_name='comprehend', region_name='us-east-1', aws_access_key_id='KEY ID', aws_secret_access_key= 'ACCESS KEY')
text = "It is raining today in Seattle"
print('Calling DetectEntities')
print(json.dumps(client.detect_entities(Text=text, LanguageCode='en'), sort_keys=True, indent=4))
print('End of DetectEntities\n')
except ClientError as e:
print (e)
Error : An error occurred (UnrecognizedClientException) when calling the DetectEntities operation: The security token included in the request is invalid.
This error suggesting that you have provided invalid credentials.
It is also worth nothing that you should never put credentials inside your source code. This can lead to potential security problems if other people obtain access to the source code.
There are several ways to provide valid credentials to an application that uses an AWS SDK (such as boto3).
If the application is running on an Amazon EC2 instance, assign an IAM Role to the instance. This will automatically provide credentials that can be retrieved by boto3.
If you are running the application on your own computer, store credentials in the .aws/credentials file. The easiest way to create this file is with the aws configure command.
See: Credentials — Boto 3 documentation
Create a profile using aws configure or updating ~/.aws/config. If you only have one profile to work with = default, you can omit profile_name parameter from Session() invocation (see example below). Then create AWS service specific client using the session object. Example;
import boto3
session = boto3.session.Session(profile_name="test")
ec2_client = session.client('ec2')
ec2_client.describe_instances()
ec2_resource = session.resource(‘ec2’)
One useful tool I use daily is this: https://github.com/atward/aws-profile/blob/master/aws-profile
This makes assuming role so much easier!
After you set up your access key in .aws/credentials and your .aws/config
you can do something like:
AWS_PROFILE=**you-profile** aws-profile [python x.py]
The part in [] can be substituted with anything that you want to use AWS credentials. e.g., terraform plan
Essentially, this utility simply put your AWS credentials into os environment variables. Then in your boto script, you don't need to worry about setting aws_access_key_id and etc..

Categories