Accessing Google BigQuery from AWS SageMaker

Accessing Google BigQuery from AWS SageMaker - python

When running locally, my Jupyter notebook is able to reference Google BigQuery like so:
%%bigquery some_bq_table
SELECT *
FROM
`some_bq_dataset.some_bq_table`
So that later in my notebook I can reference some_bq_table as a pandas dataframe, as exemplified here: https://cloud.google.com/bigquery/docs/visualize-jupyter
I want to run my notebook on AWS SageMaker to test a few things. To authenticate with BigQuery it seems that the only two ways are using a service account on GCP (or locally) or pointing the the SDK to a credentials JSON using an env var (as explained here: https://cloud.google.com/docs/authentication/getting-started).
For example
export GOOGLE_APPLICATION_CREDENTIALS="/home/user/Downloads/[FILE_NAME].json"
Is there an easy way to connect to bigquery from SageMaker? My best idea right now is to download the JSON from somewhere to the SageMaker instnace and then set the env var from the python code.
For example, I would do this:
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "/home/user/Downloads/[FILE_NAME].json"
However, this isn't very secure - I don't like the idea of downloading my credentials JSON to a SageMaker instance (this means I would have to upload the credentials to some private s3 bucket and then store them on the SageMaker instance). Not the end of the world but I rather avoid this.
Any ideas?

As you mentioned GCP currently authenticates using service account, credentials JSON and API tokens. Instead of storing credentials in S3 bucket you can consider using AWS Secrets Manager or AWS Systems Manager Parameter Store to store the GCP credentials and then fetch them in Jupyter notebook. This way credentials can be secured and the credentials file will be created from Secrets Manager only when needed.
This is sample code I used previously to connect to BigQuery from SageMaker instance.
import os
import json
import boto3
from google.cloud.bigquery import magics
from google.oauth2 import service_account
def get_gcp_credentials_from_ssm(param_name):
# read credentials from SSM parameter store
ssm = boto3.client('ssm')
# Get the requested parameter
response = ssm.get_parameters(Names=[param_name], WithDecryption=True)
# Store the credentials in a variable
gcp_credentials = response['Parameters'][0]['Value']
# save credentials temporarily to a file
credentials_file = '/tmp/.gcp/service_credentials.json'
with open(credentials_file, 'w') as outfile:
json.dump(json.loads(gcp_credentials), outfile)
# create google.auth.credentials.Credentials to use for queries
credentials = service_account.Credentials.from_service_account_file(credentials_file)
# remove temporary file
if os.path.exists(credentials_file):
os.remove(credentials_file)
return credentials
# this will set the context credentials to use for queries performed in jupyter
# using bigquery cell magic
magics.context.credentials = get_gcp_credentials_from_ssm('my_gcp_credentials')
Please note that SageMaker execution role should have access to SSM and of course other necessary route to connect to GCP. I am not sure if this is the best way though. Hope someone has better way.

Related

Google cloud get bucket - works with cli but not in python

I was asked to preform integration with an external google storage bucket, I had received a credentials json,
And while trying to do
gsutil ls gs://bucket_name (after configuring myself with the creds json) I had received a valid response, as well as when I tried to upload a file into the bucket.
When trying to do it with Python3, it does not work:
While using google-cloud-storage==1.16.0 (tried also the newer versions), I'm doing:
project_id = credentials_dict.get("project_id")
credentials = service_account.Credentials.from_service_account_info(credentials_dict)
client = storage.Client(credentials=credentials, project=project_id)
bucket = client.get_bucket(bucket_name)
But on the get_bucket line, I get:
google.api_core.exceptions.Forbidden: 403 GET https://www.googleapis.com/storage/v1/b/BUCKET_NAME?projection=noAcl: USERNAME#PROJECT_ID.iam.gserviceaccount.com does not have storage.buckets.get access to the Google Cloud Storage bucket.
The external partner which I'm integrating with, saying that the user is set correctly, and to prove it they're showing that I can preform the action with gsutil.
Can you please assist? Any idea what might be the problem?

The answer was that the creds were indeed wrong, but it did worked when I tried to preform on the client client.bucket(bucket_name) instead of client.get_bucket(bucket_name).

Please follow these steps in order to correctly set up the Cloud Storage Client Library for Python. In general, the Cloud Storage Libraries can use Application default credentials or environment variables for authentication.
Notice that the recommended method to use would be to set up authentication using environment variables (i.e if you are using Linux: export GOOGLE_APPLICATION_CREDENTIALS="/path/to/[service-account-credentials].json" should work) and avoid the use of the service_account.Credentials.from_service_account_info() method altogether:
from google.cloud import storage
storage_client = storage.Client(project='project-id-where-the-bucket-is')
bucket_name = "your-bucket"
bucket = client.get_bucket(bucket_name)
should simply work because the authentication is handled by the client library via the environment variable.
Now, if you are interested in explicitly using the service account instead of using service_account.Credentials.from_service_account_info() method you can use the from_service_account_json() method directly in the following way:
from google.cloud import storage
# Explicitly use service account credentials by specifying the private key
# file.
storage_client = storage.Client.from_service_account_json(
'/[service-account-credentials].json')
bucket_name = "your-bucket"
bucket = client.get_bucket(bucket_name)
Find all the relevant details as to how to provide credentials to your application here.

tl;dr: dont use client.get_bucket at all.
See for detailed explanation and solution https://stackoverflow.com/a/51452170/705745

Using GCP Connection objects in Google Cloud Composer Airflow DAG python operator

I have a series of python scripts that draw data from Google Sheets using the gspread library authorized using a json service account keyfile through the oauth2client library:
import gspread
from oauth2client.service_account import ServiceAccountCredentials
scopes = ['https://spreadsheets.google.com/feeds','https://www.googleapis.com/auth/drive']
creds = ServiceAccountCredentials.from_json_keyfile_name(gcp_config_yaml_path,scopes)
client = gspread.authorize(creds)
cur = config['tables_to_load'][i]
sheet = client.open_by_url(cur['spreadsheet_url']).worksheet(cur['sheet_name'])
df=pd.DataFrame(sheet.get_all_records())
I need to convert this into an Airflow DAG using Google Cloud Composer and I would like to take advantage of the connections functionality of Airflow (https://cloud.google.com/composer/docs/how-to/managing/connections#creating_a_connection_to_another_project).
I've uploaded the json keyfile object and created the connection object in the Airflow UI (per. option 'i' in step 'd - iv' in '#2 create a new connection') and I'm able to reference that object in my code using:
client = BaseHook.get_connection('google_cloud_default')
But that's about as far as I can get. Each time I try to call an argument in the connection, I get an error that the argument doesn't exist (keyfile_json, keyfile_dict, scopes, keyfile_path, client, spreadsheets, etc.) and I can't seem to find any documentation on what attributes should be available in the object: (https://airflow.readthedocs.io/en/latest/_api/airflow/gcp/hooks/base/index.html#airflow.gcp.hooks.base.CloudBaseHook).
Any insight on methods for authorizing a Google Sheets connection in a GCP Cloud Composer Airflow environment would be a HUGE help!
Thanks so much in advance!

Output TFRecord to Google Cloud Storage from Python

I know tf.python_io.TFRecordWriter has a concept of GCS, but it doesn't seem to have permissions to write to it.
If I do the following:
output_path = 'gs://my-bucket-name/{}/{}.tfrecord'.format(object_name, record_name)
writer = tf.python_io.TFRecordWriter(output_path)
# write to writer
writer.close()
then I get 401s saying "Anonymous caller does not have storage.objects.create access to my-bucket-name."
However, on the same machine, if I do gsutil rsync -d r gs://my-bucket-name bucket-backup, it properly syncs it, so I've authenticated properly using gcloud.
How can I give TFRecordWriter permissions to write to GCS? I'm going to just use Google's GCP python API for now, but I'm sure there's a way to do this using TF alone.

A common strategy to setup credentials on systems is to use Application Default Credentials (ADC). ADC is a strategy to locate Google Cloud Service Account credentials.
If the environment variable GOOGLE_APPLICATION_CREDENTIALS is set, ADC will use the filename that the variable points to for service account credentials. This file is a Google Cloud Service Account credentials file in Json format. The previous P12 (PFX) certificates are deprecated.
If the environment variable is not set, the default service account is used for credentials if the application running on Compute Engine, App Engine, Kubernetes Engine or Cloud Functions.
If the previous two steps fail to find valid credentials, ADC will fail, and an error occurs.
For this questions, ADC could not find credentials and the TensorFlow writes to GCS failed.
The solution is to set the environment variable GOOGLE_APPLICATION_CREDENTIALS to point to the service account Json file.
For Linux:
export GOOGLE_APPLICATION_CREDENTIALS=/path/to/service-account.json
For Windows
set GOOGLE_APPLICATION_CREDENTIALS=C:\path\to\service-account.json
I wrote an article that goes into more detail on ADC.
Google Cloud Application Default Credentials

When you use the gsutil command, you are using the GCP user configured in Cloud SDK (execute:gcloud config list to see).
Plausibly your python script is not authenticated in GCP.
I believe that has a better approach to solve this (sorry I don't have a lot of knowledge about TensorFlow), but I can see 2 workarounds to fix that:
First option - Mount Cloud Storage buckets as file systems using Cloud Fuse
Second option - Write locally and move later. In this approach, you can use this code:
# Service Account file
JSON_FILE_NAME = '<Service account json file>'
# Imports the Google Cloud client library
from google.cloud import storage
# Instantiates a client
storage_client = storage.Client.from_service_account_json(JSON_FILE_NAME)
#Example file (using the service account)
source_file_path = 'your file path'
destination_blob_name = 'name of file in gcs'
# The name for the new bucket
bucket_name = '<bucket_name>'
bucket = storage_client.get_bucket(bucket_name)
blob = bucket.blob(destination_blob_name)
blob.upload_from_filename(source_file_path)
print('File {} uploaded to {}.'.format(
source_file_path,
destination_blob_name))

Do note that export command won't work in a jupyter notebook.
if you're in a jupyter notebook, this should work
import os
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = 'path/to/json'

Secrets in a google cloud bucket

We want to have a production airflow environment but do not know how to deal properly with secrets, in particular google bigquery client JSON files
We tried setting up the kubernetes secrets on the automatically created kubernetes cluster (automatically by creationg a google cloud composer (airflow) environment). We currently just put the files on the bucket, but would like a better way.
def get_bq_client ():
""" returns bq client """
return bq.Client.from_service_account_json(
join("volumes", "bigquery.json")
)
We would like some form of proper management of the required secrets. Sadly, using Airflow Variables won't work because we can't create the client object using the json file as text

One solution that would work, is to encrypt the JSON files and put that on the bucket. As long as the decryption key exists on the bucket and no where else you'll be able to just check the code in with secrets to some source control and in the bucket checkout and decrypt.

How can I obtain suitable credentials in a cloud composer environment to make calls to the google sheets API?

I would like to be able to access data on a google sheet when running python code via cloud composer; this is something I know how to do in several ways when running code locally, but moving to the cloud is proving challenging. In particular I wish to authenticate as the composer service account rather than stashing the contents of a client_secret.json file somewhere (be that the source code or some cloud location).
For essentially the same question but instead accessing google cloud platform services, this has been relatively easy (even when running through composer) thanks to the google-cloud_* libraries. For instance, I have verified that I can push data to bigquery:
from google.cloud import bigquery
client = bigquery.Client()
client.project='test project'
dataset_id = 'test dataset'
table_id = 'test table'
dataset_ref = client.dataset(dataset_id)
table_ref = dataset_ref.table(table_id)
table = client.get_table(table_ref)
rows_to_insert = [{'some_column':'test string'}]
errors = client.insert_rows(table,rows_to_insert)
and the success or failure of this can be managed through sharing (or not) 'test dataset' with the composer service account.
Similarly, getting data from a cloud storage bucket works fine:
from google.cloud import storage
client = storage.Client()
bucket = client.get_bucket('test bucket')
name = 'test.txt'
data_blob = bucket.get_blob(name)
data_pre = data_blob.download_as_string()
and once again I have the ability to control access through IAM.
However, for working with google sheets it seems I must resort to the Google APIs python client, and here I run into difficulties. Most documentation on this (which seems to be a moving target!) assumes local code execution, starting with the creation and storage of a client_secret.json file example 1, example 2, which I understand locally but doesn't make sense for a shared cloud environment with source control. So, a couple of approaches I've tried instead:
Trying to build credentials using discovery and oauth2
from googleapiclient.discovery import build
from httplib2 import Http
from oauth2client.contrib import gce
SAMPLE_SPREADSHEET_ID = 'key for test sheet'
SAMPLE_RANGE_NAME = 'test range'
creds = gce.AppAssertionCredentials(scope='https://www.googleapis.com/auth/spreadsheets')
service = build('sheets', 'v4', http = creds.authorize(Http()))
sheet = service.spreadsheets()
result = sheet.values().get(spreadsheetId=SAMPLE_SPREADSHEET_ID,
range=SAMPLE_RANGE_NAME).execute()
values = result.get('values', [])
Caveat: I know nothing about working with scopes to create credential objects via Http. But this seems closest to working: I get an HTTP403 error of
'Request had insufficient authentication scopes.'
However, I don't know if that means I successfully presented myself as the service account, which was then deemed unsuitable for access (so I need to mess around with permissions some more); or didn't actually get that far (and need to fix this credentials creation process).
Getting a credential object with google.auth and passing to gspread
My (limited) understanding is that oauth2client is being deprecated and google.auth is now the way to go. This yields credentials objects in a similarly simple way to my successful examples above for cloud platform services, that I hoped I could just pass to gspread:
import gspread
from google.auth import compute_engine
credentials = compute_engine.Credentials()
client = gspread.authorize(credentials)
Sadly, gspread doesn't work with these objects, because they don't have the attributes it expects:
AttributeError: 'Credentials' object has no attribute 'access_token'
This is presumably because gspread expects oauth2 credentials and those chucked out by google.auth aren't sufficiently compatible. The gspread docs also go down the 'just get a client_secret file'... but presumably if I can get the previous (oauth/http-based) approach to work, I could then use gspread for data retrieval. For now, though, a hybrid of these two approaches stumbles in the same way: a permission denied response due to insufficient authentication scopes.
So, whether using google.auth, oauth2 (assuming that'll stick around for a while) or some other cloud-friendly approach (i.e. not one based on storing the secret key), how can I obtain suitable credentials in a cloud composer environment to make calls to the google sheets API? Bonus marks for a way that is compatible with gspread (and hence gspread_dataframe), but this is not essential. Also happy to hear that this is a PEBCAK error and I just need to configure IAM permissions differently for my current approach to work.

It looks like your Composer environment oauthScopes config wasn't set up properly. If left unspecified, the default cloud-platform doesn't allow you to access Google sheets API. You may want to create a new Composer environment with oauthScopes = [
"https://www.googleapis.com/auth/spreadsheets",
"https://www.googleapis.com/auth/cloud-platform"].
Google sheets API reference: https://developers.google.com/sheets/api/reference/rest/v4/spreadsheets/create.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Accessing Google BigQuery from AWS SageMaker - python

Related

Google cloud get bucket - works with cli but not in python

Using GCP Connection objects in Google Cloud Composer Airflow DAG python operator

Output TFRecord to Google Cloud Storage from Python

Secrets in a google cloud bucket

How can I obtain suitable credentials in a cloud composer environment to make calls to the google sheets API?

Categories

Resources