We want to have a production airflow environment but do not know how to deal properly with secrets, in particular google bigquery client JSON files
We tried setting up the kubernetes secrets on the automatically created kubernetes cluster (automatically by creationg a google cloud composer (airflow) environment). We currently just put the files on the bucket, but would like a better way.
def get_bq_client ():
""" returns bq client """
return bq.Client.from_service_account_json(
join("volumes", "bigquery.json")
)
We would like some form of proper management of the required secrets. Sadly, using Airflow Variables won't work because we can't create the client object using the json file as text
One solution that would work, is to encrypt the JSON files and put that on the bucket. As long as the decryption key exists on the bucket and no where else you'll be able to just check the code in with secrets to some source control and in the bucket checkout and decrypt.
Related
So I am trying to orchestrate a workflow in Airflow. One task is to read GCP Cloud Storage, which needs me to specify the Google Application Credentials.
I decided to create a new folder in the dag folder and put the JSON key. Then I specified this in the dag.py file;
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "dags\support\keys\key.json"
Unfortunately, I am getting this error below;
google.auth.exceptions.DefaultCredentialsError: File dags\support\keys\dummy-surveillance-project-6915f229d012.json was not found
Can anyone help with how I should go about declaring the service account key?
Thank you.
You can create a connection to Google Cloud from Airflow webserver admin menu. In this menu you can pass the Service Account key file path.
In this picture, the keyfile Path is /usr/local/airflow/dags/gcp.json.
Beforehand you need to mount your key file as a volume in your Docker container with the previous path.
You can also directly copy the key json content in the Airflow connection, in the keyfile Json field :
You can check from these following links :
Airflow-connections
Airflow-with-google-cloud
Airflow-composer-managing-connections
If you trying to download data from Google Cloud Storage using Airflow, you should use the GCSToLocalFilesystemOperator operator described here. It is already provided as part of the standard Airflow library (if you installed it) so you don't have to write the code yourself using the Python operator.
Also, if you use this operator you can enter the GCP credentials into the connections screen (where it should be). This is a better approach to putting your credentials in a folder with your DAGs as this could lead to your credentials being committed into your version control system which could lead to security issues.
I was asked to preform integration with an external google storage bucket, I had received a credentials json,
And while trying to do
gsutil ls gs://bucket_name (after configuring myself with the creds json) I had received a valid response, as well as when I tried to upload a file into the bucket.
When trying to do it with Python3, it does not work:
While using google-cloud-storage==1.16.0 (tried also the newer versions), I'm doing:
project_id = credentials_dict.get("project_id")
credentials = service_account.Credentials.from_service_account_info(credentials_dict)
client = storage.Client(credentials=credentials, project=project_id)
bucket = client.get_bucket(bucket_name)
But on the get_bucket line, I get:
google.api_core.exceptions.Forbidden: 403 GET https://www.googleapis.com/storage/v1/b/BUCKET_NAME?projection=noAcl: USERNAME#PROJECT_ID.iam.gserviceaccount.com does not have storage.buckets.get access to the Google Cloud Storage bucket.
The external partner which I'm integrating with, saying that the user is set correctly, and to prove it they're showing that I can preform the action with gsutil.
Can you please assist? Any idea what might be the problem?
The answer was that the creds were indeed wrong, but it did worked when I tried to preform on the client client.bucket(bucket_name) instead of client.get_bucket(bucket_name).
Please follow these steps in order to correctly set up the Cloud Storage Client Library for Python. In general, the Cloud Storage Libraries can use Application default credentials or environment variables for authentication.
Notice that the recommended method to use would be to set up authentication using environment variables (i.e if you are using Linux: export GOOGLE_APPLICATION_CREDENTIALS="/path/to/[service-account-credentials].json" should work) and avoid the use of the service_account.Credentials.from_service_account_info() method altogether:
from google.cloud import storage
storage_client = storage.Client(project='project-id-where-the-bucket-is')
bucket_name = "your-bucket"
bucket = client.get_bucket(bucket_name)
should simply work because the authentication is handled by the client library via the environment variable.
Now, if you are interested in explicitly using the service account instead of using service_account.Credentials.from_service_account_info() method you can use the from_service_account_json() method directly in the following way:
from google.cloud import storage
# Explicitly use service account credentials by specifying the private key
# file.
storage_client = storage.Client.from_service_account_json(
'/[service-account-credentials].json')
bucket_name = "your-bucket"
bucket = client.get_bucket(bucket_name)
Find all the relevant details as to how to provide credentials to your application here.
tl;dr: dont use client.get_bucket at all.
See for detailed explanation and solution https://stackoverflow.com/a/51452170/705745
When running locally, my Jupyter notebook is able to reference Google BigQuery like so:
%%bigquery some_bq_table
SELECT *
FROM
`some_bq_dataset.some_bq_table`
So that later in my notebook I can reference some_bq_table as a pandas dataframe, as exemplified here: https://cloud.google.com/bigquery/docs/visualize-jupyter
I want to run my notebook on AWS SageMaker to test a few things. To authenticate with BigQuery it seems that the only two ways are using a service account on GCP (or locally) or pointing the the SDK to a credentials JSON using an env var (as explained here: https://cloud.google.com/docs/authentication/getting-started).
For example
export GOOGLE_APPLICATION_CREDENTIALS="/home/user/Downloads/[FILE_NAME].json"
Is there an easy way to connect to bigquery from SageMaker? My best idea right now is to download the JSON from somewhere to the SageMaker instnace and then set the env var from the python code.
For example, I would do this:
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "/home/user/Downloads/[FILE_NAME].json"
However, this isn't very secure - I don't like the idea of downloading my credentials JSON to a SageMaker instance (this means I would have to upload the credentials to some private s3 bucket and then store them on the SageMaker instance). Not the end of the world but I rather avoid this.
Any ideas?
As you mentioned GCP currently authenticates using service account, credentials JSON and API tokens. Instead of storing credentials in S3 bucket you can consider using AWS Secrets Manager or AWS Systems Manager Parameter Store to store the GCP credentials and then fetch them in Jupyter notebook. This way credentials can be secured and the credentials file will be created from Secrets Manager only when needed.
This is sample code I used previously to connect to BigQuery from SageMaker instance.
import os
import json
import boto3
from google.cloud.bigquery import magics
from google.oauth2 import service_account
def get_gcp_credentials_from_ssm(param_name):
# read credentials from SSM parameter store
ssm = boto3.client('ssm')
# Get the requested parameter
response = ssm.get_parameters(Names=[param_name], WithDecryption=True)
# Store the credentials in a variable
gcp_credentials = response['Parameters'][0]['Value']
# save credentials temporarily to a file
credentials_file = '/tmp/.gcp/service_credentials.json'
with open(credentials_file, 'w') as outfile:
json.dump(json.loads(gcp_credentials), outfile)
# create google.auth.credentials.Credentials to use for queries
credentials = service_account.Credentials.from_service_account_file(credentials_file)
# remove temporary file
if os.path.exists(credentials_file):
os.remove(credentials_file)
return credentials
# this will set the context credentials to use for queries performed in jupyter
# using bigquery cell magic
magics.context.credentials = get_gcp_credentials_from_ssm('my_gcp_credentials')
Please note that SageMaker execution role should have access to SSM and of course other necessary route to connect to GCP. I am not sure if this is the best way though. Hope someone has better way.
I know tf.python_io.TFRecordWriter has a concept of GCS, but it doesn't seem to have permissions to write to it.
If I do the following:
output_path = 'gs://my-bucket-name/{}/{}.tfrecord'.format(object_name, record_name)
writer = tf.python_io.TFRecordWriter(output_path)
# write to writer
writer.close()
then I get 401s saying "Anonymous caller does not have storage.objects.create access to my-bucket-name."
However, on the same machine, if I do gsutil rsync -d r gs://my-bucket-name bucket-backup, it properly syncs it, so I've authenticated properly using gcloud.
How can I give TFRecordWriter permissions to write to GCS? I'm going to just use Google's GCP python API for now, but I'm sure there's a way to do this using TF alone.
A common strategy to setup credentials on systems is to use Application Default Credentials (ADC). ADC is a strategy to locate Google Cloud Service Account credentials.
If the environment variable GOOGLE_APPLICATION_CREDENTIALS is set, ADC will use the filename that the variable points to for service account credentials. This file is a Google Cloud Service Account credentials file in Json format. The previous P12 (PFX) certificates are deprecated.
If the environment variable is not set, the default service account is used for credentials if the application running on Compute Engine, App Engine, Kubernetes Engine or Cloud Functions.
If the previous two steps fail to find valid credentials, ADC will fail, and an error occurs.
For this questions, ADC could not find credentials and the TensorFlow writes to GCS failed.
The solution is to set the environment variable GOOGLE_APPLICATION_CREDENTIALS to point to the service account Json file.
For Linux:
export GOOGLE_APPLICATION_CREDENTIALS=/path/to/service-account.json
For Windows
set GOOGLE_APPLICATION_CREDENTIALS=C:\path\to\service-account.json
I wrote an article that goes into more detail on ADC.
Google Cloud Application Default Credentials
When you use the gsutil command, you are using the GCP user configured in Cloud SDK (execute:gcloud config list to see).
Plausibly your python script is not authenticated in GCP.
I believe that has a better approach to solve this (sorry I don't have a lot of knowledge about TensorFlow), but I can see 2 workarounds to fix that:
First option - Mount Cloud Storage buckets as file systems using Cloud Fuse
Second option - Write locally and move later. In this approach, you can use this code:
# Service Account file
JSON_FILE_NAME = '<Service account json file>'
# Imports the Google Cloud client library
from google.cloud import storage
# Instantiates a client
storage_client = storage.Client.from_service_account_json(JSON_FILE_NAME)
#Example file (using the service account)
source_file_path = 'your file path'
destination_blob_name = 'name of file in gcs'
# The name for the new bucket
bucket_name = '<bucket_name>'
bucket = storage_client.get_bucket(bucket_name)
blob = bucket.blob(destination_blob_name)
blob.upload_from_filename(source_file_path)
print('File {} uploaded to {}.'.format(
source_file_path,
destination_blob_name))
Do note that export command won't work in a jupyter notebook.
if you're in a jupyter notebook, this should work
import os
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = 'path/to/json'
So the google ferris2 framework seems to exclusively use the blobstore api for the Upload component, making me question whether it's possible to make images uploaded to cloud storage public without having to write my own upload method and abandoning the use of the Upload component altogether, which also seems to create compatibility issues when using the cloud storage client library (python).
Backstory / context
using- google App engine, python, cloud storage client library
Requirements
0.5 We require that blob information nor the file be stored in the model. We want a public cloud serving url on the model and that is all. This seems to prevent us from using the normal ferris approach for uploading to cloud storage.
Things I already know / road blocks
One of the big roadblocks is dealing with Ferris using cgi / the blobstore api for field storage on the form. This seems to cause problems because so far it hasn't allowed sending data to to be sent to cloud storage through the google cloud storage python client.
Things we know about the google cloud storage python client and cgi:
To write data to cloud storage from our server, cloud storage needs to be called with cloudstorage.open("/bucket/object", "w", ...), (a cloud storage library method). However, it appears so far that a cgi.FieldStorage is returned from the post for the wtforms.fields.FileField() (as shown by a simple "print image" statement) before the data is applied to the model, after it is applied to the model, it is a blob store instance.
I would like verification on this:
after a lot of research and testing , it seems that because ferris is limited to the blobstore api for the uploads component, using the blob store api and blob keys to handle uploads seems basically unavoidable without having to create a second upload function just for the cloud storage call. Blob instances seem not to be compatible with that cloud storage client library, and it seems there is no way to get anything but meta data from blob files (without actually making a call to cloud storage to get the original file). However, it appears that this will not require storing extra data on the server. Furthermore, I believe it may be possible to get around the public link issue by setting the entire bucket to have read permissions.
Clarifying Questions:
1. To make uploaded images available to the public via our application, (any user, not an authenticated user), will I have to use the the cloudstorage python client library, or is there a way to do this with the blobstore api?
Is there a way to get the original file from a blob key (on save with the add action method) without actually having to make a call to cloud storage first, so that the file can be uploaded using that library?
If not, is there a way to grab the file from the cgi.FieldStorage, then send to cloud storage with the python client library? It seems that using cgi.FieldStorage.value is just meta data and not the file, same with cgi.FieldStorage.file.read()
1) You cannot use the GAE GCS client to update an ACL.
2) You can use the GCS json API after the blobstore upload to GCS and change the ACL to make it public. You do not have to upload again.
See this example code which inserts an acl.
3) Or use cgi.Fieldstorage to read the data (< 32 Mb) and write it to GCS using GAE GCS client.
import cloudstorage as gcs
import mimetypes
class UploadHandler(webapp2.RequestHandler):
def post(self):
file_data = self.request.get("file", default_value=None)
filename = self.request.POST["file"].filename
content_type = mimetypes.guess_type(self.filename)[0]
with gcs.open(filename, 'w', content_type=content_type or b'binary/octet-stream',
options={b'x-goog-acl': b'public-read'}) as f:
f.write(file_data)
A third method: use a form post upload with a GCS signed url and a policy document to control the upload.
And you can always use a public download handler, which reads files from the blobstore or GCS.
You can now specify the ACL when uploading a file from App Engine to Cloud Storage. Not sure how long it's been in place, just wanted to share:
filename = '/' + bucket_name + '/Leads_' + newUNID() + '.csv'
write_retry_params = gcs.RetryParams(backoff_factor=1.1)
gcs_file = gcs.open(filename,
'w',
content_type='text/csv',
options={'x-goog-acl': 'public-read'},
retry_params=write_retry_params)
docs: https://cloud.google.com/storage/docs/xml-api/reference-headers#standard