Output TFRecord to Google Cloud Storage from Python

Output TFRecord to Google Cloud Storage from Python - python

I know tf.python_io.TFRecordWriter has a concept of GCS, but it doesn't seem to have permissions to write to it.
If I do the following:
output_path = 'gs://my-bucket-name/{}/{}.tfrecord'.format(object_name, record_name)
writer = tf.python_io.TFRecordWriter(output_path)
# write to writer
writer.close()
then I get 401s saying "Anonymous caller does not have storage.objects.create access to my-bucket-name."
However, on the same machine, if I do gsutil rsync -d r gs://my-bucket-name bucket-backup, it properly syncs it, so I've authenticated properly using gcloud.
How can I give TFRecordWriter permissions to write to GCS? I'm going to just use Google's GCP python API for now, but I'm sure there's a way to do this using TF alone.

A common strategy to setup credentials on systems is to use Application Default Credentials (ADC). ADC is a strategy to locate Google Cloud Service Account credentials.
If the environment variable GOOGLE_APPLICATION_CREDENTIALS is set, ADC will use the filename that the variable points to for service account credentials. This file is a Google Cloud Service Account credentials file in Json format. The previous P12 (PFX) certificates are deprecated.
If the environment variable is not set, the default service account is used for credentials if the application running on Compute Engine, App Engine, Kubernetes Engine or Cloud Functions.
If the previous two steps fail to find valid credentials, ADC will fail, and an error occurs.
For this questions, ADC could not find credentials and the TensorFlow writes to GCS failed.
The solution is to set the environment variable GOOGLE_APPLICATION_CREDENTIALS to point to the service account Json file.
For Linux:
export GOOGLE_APPLICATION_CREDENTIALS=/path/to/service-account.json
For Windows
set GOOGLE_APPLICATION_CREDENTIALS=C:\path\to\service-account.json
I wrote an article that goes into more detail on ADC.
Google Cloud Application Default Credentials

When you use the gsutil command, you are using the GCP user configured in Cloud SDK (execute:gcloud config list to see).
Plausibly your python script is not authenticated in GCP.
I believe that has a better approach to solve this (sorry I don't have a lot of knowledge about TensorFlow), but I can see 2 workarounds to fix that:
First option - Mount Cloud Storage buckets as file systems using Cloud Fuse
Second option - Write locally and move later. In this approach, you can use this code:
# Service Account file
JSON_FILE_NAME = '<Service account json file>'
# Imports the Google Cloud client library
from google.cloud import storage
# Instantiates a client
storage_client = storage.Client.from_service_account_json(JSON_FILE_NAME)
#Example file (using the service account)
source_file_path = 'your file path'
destination_blob_name = 'name of file in gcs'
# The name for the new bucket
bucket_name = '<bucket_name>'
bucket = storage_client.get_bucket(bucket_name)
blob = bucket.blob(destination_blob_name)
blob.upload_from_filename(source_file_path)
print('File {} uploaded to {}.'.format(
source_file_path,
destination_blob_name))

Do note that export command won't work in a jupyter notebook.
if you're in a jupyter notebook, this should work
import os
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = 'path/to/json'

Related

Specify GOOGLE APPLICATION CREDENTIALS in Airflow

So I am trying to orchestrate a workflow in Airflow. One task is to read GCP Cloud Storage, which needs me to specify the Google Application Credentials.
I decided to create a new folder in the dag folder and put the JSON key. Then I specified this in the dag.py file;
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "dags\support\keys\key.json"
Unfortunately, I am getting this error below;
google.auth.exceptions.DefaultCredentialsError: File dags\support\keys\dummy-surveillance-project-6915f229d012.json was not found
Can anyone help with how I should go about declaring the service account key?
Thank you.

You can create a connection to Google Cloud from Airflow webserver admin menu. In this menu you can pass the Service Account key file path.
In this picture, the keyfile Path is /usr/local/airflow/dags/gcp.json.
Beforehand you need to mount your key file as a volume in your Docker container with the previous path.
You can also directly copy the key json content in the Airflow connection, in the keyfile Json field :
You can check from these following links :
Airflow-connections
Airflow-with-google-cloud
Airflow-composer-managing-connections

If you trying to download data from Google Cloud Storage using Airflow, you should use the GCSToLocalFilesystemOperator operator described here. It is already provided as part of the standard Airflow library (if you installed it) so you don't have to write the code yourself using the Python operator.
Also, if you use this operator you can enter the GCP credentials into the connections screen (where it should be). This is a better approach to putting your credentials in a folder with your DAGs as this could lead to your credentials being committed into your version control system which could lead to security issues.

Google Storage not using service account even with environment variable properly set

I was trying to save two files to GCP Storage using the following commands in a Jupyter Notebook:
!gsutil cp ./dist/my_custom_code-0.1.tar.gz gs://$BUCKET_NAME/custom_prediction_routine_tutorial/my_custom_code-0.1.tar.gz
!gsutil cp model.h5 preprocessor.pkl gs://$BUCKET_NAME/custom_prediction_routine_tutorial/model/
The bucket has been created properly since I can see it in the bucket list on GCP. Also in Permissions for the bucket, I can see the service account created. Plus, I made sure the environment variable is set by running:
export GOOGLE_APPLICATION_CREDENTIALS="/home/george/Documents/Credentials/prediction-routine-new-b7a445077e61.json"
This can be verified by running this in Python:
import os
print('Credendtials from environ: {}'.format(os.environ.get('GOOGLE_APPLICATION_CREDENTIALS')))
which shows:
Credentials from environ: /home/george/Documents/Credentials/prediction-routine-new-b7a445077e61.json
And I do have the json file stored at the specified location. However, when I tried to save files using the commands shown at the top, I kept getting this error message:
AccessDeniedException: 403 george***#gmail.com does not have storage.objects.list access to the Google Cloud Storage bucket.
Copying file://model.h5 [Content-Type=application/octet-stream]...
AccessDeniedException: 403 george***#gmail.com does not have storage.objects.create access to the Google Cloud Storage object.
So the question is, how come Google Storage is not using my service account and keeps using my user account?
UPDATE
After activating the service account for the project as pointed out by #Hao Z, GCP is using my service account now. However, I do have the permissions set for this service account...
UPDATE 2
This seems to be a known issue: https://github.com/GoogleCloudPlatform/gsutil/issues/546

Check How to use Service Accounts with gsutil, for uploading to CS + BigQuery
Relevant bit:
Download service account key file, and put it in e.g. /etc/backup-account.json
gcloud auth activate-service-account --key-file /etc/backup-account.json
Or you can do gsutil -i to impersonate a service account. Use 'gsutil help creds' for more info. I guess the env variable is just used by the Python SDK and not by the CLI.

I was able to resolve this in the following steps:
First, Using the way suggested by #Hao Z above, I was able to activate the service account in Jupyter Notebook using:
!gcloud auth activate-service-account \
prediction-routine-new#prediction-routine-test.iam.gserviceaccount.com \
--key-file=/home/george/Documents/Credentials/prediction-routine-new-b7a445077e61.json \
--project=prediction-routine-test
Second, I changed the bucket name used after realizing that I was using the wrong name - it should be "prediction-routine" instead of "prediction-routine-bucket".
BUCKET_NAME="prediction-routine"
Third, I changed the role from "Storage Object Admmin" to "Storage Admin" for the service account's permissions.

Google cloud get bucket - works with cli but not in python

I was asked to preform integration with an external google storage bucket, I had received a credentials json,
And while trying to do
gsutil ls gs://bucket_name (after configuring myself with the creds json) I had received a valid response, as well as when I tried to upload a file into the bucket.
When trying to do it with Python3, it does not work:
While using google-cloud-storage==1.16.0 (tried also the newer versions), I'm doing:
project_id = credentials_dict.get("project_id")
credentials = service_account.Credentials.from_service_account_info(credentials_dict)
client = storage.Client(credentials=credentials, project=project_id)
bucket = client.get_bucket(bucket_name)
But on the get_bucket line, I get:
google.api_core.exceptions.Forbidden: 403 GET https://www.googleapis.com/storage/v1/b/BUCKET_NAME?projection=noAcl: USERNAME#PROJECT_ID.iam.gserviceaccount.com does not have storage.buckets.get access to the Google Cloud Storage bucket.
The external partner which I'm integrating with, saying that the user is set correctly, and to prove it they're showing that I can preform the action with gsutil.
Can you please assist? Any idea what might be the problem?

The answer was that the creds were indeed wrong, but it did worked when I tried to preform on the client client.bucket(bucket_name) instead of client.get_bucket(bucket_name).

Please follow these steps in order to correctly set up the Cloud Storage Client Library for Python. In general, the Cloud Storage Libraries can use Application default credentials or environment variables for authentication.
Notice that the recommended method to use would be to set up authentication using environment variables (i.e if you are using Linux: export GOOGLE_APPLICATION_CREDENTIALS="/path/to/[service-account-credentials].json" should work) and avoid the use of the service_account.Credentials.from_service_account_info() method altogether:
from google.cloud import storage
storage_client = storage.Client(project='project-id-where-the-bucket-is')
bucket_name = "your-bucket"
bucket = client.get_bucket(bucket_name)
should simply work because the authentication is handled by the client library via the environment variable.
Now, if you are interested in explicitly using the service account instead of using service_account.Credentials.from_service_account_info() method you can use the from_service_account_json() method directly in the following way:
from google.cloud import storage
# Explicitly use service account credentials by specifying the private key
# file.
storage_client = storage.Client.from_service_account_json(
'/[service-account-credentials].json')
bucket_name = "your-bucket"
bucket = client.get_bucket(bucket_name)
Find all the relevant details as to how to provide credentials to your application here.

tl;dr: dont use client.get_bucket at all.
See for detailed explanation and solution https://stackoverflow.com/a/51452170/705745

Allow Google Cloud Compute Engine Instance to write file to Google Storage Bucket - Python

In my python server script which is running on a google cloud VM instance, it tries to save an image(jpeg) in the storage. But it throws following error.
File "/home/thamindudj_16/server/object_detection/object_detector.py",
line 109, in detect Hand
new_img.save("slicedhand/{}#sliced_image{}.jpeg".format(threadname,
i)) File
"/home/thamindudj_16/.local/lib/python3.5/site-packages/PIL/Image.py",
line 2004, in save
fp = builtins.open(filename, "w+b")
OSError: [Errno 5] Input/output error: 'slicedhand/thread_1#sliced_image0.jpeg'
All the files including python scripts are in a google storage bucket and have mounted to the VM instance using gcsfuse. App tries to save new image in the slicedhand folder.
Python code snippet where image saving happen.
from PIL import Image
...
...
i = 0
new_img = Image.fromarray(bounding_box_img) ## conversion to an image
new_img.save("slicedhand/{}#sliced_image{}.jpeg".format(threadname, i))
I think may be the problem is with permission access. Doc says to use --key_file. But what is the key file I should use and where I can find that. I'm not clear whether this is the problem or something else.
Any help would be appreciated.

I understand that you are using gcfuse on your Linux VM Instance to access Google Cloud Storage.
Key file is a Service Account credentials key, that will allow you to initiate Cloud SDK or Client Library as another Service Account. You can download key file from Cloud Console. However, if you are using VM Instance, you are automatically using Compute Engine Default Service Account. You can check it using console command: $ gcloud init.
To configure properly your credentials, please follow the documentation.
Compute Engine Default Service Account, need to have enabled Access Scope Storage > Full. Access Scope is the mechanism that limits access level to Cloud APIs. That can be done during machine creation or when VM Instance is stopped.
Please note that Access Scopes are defined explicitly for the Service Account that you select for VM Instance.
Cloud Storage objects names have requirements. It is strongly recommended avoid using hash symbol "#" in the names of the objects.

Accessing Google BigQuery from AWS SageMaker

When running locally, my Jupyter notebook is able to reference Google BigQuery like so:
%%bigquery some_bq_table
SELECT *
FROM
`some_bq_dataset.some_bq_table`
So that later in my notebook I can reference some_bq_table as a pandas dataframe, as exemplified here: https://cloud.google.com/bigquery/docs/visualize-jupyter
I want to run my notebook on AWS SageMaker to test a few things. To authenticate with BigQuery it seems that the only two ways are using a service account on GCP (or locally) or pointing the the SDK to a credentials JSON using an env var (as explained here: https://cloud.google.com/docs/authentication/getting-started).
For example
export GOOGLE_APPLICATION_CREDENTIALS="/home/user/Downloads/[FILE_NAME].json"
Is there an easy way to connect to bigquery from SageMaker? My best idea right now is to download the JSON from somewhere to the SageMaker instnace and then set the env var from the python code.
For example, I would do this:
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "/home/user/Downloads/[FILE_NAME].json"
However, this isn't very secure - I don't like the idea of downloading my credentials JSON to a SageMaker instance (this means I would have to upload the credentials to some private s3 bucket and then store them on the SageMaker instance). Not the end of the world but I rather avoid this.
Any ideas?

As you mentioned GCP currently authenticates using service account, credentials JSON and API tokens. Instead of storing credentials in S3 bucket you can consider using AWS Secrets Manager or AWS Systems Manager Parameter Store to store the GCP credentials and then fetch them in Jupyter notebook. This way credentials can be secured and the credentials file will be created from Secrets Manager only when needed.
This is sample code I used previously to connect to BigQuery from SageMaker instance.
import os
import json
import boto3
from google.cloud.bigquery import magics
from google.oauth2 import service_account
def get_gcp_credentials_from_ssm(param_name):
# read credentials from SSM parameter store
ssm = boto3.client('ssm')
# Get the requested parameter
response = ssm.get_parameters(Names=[param_name], WithDecryption=True)
# Store the credentials in a variable
gcp_credentials = response['Parameters'][0]['Value']
# save credentials temporarily to a file
credentials_file = '/tmp/.gcp/service_credentials.json'
with open(credentials_file, 'w') as outfile:
json.dump(json.loads(gcp_credentials), outfile)
# create google.auth.credentials.Credentials to use for queries
credentials = service_account.Credentials.from_service_account_file(credentials_file)
# remove temporary file
if os.path.exists(credentials_file):
os.remove(credentials_file)
return credentials
# this will set the context credentials to use for queries performed in jupyter
# using bigquery cell magic
magics.context.credentials = get_gcp_credentials_from_ssm('my_gcp_credentials')
Please note that SageMaker execution role should have access to SSM and of course other necessary route to connect to GCP. I am not sure if this is the best way though. Hope someone has better way.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Output TFRecord to Google Cloud Storage from Python - python

Do note that export command won't work in a jupyter notebook. if you're in a jupyter notebook, this should work import os os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = 'path/to/json'

Related

Specify GOOGLE APPLICATION CREDENTIALS in Airflow

Google Storage not using service account even with environment variable properly set

Google cloud get bucket - works with cli but not in python

Allow Google Cloud Compute Engine Instance to write file to Google Storage Bucket - Python

Accessing Google BigQuery from AWS SageMaker

Categories

Resources