Python: Save data in google cloud datastore emulator

Python: Save data in google cloud datastore emulator - python

I am trying to test google's cloud datastore locally. My test steps are:
Crawl data using Scrapy
Save data to the cloud datastore using local emulator
I followed all the steps in order to use the local emulator
start local emulator: gcloud beta emulators datastore start
set local environment variables: gcloud beta emulators datastore env-init
However, in python, when use the following command to access cloud datastore, it always save the data directly to google cloud instead of saving them to the local emulators
#Imports the Google Cloud client library
from google.cloud import datastore
# Instantiates a client
datastore_client = datastore.Client()
sample_entry = some_data
# Saves the entity
datastore_client.put(sample_entry)
It seems like you cannot specify the library to use the local datastore emulator, just like what they offer in their Node.js client
var datastore = gcloud.datastore({
apiEndpoint: "http://localhost:8380"
});
My question is, How can I ask the google cloud datastore python library to use local emulator instead of using the cloud directly

You need to eval $(gcloud beta emulators datastore env-init).
gcloud beta emulators datastore env-init only prints the commands that set the necessary environment variables.

You can try something like
if os.getenv('SERVER_SOFTWARE', '').startswith('Google App Engine/'):
# Production
else:
# Local development server
You can follow more here, https://cloud.google.com/appengine/docs/standard/python/tools/using-local-server

Related

Allow Google Cloud Compute Engine Instance to write file to Google Storage Bucket - Python

In my python server script which is running on a google cloud VM instance, it tries to save an image(jpeg) in the storage. But it throws following error.
File "/home/thamindudj_16/server/object_detection/object_detector.py",
line 109, in detect Hand
new_img.save("slicedhand/{}#sliced_image{}.jpeg".format(threadname,
i)) File
"/home/thamindudj_16/.local/lib/python3.5/site-packages/PIL/Image.py",
line 2004, in save
fp = builtins.open(filename, "w+b")
OSError: [Errno 5] Input/output error: 'slicedhand/thread_1#sliced_image0.jpeg'
All the files including python scripts are in a google storage bucket and have mounted to the VM instance using gcsfuse. App tries to save new image in the slicedhand folder.
Python code snippet where image saving happen.
from PIL import Image
...
...
i = 0
new_img = Image.fromarray(bounding_box_img) ## conversion to an image
new_img.save("slicedhand/{}#sliced_image{}.jpeg".format(threadname, i))
I think may be the problem is with permission access. Doc says to use --key_file. But what is the key file I should use and where I can find that. I'm not clear whether this is the problem or something else.
Any help would be appreciated.

I understand that you are using gcfuse on your Linux VM Instance to access Google Cloud Storage.
Key file is a Service Account credentials key, that will allow you to initiate Cloud SDK or Client Library as another Service Account. You can download key file from Cloud Console. However, if you are using VM Instance, you are automatically using Compute Engine Default Service Account. You can check it using console command: $ gcloud init.
To configure properly your credentials, please follow the documentation.
Compute Engine Default Service Account, need to have enabled Access Scope Storage > Full. Access Scope is the mechanism that limits access level to Cloud APIs. That can be done during machine creation or when VM Instance is stopped.
Please note that Access Scopes are defined explicitly for the Service Account that you select for VM Instance.
Cloud Storage objects names have requirements. It is strongly recommended avoid using hash symbol "#" in the names of the objects.

Output TFRecord to Google Cloud Storage from Python

I know tf.python_io.TFRecordWriter has a concept of GCS, but it doesn't seem to have permissions to write to it.
If I do the following:
output_path = 'gs://my-bucket-name/{}/{}.tfrecord'.format(object_name, record_name)
writer = tf.python_io.TFRecordWriter(output_path)
# write to writer
writer.close()
then I get 401s saying "Anonymous caller does not have storage.objects.create access to my-bucket-name."
However, on the same machine, if I do gsutil rsync -d r gs://my-bucket-name bucket-backup, it properly syncs it, so I've authenticated properly using gcloud.
How can I give TFRecordWriter permissions to write to GCS? I'm going to just use Google's GCP python API for now, but I'm sure there's a way to do this using TF alone.

A common strategy to setup credentials on systems is to use Application Default Credentials (ADC). ADC is a strategy to locate Google Cloud Service Account credentials.
If the environment variable GOOGLE_APPLICATION_CREDENTIALS is set, ADC will use the filename that the variable points to for service account credentials. This file is a Google Cloud Service Account credentials file in Json format. The previous P12 (PFX) certificates are deprecated.
If the environment variable is not set, the default service account is used for credentials if the application running on Compute Engine, App Engine, Kubernetes Engine or Cloud Functions.
If the previous two steps fail to find valid credentials, ADC will fail, and an error occurs.
For this questions, ADC could not find credentials and the TensorFlow writes to GCS failed.
The solution is to set the environment variable GOOGLE_APPLICATION_CREDENTIALS to point to the service account Json file.
For Linux:
export GOOGLE_APPLICATION_CREDENTIALS=/path/to/service-account.json
For Windows
set GOOGLE_APPLICATION_CREDENTIALS=C:\path\to\service-account.json
I wrote an article that goes into more detail on ADC.
Google Cloud Application Default Credentials

When you use the gsutil command, you are using the GCP user configured in Cloud SDK (execute:gcloud config list to see).
Plausibly your python script is not authenticated in GCP.
I believe that has a better approach to solve this (sorry I don't have a lot of knowledge about TensorFlow), but I can see 2 workarounds to fix that:
First option - Mount Cloud Storage buckets as file systems using Cloud Fuse
Second option - Write locally and move later. In this approach, you can use this code:
# Service Account file
JSON_FILE_NAME = '<Service account json file>'
# Imports the Google Cloud client library
from google.cloud import storage
# Instantiates a client
storage_client = storage.Client.from_service_account_json(JSON_FILE_NAME)
#Example file (using the service account)
source_file_path = 'your file path'
destination_blob_name = 'name of file in gcs'
# The name for the new bucket
bucket_name = '<bucket_name>'
bucket = storage_client.get_bucket(bucket_name)
blob = bucket.blob(destination_blob_name)
blob.upload_from_filename(source_file_path)
print('File {} uploaded to {}.'.format(
source_file_path,
destination_blob_name))

Do note that export command won't work in a jupyter notebook.
if you're in a jupyter notebook, this should work
import os
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = 'path/to/json'

Secrets in a google cloud bucket

We want to have a production airflow environment but do not know how to deal properly with secrets, in particular google bigquery client JSON files
We tried setting up the kubernetes secrets on the automatically created kubernetes cluster (automatically by creationg a google cloud composer (airflow) environment). We currently just put the files on the bucket, but would like a better way.
def get_bq_client ():
""" returns bq client """
return bq.Client.from_service_account_json(
join("volumes", "bigquery.json")
)
We would like some form of proper management of the required secrets. Sadly, using Airflow Variables won't work because we can't create the client object using the json file as text

One solution that would work, is to encrypt the JSON files and put that on the bucket. As long as the decryption key exists on the bucket and no where else you'll be able to just check the code in with secrets to some source control and in the bucket checkout and decrypt.

Connect to local App Engine Datastore with Apache Beam

I am new with Google App Engine and I am a little bit confused with answers which are related to the connections to a local Datastore.
My ultimate goal is to stream data from a Google Datastore towards a Big Query Dataset, similar to https://blog.papercut.com/google-cloud-dataflow-data-migration/. I have a copy of this DataStore locally, accessible when I run a local App Engine, i.e. I can access it through an admin console when I use $[GOOGLE_SDK_PATH]/dev_appserver.py --datastore_path=./datastore.
I would like to know if it is possible to connect to this datastore using services outside of the App Engine Instance, with python google-cloud-datastore or even Apache Beam ReadFromDatastore method. If not, should I use the Datastore Emulator with the App Engine Datastore generated file ?
If anyone has an idea on how to proceed, I would be more than grateful to know how to do.

If it is possible it would have to be through the Datastore Emulator, which is capable to also serve apps other than App Engine. But it ultimately depends on the implementation of the libraries you intend to use - if the underlying access methods are capable of understanding the DATASTORE_EMULATOR_HOST environment variable pointing to a running datastore emulator and use that instead of the real Datastore. I guess you'll just have to give it a try.
But be aware that the local storage dir internal format used by the Datastore Emulator may be different than that used by the development server, so make a backup of your .datastore dir before trying stuff, just in case. From Local data format conversion:
Currently, the local Datastore emulator stores data in sqlite3 while
the Cloud Datastore Emulator stores data as Java objects.
When dev_appserver is launched with legacy sqlite3 data, the data will
be converted to Java objects. The original data is backed up with the
filename {original-data-filename}.sqlitestub.

How is ndb (and cloud datastore) being used in the firebase tic-tac-toe example

In the google app engine firebase tic-tac-toe example here: https://cloud.google.com/solutions/using-firebase-real-time-events-app-engine
nbd is used to create the Game data model. This model is used in the code to store the state of the tic-tac-toe game. I thought nbd was used to store data in Cloud Datastore, but, as far as I can tell, nothing is being stored in the Cloud Datastore of the associated google cloud project. I think this is because I am launching the app in 'dev mode' with python dev_appserver.py app.yaml In this case, is the data being stored in memory instead of actually being written to cloud datastore?

You're correct, running the application locally is using a datastore emulation, contained inside dev_appserver.py.
The data is not stored in memory, but on the local disk. So even if the development server restarts it will still find the "datastore" data written in a previous execution.
You can check the data actually saved using the local development server's admin interface at http://localhost:8000/datastore

Dan's answer is correct; your "dev_appserver.py" automatically creates a local datastore.
I would like to add that if you do wish to emulate a real Cloud Datastore environment and be able to generate usable indexes for your production Cloud Datastore, we have an emulator that can do that. I assume that's why you want your dev app to use the real Datastore?
Either way, if you just doing testing and need a persistent storage to test (not for production), then both the default devserver local storage and the Cloud Datastore Emulator would suffice.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python: Save data in google cloud datastore emulator - python

You need to eval $(gcloud beta emulators datastore env-init). gcloud beta emulators datastore env-init only prints the commands that set the necessary environment variables.

You can try something like if os.getenv('SERVER_SOFTWARE', '').startswith('Google App Engine/'): # Production else: # Local development server You can follow more here, https://cloud.google.com/appengine/docs/standard/python/tools/using-local-server

Related

Allow Google Cloud Compute Engine Instance to write file to Google Storage Bucket - Python

Output TFRecord to Google Cloud Storage from Python

Secrets in a google cloud bucket

Connect to local App Engine Datastore with Apache Beam

How is ndb (and cloud datastore) being used in the firebase tic-tac-toe example

Categories

Resources