Connect to local App Engine Datastore with Apache Beam - python

I am new with Google App Engine and I am a little bit confused with answers which are related to the connections to a local Datastore.
My ultimate goal is to stream data from a Google Datastore towards a Big Query Dataset, similar to https://blog.papercut.com/google-cloud-dataflow-data-migration/. I have a copy of this DataStore locally, accessible when I run a local App Engine, i.e. I can access it through an admin console when I use $[GOOGLE_SDK_PATH]/dev_appserver.py --datastore_path=./datastore.
I would like to know if it is possible to connect to this datastore using services outside of the App Engine Instance, with python google-cloud-datastore or even Apache Beam ReadFromDatastore method. If not, should I use the Datastore Emulator with the App Engine Datastore generated file ?
If anyone has an idea on how to proceed, I would be more than grateful to know how to do.

If it is possible it would have to be through the Datastore Emulator, which is capable to also serve apps other than App Engine. But it ultimately depends on the implementation of the libraries you intend to use - if the underlying access methods are capable of understanding the DATASTORE_EMULATOR_HOST environment variable pointing to a running datastore emulator and use that instead of the real Datastore. I guess you'll just have to give it a try.
But be aware that the local storage dir internal format used by the Datastore Emulator may be different than that used by the development server, so make a backup of your .datastore dir before trying stuff, just in case. From Local data format conversion:
Currently, the local Datastore emulator stores data in sqlite3 while
the Cloud Datastore Emulator stores data as Java objects.
When dev_appserver is launched with legacy sqlite3 data, the data will
be converted to Java objects. The original data is backed up with the
filename {original-data-filename}.sqlitestub.

Related

How is ndb (and cloud datastore) being used in the firebase tic-tac-toe example

In the google app engine firebase tic-tac-toe example here: https://cloud.google.com/solutions/using-firebase-real-time-events-app-engine
nbd is used to create the Game data model. This model is used in the code to store the state of the tic-tac-toe game. I thought nbd was used to store data in Cloud Datastore, but, as far as I can tell, nothing is being stored in the Cloud Datastore of the associated google cloud project. I think this is because I am launching the app in 'dev mode' with python dev_appserver.py app.yaml In this case, is the data being stored in memory instead of actually being written to cloud datastore?
You're correct, running the application locally is using a datastore emulation, contained inside dev_appserver.py.
The data is not stored in memory, but on the local disk. So even if the development server restarts it will still find the "datastore" data written in a previous execution.
You can check the data actually saved using the local development server's admin interface at http://localhost:8000/datastore
Dan's answer is correct; your "dev_appserver.py" automatically creates a local datastore.
I would like to add that if you do wish to emulate a real Cloud Datastore environment and be able to generate usable indexes for your production Cloud Datastore, we have an emulator that can do that. I assume that's why you want your dev app to use the real Datastore?
Either way, if you just doing testing and need a persistent storage to test (not for production), then both the default devserver local storage and the Cloud Datastore Emulator would suffice.

Using GAE instances and GCE VM mixed

I am building a news aggregator app, and the backend can be separated (mostly) in two logical parts:
Crawling, information extraction, parsing, clustering, storing...
Serving the user requests
What I would like to do is:
a) create a heavy Google Compute Engine VM Instance to do the crawling (since that isn't doable with a Google App Engine, because the instance RAM is relatively small)
b) create a google app engine group of instances to serve the client requests which are light-weight and don't require much computational power per request
Is this possible to mix the two, Google App Engine and Google Compute Engine?
Or do I need to make the instances group on my own via GCE?
Another option you should explore is App Engine Flexible. (disclaimer, I work at Google on App Engine)
We allow you to build an App Engine application that has multiple modules. Those modules will run on GCE virtual machines, which are managed by App Engine. We auto-scale, auto-provision, etc. Under the hood, we're actually provisioning a managed instance group and auto-scaler the same way you would with GCE (just no work). You can also customize the CPU+Memory on the machine we run your app.
That way, both your front end and back end can run in the same project. Check out:
https://cloud.google.com/appengine/docs/flexible/python/
Hope this helps!

How can I upload data into a high-replication GAE app?

Until now, I've been using the bulkloader to upload data into an app, but I've noticed that Google has added a warning that the bulkloader is intended for use with the master/slave datastore:
Warning: This document applies to apps that use the master/slave datastore. If your app uses the High Replication datastore, it is possible to copy data from the app, but Google does not currently support this use case. If you attempt to copy from a High Replication datastore, you'll see a high_replication_warning error in the Admin Console, and the downloaded data might not include recently saved entities.
Is there a recommended way of getting data into and out of an app that uses the HR datastore?

Google AppEngine - How To Perform a Partial Datastore Download

I have a running GAE app that has been collecting data for a while. I am now at the point where I need to run some basic reports on this data and would like to download a subset of the live data to my dev server. Downloading all entities of a kind will simply be too big a data set for the dev server.
Does anyone know of a way to download a subset of entities from a particular kind? Ideally it would be based on entity attributes like date, or client ID etc... but any method would work. I've even tried a regular, full, download then arbitrarily killing the process when I thought I had enough data, but it seems the data is locked up in the .sql3 files generated by the bulkloader.
It looks like that default download/upload from/to GAE datastore utilities don't support filtering (appcfg.py and bulkloader.py).
It seems reasonable to do one of two things:
write a utility (select+export+save-to-local-file) and execute it locally accessing remotely GAE datastore in remote api shell
write a admin web function for select+export+zip - new url in handler + upload to GAE + call-it-using-http

In the python Google App Engine, how do I export all the entities of a model to a file in Google Storage for developers?

I have about 900K entities of a model in python GAE that I would like to export to a CSV file for offline testing. I can use the appcfg.py download_data option, but in this case I don't want to backup to local machine. I'd like a faster way to create the file in GAE, save it to Google Storage or elsewhere, and download it later from multiple machines.
I'm assuming that I will need to do this in a task since it will likely take more than 30 seconds for the operation to complete.
class MyModel(db.model):
foo = db.StringProperty(required=True)
bar = db.StringProperty(required=True)
def backup_mymodel_to_file():
#What to do here?
Your best option will be to use map reduce library to export the relevant data to the blobstore, then upload the completed file to Google Storage.
Note that integration between Google Storage and App Engine is a work in progress.
I know this is old, but I posted an example of using the App Engine Mapper API dumping datastore data into Cloud Storage here:
Google App Engine: Using Big Query on datastore?

Categories