Caching Google API calls for unit tests - python

I've got a Google App Engine project that uses the Google Cloud Language API, and I'm using the Google API Client Library (Python) to make the API calls.
When running my unit tests, I make quite a few calls to the API. This slows down my testing and also incurs costs.
I'd like to cache the calls to the Google API to speed up my tests and avoid the API charges, and I'd rather not roll my own if another solution is available.
I found this Google API page, which suggests doing this:
import httplib2
http = httplib2.Http(cache=".cache")
And I've added these lines to my code (there is another option to use GAE memcache but won't be persisted between test code invocations) and right after these lines, I create my API call connection:
NLP = discovery.build("language", "v1", API_KEY)
The caching isn't working and the above solution seems too simple so I suspect I am missing something.
UPDATE:
I updated my tests so that App Engine is not used (just a regular unit test) and I also figured out that I can pass the http I created to the Google API client like this:
NLP = discovery.build("language", "v1", http, API_KEY)
Now, the initial discovery call is cached but the actual API calls are not cached,e.g., this call is not cached:
result = NLP.documents().annotateText(body=data).execute()

The suggested code:
http = httplib2.Http(cache=".cache") is trying to cache to the local filesystem in a directory called ".cache". On App Engine, you cannot write to the local filesystem, so this does nothing.
Instead, you could try caching to Memcache. The other suggestion on the Python Client docs referenced is to do exactly this:
from google.appengine.api import memcache
http = httplib2.Http(cache=memcache)
Since all App Engine apps get free access to shared memcache this should be better than nothing.
If this fails, you could also try memoization. I've had success memoizing calls to slow or flaky APIs, but it comes at the cost of increased memory usage (so I need bigger instances).
EDIT: I see from your comment you're having this problem locally. I was originally thinking that memoization would be an alternative, but the need to hack on httplib2 makes that overly complicated. I'm back to thinking about how to convince httplib2 to do the right thing.

If you're trying to make a test run faster by caching an API call result, stop and consider whether you may have taken a wrong turn.
If can you restructure your code such that you can replace the API call with a unittest.mock, your tests will run much, much faster.

I just came across vcrpy which seems to do exactly this. I'll update this answer after I've had a chance to try it out.

Related

Does python with wsgi (uwsgi) under nginx have some small default cache?

In my small web-site I feel need to make some data widely available, to avoid exchanging with database for every request made. E.g. this could be the list of current users show in the bottom of every page or the time of last update of ranking.
The stuff works in Python (Flask) running upon nginx + uwsgi (this docker image).
I wonder, do I have some small cache or shared memory for keeping such information "out of the box", or I need to take care of explicitly setting up some dedicated cache? Or perhaps some thing like this is provided by nginx?
alternatively I still can use database for it has its own cache I think, anyway
Sorry if question seems to be naive/silly - for I come from java world (where things a bit different as we serve all requests with one fat instance of java application) - and have some difficulty grasping what powers does wsgi/uwsgi provide. Thanks in advance!
Firstly, nginx has cache:
https://www.nginx.com/blog/nginx-caching-guide/
But for flask cacheing you also have options:
https://pythonhosted.org/Flask-Cache/
http://flask.pocoo.org/docs/1.0/patterns/caching/
Did you have a look at caching section from Flask docs?
It literally says:
Flask itself does not provide caching for you, but Werkzeug, one of the libraries it is based on, has some very basic cache support
You create a cache object once and keep it around, similar to how Flask objects are created. If you are using the development server you can create a SimpleCache object, that one is a simple cache that keeps the item stored in the memory of the Python interpreter:
from werkzeug.contrib.cache import SimpleCache
cache = SimpleCache()
-- UPDATE --
Or you could solve on the frontend side storing data in the web browser local storage.
If there's nothing in the local storage you call the DB, else you use the information from local storage rather than making db call.
Hope it helps.

Is there anyway to get a callback after every insert/delete on Google Cloud Datastore?

I would like to sync my Cloud Datastore contents with an index in ElasticSearch. I would like for the ES index to always be up to date with the contents of Datastore.
I noticed that an equivalent mechanism is available in the Appengine Python Standard Environment by implementing a _post_put_hook method in a Datastore Model. This doesn't seem to be possible however using the google-cloud-datastore library available for use in the flex environment.
Is there any way to receive a callback after every insert? Or will I have to put up a "proxy" API in front of the datastore API which will update my ES index after every insert/delete?
The _post_put_hook() of NDB.Model does only work if you have written the entity through NDB to Datastore, and yes, unfortunately the NDB library is only available in App Engine Python Standard Environment. I don't know of such feature in Cloud Datastore. If I remember correctly, Firebase Realtime Database or Firestore have triggers for writes, but I guess you are not eager to migrate the database neither.
In Datastore you would either need a "proxy" API with the above method as you suggested, or you would need to modify your Datastore client(s) to do this upon any successful write op. The latter may come with higher risk of fails and stale data in ElasticSearch, especially if the client is outside your control.
I believe that a custom API makes sense if consistent and up-to-date search records is important for your use-cases. Datastore and Python / NDB (maybe with Cloud Endpoints) would be a good approach.
I have a similar solution running on GAE Python Standard (although with the builtin Search API instead of ElasticSearch). If you choose this route you should be aware of two potential caveats:
_post_put_hook() is always called, even if the put operation failed. I have added a code sample below. You can find more details in the docs: model hooks,
hook methods,
check_success()
Exporting the data to ElasticSearch or Search API will prolong your response time. This might be no issue for background tasks, just call the export feature inside _post_put_hook(). But if a user made the request, this could be a problem. For these cases, you can defer the export operation to a different thread, either by using the deferred.defer() method or by creating a push task). More or less, they are the same. Below, I use defer().
Add a class method for every kind of which you want to export search records. Whenever something went wrong or you move apps / datastores, add new search indexes etc. you can call this method that will then query all entities of that kind from datastore batch by batch, and export the search records.
Example with deferred export:
class CustomModel(ndb.Model):
def _post_put_hook(self, future):
try:
if future.check_success() is None:
deferred.defer(export_to_search, self.key)
except:
pass # or log error to Cloud Console with logging.error('blah')
def export_to_search(key=None):
try:
if key is not None:
entity = key.get()
if entity is not None:
call_export_api(entity)
except:
pass # or log error to Cloud Console with logging.error('blah')
```

Python oauth2client async

I am fighting with tornado and the official python oauth2client, gcloud... modules.
These modules accept an alternate http client passed with http=, as long as it has a method called request which can be called by any of these libraries, whenever an http request must be sent to google and/or to renew the access tokens using the refresh tokens.
I have created a simple class which has a self.client = AsyncHttpClient()
Then in its request method, returns self.client.fetch(...)
My goal is to be able to yield any of these libraries calls, so that tornado will execute them in asynchronously.
The thing is that they are highly dependant on what the default client - set to httplib2.Http() returns: (response, content)
I am really stuck and cannot find a clean way of making this async
If anyone already found a way, please help.
Thank you in advance
These libraries do not support asynchronous. The porting process is not always easy.
oauth2client
Depending on what you want to do maybe Tornado's GoogleOAuth2Mixin or tornado-alf will be enough.
gcloud
Since I am not aware of any Tornado/asyncio implementation of gcloud-python, so you could:
you may write it yourself. Again it's not simple transport change of Connection.http or request, all the stuff around must be able to use/yield future/coroutines.
wrap it in ThreadPoolExecutor (as #Apero mentioned). This is high level API, so any nested api calls within that yield will be executed in same thread (not using the pool). It could work well.
external app (with ProcessPoolExecutor or Popen).
When I had similar problem with AWS couple years ago, I've ended up with executing, asynchronously, CLI (Tornado + subprocess.Popen + some cli (awscli, or boto based)) and simple cases (like S3, basic EC2 operations) with plain AsyncHTTPClient.

Web application: Hold large object between requests

I'm working on a web application related to genome searching. This application makes use of this suffix tree library through Cython bindings. Objects of this type are large (hundreds of MB up to ~10GB) and take as long to load from disk as it takes to process them in response to a page request. I'm looking for a way to load several of these objects once on server boot and then use them for all page requests.
I have tried using a remote manager / client setup using the multiprocessing module, modeled after this demo, but it fails when the client connects with an error message that says the object is not picklable.
I would suggest writing a small Flask (or even raw WSGI… But it's probably simpler to use Flask, as it will be easier to get up and running quickly) application which loads the genome database then exposes a simple API. Something like this:
app = Flask(__name__)
database = load_database()
#app.route('/get_genomes')
def get_genomes():
return database.all_genomes()
app.run(debug=True)
Or, you know, something a bit more sensible.
Also, if you need to be handling more than one request at a time (I believe that app.run will only handle one at a time), start by threading… And if that's too slow, you can os.fork() after the database is loaded and run multiple request handlers from there (that way they will all share the same database in memory).

Faster App Engine Development Datastore Alternative

Is there a way to use a real database(SQLite, Mysql, or even some non-relational one) as datastore for development, instead of memory/file datastore that is provided.
I saw few projects, GAE-SQLite(did not seem to be working) and one tip about accessing production datastore using remote api (still pretty slow for large datasets).
MongoDB works great for that. You will need:
The MongoDB stub: http://github.com/mongodb/mongo-appengine-connector
MongoDB: http://www.mongodb.org/display/DOCS/Downloads
Some code to set it up like:
code:
import datastore_mongo_stub
os.environ['APPLICATION_ID'] = 'test'
datastore = datastore_mongo_stub.DatastoreMongoStub(
os.environ['APPLICATION_ID'], 'woot', '', require_indexes=False)
apiproxy_stub_map.apiproxy.RegisterStub('datastore_v3', datastore)
But if you're looking for truly faster development (like I was) the datastore is actually not the issue as much is the single threaded web server. I tried to replace it with spawning but that was a little too hard. You could also try to set up TyphoonAE which will mimic the appengine stack with open alternatives.
Be aware that if you do any of these, you might lose some of the exact behavior the current tools provide, meaning that if you deploy you could get results you didn't expect. In other words; make sure you know what you're doing :-)
The Google App Engine SDK for Python now bundles support for SQLite. See the official docs for more information.
bdbdatastore is an alternative datastore backend that's considerably better than the one built in to the development server, although the datastore is far from being the only problem with the dev server when it comes to handling large applications.

Categories