Is it faster to query CouchDB via an adapter instead of REST? - python

Let's say I have some data in a CouchDB database. The overall size is about 100K docs.
I have a _design doc which stores a 'get all entities' view.
Assuming the requests are done on a local machine against a local database:
via curl: curl -X GET http://127.0.0.1/mydb/_design/myexample/_view/all
via Couchdbkit: entities = Entity.view('mydb/all’)
Does 1 have to perform any additional calculations compared to 2 (JSON encoding/decoding, HTTP request parsing, etc.) and how can that affect the performance of querying 'all' entities from the database?
I guess that directly querying the database (option 2) should be faster than wrapping request/response into JSON, but I am not sure about that.

Under the API covers, Couchdbkit uses the restkit package, which is a REST library.
In other words, Couchdbkit is a pythonic API to the CouchDB REST API, and will do the same amount of work as using the REST API yourself.

Related

Python, sqlalchemy: how to improve performance of encrypted sqlite database?

I have a simple service application: python, tornado web server, sqlite database. The database is encrypted.
The problem is that processing even very simple http request takes about 300msec.
From logs I can see that most of that time takes processing of the very first sql request, no matter how simple this first request is. Subsequent sql requests are processed much faster. But then server starts processing next http request, and again the first sql request is very slow.
If I turn off the database encryption the problem is gone: processing time of sql requests does not depend on if the request is first or not and my server response time decreases by factor 10 to 15.
I do not quite understand what's going on. Looks like sqlalchemy reads and decrypts the database file each time it starts new session. Is there any way to workaround this problem?
Due to how pysqlite, or the sqlite3 module, works SQLAlchemy defaults to using a NullPool with file-based databases. This explains why your database is decrypted per each request: a NullPool discards connections as they are closed. The reason why this is done is that pysqlite's default behaviour is to disallow using a connection in more than one thread, and without encryption creating new connections is very fast.
Pysqlite does have an undocumented flag check_same_thread that can be used to disable the check, but sharing connections between threads should be handled with care and the SQLAlchemy documentation makes a passing mention that the NullPool works well with SQLite's file locking.
Depending on your web server you could use a SingletonThreadPool, which means that all connections in a thread are the same connection:
engine = create_engine('sqlite:///my.db',
poolclass=SingletonThreadPool)
If you feel adventurous and your web server does not share connections / sessions between threads while in use (for example using a scoped session), then you could try using a different pooling strategy paired with check_same_thread=False:
engine = create_engine('sqlite:///my.db',
poolclass=QueuePool,
connect_args={'check_same_thread':False})
To encrypt database sqlcipher creates a key from the passphrase I provided. This operation is resource consuming by design.
But it is possible to use not a passphrase, but 256-bit raw key. In this case sqlcipher would not have to generate the encryption key.
Originally my code was:
session.execute('PRAGMA KEY = "MY_PASSPHRASE";')
To use raw key I changed this line to:
session.execute('''PRAGMA KEY = "x'<the key>'";''')
where <the key> is 64 characters long string of hexadecimals.
Result is 20+ times speed up on small requests.
Just for reference: to convert database to use new encryption key the following commands should be executed:
PRAGMA KEY = ""MY_PASSPHRASE";
PRAGMA REKEY = "x'<the key>'";
Related question: python, sqlite, sqlcipher: very poor performance processing first request
Some info about sqlcipher commands and difference between keys and raw keys: https://www.zetetic.net/sqlcipher/sqlcipher-api/

Is there anyway to get a callback after every insert/delete on Google Cloud Datastore?

I would like to sync my Cloud Datastore contents with an index in ElasticSearch. I would like for the ES index to always be up to date with the contents of Datastore.
I noticed that an equivalent mechanism is available in the Appengine Python Standard Environment by implementing a _post_put_hook method in a Datastore Model. This doesn't seem to be possible however using the google-cloud-datastore library available for use in the flex environment.
Is there any way to receive a callback after every insert? Or will I have to put up a "proxy" API in front of the datastore API which will update my ES index after every insert/delete?
The _post_put_hook() of NDB.Model does only work if you have written the entity through NDB to Datastore, and yes, unfortunately the NDB library is only available in App Engine Python Standard Environment. I don't know of such feature in Cloud Datastore. If I remember correctly, Firebase Realtime Database or Firestore have triggers for writes, but I guess you are not eager to migrate the database neither.
In Datastore you would either need a "proxy" API with the above method as you suggested, or you would need to modify your Datastore client(s) to do this upon any successful write op. The latter may come with higher risk of fails and stale data in ElasticSearch, especially if the client is outside your control.
I believe that a custom API makes sense if consistent and up-to-date search records is important for your use-cases. Datastore and Python / NDB (maybe with Cloud Endpoints) would be a good approach.
I have a similar solution running on GAE Python Standard (although with the builtin Search API instead of ElasticSearch). If you choose this route you should be aware of two potential caveats:
_post_put_hook() is always called, even if the put operation failed. I have added a code sample below. You can find more details in the docs: model hooks,
hook methods,
check_success()
Exporting the data to ElasticSearch or Search API will prolong your response time. This might be no issue for background tasks, just call the export feature inside _post_put_hook(). But if a user made the request, this could be a problem. For these cases, you can defer the export operation to a different thread, either by using the deferred.defer() method or by creating a push task). More or less, they are the same. Below, I use defer().
Add a class method for every kind of which you want to export search records. Whenever something went wrong or you move apps / datastores, add new search indexes etc. you can call this method that will then query all entities of that kind from datastore batch by batch, and export the search records.
Example with deferred export:
class CustomModel(ndb.Model):
def _post_put_hook(self, future):
try:
if future.check_success() is None:
deferred.defer(export_to_search, self.key)
except:
pass # or log error to Cloud Console with logging.error('blah')
def export_to_search(key=None):
try:
if key is not None:
entity = key.get()
if entity is not None:
call_export_api(entity)
except:
pass # or log error to Cloud Console with logging.error('blah')
```

Aggregate multiple APIs request results using python

I'm working on an application that will have to use multiple external APIs for information and after processing the data, will output the result to a client. The client uses a web interface to query, once query is send to server, server process send requests to different API providers and after joining the responses from those APIs then return response to client.
All responses are in JSON.
current approach:
import requests
def get_results(city, country, query, type, position):
#get list of apis with authentication code for this query
apis = get_list_of_apis(type, position)
results = [ ]
for api in apis:
result = requests.get(api)
#parse json
#combine result in uniform format to display
return results
Server uses Django to generate response.
Problem with this approach
(i) This may generate huge amounts of data even though client is not interested in all.
(ii) JSON response has to be parsed based on different API specs.
How to do this efficiently?
Note: Queries are being done to serve job listings.
Most APIs of this nature allow for some sort of "paging". You should code your requests to only draw a single page from each provider. You can then consolidate the several pages locally into a single stream.
If we assume you have 3 providers, and page size is fixed at 10, you will get 30 responses. Assuming you only show 10 listings to the client, you will have to discard and re-query 20 listings. A better idea might be to locally cache the query results for a short time (say 15 minutes to an hour) so that you don't have to requery the upstream providers each time your user advances a page in the consolidated list.
As far as the different parsing required for different providers, you will have to handle that internally. Create different classes for each. The list of providers is fixed, and small, so you can code a table of which provider-url gets which class behavior.
Shameless plug but I wrote a post on how I did exactly this in Durango REST framework here.
I highly recommend using Django REST framework, it makes everything so much easier
Basically, the model on your APIs end is extremely simple and simply contains information on what external API is used and the ID for that API resource. A GenericProvider class then provides an abstract interface to perform CRUD operations on the external source. This GenericProvider uses other providers that you create and determines what provider to use via the provider field on the model. All of the data returned by the GenericProvider is then serialised as usual.
Hope this helps!

Send a GET request with a body

I'm using elasticsearch and the RESTful API supports supports reading bodies in GET requests for search criteria.
I'm currently doing
response = urllib.request.urlopen(url, data).read().decode("utf-8")
If data is present, it issues a POST, otherwise a GET. How can I force a GET despite the fact that I'm including data (which should be in the request body as per a POST)
Nb: I'm aware I can use a source property in the Url but the queries we're running are complex and the query definition is quite verbose resulting in extremely long urls (long enough that they can interfere with some older browsers and proxies).
I'm not aware of a nice way to do this using urllib. However, requests makes it trivial (and, in fact, trivial with any arbitrary verb and request content) by using the requests.request* function:
requests.request(method='get', url='localhost/test', data='some data')
Constructing a small test web server will show that the data is indeed sent in the body of the request, and that the method perceived by the server is indeed a GET.
*note that I linked to the requests.api.requests code because that's where the actual function definition lives. You should call it using requests.request(...)

Does app engine automatically cache frequent queries?

I seem to remember reading somewhere that google app engine automatically caches the results of very frequent queries into memory so that they are retrieved faster.
Is this correct?
If so, is there still a charge for datastore reads on these queries?
If you're using Python and the new ndb API, it DOES have automatic caching of entities, so if you fetch entities by key, it would be cached:
http://code.google.com/appengine/docs/python/ndb/cache.html
As the comments say, queries are not cached.
Cached requests don't hit the datastore, so you save on reads there.
If you're using Java, or the other APIs for accessing the datastore, then no, there's no caching.
edited Fixed my mistake about queries getting cached.
I think that app engine does not cache anything for you. While it could be that, internally, it caches some things for a split second, I don't think you should rely on that.
I think you will be charged the normal number of read operations for every entity you read from every query.
No, it doesn't. However depending on what framework you use for access to the datastore, memcache will be used. Are you developing in java or python? On the java side, Objectify will cache GETs automatically but not Queries. Keep in mind that there is a big difference in terms of performance and cachability between gets and queries in both python and java.
You are not charged for datastore reads for memcache hits.

Categories