I am interested in adding a spell checker to my app -- I'm planning on using difflib with a custom word list that's ~147kB large (13,025 words).
When testing user queries against this list, would it make more sense to:
load the dictionary into memcache (I guess from the datastore?) and keep it in memcache or
build an index for the Search API and pass in the user query against it there
I guess what I'm asking is which is faster: memcache or a search index?
Thanks.
Memcache is definitely faster.
Another important consideration is cost. Memcache API calls are free, while Search API calls have their own quota and pricing.
By the way, you may store your library as a static file, because it's small and it does not change. There is no need to store it in the Datastore.
Memcache is faster however you need to consider the following.
it is not reliable, at any moment entities can be purged. So your code needs a fallback for non cached data
You can only fetch by key, so as you said you would need to store whole dictionaries in memcache objects.
Each memcache entity can only store 1MB. If you dictionary is larger you would have to span multiple entities. Ok not relevant in your case.
There are some other alternatives. How often will the dictionary be updated ?
Here is one alternate strategy.
You could store it in the filesystem (requires app updates) or GCS if you want to update the dictionary outside of app updates. Then you can load the dictionary in each instance into memory at startup or on first request and cache it at the running instance level, then you won't have any round trips to services adding latencies. This will also be simpler code wise (ie no fallbacks if not in memcache etc)
Here is an example. In this case the code lives in a module, which is imported as required. I am using a yaml file for additional configuration, it could just as easily json load a dictionary, or you could define a python dictionary in the module.
_modsettings = {}
def loadSettings(settings='settings.yaml'):
x= _modsettings
if not x:
try:
_modsettings.update(load(open(settings,'r').read()))
except IOError:
pass
return _modsettings
settings = loadSettings()
Then whenever I want the settings dictionary my code just refers to mymodule.settings.
By importing this module during a warmup request you won't get a race condition, or have to import/parse the dictionary during a user facing request. You can put in more error traps as appropriate ;-)
Related
I have a python project with many calls from multiple places and services to a specific mongo collection - lets call it "cache_collection".
I moved this collection to a different Mongo instance to reduce load from the major db, and plan to remove the collection in the older db.
The thing is this - I want to make sure it won't be possible to access "cache_collection" in the older db. meaning calling get_collection('cache_collection') will return an exception, or any attempt to read/write from this collection will raise an exception. Since mongo dynamically generates collections per demand, the desired behavior is not that easy to get.
I've read about Mongo access-control and This question but it's not ideal. because it seems like permissions are additive and not restrictive and I don't want to define the "basic" permission set as none and keep maintaining collection permissions for the rest of the project.
Is there a simple solution for this?
I'm develop a web application using Flask. I have 2 approaches to return pages for user's request.
Load requesting data from database then return.
Load the whole database into python dictionary variable at initialization and return the related page when requested. (the whole database is not too big)
I'm curious which approach will have better performance?
Of course it will be faster to get data from cache that is stored in memory. But you've got to be sure that the amount of data won't get too large, and that you're updating your cache every time you update the database. Depending on your exact goal you may choose python dict, cache (like memcached) or something else, such as tries.
There's also a "middle" way for this. You can store in memory not the whole records from database, but just the correspondence between the search params in request and the ids of the records in database. That way user makes a request, you quickly check the ids of the records needed, and query your database by id, which is pretty fast.
Is it possible to parse a google app engine object like so...
objects = db.GqlQuery('SELECT * FROM Database WHERE item='random'')
memcache.add('object', objects, 3600)
if object =='some condition':
#here can I do a query on 'objects' without using GqlQuery
elif object =='something else':
#do a different query than the one above
The idea is to store an object into memcache and then manipulate that object in different ways. This is to lighten the datastore reads. Thanks in advance!
You can and everyone will find they do it. However there a bunch of things you need to consider.
At the moment you are trying to store a query object and not the results in memcache. objects in your code is a query object. Use run, fetch etc to get some results.
Manipulating the objects and storing in memcache without writing back will mean you will lose data etc. memcache is not a reliable storage mechanism on appengine (it is just a cache) and things can be evicted at any time.
If your query is intended to return a single result, get the object by the key it is far more efficient and not a lot slower than memcache, compared with a query. (ndb will cache gets for you - see the next point)
It looks like you are starting out with appengine, if you have not got an existing code base, start out with ndb rather than db. It is in my opinion a better library. ndb does a lot of caching for you (when using get()) in memcache and the request/instance.
I'm creating a python wrapper for Vimeo API and this is my first time creating a python distribution. I'm having questions with python caching.
I referred this existing python-vimeo wrapper for caching the request token. That guy implemented like this
"""By default, this client will cache API requests for 120 seconds. To
override this setting, pass in a different cache_timeout parameter (in
seconds), or to disable caching, set cache_timeout to 0."""
I'm wondering whether it will create a problem or not. If there is more than one user using that feature for connecting vimeo exactly at the same time, and storing the information like this in the server
return self._cache.setdefault(key, processor(headers, content))
doesn't it create problem(informations will be overwritten in the cache)?
If it creates a problem, could you tell me the best solution? I think It would be storing in the filename with the name of authenticated username. Am I right?
Thanks!
I'm not sure I understand the issue, but you could create a prefixed key where the prefix of the key is the username. So a naive but possibly good approach is to save to the
username+"_"+key
key instead
There most likely wouldn't be any key collisions.
I'm a Python & App Engine (and server-side!) newbie, and I'm trying to create very simple CMS. Each deployment of the application would have one -and only one -company object, instantiated from something like:
class Company(db.Model):
name = db.StringPropery()
profile = db.TextProperty()
addr = db.TextProperty()
I'm trying to provide the facility to update the company profile and other details.
My first thought was to have a Company entity singleton. But having looked at (although far from totally grasped) this thread I get the impression that it's difficult, and inadvisable, to do this.
So then I thought that perhaps for each deployment of the CMS I could, as a one-off, run a script (triggered by a totally obscure URL) which simply instantiates Company. From then on, I would get this instance with theCompany = Company.all()[0]
Is this advisable?
Then I remembered that someone in that thread suggested simply using a module. So I just created a Company.py file and stuck a few variables in it. I've tried this in the SDK and it seems to work -to my suprise, modified variable values "survived" between requests.
Forgive my ignorance but, I assume these values are only held in memory rather than on disk -unlike Datastore stuff? Is this a robust solution? (And would the module variables be in scope for all invocations of my application's scripts?)
Global variables are "app-cached." This means that each particular instance of your app will remember these variables' values between requests. However, when an instance is shutdown these values will be lost. Thus I do not think you really want to store these values in module-level variables (unless they are constants which do not need to be updated).
I think your original solution will work fine. You could even create the original entity using the remote API tool so that you don't need an obscure page to instantiate the one and only Company object.
You can also make the retrieval of the singleton Company entity a bit faster if you retrieve it by key.
If you will need to retrieve this entity frequently, then you can avoid round-trips to the datastore by using a caching technique. The fastest would be to app-cache the Company entity after you've retrieved it from the datastore. To protect against the entity from becoming too out of date, you can also app-cache the time you last retrieved the entity and if that time is more than N seconds old then you could re-fetch it from the datastore. For more details on this option and how it compares to alternatives, check out Nick Johnson's article Storage options on App Engine.
It sounds like you are trying to provide a way for your app to be configurable on a per-application basis.
Why not use the datastore to store your company entity with a key_name? Then you will always know how to fetch the company entity, and you'll be able edit the company without redeploying.
company = Company(key_name='c')
# set stuff on company....
company.put()
# later in code...
company = Company.get_by_key_name('c')
Use memcache to store the details of the company and avoid repeated datastore calls.
In addition to memcache, you can use module variables to cache the values. They are cached, as you have seen, between requests.
I think the approach you read about is the simplest:
Use module variables, initialized in None.
Provide accessors (get/setters) for these variables.
When a variable is accessed, if its value is None, fetch it from the database. Otherwise, just use it.
This way, you'll have app-wide variables provided by the module (which won't be instantiated again and again), they will be shared and you won't lose them.