Block creation or usage of *specific* collections in Mongo - python

I have a python project with many calls from multiple places and services to a specific mongo collection - lets call it "cache_collection".
I moved this collection to a different Mongo instance to reduce load from the major db, and plan to remove the collection in the older db.
The thing is this - I want to make sure it won't be possible to access "cache_collection" in the older db. meaning calling get_collection('cache_collection') will return an exception, or any attempt to read/write from this collection will raise an exception. Since mongo dynamically generates collections per demand, the desired behavior is not that easy to get.
I've read about Mongo access-control and This question but it's not ideal. because it seems like permissions are additive and not restrictive and I don't want to define the "basic" permission set as none and keep maintaining collection permissions for the rest of the project.
Is there a simple solution for this?

Related

Mongoengine - Can I get the same document object instance from two subsequent queries?

This is the use case: I have a server that receives instructions from many clients. Each client instructions are handled by its own Session object, who holds all the information about the state of the session and queries mongoengine for the data it needs.
Now, suppose session1 queries mongoengine and gets document "A" as a document object.
Later, session2 also queries and gets document "A", as another separate document object.
Now we have 2 document objects representing document "A", and to get them consistent I need to call A.update() and A.reload() all the time, which seems unnecessary.
Is there any way I can get a reference to the same document object over the two queries? This way both sessions could make changes to the document object and those changes would be seen by the other sessions, since they would be made to the same python object.
I've thought about making a wrapper for mongoengine that caches the documents that we have as document objects at runtime and ensures there are no multiple objects for the same document at any given time. But my knowledge of mongoengine is too rudimentary to do it at the time.
Any thoughts on this? Is my entire design flawed? Is there any easy solution?
I don't think going in that direction is a good idea. From what I understand you are in a web application context, you might be able to get something working for threads within a single process but you won't be able to share instances across different processes (and it gets even worse if you have processes running on different machines).
One way to address this is to use optimistic concurrency validation, you basically maintain a field like "version-identifier" that gets updated whenever the instance is updated and whenever you save/update the object, you run a query like "update object if version-identifier=... else you fail"
This means that if there are concurrent requests, 1 of them will succeed (first one to be flusged), the other one will fail because the version-identifier that they have is outdated. MongoEngine has no built in support for that but more info can be found here https://github.com/MongoEngine/mongoengine/issues/1563

Django Fallback to model lookup from external API

I'm using Django REST framework to serve up JSON content for a website front end. On the back end, I have two Django models, Player and Match, that each reference multiple of the other. A Match contains multiple Players, and a Player contains multiple Matches. This data is originally retrieved from a third-party API.
Matches and Players must be fetched separately from the API, and can only be fetched one at a time. When an object is fetched, its data is converted from the external JSON format into my Django model. At this point, the Match/Player will live forever in Django. The hard part is that I want this external fetching to be seamless. If I query for a player or match and it's in the DB, then just serve what we have there. Otherwise, I want to fetch that object from the external DB.
My question is, does Django provide any convenient way of handling this? Ideally, any query along the lines of Match.objects.get(id=...) will handle this API fallback transparently (I don't mind the fact that this query may take significantly longer in some cases).
If a way is "convenient" depends on your expectations ...
You could create a custom QuerySet where you override the get() method to include your fetch-from-API logic. Then you create a custom manager based on that QuerySet, like the docs show here.
Finally add that custom manager to your model.
See also this question from 2011.

Temporary views are not supported in CouchDB

I'm building a web app using Django and couchdb 2.0.
The new version of couchdb doesn't support temporary views. They recommend using a Mongo query but I couldn't find any useful documentation.
What is the best approach or library to use couchdb 2.0 with Django?
Temporary views were indeed abandoned in CouchDB 2.0. With mango, you could emulate them using a Hack, but that's just as bad (read: performance-wise). The recommendation is to actually use persistent views. As only the delta of new or updated documents need indexing, this will likely need significantly less resources.
As opposed to relational DBs, the created view (which is a persisted index by keys), is meant to be queried many times with different parameters (there is no such a thing as a query optimizer taking your temp view definition or something). So, when you're built heavily on temporary views, you might consider changing the way you query in the first place. One place to start is thinking about which attribute will collapse the result set most quickly to what you're looking for and build a view for that. Then, go query this view with keys and post-filter for the rest.
The closest thing you can do to a temporary view (when you really, really need it) is creating a design doc (e.g. _design/temp<uuid>) and use it for the one query execution.
Just to add a link (not new - but timeless) on the details: http://guide.couchdb.org/draft/views.html

Transfering data to REST API without a database django

Basically I have a program which scraps some data from a website, I need to either print it out to a django template or to REST API without using a database. How do I do this without a database?
Your best bet is to
a.) Perform the scraping in views themselves, and pass the info in a context dict to the template
or
b.) Write to a file and have your view pull info from the file.
Django can be run without a database, but it depends on what applications you enable. Some of the default functionality (auth, sites, contenttypes) requires a database. So you'd need to disable those. If you need to use them, you're SOL.
Other functionality (like sessions) usually uses a database, but you can configure it to use a cache or file or something else.
I've taken two approaches in the past:
1) Disable the database completely and disable the applications that require the database:
DATABASES = {}
2) Use a dummy sqlite database just so it works out of box with the default apps without too much tweaking, but don't really use it for anything. I find this method faster and good for setting up quick testing/prototyping.
And to actually get the data from the scraper into your view, you can take a number of approaches. Store the data in a cache, or just write it directly to your context variables, etc.

App engine -- when to use memcache vs search index (and search API)?

I am interested in adding a spell checker to my app -- I'm planning on using difflib with a custom word list that's ~147kB large (13,025 words).
When testing user queries against this list, would it make more sense to:
load the dictionary into memcache (I guess from the datastore?) and keep it in memcache or
build an index for the Search API and pass in the user query against it there
I guess what I'm asking is which is faster: memcache or a search index?
Thanks.
Memcache is definitely faster.
Another important consideration is cost. Memcache API calls are free, while Search API calls have their own quota and pricing.
By the way, you may store your library as a static file, because it's small and it does not change. There is no need to store it in the Datastore.
Memcache is faster however you need to consider the following.
it is not reliable, at any moment entities can be purged. So your code needs a fallback for non cached data
You can only fetch by key, so as you said you would need to store whole dictionaries in memcache objects.
Each memcache entity can only store 1MB. If you dictionary is larger you would have to span multiple entities. Ok not relevant in your case.
There are some other alternatives. How often will the dictionary be updated ?
Here is one alternate strategy.
You could store it in the filesystem (requires app updates) or GCS if you want to update the dictionary outside of app updates. Then you can load the dictionary in each instance into memory at startup or on first request and cache it at the running instance level, then you won't have any round trips to services adding latencies. This will also be simpler code wise (ie no fallbacks if not in memcache etc)
Here is an example. In this case the code lives in a module, which is imported as required. I am using a yaml file for additional configuration, it could just as easily json load a dictionary, or you could define a python dictionary in the module.
_modsettings = {}
def loadSettings(settings='settings.yaml'):
x= _modsettings
if not x:
try:
_modsettings.update(load(open(settings,'r').read()))
except IOError:
pass
return _modsettings
settings = loadSettings()
Then whenever I want the settings dictionary my code just refers to mymodule.settings.
By importing this module during a warmup request you won't get a race condition, or have to import/parse the dictionary during a user facing request. You can put in more error traps as appropriate ;-)

Categories