django-mongodb-engine Where is GridFSField Model Field - python

I'm using django-mongodb-engine for a project I'm working on and the documentation advises users to use the GridFS storage system for storing blobs as opposed to using the filesystem method. Obviously, this is one reason why we choose to use mongodb in the first place. One issue though, the documentation is sparse to say the least. In the docs they mention to use the GridFSField as your blob model field. One problem... where is the GridFSField?
class Better(models.Model):
blob = GridFSField()
iPython/Django shell:
from django_mongodb_engine.storage import GridFSStorage
#... define the class/exec
/usr/local/lib/python2.7/dist-packages/django/core/management/commands/shell.pyc in Better()
1 class Better(models.Model):
----> 2 blob = GridFSField()
3
NameError: name 'GridFSField' is not defined
Um, okay Django! Where is it defined then?!

This is not really a specific answer to your question as that answer is going to most likely be about your setup configuration. But as you seem to be going through documentation examples and therefore evaluating I thought it would be worthwhile to provide some points on using GridFS.
The intention of GridFS is not to be a way of storing "blobs" or replacing usage of the "filesystem method" as you (or the ODM documentation) are stating. Yes it can be used that way, but the sole reason for it's existence is to overcome the 16MB limitation MongoDB has on BSON document storage.
There is a common misconception that GridFS is a "feature" of MongoDB, yet it is actually a specification implemented on the driver side, for dealing with chunking large document content. There is no magic that occurs on the server side at all, as far as internal operations to MongoDB are concerned, this is just another BSON document with fields and data.
What the driver implementation is doing is breaking the content up into smaller chunks and distributing the content over several documents in a collection. Likewise when reading the content, there are methods provided to follow and fetch the various documents that make up the total content. In a nutshell, reading and writing using the GridFS methods results in multiple calls over the wire to MongoDB.
With that in mind, if your content is actually going to always be a size under 16MB then you are probably better off just using your encoded binary data within a single document, as updates will be atomic and the result will be faster reads from a single read operation per document.
So if you must have documents over 16MB in size, then use GridFS. If not just encode the content into a normal document field as that is all GridFS is doing anyway.
For more information, please read the FAQ:
http://docs.mongodb.org/manual/faq/developers/#when-should-i-use-gridfs

You can find GridFSField under django_mongodb_engine.fields
i.e.
from django_mongodb_engine.fields import GridFSField
from django.db import models
class Image(models.Model):
blob = GridFSField()

Related

Mongoengine change document structure

I'm trying for the first time to use mongo, and I choose mongoengine.
After defining the Document structure if I try to change it (adding a field, removing a field, renaming ecc..) the reading operations still works, but any other operation on previously stored document fail since they're note compliant anymore with the document structure.
Is there any way to manage this situation? should I only user Dynamic documents with Dictionaries instead of EmbeddedDocuments?
Using DynamicDocument or setting meta = {'strict': False} on your Document may help in some cases but the only proper solution to this is running a migration script.
I'd recommend doing this using pymongo but you could also do that from the mongo shell. Every time your model change in a way that is not compatible, you should run a migration on the existing data so that it fits the new model. Otherwise mongoengine will complain at some point (mongoengine contributor here)

Efficiently retrieve data (all in one batch ideally) with mongengine in Python 3

Let's say I have class User which inherits from the Document class (I am using Mongoengine). Now, I want to retrieve all users signed up after some timestamp. Here is the method I am using:
def get_users(cls, start_timestamp):
return cls.objects(ts__gte=start_timestamp)
1000 documents are returned in 3 seconds. This is extremely slow. I have done similar queries in SQL in a couple of miliseconds. I am new to MongoDB and No-SQL in general, so I guess I am doing something terribly wrong.
I suspect the retrieval is slow because it is done in several batches. I read somewhere that for PyMongo the batch size is 101, but I do not know if that is same for Mongoengine.
Can I change the batch size, so I could get all documents at once. I will know approximately how much data will be retrieved in total.
Any other suggestions are very welcome.
Thank you!
As you suggest there is no way that it should take 3 seconds to run this query. However, the issue is not going to be the performance of the pymongo driver, some things to consider:
Make sure that the ts field is included in the indexes for the user collection
Mongoengine does some aggressive de-referencing so if the 1000 returned user documents have one or more ReferenceField then each of those results in additional queries. There are ways to avoid this.
Mongoengine provides a direct interface to the pymongo method for the mongodb aggregation framework this is by far the most efficient way to query mongodb
mongodb recently released an official python ODM pymodm in part to provide better default performance than mongoengine

Storing unstructured data with ramses to be searched with Ramses-API?

I would like to give my users the possibility to store unstructured data in JSON-Format, alongside the structured data, via an API generated with Ramses.
Since the data is made available via Elasticsearch, I try to achieve that this data is indexed and searchable, too.
I can't find any mentioning in the docs or searching.
Would this be possible and how would one do it?
Cheers /Carsten
I put an answer here because needed to give a several docs links and this is a new SO account limited to a couple: https://gitter.im/ramses-tech/ramses?at=56bc0c7a4dfe1fa71ffc0b61
This is Chrisses answer, copied from gitter.im:
You can use the dict field type for "unstructured data", as it takes arbitrary json. If the db engine is postgres, it uses jsonfield under the hood, and if the db engine is mongo, it's converted to a bson document as usual. Either way it should index automatically as expected in ES and will be queryable through the Ramses API.
The following ES queries are supported on documents/fields: nefertari-readthedocs-org/en/stable/making_requests.html#query-syntax-for-elasticsearch
See the docs for field types here, start at the high level (ramses) and it should "just work", but you can see what the code is mapped to at each level below down to the db if desired:
ramses: ramses-readthedocs-org/en/stable/fields.html
nefertari (underlying web framework): nefertari-readthedocs-org/en/stable/models.html#wrapper-api
nefertari-sqla (postgres-specific engine): nefertari-sqla-readthedocs-org/en/stable/fields.html
nefertari-mongodb (mongo-specific engine): nefertari-mongodb-readthedocs-org/en/stable/fields.html
Let us know how that works out, sounds like it could be a useful thing. So far we've just used that field type to hold data like user settings that the frontend wants to persist but for which the API isn't concerned.

evernote updating note resources

I'm using the Evernote API for Python to create an app that allows the user to create and update notes, but I'm having trouble understanding how to efficiently update Evernote resources. This mainly occurs when I'm converting from HTML to ENML (Evernote Markup Language), where I'm creating resources from img tags (right now I'm only considering image resources).
My question is this: how can I tell, given HTML, if a note's resources needs to be updated? I've considered comparing the image data to all of the current resources' data, but that seems really slow. Right now I just make a new resource for each img tag.
Some helpful resources I've found include the Evernote resources guide and this sample code in the Evernote SDK. Any advice is appreciated.
The best way would be a comparison of the MD5 hash of the file. Evernote notes track resources by their MD5 hash.
To see the MD5 hash of the file attached to an Evernote note, just look at the ENML elements labeled "en-media", the form of the tags can be seen below:
<en-media type="mime-type" hash="md5-of-file" />
Where mime-type is the file type and md5-of-file is the MD5 hash of the file. To get the ENML of a note call getNote (documentation here) and make sure to specify you want the contents. The ENML contents of the note is the value of the content attribute of the object that is returned by getNote (a note object).
While hashes can be expensive MD5 is relatively quick and it will be quicker to compute the MD5 hash of a file than it will be to wait for the network to download images.
Also, the updateResource method documentation says:
"Submit a set of changes to a resource to the service. This can be
used to update the meta-data about the resource, but cannot be used to
change the binary contents of the resource (including the length and
hash). These cannot be changed directly without creating a new
resource and removing the old one via updateNote."
So the only way to "update" a resource is to remove the old resource from the note and create a new one in its place. You can do this by removing the resource by remove the Resource Object from the list contained in the resources attribute of the note in question. To add a new note simple add a new resource object to the same list.

Python: RE vs. Query

I am building a website using Django, and this website uses blocks which are enabled for a certain page.
Right now I use a textfield containing paths were a block is enabled. When a page is requested, Django retrieves all blocks from database and does re.search on the TextField.
However, I was wondering if it is not a better idea to use a separate DB table for block/paths, were each row contains a single path and reference to a block, in terms of overhead.
A seperate DB table is definitely the "right" way to do it, because mysql has to send all the data from your TEXT fields every time you query. As you add more rows and the TEXT fields get bigger, you'll start to notice performance issues and eventually crash the server. Also, you'll be able to use VARCHAR and add a unique index to the paths, making lookups lightning fast.
I am not exactly familiar with Django, but if I am understanding the situation correctly, you should use a table.
In fact this is exactly the kind of use that DB software is designed and optimized for.
No worries. It will actually be faster.
By doing the search yourself, you are trying to implement part of the DB logic on your own. Fun, certainly, but not so fast. :)
Here are some nice links on designing a database:
http://dev.mysql.com/tech-resources/articles/intro-to-normalization.html
http://en.wikipedia.org/wiki/Third_normal_form
Hope this helps. Good luck. :-)

Categories