Mongoengine: Check if document is already in DB

Mongoengine: Check if document is already in DB - python

I am working on a kind of initialization routine for a MongoDB using mongoengine.
The documents we deliver to the user are read from several JSON files and written into the database at the start of the application using the above mentioned init routine.
Some of these documents have unique keys which would raise a mongoengine.errors.NotUniqueError error if a document with a duplicate key is passed to the DB. This is not a problem at all since I am able to catch those errors using try-except.
However, some other documents are something like a bunch of values or parameters. So there is no unique key which a can check in order to prevent those from being inserted to the DB twice.
I thought I could read all existing documents from the desired collection like this:
docs = MyCollection.objects()
and check whether the document to be inserted is already available in docs by using:
doc = MyCollection(parameter='foo')
print(doc in docs)
Which prints false even if there is a MyCollection(parameter='foo') document in the the DB already.
How can I achieve a duplicate detection without using unique keys?

You can check using an if statement:
if not MyCollection.objects(parameter='foo'):
# insert your documents

Related

how to delete document without deleting collection in firestore?

I want to create some kind of collection which cannot be deleted. The reason I made it like that is because when the document is empty my website can't do the data creation process
is it possible to create a collection in firestore that has an empty document?
i use python firebase_admin

In Firestore, there is no such thing as an "empty collection". Collections simply appear in the console when there is a document present, and disappear when the last document is deleted. If you want to know if a collection is "empty", then you can simply query it and check that it has 0 documents.
Ideally, your code should be robust enough to handle the possibility of a missing document, because Firestore will do nothing to stop a document from being deleted if that's what you do in the console or your code.

Reattempting of failed bulk inserts in pymongo

I am trying to do a bulk insert of documents into a MongoDB collection in Python, using pymongo. This is what the code looks like:
collection_name.insert_many([ logs[i] for i in range (len(logs)) ])
where logs is a list of dictionaries of variable length.
This works fine when there are no issues with any of the logs. However, if any one of the logs has some kind of issue and pymongo refuses to save it (say, the issue is something like the document fails to match the validation schema set for that collection), the entire bulk insert is rolled back and no documents are inserted in the database.
Is there any way I can retry the bulk insert by ignoring only the defective log?

You can ignore those types of errors by specifying ordered: false as an option: collection.insert_many(logs, ordered=False). All operations are attempted before raising an exception, which you can catch.
See https://api.mongodb.com/python/current/api/pymongo/collection.html#pymongo.collection.Collection.insert_many

Elasticsearch for python - Get documents deleted by query

I'm using Elasticsearch in python, and I can't figure out how to get the ids of the documents deleted by the delete_by_query() method! By default it only the number of documents deleted.
There is a parameter called _source that if set to True should return the source of the deleted documents. This doesn't happen, nothing changes.
Is there a good way to know which document where deleted?

The delete by query endpoint only returns a macro summary of what happened during the task, mainly how many documents were deleted and some other details.
If you want to know the IDs of the document that are going to be deleted, you can do a search (with _source: false) before running the delete by query operation and you'll get the expected IDs.

Setting read_policy in AppEngine Python

In this document it is mentioned that the default read_policy setting is ndb.EVENTUAL_CONSISTENCY.
After I did a bulk delete of entity items from the Datastore versions of the app I pulled up continued to read the old data, so I've tried to figure out how to change this to STRONG_CONSISTENCY with no success, including:
entity.query().fetch(read_policy=ndb.STRONG_CONSISTENCY) and
...fetch(options=ndb.ContextOptions(read_policy=ndb.STRONG_CONSISTENCY))
The error I get is
BadArgumentError: read_policy argument invalid ('STRONG_CONSISTENCY')
How does one change this default? More to the point, how can I ensure that NDB will go to the Datastore to load a result rather than relying on an old cached value? (Note that after the bulk delete the datastore browser tells me the entity is gone.)

You cannot change that default, it is also the only option available. From the very doc you referenced (no other options are mentioned):
Description
Set this to ndb.EVENTUAL_CONSISTENCY if, instead of waiting for the
Datastore to finish applying changes to all returned results, you wish
to get possibly-not-current results faster.
The same is confirmed by inspecting the google.appengine.ext.ndb.context.py file (no STRONG_CONSISTENCY definition in it):
# Constant for read_policy.
EVENTUAL_CONSISTENCY = datastore_rpc.Configuration.EVENTUAL_CONSISTENCY
The EVENTUAL_CONSISTENCY ends up in ndb via the google.appengine.ext.ndb.__init__.py:
from context import *
__all__ += context.__all__
You might be able to avoid the error using a hack like this:
from google.appengine.datastore.datastore_rpc import Configuration
...fetch(options=ndb.ContextOptions(read_policy=Configuration.STRONG_CONSISTENCY))
However I think that only applies to reading the entities for the keys obtained through the query, but not to obtaining the list of keys themselves, which comes from the index the query uses, which is always eventually consistent - the root cause of your deleted entities still appearing in the result (for a while, until the index is updated). From Keys-only Global Query Followed by Lookup by Key:
But it should be noted that a keys-only global query can not exclude
the possibility of an index not yet being consistent at the time of
the query, which may result in an entity not being retrieved at all.
The result of the query could potentially be generated based on
filtering out old index values. In summary, a developer may use a
keys-only global query followed by lookup by key only when an
application requirement allows the index value not yet being
consistent at the time of a query.
Potentially of interest: Bulk delete datastore entity older than 2 days

How to do a bulk insert without overwriting existing data - Pymongo?

I am trying to bulk insert data to MondoDB without overwriting existing data. I want to insert new data to the database if no match with unique id (sourceID). Looking at the documentation for Pymongo I have written some code but cannot make it work. Any ideas to what I am doing wrong?
db.bulk_write(UpdateMany({"sourceID"}, test, upsert=True))
db is the name of my database, SourceID is the unique ID of the documents that I don't want to overwrite in the existing data, test is the array that I am tying to insert.

Either I don't understand your requirement or you misunderstands the UpdateMany operation. As per documentation, this operation serves for modifying the existing data (those matching the query) and only if no documents match the query, and upsert=True, insert new documents. Are you sure you don't want to use insert_many method?
Also, in your example, the first parameter which should be a filter for update, is not a valid query which has to be in a form {"key": "value"}.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Mongoengine: Check if document is already in DB - python

You can check using an if statement: if not MyCollection.objects(parameter='foo'): # insert your documents

Related

how to delete document without deleting collection in firestore?

Reattempting of failed bulk inserts in pymongo

Elasticsearch for python - Get documents deleted by query

Setting read_policy in AppEngine Python

How to do a bulk insert without overwriting existing data - Pymongo?

Categories

Resources