pymongo saving embedded objectIds, InvalidDocumentError

pymongo saving embedded objectIds, InvalidDocumentError - python

Using the pymongo driver bare to connect python to mongodb, why is it that using an ObjectId instance as the key for an embedded document raises an InvalidDocument error?
I am trying to link documents using objectids and cant seem to understand why I would want to convert them to strings when the ones created automatically for the driver are ObjectId instances.
item = collection.find({'x':'foo'})
item['otherstuff'] = {pymongo.objectid.ObjectId() : 'data about this link'}
collection.update({'x':'foo'}, item)
bson.errors.InvalidDocument: documents must have only string keys, key was ObjectId('4f0b5d4e764df61c67000000')
In practice the linked ids represent documents that contain questions, and the values in the dictionary here keyed as 'otherstuff' for example would represent this individual document's responses to that particular question.
Is there a reason applying objectids like this won't encode into bson and then fails? Is it impossible to nest ObjectIds within documents like this to cross-reference? Have I misunderstood the purpose of them?

The BSON spec dictates that keys must be strings, so PyMongo is right to reject this as an invalid document (and would be regardless of at what level an ObjectId was used as a key, whether at the top level or in an embedded document). This is necessary, among other reasons, so that the query language can be unambiguous. Imagine you had this document (and that it were a valid BSON document):
{ _id: ...,
"4f0cbe6d7f40d36b24a5c4d7": true,
ObjectId("4f0cbe6d7f40d36b24a5c4d7"): false
}
And then you attempted to query with:
db.foo.find({"4f0cbe6d7f40d36b24a5c4d7": false})
Should this return this document? Should that string be auto-boxed into an ObjectId? How would Mongo know when that can be auto-boxed, and how to disambiguate in cases like this document?
A possible alternative solution to your problem is to have an array of embedded documents like:
{ answers: [
{ answer_id: ObjectId("..."), summary: "Good answer to this question" },
{ answer_id: ObjectId("..."), summary: "Bad answer to this question" }
]
}
This is valid BSON, and will also be indexable more efficiently. If you add an index on answers, you can search efficiently for exact matches on these subdocuments; if you add an index on answers.answer_id, then you can search efficiently by the ObjectId of the answer you're looking for (and so on).

Related

Pymongo BulkWriteResult doesn't contain upserted_ids

Okey so currently I'm trying to upsert something in a local mongodb using pymongo.(I check to see if the document is in the db and if it is, update it, otherwise just insert it)
I'm using bulk_write to do that, and everything is working ok. The data is inserted/updated.
However, i would need the ids of the newly inserted/updated documents but the "upserted_ids" in the bulkWriteResult object is empty, even if it states that it inserted 14 documents.
I've added this screenshot with the variable. Is it a bug? or is there something i'm not aware of?
Finally, is there a way of getting the ids of the documents without actually searching for them in the db? (If possible, I would prefer to use bulk_write)
Thank you for your time.
EDIT:
As suggested, i added a part of the code so it's easier to get the general ideea:
for name in input_list:
if name not in stored_names: #completely new entry (both name and package)
operations.append(InsertOne({"name": name, "package" : [package_name]}))
if len(operations) == 0:
print ("## No new permissions to insert")
return
bulkWriteResult = _db_insert_bulk(collection_name,operations)
and the insert function:
def _db_insert_bulk(collection_name,operations_list):
return db[collection_name].bulk_write(operations_list)

The upserted_ids field in the pymongo BulkWriteResult only contains the ids of the records that have been inserted as part of an upsert operation, e.g. an UpdateOne or ReplaceOne with the upsert=True parameter set.
As you are performing InsertOne which doesn't have an upsert option, the upserted_ids list will be empty.
The lack of an inserted_ids field in pymongo's BulkWriteResult in an omission in the drivers; technically it conforms to crud specificaiton mentioned in D. SM's answer as it is annotated as "Drivers may choose to not provide this property.".
But ... there is an answer. If you are only doing inserts as part of your bulk update (and not mixed bulk operations), just use insert_many(). It is just as efficient as a bulk write and, crucially, does provide the inserted_ids value in the InsertManyResult object.
from pymongo import MongoClient
db = MongoClient()['mydatabase']
inserts = [{'foo': 'bar'}]
result = db.test.insert_many(inserts, ordered=False)
print(result.inserted_ids)
Prints:
[ObjectId('5fb92cafbe8be8a43bd1bde0')]

This functionality is part of crud specification and should be implemented by compliant drivers including pymongo. Reference pymongo documentation for correct usage.
Example in Ruby:
irb(main):003:0> c.bulk_write([insert_one:{a:1}])
=> #<Mongo::BulkWrite::Result:0x00005579c42d7dd0 #results={"n_inserted"=>1, "n"=>1, "inserted_ids"=>[BSON::ObjectId('5fb7e4b12c97a60f255eb590')]}>
Your output shows that zero documents were upserted, therefore there wouldn't be any ids associated with the upserted documents.
Your code doesn't appear to show any upserts at all, which again means you won't see any upserted ids.

Firestore query filter retrieve document where field is missing [duplicate]

Let's say I have a data model with some optional properties. This could be for example a user object with a "firstname", a "lastname" and an optional "website" property.
In Cloud Firestore only user documents with a known website would have the "website" property set, for all other user documents this property would not exist.
My questions is now, how to query for all user documents without a "website" property?

Documents can contain properties with a null value data type (see data types documentation). This will then allow you to construct a query to limit results where the website property is null.
This is not quite the same as a missing property, but if you use custom objects to write data to Firestore, empty properties will automatically be saved as null rather than not at all. You can also manually/programmatically write a null value to the database.
In Android, I tested this using the following:
FirebaseFirestore.getInstance().collection("test").whereEqualTo("website", null).get();
Where my database structure looked like:
This returned only the document inuwlZOvZNTHuBakS6GV, because document 9Hf7uwORiiToOKz6zcsX contains a string value in the website property.
I believe you usually develop in Swift, where unfortunately custom objects aren't supported, but you can use NSNull() to write a null value to Firestore. For example (I'm not proficient in Swift, so feel free to correct any issues):
// Writing data
let docData: [String: Any] = [
"firstname": "Example",
"lastname": "User",
"website": NSNull()
]
db.collection("data").document("one").setData(docData) { err in
if let err = err {
print("Error writing document: \(err)")
} else {
print("Document successfully written!")
}
}
// Querying for null values
let query = db.collection("test").whereField("website", isEqualTo: NSNull())
The documentation doesn't mention a method to query for values that don't exist, so this seems like the next best approach. If anyone can improve or suggest alternatives, please do.

Python-Eve: Prevent inserting duplicates without using unique fields

I am trying to prevent inserting duplicate documents by the following approach:
Get a list of all documents from the desired endpoint which will contain all the documents in JSON-format. This list is called available_docs.
Use a pre_POST_<endpoint> hook in order to handle the request before inserting to the data. I am not using the on_insert hook since I need to do this before validation.
Since we can access the request object use request.json to get the payload JSON-formatted
Check if request.json is already contained in available_docs
Insert new document if it's not a duplicate only, abort otherwise.
Using this approach I got the following snippet:
def check_duplicate(request):
if not request.json in available_sims:
print('Not a duplicate')
else:
print('Duplicate')
flask.abort(422, description='Document is a duplicate and already in database.')
The available_docs list looks like this:
available_docs = [{'foo': ObjectId('565e12c58b724d7884cd02bb'), 'bar': [ObjectId('565e12c58b724d7884cd02b9'), ObjectId('565e12c58b724d7884cd02ba')]}]
The payload request.json looks like this:
{'foo': '565e12c58b724d7884cd02bb', 'bar': ['565e12c58b724d7884cd02b9', '565e12c58b724d7884cd02ba']}
As you can see, the only difference between the document which was passed to the API and the document already stored in the DB is the datatype of the IDs. Due to that fact, the if-statement in my above snippet evaluates to True and judges the document to be inserted not being a duplicate whereas it definitely is a duplicate.
Is there a way to check if a passed document is already in the database? I am not able to use unique fields since the combination of all document fields needs to be unique only. There is an unique identifier (which I left out in this example), but this is not suitable for the desired comparison since it is kind of a time stamp.
I think something like casting the given IDs at the keys foo and bar as ObjectIDs would do the trick, but I do not know how to to this since I do not know where to get the datatype ObjectID from.

You approach would be much slower than setting a unique rule for the field.
Since, from your example, you are going to compare objectids, can't you simply use those as the _id field for the collection? In Mongo (and Eve of course) that field is unique by default. Actually, you typically don't even define it. You would not need to do anything at all, as a POST of a document with an already existing id would fail right away.
If you can't go that way (maybe you need to compare a different objectid field and still, for some reason, you can't simply set a unique rule for the field), I would look at querying the db for the field value instead than getting all the documents from the db and then scanning them sequentially in code. Something like db.find({db_field: new_document_field_value}). If that returns true, new document is a duplicate. Make sure db_field is indexed (which usually holds true also for fields tagged with unique rule)
EDIT after the comments. A trivial implementation would probable be something like this:
def pre_POST_callback(resource, request):
# retrieve mongodb collection using eve connection
docs = app.data.driver.db['docs']
if docs.find_one({'foo': <value>}):
flask.abort(422, description='Document is a duplicate and already in database.')
app = Eve()
app.run()

Here's my approach on preventing duplicate records:
def on_insert_subscription(items):
c_subscription = app.data.driver.db['subscription']
user = decode_token()
if user:
for item in items:
if c_subscription.find_one({
'topic': ObjectId(item['topic']),
'client': ObjectId(user['user_id'])
}):
abort(422, description="Client already subscribed to this topic")
else:
item['client'] = ObjectId(user['user_id'])
else:
abort(401, description='Please provide proper credentials')
What I'm doing here is creating subscriptions for clients. If a client is already subscribed to a topic I throw 422.
Note: the client ID is decoded from the JWT token.

python: how to find documents with specific fields

I am using python and mongodb. I have a collection which contains 40000 documents. I have a group of coordinates and I need to find which document these coordinates belong to. Now I am doing:
cell_start = citymap.find({"cell_latlng":{"$geoIntersects":{"$geometry":{"type":"Point", "coordinates":orig_coord}}}})
This method is a typical geoJSON method and it works well. Now I know some documents have such a field:
{'trips_dest':......}
The value of this field is not important so I just skip that. The thing is that, instead of looking for documents from all these 40000 documents, I can just look for documents from documents which have the field called 'trips_dest'.
Since I know only about 40% of documents have the field 'trips_dest' so I think this would increase the efficiency. However, I don't know how to modify my code to do that. Any idea?

You need the $exists query operator. Something like that:
cell_start = citymap.find({"trips_dest": {$exists: true},
"cell_latlng":{"$geoIntersects":{"$geometry":{"type":"Point", "coordinates":orig_coord}}}})
To quote the documentation:
Syntax: { field: { $exists: <boolean> } }
When <boolean> is true, $exists matches the documents that contain the field, including documents where the field value is null
If you need to reject null values, use:
"trips_dest": {$exists: true, $ne: null}
As a final note, a sparse index might eventually speed up such query.

Why does db.insert(dict) add _id key to the dict object while using pymongo

I am using pymongo in the following way:
from pymongo import *
a = {'key1':'value1'}
db1.collection1.insert(a)
print a
This prints
{'_id': ObjectId('53ad61aa06998f07cee687c3'), 'key1': 'value1'}
on the console.
I understand that _id is added to the mongo document. But why is this added to my python dictionary too? I did not intend to do this. I am wondering what is the purpose of this? I could be using this dictionary for other purposes to and the dictionary gets updated as a side effect of inserting it into the document? If I have to, say, serialise this dictionary into a json object, I will get a
ObjectId('53ad610106998f0772adc6cb') is not JSON serializable
error. Should not the insert function keep the value of the dictionary same while inserting the document in the db.

As many other database systems out there, Pymongo will add the unique identifier necessary to retrieve the data from the database as soon as it's inserted (what would happen if you insert two dictionaries with the same content {'key1':'value1'} in the database? How would you distinguish that you want this one and not that one?)
This is explained in the Pymongo docs:
When a document is inserted a special key, "_id", is automatically added if the document doesn’t already contain an "_id" key. The value of "_id" must be unique across the collection.
If you want to change this behavior, you could give the object an _id attribute before inserting. In my opinion, this is a bad idea. It would easily lead to collisions and you would lose juicy information that is stored in a "real" ObjectId, such as creation time, which is great for sorting and things like that.
>>> a = {'_id': 'hello', 'key1':'value1'}
>>> collection.insert(a)
'hello'
>>> collection.find_one({'_id': 'hello'})
{u'key1': u'value1', u'_id': u'hello'}
Or if your problem comes when serializing to Json, you can use the utilities in the BSON module:
>>> a = {'key1':'value1'}
>>> collection.insert(a)
ObjectId('53ad6d59867b2d0d15746b34')
>>> from bson import json_util
>>> json_util.dumps(collection.find_one({'_id': ObjectId('53ad6d59867b2d0d15746b34')}))
'{"key1": "value1", "_id": {"$oid": "53ad6d59867b2d0d15746b34"}}'
(you can verify that this is valid json in pages like jsonlint.com)

_id act as a primary key for documents, unlike SQL databases, its required in mongodb.
to make _id serializable, you have 2 options:
set _id to a JSON serializable datatype in your documents before inserting them (e.g. int, str) but keep in mind that it must be unique per document.
use a custom BSON serializion encoder/decoder classes:
from bson.json_util import default as bson_default
from bson.json_util import object_hook as bson_object_hook
class BSONJSONEncoder(json.JSONEncoder):
def default(self, o):
return bson_default(o)
class BSONJSONDecoder(json.JSONDecoder):
def __init__(self, **kwrgs):
JSONDecoder.__init__(self, object_hook=bson_object_hook)

as #BorrajaX answered already want to add some more.
_id is a unique identifier, when a document is inserted to the collection it generates with some random numbers. Either you can set your own id or you can use what MongoDB has created for you.
As documentation mentions about this.
For your case, you can simply ignore this key by using del keyword del a["_id"].
or
if you need _id for further operations you can use dumps from bson module.
import json
from bson.json_util import loads as bson_loads, dumps as bson_dumps
a["_id"]=json.loads(bson_dumps(a["_id"]))
or
before inserting document you can add your custom _id you won't need serialize your dictionary
a["_id"] = "some_id"
db1.collection1.insert(a)

This behavior can be circumvented by using the copy module. This will pass a copy of the dictionary to pymongo leaving the original intact. Based on the code snippet in your example, one should modifiy it like so:
import copy
from pymongo import *
a = {'key1':'value1'}
db1.collection1.insert(copy.copy(a))
print a

Clearly the docs answer your question
MongoDB stores documents on disk in the BSON serialization format. BSON is a binary representation of JSON documents, though it contains more data types than JSON.
The value of a field can be any of the BSON data types, including other documents, arrays, and arrays of documents. The following document contains values of varying types:
var mydoc = {
_id: ObjectId("5099803df3f4948bd2f98391"),
name: { first: "Alan", last: "Turing" },
birth: new Date('Jun 23, 1912'),
death: new Date('Jun 07, 1954'),
contribs: [ "Turing machine", "Turing test", "Turingery" ],
views : NumberLong(1250000)
}
to know more about BSON

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.