Okey so currently I'm trying to upsert something in a local mongodb using pymongo.(I check to see if the document is in the db and if it is, update it, otherwise just insert it)
I'm using bulk_write to do that, and everything is working ok. The data is inserted/updated.
However, i would need the ids of the newly inserted/updated documents but the "upserted_ids" in the bulkWriteResult object is empty, even if it states that it inserted 14 documents.
I've added this screenshot with the variable. Is it a bug? or is there something i'm not aware of?
Finally, is there a way of getting the ids of the documents without actually searching for them in the db? (If possible, I would prefer to use bulk_write)
Thank you for your time.
EDIT:
As suggested, i added a part of the code so it's easier to get the general ideea:
for name in input_list:
if name not in stored_names: #completely new entry (both name and package)
operations.append(InsertOne({"name": name, "package" : [package_name]}))
if len(operations) == 0:
print ("## No new permissions to insert")
return
bulkWriteResult = _db_insert_bulk(collection_name,operations)
and the insert function:
def _db_insert_bulk(collection_name,operations_list):
return db[collection_name].bulk_write(operations_list)
The upserted_ids field in the pymongo BulkWriteResult only contains the ids of the records that have been inserted as part of an upsert operation, e.g. an UpdateOne or ReplaceOne with the upsert=True parameter set.
As you are performing InsertOne which doesn't have an upsert option, the upserted_ids list will be empty.
The lack of an inserted_ids field in pymongo's BulkWriteResult in an omission in the drivers; technically it conforms to crud specificaiton mentioned in D. SM's answer as it is annotated as "Drivers may choose to not provide this property.".
But ... there is an answer. If you are only doing inserts as part of your bulk update (and not mixed bulk operations), just use insert_many(). It is just as efficient as a bulk write and, crucially, does provide the inserted_ids value in the InsertManyResult object.
from pymongo import MongoClient
db = MongoClient()['mydatabase']
inserts = [{'foo': 'bar'}]
result = db.test.insert_many(inserts, ordered=False)
print(result.inserted_ids)
Prints:
[ObjectId('5fb92cafbe8be8a43bd1bde0')]
This functionality is part of crud specification and should be implemented by compliant drivers including pymongo. Reference pymongo documentation for correct usage.
Example in Ruby:
irb(main):003:0> c.bulk_write([insert_one:{a:1}])
=> #<Mongo::BulkWrite::Result:0x00005579c42d7dd0 #results={"n_inserted"=>1, "n"=>1, "inserted_ids"=>[BSON::ObjectId('5fb7e4b12c97a60f255eb590')]}>
Your output shows that zero documents were upserted, therefore there wouldn't be any ids associated with the upserted documents.
Your code doesn't appear to show any upserts at all, which again means you won't see any upserted ids.
Related
I have a dataset of CrossRef works records stored in a collection called works in MongoDB and I am using a Python application to query this database.
I am trying to find documents based on one author's name. Removing extraneous details, a document might look like this:
{'DOI':'some-doi',
'author':[{'given': 'Albert','family':'Einstein',affiliation:[]},
{'given':'R.','family':'Feynman',affiliation:[]},
{'given':'Paul','family':'Dirac',affiliation:['University of Florida']}]
}
It isn't clear to me how to combine the queries to get just Albert Einstein's papers.
I have indexes on author.family and author.given, I've tried:
cur = works.find({'author.family':'Einstein','author.given':'Albert'})
This returns all of the documents by people called 'Albert' and all of those by people called 'Einstein'. I can filter this manually, but it's obviously less than ideal.
I also tried:
cur = works.find({'author':{'given':'Albert','family':'Einstein','affiliation':[]}})
But this returns nothing (after a very long delay). I've tried this with and without 'affiliation'. There are a few questions on SO about querying nested fields, but none seem to concern the case where we're looking for 2 specific things in 1 nested field.
Your issue is that author is a list.
You can use an aggregate query to unwind this list to objects, and then your query would work:
cur = works.aggregate([{'$unwind': '$author'},
{'$match': {'author.family':'Einstein', 'author.given':'Albert'}}])
Alternatively, use $elemMatch which matches on arrays that match all the elements specified.
cur = works.find({"author": {'$elemMatch': {'family': 'Einstein', 'given': 'Albert'}}})
Also consider using multikey indexes.
I would like to periodically update the data in elasticsearch.
In the file I send in for update, there may be data that already exist in elasticsearh (for update) and data that are new docs (for insert).
Since the data in elasticsearch is managed by auto-created ID,
I have to search the ID by a column "code"(unique) to make sure if a doc already exists, if exists update, otherwise insert.
I wonder if there is any method that is faster than the codes I think of as below.
es = Elasticsearch()
# get doc ID by searching(exact match) a code to check if ID exists
res = es.search(index=index_name, doc_type=doc_type, body=body_for_search)
id_dict = dict([('id', doc['_id'])]) for doc in res['hits']['hits’]
# if id exists, update the current doc by id
# else insert with auto-created id
If id_dict['id']:
es.update(index=index_name, id=id_dict['id'], doc_type=doc_type, body=body)
else:
es.index(index=index_name, doc_type=doc_type, body=body)
For example, could there be a method where elasticsearch search the exact match col["code"] for you and you can simply "upsert" the data without specifying id?
Any advice would be much appreciated and thank you for your reading.
ps- if we make the id = col["code"] it could be much simpler and faster, but for management issue we can't do it at current stage.
As #Archit said, use your own ID to lookup document faster
Use upsert API https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-update.html#upserts
Be sure your ID structure respects Lucene good practice:
If you are using your own ID, try to pick an ID that is friendly to
Lucene. Examples include zero-padded sequential IDs, UUID-1, and
nanotime; these IDs have consistent, sequential patterns that compress
well. In contrast, IDs such as UUID-4 are essentially random and offer
poor compression and slow down Lucene.
I am trying to prevent inserting duplicate documents by the following approach:
Get a list of all documents from the desired endpoint which will contain all the documents in JSON-format. This list is called available_docs.
Use a pre_POST_<endpoint> hook in order to handle the request before inserting to the data. I am not using the on_insert hook since I need to do this before validation.
Since we can access the request object use request.json to get the payload JSON-formatted
Check if request.json is already contained in available_docs
Insert new document if it's not a duplicate only, abort otherwise.
Using this approach I got the following snippet:
def check_duplicate(request):
if not request.json in available_sims:
print('Not a duplicate')
else:
print('Duplicate')
flask.abort(422, description='Document is a duplicate and already in database.')
The available_docs list looks like this:
available_docs = [{'foo': ObjectId('565e12c58b724d7884cd02bb'), 'bar': [ObjectId('565e12c58b724d7884cd02b9'), ObjectId('565e12c58b724d7884cd02ba')]}]
The payload request.json looks like this:
{'foo': '565e12c58b724d7884cd02bb', 'bar': ['565e12c58b724d7884cd02b9', '565e12c58b724d7884cd02ba']}
As you can see, the only difference between the document which was passed to the API and the document already stored in the DB is the datatype of the IDs. Due to that fact, the if-statement in my above snippet evaluates to True and judges the document to be inserted not being a duplicate whereas it definitely is a duplicate.
Is there a way to check if a passed document is already in the database? I am not able to use unique fields since the combination of all document fields needs to be unique only. There is an unique identifier (which I left out in this example), but this is not suitable for the desired comparison since it is kind of a time stamp.
I think something like casting the given IDs at the keys foo and bar as ObjectIDs would do the trick, but I do not know how to to this since I do not know where to get the datatype ObjectID from.
You approach would be much slower than setting a unique rule for the field.
Since, from your example, you are going to compare objectids, can't you simply use those as the _id field for the collection? In Mongo (and Eve of course) that field is unique by default. Actually, you typically don't even define it. You would not need to do anything at all, as a POST of a document with an already existing id would fail right away.
If you can't go that way (maybe you need to compare a different objectid field and still, for some reason, you can't simply set a unique rule for the field), I would look at querying the db for the field value instead than getting all the documents from the db and then scanning them sequentially in code. Something like db.find({db_field: new_document_field_value}). If that returns true, new document is a duplicate. Make sure db_field is indexed (which usually holds true also for fields tagged with unique rule)
EDIT after the comments. A trivial implementation would probable be something like this:
def pre_POST_callback(resource, request):
# retrieve mongodb collection using eve connection
docs = app.data.driver.db['docs']
if docs.find_one({'foo': <value>}):
flask.abort(422, description='Document is a duplicate and already in database.')
app = Eve()
app.run()
Here's my approach on preventing duplicate records:
def on_insert_subscription(items):
c_subscription = app.data.driver.db['subscription']
user = decode_token()
if user:
for item in items:
if c_subscription.find_one({
'topic': ObjectId(item['topic']),
'client': ObjectId(user['user_id'])
}):
abort(422, description="Client already subscribed to this topic")
else:
item['client'] = ObjectId(user['user_id'])
else:
abort(401, description='Please provide proper credentials')
What I'm doing here is creating subscriptions for clients. If a client is already subscribed to a topic I throw 422.
Note: the client ID is decoded from the JWT token.
I'm working on parsing a file and inserting it into a database, using sqlalchemy core. I had it set up with the orm originally but that doesn't meet the speed requirements for the project.
My database has 2 tables: Objects and Attributes. The Objects table has a primary key of obj_id. The primary key for Attributes is composite: attr_name, attr_class, and obj_id, which is also a foreign key from Objects.
The attributes are stored after parsing the file in a list of dictionaries, like so:
[
{ 'obj_id' = obj_id, 'attr_name' = name, 'attr_class' = class, etc...},
{ ETC ETC ETC}]
The data is being inserted by first bulk inserting the objects, then the attributes. The object insert works perfectly. When inserting the attributes however, I get an integrity error, saying I tried to insert a duplicate primary key.
Here is my insert code for attributes:
self.engine.execute(
Attributes.__table__.insert(),
[{'obj_id' : attr['obj_id'],
'attr_name' : attr['attr_name'],
'attr_class': attr['attr_class'],
'attr_type' : attr['attr_type'],
'attr_size' : attr['attr_size']} for attr in attrList])
While trying to work this error out, I printed the id, name, and class of each attribute in the list to a file to find the duplicate key. Nowhere in the list is there actually an identical primary key, so this leads me to believe it is a problem with the structure of my query.
Can anyone figure this out with the info I've given, or give me somewhere to look for more information? I've already checked the documentation pretty thoroughly and couldn't find anything helpful.
Edit:
I also tried executing each insert statement separately, as suggested by someone on sqlalchemy's google group. The results were the same. The code I used:
insert = Attributes.__table__.insert()
for attr in attrList:
stmt = insert.values({'obj_id' : attr['obj_id'], ...})
self.engine.execute(stmt)
where ... was the rest of the values.
Edit 2:
The Integrity error is thrown as soon as I try to insert an attribute with the same name/class but a different object id. So for example:
In the format name-class-id:
By iteration 4, I've got:
Attr1-Class1-0
Attr2-Class2-0
Attr3-Class3-0
Attr4-Class4-0
On the next iteration, I try to insert Attr1-Class1-1, which fails.
I found the problem, completely unrelated to the insert code. When storing the data in the list, I was storing an Object as obj_id, which sqlalchemy didn't like. By fixing that I fixed the insertions.
I am currently playing around with SQLAlchemy a bit, which is really quite neat.
For testing I created a huge table containing my pictures archive, indexed by SHA1 hashes (to remove duplicates :-)). Which was impressingly fast...
For fun I did the equivalent of a select * over the resulting SQLite database:
session = Session()
for p in session.query(Picture):
print(p)
I expected to see hashes scrolling by, but instead it just kept scanning the disk. At the same time, memory usage was skyrocketing, reaching 1GB after a few seconds. This seems to come from the identity map feature of SQLAlchemy, which I thought was only keeping weak references.
Can somebody explain this to me? I thought that each Picture p would be collected after the hash is written out!?
Okay, I just found a way to do this myself. Changing the code to
session = Session()
for p in session.query(Picture).yield_per(5):
print(p)
loads only 5 pictures at a time. It seems like the query will load all rows at a time by default. However, I don't yet understand the disclaimer on that method. Quote from SQLAlchemy docs
WARNING: use this method with caution; if the same instance is present in more than one batch of rows, end-user changes to attributes will be overwritten.
In particular, it’s usually impossible to use this setting with eagerly loaded collections (i.e. any lazy=False) since those collections will be cleared for a new load when encountered in a subsequent result batch.
So if using yield_per is actually the right way (tm) to scan over copious amounts of SQL data while using the ORM, when is it safe to use it?
here's what I usually do for this situation:
def page_query(q):
offset = 0
while True:
r = False
for elem in q.limit(1000).offset(offset):
r = True
yield elem
offset += 1000
if not r:
break
for item in page_query(Session.query(Picture)):
print item
This avoids the various buffering that DBAPIs do as well (such as psycopg2 and MySQLdb). It still needs to be used appropriately if your query has explicit JOINs, although eagerly loaded collections are guaranteed to load fully since they are applied to a subquery which has the actual LIMIT/OFFSET supplied.
I have noticed that Postgresql takes almost as long to return the last 100 rows of a large result set as it does to return the entire result (minus the actual row-fetching overhead) since OFFSET just does a simple scan of the whole thing.
You can defer the picture to only retrieve on access. You can do it on a query by query basis.
like
session = Session()
for p in session.query(Picture).options(sqlalchemy.orm.defer("picture")):
print(p)
or you can do it in the mapper
mapper(Picture, pictures, properties={
'picture': deferred(pictures.c.picture)
})
How you do it is in the documentation here
Doing it either way will make sure that the picture is only loaded when you access the attribute.