Delete randomly selected N documents in a collection (MongoDB)

Delete randomly selected N documents in a collection (MongoDB) - python

Could anybody please tell me what's the elegant way you would delete N randomly selected documents in a collection in a MongoDB database (through Python ideally)? I would like to use somewhat concise like this
db.users.remove({ $sample: { size: N } })
But this one doesn't parse and I couldn't find a working alternative anywhere else . Many thanks!

use aggregation to get your sample and store _id values to a list:
list_of_ids=list(db.users.aggregate([{'$sample': {'size': 10 }}, {'$project' : {'_id' : 1}} ]))
use delete_many to drop sample documents
results = db.users.delete_many({'_id: {'$in': list_of_ids}})
(*) make sure to check here for limitations of $sample

Related

Periodically process and update documents in elasticsearch index

I need to come up with a strategy to process and update documents in an elasticsearch index periodically and efficiently. I do not have to look at documents that I processed before.
My setting is that I have a long running process, which continuously inserts documents to an index, say approx. 500 documents per hour (think about the common logging example).
I need to find a solution to update some amount of documents periodically (via cron job, e.g) to run some code on a specific field (text field, eg.) to enhance that document with a number of new fields. I want to do this to offer more fine grained aggregations on the index. In the logging analogy, this could be, e.g., I get the UserAgent-string from a log entry (document), do some parsing on that, and add some new fields back to that document and index it.
So my approach would be:
Get some amount of documents (or even all) that I haven't looked at before. I could query them by combining must_not and exists, for instance.
Run my code on these documents (run the parser, compute some new stuff, whatever).
Update the documents obtained previously (probably most preferably via bulk api).
I know there is the Update by query API. But this does not seem to be right here, since I need to run my own code (which btw depends on external libraries), on my server and not as a painless script, which would not offer that comprehensive tasks I need.
I am accessing elasticsearch via python.
The problem is now that I don't know how to implement the above approach. E.g. what if the amount of document obtained in step 1. is larger than myindex.settings.index.max_result_window?
Any ideas?

I considered #Jay's comment and ended up with this pattern, for the moment:
from elasticsearch import Elasticsearch
from elasticsearch.helpers import bulk
from elasticsearch.helpers import scan
from my_module.postprocessing import post_process_doc
es = Elasticsearch(...)
es.ping()
def update_docs( docs ):
""""""
for idx,doc in enumerate(docs):
if idx % 10000 == 0:
print( 'next 10k' )
new_field_value = post_process_doc( doc )
doc_update = {
"_index": doc["_index"],
"_id" : doc["_id"],
"_op_type" : "update",
"doc" : { <<the new field>> : new_field_value }
}
yield doc_update
docs = scan( es, query='{ "query" : { "bool": { "must_not": { "exists": { "field": <<the new field>> }} } }}', index=index, scroll="1m", preserve_order=True )
bulk( es, update_docs( docs ) )
Comments:
I learned that elasticsearch keeps a view of the search results when you do a scroll and pass the corresponding ids with the query request. The scan abstraction method will handle that for you. The scroll-parameter in the method above tells elasticsearch how long the view will be open, i.e., how long the view will be consistant.
As stated in my comment the documentation says that they no longer recommend using the scroll API for deep pagination. If you need to preserve the index state while paging use .. point in time (PIT), but I haven't tried it yet.
In my implementation, I needed to pass preserve_over=True, otherwise an error was thrown.
Remember to update the mapping of the index beforehand, e.g., when you want to add a nested fields as another field in your document.

How to paginate an aggregation pipeline result in pymongo?

I have a web app where I store some data in Mongo, and I need to return a paginated response from a find or an aggregation pipeline. I use Django Rest Framework and its pagination, which in the end just slices the Cursor object. This works seamlessly for Cursors, but aggregation returns a CommandCursor, which does not implement __getitem__().
cursor = collection.find({})
cursor[10:20] # works, no problem
command_cursor = collection.aggregate([{'$match': {}}])
command_cursor[10:20] # throws not subscriptable error
What is the reason behind this? Does anybody have an implementation for CommandCursor.__getitem__()? Is it feasible at all?
I would like to find a way to not fetch all the values when I need just a page. Converting to a list and then slicing it is not feasible for large (100k+ docs) pipeline results. There is a workaround with based on this answer, but this only works for the first few pages, and the performance drops rapidly for pages at the end.

Mongo has certain aggregation pipeline stages to deal with this, like $skip and $limit that you can use like so:
aggregation_results = list(collection.aggregate([{'$match': {}}, {'$skip': 10}, {'$limit': 10}]))
Specifically as you noticed Pymongo's command_cursor does not have implementation for __getitem__ hence the regular iterator syntax does not work as expected. I would personally recommend not to tamper with their code unless you're interested in becoming a contributer to their package.

The MongoDB cursor for find and aggregate functions in a different way since cursor result from aggregation query is a result of precessed data (in most cases) which is not the case for find-cursors as they are static and hence documents can be skipped and limitted to your will.
You can add the paginator limits as $skip and $limit stages in the aggregation pipeline.
For Example:
command_cursor = collection.aggregate([
{
"$match": {
# Match Conditions
}
},
{
"$skip": 10 # No. of documents to skip (Should be `0` for Page - 1)
},
{
"$limit": 10 # No. of documents to be displayed on your webpage
}
])

Locating elements directly inside a dictionary or json object using multiple properties with Python

All i am a bit new to python and more so have background in other languages. My specific question is what is the easiest way to get to location within a dictionary or json document by using multiple properties of each doc entry.
Example doc structure:
[
{"Car" : "Ford", "Color" : "Red", "ID" : 1},
{"Car" : "Ford", "Color" : "Blue", "ID" : 2},
]
Is there an easy way to say search for a Red Ford using something other than having to write a iteration function unique to each doc to locate those records?
print (doc["Ford"]["Red"])
or something similar to how a SQL works in a database like
Select * from doc where Car='Ford' and Color='Red'
Just starting down the python path with multiple document structures and want to make sure i'm not doing something more crudely than it needs to be. I know the iteration will of course work but you have to kinda code one for each document and it just seems like there would be something simpler but not sure.
Thanks!
Tim

You could utilize an array filter based on your current example:
[o for o in doc if o["Car"] == "Ford" and o["Color"] == "Red"]
Alternatively, filter:
list(filter(lambda x: x["Car"] == "Ford" and x["Color"] == "Red", doc))

Mongodb: How to change an element of a nested arrary?

From what I have read it is impossible to update an element in an nested array using the positional operator $ in mongo. The $ only works one level deep. I see it is a requested feature in mongo 2.7.
Updating the whole document one level up is not an option because of write conflicts. I need to just be able to change the 'username' for a particular reward program for instance.
One of the ideas would to be pull, modify, and push the entire 'reward_programs' element but then I would loose the order. Order is important.
Consider this document:
{
"_id:"0,
"firstname":"Tom",
"profiles" : [
{
"profile_name": "tom",
"reward_programs:[
{
'program_name':'American',
'username':'tomdoe',
},
{
'program_name':'Delta',
'username':'tomdoe',
}
]
}
]
}
How would you go about specifically changing the 'username' of 'program_name'=Delta?

After doing more reading it looks like this is unsupported in mongodb at the moment. Positional updates are only supported for one level deep. The feature might be added for mongodb 2.7.
The are a couple of work arounds.
1) Flatten out your database structure. In this case, make 'reward_programs' it's own collection and do your operation on that.
2) Instead of arrays of dicts, use dicts of dicts. That way you can just have an absolute path down to the object you need to modify. This can have drawbacks to query flexibility.
3) Seems hacky to me but you can also walk the list on the nested array find it's position index in the array and do something like this:
users.update({'_id': request._id, 'profiles.profile_name': profile_name}, {'$set': {'profiles.$.reward_programs.{}.username'.format(index): new_username}})
4) Read in the whole document, modify, write back. However, this has possible write conflicts
Setting up your database structure initially is extremely important. It really depends on how you are going to use it.

A simple way to do this:
doc = collection.find_one({'_id': 0})
doc['profiles'][0]["reward_programs"][1]['username'] = 'new user name'
#replace the whole collection
collection.save(doc)

An efficient way to update dictionary in MongoDB?

I have this MongoDb schema:
tags:{
"image_uid":"",
"faces": [
{
"image_uid":"",
"age_real":""
}
]}
witch I update with a dictionary
feedbacks = [{
'face_uid': '02d42dee-3b66-11e2-b12e-e0cb4e12150c',
'age': 23
},
{
'face_uid': '02d42dee-3b66-11e2-b12e-e0cb4e12150d',
'age': 23
}]
in this way:
for feedback in feedbacks:
tags.update(
{'image_uid': image_uid, 'faces.face_uid': feedback['face_uid']},
{"$set": {'faces.$.age_real': feedback['age']}}, w=1
)
There is a more efficient way instead of the for loop?

Currently MongoDB offers no support for updating multiple array elements at once. However, instead of performing several updates in sequence, you might choose to use the Update if Current pattern, or something similar, to update your document locally and then replace it on the DB.
Also, check the original jira, where you can find a couple work-arounds in the comments.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.