Reading large data from mongoDB in batches - Pymongo [duplicate] - python

I know that it is a bad practice to use skip in order to implement pagination, because when your data gets large skip starts to consume a lot of memory. One way to overcome this trouble is to use natural order by _id field:
//Page 1
db.users.find().limit(pageSize);
//Find the id of the last document in this page
last_id = ...
//Page 2
users = db.users.find({'_id'> last_id}). limit(10);
The problem is - I'm new to mongo and do not know what is the best way to get this very last_id

The concept you are talking about can be called "forward paging". A good reason for that is unlike using .skip() and .limit() modifiers this cannot be used to "go back" to a previous page or indeed "skip" to a specific page. At least not with a great deal of effort to store "seen" or "discovered" pages, so if that type of "links to page" paging is what you want, then you are best off sticking with the .skip() and .limit() approach, despite the performance drawbacks.
If it is a viable option to you to only "move forward", then here is the basic concept:
db.junk.find().limit(3)
{ "_id" : ObjectId("54c03f0c2f63310180151877"), "a" : 1, "b" : 1 }
{ "_id" : ObjectId("54c03f0c2f63310180151878"), "a" : 4, "b" : 4 }
{ "_id" : ObjectId("54c03f0c2f63310180151879"), "a" : 10, "b" : 10 }
Of course that's your first page with a limit of 3 items. Consider that now with code iterating the cursor:
var lastSeen = null;
var cursor = db.junk.find().limit(3);
while (cursor.hasNext()) {
var doc = cursor.next();
printjson(doc);
if (!cursor.hasNext())
lastSeen = doc._id;
}
So that iterates the cursor and does something, and when it is true that the last item in the cursor is reached you store the lastSeen value to the present _id:
ObjectId("54c03f0c2f63310180151879")
In your subsequent iterations you just feed that _id value which you keep ( in session or whatever ) to the query:
var cursor = db.junk.find({ "_id": { "$gt": lastSeen } }).limit(3);
while (cursor.hasNext()) {
var doc = cursor.next();
printjson(doc);
if (!cursor.hasNext())
lastSeen = doc._id;
}
{ "_id" : ObjectId("54c03f0c2f6331018015187a"), "a" : 1, "b" : 1 }
{ "_id" : ObjectId("54c03f0c2f6331018015187b"), "a" : 6, "b" : 6 }
{ "_id" : ObjectId("54c03f0c2f6331018015187c"), "a" : 7, "b" : 7 }
And the process repeats over and over until no more results can be obtained.
That's the basic process for a natural order such as _id. For something else it gets a bit more complex. Consider the following:
{ "_id": 4, "rank": 3 }
{ "_id": 8, "rank": 3 }
{ "_id": 1, "rank": 3 }
{ "_id": 3, "rank": 2 }
To split that into two pages sorted by rank then what you essentially need to know is what you have "already seen" and exclude those results. So looking at a first page:
var lastSeen = null;
var seenIds = [];
var cursor = db.junk.find().sort({ "rank": -1 }).limit(2);
while (cursor.hasNext()) {
var doc = cursor.next();
printjson(doc);
if ( lastSeen != null && doc.rank != lastSeen )
seenIds = [];
seenIds.push(doc._id);
if (!cursor.hasNext() || lastSeen == null)
lastSeen = doc.rank;
}
{ "_id": 4, "rank": 3 }
{ "_id": 8, "rank": 3 }
On the next iteration you want to be less or equal to the lastSeen "rank" score, but also excluding those already seen documents. You do this with the $nin operator:
var cursor = db.junk.find(
{ "_id": { "$nin": seenIds }, "rank": "$lte": lastSeen }
).sort({ "rank": -1 }).limit(2);
while (cursor.hasNext()) {
var doc = cursor.next();
printjson(doc);
if ( lastSeen != null && doc.rank != lastSeen )
seenIds = [];
seenIds.push(doc._id);
if (!cursor.hasNext() || lastSeen == null)
lastSeen = doc.rank;
}
{ "_id": 1, "rank": 3 }
{ "_id": 3, "rank": 2 }
How many "seenIds" you actually hold on to depends on how "granular" your results are where that value is likely to change. In this case you can check if the current "rank" score is not equal to the lastSeen value and discard the present seenIds content so it does not grow to much.
That's the basic concepts of "forward paging" for you to practice and learn.

The simplest way to implement pagination in MongoDB
// Pagination
const page = parseInt(req.query.page, 10) || 1;
const limit = parseInt(req.query.limit, 10) || 25;
const startIndex = (page - 1) * limit;
const endIndex = page * limit;
query = query.skip(startIndex).limit(limit);

Related

How to use the sum of two fields when searching for a document in MongoDB?

I have a collection of accounts and I am trying to find an account in which the targetAmount >= totalAmount + N
{
"_id": {
"$oid": "60d097b761484f6ad65b5305"
},
"targetAmount": 100,
"totalAmount": 0,
"highPriority": false,
"lastTimeUsed": 1624283088
}
Now I just select all accounts, iterate over them and check if the condition is met. But I'm trying to do this all in a query:
amount = 10
tasks = ProviderAccountTaskModel.objects(
__raw__={
'targetAmount': {
'$gte': {'$add': ['totalAmount', amount]}
}
}
).order_by('-highPriority', 'lastTimeUsed')
I have also tried using the $sum, but both options do not work.
Can't it be used when searching, or am I just going the wrong way?
You can use a $where. Just be aware it will be fairly slow (has to execute Javascript code on every record) so combine with indexed queries if you can.
db.getCollection('YourCollectionName').find( { $where: function() { return this.targetAmount > (this.totalAmount + 10) } })
or more compact way of doing it will be
db.getCollection('YourCollectionName').find( { $where: "this.targetAmount > this.totalAmount + 10" })
You have to use aggregation instead of the find command since self-referencing of documents in addition to arithmetic operations won't work on it.
Below is the aggregation command you are looking for. Convert it into motoengine equivalent command.
db.collection.aggregate([
{
"$match": {
"$expr": {
"$gte": [
"$targetAmount",
{
"$sum": [
"$totalAmount",
10
]
},
],
},
},
},
{
"$sort": {
"highPriority": -1,
"lastTimeUsed": 1,
},
},
])
Mongo Playground Sample Execution

MongoDB better way of searching through collections for key

I'm trying to see if my string is in a specific key in my whole collection.
Example of collection:
_id: "XXX"
hwid: "XX1"
_id: "XXX"
hwid: "XX2"
_id; "XXX"
hwid: "XX3"
I want to search through all of the hwid keys in my collection and see if my string is in one of them and then return true/false. I thought it'd be a good way to do this with a for loop but it returns false everytime
client = pymongo.MongoClient("connection works")
db = client.monza
collection = db.whitelist
def search_hwid(hwid):
for x in collection.find({}, {"hwid"}):
if hwid == x['hwid']:
whitelisted = True
return whitelisted
break
else:
whitelisted = False
return whitelisted
it returns False everytime even when I am 100% sure that hwid is in my collection
//please see outcomes of find option with various options, for your query you need to find directly
//without using {} as first parameter
> db.test3.find();
{ "_id" : 1, "hwid" : "XX1" }
{ "_id" : 2, "hwid" : "XX2" }
{ "_id" : 3, "hwid" : "XX3" }
> db.test3.find({},{hwid:"XX2"});
{ "_id" : 1, "hwid" : "XX1" }
{ "_id" : 2, "hwid" : "XX2" }
{ "_id" : 3, "hwid" : "XX3" }
> db.test3.find({hwid:"XX2"});
{ "_id" : 2, "hwid" : "XX2" }
> db.test3.find({hwid:"XX3"});
{ "_id" : 3, "hwid" : "XX3" }

How to find the count of the number of documents in mongodb using pymongo aggregation?

I'm trying to find the max value of a field from a number of documents and want the output to not only reflect the max value of the field but also the total count of documents that the aggregate query will retrieve.
I'm able to retrieve the "wait" field with the max value that I want with the below query, but am stuck with how to get the count of all the documents that are satisfy the below query(Match field).
db = mongo_client[_MONGO_COLLECTION]
cursor = db.aggregate(
[
{"$match": { "owner": { "$exists": False}}},
{
"$project": {
"wait" : {
"$divide": [{"$subtract": [datetime.now(), "$creationDate"]}, 1000],
}
}
},
{
"$sort" : {
"wait": -1
}
}, {"$limit" : 1}
])
for x in cursor:
print(x)
You can use count method as below:
print(cursor.count())
print(list(cursor))
or
you can add $count pipeline as below:
{
"$count":"count" // the name of count filed
}

Pymongo count elements collected out of all documents with key

I want to count all elements which occur in somekey in an MongoDB collection.
The current code looks at all elements in somekey as a whole.
from pymongo import Connection
con = Connection()
db = con.database
collection = db.collection
from bson.code import Code
reducer = Code("""
function(obj, prev){
prev.count++;
}
""")
from bson.son import SON
results = collection.group(key={"somekey":1}, condition={}, initial={"count": 0}, reduce=reducer)
for doc in results:
print doc
However, I want that it counts all elements which occur in any document with somekey.
Here is an anticipated example. The MongoDB has the following documents.
{ "_id" : 1, “somekey" : [“AB", “CD"], "someotherkey" : "X" }
{ "_id" : 2, “somekey" : [“AB", “XY”], "someotherkey" : "Y" }
The result should provide an by count ordered list with:
count: 2 "AB"
count: 1 "CD"
count: 1 "XY"
The .group() method will not work on elements that are arrays, and the closest similar thing would be mapReduce where you have more control over the emitted keys.
But really the better fit here is the aggregation framework. It is implemented in native code as does not use JavaScript interpreter processing as the other methods there do.
You wont be getting an "ordered list" from MongoDB responses, but you get a similar document result:
results = collection.aggregate([
# Unwind the array
{ "$unwind": "somekey" },
# Group the results and count
{ "$group": {
"_id": "$somekey",
"count": { "$sum": 1 }
}}
])
Gives you something like:
{ "_id": "AB", "count": 2 }
{ "_id": "CD", "count": 1 }
{ "_id": "XY", "count": 1 }

Various queries - MongoDB

This is my table:
unicorns = {'name':'George',
'actions':[{'action':'jump', 'time':123123},
{'action':'run', 'time':345345},
...]}
How can I perform the following queries?
Grab the time of all actions of all unicorns where action=='jump' ?
Grab all actions of all unicorns where time is equal?
e.g. {'action':'jump', 'time':123} and {'action':'stomp', 'time':123}
Help would be amazing =)
For the second query, you need to use MapReduce, which can get a big hairy. This will work:
map = function() {
for (var i = 0, j = this.actions.length; i < j; i++) {
emit(this.actions[i].time, this.actions[i].action);
}
}
reduce = function(key, value_array) {
var array = [];
for (var i = 0, j = value_array.length; i < j; i++) {
if (value_array[i]['actions']) {
array = array.concat(value_array[i]['actions']);
} else {
array.push(value_array[i]);
}
}
return ({ actions: array });
}
res = db.test.mapReduce(map, reduce)
db[res.result].find()
This would return something like this, where the _id keys are your timestamps:
{ "_id" : 123, "value" : { "actions" : [ "jump" ] } }
{ "_id" : 125, "value" : { "actions" : [ "neigh", "canter" ] } }
{ "_id" : 127, "value" : { "actions" : [ "whinny" ] } }
Unfortunately, mongo doesn't currently support returning an array from a reduce function, thus necessitating the silly {actions: [...]} syntax.
Use dot-separated notation:
db.unicorns.find({'actions.action' : 'jump'})
Similarly for times:
db.unicorns.find({'actions.time' : 123})
Edit: if you want to group the results by time, you'll have to use MapReduce.

Categories