MongoDB query in pymongo with sort feature - python

I am new in MongoDB and I am trying to create a query.
I have a list, for example: mylist = [a,b,c,d,e]
My dataset has one key with a similar list: mydatalist = [b,d,g,e]
I want to create a query that will return all the data that contains at least one from the mylist.
What I have done.
query = {'mydatalist': {'$in': mylist}}
selector = {'_id':1,'name':1}
mydata = collection.find(query,selector)
That's work perfect. The only thing I want to do and I cannot is to sort the results in base of the number of mylist data they have in the mydatalist. Is there any way to do this in the query or I have to do it manually after in the cursor?
Update with an example:
mylist = [a,b,c,d,e,f,g]
#data from collection
data1[mydatalist] = [a,b,k,l] #2 items from mylist
data2[mydatalist] = [b,c,d,e,m] #4items from mylist
data3[mydatalist] = [a,u,i] #1 item from mylist
So, I want the results to be sorted as data2 -> data1 -> data3

So you want the results sorted by the number of matches to your array selection. Not a simple thing for a find but this can be done with the aggregation framework:
db.collection.aggregate([
// Match your selection to minimise the
{$match: {list: {$in: ['a','b','c','d','e','f','g']}}},
// Projection trick, keep the original document
{$project: {_id: {_id: "$_id", list: "$list" }, list: 1}},
// Unwind the array
{$unwind: "$list"},
// Match only the elements you want
{$match: {list: {$in: ['a','b','c','d','e','f','g']}}},
// Sum up the count of matches
{$group: {_id: "$_id", count: {$sum: 1}}},
// Order by count descending
{$sort: {count: -1 }},
// Clean up the response, however you want
{$project: { _id: 0, _id: "$_id._id", list: "$_id.list", count: 1 }}
])
And there you have your documents in the order you want:
{
"result" : [
{
"_id" : ObjectId("5305bc2dff79d25620079105"),
"count" : 4,
"list" : ["b","c","d","e","m"]
},
{
"_id" : ObjectId("5305bbfbff79d25620079104"),
"count" : 2,
"list" : ["a","b","k","l"]
},
{
"_id" : ObjectId("5305bc41ff79d25620079106"),
"count" : 1,
"list" : ["a","u","i"]
}
],
"ok" : 1
}
Also, it is probably worth mentioning that aggregate in all recent driver versions will return a cursor just as is the case with find. Currently this is emulated by the driver, but as of version 2.6 it will really be for real. This makes aggregate a very valid "swap-in" replacement for find in your implemented calls.

Related

How to extract info from the dictionaries within a list?

I'm new to Python, trying to gather data from a json file that consists of a list that contains info inside dictionaries as follows. How do I extract the "count" data from this? (Without using list comprehension)
{
"stats":[
{
"name":"Ron",
"count":98
},
{
"name":"Sandy",
"count":89
},
{
"name":"Sam",
"count":77
}
]
}
Index the list using the stats key then iterate through it
data = {
"stats":[
{
"name":"Ron",
"count":98
},
{
"name":"Sandy",
"count":89
},
{
"name":"Sam",
"count":77
}
]
}
for stat in data['stats']:
count = stat['count']
Consider the dictionary data stored in a variable source.
source = {
"stats":[
{
"name":"Ron",
"count":98
},
{
"name":"Sandy",
"count":89
},
{
"name":"Sam",
"count":77
}
]
}
Now to access the count field inside of "stats" we use indexing.
For example, to view the count of "Ron" you would write:
print(source['stats'][0]['count'])
This will result in 98
Similarly, for "Sam" it will be
print(source['stats'][2]['count'])
And the result will be 77
In short, we first index the key of dictionary, then the array position and then provide the filed from array of which you want the data.
I hope it helped.
Simply append all those values to do calculations:
count_values = []
for dic in data['stats']:
count_values.append(dic['count'])
# Do anything with count_values
print(count_values)
According to the Zen of Python, "Simple is better than complex."
Thus, list comprehension is actually the best way to extract the information you need and still have it available for further processing (in the form of a list of count values).
d = <your dict-list>
count_data_list = [ x['count'] for x in d['stats'] ]
If not, and your intention is to process the "count" data as it is extracted, I'd suggest a for-loop:
d = <your dict-list>
for x in d['stats']:
count_data = x['count']
<process "count_data">
using a map function will do that in a single line
>>> result = list(map(lambda x: x['count'], data['stats']))
[98, 89, 77]

How to use PyMongo find() to search nested array attribute?

Using PyMongo, how would one find/search for the documents where the nested array json object matches a given string.
Given the following 2 Product JSON documents in a MongoDB collection..
[{
"_id" : ObjectId("5be1a1b2aa21bb3ceac339b0"),
"id" : "1",
"prod_attr" : [
{
"name" : "Branded X 1 Sneaker"
},
{
"hierarchy" : {
"dept" : "10",
"class" : "101",
"subclass" : "1011"
}
}
]
},
{
"_id" : ObjectId("7be1a1b2aa21bb3ceac339xx"),
"id" : "2",
"prod_attr" : [
{
"name" : "Branded Y 2 Sneaker"
},
{
"hierarchy" : {
"dept" : "10",
"class" : "101",
"subclass" : "2022"
}
}
]
}
]
I would like to
1. return all documents where prod_att.hierarchy.subclass = "2022"
2. return all documents where prod_attr.name contains "Sneaker"
I appreciate the JSON could be structured differently, unfortunately that is not within my control to change.
1. Return all documents where prod_attr.hierarchy.subclass = "2022"
Based on the Query an Array of Embedded Documents documentation of MongoDB you can use dot notation concatenating the name of the array field (prod_attr), with a dot (.) and the name of the field in the nested document (hierarchy.subclass):
collection.find({"prod_attr.hierarchy.subclass": "2022"})
2. Return all documents where prod_attr.name contains "Sneaker"
As before, you can use the dot notation to query a field of a nested element inside an array.
To perform the "contains" query you have to use the $regex operator:
collection.find({"prod_attr.name": {"$regex": "Sneaker"}})
Another option is to use the MongoDB Aggregation framework:
collection.aggregate([
{"$unwind": "$prod_attr"},
{"$match": {"prod_attr.hierarchy.subclass": "2022"}}
])
the $unwind operator creates a new object for each object inside the prod_attr array, so you will have only nested documents and no array (check the documentation for details).
The next step is the $match operator that actually perform a query on the nested object.
This is a simple example but playing with the Aggregators Operators you have a lot of flexibility.

Efficiently query missing integers in a range on a field?

I have a database for a backup service I'm writing to backup Yahoo! Groups. It incrementally retrieves messages, which have a contiguous numeric id. stored in a 'message_id' field. So, if the last message on the service is message number 10000, then once the backup is complete, the database should contain 10000 documents, with the sorted 'message_id's of each document being equivalent to range(1, 10000+1).
I'd like to write a query yielding the missing message ids. So if I have 9995 documents in the database, and messages 10, 15, 49, 99, and 1043 are missing, it should return [10, 15, 49, 99, 1043].
I've done the following, getting just the ids from the database and running a set intersection in my app code:
def missing_message_ids(self):
"""Return the set of the ids of all missing messages.."""
latest = self.get_latest_message()
ids = set(range(1, latest['_id']+1))
present_ids = set(doc['_id'] for doc in self.db.messages.find({}, {'_id': 1}))
return ids - present_ids
This is fine for my purposes, but it seems like it might get too slow for a vast number of messages. This is more for curiosity's sake than a real performance requirement: Is there any more efficient way to do this, perhaps entirely on the database engine?
in SQL word one could use CTE for that, in mongo we can use aggregation with $lookup as a kind of CTE (common table expressions)
having this data structure
{
"_id" : ObjectId("575deea531dcfb59af388e17"),
"mesId" : 4.0
}, {
"_id" : ObjectId("575deea531dcfb59af388e18"),
"mesId" : 6.0
}
with missing "mesId" : 5.0 we can use this aggregation query, which will project all next expected ids, and join on them. The limitation here is if we have missing more than one message in sequence, but this could be extended by projecting next Id and making $lookup again.
var project = {
$project : {
_id : 0,
mesId : 1,
nextId : {
$sum : ["$mesId", 1]
}
}
}
var lookup = {
$lookup : {
from : "claudiu",
localField : "nextId",
foreignField : "mesId",
as : "missing"
}
}
var match = {
$match : {
missing : []
}
}
db.claudiu.aggregate([project, lookup, match])
and output:
{
"mesId" : 4.0,
"nextId" : 5.0,
"missing" : []
}

Reach to array subdocument in mongodb

How to reach to the last element of scores array with score value: 64.8 ???
I try to use $pull operator,but I have 200 documents like this form. So, I can't make use of exact value of latest element of scores array.
{
"_id" : 198,
"name" : "Timothy Harrod",
"scores" : [
{
"type" : "exam",
"score" : 11.9075674046519
},
{
"type" : "quiz",
"score" : 20.51879961777022
},
{
"type" : "homework",
"score" : 55.85952928204192
},
{
"type" : "homework",
"score" : 64.85650354990375
} ]
}
Easiest way would be to put that document into a python dict if it's in string format:
import json
doc_json = json.loads(doc_string)
Then to access the LAST element of the scores array, just do:
last_element = doc_json['scores'][-1]
The ['scores'] tells us to get the 'scores' array from our json_doc, and the [-1] tells us to access the last element of this 'scores' array. From there, if you wanted the score from the last_element, you would just do:
last_score = last_element['score']
You could have also gotten it straight by doing:
last_score = doc_json['scores'][-1]['score']

MongoDb: $sort by $in

I am running a mongodb find query with an $in operator:
collection.find({name: {$in: [name1, name2, ...]}})
I would like the results to be sorted in the same order as my name array: [name1, name2, ...]. How do I achieve this?
Note: I am accessing MongoDb through pymongo, but I don't think that's of any importance.
EDIT: as it's impossible to achieve this natively in MongoDb, I ended up using a typical Python solution:
names = [name1, name2, ...]
results = list(collection.find({"name": {"$in": names}}))
results.sort(key=lambda x: names.index(x["name"]))
You can achieve this with aggregation framework starting with upcoming version 3.4 (Nov 2016).
Assuming the order you want is the array order=["David", "Charlie", "Tess"] you do it via this pipeline:
m = { "$match" : { "name" : { "$in" : order } } };
a = { "$addFields" : { "__order" : { "$indexOfArray" : [ order, "$name" ] } } };
s = { "$sort" : { "__order" : 1 } };
db.collection.aggregate( m, a, s );
The "$addFields" stage is new in 3.4 and it allows you to "$project" new fields to existing documents without knowing all the other existing fields. The new "$indexOfArray" expression returns position of particular element in a given array.
The result of this aggregation will be documents that match your condition, in order specified in the input array order, and the documents will include all original fields, plus an additional field called __order
Impossible. $in operator checks the presence. The list is treated as set.
Options:
Split for several queries for name1 ... nameN or filter the result the same way.
More names - more queries.
Use itertools groupby/ifilter. In that case - add the "sorting precedence" flag to every document and match name1 to PREC1, name2 to PREC2, ...., then isort by PREC then group by PREC.
If your collection has the index on "name" field - option 1 is better.
If doest not have the index or you cannot create it due to high write/read ratio - option 2 is for you.
Vitaly is correct it's impossible to do that with find but it can be achieved with aggregates:
db.collection.aggregate([
{ $match: { name: { $in: [name1, name2, /* ... */] } } },
{
$project: {
name: 1,
name1: { $eq: ['name1', '$name'] },
name2: { $eq: ['name2', '$name'] },
},
},
{ $sort: { name1: 1, name2: 1 } },
])
tested on 2.6.5
I hope this will hint other people in the right direction.

Categories