Count items through documents on mongo query - python

My mongo cursor looks like this :
{
"_id":ObjectId("57558ee01807ce2f774569cc"),
"description": "Lorem Ipnsun ....",
"results":[
{
"name":"Alica James",
"gender":"male"
},
{
"name":"Alica James",
"gender":"female"
},
{
"name":"Alica James",
"gender":"female"
}
]
},
{
"_id":ObjectId("57558ee01807ce2f774569c6"),
"description": "Lorem Ipnsun ....",
"results":[
{
"name":"Van Ban",
"gender":"unclear"
}
]
},
{
"_id":ObjectId("57558ee01807ce2f774569c7"),
"description": "Lorem Ipnsun ....",
"results":[]
}
As you can see the results key can be empty or can have values. Inside it, there's a field name which for with exists a gender that can be male female or unclear.
I want to find all documents in my collection, then search through each document check gender distribution for each name.
So for name "Alica James" i want my query to get
female_numbers_for_document = 2
male_numbers_for_document = 1
unclear_numbers_for_document = 0
For Van Ban:
female_numbers_for_document = 0
male_numbers_for_document = 0
unclear_numbers_for_document = 1
On python, I started to do it, first i found all the documents on collections then I started to iterate through each document in cursor and then I declared some vars to define gender but this doesn't work since it takes only first value and doesnt go throught results. Code look like this :
def find_gender_distribution(self):
cursor = self.mongo.db[self.collection_name].find()
for document in cursor:
female_numbers_for_document = document.find({"results.gender": "female"}).count()
male_numbers_for_document = document.find({"results.gender": "male"}).count()
unclear_numbers_for_document = document.find({"results.gender": "unclear"}).count()
I don't know how to count how many documents inside results that contains same gender? Please help.

You are using the wrong method to do this. You need to use the .aggregate() method which gives access to the aggregation pipeline.
unwind1 = {"$unwind": "$result"}
group1 = {
"$group": {
"_id": {"name": "$result.name", "gender": "$result.gender"},
"count": {"$sum": 1}
}
}
group2 = {
"$group": {
"_id": "$_id.name",
"nmale": {
"$sum": {"$cond": [
{"$eq": ["$_id.gender", "male"]},
"$count",
0
]
}
},
"nfemale": {
"$sum": {"$cond": [
{"$eq": ["$_id.gender", "female"]},
"$count",
0
]
}
},
"nunclear": {
"$sum": {"$cond": [
{"$or": [
{"$ne": ["$_id.gender", "male"]},
{"$ne": ["$_id.gender", "female"]}
]},
"$count",
0
]
}
}
}
}
pipeline = [unwind1, group1, group2]
def find_gender_distribution(self):
collection = self.mongo.db[self.collection_name]
cursor = collection.aggregate(pipeline)
for document in cursor:
print(document) # or do something
If we print the cursor, it yields something like:
{ "_id" : "Alica James", "nmale" : 1, "nfemale" : 2, "nunclear" : 3 }
{ "_id" : "Van Ban", "nmale" : 0, "nfemale" : 0, "nunclear" : 1 }

Related

Select mongo documents whose subdoc array has duplicate field?

With a schema like this
{
"doc1": {
"items": [
{
"item_id": 1
},
{
"item_id": 2
},
{
"item_id": 3
},
]
},
"doc2": {
"items": [
{
"item_id": 1
},
{
"item_id": 2
},
{
"item_id": 1
},
]
}
}
I want to query for documents that contain a duplicate item in their items array field. A duplicate means items with the same item_id field.
So the result for the example above should return doc2 only, because it has two items with the same item_id
Something like this?
qry = {
"items": {
"$size": {
"$ne": {
"items.unique_count" # obviously this doesn't exist, not sure how to do it
}
}
}
}
result = MyDocument.find(qry)
One option similar to #rickhg12hs and your suggestions is:
db.collection.aggregate([
{$match: {
$expr: {
$ne: [
{$size: "$items"},
{$size: {
$reduce: {
input: "$items",
initialValue: [],
in: {$setUnion: ["$$value", ["$$this.item_id"]]}
}
}
}
]
}
}
}
])
See how it works on the playground example

Removing a list entry in a list in pyMongo

I have a database collection that has objects like this:
{
"_id": ObjectId("something"),
"name_lower": "total",
"name": "Total",
"mounts": [
[
"mount1",
"instance1"
],
[
"mount2",
"instance1"
],
[
"mount1",
"instance2"
],
[
"mount2",
"instance2"
]
]
}
Say I want to remove every mount that has the instance instance2, How would I go about doing that? I have been searching for quite a while.
You can do something like this
[
{
$unwind: "$mounts"
},
{
$match: {
"mounts": {
$ne: "instance2"
}
}
},
{
$group: {
_id: "$_id",
name: {
$first: "$name"
},
mounts: {
$push: "$mounts"
}
}
}
]
Working Mongo playground
This answer is based on #varman answer but more pythonic and efficient.
The first stage should be a $match condition to filter out documents that don't need to be updated.
Since the mounts key consists of a nested array, we have to $unwind it, so that we can remove array elements that need to be removed.
We have to apply the $match condition again to filter out the element that has to be removed.
Finally, we have to $group the pipeline by _id key, so that the documents which got $unwind in the previous stage will be groupped into a single document.
from pymongo import MongoClient
client = MongoClient("<URI-String>")
col = client["<DB-Name"]["<Collection-Name>"]
count = 0
for cursor in col.aggregate([
{
"$match": {
"mounts": {"$ne": "instance2"}
}
},
{
"$unwind": "$mounts"
},
{
"$match": {
"mounts": {"$ne": "instance2"}
}
},
{
"$group": {
"_id": "$_id",
"newMounts": {
"$push": "$mounts"
}
}
},
]):
# print(cursor)
col.update_one({
"_id": cursor["_id"]
}, {
"$set": {
"mounts": cursor["newMounts"]
}
})
count += 1
print("\r", count, end="")
print("\n\nDone!!!")

How to filter nested array in python pymongo 3.7.2 for mongodb

I am using pymongo version 3.7.2 with python 3.6.8. I have documents in the following format in my database:
{"_id" : 1,
"main_array":[
{"subid":222,
"subarray":[{"name":"hari","status":1},{"name":"henry","status":1}]
},
{"subid":333,
"subarray":[{"name":"james","status":0},{"name":"jason","status":1}]
}]
},
{"_id" : 2,
"main_array":[
{"subid":222,
"subarray":[{"name":"alex","status":1},{"name":"anna","status":1}]
},
{"subid":333,
"subarray":[{"name":"bob","status":0},{"name":"bunny","status":1}]
}]
}
I need to get the objects with subid = 222 from all the documents in the collection. The required result should be as follows:
{"_id" : 1,
"main_array":[
{"subid":222,
"subarray":[{"name":"hari","status":1},{"name":"henry","status":1}]
}]
},
{"_id" : 2,
"main_array":[
{"subid":222,
"subarray":[{"name":"alex","status":1},{"name":"anna","status":1}]
}]
}
I tried the following code:
myclient = pymongo.MongoClient(<mongoclient url>)
mydb = myclient["test"]
mycol = mydb["user"]
subid = 222
_id = 1
x = mycol.find({"_id":_id},{"main_array":{"$elemMatch":{"subid":subid}}})
I got the required result for a particular document. But i need for all the documents. I tried the following query:
x = mycol.find({"main_array":{"$elemMatch":{"subid":subid}}})
But this time it returns the entire collection. What did i miss ?
elemMatch gives you the documents in which ANY of the array item passes the condition.
You should use the aggregation pipeline with $unwind and $match.
Basically, do:
db.collection.aggregate([{
$unwind: "$main_array"
},
{
$match: {
"main_array.subid": 222
}
}])
This gives main_array as an object though, but you should be able to work with that.
Output of the above:
[
{
"_id": 1,
"main_array": {
"subarray": [
{
"name": "hari",
"status": 1
},
{
"name": "henry",
"status": 1
}
],
"subid": 222
}
},
{
"_id": 2,
"main_array": {
"subarray": [
{
"name": "alex",
"status": 1
},
{
"name": "anna",
"status": 1
}
],
"subid": 222
}
}
]
Fiddle: https://mongoplayground.net/p/-sg_d2h5wIJ

mongodb query issue for contructing where clause

I am completely new to mongoDB,I have this below query:
jds=jd.aggregate( [
{
"$group": {
"_id": {"house_NAME":"$house_NAME"},
"count": { "$sum": 1 }
}
},
{ "$match": { "count": { "$gt": 0 } } }
] )
which returns count of each house name present in the collection.
my collection is somewhat like below :
record_id house_NAME status
1 Thomas Open
2 Panther Close
3 Thomas Close
what I want is to only return the value whose status is "Open", I want to add "and" clause in above query so it return the count of only those documents whose status "Open". I don't know how exactly to do it.
I am stucked in it .any help will be greatly appreciated !
Thanks in advance !
You can add a $match stage at the start of the pipeline
jds=jd.aggregate([
{ "$match": { "status": "Open" }},
{ "$group": {
"_id": { "house_NAME": "$house_NAME" },
"count": { "$sum": 1 }
}},
{ "$match": { "count": { "$gt": 0 }}}
])

How to Build a Keyword Historgam in MongoDB?

I am using MongoDB 3.4 and PyMongo. I have a set of keywords:
keywords = [ 'bar', 'foo', ..., 'zoo' ]
I also have a collection:
docs = { 'data' : ' ... bar foo ... ',
'data' : ' ... foo ... ',
'data' : ' ... zoo ... ' }
I am looking for a PyMongo aggregation query which is going to give me a dict:
{ 'bar' : 0, 'foo' : 2, ..., 'zoo' : 0 }
There isn't anything language specific about this, as the only solutions are either all aggregate or using mapReduce, where the latter is defined in JavaScript functions
Just setting up some sample data:
db.wordstuff.insertMany([
{ 'data': "foo brick bar" },
{ 'data': "brick foo" },
{ 'data': "bar brick baz" },
{ 'data': "bax" },
{ 'data': "brin brok fu foo" }
])
Aggregation Framework
Then you can run the aggregation statement:
db.wordstuff.aggregate([
{ "$project": {
"_id": 0,
"split": {
"$filter": {
"input": { "$split": [ "$data", " " ] },
"cond": { "$in": [ "$$this", ["bar","foo","baz","blat"] ] }
}
}
}},
{ "$unwind": "$split" },
{ "$group": { "_id": "$split", "count": { "$sum": 1 } }},
{ "$group": {
"_id": null,
"data": { "$push": { "k": "$_id", "v": "$count" } }
}},
{ "$replaceRoot": {
"newRoot": {
"$arrayToObject": {
"$map": {
"input": ["bar","foo","baz","blat"],
"as": "d",
"in": {
"$cond": {
"if": { "$ne": [{ "$indexOfArray": ["$data.k","$$d"] },-1] },
"then": {
"$arrayElemAt": [
"$data",
{ "$indexOfArray": ["$data.k","$$d"] }
]
},
"else": { "k": "$$d", "v": 0 }
}
}
}
}
}
}}
])
In reality, all of the real work is done by this point:
db.wordstuff.aggregate([
{ "$project": {
"_id": 0,
"split": {
"$filter": {
"input": { "$split": [ "$data", " " ] },
"cond": { "$in": [ "$$this", ["bar","foo","baz","blat"] ] }
}
}
}},
{ "$unwind": "$split" },
{ "$group": { "_id": "$split", "count": { "$sum": 1 } }},
])
Which gives you output like:
{ "_id" : "baz", "count" : 1.0 }
{ "_id" : "bar", "count" : 2.0 }
{ "_id" : "foo", "count" : 3.0 }
So the real work here is being done by $split and that is the main dependency on using the aggregation framework, so you need MongoDB 3.4 at least in order to do this. The very simple premise is to $split the words out individually as array members, then $filter the content to match the input array of words to match.
That $filter uses $in, which is another addition as of MongoDB 3.4 to match against each listed word. There are other operators that can do this with longer syntax, but we know we already need MongoDB 3.4 so this is the shortest syntax.
All that is really done after that is to $unwind the matched array of words from each document, then $group to obtain those matched words as a distinct list, along with the count of the occurrences.
That really is all there is to it from the main perspective of the database.
The following parts are actually "optional" since these are easy to reproduce in code, and probably look a lot clearer and cleaner by doing so. But just to demonstrate the newer operators that would require MongoDB 3.4.4 at least for the introduction of $arrayToObject.
Again the basics are that the next $group "rolls up" the matched words from the cursor into an array within a single document. There is also a very specific key naming applied of "k" and "v" for later reasons.
Then you use a $replaceRoot stage since the content of the document returned is evaluated from an expression. This expression uses $map to iterate over the "input array" of words and matches those to the entries created from the aggregation. This matching is done using $indexOfArray do return the matched index of the compared value.
You use this within $cond as you either want to transform that value into a matched elment using $arrayElemAt, or alternately recognize the index was not a match. This either returns the aggregated entry ( obtained from earlier matches ) or a "default" value of 0 for the given word.
The final part uses $arrayToObject which transforms an array of objects with properties "k" and "v" in to "key/value" pairs as an object.
So you can ask MongoDB to do it, but the data is actually reduced by the minimal pipeline as shown, so you may as well do it in client code. It's pretty simple, and for JavaScript you just do:
var words = db.wordstuff.aggregate([
{ "$project": {
"_id": 0,
"split": {
"$filter": {
"input": { "$split": [ "$data", " " ] },
"cond": { "$in": [ "$$this", ["bar","foo","baz","blat"] ] }
}
}
}},
{ "$unwind": "$split" },
{ "$group": { "_id": "$split", "count": { "$sum": 1 } }},
]).toArray();
var result = ["bar","foo","baz","blat"].map(
w => ( words.map(wd => wd._id).indexOf(w) !== -1)
? words[words.map(wd => wd._id).indexOf(w)]
: { _id: w, count: 0 }
).reduce((acc,curr) => Object.assign(acc,{ [curr._id]: curr.count }),{})
So if there is anything that's language specific at all, then that would be the part. So if you choose to run the aggregation at it's basics and process the resulting cursor, then the python code would be:
input = ["bar","foo","baz","blat"]
words = list(db.wordstuff.aggregate([
{ "$project": {
"_id": 0,
"split": {
"$filter": {
"input": { "$split": [ "$data", " " ] },
"cond": { "$in": [ "$$this", input ] }
}
}
}},
{ "$unwind": "$split" },
{ "$group": { "_id": "$split", "count": { "$sum": 1 } }},
]))
result = reduce(
lambda x,y:
dict(x.items() + { y['_id']: y['count'] }.items()),
map(lambda w: words[map(lambda wd: wd['_id'],words).index(w)]
if w in map(lambda wd: wd['_id'],words)
else { '_id': w, 'count': 0 },
input
),
{}
)
And either method pulls out the same result:
{
"bar" : 2.0,
"foo" : 3.0,
"baz" : 1.0,
"blat" : 0.0
}
MapReduce
The alternate case where you don't even have the minimum MongoDB 3.4.0 available is to use mapReduce for the process instead. Again, this needs to be sent to the server as JavaScript, which is generally represented within "strings" in most language implementations ( other than JavaScript itself ):
db.wordstuff.mapReduce(
function() {
this.data.split(' ')
.filter( w => words.indexOf(w) !== -1 )
.forEach( w => emit(null,{ [w]: 1 }) );
},
function(key,values) {
return [].concat.apply([],
values.map(v => Object.keys(v).map(k => ({ k: k, v: v[k] })))
).reduce((acc,curr) => Object.assign(acc,{
[curr.k]: (acc.hasOwnProperty(curr.k))
? acc[curr.k] + curr.v : curr.v
}),{});
},
{
"out": { "inline": 1 },
"scope": { "words": ["bar","foo","baz","blat"] },
"finalize": function(key,value) {
return words.map( w => (value.hasOwnProperty(w))
? { [w]: value[w] } : { [w]: 0 }
).reduce((acc,curr) => Object.assign(acc,curr),{})
}
}
)
And that gives you the same results and really does exactly the same thing. Just a little slower because MongoDB needs to evaluate and process the JavaScript as compared to using it's own native coded methods with the aggregation framework.

Categories