mongodb python , quick pipeline code check

mongodb python , quick pipeline code check - python

i am a beginner to mongodb and i have the assignment to write pipeline code. MY goal is to find the Region in India has the largest number of cities with longitude between 75 and 80? I hope anybody can help me to point out my misconceptions and/or mistakes, it is a very short code, so i am sure the pros will spot it right away.
Here is my code, i will post how the datastructure looks like under it :
pipeline = [
{"$match" : {"lon": {"$gte":75, "$lte" : 80}},
{'country' : 'India'}},
{ '$unwind' : '$isPartOf'},
{ "$group":
{
"_id": "$name",
"count" :{"$sum":{"cityname":"$name"}} }},
{"$sort": {"count": -1}},
{"$limit": 1}
]
{
"_id" : ObjectId("52fe1d364b5ab856eea75ebc"),
"elevation" : 1855,
"name" : "Kud",
"country" : "India",
"lon" : 75.28,
"lat" : 33.08,
"isPartOf" : [
"Jammu and Kashmir",
"Udhampur district"
],
"timeZone" : [
"Indian Standard Time"
],
"population" : 1140
}

The following pipeline will give you the desired result. The first $match pipeline operator uses standard MongoDB queries to filter the documents (cities) whose longitude is between 75 and 80 and as well as the ones only in India based on the country field. Since each document represents a city, the $unwind operator on the isPartOf deconstructs that array field from the filtered documents to output a document for each element. Each output document replaces the array with an element value. Thus for each input document, outputs n documents where n is the number of array elements and this operation is rather useful in the next $group operator stage since that's where you can calculate the number n through $sum group accumulator operator. The next pipeline stages will then transform your final document structure by introducing new replacement fields Region and NumberOfCities + sorting the documents in descending order and then returning the top 1 document which is your region with the largest number of cities:
pipeline = [
{
"$match": {
"lon": {"$gte": 75, "$lte": 80},
"country": "India"
}
},
{
"$unwind": "$isPartOf"
},
{
"$group": {
"_id": "$isPartOf",
"count": {
"$sum": 1
}
}
},
{
"$project": {
"_id": 0,
"Region": "$_id",
"NumberOfCities": "$count"
}
},
{
"$sort": {"NumberOfCities": -1}
},
{ "$limit": 1 }
]

There are some syntax and logical errors in your pipeline.
{"$match" : {"lon": {"$gte":75, "$lte" : 80}},
{'country' : 'India'}},
The Syntax here is wrong, you should just use comma to seperate key value pairs in `$match.
"_id": "$name",
You are grouping based on city name and not on the region.
{"$sum":{"cityname":"$name"}}
You need to send a numeric values to the $sum operator that result from applying a specified expression. {"cityname":"$name"} will be ignored.
The correct pipeline would be :-
[
{"$match" : {"lon": {"$gte":75,"$lte" : 80},'country' : 'India'}},
{ '$unwind' : '$isPartOf'},
{ "$group":
{
"_id": "$isPartOf",
"count" :{"$sum":1}
}
},
{"$sort": {"count": -1}},
{"$limit": 1}
]
If you want to get all the cities in that region satisfying your condition as well ,you can add "cities": {'$push': '$name'} in the $group stage.

Related

Pymongo find value in subdocuments

I'm using MongoDB 4 and Python 3. I have 3 collections. The first collection got 2 referenced fields on the other collections.
Example :
User {
_id : ObjectId("5b866e8e06a77b30ce272ba6"),
name : "John",
pet : ObjectId("5b9248cc06a77b09a496bad0"),
car : ObjectId("5b214c044ds32f6bad7d2"),
}
Pet {
_id : ObjectId("5b9248cc06a77b09a496bad0"),
name : "Mickey",
}
Car {
_id : ObjectId("5b214c044ds32f6bad7d2"),
model : "Tesla"
}
So one User has one car and one pet. I need to query the User collection and find if there is a User who has a Pet with the name "Mickey" and a Car with the model "Tesla".
I tried this :
db.user.aggregate([{
$project : {"pet.name" : "Mickey", "car.model" : "Tesla" }
}])
But it returns me lot of data while I have just one document with this data. What I'm doing wrong ?

The answer posted by #AnthonyWinzlet has the downside that it needs to churn through all documents in the users collection and perform $lookups which is relatively costly. So depending on the size of your Users collection it may well be faster to do this:
Put an index on users.pet and users.car: db.users.createIndex({pet: 1, car: 1})
Put an index on cars.model: db.cars.createIndex({model: 1})
Put an index on pets.name: db.pets.createIndex({name: 1})
Then you could simply do this:
Get the list of all matching "Tesla" cars: db.cars.find({model: "Tesla"})
Get the list of all matching "Mickey" pets: db.pets.find({name: "Mickey"})
Find the users you are interested in: db.users.find({car: { $in: [<ids from cars query>] }, pet: { $in: [<ids from pets query>] }})
That is pretty easy to read and understand plus all three queries are fully covered by indexes so they can be expected to be as fast as things can get.

You need to use $lookup aggregation here.
Something like this
db.users.aggregate([
{ "$lookup": {
"from": Pet.collection.name,
"let": { "pet": "$pet" },
"pipeline": [
{ "$match": { "$expr": { "$eq": ["$_id", "$$pet"] }, "name" : "Mickey"}}
],
"as": "pet"
}},
{ "$lookup": {
"from": Car.collection.name,
"let": { "car": "$car" },
"pipeline": [
{ "$match": { "$expr": { "$eq": ["$_id", "$$car"] }, "model" : "Tesla"}}
],
"as": "car"
}},
{ "$match": { "pet": { "$ne": [] }, "car": { "$ne": [] } }},
{ "$project": { "name": 1 }}
])

Workaround for preserveNullAndEmptyArrays in MongoDB 2.6

I am using a Python script to query a MongoDB collection. The collection contains embedded documents with varying structures.
I am trying to simply "$unwind" an array contained in several documents. However, the array is not in ALL documents.
That means only the documents that contain the field are returned, the others are ignored. I am using PyMongo 2.6 so I am unable to use preserveNullAndEmptyArrays as mentioned in the documentation because it is new in MongoDB 3.2
Is there a workaround to this? Something along the lines of "if the field path exists, unwind".
The structure of documents and code in question is outlined in detail in this separate but related question I asked earlier.
ISSUE:
I am trying to "$unwind" the value of $hostnames.name. However, since the path doesn't exist in all documents, this results in several ignored documents.
Structure 1 Hostname stored as $hostnames.name
{
"_id" : "192.168.1.1",
"addresses" : {
"ipv4" : "192.168.1.1"
},
"hostnames" : [
{
"type" : "PTR",
"name" : "example.hostname.com"
}
]
}
Structure 2 Hostname stored as $hostname
{
"_id" : "192.168.2.1",
"addresses" : {
"ipv4" : "192.168.2.1"
},
"hostname" : "helloworld.com",
}
Script
cmp = db['computers'].aggregate([
{"$project": {
"u_hostname": {
"$ifNull": [
"$hostnames.name",
{ "$map": {
"input": {"$literal": ["A"]},
"as": "el",
"in": "$hostname"
}}
]
},
"_id": 0,
"u_ipv4": "$addresses.ipv4"
}},
{"$unwind": "$u_hostname"}
])
I am missing all documents that have an empty array for "hostnames".
Here is the structure of the documents that are still missing.
Structure 3
{
"_id" : "192.168.1.1",
"addresses" : { "ipv4" : "192.168.1.1" },
"hostnames" : [], }
}

We can still preserve all the documents where the array field is missing by playing with the $ifNull operator and use a logical $condition processing to assign a value to the newly computed field.
The condition here is $eq which returns True if the field is [None] or False when the condition expression evaluates to false.
cmp = db['computers'].aggregate(
[
{"$project":{
"u_ipv4": "$addresses.ipv4",
"u_hostname": {
"$let": {
"vars": {
"hostnameName": {
"$cond": [
{"$eq": ["$hostnames", []]},
[None],
{"$ifNull": ["$hostnames.name", [None]]}
]
},
"hostname": {"$ifNull": ["$hostname", None]}
},
"in": {
"$cond": [
{"$eq": ["$$hostnameName", [None]]},
{"$map": {
"input": {"$literal": [None]},
"as": "el",
"in": "$$hostname"
}},
"$$hostnameName"
]
}
}
}
}},
{ "$unwind": "$u_hostname" }
]
)

limit results of child/sub document when using find on master document

First time exploring mongoDB and I've bumped into a pickle.
Assuming I have a table/collection called inventory.
This collection in turn have documents that look like:
{
"book" : "Harry Potter",
"users" : {
"Read_it" : {
"John" : <personal number>,
"Elise" : <personal number>
},
"Currently_reading" : { ... }
}
}
Now the dictionary "Read_it" can become quite large and I'm limited to the amount of memory the querying client has so I would like to some how limit the number of returned item and perhaps page it.
This is a function I found in the docs, not sure how to convert this into what I need.
db.inventory.find( { "book": "Harry Potter" }, { item: 1, qty: 500 } )
Skipping the second parameter to find() gives me a result in the form a complete dictionary which works as long as the "Read_it" document/container doesn't grow to big.
One solution would be to pull back the structure so it becomes more flat, but that isn't optimal in terms of other aspects of this project.
Is is possible to work with find() here or are there another function that can do this better?

You seem to asking about projecting only specific elements of a nested structure.
Consider your document example (revised for use):
{
"book" : "Harry Potter",
"users" : {
"Read_it" : {
"John" : 1,
"Elise" : 2
},
"Currently_reading" : {
"Peter": 1
},
"More_information": 5
}
}
Then just issue as follows:
db.collection.find(
{ "book": "Harry Potter" },
{
"book": 1,
"users.Currently_reading": 1,
"users.More_information": 1
}
)
Returns the result with just the fields specified:
{
"_id" : ObjectId("5573b2beb67e246aba2b4b71"),
"book" : "Harry Potter",
"users" : {
"Currently_reading" : {
"Peter" : 1
},
"More_information" : 5
}
}
Not entirely sure, but that might not be supported in all MongoDB versions. Works in 3.X though. If you find it is not supported then do this instead:
db.collection.aggregate([
{ "$match": { "book": "Harry Potter" } },
{ "$project": {
"book": 1,
"users": {
"Currently_Reading": "$users.Currently_reading",
"More_information": "$users.More_information"
}
}}
])
The $project option of the .aggregate() method allows you to manipulate the document returned quite freely. So you don't even need to keep the same structure to return nested results and could change the result further if needed.
I would also strongly suggest using arrays with properties of sub-documents rather than nested dictionaries since that form is much easier to query and filter results than your current structure allows.
Additional to unclear question
As mentioned, it is better to use arrays rather than keys to represent the nested data. So if your intent is to actually just restrict the "Read_it" items to a number of entries then your data is best modelled as such:
{
"book" : "Harry Potter",
"users" : {
"Read_it" : [
{ "username": "John", "id": 1 },
{ "username": "Elise", "id": 2 }
],
"Currently_reading" : [
{ "username": "Peter", "id": 3 }
],
"More_information": 5
}
}
Then you can do a query to limit the number of items in "Read_it" using $slice :
db.collection.find(
{ "book": "Harry Potter" },
{ "users.Read_it": { "$slice": 1 } }
)
Which returns:
{
"_id" : ObjectId("5574118012ae33005f1fca17"),
"book" : "Harry Potter",
"users" : {
"Read_it" : [
{
"username" : "John",
"id" : 1
}
],
"Currently_reading" : [
{
"username" : "Peter",
"id" : 3
}
],
"More_information" : 5
}
}
Alternate options use the projection positional $ operator or even the aggregation framework for multiple matches in the array. But there are already many answers here that show you how to do that.

mongodb, pipeline, multiple $group stages

I wanna simply get average regional city population for all countries in the cities collection. I think my first group stage works in giving me all the different regions with the avg population of that region.
My plan was to go to the next stage now, id it by country and then build the avg of all those values i got in my first group stage. Maybe i have an error in my thinking here or more likely my execution, since i am new to mongo db and the pipeline thing. Below my code i put example data.
pipeline = [
{ '$unwind' : '$isPartOf'},
{
"$group":
{
"_id": "$isPartOf",
"avgpop" : {"$avg":"$population"},
}
},
{
"$group":
{
"_id": "$country",
"avgpopc" : {"$avg":"$avgpop"},
}
}
]
{
"_id" : ObjectId("52fe1d364b5ab856eea75ebc"),
"elevation" : 1855,
"name" : "Kud",
"country" : "India",
"lon" : 75.28,
"lat" : 33.08,
"isPartOf" : [
"Jammu and Kashmir",
"Udhampur district"
],
"timeZone" : [
"Indian Standard Time"
],
"population" : 1140
}

In order to get the average regional city population for all countries in the cities collection you need to first calculate the average city population for each region in a country and then calculate the average of all the regional averages for a country.The _id field in the first $group pipeline stage should be compound keys, that is, documents composed of multiple fields. In the case above the keys in the $group _id would be the isPartOf and country fields. This is where you get the average population of each region per country. The next group pipeline stage then calculates the average of all the country's regional population averages. Thus your final aggregation pipeline should look like this:
pipeline = [
{ "$unwind": "$isPartOf"},
{
"$group": {
"_id": {
"Region": "$isPartOf",
"Country": "$country"
},
"avgPopulation": {"$avg": "$population"},
}
},
{
"$group": {
"_id": "$_id.Country" ,
"avgRegionalPopulation": {"$avg": "$avgPopulation" },
}
}
]

how to aggregate on each item in collection in mongoDB

MongoDB noob here...
when I do db.students.find().pretty() in the shell I get a long list from my collection...like so..
{
"_id" : 19,
"name" : "Gisela Levin",
"scores" : [
{
"type" : "exam",
"score" : 44.51211101958831
},
{
"type" : "quiz",
"score" : 0.6578497966368002
},
{
"type" : "homework",
"score" : 93.36341655949683
},
{
"type" : "homework",
"score" : 49.43132782777443
}
]
}
now I've got about over 100 of these...I need to run the following on each of them...
lowest_hw_score =
db.students.aggregate(
// Initial document match (uses index, if a suitable one is available)
{ $match: {
_id : 0
}},
// Expand the scores array into a stream of documents
{ $unwind: '$scores' },
// Filter to 'homework' scores
{ $match: {
'scores.type': 'homework'
}},
// Sort in descending order
{ $sort: {
'scores.score': 1
}},
{ $limit: 1}
)
So I can run something like this on each result
for item in lowest_hw_score:
print lowest_hw_score
Right now "lowest_score" works on only one item I to run this on all items in the collection...how do I do this?

> db.students.aggregate(
{ $match : { 'scores.type': 'homework' } },
{ $unwind: "$scores" },
{ $match:{"scores.type":"homework"} },
{ $group: {
_id : "$_id",
maxScore : { $max : "$scores.score"},
minScore: { $min:"$scores.score"}
}
});
You don't really need the first $match, but if "scores.type" is indexed, it means it would be used before unwinding the scores. (I don't believe after the $unwind mongo would be able to use the index.)
Result:
{
"result" : [
{
"_id" : 19,
"maxScore" : 93.36341655949683,
"minScore" : 49.43132782777443
}
],
"ok" : 1
}
Edit: tested and updated in mongo shell

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

mongodb python , quick pipeline code check - python

Related

Pymongo find value in subdocuments

Workaround for preserveNullAndEmptyArrays in MongoDB 2.6

limit results of child/sub document when using find on master document

mongodb, pipeline, multiple $group stages

how to aggregate on each item in collection in mongoDB

Categories

Resources