mongodb, pipeline, multiple $group stages - python

I wanna simply get average regional city population for all countries in the cities collection. I think my first group stage works in giving me all the different regions with the avg population of that region.
My plan was to go to the next stage now, id it by country and then build the avg of all those values i got in my first group stage. Maybe i have an error in my thinking here or more likely my execution, since i am new to mongo db and the pipeline thing. Below my code i put example data.
pipeline = [
{ '$unwind' : '$isPartOf'},
{
"$group":
{
"_id": "$isPartOf",
"avgpop" : {"$avg":"$population"},
}
},
{
"$group":
{
"_id": "$country",
"avgpopc" : {"$avg":"$avgpop"},
}
}
]
{
"_id" : ObjectId("52fe1d364b5ab856eea75ebc"),
"elevation" : 1855,
"name" : "Kud",
"country" : "India",
"lon" : 75.28,
"lat" : 33.08,
"isPartOf" : [
"Jammu and Kashmir",
"Udhampur district"
],
"timeZone" : [
"Indian Standard Time"
],
"population" : 1140
}

In order to get the average regional city population for all countries in the cities collection you need to first calculate the average city population for each region in a country and then calculate the average of all the regional averages for a country.The _id field in the first $group pipeline stage should be compound keys, that is, documents composed of multiple fields. In the case above the keys in the $group _id would be the isPartOf and country fields. This is where you get the average population of each region per country. The next group pipeline stage then calculates the average of all the country's regional population averages. Thus your final aggregation pipeline should look like this:
pipeline = [
{ "$unwind": "$isPartOf"},
{
"$group": {
"_id": {
"Region": "$isPartOf",
"Country": "$country"
},
"avgPopulation": {"$avg": "$population"},
}
},
{
"$group": {
"_id": "$_id.Country" ,
"avgRegionalPopulation": {"$avg": "$avgPopulation" },
}
}
]

Related

searching for space separated words in Elasticsearch

my data is,
POST index_name/_doc/1
{
"created_date" : "2023-02-09T13:21:41.632492",
"created_by" : "hats",
"data" : [
{
"name" : "west cost",
"document_id" : "1"
},
{
"name" : "mist cost",
"document_id" : "2"
},
{
"name" : "cost",
"document_id" : "3"
}
]
}
i used query_String to search
GET index_name/_serach
{
"query_string": {
"default_field": "data.name",
"query": "*t cost"
}
}
expected result was:
west cost, mist cost
but the output was:
west cost, mist cost, cost
i have tried many search query but still couldn't find a solution
which search query is used to handle the space, i need to search for the similar patterned value in the field

Good approach with MongoDB, comparing two fields to determine if value exists in both?

example:
_id = 001
field 'location' = PARIS FRANCE
field 'country' = FRANCE
_id = 002
field 'location' = TORONTO
field 'country' = CANADA
desired result:
ability to recognize that for _id 001, "france" is also in the value for location field;
whereas, _id 002 does not have a value from country that also is in location
Instead of relying on pandas, would like to see if there are more efficient options using pymongo, for example?
This is sensitive to case, and possible abbreviations, etc., but here's one way to identify if one string is contained within the other.
Given an example collection like this:
[
{
"_id": "001",
"location": "PARIS FRANCE",
"country": "FRANCE"
},
{
"_id": "002",
"location": "TORONTO",
"country": "CANADA"
}
]
This will set "isIn" if "country" is contained within "location" or vice-versa.
db.collection.aggregate([
{
"$set": {
"isIn": {
"$gte": [
{
"$sum": [
{ // returns pos or -1 if not found
"$indexOfCP": ["$location", "$country"]
},
{"$indexOfCP": ["$country", "$location"]}
]
},
-1
]
}
}
}
])
Example output:
[
{
"_id": "001",
"country": "FRANCE",
"isIn": true,
"location": "PARIS FRANCE"
},
{
"_id": "002",
"country": "CANADA",
"isIn": false,
"location": "TORONTO"
}
]
Try it on mongoplayground.net.

Transforming nested JSON to simple dictionary JSON structure

I'm calling an API which returns me data in such a format:
{
"records": [
{
"columns": [
{
"fieldNameOrPath": "Name",
"value": "Burlington Textiles Weaving Plant Generator"
},
{
"fieldNameOrPath": "AccountName",
"value": "Burlington Textiles Corp of America"
}
]
},
{
"columns": [
{
"fieldNameOrPath": "Name",
"value": "Dickenson Mobile Generators"
},
{
"fieldNameOrPath": "AccountName",
"value": "Dickenson plc"
}
]
}
]
}
in order to properly use this data for my following workflow I need a structure such as:
{
"records": [
{
"Name": "Burlington Textiles Weaving Plant Generator",
"AccountName": "Burlington Textiles Corp of America"
},
{
"Name": "Dickenson Mobile Generators",
"AccountName": "Dickenson plc"
}
]
}
So the fieldNameOrPath value needs to become the key and the value value needs to become the value.
Can this transformation be done with a python function?
Those conditions apply:
I don't know how many objects will be inside each columns list element
The key and the value names could be different (so I need to pass fieldNameOrPath as the key for the key and value as the key for the value to the function in order to specify them)
We'll suppose the data from the API is stored in a variable data. To get the data transformed into the format you propose, we can iterate through all the records, and for each record create a dictionary by iterating through its columns, using the fieldNameOrPath values as the keys, and the value values as the dictionary values.
trans_data = {"records": []}
for record in data["records"]:
trans_record = {}
for column in record["columns"]:
trans_record[column["fieldNameOrPath"]] = column["value"]
trans_data["records"].append(trans_record)

Pymongo find value in subdocuments

I'm using MongoDB 4 and Python 3. I have 3 collections. The first collection got 2 referenced fields on the other collections.
Example :
User {
_id : ObjectId("5b866e8e06a77b30ce272ba6"),
name : "John",
pet : ObjectId("5b9248cc06a77b09a496bad0"),
car : ObjectId("5b214c044ds32f6bad7d2"),
}
Pet {
_id : ObjectId("5b9248cc06a77b09a496bad0"),
name : "Mickey",
}
Car {
_id : ObjectId("5b214c044ds32f6bad7d2"),
model : "Tesla"
}
So one User has one car and one pet. I need to query the User collection and find if there is a User who has a Pet with the name "Mickey" and a Car with the model "Tesla".
I tried this :
db.user.aggregate([{
$project : {"pet.name" : "Mickey", "car.model" : "Tesla" }
}])
But it returns me lot of data while I have just one document with this data. What I'm doing wrong ?
The answer posted by #AnthonyWinzlet has the downside that it needs to churn through all documents in the users collection and perform $lookups which is relatively costly. So depending on the size of your Users collection it may well be faster to do this:
Put an index on users.pet and users.car: db.users.createIndex({pet: 1, car: 1})
Put an index on cars.model: db.cars.createIndex({model: 1})
Put an index on pets.name: db.pets.createIndex({name: 1})
Then you could simply do this:
Get the list of all matching "Tesla" cars: db.cars.find({model: "Tesla"})
Get the list of all matching "Mickey" pets: db.pets.find({name: "Mickey"})
Find the users you are interested in: db.users.find({car: { $in: [<ids from cars query>] }, pet: { $in: [<ids from pets query>] }})
That is pretty easy to read and understand plus all three queries are fully covered by indexes so they can be expected to be as fast as things can get.
You need to use $lookup aggregation here.
Something like this
db.users.aggregate([
{ "$lookup": {
"from": Pet.collection.name,
"let": { "pet": "$pet" },
"pipeline": [
{ "$match": { "$expr": { "$eq": ["$_id", "$$pet"] }, "name" : "Mickey"}}
],
"as": "pet"
}},
{ "$lookup": {
"from": Car.collection.name,
"let": { "car": "$car" },
"pipeline": [
{ "$match": { "$expr": { "$eq": ["$_id", "$$car"] }, "model" : "Tesla"}}
],
"as": "car"
}},
{ "$match": { "pet": { "$ne": [] }, "car": { "$ne": [] } }},
{ "$project": { "name": 1 }}
])

mongodb python , quick pipeline code check

i am a beginner to mongodb and i have the assignment to write pipeline code. MY goal is to find the Region in India has the largest number of cities with longitude between 75 and 80? I hope anybody can help me to point out my misconceptions and/or mistakes, it is a very short code, so i am sure the pros will spot it right away.
Here is my code, i will post how the datastructure looks like under it :
pipeline = [
{"$match" : {"lon": {"$gte":75, "$lte" : 80}},
{'country' : 'India'}},
{ '$unwind' : '$isPartOf'},
{ "$group":
{
"_id": "$name",
"count" :{"$sum":{"cityname":"$name"}} }},
{"$sort": {"count": -1}},
{"$limit": 1}
]
{
"_id" : ObjectId("52fe1d364b5ab856eea75ebc"),
"elevation" : 1855,
"name" : "Kud",
"country" : "India",
"lon" : 75.28,
"lat" : 33.08,
"isPartOf" : [
"Jammu and Kashmir",
"Udhampur district"
],
"timeZone" : [
"Indian Standard Time"
],
"population" : 1140
}
The following pipeline will give you the desired result. The first $match pipeline operator uses standard MongoDB queries to filter the documents (cities) whose longitude is between 75 and 80 and as well as the ones only in India based on the country field. Since each document represents a city, the $unwind operator on the isPartOf deconstructs that array field from the filtered documents to output a document for each element. Each output document replaces the array with an element value. Thus for each input document, outputs n documents where n is the number of array elements and this operation is rather useful in the next $group operator stage since that's where you can calculate the number n through $sum group accumulator operator. The next pipeline stages will then transform your final document structure by introducing new replacement fields Region and NumberOfCities + sorting the documents in descending order and then returning the top 1 document which is your region with the largest number of cities:
pipeline = [
{
"$match": {
"lon": {"$gte": 75, "$lte": 80},
"country": "India"
}
},
{
"$unwind": "$isPartOf"
},
{
"$group": {
"_id": "$isPartOf",
"count": {
"$sum": 1
}
}
},
{
"$project": {
"_id": 0,
"Region": "$_id",
"NumberOfCities": "$count"
}
},
{
"$sort": {"NumberOfCities": -1}
},
{ "$limit": 1 }
]
There are some syntax and logical errors in your pipeline.
{"$match" : {"lon": {"$gte":75, "$lte" : 80}},
{'country' : 'India'}},
The Syntax here is wrong, you should just use comma to seperate key value pairs in `$match.
"_id": "$name",
You are grouping based on city name and not on the region.
{"$sum":{"cityname":"$name"}}
You need to send a numeric values to the $sum operator that result from applying a specified expression. {"cityname":"$name"} will be ignored.
The correct pipeline would be :-
[
{"$match" : {"lon": {"$gte":75,"$lte" : 80},'country' : 'India'}},
{ '$unwind' : '$isPartOf'},
{ "$group":
{
"_id": "$isPartOf",
"count" :{"$sum":1}
}
},
{"$sort": {"count": -1}},
{"$limit": 1}
]
If you want to get all the cities in that region satisfying your condition as well ,you can add "cities": {'$push': '$name'} in the $group stage.

Categories