I have a MongoDB query as follows :
data = db.collection.aggregate([{"$match":{"created_at":{"$gte":start,"$lt":end}}},{"$group":{"_id":"$stage","count":{"$sum":1}}},{"$match":{"count":{"$gt":m{u'count': 296, u'_id': u'10.57.72.93'}
Which results in the following output:
{u'count': 230, u'_id': u'111.11.111.111'}
{u'count': 2240, u'_id': u'111.11.11.11'}
I am trying to sort the output by the 'count' column:
data.sort('count', pymongo.DESCENDING)
...but I am getting the following error:
'CommandCursor' object has no attribute 'sort'
Can anyone explain the reason for this error?
Using $sort as shown in Aggregation example:
from bson.son import SON
data = db.collection.aggregate([
{"$match":{"created_at":{"$gte":start,"$lt":end}}},
{"$group":{"_id":"$stage","count":{"$sum":1}}},
{"$match":{"count": ... }},
{"$sort": SON([("count", -1)])} # <---
])
alternative general solution: use sorted with custom key function:
data = db.collection.aggregate(...)
data = sorted(data, key=lambda x: x['count'], reverse=True)
You may want to use $sort. Check Mongo docs also.
Aggregation pipelines have a $sort pipeline stage:
data = db.collection.aggregate([
{ "$match":{"created_at":{"$gte":start,"$lt":end} }},
{ "$group":{ "_id":"$stage","count":{"$sum":1} }},
{ "$match":{
"count":{ "$gt": 296 } # query trimmed because your question listing is incomplete
}},
{ "$sort": { "count": -1 } } # Actual sort stage
])
The other .sort() method is for a "cursor" which is different from what the aggregation pipeline does.
Related
Requirement
My requirement is to have a Python code extract some records from a database, format and upload a formatted JSON to a sink.
Planned approach
1. Create JSON-like templates for each record. E.g.
json_template_str = '{{
"type": "section",
"fields": [
{{
"type": "mrkdwn",
"text": "Today *{total_val}* customers saved {percent_derived}%."
}}
]
}}'
2. Extract records from DB to a dataframe.
3. Loop over dataframe and replace the {var} variables in bulk using something like .format(**locals()))
Question
I haven't worked with dataframes before.
What would be the best way to accomplish Step 3 ? Currently I am
3.1 Looping over the dataframe objects 1 by 1 for i, df_row in df.iterrows():
3.2 Assigning
total_val= df_row['total_val']
percent_derived= df_row['percent_derived']
3.3 In the loop format and add str to a list block.append(json.loads(json_template_str.format(**locals()))
I was trying to use the assign() method in dataframe but was not able to figure out a way to use like a lambda function to create a new column with my expected value that I can use.
As a novice in pandas, I feel there might be a more efficient way to do this (which may even involve changing the JSON template string - which I can totally do). Will be great to hear thoughts and ideas.
Thanks for your time.
I would not write a JSON string by hand, but rather create a corresponding python object and then use the json library to convert it into a string. With this in mind, you could try the following:
import copy
import pandas as pd
# some sample data
df = pd.DataFrame({
'total_val': [100, 200, 300],
'percent_derived': [12.4, 5.2, 6.5]
})
# template dictionary for a single block
json_template = {
"type": "section",
"fields": [
{"type": "mrkdwn",
"text": "Today *{total_val:.0f}* customers saved {percent_derived:.1f}%."
}
]
}
# a function that will insert data from each row
# of the dataframe into a block
def format_data(row):
json_t = copy.deepcopy(json_template)
text_t = json_t["fields"][0]["text"]
json_t["fields"][0]["text"] = text_t.format(
total_val=row['total_val'], percent_derived=row['percent_derived'])
return json_t
# create a list of blocks
result = df.agg(format_data, axis=1).tolist()
The resulting list looks as follows, and can be converted into a JSON string if needed:
[{
'type': 'section',
'fields': [{
'type': 'mrkdwn',
'text': 'Today *100* customers saved 12.4%.'
}]
}, {
'type': 'section',
'fields': [{
'type': 'mrkdwn',
'text': 'Today *200* customers saved 5.2%.'
}]
}, {
'type': 'section',
'fields': [{
'type': 'mrkdwn',
'text': 'Today *300* customers saved 6.5%.'
}]
}]
I was trying to extract the JSON response data from an API and load it into Snowflake VARIANT column using Python Script.
While loading the data, I noticed that the keys are re-arranged in alphabetical order.
Python/Postman data:
{
"Data": [
{
"CompanyID": 3522,
"MarketID": 23259,
"MarketName": "XYZ_Market"
"LocationID": 17745,
"LocationName": "XYZ_Location"
}
}
Snowflake data:
{
"Data": [
{
"CompanyID": 3522,
"LocationID": 17745,
"LocationName": "XYZ_Location",
"MarketID": 23259,
"MarketName": "XYZ_Market"
}
}
I was using PARSE_JSON() query function to load the data into snowflake. Is there any way to preserve the order of keys ?
In Python 3.6+ dictionaries maintain their insertion order. However, as noted in the snowflake docs JSON objects are unordered. So, you may be limited by how your data is stored.
If you need to maintain order, consider an array of arrays, instead.
[
["CompanyID", 3522],
["MarketID", 23259],
["MarketName", "XYZ_Market"],
["LocationID", 17745],
["LocationName", "XYZ_Location"]
]
totalHotelsInTown=hotels.aggregate([ {"$group": {"_id": "$Town", "TotalRestaurantInTown": {"$sum":1}} } ])
NumOfHotelsInTown={}
for item in totalHotelsInTown:
NumOfHotelsInTown[item['_id']]=item['TotalRestaurantInTown']
results = hotels.aggregate(
[{"$match": {"cuisine": cuisine}},
{"$group": {"_id": "$town", "HotelsCount": {"$sum": 1} }}, {"$project": {"HotelsCount":1,"Percent": {"$multiply": [{"$divide": ["$HotelsCount", NumOfHotelsInTown["$_id"]]}, 100]}}}, {"$sort": {"Percent": 1}},
{"$limit": 1}])
I want to pass the value of "_id" field as a key to python dictionary, but the interpreter is taking "$_id" itself as a key instead of its value and giving a KeyError because of that. Any help would be much appreciated. Thanks!
'NumOfHotelsInTown' dictionary has key value pairs of place and number of hotels
When I am trying to retrieve the value from NumOfHotelsInTown dictionary,
I am giving the key dynamically with "$_id".
The exact error I am getting is:
{"$group": {"_id": "$borough", "HotelsCount": {"$sum": 1} }}, {"$project": {"HotelsCount":1,"Percent": {"$multiply": [{"$divide": ["$HotelsCount", NumOfHotlesInTown["$_id"]]}, 100]}}}, {"$sort": {"Percent": 1}},
KeyError: '$_id'
I see what you're trying to do, but you can't dynamically run python code during a MongbDB aggregate.
What you should do instead:
Get the total counts for every borough (which you have already done)
Get the total counts for every borough for a given cuisine (which you have part of)
Use python to compare the 2 totals to produce a list of percentages and not MongoDB
For example:
group_by_borough = {"$group": {"_id": "$borough", "TotalRestaurantInBorough": {"$sum":1}} }
count_of_restaurants_by_borough = my_collection.aggregate([group_by_borough])
restaurant_count_by_borough = {doc["_id"]: doc["TotalRestaurantInBorough"] for doc in count_of_restaurants_by_borough}
count_of_cuisines_by_borough = my_collection.aggregate([{"$match": {"cuisine": cuisine}}, group_by_borough])
cuisine_count_by_borough = {doc["_id"]: doc["TotalRestaurantInBorough"] for doc in count_of_cuisines_by_borough}
percentages = {}
for borough, count in restaurant_count_by_borough.items():
percentages[borough] = cuisine_count_by_borough.get(borough, 0) / float(count) * 100
# And if you wanted it sorted you can use an OrderedDict
from collections import OrderedDict
percentages = OrderedDict(sorted(percentages.items(), key=lambda x: x[1]))
I have a collection with documents like this:
{
"_id" : "1234567890",
"area" : "Zone 63",
"last_state" : "Cloudy",
"recent_indices" : [
21,
18,
33,
...
38
41
],
"Report_stats" : [
{
"date_hour" : "2017-01-01 01",
"count" : 31
},
{
"date_hour" : "2017-01-01 02",
"count" : 20
},
...
{
"date_hour" : "2018-08-26 13",
"count" : 3
}
]
}
which is supposed to be updated based on some online real-time reports
and assume each report looks like this:
{
'datetime' : '2018-08-26 13:48:11.677635',
'areas' : 'Zone 3; Zone 45; Zone 63',
'status' : 'Clear',
'index' : '33'
}
Now I have to update the collection in way that:
Each time that a new 'area' (say Zone 1025) shows up on the report, a new document adds to keep the related data
New 'index' adds to list "recent_indices" while "last_state" updates to 'status'
based on what the 'datetime' is, the respective "Report_stats.count" increments by 1 or a new "Report_stats" document ('datetime' with an hour resolution, where its 'count' is 1) inserted.
The way to do each of these updates separately, is somehow obvious, the problem is: How can I do all these simultaneously in a single update/upsert task?
I tried to use update_one and find_one_and_update(as well as update and find_and_modify) using pyMongo, but it was not possible (for me at least) to resolve the problem.
So I started to wonder if there possibly is a simple/single task to do so, or I should start trying to fix it in a different way altogether.
Can you please help me how to do this or (since there is a lot of data being gathered and therefore should be processed) suggest a low-cost alternative?
Thank you!
I am unsure if I understand your question, but if your problem revolves around upsert i.e update it or add the record if it is not there.
You can do it by adding one parameter like this:
update_one( {'_id':1}, {$set:{}}, upsert=True )
If you want to update multiple fields you can simply do it like setting your updated document:
{
name: 'Kanika',
age: 19
},
//set document
{
name: 'Andy',
age: 30
}
Please try looking into: https://docs.mongodb.com/manual/reference/method/db.collection.update/ , if it helps.
Thanks, Kanika
The best solution I have reached so far, is this:
if mycollection.find_one({'area': 'zone 45', 'Report_stats.date_hour': '2018-08-26 13'}):
mycollection.update_one({'area': 'zone 45', 'Report_stats.date_hour': '2018-08-26 13'},
{
'$inc': {
'Report_stats.$.count': 1
},
'$set': {
'last_state': 'Clear'
},
'$push': {
'recent_indices': 33,
}
},
)
else:
mycollection.update_one({'area': 'zone 45'},
{
'$set': {
'last_state': 'Clear'
},
'$push': {
'recent_indices': 33,
'Report_stats':{'date_hour':'2018-08-26 13', 'count':1}
}
},
upsert = True
)
However, it still is performing one query twice to update one document based on one request, which is not quite satisfactory.
Any better suggestions?
What if I figured out from your above reply is that if Report_stats.date_hour exists in your document, then you increment the counter or else you just push a new document.
I believe we can do it using $cond or $switch. Can you please take a look.
https://docs.mongodb.com/manual/reference/operator/aggregation/cond/#exp._S_cond
Meanwhile, I am trying to write the whole query for you and lets see if it works.
Thanks, Kanika
My documents in a collection looks like this -
{'_id' : 'Delhi1', 'loc' : [28.34242,77.656565] }
{'_id' : 'Delhi2', 'loc' : [27.34242,78.626523] }
{'_id' : 'Delhi3', 'loc' : [25.34242,77.612345] }
{'_id' : 'Delhi4', 'loc' : [28.34242,77.676565] }
I want to apply aggregation using pymongo, to find out relevant document based on input latlong. I have created the index on 'loc'. Here is what I have done so far -
pipeline = [{'$geoNear':{'near': [27.8787, 78.2342],
'distanceField': "distance",
'maxDistance' : 2000 }}]
db['mycollection'].aggregate(pipeline)
But this is not working for me ? How to correctly use this ?
Actually, I created the '2dsphere' index in collection, and to use geoNear with 2dsphere we need to specify, spherical = True in the pipeline
pipeline = [{'$geoNear':{'near': [27.8787, 78.2342],
'distanceField': "distance",
'maxDistance' : 2000,
'spherical' : True }}]
It looks like you have a few formatting mistakes: 1) neither the collection nor operators need parentheses or brackets, 2) logical operators are lowercase.
db.mycollection.aggregate([
{
$geoNear: {
near: { coordinates: [ 27.8787 , 78.2342 ] },
distanceField: "distance",
maxDistance: 2000,
spherical: true
}
}
])