Aggregating over map keys using mongo aggregation framework - python

I have a use case where I have documents stored in a mongo collection with one of the columns as map. For example :
{ "_id" : ObjectId("axa"), "date" : "2015-08-05", "key1" : "abc", "aggregates" : { "x" : 12, "y" : 1 } }
{ "_id" : ObjectId("axa1"), "date" : "2015-08-04", "key1" : "abc", "aggregates" : { "x" : 4, "y" : 19 } }
{ "_id" : ObjectId("axa2"), "date" : "2015-08-03", "key1" : "abc", "aggregates" : { "x" : 3, "y" : 13 } }
One thing to note is keys inside aggregates sub document could change. for example instead of x and y , it could be z and k or any combination and any number
Now I am pulling that data over from an API and need to use mongo aggregation framework to aggregate over date range. For instance, for the above example, I want to run query for date 08/03 -08/05 and aggregate x and y (group by x and y ) and the result should be
{ "key1" : "abc", "aggregates" : { "x" : 19, "y" : 33 } }
How can I do it?

First you should update you document because date is string. You can do that using the Bulk() API
from datetime import datetime
import pymongo
conn = pymongo.MongoClient()
db = conn.test
col = db.collection
bulk = col.initialize_ordered_bulk_op()
count = 0
for doc in col.find():
conv_date = datetime.strptime(doc['date'], '%Y-%m-%d')
bulk.find({'_id': doc['_id']}).update_one({'$set': {'date': conv_date}})
count = count + 1
if count % 500 == 0:
# Execute per 500 operations and re-init.
bulk.execute()
bulk = col.initialize_ordered_bulk_op()
# Clean up queues
if count % 500 != 0:
bulk.execute()
Then comes the aggregation part:
You need to filter you documents by date using the $match operator. Next $group your documents by a specified identifierkey1 and apply the accumulator $sum. With $project you can reshape your document.
x = 'x'
y = 'y'
col.aggregate([
{'$match': { 'date': { '$lte': datetime(2015, 8, 5), '$gte': datetime(2015, 8, 3)}}},
{'$group': {'_id': '$key1', 'x': {'$sum': '$aggregates. ' +x}, 'y': {'$sum': '$aggregates.' + y}}},
{'$project': {'key1': '$_id', 'aggregates': {'x': '$x', 'y': '$y'}, '_id': 0}}
])

Related

How to rearrange my output to be a single object and be separated by a comma ? Mongodb/python

Note: i use flask / pymongo
How can i rearrange my data to output all of them in a single object separated by a comma. (see end of post for example).
I have a collection with data similar to this and i need to ouput all the number of times for example here, that sandwich is in the collection like this Sandwiches: 13 :
{
"id":"J6qWt6XIUmIGFHX5rQJA-w",
"categories":[
{
"alias":"sandwiches",
"title":"Sandwiches"
}
]
}
So with this first request :
restos.aggregate([{$unwind:"$categories"},
{$group:{_id:'$categories.title', count:{$sum:1}}},
{$project:{ type:"$_id", count: 1,_id:0}}])
I achieved to get an out put like this :
{ "count" : 3, "type" : "Sandwiches" }
But what i want is the type as a key and the count as a value, like this : { "Sandwiches" : 3 }
I was able to "partially make it works with that command but that's not really the format i want :
.aggregate([{'$unwind': '$categories'},{'$group': {'_id': '$categories.title','count':{'$sum':
1}}},{'$project': {'type': '$_id', 'count': 1, '_id': 0}}, {'$replaceRoot': {'newRoot':
{'$arrayToObject': [[{'k': '$type', 'v': '$count'}]]}}}]))
The output was :
{
"restaurants": [
{
"Polish": 1
},
{
"Salad": 3
},
{
"Convenience Stores": 1
},
{
"British": 2
}]}
But my desired output is something like this that doesn't have the array and the data is contained into only 1 object:
{
"restaurants":{
Sandwiches: 13,
pizza: 15,
...
}
for the list thing i've come to realize that i use flask and when i return my jsonify object i put 'restaurants': list(db.restaurants.aggregate([
but when i remove it i get this error : TypeError: Object of type CommandCursor is not JSON serializable
Any idea on how to do that ? thanks a lot :)
If you can get a data like the following.
{ "count" : 13, "type" : "Sandwiches" }
You can do like this:
data = [{ "count" : 13, "type" : "Sandwiches" }, { "count" : 15, "type" : "Pizza" }]
output = {}
p = {}
for d in data: # read each item in the list
p.update({d['type']: d['count']}) # build a p dict with type key
output.update({'restaurants': p}) # build an output dict with restaurants key
print(output)
# {'restaurants': {'Sandwiches': 13, 'Pizza': 15}}

python how to search a string, count values and group by in json

I have a python program calling an API that receives the result as below:
{
"result": [
{
"company" : "BMW",
"model" : "5"
},
{
"company" : "BMW",
"model" : "5"
},
{
"company" : "BMW",
"model" : "5"
},
{
"company" : "BMW",
"model" : "3"
},
{
"company" : "BMW",
"model" : "7"
},
{
"company" : "AUDI",
"model" : "A3"
},
{
"company" : "AUDI",
"model" : "A7"
},
]
}
Now my task is to identify the number of occurrences of elements from the list in JSON output and group them. The expected output should look like this:
{
"BMW" :
{
"5series" : 3,
"3series" : 1,
"7series" : 1,
},
"AUDI" :
{
"A3" : 1,
"A7" : 1,
},
"MERCEDES":
{
"EClass" : 0,
"SClass" : 0
}
}
I need to find the "company" from list of elements. This will include names that may not be in JSON response sometimes, then the expected output should include that as 0. The "model" names (3,5,7,A3 etc..,) are fixed, so we know that's those are only ones that may or may not be in json api response.
For ex: The List has 3 company names in below code. - companyname = ["BMW,"AUDI","MERCEDES"] . However, sometimes, the JSON API response may not have one or more elements. In this case, "MERCEDES" is missing, but the final output should include "MERCEDES" as well with value as 0.
Here is what i have tried so far:
def modelcount():
companyname= ["BMW","AUDI","MERCEDES"]
url = apiurl
#Send Request
apiresponse = requests.get(url, auth=(user, password), headers=headers, proxies=proxies)
# Decode the JSON response into a dictionary and use the data
data = apiresponse.json()
print(len(data['result']))
3series= 0
5series= 0
7series= 0
A3=0
A7=0
EClass = 0
SClass = 0
modelcountjson = {}
for name in companyname:
for item in data['result']:
models= {}
if item['company'] == name:
if item['model'] == 3:
3series = 3series + 1
elif item['model'] == 5:
5series = 5series + 1
elif item['model'] == 7:
7series = 7series + 1
models['3series'] = 3series
models['5series'] = 5series
models['7series'] = 7series
#I still haven't written AUDI, MERCEDES above. This is where i feel i am writing inefficiently.
modelcountjson[name] = models
return jsonify(modelcountjson)
```
As the number of models grow, I am worried of code getting redundant with many for loops and may cause performance overhead. I am looking for help on achieving the end result in most efficient way.
Thank you so much for your help.
A useful package for working directly with JSON-style dictionaries and lists is toolz (see documentation for more details). This way you can concisely group the data and count occurrences of each model while handling potentially missing data separately:
from toolz import itertoolz
result = {
"result": [
{
"company" : "BMW",
"model" : "5"
},
{
"company" : "BMW",
"model" : "5"
},
{
"company" : "BMW",
"model" : "5"
},
{
"company" : "BMW",
"model" : "3"
},
{
"company" : "BMW",
"model" : "7"
},
{
"company" : "AUDI",
"model" : "A3"
},
{
"company" : "AUDI",
"model" : "A7"
},
]
}
final_output = {}
grouped_result = itertoolz.groupby('company', result['result'])
if 'MERCEDES' not in grouped_result:
final_output['MERCEDES'] = {
'EClass': 0,
'SClass': 0
}
for key, value in grouped_result.items():
models = itertoolz.pluck('model', value)
final_output[key] = itertoolz.frequencies(models)
The output results in:
{'AUDI': {'A3': 1, 'A7': 1}, 'BMW': {'3': 1, '5': 3, '7': 1}, 'MERCEDES': {'EClass': 0, 'SClass': 0}}
You could go for a bit of a separation of code and config:
conf = {
'BMW': {'format': '{}series', 'keys': ['3', '5', '7']},
'AUDI': {'format': '{}', 'keys': ['A3', 'A7']},
'MERCEDES': {'format': '{}Class', 'keys': ['E', 'S']},
}
def modelcount():
# retrieve `data`
# ...
result = {
k: {
v['format'].format(key): 0 for key in v['keys']
} for k, v in conf.items()
}
for car in data['result']:
com = car['company']
mod = car['model']
key = conf[com]['format'].format(mod)
result[com][key] += 1
for com in result:
result[com]['Total'] = sum(result[com].values())
return result
>>> modelcount()
{'BMW': {'3series': 1, '5series': 3, '7series': 1},
'AUDI': {'A3': 1, 'A7': 1},
'MERCEDES': {'EClass': 0, 'SClass': 0}}
This way, for more companies and models, you will only have to touch the conf, not the code. The time complexity of this is O(m+n) with m the total number of distinct models and n the number of cars in the API response.

Error while adding key/value to Python Dict in nested loop

I have a Json Structure as Follows:
{
"_id" : ObjectId("asdasda156121s"),
"Hp" : {
"bermud" : [
{
"abc" : {
"gfh" : 1,
"fgh" : 0.0,
"xyz" : [
{
"kjl" : "0",
"bnv" : 0,
}
],
"xvc" : "bv",
"hgth" : "INnn",
"sdf" : 0,
}
}
},
{
"abc" : {
"gfh" : 1,
"fgh" : 0.0,
"xyz" : [
{
"kjl" : "0",
"bnv" : 0,
}
],
"xvc" : "bv",
"hgth" : "INnn",
"sdf" : 0,
}
}
},
..
I am trying to parse this json and add a new value with key ['cat'] inside the object 'xyz',below is my py code.
data = []
for x in a:
for y in x['Hp'].values():
for z in y:
for k in z['abc']['xyz']:
for m in data:
det = m['response']
// Some processing with det whose output is stored in s
k['cat'] = s
print x
However when x is printed only the last value is being appended onto the whole dictionary, wheras there are different values for s. Its obvious that the 'cat' key is being overwritten everytime the loop rounds,but can't find a way to make it right.What mistake am I making?

logical error in python dictionary traversal

one of my queries in mongoDB through pymongo returns:
{ "_id" : { "origin" : "ABE", "destination" : "DTW", "carrier" : "EV" }, "Ddelay" : -5.333333333333333,
"Adelay" : -12.666666666666666 }
{ "_id" : { "origin" : "ABE", "destination" : "ORD", "carrier" : "EV" }, "Ddelay" : -4, "Adelay" : 14 }
{ "_id" : { "origin" : "ABE", "destination" : "ATL", "carrier" : "EV" }, "Ddelay" : 6, "Adelay" : 14 }
I am traversing the result as below in my python module but I am not getting all the 3 results but only two. I believe I should not use len(results) as I am doing currently. Can you please help me correctly traverse the result as I need to display all three results in the resultant json document on web ui.
Thank you.
code:
pipe = [{ '$match': { 'origin': {"$in" : [origin_ID]}}},
{"$group" :{'_id': { 'origin':"$origin", 'destination': "$dest",'carrier':"$carrier"},
"Ddelay" : {'$avg' :"$dep_delay"},"Adelay" : {'$avg' :"$arr_delay"}}}, {"$limit" : 4}]
results = connect.aggregate(pipeline=pipe)
#pdb.set_trace()
DATETIME_FORMAT = '%Y-%m-%d'
for x in range(len(results)):
origin = (results['result'][x])['_id']['origin']
destination = (results['result'][x])['_id']['destination']
carrier = (results['result'][x])['_id']['carrier']
Adelay = (results['result'][x])['Adelay']
Ddelay = (results['result'][x])['Ddelay']
obj = {'Origin':origin,
'Destination':destination,
'Carrier': carrier,
'Avg Arrival Delay': Adelay,
'Avg Dep Delay': Ddelay}
json_result.append(obj)
return json.dumps(json_result,indent= 2, sort_keys=False,separators=(',',':'))
Pymongo returns result in format:
{u'ok': 1.0, u'result': [...]}
So you should iterate over result:
for x in results['result']:
...
In your code you try to calculate length of dict, not length of result container.

Is there a good way python to compare complex dictionaries with unkonwn depth & level

I have two dictionary objects which are very complex and created by converting large xml files into python dictionaries.
I don't know the depth of the dictionaries and just want to compare and want the following output...
e.g. My dictionaries are like this
d1 = {"great grand father":
{"name":"John",
"grand father":
{"name":"Tom",
"father":
{"name":"Andy",
"Me":
{"name":"Mike",
"son":
{"name":"Tom"}
}
}
}
}
}
d2 is also a similar but could be possible any one of the field is missing or changed as below
d2 = {"great grand father":
{"name":"John",
"grand father":
{"name":"Tom",
"father":
{"name":"Andy",
"Me":
{"name":"Tonny",
"son":
{"name":"Tom"}
}
}
}
}
}
The dictionary comparison should give me results like this -
Expected Key/Val : Me->name/"Mike"
Actual Key/Val : Me->name/"Tonny"
If the key "name" does not exists in "Me" in d2, it should give me following output
Expected Key/Val : Me->name/"Mike"
Actual Key/Val : Me->name/NOT_FOUND
I repeat the dictionary depth could be variable or dynamically generated. The two dictionaries here are given as examples...
All the dictionary comparison questions and their answers which I have seen in SO are related fixed depth Dictionaries.....
You're in luck, I did this as part of a project where I worked.
You need a recursive function something like:
def checkDifferences(dict_a,dict_b,differences=[])
You can first check for keys that don't exist in one or the other.
e.g
Expected Name/Tom Actual None
Then you compare the types of the values i.e check if the value is a dict or a list etc.
If it is then you can recursively call the function using the value as dict_a/b. When calling recursively pass the differences array.
If the type of the value is a list and the list may have dictionaries within it then you need to covert the list to a dict and call the function on the converted dictionary.
I'm sorry I can't help more but I no longer have access to the source code. Hopefully this is enough to get you started.
Here I found a way to compare any two dictionaries -
I have tried with various dictionaries of any depths and worked for me. The code is not so modular but just for the reference -
import pprint
pp = pprint.PrettyPrinter(indent=4)
dict1 = { 'Person' : { 'Male' : {'Boys' : {'Roger' : {'age' : 20},
'Rafa' : {'age' : 25}
}
},
'Female' : { 'Girls' : {'Serena' : {'age' : 23},
'Maria' : {'age' : 15}
}
}
},
'Animal' : { 'Huge' : {'Elephant' : {'color' : 'black' }
}
}
}
'''
dict2 = { 'Person' : { 'Male' : {'Boys' : {'Roger' : {'age' : 20}
}
},
'Female' : { 'Girls' : {'Serena' : {'age' : 23},
'Maria' : {'age' : 1}
}
}
}
}
dict2 = { 'Person' : { 'Male' : {'Boys' : {'Roger' : {'age' : 20},
'Rafa' : {'age' : 2}
}
}
}
}
'''
dict2 = { 'Person' : { 'Male' : {'Boys' : {'Roger' : {'age' : 2}}},
'Female' : 'Serena'}
}
key_list = []
err_list = {}
def comp(exp,act):
for key in exp:
key_list.append(key)
exp_val = exp[key]
try:
act_val = act[key]
is_dict_exp = isinstance(exp_val,__builtins__.dict)
is_dict_act = isinstance(act_val,__builtins__.dict)
if is_dict_exp == is_dict_act == True:
comp(exp_val,act_val)
elif is_dict_exp == is_dict_act == False:
if not exp_val == act_val:
temp = {"Exp" : exp_val,"Act" : act_val}
err_key = "-->".join(key_list)
if err_list.has_key(key):
err_list[err_key].update(temp)
else:
err_list.update({err_key : temp})
else:
temp = {"Exp" : exp_val, "Act" : act_val}
err_key = "-->".join(key_list)
if err_list.has_key(key):
err_list[err_key].update(temp)
else:
err_list.update({err_key : temp})
except KeyError:
temp = {"Exp" : exp_val,"Act" : "NOT_FOUND"}
err_key = "-->".join(key_list)
if err_list.has_key(key):
err_list[err_key].update(temp)
else:
err_list.update({err_key : temp})
key_list.pop()
comp(dict1,dict2)
pp.pprint(err_list)
Here is the output of my code -
{ 'Animal': { 'Act': 'NOT_FOUND',
'Exp': { 'Huge': { 'Elephant': { 'color': 'black'}}}},
'Person-->Female': { 'Act': 'Serena',
'Exp': { 'Girls': { 'Maria': { 'age': 15},
'Serena': { 'age': 23}}}},
'Person-->Male-->Boys-->Rafa': { 'Act': 'NOT_FOUND', 'Exp': { 'age': 25}},
'Person-->Male-->Boys-->Roger-->age': { 'Act': 2, 'Exp': 20}
}
One can also try with other dictionaries given in commented code..
One more thing - The keys are checked in expected dictionary and the matched with an actual. If we pass dictionaries in alternate order the other way matching is also possible...
comp(dict2,dict1)

Categories