Python averaging list of lists of nested dicts - python

I have a list with this structure:
data = [[
{
"id": 713,
"prediction": 4.8,
"confidence": [
{"percentile": "75", "lower": 4.8, "upper": 5.7}
],
},
{
"id": 714,
"prediction": 4.93,
"confidence": [
{"percentile": "75", "lower": 4.9, "upper": 5.7}
],
},
],
[
{
"id": 713,
"prediction": 5.8,
"confidence": [
{"percentile": "75", "lower": 4.2, "upper": 6.7}
],
},
{
"id": 714,
"prediction": 2.93,
"confidence": [
{"percentile": "75", "lower": 1.9, "upper": 3.7}
],
},
]]
So here we have a list containing two list, but it could also be more lists. Each list consist of a prediction with an id and confidence intervals in another list with a dict.
What I need is to merge these lists so I have one dict per id with the average of the numeric values.
I have tried searching but have not found an answer that matches this nested structure.
The expected output would look like this:
merged_data = [
{
"id": 713,
"prediction": 5.3,
"confidence": [
{"percentile": "75", "lower": 4.5, "upper": 6.2}
],
},
{
"id": 714,
"prediction": 3.93,
"confidence": [
{"percentile": "75", "lower": 3.4, "upper": 4.7}
],
},
]

def merge_items(items):
result = {}
if len(items):
result['id'] = items[0]['id']
result['prediction'] = round(sum([item['prediction'] for item in items]) / len(items), 2)
result['confidence'] = []
result['confidence'].append({
'percentile': items[0]['confidence'][0]['percentile'],
'lower': round(sum(item['confidence'][0]['lower'] for item in items) / len(items), 2),
'upper': round(sum(item['confidence'][0]['upper'] for item in items) / len(items), 2),
})
return result
result = []
ids = list(set([el['id'] for item in data for el in item]))
for id in ids:
to_merge = [sub_item for item in data for sub_item in item if sub_item['id'] == id]
result.append(merge_items(to_merge))
print(result)

dicc = {}
for e in l:
for d in e:
if d["id"] not in dicc:
dicc[d["id"]] = {"prediction": [], "lower": [], "upper": []}
dicc[d["id"]]["prediction"].append(d["prediction"])
dicc[d["id"]]["lower"].append(d["confidence"][0]["lower"])
dicc[d["id"]]["upper"].append(d["confidence"][0]["upper"])
for k in dicc:
dicc[k]["average_prediction"] = sum(dicc[k]["prediction"])/len(dicc[k]["prediction"])
dicc[k]["average_lower"] = sum(dicc[k]["lower"])/len(dicc[k]["lower"])
dicc[k]["average_upper"] = sum(dicc[k]["upper"])/len(dicc[k]["upper"])
print(dicc)
{713: {'prediction': [4.8, 5.8], 'lower': [4.8, 4.2], 'upper': [5.7, 6.7], 'average_prediction': 5.3, 'average_lower': 4.5, 'average_upper': 6.2}, 714: {'prediction': [4.936893921359024, 2.936893921359024], 'lower': [4.9, 1.9], 'upper': [5.7, 3.7], 'average_prediction': 3.936893921359024, 'average_lower': 3.4000000000000004, 'average_upper': 4.7}}

You really have three parts to this question.
How do you unpack the lists and group by the ids in preparation for some kind of aggregation? You have lots of options, but a pretty classic one is to make a lookup table and append any new values:
groups = {}
# `data` is the outer list in your nested structure
for d in (d for L in data for d in L):
L = groups.get(d['id'], [])
L.append(d)
groups[d['id']] = L
How do you aggregate those dictionaries so that you have an average of all the numeric values? There are lots of approaches with varying numeric stability. I'll start with an easy one that recursively walks a partial result set and a new entry.
Note that this assumes an incredibly consistent object structure (like you have shown). If you sometimes have missing keys, mismatched lengths, or other discrepancies you'll have to think long and hard about the exact details of what you want to happen when those structures are merged -- there isn't a one-size fits all solution.
def walk(avgs, new, n):
"""
Most of this algorithm is just walking the object structure.
We keep any keys, lists, etc the same and only average the
numeric elements.
"""
if isinstance(avgs, dict):
return {k:walk(avgs[k], new[k], n) for k in avgs}
if isinstance(avgs, list):
return [walk(x, y, n) for x,y in zip(avgs, new)]
if isinstance(avgs, float): # integers and whatnot also satisfy this
"""
This is the only place that averaging actually happens.
At the risk of some accumulated errors, this directly
computes the total of the last n+1 items and divides
by n+1.
"""
return (avgs*n+new)/(n+1.)
return avgs
def merge(L):
if not L:
# never happens using the above grouping code
return None
d = L[0]
for n, new in enumerate(L[1:], 1):
d = walks(d, new, n)
return d
averaged = {k:merge(v) for k,v in groups.items()}
You probably only want certain keys like the prediction to be averaged. You can do the filtering beforehand on the grouped objects or afterward (it's probably more efficient to do it beforehand):
# before
groups = {
# any transformation you'd like to apply to the dictionaries
k:[{s:d[s] for s in ('prediction', 'confidence')} for d in L] for k,L in groups.items()
}
# after
averaged = {
# basically the same code, except there's only one object per key
k:{s:d[s] for s in ('prediction', 'confidence')} for k,d in averaged.items()
}
For a note on efficiency, I created a bunch of intermediate lists, but those aren't really necessary. Instead of grouping then aggregating you can absolutely apply a rolling update algorithm and save some memory.
averaged = {}
# `data` is the outer list in your nested structure
for d in (d for L in data for d in L):
key = d['id']
d = {s:d[s] for s in ('prediction', 'confidence')} # any desired transforms
if key not in averaged:
averaged[key] = (d, 1)
else:
agg, n = groups[key]
averaged[key] = (walk(agg, d, n), n+1)
averaged = {k:v[0] for k,v in averaged.items()}
We still don't have the output formatted quite like you want (we have a dictionary, and you want a list where the keys are included in the objects). That's a pretty easy problem to solve though:
def inline_key(d, key):
# not a pure function, but we're lazy, and the original
# values are never used
d['id'] = key
return d
final_result = [inline_key(d, k) for k,d in averaged.items()]

Try this :
from copy import deepcopy
input = [[
{
"id": 713,
"prediction": 4.8,
"confidence": [
{"percentile": "75", "lower": 4.8, "upper": 5.7}
],
},
{
"id": 714,
"prediction": 4.936893921359024,
"confidence": [
{"percentile": "75", "lower": 4.9, "upper": 5.7}
],
},
],
[
{
"id": 713,
"prediction": 5.8,
"confidence": [
{"percentile": "75", "lower": 4.2, "upper": 6.7}
],
},
{
"id": 714,
"prediction": 2.936893921359024,
"confidence": [
{"percentile": "75", "lower": 1.9, "upper": 3.7}
],
},
]]
final_dict_list = []
processed_id = []
for item in input:
for dict_ele in item:
if dict_ele["id"] in processed_id:
for final_item in final_dict_list:
if final_item['id'] == dict_ele["id"]:
final_item["prediction"] += dict_ele["prediction"]
final_item["confidence"][0]["lower"] += dict_ele["confidence"][0]["lower"]
final_item["confidence"][0]["upper"] += dict_ele["confidence"][0]["upper"]
else:
final_dict = deepcopy(dict_ele)
final_dict_list.append(final_dict)
processed_id.append(dict_ele["id"])
numer_of_items = len(input)
for item in final_dict_list:
item["prediction"] /= numer_of_items
item["confidence"][0]["lower"] /= numer_of_items
item["confidence"][0]["upper"] /= numer_of_items
print(final_dict_list)
OUTPUT :
[
{'confidence': [{'upper': 6.2, 'lower': 4.5, 'percentile': '75'}], 'id': 713, 'prediction': 5.3},
{'confidence': [{'upper': 4.7, 'lower': 3.4000000000000004, 'percentile': '75'}], 'id': 714, 'prediction': 3.936893921359024}]
Just to point, it could have been much easier if the structure of data would have been a little differently created.

Related

How to filter one JSON based on another JSON in Python?

Im trying to write a Python class that takes a json request and a json array, and then filters the array based on the data in the request. Here is the basic folder structure for the project:
my_input/
/request.json
/array.json
my_python/
/class.py
request.json has the data "place: 1" and "trait: 3", and is of the following form:
{
"metadata": {
"id": "request",
},
"data": {
"place": [
"1"
],
"trait": [
"3"
]
}
}
array.json has 2 sub-arrays, locationArray and measurementArray; in locationArray, we see that place 1 is associated with plots 2 and 3:
{ "data":{
"measurementArray": {
"headers": ["plot","trait","value"],
"data": [
[1, 3, 2.7],
[2, 2, 1.8],
[3, 3, 3.6]
],
"locationArray": {
"headers": ["place","plot"],
"data": [
[1,2],
[3,4],
[1,3]
],
}}}
Then we filter plots 2 and 3, where "trait: 3". In this example, it would return just one row from the measurement array, because only one row that was at plot 2 or 3 had trait 3:
[3,3,3.6]
How could I write a class to parse the request json and then filter the array in the second JSON? Currently, I only have a blank class.
Thanks so much for considering this question!
I'm not sure if it works
class my_input:
def __init__(self,my_request, my_array):
self.my_request = my_request
self.my_array = my_array
def get_places(self):
return self.my_request['data']['place']
def get_traits(self):
return self.my_request['data']['trait']
def get_associated_plots(self,places):
associated_plots = []
place_st = set(places)
for current_data in self.my_array['data']['measurementArray']['locationArray']['data']:
if str(current_data[0]) in place_st:
associated_plots.append(current_data[1])
return associated_plots
def get_rows(self,associated_plots,traits):
triats_st = set(traits)
associated_plots_st = set(associated_plots)
for current_data in self.my_array['data']['measurementArray']['data']:
rows = []
if current_data[0] in associated_plots_st and str(current_data[1]) in triats_st:
rows.append(current_data)
return rows

Creating a new dict from a list of dicts

I have a list of dictionaries in the following format
data = [
{
"Members": [
"user11",
"user12",
"user13"
],
"Group": "Group1"
},
{
"Members": [
"user11",
"user21",
"user22",
"user23"
],
"Group": "Group2"
},
{
"Members": [
"user11",
"user22",
"user31",
"user32",
"user33",
],
"Group": "Group3"
}]
I'd like to return a dictionary where every user is a key and the value is a list of all the groups which they belong to. So for the above example, this dict would be:
newdict = {
"user11": ["Group1", "Group2", "Group3"]
"user12": ["Group1"],
"user13": ["Group1"],
"user21": ["Group2"],
"user22": ["Group2", "Group3"],
"user23": ["Group2"],
"user31": ["Group3"],
"user32": ["Group3"],
"user33": ["Group3"],
}
My initial attempt was using a defaultdict in a nested loop, but this is slow (and also isn't returning what I expected). Here was that attempt:
user_groups = defaultdict(list)
for user in users:
for item in data:
if user in item["Members"]:
user_groups[user].append(item["Group"])
Does anyone have any suggestions for improvement for speed, and also just a generally better way to do this?
Code
new_dict = {}
for d in data: # each item is dictionary
members = d["Members"]
for m in members:
# appending corresponding group for each member
new_dict.setdefault(m, []).append(d["Group"])
print(new_dict)
Out
{'user11': ['Group1', 'Group2', 'Group3'],
'user12': ['Group1'],
'user13': ['Group1'],
'user21': ['Group2'],
'user22': ['Group2', 'Group3'],
'user23': ['Group2'],
'user31': ['Group3'],
'user32': ['Group3'],
'user33': ['Group3']}

Particular nested dictionary from a Pandas DataFrame for circle packing

I am trying to create a particular nested dictionary from a DataFrame in Pandas conditions, in order to then visualize.
dat = pd.DataFrame({'cat_1' : ['marketing', 'marketing', 'marketing', 'communications'],
'child_cat' : ['marketing', 'social media', 'marketing', 'communications],
'skill' : ['digital marketing','media marketing','research','seo'],
'value' : ['80', '101', '35', '31']
and I would like to turn this into a dictionary that looks a bit like this:
{
"name": "general skills",
"children": [
{
"name": "marketing",
"children": [
{
"name": "marketing",
"children": [
{
"name": "digital marketing",
"value": 80
},
{
"name": "research",
"value": 35
}
]
},
{
"name": "social media", // notice that this is a sibling of the parent marketing
"children": [
{
"name": "media marketing",
"value": 101
}
]
}
]
},
{
"name": "communications",
"children": [
{
"name": "communications",
"children": [
{
"name": "seo",
"value": 31
}
]
}
]
}
]
}
So cat_1 is the parent node, child_cat is its children, and skill is its child too. I am having trouble with creating the additional children lists. Any help?
With a lot of inefficiencies I came up with this solution. Probably highly sub-optimal
final = {}
# control dict to get only one broad category
contrl_dict = {}
contrl_dict['dummy'] = None
final['name'] = 'variants'
final['children'] = []
# line is the values of each row
for idx, line in enumerate(df_dict.values):
# parent categories dict
broad_dict_1 = {}
print(line)
# this takes every value of the row minus the value in the end
for jdx, col in enumerate(line[:-1]):
# look into the broad category first
if jdx == 0:
# check in our control dict - does this category exist? if not add it and continue
if not col in contrl_dict.keys():
# if it doesn't it appends it
contrl_dict[col] = 'added'
# then the broad dict parent takes the name
broad_dict_1['name'] = col
# the children are the children broad categories which will be populated further
broad_dict_1['children'] = []
# go to broad categories 2
for ydx, broad_2 in enumerate(list(df_dict[df_dict.broad_categories == col].broad_2.unique())):
# sub categories dict
prov_dict = {}
prov_dict['name'] = broad_2
# children is again a list
prov_dict['children'] = []
# now isolate the skills and values of each broad_2 category and append them
for row in df_dict[df_dict.broad_2 == broad_2].values:
prov_d_3 = {}
# go to each row
for xdx, direct in enumerate(row):
# in each row, values 2 and 3 are name and value respectively add them
if xdx == 2:
prov_d_3['name'] = direct
if xdx == 3:
prov_d_3['size'] = direct
prov_dict['children'].append(prov_d_3)
broad_dict_1['children'].append(prov_dict)
# if it already exists in the control dict then it moves on
else:
continue
final['children'].append(broad_dict_1)

Is there an efficient way to accumulate values from a nested instance in Python?

I have an instance that looks like this:
{
"_id": "cgx",
"capacity": 1000000000,
"chunks": [
{
"prs": [
{
"segs": [
{
"node_id": "server-0",
}
]
}
]
},
{
"prs": [
{
"segs": [
{
"node_id": "server-2",
}
]
},
{
"segs": [
{
"node_id": "server-3",
}
]
}
]
}
],
"health": "healthy",
"status": "ok"
}
each 'chunk' in the chunks array is a Chunk instance, each 'pr' in the prs array is a Pr instance, each 'seg' in the segs array is a Seg instance
I want to traverse the instance and accumulate a set of 'node_id' values from all of the instance. I did it in the following way:
def setDistinctElements(self, result):
elements = []
for chunk in getattr(result, 'chunks'):
for pr in getattr(chunk, 'prs'):
for seg in getattr(pRaid, 'segs'):
elements.append(getattr(seg, 'node_id'))
Is there a more efficient way to do it instead of looping 3 times? Each such instance can have a lot of 'chunks', 'prs' and 'segs' instances nested in it.
I couldn't run your code so made a similar one.
and to run it faster I convert the json to string and mess with it to get what I need. and it's almost X2 faster 1
lst = []
for row in a.replace('[','').replace(' ','').replace('{','').replace('\n','').replace(']','').replace('}','').replace('"','').sp
lit(','):
if "node_id" in row:
lst.append(row.split(':')[-1])
You can use recursion with a generator to traverse a structure of any depth:
data = {'_id': 'cgx', 'capacity': 1000000000, 'chunks': [{'prs': [{'segs': [{'node_id': 'server-0'}]}]}, {'prs': [{'segs': [{'node_id': 'server-2'}]}, {'segs': [{'node_id': 'server-3'}]}]}], 'health': 'healthy', 'status': 'ok'}
def get_nodes(d):
for a, b in d.items():
if a == 'node_id':
yield b
elif isinstance(b, dict):
yield from get_nodes(b)
elif isinstance(b, list):
for c in b:
yield from get_nodes(c)
print(list(get_nodes(data)))
Output:
['server-0', 'server-2', 'server-3']

Robust way to sum all values corresponding to a particular objects property?

I have an array as such.
items = [
{
"title": "title1",
"category": "category1",
"value": 200
},
{
"title": "title2",
"category": "category2",
"value": 450
},
{
"title": "title3",
"category": "category1",
"value": 100
}
]
This array consists of many dictionaries with a property category and value.
What is the robust way of getting an array of category objects with their value summed like:
data= [
{
"category": "category1",
"value": 300
},
{
"category": "category2",
"value": 450
}
]
I'm looking for the best algorithm or way possible for both the small array as well as the huge array. If there is an existing algorithm please point me to the source.
What I tried??
data = []
for each item in items:
if data has a dictionary with dictionary.category == item.category:
data's dictionary.value = data's dictionary.value + item.value
else:
data.push({"category": item.category, "value":item.value})
Note: Any programming language is welcome. Please comment before downvoting.
In javascript, you can use reduce to group the array into an object. Use the category as the property. Use Object.values to convert the object into an array.
var items = [{
"title": "title1",
"category": "category1",
"value": 200
},
{
"title": "title2",
"category": "category2",
"value": 450
},
{
"title": "title3",
"category": "category1",
"value": 100
}
];
var data = Object.values(items.reduce((c, v) => {
c[v.category] = c[v.category] || {category: v.category,value: 0};
c[v.category].value += v.value;
return c;
}, {}));
console.log(data);
What you need is a SQL group by like operation. Usually, those group by operations are handling with hashing algorithms. If all your data could fit in memory (small to large data structures) you can implement it very quickly.
If your data structure is huge, you will need to use intermediate memory (such as hard drive or database).
An easy python approach will be:
data_tmp = {}
for item in items:
if item['category'] not in data_tmp:
data_tmp[item['category']] = 0
data_tmp[item['category']] += item['value']
data = []
for k, v in data_tmp.items():
data.append({
'category': k,
'value': v
})
# done
If you want more pythonic code you can use a defaultdict:
from collections import defaultdict
data_tmp = defaultdict(int)
for item in items:
data_tmp[item['category']] += item['value']
data = []
for k, v in data_tmp.items():
data.append({
'category': k,
'value': v
})
# done
In Python, Pandas is likely to be a more convenient and efficient way of doing this.
import pandas as pd
df = pd.DataFrame(items)
sums = df.groupby("category", as_index=False).sum()
data = sums.to_dict("records")
For the final step, it may be more convenient to leave sums as a dataframe and work with it like that instead of converting back to a list of dictionaries.
Using itertools.groupby
d = []
lista = sorted(items, key=lambda x: x['category'])
for k, g in groupby(lista, key=lambda x: x['category']):
temp = {}
temp['category'] = k
temp['value'] = sum([i['value'] for i in list(g)])
d.append(temp)
print(d)
# [{'category': 'category1', 'value': 300}, {'category': 'category2', 'value': 450}]

Categories