Faster data aggregation algorithms for nested objects - python

I am having some trouble learning about new methods of flattening/aggregating nested object data in python. My current implementation is rather slow, and I want to know some approaches to speed up processing. Consider that I have a dataset of donations defined as:
donations = [
{
"amount": 100,
"organization": {
"name": "Org 1",
"total_budget": 8000,
"states": [
{
"name": "Maine",
"code": "ME"
},
{
"name": "Massachusetts",
"code": "MA"
}
]
}
},
{
"amount": 5000,
"organization": {
"name": "Org 2",
"total_budget": 10000,
"states": [
{
"name": "Massachusetts",
"code": "MA"
}
]
}
},
{
"amount": 5000,
"organization": {
"name": "Org 1",
"total_budget": 8000,
"states": [
{
"name": "Maine",
"code": "ME"
},
{
"name": "Massachusetts",
"code": "MA"
}
]
}
}
]
The relationship of these objects is such that a donation is related to a single organization, and an organization can be related to one or more states.
I additionally can get just the organization dataset as:
organizations = [
{
"name": "Org 1",
"total_budget": 8000,
"states": [
{
"name": "Maine",
"code": "ME"
},
{
"name": "Massachusetts",
"code": "MA"
}
]
},
{
"name": "Org 2",
"total_budget": 10000,
"states": [
{
"name": "Massachusetts",
"code": "MA"
}
]
}
]
The output I am looking to achieve is an aggregation, by state, of the total donations and total budget, where the donation amounts and the organization's total budget is evenly distributed among all states it is associated with. Example for the above dataset:
results = {
"ME": {
"name": "Maine",
"total_donations": 2550
"total_budget": 4000
},
"MA": {
"name": "Massachusetts",
"total_donations": 7550
"total_budget": 14000
}
}
What I have tried so far is to use for loops to iterate through each donation and organization, and sort them into a defaultdict:
from collections import defaultdict
def get_stats():
return { "total_donations": 0, "total_budget": 0, "name": "" }
results = defaultdict(get_stats)
for donation in donations:
for state in donation["organization"]["states"]
results[state["code"]]["total_donations"] += donation["amount"]/len(donation["organization"]["states"])
for organization in organizations:
for state in organization["states"]:
results[state["code"]]["total_budget"] += organization["total_budget"]/len(organization["states"])
results[state["code"]]["name"] = state["state"]
I was thinking about using map/reduce here, but I didn't get the sense that those would improve performance. Any advice here would be super appreciated.

Related

Filtering an array on nested values by using values from another array

I have this array of 3 objects.
The parameter that interests me is "id", that is nested into "categories" attribute.
list = [
{
"title": "\u00c9glise Saint-Julien",
"distance": 1841,
"excursionDistance": 1575,
"categories": [
{
"id": "300-3200-0030",
"name": "\u00c9glise",
"primary": true
},
{
"id": "300-3000-0025",
"name": "Monument historique"
}
]
},
{
"title": "Sevdec",
"distance": 2250,
"excursionDistance": 301,
"categories": [
{
"id": "700-7600-0322",
"name": "Station de recharge",
"primary": true
}
]
},
{
"title": "SIEGE 27",
"distance": 2651,
"excursionDistance": 1095,
"categories": [
{
"id": "700-7600-0322",
"name": "Station de recharge",
"primary": true
}
]
}
]
Then I have these two arrays that contain ids:
mCat1 = ["300-3000-0000","300-3000-0023","300-3000-0030","300-3000-0025","300-3000-0024","300-3100"] # macro cat1 = tourism
mCat2 = ["400-4300","700-7600-0322"]
I need to filter "list" on "mCat1" in order to extract in a new variable the object(s) that have at least one "id" that matches those in "mCat1".
Then I need to do the same with "mCat2".
In this example the expected result would be:
mCat1Result = [{
"title": "\u00c9glise Saint-Julien",
"distance": 1841,
"excursionDistance": 1575,
"categories": [
{
"id": "300-3200-0030",
"name": "\u00c9glise",
"primary": true
},
{
"id": "300-3000-0025",
"name": "Monument historique"
}
]
}]
mCat2Result = [{
"title": "Sevdec",
"distance": 2250,
"excursionDistance": 301,
"categories": [
{
"id": "700-7600-0322",
"name": "Station de recharge",
"primary": true
}
]
},
{
"title": "SIEGE 27",
"distance": 2651,
"excursionDistance": 1095,
"categories": [
{
"id": "700-7600-0322",
"name": "Station de recharge",
"primary": true
}
]
}]
What would be the most efficient way to do this? I am able to do it using loops but it is very resource dependent on large datasets.

Is there a way to add curly brackets around a list of dictionaries already existing within a JSON file?

I currently have two JSONS that I want to merge into one singular JSON, additionally I want to add in a slight change.
Firstly, these are the two JSONS in question.
An intents JSON:
[
{
"ID": "G1",
"intent": "password_reset",
"examples": [
{
"text": "I forgot my password"
},
{
"text": "I can't log in"
},
{
"text": "I can't access the site"
},
{
"text": "My log in is failing"
},
{
"text": "I need to reset my password"
}
]
},
{
"ID": "G2",
"intent": "account_closure",
"examples": [
{
"text": "I want to close my account"
},
{
"text": "I want to terminate my account"
}
]
},
{
"ID": "G3",
"intent": "account_creation",
"examples": [
{
"text": "I want to open an account"
},
{
"text": "Create account"
}
]
},
{
"ID": "G4",
"intent": "complaint",
"examples": [
{
"text": "A member of staff was being rude"
},
{
"text": "I have a complaint"
}
]
}
]
and an entities JSON:
[
{
"ID": "K1",
"entity": "account_type",
"values": [
{
"type": "synonyms",
"value": "business",
"synonyms": [
"corporate"
]
},
{
"type": "synonyms",
"value": "personal",
"synonyms": [
"vanguard",
"student"
]
}
]
},
{
"ID": "K2",
"entity": "beverage",
"values": [
{
"type": "synonyms",
"value": "hot",
"synonyms": [
"heated",
"warm"
]
},
{
"type": "synonyms",
"value": "cold",
"synonyms": [
"ice",
"freezing"
]
}
]
}
]
The expected outcome is to create a JSON file that mimics this structure:
{
"intents": [
{
"intent": "password_reset",
"examples": [
{
"text": "I forgot my password"
},
{
"text": "I want to reset my password"
}
],
"description": "Reset a user password"
}
],
"entities": [
{
"entity": "account_type",
"values": [
{
"type": "synonyms",
"value": "business",
"synonyms": [
"company",
"corporate",
"enterprise"
]
},
{
"type": "synonyms",
"value": "personal",
"synonyms": []
}
],
"fuzzy_match": true
}
],
"metadata": {
"api_version": {
"major_version": "v2",
"minor_version": "2018-11-08"
}
},
"dialog_nodes": [
{
"type": "standard",
"title": "anything_else",
"output": {
"generic": [
{
"values": [
{
"text": "I didn't understand. You can try rephrasing."
},
{
"text": "Can you reword your statement? I'm not understanding."
},
{
"text": "I didn't get your meaning."
}
],
"response_type": "text",
"selection_policy": "sequential"
}
]
},
"conditions": "anything_else",
"dialog_node": "Anything else",
"previous_sibling": "node_4_1655399659061",
"disambiguation_opt_out": true
},
{
"type": "event_handler",
"output": {
"generic": [
{
"title": "What type of account do you hold with us?",
"options": [
{
"label": "Personal",
"value": {
"input": {
"text": "personal"
}
}
},
{
"label": "Business",
"value": {
"input": {
"text": "business"
}
}
}
],
"response_type": "option"
}
]
},
"parent": "slot_9_1655398217028",
"event_name": "focus",
"dialog_node": "handler_6_1655398217052",
"previous_sibling": "handler_7_1655398217052"
},
{
"type": "event_handler",
"output": {},
"parent": "slot_9_1655398217028",
"context": {
"account_type": "#account_type"
},
"conditions": "#account_type",
"event_name": "input",
"dialog_node": "handler_7_1655398217052"
},
{
"type": "standard",
"title": "business_account",
"output": {
"generic": [
{
"values": [
{
"text": "We have notified your corporate security team, they will be in touch to reset your password."
}
],
"response_type": "text",
"selection_policy": "sequential"
}
]
},
"parent": "node_3_1655397279884",
"next_step": {
"behavior": "jump_to",
"selector": "body",
"dialog_node": "node_4_1655399659061"
},
"conditions": "#account_type:business",
"dialog_node": "node_1_1655399028379",
"previous_sibling": "node_3_1655399027429"
},
{
"type": "standard",
"title": "intent_collection",
"output": {
"generic": [
{
"values": [
{
"text": "Thank you for confirming that you want to reset your password."
}
],
"response_type": "text",
"selection_policy": "sequential"
}
]
},
"next_step": {
"behavior": "jump_to",
"selector": "body",
"dialog_node": "node_3_1655397279884"
},
"conditions": "#password_reset",
"dialog_node": "node_3_1655396920143",
"previous_sibling": "Welcome"
},
{
"type": "frame",
"title": "account_type_confirmation",
"output": {
"generic": [
{
"values": [
{
"text": "Thank you"
}
],
"response_type": "text",
"selection_policy": "sequential"
}
]
},
"parent": "node_3_1655396920143",
"context": {},
"next_step": {
"behavior": "skip_user_input"
},
"conditions": "#password_reset",
"dialog_node": "node_3_1655397279884"
},
{
"type": "standard",
"title": "personal_account",
"output": {
"generic": [
{
"values": [
{
"text": "We have sent you an email with a password reset link."
}
],
"response_type": "text",
"selection_policy": "sequential"
}
]
},
"parent": "node_3_1655397279884",
"next_step": {
"behavior": "jump_to",
"selector": "body",
"dialog_node": "node_4_1655399659061"
},
"conditions": "#account_type:personal",
"dialog_node": "node_3_1655399027429"
},
{
"type": "standard",
"title": "reset_confirmation",
"output": {
"generic": [
{
"values": [
{
"text": "Do you need assistance with anything else today?"
}
],
"response_type": "text",
"selection_policy": "sequential"
}
]
},
"digress_in": "does_not_return",
"dialog_node": "node_4_1655399659061",
"previous_sibling": "node_3_1655396920143"
},
{
"type": "slot",
"output": {},
"parent": "node_3_1655397279884",
"variable": "$account_type",
"dialog_node": "slot_9_1655398217028",
"previous_sibling": "node_1_1655399028379"
},
{
"type": "standard",
"title": "welcome",
"output": {
"generic": [
{
"values": [
{
"text": "Hello. How can I help you?"
}
],
"response_type": "text",
"selection_policy": "sequential"
}
]
},
"conditions": "welcome",
"dialog_node": "Welcome"
}
],
"counterexamples": [],
"system_settings": {
"off_topic": {
"enabled": true
},
"disambiguation": {
"prompt": "Did you mean:",
"enabled": true,
"randomize": true,
"max_suggestions": 5,
"suggestion_text_policy": "title",
"none_of_the_above_prompt": "None of the above"
},
"human_agent_assist": {
"prompt": "Did you mean:"
},
"intent_classification": {
"training_backend_version": "v2"
},
"spelling_auto_correct": true
},
"learning_opt_out": false,
"name": "Reset Password",
"language": "en",
"description": "Basic Password Reset Request"
}
So what I am missing in my original files, is essentially:
"intents":
and for the entities file:
"entities"
at the start of each list of dictionaries.
Additionally, I would need to wrap the whole thing in curly braces to comply with json formatting.
As seen, the final goal is not just appending these two to one another but the file technically continues with some other JSON code that I have yet to write and deal with.
My question now is as follows; by what method can I either add in these words and the braces to the individual files, then combine them into a singular JSON or alternatively by what method can I read in these files and combine them with the changes all in one go?
The new output file closing on a curly brace after the entities list of dicts is an acceptable outcome for me at the time, so that I can continue to make changes and hopefully further learn from this how to do these changes in future when I get there.
TIA
JSON is only a string format, you can it load in a language structure, in python that is list and dict, do what you need then dump it back, so you don't "add strings" and "add brackets", on modify the structure
file = 'intents.txt'
intents = json.load(open(file)) # load a list
file = 'entities.txt'
entities = json.load(open(file)) # load a list
# create a dict
content = {
"intents": intents,
"entities": entities
}
json.dump(content, open(file, "w"))
If you're reading all the json in as a string, you can just prepend "{'intents':" to the start and append a closing "}".
myJson = "your json string"
myWrappedJson = '{"intents":' + myJson + "}"

Creating custom JSON from existing JSON using Python

(Python beginner alert) I am trying to create a custom JSON from an existing JSON. The scenario is - I have a source which can send many set of fields but I want to cherry pick some of them and create a subset of that while maintaining the original JSON structure. Original Sample
{
"Response": {
"rCode": "11111",
"rDesc": "SUCCESS",
"pData": {
"code": "123-abc-456-xyz",
"sData": [
{
"receiptTime": "2014-03-02T00:00:00.000",
"sessionDate": "2014-02-28",
"dID": {
"d": {
"serialNo": "3432423423",
"dType": "11111",
"dTypeDesc": "123123sd"
},
"mode": "xyz"
},
"usage": {
"duration": "661",
"mOn": [
"2014-02-28_20:25:00",
"2014-02-28_22:58:00"
],
"mOff": [
"2014-02-28_21:36:00",
"2014-03-01_03:39:00"
]
},
"set": {
"abx": "1",
"ayx": "1",
"pal": "1"
},
"rEvents": {
"john": "doe",
"lorem": "ipsum"
}
},
{
"receiptTime": "2014-04-02T00:00:00.000",
"sessionDate": "2014-04-28",
"dID": {
"d": {
"serialNo": "123123",
"dType": "11111",
"dTypeDesc": "123123sd"
},
"mode": "xyz"
},
"usage": {
"duration": "123",
"mOn": [
"2014-04-28_20:25:00",
"2014-04-28_22:58:00"
],
"mOff": [
"2014-04-28_21:36:00",
"2014-04-01_03:39:00"
]
},
"set": {
"abx": "4",
"ayx": "3",
"pal": "1"
},
"rEvents": {
"john": "doe",
"lorem": "ipsum"
}
}
]
}
}
}
Here the sData array tag has got few tags out of which I want to keep only 24 and get rid of the rest. I know I could use element.pop() but I cannot go and delete a new incoming field every time the source publishes it. Below is the expected output -
Expected Output
{
"Response": {
"rCode": "11111",
"rDesc": "SUCCESS",
"pData": {
"code": "123-abc-456-xyz",
"sData": [
{
"receiptTime": "2014-03-02T00:00:00.000",
"sessionDate": "2014-02-28",
"usage": {
"duration": "661",
"mOn": [
"2014-02-28_20:25:00",
"2014-02-28_22:58:00"
],
"mOff": [
"2014-02-28_21:36:00",
"2014-03-01_03:39:00"
]
},
"set": {
"abx": "1",
"ayx": "1",
"pal": "1"
}
},
{
"receiptTime": "2014-04-02T00:00:00.000",
"sessionDate": "2014-04-28",
"usage": {
"duration": "123",
"mOn": [
"2014-04-28_20:25:00",
"2014-04-28_22:58:00"
],
"mOff": [
"2014-04-28_21:36:00",
"2014-04-01_03:39:00"
]
},
"set": {
"abx": "4",
"ayx": "3",
"pal": "1"
}
}
]
}
}
}
I myself took reference from How can I create a new JSON object form another using Python? but its not working as expected. Looking forward for inputs/solutions from all of you gurus. Thanks in advance.
Kind of like this:
data = json.load(open("fullset.json"))
def subset(d):
newd = {}
for name in ('receiptTime','sessionData','usage','set'):
newd[name] = d[name]
return newd
data['Response']['pData']['sData'] = [subset(d) for d in data['Response']['pData']['sData']]
json.dump(data, open('newdata.json','w'))

3 levels json count in python

I am new at python, I´ve worked with other languages... I´ve made this code with Java and works, but now, I must do it in python. I have a json of 3 levels, the first two are: resources, usages, and I want to count the names on the third level. I´ve seen several examples but I cant get it done
import json
data = {
"startDate": "2019-06-23T16:07:21.205Z",
"endDate": "2019-07-24T16:07:21.205Z",
"status": "Complete",
"usages": [
{
"name": "PureCloud Edge Virtual Usage",
"resources": [
{
"name": "Edge01-VM-GNS-DemoSite01 (1f279086-a6be-4a21-ab7a-2bb1ae703fa0)",
"date": "2019-07-24T09:00:28.034Z"
},
{
"name": "329ad5ae-e3a3-4371-9684-13dcb6542e11",
"date": "2019-07-24T09:00:28.034Z"
},
{
"name": "e5796741-bd63-4b8e-9837-4afb95bb0c09",
"date": "2019-07-24T09:00:28.034Z"
}
]
},
{
"name": "PureCloud for SmartVideo Add-On Concurrent",
"resources": [
{
"name": "jpizarro#gns.com.co",
"date": "2019-06-25T04:54:17.662Z"
},
{
"name": "jaguilera#gns.com.co",
"date": "2019-06-25T04:54:17.662Z"
},
{
"name": "dcortes#gns.com.co",
"date": "2019-07-15T15:06:09.203Z"
}
]
},
{
"name": "PureCloud 3 Concurrent User Usage",
"resources": [
{
"name": "jpizarro#gns.com.co",
"date": "2019-06-25T04:54:17.662Z"
},
{
"name": "jaguilera#gns.com.co",
"date": "2019-06-25T04:54:17.662Z"
},
{
"name": "dcortes#gns.com.co",
"date": "2019-07-15T15:06:09.203Z"
}
]
},
{
"name": "PureCloud Skype for Business WebSDK",
"resources": [
{
"name": "jpizarro#gns.com.co",
"date": "2019-06-25T04:54:17.662Z"
},
{
"name": "jaguilera#gns.com.co",
"date": "2019-06-25T04:54:17.662Z"
},
{
"name": "dcortes#gns.com.co",
"date": "2019-07-15T15:06:09.203Z"
}
]
}
],
"selfUri": "/api/v2/billing/reports/billableusage"
}
cantidadDeLicencias = 0
cantidadDeUsages = len(data['usages'])
for x in range(cantidadDeUsages):
temporal = data[x]
cantidadDeResources = len(temporal['resource'])
for z in range(cantidadDeResources):
print(x)
What changes I have to make? Maybe I have to do it on another approach? Thanks in advance
Update
Code that works
cantidadDeLicencias = 0
for usage in data['usages']:
cantidadDeLicencias = cantidadDeLicencias + len(usage['resources'])
print(cantidadDeLicencias)
You can do this :
for usage in data['usages']:
print(len(usage['resources']))
If you want to know the number of names in each of the resources level, counting the duplicated names (e.g. "jaguilera#gns.com.co" appears more than one time in your data), then just do iterate over the first-level (usages) and sum the size of each array
cantidadDeLicencias = 0
for usage in data['usages']:
cantidadDeLicencias += len(usage['resources'])
print(cantidadDeLicencias)
If you don't want to count duplicates, then use a set and iterate over each resources array
cantidadDeLicencias_set = {}
for usage in data['usages']:
for resource in usage['resources']:
cantidadDeLicencias_set.add(resource['name'])
print(len(cantidadDeLicencias_set ))

How can I use python to add unique IDs to JSON children?

I have a json file that contains many children, like this:
{
"tree": {
"name": "Top Level",
"children": [
{
"name": "[('server', 'Cheese')]",
"children": [
{
"name": "[('waiter', 'mcdonalds')]",
"percentage": "100.00%",
"duration": 100,
"children": [
{
"name": "[('server', 'kfc')]",
"percentage": "15.73%",
"duration": 100,
"children": [
{
"name": "[('server', 'wendys')]",
"percentage": "12.64%",
"duration": 100
},
{
"name": "[('boss', 'dennys')]",
"percentage": "10.96%",
"duration": 100
}
]
},
{
"name": "[('cashier', 'chickfila')]",
"percentage": "10.40%",
"duration": 100,
"children": [
{
"name": "[('cashier', 'burger king')]",
"percentage": "11.20%",
"duration": 100
}
]
}
]
}
]
}
]
}
}
I want to add a unique ID to each child that corresponds to the level they are in so it ends up looking like this, where each ID can tell how many parents the data has and how deep into the json you are (for example, 21.2.3.102 would be the 102nd child of a 3rd child of a 2nd child of the 21st parent):
{
"tree": {
"name": "Top Level",
"id": 1
"children": [
{
"name": "[('server', 'Cheese')]",
"id": 1.1
"children": [
{
"name": "[('waiter', 'mcdonalds')]",
"percentage": "100.00%",
"duration": 100,
"id": 1.1.1
"children": [
{
"name": "[('server', 'kfc')]",
"percentage": "15.73%",
"duration": 100,
"id": 1.1.1.1
"children": [
{
"name": "[('server', 'wendys')]",
"percentage": "12.64%",
"duration": 100,
"id":1.1.1.1.1
},
{
"name": "[('boss', 'dennys')]",
"percentage": "10.96%",
"duration": 100,
"id":1.1.1.1.2
}
]
},
{
"name": "[('cashier', 'chickfila')]",
"percentage": "10.40%",
"duration": 100,
"id":1.1.1.2
"children": [
{
"name": "[('cashier', 'burger king')]",
"percentage": "11.20%",
"duration": 100,
"id":1.1.1.2.1
}
]
}
]
}
]
}
]
}
}
Is there a streamlined way to do this to a very long json file with many many children?
PLEASE
Thanks!
You can use recursion walking, where d - your dictionary from json:
def walk(d, level="1"):
d["id"] = level
for i, child in enumerate(d.get("children", []), 1):
walk(child, level + "." + str(i))
walk(d["tree"])

Categories