How to create a tree using BFS in python? - python

So I have a flattened tree in JSON like this, as array of objects:
[{
aid: "id3",
data: ["id1", "id2"]
},
{
aid: "id1",
data: ["id3", "id2"]
},
{
aid: "id2",
nested_data: {aid: "id4", atype: "nested", data: ["id1", "id3"]},
data: []
}]
I want to gather that tree and resolve ids into data with recursion loops into something like this (say we start from "id3"):
{
"aid":"id3",
"payload":"1",
"data":[
{
"id1":{
"aid":"id1",
"data":[
{
"id3":null
},
{
"id2":null
}
]
}
},
{
"id2":{
"aid":"id2",
"nested_data":{
"aid":"id4",
"atype":"nested",
"data":[
{
"id1":null
},
{
"id3":null
}
]
},
"data":[
]
}
}
]
}
So that we would get breadth-first search and resolve some field into "value": "object with that field" on first entrance and "value": Null
How to do such a thing in python 3?

Apart from all the problems that your structure has in terms of syntax (identifiers must be within quotes, etc.), the code below will provide you with the requested answer.
But you should carefully think about what you are doing, and have the following into account:
Using the relations expressed in the flat structure that you provide will mean that you will have an endless recursion since you have items that include other items that in turn include the first ones (like id3 including id1, which in turn include id3. So, you have to define stop criteria, or be sure that this does not occur in your flat structure.
Your initial flat structure is better to be in the form of a dictionary, instead of a list of pairs {id, data}. That is why the first thing the code below does is to transform this.
Your final, desired structure contains a lot of redundancies in terms of information contained. Consider simplifying it.
Finally, you mentioned nothing about the "nested_data" nodes, and how they should be treated. I simply assumed that in case that exist, further expansion is required.
Please, consider trying to provide a bit of context in your questions, some real data examples (I believe the data provided is not real data, therefore the inconsistencies and redundancies), and try yourself and provide your efforts; that's the only way to learn.
from pprint import pprint
def reformat_flat_info(flat):
reformatted = {}
for o in flat:
key = o["aid"]
del o["aid"]
reformatted[key] = o
return reformatted
def expand_data(aid, flat, lvl=0):
obj = flat[aid]
if obj is None: return {aid: obj}
obj.update({"aid": aid})
if lvl > 1:
return {aid: None}
for nid,id in enumerate(obj["data"]):
obj["data"][nid] = expand_data(id, flat, lvl=lvl+1)
if "nested_data" in obj:
for nid,id in enumerate(obj["nested_data"]["data"]):
obj["nested_data"]["data"][nid] = expand_data(id, flat, lvl=lvl+1)
return {aid: obj}
# Provide the flat information structure
flat_info = [
{
"aid": "id3",
"data": ["id1", "id2"]
}, {
"aid": "id1",
"data": ["id3", "id2"]
}, {
"aid": "id2",
"nested_data": {"aid": "id4", "atype": "nested", "data": ["id1", "id3"]},
"data": []
}
]
pprint(flat_info)
print('-'*80)
# Reformat the flat information structure
new_flat_info = reformat_flat_info(flat=flat_info)
pprint(new_flat_info)
print('-'*80)
# Generate the result
starting_id = "id3"
result = expand_data(aid=starting_id, flat=new_flat_info)
pprint(result)

Related

Save reference to a python dictionary (from a json file) item in a variable

I'm trying to save the reference to a value in a json file where item orders could not be guaranteed. So far, what I have for a dataset like this one:
"Values": [
{
"Object": "DFC_Asset_05",
"Properties": [
{
"Property": "WeightKilograms",
"Value Offset": 5
},
{
"Property": "WeightPounds",
"Value Offset": 10
}
]
},
{
"Object": "DFC_Asset_05",
"Properties": [
{
"Property": "Name",
"Value Offset": 25
},
{
"Property": "ShortName",
"Value Offset": 119
}
]
}
]
and retrieving this object:
{
"Property": "ShortName",
"Value Offset": 119
}
Is a string like this:
reference = "[Object=DFC_Asset_06][Properties][Property=Name]"
Which looks nice and understandable in a string, but it's very unclean to find the referenced value as I must first parse the reference it with a regex then loop in the data to retrieve the matching item.
Am I doing this wrong? Is there a better way to do this? I looked at the reduce() function however it seems like it's made for dictionaries with static data. For example, I could not save the direct keys:
reference = "[1][Properties][1]"
reference_using_reduce = [1, "Properties", 1]
As they might not always be in that order
You can run "queries" on JSONs without referencing specific indexes using the pyjq module:
query = (
'.Values[]' # On all the items in "Values"
'|select(."Object" == "DFC_Asset_06")' # Find key "Object" which holds this value
'|."Properties"[]' # And get all the items of "Properties"
'|select(."Property" == "Name")' # Where the key "Property" holds the value "Name"
)
pyjq.first(query, d)
Result:
{'Property': 'Name', 'Value Offset': 25}
You can read more about jq in the documentations.

Navigate dict based on its structure

I have a python code that interacts with multiple APIs. All of the APIs return some json but each has different structure. Let's say I'm looking for people's names in all these jsons:
json_a = {
"people": [
{"name": "John"},
{"name": "Peter"}
]
}
json_b = {
"humans": {
"names": ["Adam", "Martin"]
}
}
As you can see above the dictionaries from jsons have arbitrary structures. I'd like to define something that will serve as a "blueprint" for navigating each json, something like this:
all_jsons = {
"json_a": {
"url": "http://endpoint",
"json_structure": "people -> list -> name"
},
"json_b": {
"url": "http://someotherendpoint",
"json_structure": "humans -> names -> list"
}
}
So that if I'm working with json_a I'll just look into all_jsons["json_a"]["json_structure"] and I have an information on how to navigate this exact json. What would be the best way to achieve this?
Why not define concrete retrieval functions for each api:
def retrieve_a(data):
return [d["name"] for d in data["people"]]
def retrieve_b(data):
return data["humans"]["names"]
and store them for each endpoint:
all_jsons = {
"json_a": {
"url": "http://endpoint",
"retrieve": retrieve_a
},
"json_b": {
"url": "http://someotherendpoint",
"retrieve": retrieve_b
}
}
I have found this approach more workable than trying to express code-logic by configuration. Then you can easily collect names:
for dct in all_jsons.values():
data = ... # requests.get(dct["url"]).json() # or similar
names = dct["retrieve"](data)
To get a value from a dictionary with a key if it may not exist, dict.get(key) is used for. To distinguish which type list or dict or else, type(val) is useful for. Combination of them should achieve your problem.

Elegant way of iterating list of dict python

I have a list of dictionary as below. I need to iterate the list of dictionary and remove the content of the parameters and set as an empty dictionary in sections dictionary.
input = [
{
"category":"Configuration",
"sections":[
{
"section_name":"Global",
"parameters":{
"name":"first",
"age":"second"
}
},
{
"section_name":"Operator",
"parameters":{
"adrress":"first",
"city":"first"
}
}
]
},
{
"category":"Module",
"sections":[
{
"section_name":"Global",
"parameters":{
"name":"first",
"age":"second"
}
}
]
}
]
Expected Output:
[
{
"category":"Configuration",
"sections":[
{
"section_name":"Global",
"parameters":{}
},
{
"section_name":"Operator",
"parameters":{}
}
]
},
{
"category":"Module",
"sections":[
{
"section_name":"Global",
"parameters":{}
}
]
}
]
My current code looks like below:
category_list = []
for categories in input:
sections_list = []
category_name_dict = {"category": categories["category"]}
for sections_dict in categories["sections"]:
section = {}
section["section_name"] = sections_dict['section_name']
section["parameters"] = {}
sections_list.append(section)
category_name_dict["sections"] = sections_list
category_list.append(category_name_dict)
Is there any elegant and more performant way to do compute this logic. Keys such as category, sections, section_name, and parameters are constants.
The easier way is not to rebuild the dictionary without the parameters, just clear it in every section:
for value in values:
for section in value['sections']:
section['parameters'] = {}
Code demo
Elegance is in the eye of the beholder, but rather than creating empty lists and dictionaries then filling them why not do it in one go with a list comprehension:
category_list = [
{
**category,
"sections": [
{
**section,
"parameters": {},
}
for section in category["sections"]
],
}
for category in input
]
This is more efficient and (in my opinion) makes it clearer that the intention is to change a single key.

How to find equal values in different fields in Elasticsearch via Python Query?

I've got values in Elasticsearch (+Kibana) and want to make a Graph, where certain nodes are connected.
My fields are "prev" and "curr" and indicate the "previous" and the "current" page, which the user visited.
E.g:
prev: Main_Page, curr: Donald_Trump
prev: other-internal, curr: El_Bienamado
...
So what I'm trying to do is searching for values, where current is equal to previous, to be able to connect those and visualize via Networkx-Graph in Kibana.
My problem is that I just started yesterday with query-syntax and don't know if this is even possible.
All in all, my goal is to make a graph, where nodes are connected to a chain, e.g:
Main_Page -> Donald_Trump -> Problems_in_Afrika -> etc.
Meaning that somebody visited those pages in a certain order.
What I've tried for now is:
def getPrevList():
previous = []
previousQuery = {
"size": 0,
"aggs": {
"topTerms": {
"terms": {
"field": "prev",
"size": 50000
}
}
}
}
results = es.search(index="wiki", body=previousQuery)["aggregations"]["topTerms"]["buckets"]
for bucket in results:
previous.append({
"prev" : bucket["key"],
"numDocs" : bucket["doc_count"]
})
return previous
prevs=getPrevList()
rowNum = 0;
totalNumReviews=0
for prevDetails in prevs:
rowNum += 1
totalNumDocs += prevDetails["numDocs"]
prevId = prevDetails["prev"]
q = {
"query": {
"bool": {
"must": [
{
"term": {"prev": prevId}
}
]
}
},
"controls": {
"sample_size": 10000,
"use_significance": True
},
"vertices": [
{
"field": "curr",
"size": VERTEX_SIZE,
"min_doc_count": 1
},
{
"field": "prev",
"size": VERTEX_SIZE,
"min_doc_count": 1
}
],
"connections": {
"query": {
"match_all": {}
}
}
}
At the end, I'm doing the following:
results = es.transport.perform_request('POST', "/wiki/_xpack/_graph/_explore", body=q)
# Use NetworkX to create a graph of prevs and currs we can analyze
G = nx.Graph()
for node in results["vertices"]:
G.add_node(nodeId(node), type=node["field"])
for edge in results["connections"]:
n1 = results["vertices"][int(edge["source"])]
n2 = results["vertices"][int(edge["target"])]
G.add_edge(nodeId(n1), nodeId(n2))
I copied it from another example, which worked well, but I can see that the "connections" are important to be able to connect the vertices.
As far as I understood, I need the query to find the correct "prev" field.
The controls are not significant for now.
And here comes the complex part for me: What am I writing in the vertices and connections part? Is it correct that I defined vertices as the prev and curr fields?
And in the connections-query: for now I defined "match_all", but this is obviously not correct. I need a query, where I can "match" those, where prev equals curr and connect them.. but HOW??
Any hint is appreciated!
Thank you in forward.
EDIT:
Like #Lupanoide suggested, I altered the code and have now two visualizations:
the first one is the first suggested solution and it gives me this graph (part of it) (matplotlib, not Kibana yet):
The second solution looks more crazy and is more likely to be the correct one, but I need to visualize it in Kibana first:
So the new end of my script is now:
gq = json.dumps(q)
workspaceID ="/f44c95c0-223d-11e9-b49e-bb0f8e1e7bae" # my v6.4.0 workspace
workspaceUrl = "graph#/workspace/"+workspaceID+"?query=" + urllib.quote_plus(gq)
doc = {
"url": workspaceUrl
}
res = es.index(index=connectionsIndexName, doc_type='task', id=0, body=doc)
My only problem now is that when I'm using Kibana to open the URL, I do not see the graph. Instead I get the "new Graph" page.
EDIT2
Okay, I send the query, but of course the query alone is not enough. I need to pass the graph and its connections, right? Is it possible?
Thank you very much!
EDIT:
For your use case you need find all the values for field curr with the same prev value. So you need to groupBy all the pages that are clicked after a certain page. You can do that with terms aggregation .
You need to build a query that on one hand returns, with a term aggregation, all the values for the prev field and then you aggregate against over all the curr values generated:
def getOccurrencyDict():
body = {
"size": 0,
"aggs": {
"getAllThePrevs": {
"terms": {
"field": "prev",
"size": 40000
},
"aggs": {
"getAllTheCurr": {
"terms": {
"field": "curr",
"size": 40000
}
}
}
}
}
}
result = es.search(index="my_index", doc_type="mydoctype", body=body)
Then you have to build a data structure that the class Graph() of Networkx library accepts. So you should build a dict of list and then pass that var to the fromdictoflist method:
dict2Graph = dict()
for res in result["aggregations"]["getAllThePrevs"]["buckets"]:
dict2Graph[ res["key"] ] = list() #you create a dict of list with a prev value key
dict2Graph[ res["key"] ].append(res["getAllTheCurr"]["buckets"]) # you append a list of dict composed by key `key` with the `curr` value, and key `doc_count` with the number of occurrence of the term `curr` before the term prev
Now you pass it to the networkx ingestion method:
G=nx.from_dict_of_lists(dict2Graph)
I have not tested the networkx ingestion, so if it doesn't work , it is because we passed a dict of list of dict inside it and not a dict of list, so you should change a little bit how you build your dict2Graph dict
If the aggregation of aggregation query is too slow you should use prtition. Please read here how you could reach partition aggregation in elastic
EDIT:
after a reading of the networkX documentation, you could do also in this way, without creating the intermediate data structure:
from elasticsearch import Elasticsearch
from elasticsearch.client.graph import GraphClient
es = Elasticsearch()
graph_client = GraphClient(es)
def createGraphInKibana(prev):
q = {
"query": {
"bool": {
"must": [
{
"term": {"prev": prev}
}
]
}
},
"controls": {
"sample_size": 10000,
"use_significance": True
},
"vertices": [
{
"field": "curr",
"size": VERTEX_SIZE,
"min_doc_count": 1
},
{
"field": "prev",
"size": VERTEX_SIZE,
"min_doc_count": 1
}
],
"connections": {
"query": {
"match_all": {}
}
}
}
graph_client.explore(index="your_index", doc_type="your_doc_type", body=q)
G = nx.Graph()
for prev in result["aggregations"]["getAllThePrevs"]["buckets"]:
createGraphInKibana(prev['key'])
for curr in prev["getAllTheCurr"]["buckets"]:
G.add_edge(prev["key"], curr["key"], weight=curr["doc_count"])

Logic for building converter using python dictionary values

I have such slice of loaded json tp python dictionary (size_dict):
{
"sizeOptionName":"XS",
"sizeOptionId":"1528",
"sortOrderNumber":"7017"
},
{
"sizeOptionName":"S",
"sizeOptionId":"1529",
"sortOrderNumber":"7047"
},
{
"sizeOptionName":"M",
"sizeOptionId":"1530",
"sortOrderNumber":"7095"
}
and I have products with size Id (dictionary_prod):
{
"catalogItemId":"7627712",
"catalogItemTypeId":"3",
"regularPrice":"0.0",
"sizeDimension1Id":"1528",
"sizeDimension2Id":"0",
}
I need to make such as output for any product:
result_dict = {'variant':
[{"catalogItemId":"7627712", ...some other info...,
'sizeName': 'XS', 'sizeId': '1525'}}]}
so I need to convert size ID and add it to new result object
What is the best pythonic way to do this?
I dont know how to get right data from size_dict
if int(dictionary_prod['sizeDimension1Id']) > o:
(result_dict['variant']).append('sizeName': size_dict???)
As Tommy mentioned, this is best facilitated by mapping the size id's to their respective dictionaries.
size_dict = \
[
{
"sizeOptionName":"XS",
"sizeOptionId":"1528",
"sortOrderNumber":"7017"
},
{
"sizeOptionName":"S",
"sizeOptionId":"1529",
"sortOrderNumber":"7047"
},
{
"sizeOptionName":"M",
"sizeOptionId":"1530",
"sortOrderNumber":"7095"
}
]
size_id_map = {size["sizeOptionId"] : size for size in size_dict}
production_dict = \
[
{
"catalogItemId":"7627712",
"catalogItemTypeId":"3",
"regularPrice":"0.0",
"sizeDimension1Id":"1528",
"sizeDimension2Id":"0",
}
]
def make_variant(idict):
odict = idict.copy()
size_id = odict.pop("sizeDimension1Id")
odict.pop("sizeDimension2Id")
odict["sizeName"] = size_id_map[size_id]["sizeOptionName"]
odict["sizeId"] = size_id
return odict
result_dict = \
{
"variant" : [make_variant(product) for product in production_dict]
}
print(result_dict)
Your question is a little confusing but it looks like you have a list (size_dict) of dictionaries that contain some infroamtion and you want to do a lookup to find a particular element in the list that contains the SizeOptionName you are interested in so that you can read off the SizeOptionID.
So first you could organsie your size_dict as a dictionary rather than a list - i.e.
sizeDict = {"XS":{
"sizeOptionName":"XS",
"sizeOptionId":"1528",
"sortOrderNumber":"7017"
}, "S": {
"sizeOptionName":"S",
"sizeOptionId":"1529",
"sortOrderNumber":"7047"
}, ...
You could then read off the SizeOptionID you need by doing:
sizeDict[sizeNameYouAreLookingFor][SizeOptionID]
Alternative you could keep your current structure and just search the list of dictionaries that is size_dict.
So:
for elem in size_dict:
if elem.SizeOptionID == sizeYouAreLookingFor:
OptionID = elem.SizeOptionId
Or perhaps you are asking something else?

Categories