I've seen that there are a fair few questions addressing more or less this issue, but I've not managed to apply them to my specific use-case, and I've been scratching my head and trying different solutions for a couple of days now.
I have a list of dictionaries, with their hierarchical position encoded as a string of index numbers - I want to rearrange the dictionaries into a nested hierarchy using these indices.
Here's some example data:
my_data = [{'id':1, 'text':'one', 'path':'1'},
{'id':2, 'text':'two', 'path':'3.1'},
{'id':3, 'text':'three', 'path':'2.1.1'},
{'id':4, 'text':'four', 'path':'3.2.1'},
{'id':5, 'text':'five', 'path':'2.1.2'},
{'id':6, 'text':'six', 'path':'3.2.2'},
{'id':7, 'text':'seven', 'path':'2'},
{'id':8, 'text':'eight', 'path':'3'},
{'id':9, 'text':'nine', 'path':'3.2'},
{'id':10, 'text':'ten', 'path':'2.1'}]
and here's what I'm trying to achieve:
result = {1:{'id':1, 'text':'one', 'path':'1'},
2:{'id':7, 'text':'seven', 'path':'2', 'children':{
1:{'id':10, 'text':'ten', 'path':'2.1', 'children':{
1:{'id':3, 'text':'three', 'path':'2.1.1'},
2:{'id':5, 'text':'five', 'path':'2.1.2'}
}}}},
3:{'id':8, 'text':'eight', 'path':'3', 'children':{
1:{'id':2, 'text':'two', 'path':'3.1'},
2:{'id':9, 'text':'nine', 'path':'3.2', 'children':{
1:{'id':4, 'text':'four', 'path':'3.2.1'},
2:{'id':6, 'text':'six', 'path':'3.2.2'}
}}}}
}
Since the paths of the individual data dictionaries don't appear in any logical order, I'm using dictionaries throughout rather than lists of dictionaries, as this allows me to create 'empty' spaces in the structure. I don't really want to rely on re-ordering the dictionaries in the initial list.
Here's my code:
#%%
class my_dict(dict):
def rec_update(self, index, dictObj): # extend the dict class with recursive update function
"""
Parameters
----------
index : list
path to dictObj.
dictObj : dict
data object.
Returns: updates the dictionary instance
-------
None.
"""
pos = index[0]
index.pop(0)
if len(index) != 0:
self.update({pos : {'children' : {self.rec_update(index, dictObj)}}})
else:
self.update({pos : dictObj})
#%%
dataOut = my_dict() #create empty dictionary to receive result
dataOut.clear()
# dictObj = my_data[0] # for testing
# dictObj = my_data[1]
for dictObj in my_data:
index = dictObj.get('path').split(".") # create the path list
dataOut.rec_update(index, dictObj) # place the current data dictionary in the hierarchy
The issue with the code is that the result of the nested function call in the class definition self.rec_update(index, dictObj) isn't ending up as the value of the 'children' key. Is this because I've not understood the scope of self properly?
I've noticed during testing that, if I run the dataOut.rec_update(index, dictObj) call for a single element of my_data, e.g. dictObj = my_data[1], that the index list variable in the console scope is modified, which is unexpected, as I thought the rec_update() function had its own distinct scope.
I think I can see a further bug where the 'children' element will be overwritten, but I'm not at that stage yet.
I'd welcome any explanation that can put me on the right track, please.
Here's a solution that you should be able to adapt to your needs. It's just a stand-alone function that transforms my_data into result:
def make_tree(data):
###
### Construct path_list and path_dict
###
path_dict = {}
path_list = []
for data in data:
path = data['path']
path_split = path.split('.')
assert len(path_split) >= 1
path_tuple = tuple(map(int, path_split))
assert path_tuple not in path_dict
path_dict[path_tuple] = data
path_list.append(path_tuple)
###
### Sort path_list. This is sorting the tuples corresponding to
### each path value. Among other things, this ensues that the
### parent of a path appears before the path.
###
path_list.sort()
###
### Construct and return the tree
###
new_path_dict = {}
tree = {}
for path_tuple in path_list:
data = path_dict[path_tuple]
path_leaf = path_tuple[-1]
new_data = data.copy()
if len(path_tuple) == 1:
assert path_leaf not in tree
tree[path_leaf] = new_data
else:
parent_path_tuple = path_tuple[:-1]
assert parent_path_tuple in new_path_dict
parent = new_path_dict[parent_path_tuple]
if 'children' not in parent:
children = {}
parent['children'] = children
else:
children = parent['children']
assert path_leaf not in children
children[path_leaf] = new_data
new_path_dict[path_tuple] = new_data
return tree
When called as:
result = make_tree(my_data)
It gives result the value:
{1: {'id': 1, 'text': 'one', 'path': '1'},
2: {'id': 7, 'text': 'seven', 'path': '2', 'children': {
1: {'id': 10, 'text': 'ten', 'path': '2.1', 'children': {
1: {'id': 3, 'text': 'three', 'path': '2.1.1'},
2: {'id': 5, 'text': 'five', 'path': '2.1.2'}}}}},
3: {'id': 8, 'text': 'eight', 'path': '3', 'children': {
1: {'id': 2, 'text': 'two', 'path': '3.1'},
2: {'id': 9, 'text': 'nine', 'path': '3.2', 'children': {
1: {'id': 4, 'text': 'four', 'path': '3.2.1'},
2: {'id': 6, 'text': 'six', 'path': '3.2.2'}}}}}}
Note that Python 3 dictionaries maintain the order of added elements, so in that sense, the constructed tree is "sorted" at each level by the corresponding path component.
Also note that the original source list, and the dictionaries it contains, are unchanged by this function.
I think I've cracked it! And I've learned a lot in the process. (I'd hope so - I've got at least 24 SO tabs open, 6 doc.python.org tabs, and maybe 20 others - so it's been a group effort!)
Here is a recursive function that creates the required nested data:
class my_dict(dict): # new class inherits dict()
def rec_update(self, index, dictObj):
pos = index[0] # store the first index position
index.pop(0) # remove the first position from the list
dictTemp = my_dict() # will be used to pass the nested branch to the recursive function - doesn't need defined here
if len(index) != 0: # ... then we've not arrived at the leaf yet
if not (pos in self and 'children' in self[pos]): # ... then create a new branch
self[pos] = {'children': {}} # create template
dictTemp = my_dict(self[pos]['children']) # cast the dictionary as my_dict so that it has access to the rec_update() function
self[pos]['children'] = dictTemp.rec_update(index, dictObj) # pass data on to next level, and recurse
else:
if (pos in self and 'children' in self[pos]): # ... then update existing branch
self[pos].update(dictObj) # add in the data alongside pre-existing children key
else: # populate new branch with data, finally!
self[pos] = dictObj
return self
and here is the calling code:
dataOut = my_dict()
for dictObj in my_data:
index = [int(i) for i in dictObj.get('path').split(".")] # turn path string into list and iterate; convert to integers
dataOut.rec_update(index, dictObj)
I still don't understand why changes to index inside the function alter index in the calling code - answers welcome :-)
But I did discover that I couldn't override dict.copy() with a __copy__() function inside my my_dict class definition, hence dictTemp = my_dict(self[pos]['children']) rather than dictTemp = self[pos]['children'].copy().
One final oddity which I've still to address: when I apply it to my production data, I have to run it twice!
Related
I have an example dictionaty for rules, quantifiers, and transformations, essentially, inside each key there belongs another key containing ids equal to id. I am trying to find all those that match and return these id's that match as a dictionary in this format:
dictionary = {'rules':[...], 'quantifiers':[...], 'transformations':[...]}
Here is the sample:
test_dict = {
'rules': [{'id': 123,'logic': '{"$or":[{"$and":[{"baseOperator":null,"operator":"does_not_contain_ignore_case","operand1":"metrics.123","operand2":"metrics.456"}]}]}',},
{'id': 589,
'logic': '{"$or":[{"$and":[{"baseOperator":null,"operator":"does_not_contain_ignore_case","operand1":"metrics.123","operand2":0}, {"baseOperator":null,"operator":"does_not_contain_ignore_case","operand1":"metrics.456","operand2":0}]}]}',},
{'id': 51,
'logic': '{"$or":[{"$and":[{"baseOperator":null,"operator":"does_not_contain_ignore_case","operand1":"metrics.789","operand2":"metrics.1"}]}]}',},],
'quant': [{'id':123,
'transIds': [1, 2, 3],
'qualifiedId': 'metrics.123'},
{'id':456,
'transIds': [1, 6],
'qualifiedId': 'metrics.456'},
{'id':789,
'transIds': [9],
'qualifiedId': 'metrics.789'}],
'trans': [{'id':1,
'rules': [123, 120]},
{'id':6,
'rules':[589, 2]}]
}
Here was my attempt, however, I realised that the list trans, rules would be specific to each index ID, therefore, because rules is first in the test_dict, then the loop won't capture it because all values side by it are empty.
Essentially, I wanted to enter logic and capture all values metric that belong to the ids in quantifiers
Capture all ids from quantifiers that match the values inside attr
attr = [123, 456]
keys = list(test_dict.keys())
trans = []
rules = []
for iter in range(len(keys)):
for in_iter in range(len(test_dict[keys[iter]])):
if test_dict[keys[iter]][in_iter].get('id') in attr:
if test_dict[keys[iter]][in_iter].get('transIds') is not None:
for J in test_dict[keys[iter]][in_iter].get('transIds'):
trans.append(J)
if test_dict[keys[iter]][in_iter].get('id') in trans:
if test_dict[keys[iter]][in_iter].get('rules') is not None:
for K in test_dict[keys[iter]][in_iter].get('rules'):
rules.append(K)
if test_dict[keys[iter]][in_iter].get('id') in rules:
if test_dict[keys[iter]][in_iter].get('logic') is not None:
print(test_dict[keys[iter]][in_iter].get('logic'))
I figured it out thanks to the comments; Instead of running it all inside a single loop, then I split the loops into parts which solved the list issue. However, the lines of code is far too long for this attempt:
attr = [123, 456]
keys = list(test_dict.keys())
trans = []
rules = []
qualified = []
quant_id = set()
import json
for iter in range(len(keys)):
for in_iter in range(len(test_dict[keys[iter]])):
if test_dict[keys[iter]][in_iter].get('id') in attr:
qualified.append(test_dict[keys[iter]][in_iter].get('qualifiedId'))
if test_dict[keys[iter]][in_iter].get('transIds') is not None:
for J in test_dict[keys[iter]][in_iter].get('transIds'):
trans.append(J)
trans2 = set()
for iter in range(len(keys)):
for in_iter in range(len(test_dict[keys[iter]])):
if test_dict[keys[iter]][in_iter].get('id') in trans:
trans2.add(test_dict[keys[iter]][in_iter].get('id'))
if test_dict[keys[iter]][in_iter].get('rules') is not None:
for K in test_dict[keys[iter]][in_iter].get('rules'):
rules.append(K)
rules2 = set()
for iter in range(len(keys)):
for in_iter in range(len(test_dict[keys[iter]])):
if test_dict[keys[iter]][in_iter].get('id') in rules:
rules2.add(test_dict[keys[iter]][in_iter].get('id'))
if test_dict[keys[iter]][in_iter].get('logic') is not None:
logic = json.loads(test_dict[keys[iter]][in_iter].get('logic'))
ks_or = list(logic.keys())
for or_ in range(len(logic)):
for unl_or_ in range(len(logic[ks_or[or_]])):
and_logic = logic[ks_or[or_]][unl_or_]
ks_and = list(logic[ks_or[or_]][unl_or_].keys())
for and_ in range(len(and_logic)):
for unl_and_ in range(len(and_logic[ks_and[and_]])):
if and_logic[ks_and[and_]][unl_and_].get('operand1') in qualified:
quant_id.add(and_logic[ks_and[and_]][unl_and_].get('operand1').split('.')[-1])
elif and_logic[ks_and[and_]][unl_and_].get('operand2') in qualified:
quant_id.add(and_logic[ks_and[and_]][unl_and_].get('operand2').split('.')[-1])
else:
continue
dictionary = {'rules':rules2, 'transformations': trans2, 'quantifiers': quant_id}
print(dictionary)
Result:
{'rules': {123, 589}, 'transformations': {1, 6}, 'quantifiers': {'456', '123'}}
Updated with set instead of list so only unique values remain.
I'm trying to convert a list of asset objects that has a list of attribute objects into an array of dictionaries. I'm trying to denormalise the parent/child relationship into a single dictionary.
For the context of my code below I have an asset object with a short_name and the asset object has a list of attributes with an attribute_value and attribute_name.
My intended result is something like this;
[{'name': 'Test', 'attr': 0.9}, {'name': 'Test2', 'attr': 0.5}]
So far I've written it like this;
a_list = []
for a in self.assets:
asset_dict = {'name': a.short_name }
for x in a.attributes:
asset_dict = asset_dict | { x.attribute_name : x.attribute_value }
a_list.append(asset_dict)
This works fine, but I'm looking for a neater solution.
I experimented with;
result = [{'name':a.short_name} | {x.attribute_name : x.attribute_value} for x in a.attribute for a in self.assets]
However, I just can't seem to get the syntax correct and not sure if it is possible to do something like this.
EDIT: Inputs on request (excluding the class definition);
self.assets = [Asset(short_name='Test'),Asset(short_name='Test2')]
self.assets[0].attributes = [Attribute(attribute_name='attr',attribute_value=0.9)]
self.assets[1].attributes = [Attribute(attribute_name='attr',attribute_value=0.5)]
This should work:
a_list = [
{'name': a.short_name} |
{x.attribute_name: x.attribute_value for x in a.attributes}
for a in self.assets
]
or
a_list = [
{'name': a.short_name, **{x.attribute_name: x.attribute_value
for x in a.attributes}}
for a in self.assets
]
Trying to create a dict that holds name,position and number for each player for each team. But when trying to create the final dictionary players[team_name] =dict(zip(number,name,position)) it throws an error (see below). I can't seem to get it right, any thoughts on what I'm doing wrong here would be highly appreciated. Many thanks,
from bs4 import BeautifulSoup as soup
import requests
from lxml import html
clubs_url = 'https://www.premierleague.com/clubs'
parent_url = clubs_url.rsplit('/', 1)[0]
data = requests.get(clubs_url).text
html = soup(data, 'html.parser')
team_name = []
team_link = []
for ul in html.find_all('ul', {'class': 'block-list-5 block-list-3-m block-list-1-s block-list-1-xs block-list-padding dataContainer'}):
for a in ul.find_all('a'):
team_name.append(str(a.h4).split('>', 1)[1].split('<')[0])
team_link.append(parent_url+a['href'])
team_link = [item.replace('overview', 'squad') for item in team_link]
team = dict(zip(team_name, team_link))
data = {}
players = {}
for team_name, team_link in team.items():
player_page = requests.get(team_link)
cont = soup(player_page.content, 'lxml')
clud_ele = cont.find_all('span', attrs={'class' : 'playerCardInfo'})
for i in clud_ele:
v_number = [100 if v == "-" else v.get_text(strip=True) for v in i.select('span.number')]
v_name = [v.get_text(strip=True) for v in i.select('h4.name')]
v_position = [v.get_text(strip=True) for v in i.select('span.position')]
key_number = [key for element in i.select('span.number') for key in element['class']]
key_name = [key for element in i.select('h4.name') for key in element['class']]
key_position = [key for element in i.select('span.position') for key in element['class']]
number = dict(zip(key_number,v_number))
name = dict(zip(key_name,v_name))
position = dict(zip(key_position,v_name))
players[team_name] = dict(zip(number,name,position))
---> 21 players[team_name] = dict(zip(number,name,position))
22
23
ValueError: dictionary update sequence element #0 has length 3; 2 is required
There are many problems in your code. The one causing the error is that you are trying to instantiate a dictionary with a 3-items tuple in list which is not possible. See the dict doc for details.
That said, I would suggest to rework the whole nested loop.
First, you have in clud_ele a list of player info, each player info concerns only one player and provides only one position, only one name and only one number. So there is no need to store those informations in lists, you could use simple variables:
for player_info in clud_ele:
number = player_info.select('span.number')[0].get_text(strip=True)
if number == '-':
number = 100
name = player_info.select('h4.name')[0].get_text(strip=True)
position = player_info.select('span.position')[0].get_text(strip=True)
Here, usage of select method returns a list but since you know that the list contains only one item, it's ok to get this item to call get_text on. But you could check that the player_info.select('span.number') length is actually 1 before continuing to work if you want to be sure...
This way, you get scalar data type which will be much easier to manipulate.
Also note that I renamed the i to player_info which is much more explicit.
Then you can easily add your player data to your players dict:
players[team_name].append({'name': name,
'position': position
'number': number})
This assume that you create the players[team_name] before the nested loop with players[team_name] = [].
Edit: as stated in the #kederrac's answer, usage of a defaultdict is a smart and convenient way to avoid the manual creation of each players[team_name] list
Finally, this will give you:
a dictionary containing values for name, position and number keys for each player
a team list containg player dictionaries for each team
a players dictionary associating a team list for each team_name
It is the data structure you seems to want, but other structures are possible. Remember to think about your data structure to make it logical AND easily manipulable.
you can't instantiate a dict with 3 arguments, the problem is the fact that you have 3 variables in the zip: zip(number, name, position) with which you want to instantiate a dict, you should give only 2 arguments at a time, the key and the value
I've rewritten your las part of the code:
from collections import defaultdict
data = {}
players = defaultdict(list)
for team_name, team_link in team.items():
player_page = requests.get(team_link)
cont = soup(player_page.text, 'lxml')
clud_ele = cont.find_all('span', attrs={'class' : 'playerCardInfo'})
for i in clud_ele:
num = i.select('span.number')[0].get_text(strip=True)
number = 100 if num == '-' else num
name = i.select('h4.name')[0].get_text(strip=True)
position = i.select('span.position')[0].get_text(strip=True)
players[team_name].append({'number': number, 'position': position, 'name': name})
output:
defaultdict(list,
{'Arsenal': [{'number': '1',
'position': 'Goalkeeper',
'name': 'Bernd Leno'},
{'number': '26',
'position': 'Goalkeeper',
'name': 'Emiliano Martínez'},
{'number': '33', 'position': 'Goalkeeper', 'name': 'Matt Macey'},
{'number': '2',
'position': 'Defender',
'name': 'Héctor Bellerín'},
.......................
As an exercise, I wanted to be less reliant on pandas and build a custom merge function on a list of dictionaries. Essentially, this is a left merge, where the original list is preserved and if the key has multiple matches then extra rows are added. However in my case, the extra rows appear to be added but with the exact same information.
Could anyone steer me in the right direction, as to where this code is going wrong?
def merge(self, l2, key):
#self.data is a list of dictionaries
#l2 is the second list of dictionaries to merge
headers = l2[0]
found = {}
append_list = []
for row in self.data:
for row_b in l2:
if row_b[key] == row[key] and row[key] not in found:
found[row[key]] = ""
for header in headers:
row[header] = row_b[header]
elif row_b[key] == row[key]:
new_row = row
for header in headers:
new_row[header] = row_b[header]
append_list.append(new_row)
self.data.extend(append_list)
Edit: Here is some sample input, and expected output:
self.data = [{'Name':'James', 'Country':'Australia'}, {'Name':'Tom', 'Country':'France'}]
l2 = [{'Country':'France', 'Food':'Frog Legs'}, {'Country':'Australia', 'Food':'Meat Pie'},{'Country':'Australia', 'Food':'Pavlova'}]
I would expect self.data to equal the following after passing through the function, with a parameter of 'Country':
[{'Name':'James', 'Country':'Australia', 'Food':'Meat Pie'}, {'Name':'James', 'Country':'Australia', 'Food':'Pavlova'}, {'Name':'Tom', 'Country':'France', 'Food':'Frog Legs'}]
The function below takes two lists of dictionaries, where the dictionaries are expected to all have keyprop as one of their properties:
from collections import defaultdict
from itertools import product
def left_join(left_table, right_table, keyprop):
# create a dictionary indexed by `keyprop` on the left
left = defaultdict(list)
for row in left_table:
left[row[keyprop]].append(row)
# create a dictionary indexed by `keyprop` on the right
right = defaultdict(list)
for row in right_table:
right[row[keyprop]].append(row)
# now simply iterate through the "left side",
# grabbing rows from the "right side" if they are available
result = []
for key, left_rows in left.items():
right_rows = right.get(key)
if right_rows:
for left_row, right_row in product(left_rows, right_rows):
result.append({**left_row, **right_row})
else:
result.extend(left_rows)
return result
sample1 = [{'Name':'James', 'Country':'Australia'}, {'Name':'Tom', 'Country':'France'}]
sample2 = [{'Country':'France', 'Food':'Frog Legs'}, {'Country':'Australia', 'Food':'Meat Pie'},{'Country':'Australia', 'Food':'Pavlova'}]
print(left_join(sample1, sample2, 'Country'))
# outputs:
# [{'Name': 'James', 'Country': 'Australia', 'Food': 'Meat Pie'},
# {'Name': 'James', 'Country': 'Australia', 'Food': 'Pavlova'},
# {'Name': 'Tom', 'Country': 'France', 'Food': 'Frog Legs'}]
In a data set where you can assume that rows are unique on the value of keyprop in their respective data sets, the implementation is quite a bit simpler:
def left_join(left_table, right_table, keyprop):
# create a dictionary indexed by `keyprop` on the left
left = {row[keyprop]: row for row in left_table}
# create a dictionary indexed by `keyprop` on the right
right = {row[keyprop]: row for row in right_table}
# now simply iterate through the "left side",
# grabbing rows from the "right side" if they are available
return [{**leftrow, **right.get(key, {})} for key, leftrow in left.items()]
In python and any other language it is quite easy to to traverse (in level order so BFS) a binary tree using a queue data structure. Given an adjecency list representation in python and the root of a tree I can traverse the tree in level order and print level elements in order. Nonetheless what I cannot do is go from an adjecency list representation to a level_dictionary or something of the likes:
so for example I would like to go from
adjecency_list = {'A': {'B','C'}, 'C':{'D'}, 'B': {'E'}}
to
levels = {0: ['A'], 1: ['B','C'], 2: ['D','E']}
So far I have the following:
q = Queue()
o = OrderedDict()
root = find_root(adjencency_list) # Seperate function it works fine
height = find_height(root, adjencency_list) # Again works fine
q.put(root)
# Creating a level ordered adjecency list
# using a queue to keep track of pointers
while(not q.empty()):
current = q.get()
try:
if(current in adjencency_list):
q.put(list(adjencency_list[current])[0])
# Creating ad_list in level order
if current in o:
o[current].append(list(adjencency_list[current])[0])
else:
o[current] = [list(adjencency_list[current])[0]]
if(current in adjencency_list):
q.put(list(adjencency_list[current])[1])
# Creating ad_list in level order
if current in o:
o[current].append(list(adjencency_list[current])[1])
else:
o[current] = [list(adjencency_list[current])[1]]
except IndexError:
pass
All it does is place the adjecency list in the the correct level orders for the tree and if I printed a the start of the loop it would print in level order traversal. Nonetheless it does not solve my problem. I am aware adjecency list is not the best representation for a tree but I require using it for the task I am doing.
A recursive way to create the level dictionary from your adjacency list would be -
def level_dict(adj_list,curr_elems,order=0):
if not curr_elems: # This check ensures that for empty `curr_elems` list we return empty dictionary
return {}
d = {}
new_elems = []
for elem in curr_elems:
d.setdefault(order,[]).append(elem)
new_elems.extend(adj_list.get(elem,[]))
d.update(level_dict(adj_list,new_elems,order+1))
return d
The starting input to the method would be the root element in a list, example - ['A'] , and the initial level, which would be 0.
In each level, it takes the chlidren of the elements at that level and creates a new list, and at the same time, creates the level dictionary (in d) .
Example/Demo -
>>> adjecency_list = {'A': {'B','C'}, 'C':{'D'}, 'B': {'E'}}
>>> def level_dict(adj_list,curr_elems,order=0):
... if not curr_elems:
... return {}
... d = {}
... new_elems = []
... for elem in curr_elems:
... d.setdefault(order,[]).append(elem)
... new_elems.extend(adj_list.get(elem,[]))
... d.update(level_dict(adj_list,new_elems,order+1))
... return d
...
>>> level_dict(adjecency_list,['A'])
{0: ['A'], 1: ['C', 'B'], 2: ['D', 'E']}