How to map nested list to flat values - python

I`m trying to parse a spreadsheet with a header that looks something like this:
My problem is those nested keys below "Контрагент". I decided to parse it like this:
['Дата',
'Номер документа',
'Дебет',
'Кредит',
['Контрагент',
['Наименование', 'ИНН', 'КПП', 'Счет', 'БИК', 'Наименование банка']],
'Назначение платежа',
'Код дебитора',
'Тип документа']
But now, I don`t really have an idea as how to map it to a flat list of values:
['21.05.2021',
'591324565436',
'0.00',
'526345428.99',
'asdasd',
'234525460140679',
'77130100123412341',
'302328105423534200000000280',
'0445252345234974',
'asdfsadfsd',
'sdfghsfgdhfdghdfgh',
'',
'dfghfgdhfdgh']
Given these variables, I want a function to return following dict:
{
"Дата": "21.05.2021",
"Номер документа": "591324565436",
"Дебет": "0.00",
"Кредит": "526345428.99",
"Контрагент": {
"Наименование": "asdasd",
"ИНН": "234525460140679",
"КПП": "77130100123412341",
"Счет": "302328105423534200000000280",
"БИК": "0445252345234974",
"Наименование банка": "asdfsadfsd"
},
"Назначение платежа": "sdfghsfgdhfdghdfgh",
"Код дебитора": "",
"Тип документа": "dfghfgdhfdgh"
}
I've gone this far without realizing it'd be raising IndexError on the 3rd line:
def map_to_schema(schema, data):
for i, elem in enumerate(data):
key = schema[i]
if isinstance(key, list):
if key[0] not in result:
result[key[0]] = {}
result[key[0]] |= {
key[1][i-len(key)]: elem
}
else:
result[key] = elem
What should I do? Maybe the structure for the schema isn't good enough? I really have no idea...

You could use a dictionary comprehension and an iterator:
headers = ['Дата', 'Номер документа', 'Дебет', 'Кредит',
['Контрагент', ['Наименование', 'ИНН', 'КПП', 'Счет', 'БИК', 'Наименование банка']],
'Назначение платежа', 'Код дебитора', 'Тип документа']
values = ['21.05.2021', '591324565436', '0.00', '526345428.99', 'asdasd', '234525460140679', '77130100123412341',
'302328105423534200000000280', '0445252345234974', 'asdfsadfsd', 'sdfghsfgdhfdghdfgh', '',
'dfghfgdhfdgh']
it = iter(values)
out = {k[0] if (islist := isinstance(k, list)) else k:
{k2: next(it) for k2 in k[1]} if islist else next(it)
for k in headers}
output:
{'Дата': '21.05.2021',
'Номер документа': '591324565436',
'Дебет': '0.00',
'Кредит': '526345428.99',
'Контрагент': {'Наименование': 'asdasd',
'ИНН': '234525460140679',
'КПП': '77130100123412341',
'Счет': '302328105423534200000000280',
'БИК': '0445252345234974',
'Наименование банка': 'asdfsadfsd'},
'Назначение платежа': 'sdfghsfgdhfdghdfgh',
'Код дебитора': '',
'Тип документа': 'dfghfgdhfdgh'}

Thanks #mozway for this solution! This is essentially the same algorithm, using a for loop.
def map(schema, s_length, row: list):
# If len(row) was less then *true* schema length, it would have thrown StopIteration.
# I ended up just extending row list by delta elements.
if (delta := s_length - len(row)) > 0:
row.extend([""] * delta)
iter_row = iter(row)
result = {}
for key in schema:
if isinstance(key, list):
result[key[0]] = {}
for sub_key in key[1]:
result[key[0]][sub_key] = next(iter_row)
else:
result[key] = next(iter_row)
return result

Related

Get all the keys of a nested dict

With xmltodict I managed to get my code from xml in a dict and now I want to create an excel.
In this excel the header of a value is going to be all the parents (keys in the dict).
For example:
dict = {"name":"Pete", "last-name": "Pencil", "adres":{"street": "example1street", "number":"5", "roommate":{"gender":"male"}}}
The value male will have the header: adres/roommate/gender.
Here's a way to orgainze the data in the way your question asks:
d = {"name":"Pete", "last-name": "Pencil", "adres":{"street": "example1street", "number":"5", "roommate":{"gender":"male"}}}
print(d)
stack = [('', d)]
headerByValue = {}
while stack:
name, top = stack.pop()
if isinstance(top, dict):
stack += (((name + '/' if name else '') + k, v) for k, v in top.items())
else:
headerByValue[name] = top
print(headerByValue)
Output:
{'adres/roommate/gender': 'male',
'adres/number': '5',
'adres/street': 'example1street',
'last-name': 'Pencil',
'name': 'Pete'}

Create new dictionary with specific keys from old dictionary

I want to make a new dictionary that prints a new object containing uuid, name, website, and email address for all rows of my dict that have values for all four of these attributes.
I thought I did this for email, name, and website below in my code but I noticed sometimes name or email wont print (because they have missing values), how do I drop those? Also, uuid is outside of the nested dictionary, how do I add that in the new dictionary too?
I attached my code and an element from my code below.
new2 = {}
for i in range (0, len(json_file)):
try:
check = json_file[i]['payload']
new = {k: v for k, v in check.items() if v is not None}
new2 = {k: new[k] for k in new.keys() & {'name', 'website', 'email'}}
print(new2)
except:
continue
Dictionary sample:
{
"payload":{
"existence_full":1,
"geo_virtual":"[\"56.9459720|-2.1971226|20|within_50m|4\"]",
"latitude":"56.945972",
"locality":"Stonehaven",
"_records_touched":"{\"crawl\":8,\"lssi\":0,\"polygon_centroid\":0,\"geocoder\":0,\"user_submission\":0,\"tdc\":0,\"gov\":0}",
"address":"The Lodge, Dunottar",
"email":"dunnottarcastle#btconnect.com",
"existence_ml":0.5694238217658721,
"domain_aggregate":"",
"name":"Dunnottar Castle",
"search_tags":[
"Dunnottar Castle Aberdeenshire",
"Dunotter Castle"
],
"admin_region":"Scotland",
"existence":1,
"category_labels":[
[
"Landmarks",
"Buildings and Structures"
]
],
"post_town":"Stonehaven",
"region":"Kincardineshire",
"review_count":"719",
"geocode_level":"within_50m",
"tel":"01569 762173",
"placerank":65,
"longitude":"-2.197123",
"placerank_ml":37.27916073464469,
"fax":"01330 860325",
"category_ids_text_search":"",
"website":"http://www.dunnottarcastle.co.uk",
"status":"1",
"geocode_confidence":"20",
"postcode":"AB39 2TL",
"category_ids":[
108
],
"country":"gb",
"_geocode_quality":"4"
},
"uuid":"3867aaf3-12ab-434f-b12b-5d627b3359c3"
}
Try using the dict.get() method:
def new_dict(input_dict, keys, fallback='payload'):
ret = dict()
for key in keys:
val = input_dict.get(key) or input_dict[fallback].get(key)
if val:
ret.update({key:val})
if len(ret) == 4: # or you could do: if set(ret.keys()) == set(keys):
print(ret)
for dicto in json_file:
new_dict(dicto, ['name','website','email','uuid'])
{'name': 'Dunnottar Castle', 'website': 'http://www.dunnottarcastle.co.uk', 'email': 'dunnottarcastle#btconnect.com', 'uuid': '3867aaf3-12ab-434f-b12b-5d627b3359c3'}

Trouble getting right values against each item

I'm trying to parse the item names and it's corresponding values from the below snippet. dt tag holds names and dd containing values. There are few dt tags which do not have corresponding values. So, all the names do not have values. What I wish to do is keep the values blank against any name if the latter doesn't have any values.
These are the elements I would like to scrape data from:
content="""
<div class="movie_middle">
<dl>
<dt>Genres:</dt>
<dt>Resolution:</dt>
<dd>1920*1080</dd>
<dt>Size:</dt>
<dd>1.60G</dd>
<dt>Quality:</dt>
<dd>1080p</dd>
<dt>Frame Rate:</dt>
<dd>23.976 fps</dd>
<dt>Language:</dt>
</dl>
</div>
"""
I've tried like below:
soup = BeautifulSoup(content,"lxml")
title = [item.text for item in soup.select(".movie_middle dt")]
result = [item.text for item in soup.select(".movie_middle dd")]
vault = dict(zip(title,result))
print(vault)
It gives me messy results (wrong pairs):
{'Genres:': '1920*1080', 'Resolution:': '1.60G', 'Size:': '1080p', 'Quality:': '23.976 fps'}
My expected result:
{'Genres:': '', 'Resolution:': '1920*1080', 'Size:': '1.60G', 'Quality:': '1080p','Frame Rate:':'23.976 fps','Language:':''}
Any help on fixing the issue will be highly appreciated.
You can loop through the elements inside dl. If the current element is dt and the next element is dd, then store the value as the next element, else set the value as empty string.
dl = soup.select('.movie_middle dl')[0]
elems = dl.find_all() # Returns the list of dt and dd
data = {}
for i, el in enumerate(elems):
if el.name == 'dt':
key = el.text.replace(':', '')
# check if the next element is a `dd`
if i < len(elems) - 1 and elems[i+1].name == 'dd':
data[key] = elems[i+1].text
else:
data[key] = ''
You can use BeautifulSoup to parse the dl structure, and then write a function to create the dictionary:
from bs4 import BeautifulSoup as soup
import re
def parse_result(d):
while d:
a, *_d = d
if _d:
if re.findall('\<dt', a) and re.findall('\<dd', _d[0]):
yield [a[4:-5], _d[0][4:-5]]
d = _d[1:]
else:
yield [a[4:-5], '']
d = _d
else:
yield [a[4:-5], '']
d = []
print(dict(parse_result(list(filter(None, str(soup(content, 'html.parser').find('dl')).split('\n')))[1:-1])))
Output:
{'Genres:': '', 'Resolution:': '1920*1080', 'Size:': '1.60G', 'Quality:': '1080p', 'Frame Rate:': '23.976 fps', 'Language:': ''}
For a slightly longer, although cleaner solution, you can create a decorator to strip the HTML tags of the output, thus removing the need for the extra string slicing in the main parse_result function:
def strip_tags(f):
def wrapper(data):
return {a[4:-5]:b[4:-5] for a, b in f(data)}
return wrapper
#strip_tags
def parse_result(d):
while d:
a, *_d = d
if _d:
if re.findall('\<dt', a) and re.findall('\<dd', _d[0]):
yield [a, _d[0]]
d = _d[1:]
else:
yield [a, '']
d = _d
else:
yield [a, '']
d = []
print(parse_result(list(filter(None, str(soup(content, 'html.parser').find('dl')).split('\n')))[1:-1]))
Output:
{'Genres:': '', 'Resolution:': '1920*1080', 'Size:': '1.60G', 'Quality:': '1080p', 'Frame Rate:': '23.976 fps', 'Language:': ''}
from collections import defaultdict
test = soup.text.split('\n')
d = defaultdict(list)
for i in range(len(test)):
if (':' in test[i]) and (':' not in test[i+1]):
d[test[i]] = test[i+1]
elif ':' in test[i]:
d[test[i]] = ''
d
defaultdict(list,
{'Frame Rate:': '23.976 fps',
'Genres:': '',
'Language:': '',
'Quality:': '1080p',
'Resolution:': '1920*1080',
'Size:': '1.60G'})
The logic here is that you know that every key will have a colon. Knowing this, you can write an if else statement to capture the unique combinations, whether that is key followed by key or key followed by value
Edit:
In case you wanted to clean your keys, below replaces the : in each one:
d1 = { x.replace(':', ''): d[x] for x in d.keys() }
d1
{'Frame Rate': '23.976 fps',
'Genres': '',
'Language': '',
'Quality': '1080p',
'Resolution': '1920*1080',
'Size': '1.60G'}
The problem is that empty elements are not present. Since there is no hierarchy between the <dt> and the <dd>, I'm afraid you'll have to craft the dictionary yourself.
vault = {}
category = ""
for item in soup.find("dl").findChildren():
if item.name == "dt":
if category == "":
category = item.text
else:
vault[category] = ""
category = ""
elif item.name == "dd":
vault[category] = item.text
category = ""
Basically this code iterates over the child elements of the <dl> and fills the vault dictionary with the values.

Merge duplicate entries in array of dict

I'm struggling with a recursive merge problem.
Let's say I have:
a=[{'name':"bob",
'age':10,
'email':"bob#bla",
'profile':{'id':1, 'role':"admin"}},
{'name':"bob",
'age':10,
'email':"other mail",
'profile':{'id':2, 'role':"dba"},
'home':"/home/bob"
}]
and I need something to recursively merge entries. If value for an existing given key on the same level is different it appends the value to an array.
b = merge(a)
print b
{'name':"bob",
'age':10,
'email':["bob#bla","other mail"],
'profile':{'id':[1,2], 'role'=["admin", "dba"], 'home':"/home/bob"}
I wrote this code:
def merge(items):
merged = {}
for item in items:
for key in item.keys():
if key in merged.keys():
if item[key] != merged[key]:
if not isinstance(merged[key], list):
merged[key] = [merged[key]]
if item[key] not in merged[key]:
merged[key].append(item[key])
else:
merged[key] = item[key]
return merged
The output is:
{'age': 10,
'email': ['bob#bla', 'other mail'],
'home': '/home/bob',
'name': 'bob',
'profile': [{'id': 1, 'role': 'admin'}, {'id': 2, 'role': 'dba'}]}
Which is not what I want.
I can't figure out how to deal with recursion.
Thanks :)
As you iterate over each dictionary in the arguments, then each key and value in each dictionary, you want the following rules:
If there is nothing against that key in the output, add the new key and value to the output;
If there is a value for that key, and it's the same as the new value, do nothing;
If there is a value for that key, and it's a list, append the new value to the list;
If there is a value for that key, and it's a dictionary, recursively merge the new value with the existing dictionary;
If there is a value for that key, and it's neither a list nor a dictionary, make the value in the output a list of the current value and the new value.
In code:
def merge(*dicts):
"""Recursively merge the argument dictionaries."""
out = {}
for dct in dicts:
for key, val in dct.items():
try:
out[key].append(val) # 3.
except AttributeError:
if out[key] == val:
pass # 2.
elif isinstance(out[key], dict):
out[key] = merge(out[key], val) # 4.
else:
out[key] = [out[key], val] # 5.
except KeyError:
out[key] = val # 1.
return out
In use:
>>> import pprint
>>> pprint.pprint(merge(*a))
{'age': 10,
'email': ['bob#bla', 'other mail'],
'home': '/home/bob',
'name': 'bob',
'profile': {'id': [1, 2], 'role': ['admin', 'dba']}}

requests form-urlencoded data

UPDATE:
I think I'm a little closer now, if change my data structure to the following and use:
urllib.urlencode(data, doseq=True)
data ={'Properties': [('key', 'KeyLength'), ('value', '512')], 'Category': 'keysets', 'Offset': '0', 'Limit': '10'}
I now get the following which is a lot closer, but still not quite correct:
Category=keysets&Limit=10&Properties=('key', 'KeyLength')&Properties=('value', '512')&Offset=0
I'm re-writing this question because I think I know what the problem is, but still don't quite know how to fix it.
I think the problem relates to the fact that the form data I need to send contains form fields with the same name. i.e. the 'Properties'. This is my data structure:
data = {'Properties': [{'key': 'KeyLength', 'value': '512'}], 'Category': 'keysets', 'Offset': '0', 'Limit': '100'}
and this is how it should appear when received by the web service:
Form Data:
Properties[0][key]:KeyLength
Properties[0][value]:768
Category:
Offset:0
Limit:100
I'm posting using 'requests' this:
req = requests.post('http://server1/ws1/api/data/filters/', data=data)
However it seems to end up like this:
Category=keysets&Limit=100&Properties=('key', 'KeyLength')&Offset=0
Instead of like this:
Properties[0][key]=KeyLength&Properties[0][value]=768&Category=&Offset=0&Limit=100
Can somebody please advise what I'm doing wrong.
You won't be able to get that directly. From Python Standard Library Manual, Python url encoding works that way : Convert a mapping object or a sequence of two-element tuples.
There is no support for exploding list or hash values like you want : you will have to preprocess your data. You could try a function like that :
def transform(h, resul = None, kk=None):
if resul is None:
resul = {}
for (k, v) in h.items():
key = k if kk is None else "%s[%s]" % (kk, k)
if isinstance(v, list) or isinstance(v, tuple):
for i, v1 in enumerate(v):
transform(v1, resul, '%s[%d]' % (key, i))
elif isinstance(v, dict):
for i, v1 in v.items:
transform(v1, resul, '%s[%s]' % (key, i))
else:
resul[key] = v
return resul
With you original data structure, it gives :
>>> data = {'Properties': [{'key': 'KeyLength', 'value': '512'}], 'Category': 'keysets', 'Offset': '0', 'Limit': '100'}
>>> transform(data)
{'Category': 'keysets', 'Limit': '100', 'Properties[0][key]': 'KeyLength', 'Offset': '0', 'Properties[0][value]': '512'}
And then it would be encoded as per your requirements
It actually seems that the answer contributed by Serge will not play nice with lists embedded down in the tree. (they don't have an items() method)
Here's a variation on the idea that's working for me:
def flattenForPost(h, resul = None, kk=None):
if resul is None:
resul = {}
if isinstance(h, str) or isinstance(h, bool):
resul[kk] = h
elif isinstance(h, list) or isinstance(h, tuple):
for i, v1 in enumerate(h):
flattenForPost(v1, resul, '%s[%d]' % (kk, i))
elif isinstance(h, dict):
for (k, v) in h.items():
key = k if kk is None else "%s[%s]" % (kk, k)
if isinstance(v, dict):
for i, v1 in v.items():
flattenForPost(v1, resul, '%s[%s]' % (key, i))
else:
flattenForPost(v, resul, key)
return resul

Categories