Python - nested dictionary. Where is the bug? - python

I have a CSV file that I've filtered into a list and grouped. Example:
52713
['52713', '', 'Vmax', '', 'Start Value', '', '\n']
['52713', '', 'Vmax', '', 'ECNumber', '1.14.12.17', '\n']
['52713', 'O2', 'Km', 'M', 'Start Value', '3.5E-5', '\n']
['52713', 'O2', 'Km', 'M', 'ECNumber', '1.14.12.17', '\n']
52714
['52714', '', 'Vmax', '', 'Start Value', '', '\n']
['52714', '', 'Vmax', '', 'ECNumber', '1.14.12.17', '\n']
['52714', 'O2', 'Km', 'M', 'Start Value', '1.3E-5', '\n']
['52714', 'O2', 'Km', 'M', 'ECNumber', '1.14.12.17', '\n']
From this, I create a nested dictionary with the structure:
dict = ID number:{Km:n, Kcat:n, ECNumber:n}
...for every ID in the list.
I use the following code to create this dictionary
dict = {}
for key, items in groupby(FilteredTable1[1:], itemgetter(0)):
#print key
for subitem in items:
#print subitem
dict[subitem[EntryID]] = {}
dict[subitem[EntryID]]['EC'] = []
dict[subitem[EntryID]]['Km'] = []
dict[subitem[EntryID]]['Kcat'] = []
if 'ECNumber' in subitem:
dict[subitem[EntryID]]['EC'] = subitem[value]
if 'Km' in subitem and 'Start Value' in subitem:
dict[subitem[EntryID]]['Km'] = subitem[value]
#print subitem
This works for the ECNumber value, but not the Km value. It can print the line, showing that it identifies the Km value as being present, but doesn't put it in the dictionary.
Example output:
{'Km': [], 'EC': '1.14.12.17', 'Kcat': []}
Any ideas?
Ben

The problem is that your inner for loop keeps reinitializing dict[subitem[EntryID]] even though it may already exist. That's fixed in the following by explicitly checking to see if it's already there:
dict = {}
for key, items in groupby(FilteredTable1[1:], itemgetter(0)):
#print key
for subitem in items:
#print ' ', subitem
if subitem[EntryID] not in dict:
dict[subitem[EntryID]] = {}
dict[subitem[EntryID]]['EC'] = []
dict[subitem[EntryID]]['Km'] = []
dict[subitem[EntryID]]['Kcat'] = []
if 'ECNumber' in subitem:
dict[subitem[EntryID]]['EC'] = subitem[value]
if 'Km' in subitem and 'Start Value' in subitem:
dict[subitem[EntryID]]['Km'] = subitem[value]
#print subitem
However this code could be made more efficient by using something like the following instead, which avoids recomputing values and double dictionary lookups. It also doesn't use the name of a built-in type for a variable name, which goes against the guidelines given in the PEP8 - Style Guide for Python Code. It also suggests using CamelCase only for class names, not for variable names like FilteredTable1 — but I didn't change that.
adict = {}
for key, items in groupby(FilteredTable1[1:], itemgetter(0)):
#print key
for subitem in items:
#print ' ', subitem
entry_id = subitem[EntryID]
if entry_id not in adict:
adict[entry_id] = {'EC': [], 'Km': [], 'Kcat': []}
entry = adict[entry_id]
if 'ECNumber' in subitem:
entry['EC'] = subitem[value]
if 'Km' in subitem and 'Start Value' in subitem:
entry['Km'] = subitem[value]
#print subitem
Actually, since you're building a dictionary of dictionaries, it's not clear that there's any advantage to using groupby to do so.

I'm posting this to follow-up and extend on my previous answer.
For starters, you could streamline the code a little further by eliminating the need to check for preexisting entries simply making the dictionary being created a collections.defaultdict dict subclass instead of a regular one:
from collections import defaultdict
adict = defaultdict(lambda: {'EC': [], 'Km': [], 'Kcat': []})
for key, items in groupby(FilteredTable1[1:], itemgetter(0)):
for subitem in items:
entry = adict[subitem[EntryID]]
if 'ECNumber' in subitem:
entry['EC'] = subitem[value]
if 'Km' in subitem and 'Start Value' in subitem:
entry['Km'] = subitem[value]
Secondly, as I mentioned in the other answer, I don't think you're gaining anything by using itertools.groupby() to do this — except making the process more complicated than needed. This is a because basically what you're doing is making a dictionary-of-dictionaries whose entries can all be randomly accessed, so there's no benefit in going to the trouble of grouping them before doing so. The code below proves this (in conjunction with using a defaultdict as shown above):
adict = defaultdict(lambda: {'EC': [], 'Km': [], 'Kcat': []})
for subitem in FilteredTable1[1:]:
entry = adict[subitem[EntryID]]
if 'ECNumber' in subitem:
entry['EC'] = subitem[value]
if 'Km' in subitem and 'Start Value' in subitem:
entry['Km'] = subitem[value]

Related

multiple separator in a string python

text="Brand.*/Smart Planet.#/Color.*/Yellow.#/Type.*/Sandwich Maker.#/Power Source.*/Electrical."
I have this kind of string. I am facing the problem which splits it to 2 lists. Output will be approximately like this :
name = ['Brand','Color','Type','Power Source']
value = ['Smart Plane','Yellow','Sandwich Maker','Electrical']
Is there any solution for this.
name = []
value = []
text = text.split('.#/')
for i in text:
i = i.split('.*/')
name.append(i[0])
value.append(i[1])
This is one approach using re.split and list slicing.
Ex:
import re
text="Brand.*/Smart Planet.#/Color.*/Yellow.#/Type.*/Sandwich Maker.#/Power Source.*/Electrical."
data = [i for i in re.split("[^A-Za-z\s]+", text) if i]
name = data[::2]
value = data[1::2]
print(name)
print(value)
Output:
['Brand', 'Color', 'Type', 'Power Source']
['Smart Planet', 'Yellow', 'Sandwich Maker', 'Electrical']
You can use regex to split the text, and populate the lists in a loop.
Using regex you protect your code from invalid input.
import re
name, value = [], []
for ele in re.split(r'\.#\/', text):
k, v = ele.split('.*/')
name.append(k)
value.append(v)
>>> print(name, val)
['Brand', 'Color', 'Type', 'Power Source'] ['Smart Planet', 'Yellow', 'Sandwich Maker', 'Electrical.']
text="Brand.*/Smart Planet.#/Color.*/Yellow.#/Type.*/Sandwich Maker.#/Power Source.*/Electrical."
name=[]
value=[]
word=''
for i in range(len(text)):
temp=i
if text[i]!='.' and text[i]!='/' and text[i]!='*' and text[i]!='#':
word=word+''.join(text[i])
elif temp+1<len(text) and temp+2<=len(text):
if text[i]=='.' and text[temp+1]=='*' and text[temp+2]=='/':
name.append(word)
word=''
elif text[i]=='.' and text[temp+1]=='#' and text[temp+2]=='/':
value.append(word)
word=''
else:
value.append(word)
print(name)
print(value)
this will be work...

How to insert data into an array of dictionaries in an order without missing data via regex

This is my code:
I'm trying to use the following code to insert data into an array of dictionaries but unable to insert properly.
Code:
test_list = {'module_serial-1': 'PSUXA12345680', 'module_name-1': 'CH1.FM5', 'module_name-2': 'CH1.FM6', 'module_serial-2': 'PSUXA12345681'}
def parse_subdevice_modules(row):
modules = []
module = {}
for k, v in row.items():
if v:
if re.match("module_name", k):
module['name'] = v
if re.match("module_serial", k):
module['serial'] = v
modules.append(module)
module = {}
return modules
print(parse_subdevice_modules(test_list))
Expected output:
[{'name':'CH1.FM5', serial': 'PSUXA12345680'}, {'name': 'CH1.FM6', 'serial': 'PSUXA12345681'}]
Actual output:
['serial': 'PSUXA12345680'}, {'name': 'CH1.FM6', 'serial': 'PSUXA12345681'}]
Run it here: https://repl.it/repls/WetSteelblueRange
Please note that the order of the data test_list cannot be altered as it comes via an external API so I used regex. Any ideas would be appreciated.
Your code relies on the wrong assumption that keys are ordered and that the serial will always follow the name. The proper solution here is to use a dict (actually a collections.defaultdict to make things easier) to collect and regroup the values you're interested in based on the module number (the final '-N' in the key). Note that you don't need regexps here - Python string already provide the necessary operations for this task:
from collections import defaultdict
def parse_subdevice_modules(row):
modules = defaultdict(dict)
for k, v in row.items():
# first get rid of what we're not interested in
if not v:
continue
if not k.startswith("module_"):
continue
# retrieve the key number (last char) with
# negative string indexing:
key_num = k[-1]
# retrieve the useful part of the key ("name" or "serial")
# by splitting the string:
key_name = k.split("_")[1].split("-")[0]
# and now we just have to store this in our defaultdict
modules[key_num][key_name] = v
# and return only the values.
# NB: in py2.x you don't need the call to `list`,
# you can just return `modules.values()` directly
modules = list(modules.values())
return modules
test_list = {
'profile': '', 'chassis_name': '123', 'supplier_order_num': '',
'device_type': 'mass_storage', 'device_subtype': 'flashblade',
'module_serial-1': 'PSUXA12345680', 'module_name-1': 'CH1.FM5',
'module_name-2': 'CH1.FM6', 'rack_total_pos': '',
'asset_tag': '002000027493', 'module_serial-2': 'PSUXA12345681',
'purchase_order': '0004530869', 'build': 'Test_Build_for_SNOW',
'po_line_num': '00190', 'mac_address': '', 'position': '7',
'model': 'FB-528TB-10X52.8TB', 'manufacturer': 'PureStorage',
'rack': 'Test_Rack_2', 'serial': 'PMPAM1842147D', 'name': 'FB02'
}
print(parse_subdevice_modules(test_list))
You can do somthing like this also.
test_list = {'module_serial-1': 'PSUXA12345680', 'module_name-1': 'CH1.FM5', 'module_name-2': 'CH1.FM6',
'module_serial-2': 'PSUXA12345681'}
def parse_subdevice_modules(row):
modules_list = []
for key, value in row.items():
if not value or key.startswith('module_name'):
continue
if key.startswith('module_serial'):
module_name_key = f'module_name-{key.split("-")[-1]}'
modules_list.append({'serial': value, 'name': row[module_name_key]})
return modules_list
print(parse_subdevice_modules(test_list))
Output:
[{'serial': 'PSUXA12345680', 'name': 'CH1.FM5'}, {'serial': 'PSUXA12345681', 'name': 'CH1.FM6'}]
You would need to check if module contains 2 elements and append it to modules:
test_list = {'module_serial-1': 'PSUXA12345680', 'module_name-1': 'CH1.FM5', 'module_name-2': 'CH1.FM6', 'module_serial-2': 'PSUXA12345681'}
def parse_subdevice_modules(row):
modules = []
module = {}
for k, v in row.items():
if v:
if k.startswith('module_name'):
module['name'] = v
elif k.startswith("module_serial"):
module['serial'] = v
if len(module) == 2:
modules.append(module)
module = {}
return modules
print(parse_subdevice_modules(test_list))
Returns:
[{'serial': 'PSUXA12345680'}, {'name': 'CH1.FM5'}, {'name': 'CH1.FM6'}, {'serial': 'PSUXA12345681'}]

Trouble getting right values against each item

I'm trying to parse the item names and it's corresponding values from the below snippet. dt tag holds names and dd containing values. There are few dt tags which do not have corresponding values. So, all the names do not have values. What I wish to do is keep the values blank against any name if the latter doesn't have any values.
These are the elements I would like to scrape data from:
content="""
<div class="movie_middle">
<dl>
<dt>Genres:</dt>
<dt>Resolution:</dt>
<dd>1920*1080</dd>
<dt>Size:</dt>
<dd>1.60G</dd>
<dt>Quality:</dt>
<dd>1080p</dd>
<dt>Frame Rate:</dt>
<dd>23.976 fps</dd>
<dt>Language:</dt>
</dl>
</div>
"""
I've tried like below:
soup = BeautifulSoup(content,"lxml")
title = [item.text for item in soup.select(".movie_middle dt")]
result = [item.text for item in soup.select(".movie_middle dd")]
vault = dict(zip(title,result))
print(vault)
It gives me messy results (wrong pairs):
{'Genres:': '1920*1080', 'Resolution:': '1.60G', 'Size:': '1080p', 'Quality:': '23.976 fps'}
My expected result:
{'Genres:': '', 'Resolution:': '1920*1080', 'Size:': '1.60G', 'Quality:': '1080p','Frame Rate:':'23.976 fps','Language:':''}
Any help on fixing the issue will be highly appreciated.
You can loop through the elements inside dl. If the current element is dt and the next element is dd, then store the value as the next element, else set the value as empty string.
dl = soup.select('.movie_middle dl')[0]
elems = dl.find_all() # Returns the list of dt and dd
data = {}
for i, el in enumerate(elems):
if el.name == 'dt':
key = el.text.replace(':', '')
# check if the next element is a `dd`
if i < len(elems) - 1 and elems[i+1].name == 'dd':
data[key] = elems[i+1].text
else:
data[key] = ''
You can use BeautifulSoup to parse the dl structure, and then write a function to create the dictionary:
from bs4 import BeautifulSoup as soup
import re
def parse_result(d):
while d:
a, *_d = d
if _d:
if re.findall('\<dt', a) and re.findall('\<dd', _d[0]):
yield [a[4:-5], _d[0][4:-5]]
d = _d[1:]
else:
yield [a[4:-5], '']
d = _d
else:
yield [a[4:-5], '']
d = []
print(dict(parse_result(list(filter(None, str(soup(content, 'html.parser').find('dl')).split('\n')))[1:-1])))
Output:
{'Genres:': '', 'Resolution:': '1920*1080', 'Size:': '1.60G', 'Quality:': '1080p', 'Frame Rate:': '23.976 fps', 'Language:': ''}
For a slightly longer, although cleaner solution, you can create a decorator to strip the HTML tags of the output, thus removing the need for the extra string slicing in the main parse_result function:
def strip_tags(f):
def wrapper(data):
return {a[4:-5]:b[4:-5] for a, b in f(data)}
return wrapper
#strip_tags
def parse_result(d):
while d:
a, *_d = d
if _d:
if re.findall('\<dt', a) and re.findall('\<dd', _d[0]):
yield [a, _d[0]]
d = _d[1:]
else:
yield [a, '']
d = _d
else:
yield [a, '']
d = []
print(parse_result(list(filter(None, str(soup(content, 'html.parser').find('dl')).split('\n')))[1:-1]))
Output:
{'Genres:': '', 'Resolution:': '1920*1080', 'Size:': '1.60G', 'Quality:': '1080p', 'Frame Rate:': '23.976 fps', 'Language:': ''}
from collections import defaultdict
test = soup.text.split('\n')
d = defaultdict(list)
for i in range(len(test)):
if (':' in test[i]) and (':' not in test[i+1]):
d[test[i]] = test[i+1]
elif ':' in test[i]:
d[test[i]] = ''
d
defaultdict(list,
{'Frame Rate:': '23.976 fps',
'Genres:': '',
'Language:': '',
'Quality:': '1080p',
'Resolution:': '1920*1080',
'Size:': '1.60G'})
The logic here is that you know that every key will have a colon. Knowing this, you can write an if else statement to capture the unique combinations, whether that is key followed by key or key followed by value
Edit:
In case you wanted to clean your keys, below replaces the : in each one:
d1 = { x.replace(':', ''): d[x] for x in d.keys() }
d1
{'Frame Rate': '23.976 fps',
'Genres': '',
'Language': '',
'Quality': '1080p',
'Resolution': '1920*1080',
'Size': '1.60G'}
The problem is that empty elements are not present. Since there is no hierarchy between the <dt> and the <dd>, I'm afraid you'll have to craft the dictionary yourself.
vault = {}
category = ""
for item in soup.find("dl").findChildren():
if item.name == "dt":
if category == "":
category = item.text
else:
vault[category] = ""
category = ""
elif item.name == "dd":
vault[category] = item.text
category = ""
Basically this code iterates over the child elements of the <dl> and fills the vault dictionary with the values.

Dynamically changing key value in dictionary

I am checking the key in dictionary, if it contains space remove it.
def query_combination(sentence,mydict):
for key in mydict.keys():
if key == 'key':
pass
else:
print 'key is : ',key
if " " in key:
temp = key
key = key.replace(' ',"")
print 'new key : ',key
sentence = sentence.replace(temp ,key)
print 'new sentence : ',sentence
print mydict
mydict = {'films': {'match': ['Space', 'Movie', 'six', 'two', 'one']}, u'Popeye Doyle': {'score': 100, 'match': [u'People', 'heaven', 'released']}}
sentence ='What films featured the character Popeye Doyle'
combination = query_combination(sentence,mydict)
I could not dynamically change the new key value to mydict. Any suggestion much appreciable
If you get a string out of the dictionary, and then change it and make a new string, the dictionary won't know about it; you can add a new entry to the dictionary and remove the old one:
if " " in key:
newkey = key.replace(' ',"")
mydict[newkey] = mydict[key]
del mydict[key]
print 'new key : ', newkey
You could try this
def query_combination(sentence,mydict):
for key in mydict.iterkeys():
if " " in key:
temp = key
mydict[key.replace(" ","")] = mydict[key] # create new key
del mydict[key] # delete old key
sentence = sentence.replace(temp ,key)
Another solution in one line would be
mydict[key.replace(" ","")] = mydict.pop(key)
key = key.replace(' ',"") does not affect the actual key in the dictionary, it is changing a copy of that key. You need to add the value to the dictionary with the new key and remove the old key. Here's one way to do it:
def query_combination(sentence, mydict):
for old_key, new_key in [(key, key.replace(' ', '')) for key in mydict if ' ' in key]:
mydict[new_key] = mydict.pop(old_key)
sentence = sentence.replace(old_key, new_key)
Note, however, that you are replacing the key in the string sentence, but sentence is local to function query_combination(), so the outer scope sentence is unaffected by the replacement. I am not sure if that was what you hoped your code would do, but if it was you could simply return the revised sentence from the function, or include it as an item in the dictionary.
Given that sentence is not actually updated by your function, you can simplify the whole function to a mere dictionary comprehension:
>>> mydict = {'films': {'match': ['Space', 'Movie', 'six', 'two', 'one']}, u'Popeye Doyle': {'score': 100, 'match': [u'People', 'heaven', 'released']}}
>>> mydict = {key.replace(' ', '') : value for key, value in mydict.items()}
>>> mydict
{'films': {'match': ['Space', 'Movie', 'six', 'two', 'one']}, u'PopeyeDoyle': {'score': 100, 'match': [u'People', 'heaven', 'released']}}

Failing to append to dictionary. Python

I am experiencing a strange faulty behaviour, where a dictionary is only appended once and I can not add more key value pairs to it.
My code reads in a multi-line string and extracts substrings via split(), to be added to a dictionary. I make use of conditional statements. Strangely only the key:value pairs under the first conditional statement are added.
Therefore I can not complete the dictionary.
How can I solve this issue?
Minimal code:
#I hope the '\n' is sufficient or use '\r\n'
example = "Name: Bugs Bunny\nDOB: 01/04/1900\nAddress: 111 Jokes Drive, Hollywood Hills, CA 11111, United States"
def format(data):
dic = {}
for line in data.splitlines():
#print('Line:', line)
if ':' in line:
info = line.split(': ', 1)[1].rstrip() #does not work with files
#print('Info: ', info)
if ' Name:' in info: #middle name problems! /maiden name
dic['F_NAME'] = info.split(' ', 1)[0].rstrip()
dic['L_NAME'] = info.split(' ', 1)[1].rstrip()
elif 'DOB' in info: #overhang
dic['DD'] = info.split('/', 2)[0].rstrip()
dic['MM'] = info.split('/', 2)[1].rstrip()
dic['YY'] = info.split('/', 2)[2].rstrip()
elif 'Address' in info:
dic['STREET'] = info.split(', ', 2)[0].rstrip()
dic['CITY'] = info.split(', ', 2)[1].rstrip()
dic['ZIP'] = info.split(', ', 2)[2].rstrip()
return dic
if __name__ == '__main__':
x = format(example)
for v, k in x.iteritems():
print v, k
Your code doesn't work, at all. You split off the name before the colon and discard it, looking only at the value after the colon, stored in info. That value never contains the names you are looking for; Name, DOB and Address all are part of the line before the :.
Python lets you assign to multiple names at once; make use of this when splitting:
def format(data):
dic = {}
for line in data.splitlines():
if ':' not in line:
continue
name, _, value = line.partition(':')
name = name.strip()
if name == 'Name':
dic['F_NAME'], dic['L_NAME'] = value.split(None, 1) # strips whitespace for us
elif name == 'DOB':
dic['DD'], dic['MM'], dic['YY'] = (v.strip() for v in value.split('/', 2))
elif name == 'Address':
dic['STREET'], dic['CITY'], dic['ZIP'] = (v.strip() for v in value.split(', ', 2))
return dic
I used str.partition() here rather than limit str.split() to just one split; it is slightly faster that way.
For your sample input this produces:
>>> format(example)
{'CITY': 'Hollywood Hills', 'ZIP': 'CA 11111, United States', 'L_NAME': 'Bunny', 'F_NAME': 'Bugs', 'YY': '1900', 'MM': '04', 'STREET': '111 Jokes Drive', 'DD': '01'}
>>> from pprint import pprint
>>> pprint(format(example))
{'CITY': 'Hollywood Hills',
'DD': '01',
'F_NAME': 'Bugs',
'L_NAME': 'Bunny',
'MM': '04',
'STREET': '111 Jokes Drive',
'YY': '1900',
'ZIP': 'CA 11111, United States'}

Categories