Merging similar dictionaries in a list together - python

New to python here. I've been pulling my hair for hours and still can't figure this out.
I have a list of dictionaries:
[ {'FX0XST001.MID5': '195', 'Name': 'Firmicutes', 'Taxonomy ID': '1239', 'Type': 'phylum'}
{'FX0XST001.MID13': '4929', 'Name': 'Firmicutes', 'Taxonomy ID': '1239','Type': 'phylum'},
{'FX0XST001.MID6': '826', 'Name': 'Firmicutes', 'Taxonomy ID': '1239', 'Type': 'phylum'},
.
.
.
.
{'FX0XST001.MID6': '125', 'Name': 'Acidobacteria', 'Taxonomy ID': '57723', 'Type': 'phylum'}
{'FX0XST001.MID25': '70', 'Name': 'Acidobacteria', 'Taxonomy ID': '57723', 'Type': 'phylum'}
{'FX0XST001.MID40': '40', 'Name': 'Acidobacteria', 'Taxonomy ID': '57723', 'Type': 'phylum'} ]
I want to merge the dictionaries in the list based on their Type, Name, and Taxonomy ID
[ {'FX0XST001.MID5': '195', 'FX0XST001.MID13': '4929', 'FX0XST001.MID6': '826', 'Name': 'Firmicutes', 'Taxonomy ID': '1239', 'Type': 'phylum'}
.
.
.
.
{'FX0XST001.MID6': '125', 'FX0XST001.MID25': '70', 'FX0XST001.MID40': '40', 'Name': 'Acidobacteria', 'Taxonomy ID': '57723', 'Type': 'phylum'}]
I have the data structure setup like this because I need to write the data to CSV using csv.DictWriter later.
Would anyone kindly point me to the right direction?

You can use the groupby function for this:
http://docs.python.org/library/itertools.html#itertools.groupby
from itertools import groupby
keyfunc = lambda row : (row['Type'], row['Taxonomy ID'], row['Name'])
result = []
data = sorted(data, key=keyfunc)
for k, g in groupby(data, keyfunc):
# you can either add the matching rows to the item so you end up with what you wanted
item = {}
for row in g:
item.update(row)
result.append(item)
# or you could just add the matched rows as subitems to a parent dictionary
# which might come in handy if you need to work with just the parts that are
# different
item = {'Type': k[0], 'Taxonomy ID' : k[1], 'Name' : k[2], 'matches': [])
for row in g:
del row['Type']
del row['Taxonomy ID']
del row['Name']
item['matches'].append(row)
result.append(item)

Make some test data:
list_of_dicts = [
{"Taxonomy ID":1, "Name":"Bob", "Type":"M", "hair":"brown", "eyes":"green"},
{"Taxonomy ID":1, "Name":"Bob", "Type":"M", "height":"6'2''", "weight":200},
{"Taxonomy ID":2, "Name":"Alice", "Type":"F", "hair":"black", "eyes":"hazel"},
{"Taxonomy ID":2, "Name":"Alice", "Type":"F", "height":"5'7''", "weight":145}
]
I think this (below) is a neat trick using reduce that improves upon the other groupby solution.
import itertools
def key_func(elem):
return (elem["Taxonomy ID"], elem["Name"], elem["Type"])
output_list_of_dicts = [reduce((lambda x,y: x.update(y) or x), list(val)) for key, val in itertools.groupby(list_of_dicts, key_func)]
Then print the output:
for elem in output_list_of_dicts:
print elem
This prints:
{'eyes': 'green', 'Name': 'Bob', 'weight': 200, 'Taxonomy ID': 1, 'hair': 'brown', 'height': "6'2''", 'Type': 'M'}
{'eyes': 'hazel', 'Name': 'Alice', 'weight': 145, 'Taxonomy ID': 2, 'hair': 'black', 'height': "5'7''", 'Type': 'F'}
FYI, Python Pandas is far better for this sort of aggregation, especially when dealing with file I/O to .csv or .h5 files, than the itertools stuff.

Perhaps the easiest thing to do would be to create a new dictionary, indexed by a (Type, Name, Taxonomy ID) tuple, and iterate over your dictionary, storing values by (Type, Name, Taxonomy ID). Use a default dict to make this easier. For example:
from collections import defaultdict
grouped = defaultdict(lambda : {})
# iterate over items and store:
for entry in list_of_dictionaries:
grouped[(entry["Type"], entry["Name"], entry["Taxonomy ID"])].update(entry)
# now you have everything stored the way you want in values, and you don't
# need the dict anymore
grouped_entries = grouped.values()
This is a bit hackish, especially because you end up overwriting "Type", "Name", and "Phylum" every time you use update, but since your dict keys are variable, that might be the best you can do. This will get you at least close to what you need.
Even better would be to do this on your initial import and skip intermediate steps (unless you actually need to transform the data beforehand). Plus, if you could get at the only varying field, you could change the update to just: grouped[(type, name, taxonomy_id)][key] = value where key and value are something like: 'FX0XST001.MID5', '195'

from itertools import groupby
data = [ {'FX0XST001.MID5': '195', 'Name': 'Firmicutes', 'Taxonomy ID': '1239', 'Type':'phylum'},
{'FX0XST001.MID13': '4929', 'Name': 'Firmicutes', 'Taxonomy ID': '1239','Type': 'phylum'},
{'FX0XST001.MID6': '826', 'Name': 'Firmicutes', 'Taxonomy ID': '1239', 'Type': 'phylum'},
{'FX0XST001.MID6': '125', 'Name': 'Acidobacteria', 'Taxonomy ID': '57723', 'Type': 'phylum'},
{'FX0XST001.MID25': '70', 'Name': 'Acidobacteria', 'Taxonomy ID': '57723', 'Type': 'phylum'},
{'FX0XST001.MID40': '40', 'Name': 'Acidobacteria', 'Taxonomy ID': '57723', 'Type': 'phylum'} ,]
kk = ('Name', 'Taxonomy ID', 'Type')
def key(item): return tuple(item[k] for k in kk)
result = []
data = sorted(data, key=key)
for k, g in groupby(data, key):
result.append(dict((i, j) for d in g for i,j in d.items()))
print result

Related

understanding nested python dict comprehension

I am getting along with dict comprehensions and trying to understand how the below 2 dict comprehensions work:
select_vals = ['name', 'pay']
test_dict = {'data': [{'name': 'John', 'city': 'NYC', 'pay': 70000}, {'name': 'Mike', 'city': 'NYC', 'pay': 80000}, {'name': 'Kate', 'city': 'Houston', 'pay': 65000}]}
dict_comp1 = [{key: item[key] for key in select_vals } for item in test_dict['data'] if item['pay'] > 65000 ]
The above line gets me
[{'name': 'John', 'pay': 70000}, {'name': 'Mike', 'pay': 80000}]
dict_comp2 = [{key: item[key]} for key in select_vals for item in test_dict['data'] if item['pay'] > 65000 ]
The above line gets me
[{'name': 'John'}, {'name': 'Mike'}, {'pay': 70000}, {'pay': 80000}]
How does the two o/ps vary when written in a for loop ? When I execute in a for loop
dict_comp3 = []
for key in select_vals:
for item in test_dict['data']:
if item['pay'] > 65000:
dict_comp3.append({key: item[key]})
print(dict_comp3)
The above line gets me same as dict_comp2
[{'name': 'John'}, {'name': 'Mike'}, {'pay': 70000}, {'pay': 80000}]
How do I get the o/p as dict_comp1 in a for loop ?
The select vals iteration should be the inner one
result = []
for item in test_dict['data']:
if item['pay'] > 65000:
aux = {}
for key in select_vals:
aux[key] = item[key]
result.append(aux)

Build a dictionary with single elements or lists as values

I have a list of dictionaries:
mydict = [
{'name': 'test1', 'value': '1_1'},
{'name': 'test2', 'value': '2_1'},
{'name': 'test1', 'value': '1_2'},
{'name': 'test1', 'value': '1_3'},
{'name': 'test3', 'value': '3_1'},
{'name': 'test4', 'value': '4_1'},
{'name': 'test4', 'value': '4_2'},
]
I would like to use it to create a dictionary where the values are lists or single values depending of number of their occurrences in the list above.
Expected output:
outputdict = {
'test1': ['1_1', '1_2', '1_3'],
'test2': '2_1',
'test3': '3_1',
'test4': ['4_1', '4_2'],
}
I tried to do it the way below but it always returns a list, even when there is just one value element.
outputdict = {}
outputdict.setdefault(mydict.get('name'), []).append(mydict.get('value'))
The current output is:
outputdict = {
'test1': ['1_1', '1_2', '1_3'],
'test2': ['2_1'],
'test3': ['3_1'],
'test4': ['4_1', '4_2'],
}
Do what you have already done, and then convert single-element lists afterwards:
outputdict = {
name: (value if len(value) > 1 else value[0])
for name, value in outputdict.items()
}
You can use a couple of the built-in functions mainly itertools.groupby:
from itertools import groupby
from operator import itemgetter
mydict = [
{'name': 'test1', 'value': '1_1'},
{'name': 'test2', 'value': '2_1'},
{'name': 'test1', 'value': '1_2'},
{'name': 'test1', 'value': '1_3'},
{'name': 'test3', 'value': '3_1'},
{'name': 'test4', 'value': '4_1'},
{'name': 'test4', 'value': '4_2'},
]
def keyFunc(x):
return x['name']
outputdict = {}
# groupby groups all the items that matches the returned value from keyFunc
# in our case it will use the names
for name, groups in groupby(mydict, keyFunc):
# groups will contains an iterator of all the items that have the matched name
values = list(map(itemgetter('value'), groups))
if len(values) == 1:
outputdict[name] = values[0]
else:
outputdict[name] = values
print(outputdict)

compare two different length lists of dictionaries in python

I want to compare below dictionaries. Name key in the dictionary is common in both dictionaries.
If Name matched in both the dictionaries, i wanted to do some other stuff with the data.
PerfData = [
{'Name': 'abc', 'Type': 'Ex1', 'Access': 'N1', 'perfStatus':'Latest Perf', 'Comments': '07/12/2017 S/W Version'},
{'Name': 'xyz', 'Type': 'Ex1', 'Access': 'N2', 'perfStatus':'Latest Perf', 'Comments': '11/12/2017 S/W Version upgrade failed'},
{'Name': 'efg', 'Type': 'Cust1', 'Access': 'A1', 'perfStatus':'Old Perf', 'Comments': '11/10/2017 S/W Version upgrade failed, test data is active'}
]
beatData = [
{'Name': 'efg', 'Status': 'Latest', 'rcvd-timestamp': '1516756202.632'},
{'Name': 'abc', 'Status': 'Latest', 'rcvd-timestamp': '1516756202.896'}
]
Thanks
Rajeev
l = [{'name': 'abc'}, {'name': 'xyz'}]
k = [{'name': 'a'}, {'name': 'abc'}]
[i['name'] for i in l for f in k if i['name'] == f['name']]
Hope above logic work for you.
The answer provided didn't assign the result to any variable. If you want to print it, add the following would work:
result = [i['name'] for i in l for f in k if i['name'] == f['name']]
print(result)

parse multilevel json to string with condition

I have this nested json item that I just want to flatten out to a comma separated string (i.e. parkinson:5, billy mays:4)so I can store in a database if needed for future analysis. I wrote out the function below but am wondering if there's a more elegant way using list comprehension (or something else). I found this post but I'm not sure how to adapt it for my needs (Python - parse JSON values by multilevel keys).
Data looks like this:
{'persons':
[{'name': 'parkinson', 'sentiment': '5'},
{'name': 'knott david', 'sentiment': 'none'},
{'name': 'billy mays', 'sentiment': '4'}],
'organizations':
[{'name': 'piper jaffray companies', 'sentiment': 'none'},
{'name': 'marketbeat.com', 'sentiment': 'none'},
{'name': 'zacks investment research', 'sentiment': 'none'}]
'locations': []
}
Here's my code:
def parse_entities(data):
results = ''
for category in data.keys():
# for c_id, category in enumerate(data.keys()):
entity_data = data[category]
for e_id, entity in enumerate(entity_data):
if not entity_data[e_id]['sentiment'] == 'none':
results = results + (data[category][e_id]['name'] + ":" +
data[category][e_id]['sentiment'] + ",")
return results
Firstly, the most important thing to make your code shorter and nicer to look at is to use your own variables. Be aware that entity_data = data[category] and entity = entity_data[e_id]. So you can write entity['name'] instead of data[category][e_id]['name'].
Secondly, if you want something like
for category in data.keys():
entity_data = data[category]
you can make it shorter and easier to read by changing it to
for category, entity_data in data.items():
But you don't even need that here, you can just use the data.values() iterator to get the values. When combining these improvements your code looks like this:
def parse_entities(data):
results = ''
for entity_data in data.values():
for entity in entity_data:
if entity['sentiment'] != 'none':
results += entity['name'] + ":" + entity['sentiment'] + ","
return results
(I have also changed results = results + ... to results += ... and if not entity['sentiment'] == 'none' to if entity['sentiment'] != 'none', because it is shorter and doesn't lower the readability)
When you have this it is much easier to make it even shorter and more elegant by using list comprehension:
def parse_entities(data):
return ",".join([entity['name'] + ":" + entity['sentiment']
for entity_data in data.values()
for entity in entity_data
if not entity['sentiment'] == 'none'])
Maybe something like this will work?
def parse_entities(data):
results = []
for category in data.keys():
results += list(map(lambda x: '{0}:{1}'.format(x['name'], x['sentiment']),
filter(lambda i: i['sentiment'] != 'none', data[category])))
return ','.join(results)
if __name__ == '__main__':
print(parse_entities(data))
With the output looking like this
parkinson:5,billy mays:4
This might be a way to do it. Even though using a 'proper library' (depending on your actual use case) makes more sense.
data = {
'persons':
[{'name': 'parkinson', 'sentiment': '5'},
{'name': 'knott david', 'sentiment': 'none'},
{'name': 'billy mays', 'sentiment': '4'}],
'organizations':
[{'name': 'piper jaffray companies', 'sentiment': 'none'},
{'name': 'marketbeat.com', 'sentiment': 'none'},
{'name': 'zacks investment research', 'sentiment': 'none'}],
'locations': []
}
import itertools
# eq. = itertools.chain.from_iterable(data.values())
dicts = itertools.chain(*data.values())
pairs = [":".join([d['name'], d['sentiment']])
for d in dicts if d['sentiment'] != 'none']
result = ",".join(pairs)
print(result)
# parkinson:5,billy mays:4
# short, but less readable version
result = ",".join([":".join([d['name'], d['sentiment']])
for d in itertools.chain(*data.values())
if d['sentiment'] != 'none'])
This is a problem where we need to perform the 3 separate tasks:
Filter out unqualified rows of data
Flatten the dict of lists into a simple list
Transform each dictionary object into a simple tuple, ready for formatting
Here is the code:
def parse_entities(data):
new_data = [
(row['name'], row['sentiment']) # 3. Transform
for rows in data.values() # 2. Flatten
for row in rows # 2. Flatten
if row['sentiment'] != 'none' # 1. Filter
]
# e.g, new_data = [('parkinson', '5'), ('billy mays', '4')]
return ','.join('{}:{}'.format(*row) for row in new_data)
#
# test code
#
data = {
'locations': [],
'organizations': [
{'name': 'piper jaffray companies', 'sentiment': 'none'},
{'name': 'marketbeat.com', 'sentiment': 'none'},
{'name': 'zacks investment research', 'sentiment': 'none'}
],
'persons': [
{'name': 'parkinson', 'sentiment': '5'},
{'name': 'knott david', 'sentiment': 'none'},
{'name': 'billy mays', 'sentiment': '4'}
],
}
print parse_entities(data)
Output:
parkinson:5,billy mays:4
Here's a generator expression that does it:
data = {'persons': [
{'name': 'parkinson', 'sentiment': '5'},
{'name': 'knott david', 'sentiment': 'none'},
{'name': 'billy mays', 'sentiment': '4'}],
'organizations': [
{'name': 'piper jaffray companies', 'sentiment': 'none'},
{'name': 'marketbeat.com', 'sentiment': '99'},
{'name': 'zacks investment research', 'sentiment': 'none'}],
'locations': []
}
results = ','.join(entity['name'] + ':' + entity['sentiment']
for category, entity_data in data.items()
for entity in entity_data if entity['sentiment'] is not 'none')
print(results) # -> parkinson:5,billy mays:4,marketbeat.com:99
Note: I changed the sample data slightly to make sure it handled data in more than one category the same as your code.

Extract multiple key:value pairs from one dict to a new dict

I have a list of dict what some data, and I would like to extract certain key:value pairs into a new list of dicts. I know one way that I could do this would be to use del i['unwantedKey'], however, I would rather not delete any data but instead create a new dict with the needed data.
The column order might change, so I need something to extract the two key:value pairs from the larger dict into a new dict.
Current Data Format
[{'Speciality': 'Math', 'Name': 'Matt', 'Location': 'Miami'},
{'Speciality': 'Science', 'Name': 'Ben', 'Location': 'Las Vegas'},
{'Speciality': 'Language Arts', 'Name': 'Sarah', 'Location': 'Washington DC'},
{'Speciality': 'Spanish', 'Name': 'Tom', 'Location': 'Denver'},
{'Speciality': 'Chemistry', 'Name': 'Jim', 'Location': 'Dallas'}]
Code to delete key:value from dict
import csv
data= []
for line in csv.DictReader(open('data.csv')):
data.append(line)
for i in data:
del i['Speciality']
print data
Desired Data Format without using del i['Speciality']
[{'Name': 'Matt', 'Location': 'Miami'},
{'Name': 'Ben', 'Location': 'Las Vegas'},
{'Name': 'Sarah', 'Location': 'Washington DC'},
{'Name': 'Tom', 'Location': 'Denver'},
{'Name': 'Jim', 'Location': 'Dallas'}]
If you want to give a positive list of keys to copy over into the new dictionaries:
import csv
with open('data.csv', 'rb') as csv_file:
data = list(csv.DictReader(csv_file))
keys = ['Name', 'Location']
new_data = [dict((k, d[k]) for k in keys) for d in data]
print new_data
suppose we have,
l1 = [{'Location': 'Miami', 'Name': 'Matt', 'Speciality': 'Math'},
{'Location': 'Las Vegas', 'Name': 'Ben', 'Speciality': 'Science'},
{'Location': 'Washington DC', 'Name': 'Sarah', 'Speciality': 'Language Arts'},
{'Location': 'Denver', 'Name': 'Tom', 'Speciality': 'Spanish'},
{'Location': 'Dallas', 'Name': 'Jim', 'Speciality': 'Chemistry'}]
to create a new list of dictionaries that do not contain the keys 'Speciality' we can do,
l2 = []
for oldd in l1:
newd = {}
for k,v in oldd.items():
if k != 'Speciality':
newd[k] = v
l2.append(newd)
and now l2 will be your desired output. In general you can exclude an arbitrary list of keys like so
exclude_keys = ['Speciality', 'Name']
l2 = []
for oldd in l1:
newd = {}
for k,v in oldd.items():
if k not in exclude_keys:
newd[k] = v
l2.append(newd)
the same can be done with an include_keys variable
include_keys = ['Name', 'Location']
l2 = []
for oldd in l1:
newd = {}
for k,v in oldd.items():
if k in include_keys:
newd[k] = v
l2.append(newd)
You can create a new list of dicts limited to the keys you want with one line of code (Python 2.6+):
NLoD=[{k:d[k] for k in ('Name', 'Location')} for d in LoD]
Try it:
>>> LoD=[{'Speciality': 'Math', 'Name': 'Matt', 'Location': 'Miami'},
{'Speciality': 'Science', 'Name': 'Ben', 'Location': 'Las Vegas'},
{'Speciality': 'Language Arts', 'Name': 'Sarah', 'Location': 'Washington DC'},
{'Speciality': 'Spanish', 'Name': 'Tom', 'Location': 'Denver'},
{'Speciality': 'Chemistry', 'Name': 'Jim', 'Location': 'Dallas'}]
>>> [{k:d[k] for k in ('Name', 'Location')} for d in LoD]
[{'Name': 'Matt', 'Location': 'Miami'}, {'Name': 'Ben', 'Location': 'Las Vegas'}, {'Name': 'Sarah', 'Location': 'Washington DC'}, {'Name': 'Tom', 'Location': 'Denver'}, {'Name': 'Jim', 'Location': 'Dallas'}]
Since you are using csv, you can limit the columns that you read in the first place to the desired columns so you do not need to delete the undesired data:
dc=('Name', 'Location')
with open(fn) as f:
reader=csv.DictReader(f)
LoD=[{k:row[k] for k in dc} for row in reader]
keys_lst = ['Name', 'Location']
new_data={key:val for key,val in event.items() if key in keys_lst}
print(new_data)

Categories