Python Dedup/Merge List of Dicts - python

Say that I have a list of dicts:
list = [{'name':'john','age':'28','location':'hawaii','gender':'male'},
{'name':'john','age':'32','location':'colorado','gender':'male'},
{'name':'john','age':'32','location':'colorado','gender':'male'},
{'name':'parker','age':'24','location':'new york','gender':'male'}]
In this dict, 'name' can be considered a unique identifier. My goal is to not only dedup this list for identical dicts (ie list[1] and list[2], but to also merge/append differing values for a single 'name' (ie list[0] and list[1/2]. In other words, I want to merge all of the 'name'='john' dicts in my example to a single dict, like so:
dedup_list = [{'name':'john','age':'28; 32','location':'hawaii; colorado','gender':'male'},
{'name':'parker','age':'24','location':'new york','gender':'male'} ]
I have tried thus far to create my second list, dedup_list, and to iterate through the first list. If the 'name' key does not already exist in one of dedup_list's dicts, I will append it. It is the merging part where I am stuck.
for dict in list:
for new_dict in dedup_list:
if dict['name'] in new_dict:
# MERGE OTHER DICT FIELDS HERE
else:
dedup_list.append(dict) # This will create duplicate values as it iterates through each row of the dedup_list. I can throw them in a set later to remove?
My list of dicts will never contain more than 100 items, so an O(n^2) solution is definitely acceptable but not necessarily ideal. This dedup_list will eventually be written to a CSV, so if there is a solution involving that, I am all ears.
Thanks!

well, I was about to craft a solution around defaultdict, but hopefully #hivert posted the best solution I could came with, which is in this answer:
from collections import defaultdict
dicts = [{'a':1, 'b':2, 'c':3},
{'a':1, 'd':2, 'c':'foo'},
{'e':57, 'c':3} ]
super_dict = defaultdict(set) # uses set to avoid duplicates
for d in dicts:
for k, v in d.iteritems():
super_dict[k].add(v)
i.e. I'm voting for closing this question as a dupe of that question.
N.B.: you won't be getting values such as '28; 32', but instead get a set containing [28,32], which then can be processed into a csv file as you wish.
N.B.2: to write the csv file, have a look at the DictWriter class

Related

Python: Create sorted list of keys moving one key to the head

Is there a more pythonic way of obtaining a sorted list of dictionary keys with one key moved to the head? So far I have this:
# create a unique list of keys headed by 'event' and followed by a sorted list.
# dfs is a dict of dataframes.
for k in (dict.fromkeys(['event']+sorted(dfs))):
display(k,dfs[k]) # ideally this should be (k,v)
I suppose you would be able to do
for k, v in list(dfs.items()) + [('event', None)]:
.items() casts a dictionary to a list of tuples (or technically a dict_items, which is why I have to cast it to list explicitly to append), to which you can append a second list. Iterating through a list of tuples allows for automatic unpacking (so you can do k,v in list instead of tup in list)
What we really want is an iterable, but that's not possible with sorted, because it must see all the keys before it knows what the first item should be.
Using dict.fromkeys to create a blank dictionary by insertion order was pretty clever, but relies on an implementation detail of the current version of python. (dict is fundamentally unordered) I admit, it took me a while to figure out that line.
Since the code you posted is just working with the keys, I suggest you focus on that. Taking up a few more lines for readability is a good thing, especially if we can hide it in a testable function:
def display_by_keys(dfs, priority_items=None):
if not priority_items:
priority_items = ['event']
featured = {k for k in priority_items if k in dfs}
others = {k for k in dfs.keys() if k not in featured}
for key in list(featured) + sorted(others):
display(key, dfs[key])
The potential downside is you must sort the keys every time. If you do this much more often than the data store changes, on a large data set, that's a potential concern.
Of course you wouldn't be displaying a really large result, but if it becomes a problem, then you'll want to store them in a collections.OrderedDict (https://stackoverflow.com/a/13062357/1766544) or find a sorteddict module.
from collections import OrderedDict
# sort once
ordered_dfs = OrderedDict.fromkeys(sorted(dfs.keys()))
ordered_dfs.move_to_end('event', last=False)
ordered_dfs.update(dfs)
# display as often as you need
for k, v in ordered_dfs.items():
print (k, v)
If you display different fields first in different views, that's not a problem. Just sort all the fields normally, and use a function like the one above, without the sort.

Python Dictionaries: making the key:value the key and appending new values

EDIT: My question has been getting a lot of follow up questions because on the surface, it doesn't appear to make any sense. For most people, dictionaries are an illogical way to solve this problem. I agree, and have been frustrated by my constraints (explained in the comments). In my scenario, the original KV pairs are going to be encoded as data to be read by another server using the ObjectID. This, however, must be fed into an encoding function as a dictionary. The order does not matter, but the KV pairs must be given a new unique value. The original KV pairs will end up as a new string key in this new dictionary with the ObjectID as a new unique value.
Keep in mind that I am using Python 2.7.
The Issue
Note that this is a matter of presenting a dictionary (dictA), encoded by the ObjectID values, within the constraints of what I have been given
I have a dictionary, say dictA = {'a':'10', 'b':'20', 'c':'30'}, and I have a list of ObjectIdentifier('n'), where n is a number. What is the best way to create dictB so that dictB is a new dictionary with the key equal to dictA's key:value pair and the value equal to the corresponding ObjectIdentifier('n') in the list.
The new dictB should be:
{"'a':'10'":ObjectIdentifier('n'), "'b':'20'":ObjectIdentifier('n+1'), "'c':'30'":ObjectIdentifier('n+2')}
If that makes any sense.
The problem is that dictionaries aren't ordered. So you say
dictA = {'a':'10', 'b':'20', 'c':'30'}
but as far as python knows it could be
dictA = {'c':'30', 'a':'10', 'b':'20'}
Because dictionaries don't have order.
You could create your dict like this:
result = {key: ObjectIdentifier(n+pos) for pos, key in enumerate(dictA.items())}
But there is no way to determine which key will fall in which position, because, as I said, dictionaries don't have order.
If you want alphabetic order, just use sorted()
result = {key: ObjectIdentifier(n+pos)
for pos, key in enumerate(sorted(dictA.items()))}
I don't know why you would want this
def ObjectIdentifier(n):
print(n)
return "ObjectIdentifier("+ str(n) + ")"
dictA = {'a':'10', 'b':'20', 'c':'30'}
dictB = {}
for n, key in enumerate(sorted(dictA.keys())):
dictB[key] = {dictA[key] : ObjectIdentifier(str(n))}
Output:
{'a': {'10': 'ObjectIdentifier(0)'}, 'b': {'20': 'ObjectIdentifier(1)'}, 'c': {'30': 'ObjectIdentifier(2)'}}

Set value for existing key in nested dictionary without iterating through upper levels?

I have a very large nested dictionary of the form and example:
keyDict = {f: {t: {c_1: None, c_2: None, c_3: None, ..., c_n: None}}}
And another dictionary with keys and values:
valDict = {c_1: 13.37, c_2: -42.00, c_3: 0.00, ... c_n: -0.69}
I want to use the valDict to assign the values to the lowest level of the keyDict as fast as possible.
My current implementation is very slow I think because I iterate through the 2 upper levels [f][t] of the keyDict. There must be a way to set the values of the low level without concern for the upper levels because the value of [c] does not depend on the values of [f][t].
My current SLOW implementation:
for f in keyDict:
for t in keyDict[f]:
for c in keyDict[f][t]:
keyDict[f][t][c] = valDict[c]
Still looking for a solution. [c] only has a few thousands keys, but [f][t] can have millions, so the way I do it, distinct value assignment is happening millions of times when it should be able to go through the bottom level and assign the value which does NOT depend on f,t but ONLY on c.
To clarify example per Alexis request: c dictionaries don't necessarily all have the same keys, but c dictionaries DO have the same values for a given key. For example, to make things simple, lets say there are only 3 possible keys for c dict (c_1, c_2, c_3). Now one parent dictionary (ex f=1,t=1) may have just {c_2} and another parent diction (f=1,t=2) may have {c_2 and c_3} and yet another (ex f=999,t=999) might have all three {c_1, c_2, and c_3}. Some parent dicts may have the same set of c's. What I am trying to do is assign the value to the c dict, which is defined purely by the c key, not T or F.
If the most nested dicts and valDict share exactly the same keys, it would be faster to use dict.update instead of looping over all the keys of the dict:
for dct in keyDict.values()
for d in dct.values():
d.update(valDict)
Also, it is more elegant and probably more faster to loop on the values of the outer dicts directly instead of iterating on the keys and then accessing the value using the current key.
So you have millions of "c" dictionaries that you need to keep synchronized. The dictionaries have different sets of keys (presumably for good reason, but I trust you realize that your update code puts the new values in all the dictionaries), but the non-None values must change in lockstep.
You haven't explained what this data structure is for, but judging from your description, you should have a single c dictionary, not millions of them.
After all, you only have one set of valid "c" values; maintaining multiple copies is not only a performance problem, it puts an incredible burden of consistency on your code. But obviously, updating a single dictionary will be hugely faster than updating millions of them.
Of course you also want to know which keys were contained in each dictionary: To do this, your tree of dictionaries should terminate with sets of keys, which you can use to look up values as necessary.
In case my description is not clear, here is how your structure would be transformed:
all_c = dict()
for for f in keyDict:
for t in keyDict[f]:
all_c.update(k,v for k, v in keydict[f][t].items() if v is not None)
keydict[f][t] = set(keydict[f][t].keys())
This code builds a combined dictionary all_c with the non-null values from each of your bottom-level "c" dictionaries, then replaces the latter with a list of its keys. If you later need the complete dictionary at keyDict[f][t] (rather than access to particular values), you can reconstruct it like this:
f_t_cdict = dict((k, all_c[k]) for k in keyDict[f][t])
But I'm pretty sure you can do whatever it is you are doing by working with the sets keyDict[f][t], and simply looking up values directly in the combined dictionary all_c.

How to retain an item in python dictionary? [duplicate]

This question already has answers here:
Extract a subset of key-value pairs from dictionary?
(14 answers)
Closed 6 years ago.
I've a dictionary my_dict and a list of tokens my_tok as shown:
my_dict = {'tutor': 3,
'useful': 1,
'weather': 1,
'workshop': 3,
'thankful': 1,
'puppy': 1}
my_tok = ['workshop',
'puppy']
Is it possible to retain in my_dict, only the values present in my_tok rather than popping the rest?
i.e., I need to retain only workshop and puppy.
Thanks in advance!
Just overwrite it like so:
my_dict = {k:v for k, v in my_dict.items() if k in my_tok}
This is a dictionary comprehension that recreates my_dict using only the keys that are present as entries in the my_tok list.
As said in the comments, if the number of elemenst in the my_tok list is small compaired to the dictionary keys, this solution is not the most efficient one. In that case it would be much better to iterate through the my_tok list instead as follows:
my_dict = {k:my_dict.get(k, default=None) for k in my_tok}
which is more or less what the other answers propose. The only difference is the use of .get dictionary method with allows us not to care whether the key is present in the dictionary or not. If it isn't it would be assigned the default value.
Going over the values from the my_tok, and get the results that are within the original dictionary.
my_dict = {i:my_dict[i] for i in my_tok}
Create a new copy
You can simply overwrite the original dictionary:
new_dic = {token:my_dict[key] for key in my_tok if key in my_dict}
Mind however that you construct a new dictionary (perhaps you immediately writ it to my_dict) but this has implications: other references to the dictionary will not reflect this change.
Since the number of tokens (my_tok) are limited, it is probably better to iterate over these tokens and do a contains-check on the dictionary (instead of looping over the tuples in the original dictionary).
Update the original dictionary
Given you want to let the changes reflect in your original dictionary, you can in a second step you can .clear() the original dictionary and .update() it accordingly:
new_dic = {token:my_dict[key] for key in my_tok if key in my_dict}
my_dict.clear()
my_dict.update(new_dic)

Adding two asynchronous lists, into a dictionary

I've always found Dictionaries to be an odd thing in python. I know it is just me i'm sure but I cant work out how to take two lists and add them to the dict. If both lists were mapable it wouldn't be a problem something like dictionary = dict(zip(list1, list2)) would suffice. However, during each run the list1 will always have one item and the list2 could have multiple items or single item that I'd like as values.
How could I approach adding the key and potentially multiple values to it?
After some deliberation, Kasramvd's second option seems to work well for this scenario:
dictionary.setdefault(list1[0], []).append(list2)
Based on your comment all you need is assigning the second list as a value to only item of first list.
d = {}
d[list1[0]] = list2
And if you want to preserve the values for duplicate keys you can use dict.setdefault() in order to create value of list of list for duplicate keys.
d = {}
d.setdefault(list1[0], []).append(list2)

Categories