Joining large dictionaries by identical keys

Joining large dictionaries by identical keys - python

I have around 10 huge files that contain python dictionaries like so:
dict1:
{
'PRO-HIS-MET': {
'A': ([1,2,3],[4,5,6],[7,8,9]),
'B': ([5,2],[6],[8,9]),
'C': ([3],[4],[7,8])},
'TRP-MET-GLN': {
'F': ([-5,-4,1123],[-7,-11,2],[-636,-405])}
}
dict2:
{
'PRO-HIS-MET': {
'J': ([-657], [7,-20,3], [-8,-85,15])}
'TRP-MET-GLN':{
'K': ([1,2,3],[4,50,6],[7,80,9]),
'L': ([5,20],[60,80],[8,9])}
}
Basically they are all dictionaries of dictionaries. Each file is around 1 GB in size (the above is just an example of the data). Anyway, what I would like to do is join the 10 dictionaries together:
final:
{
'PRO-HIS-MET': {
'A': ([1,2,3],[4,5,6],[7,8,9]),
'B': ([5,2],[6],[8,9]),
'C': ([3],[4],[7,8])
'J': ([-657], [7,-20,3], [-8,-85,15])},
'TRP-MET-GLN': {
'F': ([-5,-4,1123],[-7,-11,2],[-636,-405])
'K': ([1,2,3],[4,50,6],[7,80,9]),
'L': ([5,20],[60,80],[8,9])}
}
I have tried the following code on small files and it works fine:
import csv
import collections
d1 = {}
d2 = {}
final = collections.defaultdict(dict)
for key, val in csv.reader(open('filehere.txt')):
d1[key] = eval(val)
for key, val in csv.reader(open('filehere2.txt')):
d2[key] = eval(val)
for key in d1:
final[key].update(d1[key])
for key in d2:
final[key].update(d2[key])
out = csv.writer(open('out.txt', 'w'))
for k, v in final.items():
out.writerow([k, v])
However if I try that on my 1 GB files I quickly run out of memory by keeping d1 and d2 as well as the final dictionary in memory.
I have a couple ideas:
Is there a way where I can just load the keys from the segmented dictionaries, compare those, and if the same ones are found in multiple dictionaries just combine the values?
Instead of merging the dictionaries into one huge file (which will probably give me memory headaches in the future), how can I make many separate files that contain all the values for one key after merging data? For example, for the above data I would just have:
pro-his-met.txt:
'PRO-HIS-MET': {
'A': ([1,2,3],[4,5,6],[7,8,9]),
'B': ([5,2],[6],[8,9]),
'C': ([3],[4],[7,8])
'J': ([-657], [7,-20,3], [-8,-85,15])}
trp-met-gln.txt:
'TRP-MET-GLN': {
'F': ([-5,-4,1123],[-7,-11,2],[-636,-405])
'K': ([1,2,3],[4,50,6],[7,80,9]),
'L': ([5,20],[60,80],[8,9])}
I don't have too much programming experience as a biologist (you may have guessed the above data represents a bioinformatics problem) so any help would be much appreciated!

The shelve module is a very easy-to-use database for Python. It's nowhere near as powerful as a real database (for that, see #Voo's answer), but it will do the trick for manipulating large dictionaries.
First, create shelves from your dictionaries:
import shelve
s = shelve.open('filehere.db', flag='n', protocol=-1, writeback=False)
for key, val in csv.reader(open('filehere.txt')):
s[key] = eval(val)
s.close()
Now that you've shelved everything neatly, you can operate on the dictionaries efficiently:
import shelve
import itertools
s = shelve.open('final.db', flag='c', protocol=-1, writeback=False)
s1 = shelve.open('file1.db', flag='r')
s2 = shelve.open('file2.db', flag='r')
for key, val in itertools.chain(s1.iteritems(), s2.iteritems()):
d = s.get(key, {})
d.update(val)
s[key] = d # force write
s.close()

Personally this sounds like the archetype of a problem databases were invented to solve. Yes you can solve this yourself with keeping files around and for performance optimizations map them into memory and let the OS handle the swapping, etc. but this is really complicated and hard to do really good.
Why go through all this effort if you can let a DB - into which millions of man-hours of have been put - handle it? That will be more efficient and as an added benefit much easier to query for information.
I've seen Oracle DBs storing much more than 10 GB of data without any problems, I'm sure postgre will handle this just as well.. the nice thing is if you use an ORM you can abstract those nitty gritty details away and worry about them later if it gets necessary.
Also while bioinformatics isn't my speciality I'm pretty sure there are specific solutions tailored to bioinformatics around - maybe one of them would be the perfect fit?

This concept should work.
I would consider to do multiple passes on the file where each time you do a portion of keys. and save that result.
Eg. if you create a list of the unique first characters of all the keys in one pass and then process each of thoses passes to new output files. If it was simple alphabetic data the logical choice would be a loop with each letter of the alphabet.
Eg. in the "p" pass you would process 'PRO-HIS-MET'
Then you would combine all the results from all the files at the end.
If you were a developer, the Database idea in the previous answer is probably the best approach if you can handle that kind of interaction. That idea entails creating a 2 level structure where you insert and update the records then query the result with an SQL statement.

Related

More consistent hashing in dictionary with python objects?

So, I saw Hashing a dictionary?, and I was trying to figure out a way to handle python native objects better and produce stable results.
After looking at all the answers + comments this is what I came to and everything seems to work properly, but am I maybe missing something that would make my hashing inconsistent (besides hash algorithm collisions)?
md5(repr(nested_dict).encode()).hexdigest()
tl;dr: it creates a string with the repr and then hashes the string.
Generated my testing nested dict with this:
for i in range(100):
for j in range(100):
if not nested_dict.get(i,None):
nested_dict[i] = {}
nested_dict[i][j] = ''
I'd imagine the repr should be able to support any python object, since most have to have the __repr__ support in general, but I'm still pretty new to python programming. One thing that I've heard of when using from reprlib import repr instead of the stdlib one that it'll truncate large sequences. So, that's one potential downfall, but it seems like the native list and set types don't do that.
other notes:
I'm not able to use https://stackoverflow.com/a/5884123, because I'm going to have nested dictionaries.
I used python 3.9.7 when testing this out.
Not able to use https://stackoverflow.com/a/22003440, because at the time of hashing it still has IPv4 address objects as keys. (json.dumps didn't like that too much 😅)

Python dicts are insert ordered. The repr respects that. Your hexdigest of {"A":1,"B":2} will differ from {"B":2,"A":1} whereas == - wise those dicts are the same.
Yours won't work out:
from hashlib import md5
def yourHash(d):
return md5(repr(d).encode()).hexdigest()
a = {"A":1,"B":2}
b = {"B":2,"A":1}
print(repr(a))
print(repr(b))
print (a==b)
print(yourHash(a) == yourHash(b))
gives
{'A': 1, 'B': 2} # repr a
{'B': 2, 'A': 1} # repr b
True # a == b
False # your hashes equall'ed
I really do not see the "sense" in hashing dicts at all ... and those ones here are not even "nested".
You could try JSON to sort keys down to the last nested one and using the json.dumps() of the whole structure to be hashed - but still - don't see the sense and it will give you plenty computational overhead:
import json
a = {"A":1,"B":2, "C":{2:1,3:2}}
b = {"B":2,"A":1, "C":{3:2,2:1}}
for di in (a,b):
print(json.dumps(di,sort_keys=True))
gives
{"A": 1, "B": 2, "C": {"2": 1, "3": 2}} # thouroughly sorted
{"A": 1, "B": 2, "C": {"2": 1, "3": 2}} # recursively...
which is exactly what this answer in Hashing a dictionary? proposes ... why stray from it?

Python: cycling/scanning though fields in an object

I have a JSON file named MyFile.json that contains this structure:
[{u'randomName1': {u'A': 16,u'B': 20,u'C': 71},u'randomName2': {u'A': 12,u'B': 17,u'C': 47}},...]
I can open the file and load it like this:
import json
with open('MyFile.json') as data_file:
data = json.load(data_file)
And I can access the values in the first element like this:
data[0]["randomName1"][A]
data[0]["randomName1"][B]
data[0]["randomName1"][C]
data[0]["randomName2"][A]
data[0]["randomName2"][B]
data[0]["randomName2"][C]
The A B C keys are always named A B C (and there are always exactly 3 of them, so that's no problem.
The problem is:
1) I don't know how many elements are in the list, and
2) I don't know how many "randomName" keys are in each element, and
3) I don't know the names of the randomName keys.
How do I scan/cycle through the entire file, getting all the elements, and getting all the key names and associated key values for each element?
I don't have the knowledge or desire to write a complicated parsing script of my own. I was expecting that there's a way for the json library to provide this information.
For example (and this is not a perfect analogy I realize) if I am given an array X in AWK, I can scan all the index/name pairs by using
for(index in X){print index, X[index]);
Is there something like this in Python?
---------------- New info below this line -------------
Thank you Padraic and E.Gordon. That goes a long way toward solving the problem.
In an attempt to make my initial post as concise as possible, I simplified my JSON data example too much.
My JSON data actually looks this this:
data=[
{ {u'X': u'randomName1': {u'A': 11,u'B': 12,u'C': 13}, u'randomName2': {u'A': 21,u'B': 22,u'C': 23}, ... }, u'Y': 101, u'Z': 102 },
.
.
.
]
The ellipses represent arbitrary repetition, as described in the original post. The X Y Z keys are always named X Y Z (and there are always exactly 3 of them).
Using your posts as a starting point, I've been working on this for a couple of hours, but being new to Python I'm stumped. I cannot figure out how to add the extra loop to work with that data. I would like the output stream to look something like this:
Z,102,Y,101,randomName1,A,11,B,12,C,13,randomName2,A,21,B,22,C,23,...
.
.
.
Thanks for your help.
-
----------------- 3/23/16 update below --------------
Again, thanks for the help. Here's what I finally came up with. It does what I need:
import json
with open('MyFile.json') as data_file:
data = json.load(data_file)
for record in data:
print record['Z'],record['Y']
for randomName in record['X']:
print randomName, randomName['A'], randomName['B'],randomName['C']
...

You can print the items in the dicts:
js = [{u'randomName1': {u'A': 16,u'B': 20,u'C': 71},u'randomName2': {u'A': 12,u'B': 17,u'C': 47}}]
for dct in js:
for k, v in dct.items():
print(k, v)
Which gives you the key/inner dict pairings:
randomName1 {'B': 20, 'A': 16, 'C': 71}
randomName2 {'B': 17, 'A': 12, 'C': 47}
If you want the values from the inner dicts you can add another loop
for dct in js:
for k1, d in dct.items():
print(k1)
for k2,v in d.items():
print(k2,v)
Which will give you:
randomName1
A 16
B 20
C 71
randomName2
A 12
B 17
C 47
If you have arbitrary levels of nesting we will have to do it recursively.

You can use the for element in list construct to loop over all the elements in a list, without having to know its length.
The iteritems() dictionary method provides a convenient way to get the key-value pairs from a dictionary, again without needing to know how many there are or what the keys are called.
For example:
import json
with open('MyFile.json') as data_file:
data = json.load(data_file)
for element in data:
for name, values in element.iteritems():
print("%s has A=%d, B=%d and C=%d" % (name,
values["A"],
values["B"],
values["C"]))

Shortest path algorithm using dictionaries [Python]

This is my first question and actually my first time trying this but I read the rules of the questions and I hope my question comply with all of them.
I have a project for my algorithm subject, and it is to design a gui for dijkstra shortest path algorthim. I chose to use python because it is a language that I would like to master. I have been trying for more than a week actually and I am facing troubles all the way. But anyways this is good fun :)!
I chose to represent my directed graph as a dictionary in this way :
g= {'A': {"B": 20, 'D': 80, 'G' :90}, # A can direct to B, D and G
'B': {'F' : 10},
'F':{'C':10,'D':40},
'C':{'D':10,'H':20,'F':50},
'D':{'G':20},
'G':{'A':20},
'E':{'G':30,'B':50},
'H':None} # H is not directed to anything, but can accessed through C
so the key is the vertice and the value is the linked vetrices and the weights. This is an example of a graph but I was planning to ask the user to input their own graph details and examine the shortest path between each two nodes [start -> end] The problem is however that I don't even know how to access the inner dictionary so I can work on the inner paramteters, and I tried many ways like those two:
for i in g:
counter = 0
print g[i[counter]] # One
print g.get(i[counter]) # Two
but the both give me the same output which is: (Note that I can't really access and play with the inner paramters)
{"B": 20, 'D': 80, 'G' :90}
{'F' : 10}
{'C':10,'D':40}
{'D':10,'H':20,'F':50}
{'G':20}
{'A':20}
{'G':30,'B':50}
None
So my question is, could you please help me with how to access the inner dictionaries so I can start working on the algorithm itself. Thanks a lot in advance and thanks for reading.

This is actually not so hard, and should make complete sense once you see it. Let's take your g. We want to get the weight of the 'B' connection from the 'A' node:
>>> d = g['A']
>>> d
{"B": 20, 'D': 80, 'G' :90}
>>> d['B']
20
>>> g['A']['B']
20
Using g['A'] gets us the value of the key in dictionary g. We can act directly on this value by referring to the 'B' key.

Using a for loop will iterate over the keys of a dictionary, and by using the key, you can fetch the value that is associated to the key. If the value itself is a dictionary, you can use another loop.
for fromNode in g:
neighbors = g[fromNode]
for toNode in neighbors:
distance = neighbors[toNode]
print("%s -> %s (%d)" % (fromNode, toNode, distance))
Note that for this to work, you should use an empty dictionary {} instead of None when there are no neighbors.

I guess these give you some ideas:
for dict in g:
print dict.get("B","")
for dict in g:
print dict.keys() #or dict.values()
for dict in g:
print dict["B"]

How to create a dictionary based on variable value in Python

I am trying to create a dictionary where the name comes from a variable.
Here is the situation since maybe there is a better way:
Im using an API to get attributes of "objects". (Name, Description, X, Y, Z) etc. I want to store this information in a way that keeps the data by "object".
In order to get this info, the API iterates through all the "objects".
So what my proposal was that if the object name is one of the ones i want to "capture", I want to create a dictionary with that name like so:
ObjectName = {'Description': VarDescrption, 'X': VarX.. etc}
(Where I say "Varetc..." that would be the value of that attribute passed by the API.
Now since I know the list of names ahead of time, I CAN use a really long If tree but am looking for something easier to code to accomplish this. (and extensible without adding too much code)
Here is code I have:
def py_cell_object():
#object counter - unrelated to question
addtototal()
#is this an object I want?
if aw.aw_string (239)[:5] == "TDT3_":
#If yes, make a dictionary with the object description as the name of the dictionary.
vars()[aw.aw_string (239)]={'X': aw.aw_int (232), 'Y': aw.aw_int (233), 'Z': aw.aw_int (234), 'No': aw.aw_int (231)}
#print back result to test
for key in aw.aw_string (239):
print 'key=%s, value=%s' % (key, aw.aw_string (239)[key])
here are the first two lines of code to show what "aw" is
from ctypes import *
aw = CDLL("aw")
to explain what the numbers in the API calls are:
231 AW_OBJECT_NUMBER,
232 AW_OBJECT_X,
233 AW_OBJECT_Y,
234 AW_OBJECT_Z,
239 AW_OBJECT_DESCRIPTION,
231-234 are integers and 239 is a string

I deduce that you are using the Active Worlds SDK. It would save time to mention that in the first place in future questions.
I guess your goal is to create a top-level dictionary, where each key is the object description. Each value is another dictionary, storing many of the attributes of that object.
I took a quick look at the AW SDK documentation on the wiki and I don't see a way to ask the SDK for a list of attribute names, IDs, and types. So you will have to hard-code that information in your program somehow. Unless you need it elsewhere, it's simplest to just hard-code it where you create the dictionary, which is what you are already doing. To print it back out, just print the attribute dictionary's repr. I would probably format your method more like this:
def py_cell_object():
#object counter - unrelated to question
addtototal()
description = aw.aw_string(239)
if description.startswith("TDT3_"):
vars()[description] = {
'DESCRIPTION': description,
'X': aw.aw_int(232),
'Y': aw.aw_int(233),
'Z': aw.aw_int(234),
'NUMBER': aw.aw_int (231),
... etc for remaining attributes
}
print repr(vars()[description])
Some would argue that you should make named constants for the numbers 232, 233, 234, etc., but I see little reason to do that unless you need them in multiple places, or unless it's easy to generate them automatically from the SDK (for example, by parsing a .h file).

If the variables are defined in the local scope, it's as simple as:
obj_names = {}
while True:
varname = read_name()
if not varname: break
obj_names[varname] = locals()[varname]

This is actual code I am using in my production environment
hope it helps.
cveDict = {}
# StrVul is a python list holding list of vulnerabilities belonging to a report
report = Report.objects.get(pk=report_id)
vul = Vulnerability.objects.filter(report_id=report_id)
strVul = map(str, vul)
# fill up the python dict, += 1 if cvetype already exists
for cve in strVul:
i = Cve.objects.get(id=cve)
if i.vul_cvetype in cveDict.keys():
cveDict[i.vul_cvetype] += 1
else:
cveDict[i.vul_cvetype] = 1

Parse string with three-level delimitation into dictionary

I've found how to split a delimited string into key:value pairs in a dictionary elsewhere, but I have an incoming string that also includes two parameters that amount to dictionaries themselves: parameters with one or three key:value pairs inside:
clientid=b59694bf-c7c1-4a3a-8cd5-6dad69f4abb0&keyid=987654321&userdata=ip:192.168.10.10,deviceid:1234,optdata:75BCD15&md=AMT-Cam:avatar&playbackmode=st&ver=6&sessionid=&mk=PC&junketid=1342177342&version=6.7.8.9012
Obviously these are dummy parameters to obfuscate proprietary code, here. I'd like to dump all this into a dictionary with the userdata and md keys' values being dictionaries themselves:
requestdict {'clientid' : 'b59694bf-c7c1-4a3a-8cd5-6dad69f4abb0', 'keyid' : '987654321', 'userdata' : {'ip' : '192.168.10.10', 'deviceid' : '1234', 'optdata' : '75BCD15'}, 'md' : {'Cam' : 'avatar'}, 'playbackmode' : 'st', 'ver' : '6', 'sessionid' : '', 'mk' : 'PC', 'junketid' : '1342177342', 'version' : '6.7.8.9012'}
Can I take the slick two-level delimitation parsing command that I've found:
requestDict = dict(line.split('=') for line in clientRequest.split('&'))
and add a third level to it to handle & preserve the 2nd-level dictionaries? What would the syntax be? If not, I suppose I'll have to split by & and then check & handle splits that contain : but even then I can't figure out the syntax. Can someone help? Thanks!

I basically took Kyle's answer and made it more future-friendly:
def dictelem(input):
parts = input.split('&')
listing = [part.split('=') for part in parts]
result = {}
for entry in listing:
head, tail = entry[0], ''.join(entry[1:])
if ':' in tail:
entries = tail.split(',')
result.update({ head : dict(e.split(':') for e in entries) })
else:
result.update({head: tail})
return result

Here's a two-liner that does what I think you want:
dictelem = lambda x: x if ':' not in x[1] else [x[0],dict(y.split(':') for y in x[1].split(','))]
a = dict(dictelem(x.split('=')) for x in input.split('&'))

Can I take the slick two-level delimitation parsing command that I've found:
requestDict = dict(line.split('=') for line in clientRequest.split('&'))
and add a third level to it to handle & preserve the 2nd-level dictionaries?
Of course you can, but (a) you probably don't want to, because nested comprehensions beyond two levels tend to get unreadable, and (b) this super-simple syntax won't work for cases like yours, where only some of the data can be turned into a dict.
For example, what should happen with 'PC'? Do you want to make that into {'PC': None}? Or maybe the set {'PC'}? Or the list ['PC']? Or just leave it alone? You have to decide, and write the logic for that, and trying to write it as an expression will make your decision very hard to read.
So, let's put that logic in a separate function:
def parseCommasAndColons(s):
bits = [bit.split(':') for bit in s.split(',')]
try:
return dict(bits)
except ValueError:
return bits
This will return a dict like {'ip': '192.168.10.10', 'deviceid': '1234', 'optdata': '75BCD15'} or {'AMT-Cam': 'avatar'} for cases where each comma-separated component has a colon inside it, but a list like ['1342177342'] for cases where any of them don't.
Even this may be a little too clever; I might make the "is this in dictionary format" check more explicit instead of just trying to convert the list of lists and see what happens.
Either way, how would you put that back into your original comprehension?
Well, you want to call it on the value in the line.split('='). So let's add a function for that:
def parseCommasAndColonsForValue(keyvalue):
if len(keyvalue) == 2:
return keyvalue[0], parseCommasAndColons(keyvalue[1])
else:
return keyvalue
requestDict = dict(parseCommasAndColonsForValue(line.split('='))
for line in clientRequest.split('&'))
One last thing: Unless you need to run on older versions of Python, you shouldn't often be calling dict on a generator expression. If it can be rewritten as a dictionary comprehension, it will almost certainly be clearer that way, and if it can't be rewritten as a dictionary comprehension, it probably shouldn't be a 1-liner expression in the first place.
Of course breaking expressions up into separate expressions, turning some of them into statements or even functions, and naming them does make your code longer—but that doesn't necessarily mean worse. About half of the Zen of Python (import this) is devoted to explaining why. Or one quote from Guido: "Python is a bad language for code golf, on purpose."
If you really want to know what it would look like, let's break it into two steps:
>>> {k: [bit2.split(':') for bit2 in v.split(',')] for k, v in (bit.split('=') for bit in s.split('&'))}
{'clientid': [['b59694bf-c7c1-4a3a-8cd5-6dad69f4abb0']],
'junketid': [['1342177342']],
'keyid': [['987654321']],
'md': [['AMT-Cam', 'avatar']],
'mk': [['PC']],
'playbackmode': [['st']],
'sessionid': [['']],
'userdata': [['ip', '192.168.10.10'],
['deviceid', '1234'],
['optdata', '75BCD15']],
'ver': [['6']],
'version': [['6.7.8.9012']]}
That illustrates why you can't just add a dict call for the inner level—because most of those things aren't actually dictionaries, because they had no colons. If you changed that, then it would just be this:
{k: dict(bit2.split(':') for bit2 in v.split(',')) for k, v in (bit.split('=') for bit in s.split('&'))}
I don't think that's very readable, and I doubt most Python programmers would. Reading it 6 months from now and trying to figure out what I meant would take a lot more effort than writing it did.
And trying to debug it will not be fun. What happens if you run that on your input, with missing colons? ValueError: dictionary update sequence element #0 has length 1; 2 is required. Which sequence? No idea. You have to break it down step by step to see what doesn't work. That's no fun.
So, hopefully that illustrates why you don't want to do this.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Joining large dictionaries by identical keys - python

Related

More consistent hashing in dictionary with python objects?

Python: cycling/scanning though fields in an object

Shortest path algorithm using dictionaries [Python]

How to create a dictionary based on variable value in Python

Parse string with three-level delimitation into dictionary

Categories

Resources