Creating (seeding) large dictionaries efficiently in Python

Creating (seeding) large dictionaries efficiently in Python - python

I have a long (500K+ rows) two column spreadsheet that looks like this:
Name Code
1234 A
1234 B
1456 C
4556 A
4556 B
4556 C
...
So there is an element (with a Name) that can have a number of Codes. But instead of one row per code, I would like to a list of all codes that occur for each element. What I want is a dictionary like this:
{"1234":["A","B"],"1456":["C"],"4556":["A","B","C"] ...]}
What I have tried is this (and I'm not including the file reading syntax).
codelist = {}
for row in rows:
name,code = well.split()
if name in codelist.keys():
codelist[name].append(code)
else:
codelist[name] = [code]
This creates the right output but progress becomes incredibly slow. So I've tried priming my dictionary with keys:
allnames = [.... list of all the names ...]
codelist = dict.fromkeys(allnames)
for row in rows:
name,code = well.split()
if codelist[name]:
codelist[name].append(code)
else:
codelist[name] = [code]
This is dramatically faster, and my question is why? Doesn't the program each time still have to search all the keys in the dict? Is there another way to speed up the dict search that doesn't include traversing a tree?
Interesting is the error I get when I use the same conditional check as before (if name in codelist.keys():) after priming my dictionary.
Traceback (most recent call last):
File ....
codelist[name].append(code)
AttributeError: 'NoneType' object has no attribute 'append'
Now, there is a key but no list to append to. So I use codelist[name] which is <NoneType> as well and appears to work. What does it mean when mydict["primed key"] is <NoneType> ?enter code here

The former one is slower because .keys() has to create a list of all keys in memory first and then the in operator performs a search on it. So, it is an O(N) search for each line from the text file, hence it is slow.
On the other hand a simple key in dict search takes O(1) time.
dict.fromkeys(allnames)
The default value assigned by dict.fromkeys is None, so you can't use append on it.
>>> d = dict.fromkeys('abc')
>>> d
{'a': None, 'c': None, 'b': None}
A better solution will be to use collections.defaultdict here, in case that is not an option then use a normal dict with either a simple if-else check or dict.setdefault.
In Python3 .keys() returns a View Object, so time complexity may differ there. But, it is still going to be slightly slower than normal key in dict search.

You might want to have a look at the defaultdict container to avoid checks
from collections import defaultdict
allnames [.... list of all the names ...]
codelist = defaultdict(list)
for row in rows:
name,code = well.split()
codelist[name].append(code)

Related

Comparing items through a tuple in Python

I am given an assignment when I am supposed to define a function that returns the second element of a tuple if the first element of a tuple matches with the argument of a function.
Specifically, let's say that I have a list of student registration numbers that goes by:
particulars = (("S12345", "John"), ("S23456", "Max"), ("S34567", "Mary"))
And I have defined a function that is supposed to take in the argument of reg_num, such as "S12345", and return the name of the student in this case, "John". If the number does not match at all, I need to print "Not found" as a message. In essence, I understand that I need to sort through the larger tuple, and compare the first element [0] of each smaller tuple, then return the [1] entry of each smaller tuple. Here's what I have in mind:
def get_student_name(reg_num, particulars):
for i in records:
if reg_num == particulars[::1][0]:
return particulars[i][1]
else:
print("Not found")
I know I'm wrong, but I can't tell why. I'm not well acquainted with how to sort through a tuple. Can anyone offer some advice, especially in syntax? Thank you very much!

When you write for i in particulars, in each iteration i is an item of the collection and not an index. As such you cannot do particulars[i] (and there is no need - as you already have the item). In addition, remove the else statement so to not print for every item that doesn't match condition:
def get_student_name(reg_num, particulars):
for i in particulars:
if reg_num == i[0]:
return i[1]
print("Not found")
If you would want to iterate using indices you could do (but less nice):
for i in range(len(particulars)):
if reg_num == particulars[i][0]:
return particulars[i][1]

Another approach, provided to help learn new tricks for manipulating python data structures:
You can turn you tuple of tuples:
particulars = (("S12345", "John"), ("S23456", "Max"), ("S34567", "Mary"))
into a dictionary:
>>> pdict = dict(particulars)
>>> pdict
{'S12345': 'John', 'S23456': 'Max', 'S34567': 'Mary'}
You can look up the value by supplying the key:
>>> r = 'S23456'
>>> dict(pdict)[r]
'Max'
The function:
def get_student_name(reg, s_data):
try:
return dict(s_data)[reg]
except:
return "Not Found"
The use of try ... except will catch errors and just return Not Found in the case where the reg is not in the tuple in the first place. It will also catch of the supplied tuple is not a series of PAIRS, and thus cannot be converted the way you expect.
You can read more about exceptions: the basics and the docs to learn how to respond differently to different types of error.

for loops in python
Gilad Green already answered your question with a way to fix your code and a quick explanation on for loops.
Here are five loops that do more or less the same thing; I invite you to try them out.
particulars = (("S12345", "John"), ("S23456", "Max"), ("S34567", "Mary"))
for t in particulars:
print("{} {}".format(t[0], t[1]))
for i in range(len(particulars)):
print("{}: {} {}".format(i, particulars[i][0], particulars[i][1]))
for i, t in enumerate(particulars):
print("{}: {} {}".format(i, t[0], t[1]))
for reg_value, student_name in particulars:
print("{} {}".format(reg_value, student_name))
for i, (reg_value, student_name) in enumerate(particulars):
print("{}: {} {}".format(i, reg_value, student_name))
Using dictionaries instead of lists
Most importantly, I would like to add that using an unsorted list to store your student records is not the most efficient way.
If you sort the list and maintain it in sorted order, then you can use binary search to search for reg_num much faster than browsing the list one item at a time. Think of this: when you need to look up a word in a dictionary, do you read all words one by one, starting by "aah", "aback", "abaft", "abandon", etc.? No; first, you open the dictionary somewhere in the middle; you compare the words on that page with your word; then you open it again to another page; compare again; every time you do that, the number of candidate pages diminishes greatly, and so you can find your word among 300,000 other words in a very small time.
Instead of using a sorted list with binary search, you could use another data structure, for instance a binary search tree or a hash table.
But, wait! Python already does that very easily!
There is a data structure in python called a dictionary. See the documentation on dictionaries. This structure is perfectly adapted to most situations where you have keys associated to values. Here the key is the reg_number, and the value is the student name.
You can define a dictionary directly:
particulars = {'S12345': 'John', 'S23456': 'Max', 'S34567': 'Mary'}
Or you can convert your list of tuples to a dictionary:
particulars = (("S12345", "John"), ("S23456", "Max"), ("S34567", "Mary"))
particulars_as_dict = dict(particulars)
Then you can check if an reg_number is in the dictionary, with they keyword in; you can return the student name using square brackets or with the method get:
>>> particulars = {'S12345': 'John', 'S23456': 'Max', 'S34567': 'Mary'}
>>> 'S23456' in particulars
True
>>> 'S98765' in particulars
False
>>>
>>> particulars['S23456']
'Max'
>>> particulars.get('S23456')
'Max'
>>> particulars.get('S23456', 'not found')
'Max'
>>>
>>> particulars['S98765']
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
KeyError: 'S98765'
>>> particulars.get('S98765')
None
>>> particulars.get('S98765', 'not found')
'not found'

Python Set Comprehension Nested in Dict Comprehension

I have a list of tuples, where each tuple contains a string and a number in the form of:
[(string_1, num_a), (string_2, num_b), ...]
The strings are nonunique, and so are the numbers, e.g. (string_1 , num_m) or (string_9 , num_b) are likely to exist in the list.
I'm attempting to create a dictionary with the string as the key and a set of all numbers occurring with that string as the value:
dict = {string_1: {num_a, num_m}, string_2: {num_b}, ...}
I've done this somewhat successfully with the following dictionary comprehension with nested set comprehension:
#st_id_list = [(string_1, num_a), ...]
#st_dict = {string_1: {num_a, num_m}, ...}
st_dict = {
st[0]: set(
st_[1]
for st_ in st_id_list
if st_[0] == st[0]
)
for st in st_id_list
}
There's only one issue: st_id_list is 18,000 items long. This snippet of code takes less than ten seconds to run for a list of 500 tuples, but over twelve minutes to run for the full 18,000 tuples. I have to think this is because I've nested a set comprehension inside a dict comprehension.
Is there a way to avoid this, or a smarter way to it?

You have a double loop, so you take O(N**2) time to produce your dictionary. For 500 items, 250.000 steps are taken, and for your 18k items, 324 million steps need to be done.
Here is a O(N) loop instead, so 500 steps for your smaller dataset, 18.000 steps for the larger dataset:
st_dict = {}
for st, id in st_id_list:
st_dict.setdefault(st, set()).add(id)
This uses the dict.setdefault() method to ensure that for a given key (your string values), there is at least an empty set available if the key is missing, then adds the current id value to that set.
You can do the same with a collections.defaultdict() object:
from collections import defaultdict
st_dict = defaultdict(set)
for st, id in st_id_list:
st_dict[st].add(id)
The defaultdict() uses the factory passed in to set a default value for missing keys.
The disadvantage of the defaultdict approach is that the object continues to produce default values for missing keys after your loop, which can hide application bugs. Use st_dict.default_factory = None to disable the factory explicitly to prevent that.

Why are you using two loops when you could do in one loop like this:
list_1=[('string_1', 'num_a'), ('string_2', 'num_b'),('string_1' , 'num_m'),('string_9' , 'num_b')]
string_num={}
for i in list_1:
if i[0] not in string_num:
string_num[i[0]]={i[1]}
else:
string_num[i[0]].add(i[1])
print(string_num)
output:
{'string_9': {'num_b'}, 'string_1': {'num_a', 'num_m'}, 'string_2': {'num_b'}}

Python - Updating value in one dictionary is updating value in all dictionaries

I have a list of dictionaries called lod. All dictionaries have the same keys but different values. I am trying to update one specific value in the list of values for the same key in all the dictionaries.
I am attempting to do it with the following for loop:
for i in range(len(lod)):
a=lod[i][key][:]
a[p]=a[p]+lov[i]
lod[i][key]=a
What's happening is each is each dictionary is getting updated len(lod) times so lod[0][key][p] is supposed to have lov[0] added to it but instead it is getting lov[0]+lov[1]+.... added to it.
What am I doing wrong?
Here is how I declared the list of dicts:
lod = [{} for _ in range(len(dataul))]
for j in range(len(dataul)):
for i in datakl:
rrdict[str.split(i,',')[0]]=list(str.split(i,',')[1:len(str.split(i,','))])
lod[j]=rrdict

The problem is in how you created the list of dictionaries. You probably did something like this:
list_of_dicts = [{}] * 20
That's actually the same dict 20 times. Try doing something like this:
list_of_dicts = [{} for _ in range(20)]
Without seeing how you actually created it, this is only an example solution to an example problem.
To know for sure, print this:
[id(x) for x in list_of_dicts]
If you defined it in the * 20 method, the id is the same for each dict. In the list comprehension method, the id is unique.

This it where the trouble starts: lod[j] = rrdict. lod itself is created properly with different dictionaries. Unfortunately, afterwards any references to the original dictionaries in the list get overwritten with a reference to rrdict. So in the end, the list contains only references to one single dictionary. Here is some more pythonic and readable way to solve your problem:
lod = [{} for _ in range(len(dataul))]
for rrdict in lod:
for line in datakl:
splt = line.split(',')
rrdict[splt[0]] = splt[1:]

You created the list of dictionaries correctly, as per other answer.
However, when you are updating individual dictionaries, you completely overwrite the list.
Removing noise from your code snippet:
lod = [{} for _ in range(whatever)]
for j in range(whatever):
# rrdict = lod[j] # Uncomment this as a possible fix.
for i in range(whatever):
rrdict[somekey] = somevalue
lod[j] = rrdict
Assignment on the last line throws away the empty dict that was in lod[j] and inserts a reference to the object represented by rrdict.
Not sure what your code does, but see a commented-out line - it might be the fix you are looking for.

Converting dict values into a set while preserving the dict

I have a dict like this:
(100002: 'APPLE', 100004: 'BANANA', 100005: 'CARROT')
I am trying to make my dict have ints for the keys (as it does now) but have sets for the values (rather than strings as it is now.) My goal is to be able to read from a .csv file with one column for the key (an int which is the item id number) and then columns for things like size, shape, and color. I want to add this information into my dict so that only the information for keys already in dict are added.
My goal dict might look like this:
(100002: set(['APPLE','MEDIUM','ROUND','RED']), 100004: set(['Banana','MEDIUM','LONG','YELLOW']), 100005: set(['CARROT','MEDIUM','LONG','ORANGE'])
Starting with my dict of just key + string for item name, I tried code like this to read the extra information in from a .csv file:
infile = open('FileWithTheData.csv', 'r')
for line in infile.readlines():
spl_line = line.split(',')
if int(spl_line[0]) in MyDict.keys():
MyDict[int(spl_line[0])].update(spl_line[1:])
Unfortunately this errors out saying AttributeError: 'str' object has no attribute 'update'. My attempts to change my dictionary's values into sets so that I can then .update them have yielded things like this: (100002: set(['A','P','L','E']), 100004: set(['B','A','N']), 100005: set(['C','A','R','O','T']))
I want to convert the values to a set so that the string that is currently the value will be the first string in the set rather than breaking up the string into letters and making a set of those letters.
I also tried making the values a set when I create the dict by zipping two lists together but it didn't seem to make any difference. Something like this
MyDict = dict(zip(listofkeys, set(listofnames)))
still makes the whole listofnames list into a set but it doesn't achieve my goal of making each value in MyDict into a set with the corresponding string from listofnames as the first string in the set.
How can I make the values in MyDict into a set so that I can add additional strings to that set without turning the string that is currently the value in the dict into a set of individual letters?
EDIT:
I currently make MyDict by using one function to generate a list of item ids (which are the keys) and another function which looks up those item ids to generate a list of corresponding item names (using a two column .csv file as the data source) and then I zip them together.
ANSWER:
Using the suggestions here I came up with this solution. I found that the section that has set()).update can easily be changed to list()).append to yield a list rather than a set (so that the order is preserved.) I also found it easier to update by .csv data input files by adding the column containing names to the FileWithTheData.csv so that I didn't have to mess with making the dict, converting the values to sets, and then adding in more data. My code for this section now looks like this:
MyDict = {}
infile = open('FileWithTheData.csv', 'r')
for line in infile.readlines():
spl_line = line.split(',')
if int(spl_line[0]) in itemidlist: #note that this is the list I was formerly zipping together with a corresponding list of names to make my dict
MyDict.setdefault(int(spl_line[0]), list()).append(spl_line[1:])
print MyDict

Your error is because originally your MyDict variable maps an integer to a string. When you are trying to update it you are treating the value like a set, when it is a string.
You can use a defaultdict for this:
combined_dict = defaultdict(set)
# first add all the values from MyDict
for key, value in MyDict.iteritems():
combined_dict[int(key)].add(value)
# then add the values from the file
infile = open('FileWithTheData.csv', 'r')
for line in infile.readlines():
spl_line = line.split(',')
combined_dict[int(sp_line[0])].update(spl_line[1:])

Your issue is with how you are initializing MyDict, try changing it to the following:
MyDict = dict(zip(listofkeys, [set([name]) for name in listofnames]))
Here is a quick example of the difference:
>>> listofkeys = [100002, 100004, 100005]
>>> listofnames = ['APPLE', 'BANANA', 'CARROT']
>>> dict(zip(listofkeys, set(listofnames)))
{100002: 'CARROT', 100004: 'APPLE', 100005: 'BANANA'}
>>> dict(zip(listofkeys, [set([name]) for name in listofnames]))
{100002: set(['APPLE']), 100004: set(['BANANA']), 100005: set(['CARROT'])}
set(listofnames) is just going to turn your list into a set, and the only effect that might have is to reorder the values as seen above. You actually want to take each string value in your list, and convert it to a one-element set, which is what the list comprehension does.
After you make this change, your current code should work fine, although you can just do the contains check directly on the dictionary instead of explicitly checking the keys (key in MyDict is the same as key in MyDict.keys()).

Adding a new key to a nested dictionary in python

I need to add a key with a value that increases by one for every item in the nested dictionary. I have been trying to use the dict['key']='value' syntax but can't get it to work for a nested dictionary. I'm sure it's a very simple.
My Dictionary:
mydict={'a':{'result':[{'key1':'value1','key2':'value2'},
{'key1':'value3','key2':'value4'}]}}
This is the code that will add the key to the main part of the dictionary:
for x in range(len(mydict)):
number = 1+x
str(number)
mydict[d'index']=number
print mydict
#out: {d'index':d'1',d'a'{d'result':[...]}}
I want to add the new key and value to the small dictionaries inside the square parentheses:
{'a':{'result':[{'key1':'value1',...,'index':'number'}]}}
If I try adding more layers to the last line of the for loop I get a traceback error:
Traceback (most recent call last):
File "C:\Python27\program.py", line 34, in <module>
main()
File "C:\Python27\program.py", line 23, in main
mydict['a']['result']['index']=number
TypeError: list indices must be integers, not unicode
I've tried various different ways of listing the nested items but no joy. Can anyone help me out here?

The problem is that mydict is not simply a collection of nested dictionaries. It contains a list as well. Breaking up the definition helps clarify the internal structure:
dictlist = [{'key1':'value1','key2':'value2'},
{'key1':'value3','key2':'value4'}]
resultdict = {'result':dictlist}
mydict = {'a':resultdict}
So to access the innermost values, we have to do this. Working backwards:
mydict['a']
returns resultdict. Then this:
mydict['a']['result']
returns dictlist. Then this:
mydict['a']['result'][0]
returns the first item in dictlist. Finally, this:
mydict['a']['result'][0]['key1']
returns 'value1'
So now you just have to amend your for loop to iterate correctly over mydict. There are probably better ways, but here's a first approach:
for inner_dict in mydict['a']['result']: # remember that this returns `dictlist`
for key in inner_dict:
do_something(inner_dict, key)

I'm not fully sure what you're trying to do, but I think itertools.count would be able to help here.
>>> c = itertools.count()
>>> c.next()
0
>>> c.next()
1
>>> c.next()
2
>>> c.next()
3
... and so on.
Using this, you can keep incrementing the value that you want to use in your dicts
Hope this helps

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Creating (seeding) large dictionaries efficiently in Python - python

You might want to have a look at the defaultdict container to avoid checks from collections import defaultdict allnames [.... list of all the names ...] codelist = defaultdict(list) for row in rows: name,code = well.split() codelist[name].append(code)

Related

Comparing items through a tuple in Python

Python Set Comprehension Nested in Dict Comprehension

Python - Updating value in one dictionary is updating value in all dictionaries

Converting dict values into a set while preserving the dict

Adding a new key to a nested dictionary in python

Categories

Resources