How to create a dictionary whose values are sets?

How to create a dictionary whose values are sets? - python

I'm working on an exercise that requires me to build two dictionaries, one whose keys are country names, and the values are the GDP. This part works fine.
The second dictionary is where I'm lost, as the keys are supposed to be the letters A‐Z and the values are sets of country names. I tried using a for loop, which I've commented on below, where the issue lies.
If the user enters a string with only one letter (like A), the program should print all the countries that begin with that letter. When you run the program, however, it only prints out one country for each letter.
The text file contains 228 lines. ie:
1:Qatar:98900
2:Liechtenstein:89400
3:Luxembourg:80600
4:Bermuda:69900
5:Singapore:59700
6:Jersey:57000
etc.
And here's my code.
initials = []
countries=[]
incomes=[]
dictionary={}
dictionary_2={}
keywordFile = open("raw.txt", "r")
for line in keywordFile:
line = line.upper()
line = line.strip("\n")
line = line.split(":")
initials.append(line[1][0]) # first letter of second element
countries.append(line[1])
incomes.append(line[2])
for i in range(0,len(countries)):
dictionary[countries[i]] = incomes[i]
this for loop should spit out 248 values (one for each country), where the key is the initial and the value is the country name. However, it only spits out 26 values (one country for each letter in the alphabet)
for i in range(0,len(countries)):
dictionary_2[initials[i]] = countries[i]
print(dictionary_2)
while True:
inputS = str(input('Enter an initial or a country name.'))
if inputS in dictionary:
value = dictionary.get(inputS, "")
print("The per capita income of {} is {}.".format((inputS.title()), value ))
elif inputS in dictionary_2:
value = dictionary_2.get(inputS)
print("The countries that begin with the letter {} are: {}.".format(inputS, (value.title())))
elif inputS.lower() in "quit":
break
else:
print("Does not exit.")
print("End of session.")
I'd appreciate any input leading me in the right direction.

Use defaultdict to make sure each value of your initials dict is a set, and then use the add method. If you just use = you'll be overwriting the initial keys value each time, defaultdict is an easier way of using an expression like:
if initial in dict:
dict[initial].add(country)
else:
dict[initial] = {country}
See the full working example below, and also note that i'm using enumerate instead of range(0,len(countries)), which i'd also recommend:
#!/usr/bin/env python3
from collections import defaultdict
initials, countries, incomes = [],[],[]
dict1 = {}
dict2 = defaultdict(set)
keywordFile = """
1:Qatar:98900
2:Liechtenstein:89400
3:Luxembourg:80600
4:Bermuda:69900
5:Singapore:59700
6:Jersey:57000
""".split("\n\n")
for line in keywordFile:
line = line.upper().strip("\n").split(":")
initials.append(line[1][0])
countries.append(line[1])
incomes.append(line[2])
for i,country in enumerate(countries):
dict1[country] = incomes[i]
dict2[initials[i]].add(country)
print(dict2["L"])
Result:
{'LUXEMBOURG', 'LIECHTENSTEIN'}
see: https://docs.python.org/3/library/collections.html#collections.defaultdict

The values for dictionary2 should be such that they can contain a list of countries. One option is to use a list as the values in your dictionary. In your code, you are overwriting the values for each key whenever a new country with the same initial is to be added as the value.
Moreover, you can use the setdefault method of the dictionary type. This code:
dictionary2 = {}
for country in countries:
dictionary2.setdefault(country[0], []).append(country)
should be enough to create the second dictionary elegantly.
setdefault, either returns the value for the key (in this case the key is set to the first letter of the country name) if it already exists, or inserts a new key (again, the first letter of the country) into the dictionary with a value that is an empty set [].
edit
if you want your values to be set (for faster lookup/membership test), you can use the following lines:
dictionary2 = {}
for country in countries:
dictionary2.setdefault(country[0], set()).add(country)

Here's a link to a live functioning version of the OP's code online.
The keys in Python dict objects are unique. There can only ever be one 'L' key a single dict. What happens in your code is that first the key/value pair 'L':'Liechtenstein' is inserted into dictionary_2. However, in a subsequent iteration of the for loop, 'L':'Liechtenstein' is overwritten by 'L':Luxembourg. This kind of overwriting is sometimes referred to as "clobbering".
Fix
One way to get the result that you seem to be after would be to rewrite that for loop:
for i in range(0,len(countries)):
dictionary_2[initials[i]] = dictionary_2.get(initials[i], set()) | {countries[i]}
print(dictionary_2)
Also, you have to rewrite the related elif statement beneath that:
elif inputS in dictionary_2:
titles = ', '.join([v.title() for v in dictionary_2[inputS]])
print("The countries that begin with the letter {} are: {}.".format(inputS, titles))
Explanation
Here's a complete explanation of the dictionary_2[initials[i]] = dictionary_2.get(initials[i], set()) | {countries[i]} line above:
dictionary_2.get(initials[i], set())
If initials[i] is a key in dictionary_2, this will return the associated value. If initials[i] is not in the dictionary, it will return the empty set set() instead.
{countries[i]}
This creates a new set with a single member in it, countries[i].
dictionary_2.get(initials[i], set()) | {countries[i]}
The | operator adds all of the members of two sets together and returns the result.
dictionary_2[initials[i]] = ...
The right hand side of the line either creates a new set, or adds to an existing one. This bit of code assigns that newly created/expanded set back to dictionary_2.
Notes
The above code sets the values of dictionary_2 as sets. If you want to use list values, use this version of the for loop instead:
for i in range(0,len(countries)):
dictionary_2[initials[i]] = dictionary_2.get(initials[i], []) + [countries[i]]
print(dictionary_2)

You're very close to what you're looking for, You could populate your dictionaries respectively while looping over the contents of the file raw.txt that you're reading. You can also read the contents of the file first and then perform the necessary operations to populate the dictionaries. You could achieve your requirement with nice oneliners in python using dict comprehensions and groupby. Here's an example:
country_per_capita_dict = {}
letter_countries_dict = {}
keywordFile = [line.strip() for line in open('raw.txt' ,'r').readlines()]
You now have a list of all lines in the keywordFile as follows:
['1:Qatar:98900', '2:Liechtenstein:89400', '3:Luxembourg:80600', '4:Bermuda:69900', '5:Singapore:59700', '6:Jersey:57000', '7:Libya:1000', '8:Sri Lanka:5000']
As you loop over the items, you can split(':') and use the [1] and [2] index values as required.
You could use dictionary comprehension as follows:
country_per_capita_dict = {entry.split(':')[1] : entry.split(':')[2] for entry in keywordFile}
Which results in:
{'Qatar': '98900', 'Libya': '1000', 'Singapore': '59700', 'Luxembourg': '80600', 'Liechtenstein': '89400', 'Bermuda': '69900', 'Jersey': '57000'}
Similarly using groupby from itertools you can obtain:
from itertools import groupby
country_list = country_per_capita_dict.keys()
country_list.sort()
letter_countries_dict = {k: list(g) for k,g in groupby(country_list, key=lambda x:x[0]) }
Which results in the required dictionary of initial : [list of countries]
{'Q': ['Qatar'], 'S': ['Singapore'], 'B': ['Bermuda'], 'L': ['Luxembourg', 'Liechtenstein'], 'J': ['Jersey']}
A complete example is as follows:
from itertools import groupby
country_per_capita_dict = {}
letter_countries_dict = {}
keywordFile = [line.strip() for line in open('raw.txt' ,'r').readlines()]
country_per_capita_dict = {entry.split(':')[1] : entry.split(':')[2] for entry in keywordFile}
country_list = country_per_capita_dict.keys()
country_list.sort()
letter_countries_dict = {k: list(g) for k,g in groupby(country_list, key=lambda x:x[0]) }
print (country_per_capita_dict)
print (letter_countries_dict)
Explanation:
The line:
country_per_capita_dict = {entry.split(':')[1] : entry.split(':')[2] for entry in keywordFile}
loops over the following list
['1:Qatar:98900', '2:Liechtenstein:89400', '3:Luxembourg:80600', '4:Bermuda:69900', '5:Singapore:59700', '6:Jersey:57000', '7:Libya:1000', '8:Sri Lanka:5000'] and splits each entry in the list by :
It then takes the value at index [1] and [2] which are the country names and the per capita value and makes them into a dictionary.
country_list = country_per_capita_dict.keys()
country_list.sort()
This line, extracts the name of all the countries from the dictionary created before into a list and sorts them alphabetically for groupby to work correctly.
letter_countries_dict = {k: list(g) for k,g in groupby(country_list, key=lambda x:x[0]) }
This lambda expression takes the input as the list of countries and groups together the names of countries where each x starts with x[0] into list(g).

Related

Function that makes dict from string but swaps keys and values?

I'm trying to make a function that takes in list of strings as an input like the one listed below:
def swap_values_dict(['Summons: Bahamut, Shiva, Chocomog',
'Enemies: Bahamut, Shiva, Cactaur'])
and creates a dictionary from them using the words after the colons as keys and the words before the colons as values. I need to clarify that, at this point, there are only two strings in the list. I plan to split the strings into sublists and, from there, try and assign them to a dictionary.
The output should look like
{'Bahamut': ['Summons','Enemies'],'Shiva':['Summons','Enemies'],'Chocomog':['Summons'],'Cactaur':['Enemies']}
As you can see, the words after the colon in the original list have become keys while the words before the colon (categories) have become the values. If one of the values appears in both lists, it is assigned two values in the final dictionary. I would like to be able to make similar dictionaries out of many lists of different sizes, not just ones that contain two strings. Could this be done without list comprehension and only for loops and if statements?
What I've Tried So Far
title_list = []
for i in range(len(mobs)):#counts amount of strings in list
titles = (mobs[i].split(":"))[0] #gets titles from list using split
title_list.append(titles)
title_list
this code returns ['Summons', 'Enemies'] which aren't the results I wanted to receive but I think they could help me write the function. I had planned on separating the keys and values into separate lists and then zipping them together afterwards as a dictionary.

Try:
def swap_values_dict(lst):
tmp = {}
for s in lst:
k, v = map(str.strip, s.split(":"))
tmp[k] = list(map(str.strip, v.split(",")))
out = {}
for k, v in tmp.items():
for i in v:
out.setdefault(i, []).append(k)
return out
print(
swap_values_dict(
[
"Summons: Bahamut, Shiva, Chocomog",
"Enemies: Bahamut, Shiva, Cactaur",
]
)
)
Prints:
{
"Bahamut": ["Summons", "Enemies"],
"Shiva": ["Summons", "Enemies"],
"Chocomog": ["Summons"],
"Cactaur": ["Enemies"],
}

I'd use a defaultdict. It saves you the trouble of manually checking if a key exists in your dictionary and constructing a new empty list, making for a rather concise function:
from collections import defaultdict
def swap_values_dict(mobs):
result = defaultdict(list)
for elem in mobs:
role, members = elem.split(': ')
for m in members.split(', '):
result[m].append(role)
return result

How to sort a list of tuples based on the 2nd value without hardcoding it

I have a list of tuples.
[('first_title', 'first_content','notes'),('second_title','second_content','Lists'), ('third_title', 'third_content','Books'), ('fourth_title', 'fourth_content','Chores')
and I want to get each tuple in the code and place them in a list where that list has only the tuples that have the same 2nd value (starting at 0) but without hardcoding what the second value or the length of the list.So the result would look like...
notes = [('first_title, 'first_content, 'notes')]
Lists = [('second_title, 'second_content, 'Lists')]
Books = [('third_title, 'third_content, 'Books')]
Chores = [('fourth_title, 'fourth_content, 'Chores')]
so I can't really do...
if x[2] == 'Lists'
because it's hardcoded.
If there was another tuple that had the 2nd element (starting at 0) equal to 'Books' then it would be in the Books list for example.

You want to create a dictionary of lists where the third value in each tuple is used as key.
You can use a defaultdict to create a new list automatically when a key is inserted for the first time:
from collections import defaultdict
result = defaultdict(list)
for item in list_of_tuples:
key = item[2]
result[key].append(item)
Now you can use result['notes'], result['Lists'], etc.

Seems like you are looking for filter?
This would allow you to reuse some of the code like this (throw in a selector if you want to be even more flexible but not required):
inp = [('first_title', 'first_content','notes'),('second_title','second_content','Lists'), ('third_title', 'third_content','Books'), ('fourth_title', 'fourth_content','Chores')]
def get_by(category, l, selector=lambda x: x[2]):
return filter(l, lambda x: selector(x) == category)
I can then get the categories:
get_by('Books', inp)
Or I can change the selector and filter on some other criteria:
get_by('first_title', inp, selector=lambda x: x[0])

How can I combine separate dictionary outputs from a function in one dictionary?

For our python project we have to solve multiple questions. We are however stuck at this one:
"Write a function that, given a FASTA file name, returns a dictionary with the sequence IDs as keys, and a tuple as value. The value denotes the minimum and maximum molecular weight for the sequence (sequences can be ambiguous)."
import collections
from Bio import Seq
from itertools import product
def ListMW(file_name):
seq_records = SeqIO.parse(file_name, 'fasta',alphabet=generic_dna)
for record in seq_records:
dictionary = Seq.IUPAC.IUPACData.ambiguous_dna_values
result = []
for i in product(*[dictionary[j] for j in record]):
result.append("".join(i))
molw = []
for sequence in result:
molw.append(SeqUtils.molecular_weight(sequence))
tuple= (min(molw),max(molw))
if min(molw)==max(molw):
dict={record.id:molw}
else:
dict={record.id:(min(molw), max(molw))}
print(dict)
Using this code we manage to get this output:
{'seq_7009': (6236.9764, 6367.049999999999)}
{'seq_418': (3716.3642000000004, 3796.4124000000006)}
{'seq_9143_unamb': [4631.958999999999]}
{'seq_2888': (5219.3359, 5365.4089)}
{'seq_1101': (4287.7417, 4422.8254)}
{'seq_107': (5825.695099999999, 5972.8073)}
{'seq_6946': (5179.3118, 5364.420900000001)}
{'seq_6162': (5531.503199999999, 5645.577399999999)}
{'seq_504': (4556.920899999999, 4631.959)}
{'seq_3535': (3396.1715999999997, 3446.1969999999997)}
{'seq_4077': (4551.9108, 4754.0073)}
{'seq_1626_unamb': [3724.3894999999998]}
As you can see this is not one dictionary but multiple dictionaries under each other. So is there anyway we can change our code or type an extra command to get it in this format:
{'seq_7009': (6236.9764, 6367.049999999999),
'seq_418': (3716.3642000000004, 3796.4124000000006),
'seq_9143_unamb': (4631.958999999999),
'seq_2888': (5219.3359, 5365.4089),
'seq_1101': (4287.7417, 4422.8254),
'seq_107': (5825.695099999999, 5972.8073),
'seq_6946': (5179.3118, 5364.420900000001),
'seq_6162': (5531.503199999999, 5645.577399999999),
'seq_504': (4556.920899999999, 4631.959),
'seq_3535': (3396.1715999999997, 3446.1969999999997),
'seq_4077': (4551.9108, 4754.0073),
'seq_1626_unamb': (3724.3894999999998)}
Or in someway manage to make clear that it should use the seq_ID ans key and the Molecular weight as a value for one dictionary?

Set a dictionnary right before your for loop, then update it during your loop such as :
import collections
from Bio import Seq
from itertools import product
def ListMW(file_name):
seq_records = SeqIO.parse(file_name, 'fasta',alphabet=generic_dna)
retDict = {}
for record in seq_records:
dictionary = Seq.IUPAC.IUPACData.ambiguous_dna_values
result = []
for i in product(*[dictionary[j] for j in record]):
result.append("".join(i))
molw = []
for sequence in result:
molw.append(SeqUtils.molecular_weight(sequence))
tuple= (min(molw),max(molw))
if min(molw)==max(molw):
retDict[record.id] = molw
else:
retDict[record.id] = (min(molw), max(molw))}
# instead of printing now, print in the end of your function / script
# print(dict)
Right now, you're setting a new dict at each turn of your loop, and print it. It is just a normal behaviour of your code to print lots and lots of dict.

you're creating a dictionary with 1 entry at each iteration.
You want to:
define a dict variable (better use dct to avoid reusing built-in type name) before your loop
rewrite the assignment to dict in the loop
So before the loop:
dct = {}
and in the loop (instead of your if + dict = code), in a ternary expression, with min & max computed only once:
minval = min(molw)
maxval = max(molw)
dct[record.id] = molw if minval == maxval else (minval,maxval)

Adding items to sets in a dictionary

I have a list of dictionaries that maps different IDs to a central ID. I have a document with these different IDs associated with terms. I have created a function that now has a key the central ID from the different IDs in the document. The goFile is the document where in the first column there's an ID and in the second one there's a GOterm. The mappingList is a list containing dictionaries in which the ID in the goFile is mapped to a main ID.
My expected output is a dictionary with a main ID as a key and a set with the go terms associated with it as value.
def parseGO(mappingList, goFile):
# open the file
file = open(goFile)
# this will be the dictionary that this function returns
# entries will have as a key an Ensembl ID
# and the value will be a set of GO terms
GOdict = {}
GOset = set()
for line in file:
splitline = line.split(' ')
GO_term = splitline[1]
value_ID = splitline[0]
for dict in mappingList:
if value_ID in dict:
ENSB_term = dict[value_ID]
#my best try
for dict in mappingList:
for key in GOdict.keys():
if value_ID in dict and key == dict[value_ID]:
GOdict[ENSB_term].add(GO_term)
GOdict[ENSB_term] = GOset
return GOdict
My problem is that now I have to add to the central ID in my GOdict the terms that are associated in the document to the different IDs. To avoid duplicates i use a set (GOset). How do I do it? All my try end having all the terms mapped to all the main IDs.
Some sample:
mappingList = [{'1234': 'mainID1', '456': 'mainID2'}, {'789': 'mainID2'}]
goFile:
1234 GOTERM1
1234 GOTERM2
456 GOTERM1
456 GOTERM3
789 GOTERM1
expected output:
GOdict = {'mainID1': set([GOTERM1, GOTERM2]), 'mainID2': set([GOTERM1, GOTERM3])}

First off, you shouldn't use the variable name 'dict', as it shadows the built-in dict class, and will cause you problems at some point.
The following should work for you:
from collections import defaultdict
def parse_go(mapping_list, go_file):
go_dict = defaultdict(set)
with open(go_file) as f: # Better garbage handling using 'with'
for line in f:
(value_id, go_term) = line.split() # Feel free to change the split behaviour
# work better for you.
for map_dict in mapping_list:
if value_id in map_dict:
go_dict[map_dict[value_id]].add(go_term)
return go_dict
The code is fairly straightforward, but here's a breakdown anyway.
We use a default dictionary instead of a normal dictionary so we can eliminate all that if in or setdefault() boilerplate.
For each line in the file, we check if the first item (value_id) is a key in any of the mapping dictionaries, and if so, adds the lines second item (go_term) to that value_id's set in the dictionary.
EDIT: Request for doing this without defaultdict(). Assume that go_dict is just a normal dictionary (go_dict = {}), your for loop would look like:
for map_dict in mapping_list:
if value_id in map_dict:
esnb_entry = go_dict.setdefault(map_dict[value_id], set())
esnb_entry.add(go_term)

Building Nested dictionary in Python reading in line by line from file

The way I go about nested dictionary is this:
dicty = dict()
tmp = dict()
tmp["a"] = 1
tmp["b"] = 2
dicty["A"] = tmp
dicty == {"A" : {"a" : 1, "b" : 1}}
The problem starts when I try to implement this on a big file, reading in line by line.
This is printing the content per line in a list:
['proA', 'macbook', '0.666667']
['proA', 'smart', '0.666667']
['proA', 'ssd', '0.666667']
['FrontPage', 'frontpage', '0.710145']
['FrontPage', 'troubleshooting', '0.971014']
I would like to end up with a nested dictionary (ignore decimals):
{'FrontPage': {'frontpage': '0.710145', 'troubleshooting': '0.971014'},
'proA': {'macbook': '0.666667', 'smart': '0.666667', 'ssd': '0.666667'}}
As I am reading in line by line, I have to check whether or not the first word is still found in the file (they are all grouped), before I add it as a complete dict to the higher dict.
This is my implementation:
def doubleDict(filename):
dicty = dict()
with open(filename, "r") as f:
row = 0
tmp = dict()
oldword = ""
for line in f:
values = line.rstrip().split(" ")
print(values)
if oldword == values[0]:
tmp[values[1]] = values[2]
else:
if oldword is not "":
dicty[oldword] = tmp
tmp.clear()
oldword = values[0]
tmp[values[1]] = values[2]
row += 1
if row % 25 == 0:
print(dicty)
break #print(row)
return(dicty)
I would actually like to have this in pandas, but for now I would be happy if this would work as a dict. For some reason after reading in just the first 5 lines, I end up with:
{'proA': {'frontpage': '0.710145', 'troubleshooting': '0.971014'}},
which is clearly incorrect. What is wrong?

Use a collections.defaultdict() object to auto-instantiate nested dictionaries:
from collections import defaultdict
def doubleDict(filename):
dicty = defaultdict(dict)
with open(filename, "r") as f:
for i, line in enumerate(f):
outer, inner, value = line.split()
dicty[outer][inner] = value
if i % 25 == 0:
print(dicty)
break #print(row)
return(dicty)
I used enumerate() to generate the line count here; much simpler than keeping a separate counter going.
Even without a defaultdict, you can let the outer dictionary keep the reference to the nested dictionary, and retrieve it again by using values[0]; there is no need to keep the temp reference around:
>>> dicty = {}
>>> dicty['A'] = {}
>>> dicty['A']['a'] = 1
>>> dicty['A']['b'] = 2
>>> dicty
{'A': {'a': 1, 'b': 1}}
All the defaultdict then does is keep us from having to test if we already created that nested dictionary. Instead of:
if outer not in dicty:
dicty[outer] = {}
dicty[outer][inner] = value
we simply omit the if test as defaultdict will create a new dictionary for us if the key was not yet present.

While this isn't the ideal way to do things, you're pretty close to making it work.
Your main problem is that you're reusing the same tmp dictionary. After you insert it into dicty under the first key, you then clear it and start filling it with the new values. Replace tmp.clear() with tmp = {} to fix that, so you have a different dictionary for each key, instead of the same one for all keys.
Your second problem is that you're never storing the last tmp value in the dictionary when you reach the end, so add another dicty[oldword] = tmp after the for loop.
Your third problem is that you're checking if oldword is not "":. That may be true even if it's an empty string, because you're comparing identity, not equality. Just change that to if oldword:. (This one, you'll usually get away with, because small strings are usually interned and will usually share identity… but you shouldn't count on that.)
If you fix both of those, you get this:
{'FrontPage': {'frontpage': '0.710145', 'troubleshooting': '0.971014'},
'proA': {'macbook': '0.666667', 'smart': '0.666667', 'ssd': '0.666667'}}
I'm not sure how to turn this into the format you claim to want, because that format isn't even a valid dictionary. But hopefully this gets you close.
There are two simpler ways to do it:
Group the values with, e.g., itertools.groupby, then transform each group into a dict and insert it all in one step. This, like your existing code, requires that the input already be batched by values[0].
Use the dictionary as a dictionary. You can look up each key as it comes in and add to the value if found, create a new one if not. A defaultdict or the setdefault method will make this concise, but even if you don't know about those, it's pretty simple to write it out explicitly, and it'll still be less verbose than what you have now.
The second version is already explained very nicely in Martijn Pieters's answer.
The first can be written like this:
def doubleDict(s):
with open(filename, "r") as f:
rows = (line.rstrip().split(" ") for line in f)
return {k: {values[1]: values[2] for values in g}
for k, g in itertools.groupby(rows, key=operator.itemgetter(0))}
Of course that doesn't print out the dict so far after every 25 rows, but that's easy to add by turning the comprehension into an explicit loop (and ideally using enumerate instead of keeping an explicit row counter).

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to create a dictionary whose values are sets? - python

Related

Function that makes dict from string but swaps keys and values?

How to sort a list of tuples based on the 2nd value without hardcoding it

How can I combine separate dictionary outputs from a function in one dictionary?

Adding items to sets in a dictionary

Building Nested dictionary in Python reading in line by line from file

Categories

Resources