Python - Make a Dictionary with a variable number of keys/values - python

Okay, the title is a bit confusing, but let me elaborate.
Some methods in Java have a useful thing called varargs that allow for varying amounts of arguments in methods. It looks something like this:
void method(String... args) {
for (String arg : args) {
// TODO
}
}
I am trying to learn Python through a course, and one of the assignments is asking me to take a CSV file with a varying amount of Strings at the top that represents repeating sequences of DNA in a strand. Here's an example:
name,AGATC,AATG,TATC
Alice,2,8,3
However, they also offer different CSV files that have differing amounts of DNA sequences to check for, like the example below:
name,AGATC,TTTTTTCT,AATG,TCTAG,GATA,TATC,GAAA,TCTG
Jason,15,49,38,5,14,44,14,12
(the numbers equate to how many of the above DNA sequences are repeated in their strand. So Jason has 15 AGATC repetitions in this strand)
I want to use a Dictionary variable to store the name and all their repetitions in it. However, since I don't know in advance how many DNA sequences I'll have to check for, the Dictionary has to be programmed with any number of those sequences in mind. Is there a way to use something similar to Java's varargs in a Python Dictionary?
The output format I want is to convert the group of people and their repetitions inside the DNA database into a List that contains a Dictionary that equates to each person. Because the CSV file can contain a variable number of DNA sequences (as shown above), I want to have each person's Dictionary have their name as their first key, then an additional amount of keys for each DNA Sequence in the CSV file. Here's an example that adheres to the snippet of the CSV file above:
{"name": "Jason", "seq1": 15, "seq2": 49, "seq3": 38, "seq4": 5, "seq5": 14, "seq6": 4, "seq7": 14, "seq8": 12}

You can use *args to get a list containing all the arguments
def my_seq(*args):
for arg in args:
print (arg)
my_seq('a', 'b', 'c', 'd')

All Python dictionaries have a variable number of items, since they're mutable, so this is a bit of an XY problem, but to get what you want, you can use a csv.DictReader (as Thierry Lathuille commented).
Let's call your first example example1.csv:
name,AGATC,AATG,TATC
Alice,2,8,3
To read it, you can do something like this:
import csv
with open('example1.csv') as f:
rows = list(csv.DictReader(f))
print(rows)
# -> [{'name': 'Alice', 'AGATC': '2', 'AATG': '8', 'TATC': '3'}]
The numbers aren't automatically converted to ints, but you could use a dict comprehension:
rows = [
{k: v if k == 'name' else int(v) for k, v in row.items()}
for row in rows
]
print(rows)
# -> [{'name': 'Alice', 'AGATC': 2, 'AATG': 8, 'TATC': 3}]
Note that the DNA sequences themselves will probably be more useful as keys than 'seq1', 'seq2', etc. For example if you read in your other CSV as rows2, you can then do set-like operations on the keys:
>>> alice = rows[0]
>>> jason = rows2[0]
>>> len(alice.keys() - jason.keys()) # How many keys are unique to Alice?
0
>>> jason.keys() - alice.keys() # What keys does Jason have that Alice doesn't?
{'TCTAG', 'GATA', 'TCTG', 'TTTTTTCT', 'GAAA'}
If you want to get really advanced, you can use a Pandas DataFrame. Here's just a short example, cause I'm not very familiar with it myself :)
import pandas as pd
files = 'example2.csv', 'example1.csv' # Note the order
dfs = [pd.read_csv(f, index_col="name") for f in files]
df = pd.concat(dfs, sort=False)
df = df.astype('Int64') # allow ints and NaN in the same column
print(df)
Output:
AGATC TTTTTTCT AATG TCTAG GATA TATC GAAA TCTG
name
Jason 15 49 38 5 14 44 14 12
Alice 2 NaN 8 NaN NaN 3 NaN NaN

Related

Outputing single strings in python

I'm in need of some assistance in this code problem from a MOOC on python programming that I'm taking. This is just for self-learning, and not for any graded coursework. Could you please provide some guidance. I am stuck. Thanks in advance for your help.
The problem statement is:
One of the confusing things about dictionaries is that they are unordered: the keys have no internal ordering to them. Sometimes though, you want to look through the keys in a
particular order, such as printing them alphabetically if they represent something like artist names.
For example, imagine if a forum tool used for a course exported its data as a dictionary. The keys of the dictionary are students' names, and the values are days of activity. Your goal is to return a list of students in the class in alphabetical order, followed by their days of activity, like this:
Chopra, Deepak: 22
Joyner, David: 14
Winfrey, Oprah: 17
Write a function named alphabetical_keys. alphabetical_keys should take as input a dictionary, and return a single string. The keys of the dictionary will be names and the values will be integers. The output should be a single string made of multiple
lines, following the format above: the name (the key), a colon and space, then the number of days of activity (the value), sorted alphabetically by key.
Remember, you are returning this as a single string: you're going to need to put the \n character after each line.
To convert a dictionary's keys into a list, use this line
of code:
keys_as_list = list(the_dict.keys)
From there, you could sort keys_as_list like any normal list.
Add your code here!
def alphabetical_keys(dictionary):
keys_as_list = list(dictionary.keys())
return keys_as_list + dictionary[keys]
Below are some lines of code that will test your function. You can change the value of the variable(s) to test your function with different inputs.
If your function works correctly, this will originally
print:
Chopra, Deepak: 22
Joyner, David: 14
Winfrey, Oprah: 17
the_dictionary = {"Joyner, David": 14, "Chopra, Deepak": 22, "Winfrey, Oprah": 17}
print(alphabetical_keys(the_dictionary))
While dictionaries may seem like they should be ordered, it's best not to think about them that way. It's a mapping from one thing to another.
You already have a way to get a list of the names in the dict:
keys_as_list = list(dictionary.keys())
After a quick Google search on "how to sort python list":
# sort list in place
keys_as_list.sort() # reverse=True would give reverse alpha order
Now you just need to loop through the sorted name as a way to access the dictionary keys:
return_str = "" # start with an empty string
for name in keys_as_list:
# add to the string -- get the name and the dict value for that name
return_str += f"{name}: {dictionary[name]}\n"
All together:
def alphabetical_keys(dictionary):
keys_as_list = list(dictionary.keys())
keys_as_list.sort()
return_str = ""
for name in keys_as_list:
return_str += f"{name}: {dictionary[name]}\n"
return return_str
d = {
"Chopra, Deepak": 22,
"Joyner, David": 14,
"Winfrey, Oprah": 17,
"Gump, Forrest": 9,
"Obama, Barack": 19,
}
string = alphabetical_keys(d)
print(string)
Output:
Chopra, Deepak: 22
Gump, Forrest: 9
Joyner, David: 14
Obama, Barack: 19
Winfrey, Oprah: 17

create a dictionary from a list with a function

I'm working on a premier league dataset and I need to create a dictionary where the keys are the teams and the values are their relative points. I have a list for the teams and a function that takes the results from the matches and transform them into the points for the teams. I got everything good but the problem is that instead of creating one dictionary with all the teams and their scores, it prints 20 dictionaries for each of the team. What is wrong?
You are creating a new dictionary at each iteration. Instead you should make the dictionary before the loop and then add a new entry at each iteration:
def get_team_points(df, teams):
team_points = {}
for team_name in teams:
num_points = ... # as you have it but since you posted an image I'm not rewriting it
team_points[team_name] = num_points
return team_points
A neater solution is to use a dictionary comprehension
def get_team_points(df, teams):
team_points = {team: get_num_points(team, df) for team in teams}
return team_points
where get_num_points is a function of your num_points = ... line, which again I would type out if you had posted the code as text :)
Also - please start using better variable names ;) your life will improve if you do. Names like List and Dict are really bad since:
they're not descriptive
they shadow build-in classes from the typing module (which you should use)
they violate pep8 naming conventions
and speaking of the typing module, here it is in action:
def get_team_points(df: pd.DataFrame, teams: List[str]) -> Dict[str, int]:
team_points = {team: get_num_points(team, df) for team in teams}
return team_points
now you can use a tool like mypy to catch errors before they occur. If you use an IDE instead of jupyter, it will highlight errors as you go. And also your code becomes much clearer for other developers (including future you) to understand and use.
I think perhaps you want this:
def get_team_points(df, teams):
Dict = {}
for team_name in List:
num_points = TeamPoints(...)
Dict[team_name] = num_points
print(Dict)
In TeamsPointDict() method, you are creating dictionaries for each team member in the list.
To insert all of them in one dictionary, declare the dictionary outside the for loop.
You want to take the sum of HP for Home teams, and AP for Away teams and add them together by team. Instead of manually separating, you can use two groupby operations and sum the results.
The return of each groupby will be a Series that we can then add together as pandas aligns on the index (teams in this case). Then with Series.to_dict() we get the entire dictionary at once.
import pandas as pd
df = pd.DataFrame({'HomeTeam': list('AABCDA'), 'AwayTeam': list('CBAAAB'),
'HP': [4,5,6,7,8,10], 'AP': [0,0,10,11,4,7]})
HomeTeam AwayTeam HP AP
0 A C 4 0
1 A B 5 0
2 B A 6 10
3 C A 7 11
4 D A 8 4
5 A B 10 7
# Fill value so addition works if a team has exclusively home/away games.
s = df.groupby('HomeTeam')['HP'].sum().add(df.groupby('AwayTeam')['AP'].sum(),
fill_value=0).astype(int)
s.to_dict()
{'A': 44, 'B': 13, 'C': 7, 'D': 8}
you should define your dictionary before the function then add your values.
dic = {}
for team_name in List:
dic[team_name] = num_points

Querying from CSV files using python

I have three CSV files, one has a list of all pieces, one has a list of pieces of type M and the other one of type B. That means the first list contains the two other ones but without specifying their type.
I want to add a row to the frist list that specifies the type of pieces using python, that means for each piece in the first list, check if it's in list M and add an M in its type column, otherwise add a B.
My idea was to create a list of dictionaries (that I can convert later to CSV using a pre-written Python library), it would look something like this:
l = [{'piece','type'}] # list of dictionaries
for c in allpieces: # this is the list of all pieces:
l[{'piece'}] = c['piece'] # adding the piece number to the list of dictionaries from the list of all pieces
for m in mlist: # list of pieces of type m
if c['piece'] == m['piece']: # check of piece is found in listm
l[{'type'}] = 'm' # Add an m in its column
else: l[{'type'}] = 'b' # otherwise add b
This code is obviously not doing anything, and I need help debugging it.
A dictionary maps a key to a value like so {"key": "value} whereas elements in a list are accessed by providing the index so for first element in a list you do list[0] to get it. Now if you want to add new keys with values to a dictionary you can do it by adding them like such d["key"] = value. If you want to add to your list you do list.append(value). So in your case what you want to do is create a list with dictionaries inside I assume ? So that could look like that:
allpieces = ["Queen", "Tower", "Rook"]
mlist = ["Pawn", "Queen", "Rook"]
l = []
for c in allpieces:
if c in mlist:
l.append({"piece": c, "type": "m"})
else:
l.append({"piece": c, "type": "b"})
print(l)
Which creates a list with our dictionaries inside as such:
[{'piece': 'Queen', 'type': 'm'}, {'piece': 'Tower', 'type': 'b'}, {'piece': 'Rook', 'type': 'm'}]
Now if you were to access elements within this list you would do l[0]["piece"] to get "Queen"

Python: cycling/scanning though fields in an object

I have a JSON file named MyFile.json that contains this structure:
[{u'randomName1': {u'A': 16,u'B': 20,u'C': 71},u'randomName2': {u'A': 12,u'B': 17,u'C': 47}},...]
I can open the file and load it like this:
import json
with open('MyFile.json') as data_file:
data = json.load(data_file)
And I can access the values in the first element like this:
data[0]["randomName1"][A]
data[0]["randomName1"][B]
data[0]["randomName1"][C]
data[0]["randomName2"][A]
data[0]["randomName2"][B]
data[0]["randomName2"][C]
The A B C keys are always named A B C (and there are always exactly 3 of them, so that's no problem.
The problem is:
1) I don't know how many elements are in the list, and
2) I don't know how many "randomName" keys are in each element, and
3) I don't know the names of the randomName keys.
How do I scan/cycle through the entire file, getting all the elements, and getting all the key names and associated key values for each element?
I don't have the knowledge or desire to write a complicated parsing script of my own. I was expecting that there's a way for the json library to provide this information.
For example (and this is not a perfect analogy I realize) if I am given an array X in AWK, I can scan all the index/name pairs by using
for(index in X){print index, X[index]);
Is there something like this in Python?
---------------- New info below this line -------------
Thank you Padraic and E.Gordon. That goes a long way toward solving the problem.
In an attempt to make my initial post as concise as possible, I simplified my JSON data example too much.
My JSON data actually looks this this:
data=[
{ {u'X': u'randomName1': {u'A': 11,u'B': 12,u'C': 13}, u'randomName2': {u'A': 21,u'B': 22,u'C': 23}, ... }, u'Y': 101, u'Z': 102 },
.
.
.
]
The ellipses represent arbitrary repetition, as described in the original post. The X Y Z keys are always named X Y Z (and there are always exactly 3 of them).
Using your posts as a starting point, I've been working on this for a couple of hours, but being new to Python I'm stumped. I cannot figure out how to add the extra loop to work with that data. I would like the output stream to look something like this:
Z,102,Y,101,randomName1,A,11,B,12,C,13,randomName2,A,21,B,22,C,23,...
.
.
.
Thanks for your help.
-
----------------- 3/23/16 update below --------------
Again, thanks for the help. Here's what I finally came up with. It does what I need:
import json
with open('MyFile.json') as data_file:
data = json.load(data_file)
for record in data:
print record['Z'],record['Y']
for randomName in record['X']:
print randomName, randomName['A'], randomName['B'],randomName['C']
...
You can print the items in the dicts:
js = [{u'randomName1': {u'A': 16,u'B': 20,u'C': 71},u'randomName2': {u'A': 12,u'B': 17,u'C': 47}}]
for dct in js:
for k, v in dct.items():
print(k, v)
Which gives you the key/inner dict pairings:
randomName1 {'B': 20, 'A': 16, 'C': 71}
randomName2 {'B': 17, 'A': 12, 'C': 47}
If you want the values from the inner dicts you can add another loop
for dct in js:
for k1, d in dct.items():
print(k1)
for k2,v in d.items():
print(k2,v)
Which will give you:
randomName1
A 16
B 20
C 71
randomName2
A 12
B 17
C 47
If you have arbitrary levels of nesting we will have to do it recursively.
You can use the for element in list construct to loop over all the elements in a list, without having to know its length.
The iteritems() dictionary method provides a convenient way to get the key-value pairs from a dictionary, again without needing to know how many there are or what the keys are called.
For example:
import json
with open('MyFile.json') as data_file:
data = json.load(data_file)
for element in data:
for name, values in element.iteritems():
print("%s has A=%d, B=%d and C=%d" % (name,
values["A"],
values["B"],
values["C"]))

using Python to import a CSV (lookup table) and add GPS coordinates to another output CSV

So I have already imported one XML-ish file with 3000 elements and parsed them into a CSV for output. But I also need to import a second CSV file with 'keyword','latitude','longitude' as columns and use it to add the GPS coordinates to additional columns on the first file.
Reading the python tutorial, it seems like {dictionary} is what I need, although I've read on here that tuples might be better. I don't know.
But either way - I start with:
floc = open('c:\python\kenya_location_lookup.csv','r')
l = csv.DictReader(floc)
for row in l: print row.keys()
The output look like:
{'LATITUDE': '-1.311467078', 'LONGITUDE': '36.77352011', 'KEYWORD': 'Kianda'}
{'LATITUDE': '-1.315288401', 'LONGITUDE': '36.77614331', 'KEYWORD': 'Soweto'}
{'LATITUDE': '-1.315446430425027', 'LONGITUDE': '36.78170621395111', 'KEYWORD': 'Gatwekera'}
{'LATITUDE': '-1.3136151425171327', 'LONGITUDE': '36.785863637924194', 'KEYWORD': 'Kisumu Ndogo'}
I'm a newbie (and not a programmer). Question is how do I use the keys to pluck out the corresponding row data and match it against words in the body of the element in the other set?
Reading the python tutorial, it seems
like {dictionary} is what I need,
although I've read on here that tuples
might be better. I don't know.
They're both fine choices for this task.
print row.keys() The output look
like:
{'LATITUDE': '-1.311467078',
No it doesn't! This is the output from print row, most definitely NOT print row.keys(). Please don't supply disinformation in your questions, it makes them really hard to answer effectively (being a newbie makes no difference: surely you can check that the output you provide actually comes from the code you also provide!).
I'm a newbie (and not a programmer).
Question is how do I use the keys to
pluck out the corresponding row data
and match it against words in the body
of the element in the other set?
Since you give us absolutely zero information on the structure of "the other set", you make it of course impossible to answer this question. Guessing wildly, if for example the entries in "the other set" are also dicts each with a key of KEYWORD, you want to build an auxiliary dict first, then merge (some of) its entries in the "other set":
l = csv.DictReader(floc)
dloc = dict((d['KEYWORD'], d) for d in l)
for d in otherset:
d.update(dloc.get(d['KEYWORD'], ()))
This will leave the location missing from the other set when not present in a corresponding keyword entry in the CSV -- if that's a problem you may want to use a "fake location" dictionary as the default for missing entries instead of that () in the last statement I've shown. But, this is all wild speculation anyway, due to the dearth of info in your Q.
If you dump the DictReader into a list (data = [row for row in csv.DictReader(file)]), and you have unique keywords for each row, convert that list of dictionaries into a dictionary of dictionaries, using that keyword as the key.
>>> data = [row for row in csv.DictReader(open('C:\\my.csv'),
... ('num','time','time2'))]
>>> len(data) # lots of old data :P
1410
>>> data[1].keys()
['time2', 'num', 'time']
>>> keyeddata = {}
>>> for row in data[2:]: # I have some junk rows
... keyeddata[row['num']] = row
...
>>> keyeddata['32']
{'num': '32', 'time2': '8', 'time': '13269'}
Once you have the keyword pulled out, you can iterate through your other list, grab the keyword from it, and use it as the index for the lat/long list. Pull out the lat/long from that index and add it to the other list.
Thanks -
Alex: My code for the other set is working, and the only relevant part is that I have a string that may or may not contain the 'keyword' that is in this dictionary.
Structurally, this is how I organized it:
def main():
f = open('c:\python\ggce.sms', 'r')
sensetree = etree.parse(f)
senses = sensetree.getiterator('SenseMakingItem')
bodies = sensetree.getiterator('Body')
stories = []
for body in bodies:
fix_body(body)
storybyte = unicode(body.text)
storybit = storybyte.encode('ascii','ignore')
stories.append(storybit)
rows = [ids,titles,locations,stories]
out = map(None, *rows)
print out[120:121]
write_data(out,'c:\python\output_test.csv')
(I omitted the code for getting its, titles, locations because they work and will not be used to get the real locations from the data within stories)
Hope this helps.

Categories