I am trying to calculate a distance between two locations, using their coordinates. However I don't know how I can access the coordinate values, since they are in a dictionary.
I am very new to coding, and didn't understand any of the code I found regarding this problem, since it's too advanced for me. I don't really know where to start. My main function creates the dictionary: (Edit)
def main():
filename = input("Enter the filename:\n")
file= open(filename, 'r')
rows= file.readlines()
d = {}
list = []
for x in rows:
list.append(x)
#print(list)
for elem in list:
row = elem.split(";")
d[row[3]] = {row[0], row[1]} #these are the indexes that the name and latitude & longitude have in the file
{'Location1': {'40.155444793742276', '28.950292890004903'}, 'Location2': ... }
The dictionary is like this, so the key is the name and then the coordinates are the values. Here is the function, which contains barely anything so far:
def calculate_distance(dictionary, location1, location2):
distance_x = dictionary[location1] - dictionary[location2]
# Here I don't know how I can get the values from the dictionary,
# since there are two values, longitude and latitude...
distance_y = ...
distance = ... # Here I will use the pythagorean theorem
return distance
Basically I just need to know how to work with the dictionary, since I don't know how I can get the values out so I can use them to calculate the distance.
--> How to search a key from a dictionary and get the values to my use. Thank you for answering my stupid question. :)
Well you are starting out, its normal that this makes it more difficult for you.
So lets see, you have a function that outputs a dictionary where the keys are locations and the values are coordinate pairs.
First lets talk about the data types that you use.
location_map={'Location1': {'40.155444793742276', '28.950292890004903'}, 'Location2': ... }
I think there is an issue with your values, it seems that they are sets of strings. This has 2 main advantages for your goal.
First, set objects do not support indexing, this means that you cannot access location_map['Location1'][0] to get the first coordinate. Trying this would give you a TypeError. Instead, by using tuples when creating your map would allow you to index. You can do this by defining the coordinates as tuple([longitude,latitude]) instead of {longitude,latitude}.
Second, it seems that your coordinates are strings, in order to perform arithmetic operations with your data you need a numeric type such as integers or in your case floats. If you are reading longitude and latitude values as strings you can convert them by using float(longitude) and float(latitude).
There are multiple ways to do it, few are listed below:
# option 1
for i, v in data.items(): # to get key and value from dict.
for k in v: # get each element of value (its a set)
print (k)
# option 2
for i, v in data.items(): # to get key and value from dict.
value_data = [k for k in list(v)] # convert set to list and put it in a list
print (i, value_data[0], value_data[1]) # use values from here
I would suggest you to go through the python documentations to get more in-depth knowledge.
Related
Hopefully someone familiar with Biopython can help me out. I have a function that takes FASTA files (DNA sequence files) and creates a dictionary with the sequence ID as the key and the molecular weight of the sequence as the value. Since sequences can be ambiguous, I also have a function that spits out all possible real sequences from the ambiguous one and uses that as input for the dictionary-making function that I just described; I integrated it into the dictionary-creating function so that for ambiguous sequences, the function spits out a minimum and maximum molecular weight values for the possible real sequences represented by the ambiguous one.
def seq_ID_and_weight(file_name):
with open (file_name) as file:
ID_weight = {} #create an empty dictionary
for sequence in SeqIO.parse(file,'fasta'):
weight_min = 10000
weight_max = 0
all_poss_sequences = ambiguous_to_unambiguous(sequence.seq) # only call the function once and store it in variable to improve performance
if len(all_poss_sequences) != 1: #if the length would be 1, its unambiguous
for possib in all_poss_sequences:
if SeqUtils.molecular_weight(possib) < weight_min:
weight_min = SeqUtils.molecular_weight(possib)
elif SeqUtils.molecular_weight(possib) > weight_max:
weight_max = SeqUtils.molecular_weight(possib)
ID_weight[sequence.id] = [weight_min, weight_max]
else:
ID_weight[sequence.id] = [SeqUtils.molecular_weight(sequence.seq)]
return ID_weight
The function spits out something like this, where the values are either the definitive molecular weight of the sequence (if the seq is unambiguous) or the min and max of the possible molecular weights of the sequence (if seq is ambiguous):
{'seq_7009': [6236.9764, 6367.049999999999], 'seq_418': [3716.3642000000004, 3796.4124000000006], 'seq_9143_unamb': [4631.958999999999]}
However, now I need to use this function to make a new one that does something slightly different. The new function needs to take a FASTA file name and min and max molecular weights as inputs and return a list of sequence IDs for sequences that have a molecular weight within that interval. Basically, the function should return the ID of an ambiguous sequence for which the weight interval overlaps the weight interval that you specify.
My approach to this would be as follows:
Initialize a dictionary containing the output of the previous function, like the example I gave above.
Iterate over the dictionary, checking if the key has only one value or multiple (a tuple).
a. If only one value, then check if the value is in the given range, and if so, print that sequence ID. If not, break (do nothing).
b. If multiple values, then check if either the first or second is in the given range (because if so, there is some overlap). If so, print that sequence ID. If not, break.
How would I actually implement this? This is all I have so far - I've really only created the dictionary:
def find_sequence(file_name, min_weight, max_weight):
with open (file_name) as file:
dictionary = {}
dictionary.update(seq_ID_and_weight(file_name))
for key in dictionary:
Now I need to check how many values the keys have, but I don't know how to do that. Any ideas?
You just have to traverse the dictionary & check the first 2 values.
This is the approach.
def find_sequence(file_name, min_weight, max_weight):
li=[] # list to store ids
with open (file_name) as file:
dictionary = {}
dictionary.update(seq_ID_and_weight(file_name))
for k,v in dictionary.items(): # traverse the dictionary
for i in range(min(2,len(v))): # if len(v) > 2 , then it range will be 2 else 1
if v[i]>min_weight and v[i]<max_weight: # if value is within range append the sequence_id to list
li.append(k)
break
return li
I have this code that I cobbled together based on some posts on here. It takes a FASTA file (file that features DNA sequences) and finds sequences whose molecular weights are within the given weight range. To do this, it uses a dictionary resulting from a previously-built function, seq_ID_and_weight, which (as its name suggests) outputs the ID of sequences in the file and the minimum and maximum values of their molecular weights (sequences can be ambiguous, so there are many possible weights).
The below function does what I need it to do, but I'm not actually sure how.
def find_sequence(file_name, min_weight, max_weight):
ID_list=[] # Initialize a list to store seq IDs
with open (file_name) as file:
dictionary = (seq_ID_and_weight(file_name))
for k,v in dictionary.items(): # This function lets you traverse the dictionary
for i in range(min(2,len(v))):
if v[i]>min_weight and v[i]<max_weight: # If value is within given range, append the sequence_id to list.
ID_list.append(k)
break
return ID_list
I understand up until the "for i in range" line. I know that line is there because I have to deal with keys that have two values as well as keys that have only one. But what does the min function do? And why am I using i as a variable?
Sorry if it's a dumb question, but I am new to Python.
Python min() returns the smallest item from an iterable object or two or more arguments. Here is the official documentation.
i is short for representing the index value within the range from 0 to the value of min(2,len(v)), and is used as the index within the dictionary value v (assuming it's an iterable object)
I am looking for a fast way to update the values in a (ordered) dictionary, which contains tens of millions of values, where the updated values are stored in a list/array.
The program I am writing takes the list of keys from the original dictionary (which are numerical tuples) as a numpy array, and passes them through a function which returns an array of new numbers (one for each key value). This array is then multiplied with the corresponding dictionary values (through piece-wise array multiplication), and it is this returned 1-D array of values that we wish to use to update the dictionary. The entries in the new array are stored in the order of the corresponding keys, so I could use a loop to go through the dictionary a update the values one-by-one. But this is too inefficient. Is there a faster way in which to update the values in this dictionary which doesn't use loops?
An example of a similar problem would be if the keys in a dictionary represent the x and y-coordinates of points in space, and the values represent the forces being applied at that point. If we want to calculate the torque experienced at each point from the origin, we would first need a function like:
def euclid(xy):
return (xy[0]**2 + xy[1]**2)**0.5
Which, if xy represents the x, y-tuple, would return the Euclidean distance from the origin. We could then multiply this by the corresponding dictionary value to return the torque, like so:
for xy in dict.keys():
dict[xy] = euclid(xy)*dict[xy]
But this loop is slow, and we could take advantage of array algebra to get the new values in one operation:
new_dict_values = euclid(np.array(dict.keys()))*np.array(dict.values())
And it is here that we wish to find a fast method to update the dictionary, instead of utilising:
i = 0
for key in dict.keys():
dict[key] = new_dict_value[i]
i += 1
That last piece of code isn't just slow. I don't think it does what you want it to do:
for key in dict.keys():
for i in range(len(new_dict_values)):
dict[key] = new_dict_value[i]
For every key in the dictionary, you are iterating through the entire list of new_dict_values and assigning each one to the value of that key, overwriting the value you assigned in the previous iteration of the loop. This will give you a dictionary where every key has the value of the last element in new_dict_value, which I don't think is what you want.
If you are certain that the order of the keys in the dictionary is the same as the order of the values in new_dict_values, then you can do this:
for key, value in zip(dict.keys(), new_dict_values):
dict[key] = value
Edit: Also, in the future there is no need in python to iterate through a range of numbers and access elements of a list via the index. This:
for i in range(len(new_dict_values)):
dict[key] = new_dict_value[i]
is equivalent to this:
for i in new_dict_values:
dict[key] = i
Writing kmeans in python from scratch without any outside packages like numpy and scipy and ran into this issue when I am trying to assign data points to clusters.
Essentially for each data point, I find which cluster is closest to that point and then update the dictionary of clusters by adding the data point to the list of points that belong to that cluster (ie the value of the dictionary). My issue is that when I try to update on of the keys in the dictionary it turns all the other dictionary values to None, which is incorrect.
Tried separating out the steps of the process and looking at it line by line, but when I try to update one value all other values turn into None.
clusters = dict.fromkeys(k_init, [].copy())
for elem in data:
minC = (101010101, 9999999)
for cent in k_init:
#print(elem, cent)
if eucliean(elem, cent) < minC[1]:
minC = (cent, eucliean(elem, cent))
key = minC[0]
old = clusters.get(key)
clusters[key] = old.append(elem)
The problem is on the line
clusters = dict.fromkeys(k_init, [].copy())
When you create a dictionary like the above, then each key is assigned the reference of the same list. Hence whenever you add to the list of any keys, it is the same reference for all the other keys, so you see that it is appended to all keys. To avoid this issue do:
clusters = { key : list([]) for key in keys }
hi I am having trouble with this one problem
Given a variable, polygon_sides, that is associated with a dictionary that maps names of polygons to number of sides, create a new dictionary that maps number of sides to polygon names, and associate it with a variable n_polygons.
my current code
for n_polygons in polygon_sides:
polygon_sides={n_polygons[]:polygon_sides}
the only error it gives me are syntax errors.
do i have to rearrange the whole problem??
It's a one-liner. You need a loop in the expression to get all the elements.
n_polygons = {v:k for k,v in polygon_sides.items()}
You've used n_polygons in your code already. My understanding is that you want to swap the key-value pairs in the dict. Try this instead
n_polygons = dict((v,k) for k,v in polygon_sides.iteritems())