Could someone explain this code to me - traversing keys in a dictionary? - python

I have this code that I cobbled together based on some posts on here. It takes a FASTA file (file that features DNA sequences) and finds sequences whose molecular weights are within the given weight range. To do this, it uses a dictionary resulting from a previously-built function, seq_ID_and_weight, which (as its name suggests) outputs the ID of sequences in the file and the minimum and maximum values of their molecular weights (sequences can be ambiguous, so there are many possible weights).
The below function does what I need it to do, but I'm not actually sure how.
def find_sequence(file_name, min_weight, max_weight):
ID_list=[] # Initialize a list to store seq IDs
with open (file_name) as file:
dictionary = (seq_ID_and_weight(file_name))
for k,v in dictionary.items(): # This function lets you traverse the dictionary
for i in range(min(2,len(v))):
if v[i]>min_weight and v[i]<max_weight: # If value is within given range, append the sequence_id to list.
ID_list.append(k)
break
return ID_list
I understand up until the "for i in range" line. I know that line is there because I have to deal with keys that have two values as well as keys that have only one. But what does the min function do? And why am I using i as a variable?
Sorry if it's a dumb question, but I am new to Python.

Python min() returns the smallest item from an iterable object or two or more arguments. Here is the official documentation.
i is short for representing the index value within the range from 0 to the value of min(2,len(v)), and is used as the index within the dictionary value v (assuming it's an iterable object)

Related

function that takes directory of .txt file and returns a dictionary based on set parameters?

I want to make a function that takes the directory of a .txt file as an input and returns a dictionary based on specific parameters. If the .txt file is empty,
then the function will return nothing. When writing this function, I request that no imports, no list comprehension, and only for/while and if statements are used.
This is for the sake of the content I am learning right now, and I would like to be able to learn and interpret the function step-by-step.
An example of a .txt file is below. The amount of lines can vary but every line is formatted such that they appear in the order:
word + a string of 3 numbers connected by commas.
terra,4,5,6
cloud,5,6,7
squall,6,0,8
terra,4,5,8
cloud,6,5,7
First I would like to break down the steps of the function
Each component of the string that is separated by a comma serves a specific purpose:
The last number in the string will be subtracted by the second to last number in a string to form a value in the dictionary.
for example, the last two characters of terra,4,5,6 will be subtracted to form a value of [1] in the dictionary
The alphabetical words will form the keys of the dictionary. If there are multiple entries of the same word in a .txt file then a single key will be formed
and it will contain all the values of the duplicate keys.
for example, terra,4,5,6 , terra,4,4,6 , and terra,4,4,7 will output ('terra', 4):[1,2,3] as a key and value respectively.
However, in order for a key to be marked as a duplicate, the first values of the keys must be the same. If they are not, then they will be separate values.
For example, terra,4,5,6 and terra,5,4,6 will appear separately from eachother in the dictionary as ('terra', 4):[1] and ('terra', 5):[2] respectively.
Example input
if we use the example .txt file mentioned above, the input should look like create_dict("***files/example.txt") and should ouput a dictionary
{('terra', 4):[1,3],('cloud', 5):[1],('squall', 6):[8],('cloud', 6):[2]}. I will add a link to the .txt file for the sake of recreating this example. (note that *** are placeholders for the rest of the directory)
What I'm Trying:
testfiles = (open("**files/example.txt").read()).split('\n')
int_list = []
alpha_list = []
for values in testfiles:
ao = values.split(',') #returns only a portion of the list. why?
for values in ao:
if values.isnumeric():
int_list.append(values) #retrives list of ints from list
for values in ao:
if values.isalpha():
alpha_list.append(values) #retrieves a list of words
{((alpha_list[0]), int(int_list[0])):(int(int_list[2])-(int(int_list[1])))} #each line will always have 3 number values so I used index
this returns {('squall', 6): 1} which is mainly just a proof of concept and not a solution to the function. I wanted to see if it was possible to use the numbers and words I found in int_list and alpha_list using indexes to generate entries in the dictionary. If possible, the same could be applied to the rest of the strings in the .txt file.
Your input is in CSV format.
You really should be using one of these
https://docs.python.org/3/library/csv.html#csv.reader
https://docs.python.org/3/library/csv.html#csv.DictReader
since "odd" characters within a comma-separated field
are non-trivial to handle.
Better to let the library worry about such details.
Using defaultdict(list) is the most natural way,
the most readable way, to implement your dup key requirement.
https://docs.python.org/3/library/collections.html#collections.defaultdict
I know, I know, "no import";
now on to a variant solution.
d = {}
with open('example.txt') as f:
for line in f:
word, nums = line.split(',', maxsplit=1)
a, b, c = map(int, nums.split(','))
delta = c - b
key = (word, a)
if key not in d:
d[key] = []
d[key].append(delta)
return d

Biopython: Weight Intervals and Dictionaries

Hopefully someone familiar with Biopython can help me out. I have a function that takes FASTA files (DNA sequence files) and creates a dictionary with the sequence ID as the key and the molecular weight of the sequence as the value. Since sequences can be ambiguous, I also have a function that spits out all possible real sequences from the ambiguous one and uses that as input for the dictionary-making function that I just described; I integrated it into the dictionary-creating function so that for ambiguous sequences, the function spits out a minimum and maximum molecular weight values for the possible real sequences represented by the ambiguous one.
def seq_ID_and_weight(file_name):
with open (file_name) as file:
ID_weight = {} #create an empty dictionary
for sequence in SeqIO.parse(file,'fasta'):
weight_min = 10000
weight_max = 0
all_poss_sequences = ambiguous_to_unambiguous(sequence.seq) # only call the function once and store it in variable to improve performance
if len(all_poss_sequences) != 1: #if the length would be 1, its unambiguous
for possib in all_poss_sequences:
if SeqUtils.molecular_weight(possib) < weight_min:
weight_min = SeqUtils.molecular_weight(possib)
elif SeqUtils.molecular_weight(possib) > weight_max:
weight_max = SeqUtils.molecular_weight(possib)
ID_weight[sequence.id] = [weight_min, weight_max]
else:
ID_weight[sequence.id] = [SeqUtils.molecular_weight(sequence.seq)]
return ID_weight
The function spits out something like this, where the values are either the definitive molecular weight of the sequence (if the seq is unambiguous) or the min and max of the possible molecular weights of the sequence (if seq is ambiguous):
{'seq_7009': [6236.9764, 6367.049999999999], 'seq_418': [3716.3642000000004, 3796.4124000000006], 'seq_9143_unamb': [4631.958999999999]}
However, now I need to use this function to make a new one that does something slightly different. The new function needs to take a FASTA file name and min and max molecular weights as inputs and return a list of sequence IDs for sequences that have a molecular weight within that interval. Basically, the function should return the ID of an ambiguous sequence for which the weight interval overlaps the weight interval that you specify.
My approach to this would be as follows:
Initialize a dictionary containing the output of the previous function, like the example I gave above.
Iterate over the dictionary, checking if the key has only one value or multiple (a tuple).
a. If only one value, then check if the value is in the given range, and if so, print that sequence ID. If not, break (do nothing).
b. If multiple values, then check if either the first or second is in the given range (because if so, there is some overlap). If so, print that sequence ID. If not, break.
How would I actually implement this? This is all I have so far - I've really only created the dictionary:
def find_sequence(file_name, min_weight, max_weight):
with open (file_name) as file:
dictionary = {}
dictionary.update(seq_ID_and_weight(file_name))
for key in dictionary:
Now I need to check how many values the keys have, but I don't know how to do that. Any ideas?
You just have to traverse the dictionary & check the first 2 values.
This is the approach.
def find_sequence(file_name, min_weight, max_weight):
li=[] # list to store ids
with open (file_name) as file:
dictionary = {}
dictionary.update(seq_ID_and_weight(file_name))
for k,v in dictionary.items(): # traverse the dictionary
for i in range(min(2,len(v))): # if len(v) > 2 , then it range will be 2 else 1
if v[i]>min_weight and v[i]<max_weight: # if value is within range append the sequence_id to list
li.append(k)
break
return li

How to find maximum element from a list and its index?

I have a list with ordered dictionaries. These ordered dictionaries have different sizes and can also have the same size(for example, 10 dictionaries can have the length of 30 and 20 dictionaries can have the length of 32). I want to find the maximum number of items a dictionary from the list has. I have tried this, which gets me the correct maximum length:
maximum_len= max(len(dictionary_item) for dictionary_item in item_list)
But how can I find the dictionary fields for which the maximum_len is given? Say that the maximum_len is 30, I want to also have the dictionary with the 30 keys printed. It can be any dictionary with the size 30, not a specific one. I just need the keys of that dictionary.
Well you can always use filter:
output_dics=filter((lambda x: len(x)==maximum_len),item_list)
then you have all the dictionarys that satisfies the condition , pick a random one or the first one
Don't know if this is the easiest or most elegant way to do it but you could just write a simple function that returns 2 values, the max_length you already calculated but also the dict that you can get via the .index method and the max_length of the object you were searching for.
im talking about something like this:
def get_max(list_of_dict):
plot = []
for dict_index, dictionary in enumerate(list_of_dict):
plot.append(len(dictionary))
return max(plot), list_of_dict[plot.index(max(plot))]
maximum_len, max_dict = get_max(test)
tested it, works for my case, although i have just made myself a testlist with just 5 dicts of different length.
EDIT:
changed variable "dict" to "dictionary" to prevent it shadowing from outer scope.

Calculating distance between two points using dictionary in python

I am trying to calculate a distance between two locations, using their coordinates. However I don't know how I can access the coordinate values, since they are in a dictionary.
I am very new to coding, and didn't understand any of the code I found regarding this problem, since it's too advanced for me. I don't really know where to start. My main function creates the dictionary: (Edit)
def main():
filename = input("Enter the filename:\n")
file= open(filename, 'r')
rows= file.readlines()
d = {}
list = []
for x in rows:
list.append(x)
#print(list)
for elem in list:
row = elem.split(";")
d[row[3]] = {row[0], row[1]} #these are the indexes that the name and latitude & longitude have in the file
{'Location1': {'40.155444793742276', '28.950292890004903'}, 'Location2': ... }
The dictionary is like this, so the key is the name and then the coordinates are the values. Here is the function, which contains barely anything so far:
def calculate_distance(dictionary, location1, location2):
distance_x = dictionary[location1] - dictionary[location2]
# Here I don't know how I can get the values from the dictionary,
# since there are two values, longitude and latitude...
distance_y = ...
distance = ... # Here I will use the pythagorean theorem
return distance
Basically I just need to know how to work with the dictionary, since I don't know how I can get the values out so I can use them to calculate the distance.
--> How to search a key from a dictionary and get the values to my use. Thank you for answering my stupid question. :)
Well you are starting out, its normal that this makes it more difficult for you.
So lets see, you have a function that outputs a dictionary where the keys are locations and the values are coordinate pairs.
First lets talk about the data types that you use.
location_map={'Location1': {'40.155444793742276', '28.950292890004903'}, 'Location2': ... }
I think there is an issue with your values, it seems that they are sets of strings. This has 2 main advantages for your goal.
First, set objects do not support indexing, this means that you cannot access location_map['Location1'][0] to get the first coordinate. Trying this would give you a TypeError. Instead, by using tuples when creating your map would allow you to index. You can do this by defining the coordinates as tuple([longitude,latitude]) instead of {longitude,latitude}.
Second, it seems that your coordinates are strings, in order to perform arithmetic operations with your data you need a numeric type such as integers or in your case floats. If you are reading longitude and latitude values as strings you can convert them by using float(longitude) and float(latitude).
There are multiple ways to do it, few are listed below:
# option 1
for i, v in data.items(): # to get key and value from dict.
for k in v: # get each element of value (its a set)
print (k)
# option 2
for i, v in data.items(): # to get key and value from dict.
value_data = [k for k in list(v)] # convert set to list and put it in a list
print (i, value_data[0], value_data[1]) # use values from here
I would suggest you to go through the python documentations to get more in-depth knowledge.

Returning tuple of unknown length from python UDF and then applying hash in Pig

This is a question that has two parts:
First, I have a python UDF that creates a list of strings of unknown length. The input to the UDF is a map (dict in python) and the number of keys is essentially unknown (it is what I'm trying to obtain).
What I don't know is how to output that in a schema that lets me return it as a list (or some other iterable data structure). This is what I have so far:
#outputSchema("?????") #WHAT SHOULD THE SCHEMA BE!?!?
def test_func(input):
output = []
for k, v in input.items():
output.append(str(key))
return output
Now, the second part of the question. Once in Pig I want to apply a SHA hash to each element in the "list" for all my users. Some Pig pseudo code:
USERS = LOAD 'something' as (my_map:map[chararray])
UDF_OUT = FOREACH USERS GENERATE my_udfs.test_func(segment_map)
SHA_OUT = FOREACH UDF_OUT GENERATE SHA(UDF_OUT)
The last line is likely wrong as I want to apply the SHA to each element in the list, NOT to the whole list.
To answer your question, since you are returning a python list who's contents are a string, you will want your decorator to be
#outputSchema('name_of_bag:{(keys:chararray)}')
It can be confusing when specifying this structure because you only need to define what one element in the bag would look like.
That being said, there is a much simpler way to do what you require. There is a function KEYSET() (You can reference this question I answered) that will extract the keys from a Pig Map. So using the data set from that example and adding a few more keys to the first one since you said your map contents are variable in length
maps
----
[a#1,b#2,c#3,d#4,e#5]
[green#sam,eggs#I,ham#am]
Query:
REGISTER /path/to/jar/datafu-1.2.0.jar;
DEFINE datafu.pig.hash.SHA();
A = LOAD 'data' AS (M:[]);
B = FOREACH A GENERATE FLATTEN(KEYSET(M));
hashed = FOREACH B GENERATE $0, SHA($0);
DUMP hashed;
Output:
(d,18ac3e7343f016890c510e93f935261169d9e3f565436429830faf0934f4f8e4)
(e,3f79bb7b435b05321651daefd374cdc681dc06faa65e374e38337b88ca046dea)
(b,3e23e8160039594a33894f6564e1b1348bbd7a0088d42c4acb73eeaed59c009d)
(c,2e7d2c03a9507ae265ecf5b5356885a53393a2029d241394997265a1a25aefc6)
(a,ca978112ca1bbdcafac231b39a23dc4da786eff8147c4e72b9807785afee48bb)
(ham,eccfe263668d171bd19b7d491c3ef5c43559e6d3acf697ef37596181c6fdf4c)
(eggs,46da674b5b0987431bdb496e4982fadcd400abac99e7a977b43f216a98127721)
(green,ba4788b226aa8dc2e6dc74248bb9f618cfa8c959e0c26c147be48f6839a0b088)

Categories