Biopython: Weight Intervals and Dictionaries - python

Hopefully someone familiar with Biopython can help me out. I have a function that takes FASTA files (DNA sequence files) and creates a dictionary with the sequence ID as the key and the molecular weight of the sequence as the value. Since sequences can be ambiguous, I also have a function that spits out all possible real sequences from the ambiguous one and uses that as input for the dictionary-making function that I just described; I integrated it into the dictionary-creating function so that for ambiguous sequences, the function spits out a minimum and maximum molecular weight values for the possible real sequences represented by the ambiguous one.
def seq_ID_and_weight(file_name):
with open (file_name) as file:
ID_weight = {} #create an empty dictionary
for sequence in SeqIO.parse(file,'fasta'):
weight_min = 10000
weight_max = 0
all_poss_sequences = ambiguous_to_unambiguous(sequence.seq) # only call the function once and store it in variable to improve performance
if len(all_poss_sequences) != 1: #if the length would be 1, its unambiguous
for possib in all_poss_sequences:
if SeqUtils.molecular_weight(possib) < weight_min:
weight_min = SeqUtils.molecular_weight(possib)
elif SeqUtils.molecular_weight(possib) > weight_max:
weight_max = SeqUtils.molecular_weight(possib)
ID_weight[sequence.id] = [weight_min, weight_max]
else:
ID_weight[sequence.id] = [SeqUtils.molecular_weight(sequence.seq)]
return ID_weight
The function spits out something like this, where the values are either the definitive molecular weight of the sequence (if the seq is unambiguous) or the min and max of the possible molecular weights of the sequence (if seq is ambiguous):
{'seq_7009': [6236.9764, 6367.049999999999], 'seq_418': [3716.3642000000004, 3796.4124000000006], 'seq_9143_unamb': [4631.958999999999]}
However, now I need to use this function to make a new one that does something slightly different. The new function needs to take a FASTA file name and min and max molecular weights as inputs and return a list of sequence IDs for sequences that have a molecular weight within that interval. Basically, the function should return the ID of an ambiguous sequence for which the weight interval overlaps the weight interval that you specify.
My approach to this would be as follows:
Initialize a dictionary containing the output of the previous function, like the example I gave above.
Iterate over the dictionary, checking if the key has only one value or multiple (a tuple).
a. If only one value, then check if the value is in the given range, and if so, print that sequence ID. If not, break (do nothing).
b. If multiple values, then check if either the first or second is in the given range (because if so, there is some overlap). If so, print that sequence ID. If not, break.
How would I actually implement this? This is all I have so far - I've really only created the dictionary:
def find_sequence(file_name, min_weight, max_weight):
with open (file_name) as file:
dictionary = {}
dictionary.update(seq_ID_and_weight(file_name))
for key in dictionary:
Now I need to check how many values the keys have, but I don't know how to do that. Any ideas?

You just have to traverse the dictionary & check the first 2 values.
This is the approach.
def find_sequence(file_name, min_weight, max_weight):
li=[] # list to store ids
with open (file_name) as file:
dictionary = {}
dictionary.update(seq_ID_and_weight(file_name))
for k,v in dictionary.items(): # traverse the dictionary
for i in range(min(2,len(v))): # if len(v) > 2 , then it range will be 2 else 1
if v[i]>min_weight and v[i]<max_weight: # if value is within range append the sequence_id to list
li.append(k)
break
return li

Related

function that takes directory of .txt file and returns a dictionary based on set parameters?

I want to make a function that takes the directory of a .txt file as an input and returns a dictionary based on specific parameters. If the .txt file is empty,
then the function will return nothing. When writing this function, I request that no imports, no list comprehension, and only for/while and if statements are used.
This is for the sake of the content I am learning right now, and I would like to be able to learn and interpret the function step-by-step.
An example of a .txt file is below. The amount of lines can vary but every line is formatted such that they appear in the order:
word + a string of 3 numbers connected by commas.
terra,4,5,6
cloud,5,6,7
squall,6,0,8
terra,4,5,8
cloud,6,5,7
First I would like to break down the steps of the function
Each component of the string that is separated by a comma serves a specific purpose:
The last number in the string will be subtracted by the second to last number in a string to form a value in the dictionary.
for example, the last two characters of terra,4,5,6 will be subtracted to form a value of [1] in the dictionary
The alphabetical words will form the keys of the dictionary. If there are multiple entries of the same word in a .txt file then a single key will be formed
and it will contain all the values of the duplicate keys.
for example, terra,4,5,6 , terra,4,4,6 , and terra,4,4,7 will output ('terra', 4):[1,2,3] as a key and value respectively.
However, in order for a key to be marked as a duplicate, the first values of the keys must be the same. If they are not, then they will be separate values.
For example, terra,4,5,6 and terra,5,4,6 will appear separately from eachother in the dictionary as ('terra', 4):[1] and ('terra', 5):[2] respectively.
Example input
if we use the example .txt file mentioned above, the input should look like create_dict("***files/example.txt") and should ouput a dictionary
{('terra', 4):[1,3],('cloud', 5):[1],('squall', 6):[8],('cloud', 6):[2]}. I will add a link to the .txt file for the sake of recreating this example. (note that *** are placeholders for the rest of the directory)
What I'm Trying:
testfiles = (open("**files/example.txt").read()).split('\n')
int_list = []
alpha_list = []
for values in testfiles:
ao = values.split(',') #returns only a portion of the list. why?
for values in ao:
if values.isnumeric():
int_list.append(values) #retrives list of ints from list
for values in ao:
if values.isalpha():
alpha_list.append(values) #retrieves a list of words
{((alpha_list[0]), int(int_list[0])):(int(int_list[2])-(int(int_list[1])))} #each line will always have 3 number values so I used index
this returns {('squall', 6): 1} which is mainly just a proof of concept and not a solution to the function. I wanted to see if it was possible to use the numbers and words I found in int_list and alpha_list using indexes to generate entries in the dictionary. If possible, the same could be applied to the rest of the strings in the .txt file.
Your input is in CSV format.
You really should be using one of these
https://docs.python.org/3/library/csv.html#csv.reader
https://docs.python.org/3/library/csv.html#csv.DictReader
since "odd" characters within a comma-separated field
are non-trivial to handle.
Better to let the library worry about such details.
Using defaultdict(list) is the most natural way,
the most readable way, to implement your dup key requirement.
https://docs.python.org/3/library/collections.html#collections.defaultdict
I know, I know, "no import";
now on to a variant solution.
d = {}
with open('example.txt') as f:
for line in f:
word, nums = line.split(',', maxsplit=1)
a, b, c = map(int, nums.split(','))
delta = c - b
key = (word, a)
if key not in d:
d[key] = []
d[key].append(delta)
return d

Could someone explain this code to me - traversing keys in a dictionary?

I have this code that I cobbled together based on some posts on here. It takes a FASTA file (file that features DNA sequences) and finds sequences whose molecular weights are within the given weight range. To do this, it uses a dictionary resulting from a previously-built function, seq_ID_and_weight, which (as its name suggests) outputs the ID of sequences in the file and the minimum and maximum values of their molecular weights (sequences can be ambiguous, so there are many possible weights).
The below function does what I need it to do, but I'm not actually sure how.
def find_sequence(file_name, min_weight, max_weight):
ID_list=[] # Initialize a list to store seq IDs
with open (file_name) as file:
dictionary = (seq_ID_and_weight(file_name))
for k,v in dictionary.items(): # This function lets you traverse the dictionary
for i in range(min(2,len(v))):
if v[i]>min_weight and v[i]<max_weight: # If value is within given range, append the sequence_id to list.
ID_list.append(k)
break
return ID_list
I understand up until the "for i in range" line. I know that line is there because I have to deal with keys that have two values as well as keys that have only one. But what does the min function do? And why am I using i as a variable?
Sorry if it's a dumb question, but I am new to Python.
Python min() returns the smallest item from an iterable object or two or more arguments. Here is the official documentation.
i is short for representing the index value within the range from 0 to the value of min(2,len(v)), and is used as the index within the dictionary value v (assuming it's an iterable object)

Calculating distance between two points using dictionary in python

I am trying to calculate a distance between two locations, using their coordinates. However I don't know how I can access the coordinate values, since they are in a dictionary.
I am very new to coding, and didn't understand any of the code I found regarding this problem, since it's too advanced for me. I don't really know where to start. My main function creates the dictionary: (Edit)
def main():
filename = input("Enter the filename:\n")
file= open(filename, 'r')
rows= file.readlines()
d = {}
list = []
for x in rows:
list.append(x)
#print(list)
for elem in list:
row = elem.split(";")
d[row[3]] = {row[0], row[1]} #these are the indexes that the name and latitude & longitude have in the file
{'Location1': {'40.155444793742276', '28.950292890004903'}, 'Location2': ... }
The dictionary is like this, so the key is the name and then the coordinates are the values. Here is the function, which contains barely anything so far:
def calculate_distance(dictionary, location1, location2):
distance_x = dictionary[location1] - dictionary[location2]
# Here I don't know how I can get the values from the dictionary,
# since there are two values, longitude and latitude...
distance_y = ...
distance = ... # Here I will use the pythagorean theorem
return distance
Basically I just need to know how to work with the dictionary, since I don't know how I can get the values out so I can use them to calculate the distance.
--> How to search a key from a dictionary and get the values to my use. Thank you for answering my stupid question. :)
Well you are starting out, its normal that this makes it more difficult for you.
So lets see, you have a function that outputs a dictionary where the keys are locations and the values are coordinate pairs.
First lets talk about the data types that you use.
location_map={'Location1': {'40.155444793742276', '28.950292890004903'}, 'Location2': ... }
I think there is an issue with your values, it seems that they are sets of strings. This has 2 main advantages for your goal.
First, set objects do not support indexing, this means that you cannot access location_map['Location1'][0] to get the first coordinate. Trying this would give you a TypeError. Instead, by using tuples when creating your map would allow you to index. You can do this by defining the coordinates as tuple([longitude,latitude]) instead of {longitude,latitude}.
Second, it seems that your coordinates are strings, in order to perform arithmetic operations with your data you need a numeric type such as integers or in your case floats. If you are reading longitude and latitude values as strings you can convert them by using float(longitude) and float(latitude).
There are multiple ways to do it, few are listed below:
# option 1
for i, v in data.items(): # to get key and value from dict.
for k in v: # get each element of value (its a set)
print (k)
# option 2
for i, v in data.items(): # to get key and value from dict.
value_data = [k for k in list(v)] # convert set to list and put it in a list
print (i, value_data[0], value_data[1]) # use values from here
I would suggest you to go through the python documentations to get more in-depth knowledge.

Determine location given a dictionary with lists inside of a list

I'm new here and I need some help with some code I've been working on because I've gotten myself lost and am now just confused.
First, I created a dictionary based on some help from this website. A sample of my dictionary looks like this:
length = {'A': [(0,21), (30,41), (70,80)] 'B': [(0,42), (70,80)]..etc}
I have a file that I would like to use to iterate over my dictionary that contains this information:
A 32
B 15
etc
What I want to do is to take the first feature in my file and match it to the key of my dictionary. Once I have it matched, I want to to see which range the number in my file matches to. For example, the first feature in my file would match to A and the second range. That means I would want my output to show the name (A) and display 2 because it matched to the second range.
I've tried my code below:
import csv
with open('Exome_agg_cons_snps_pct_RefSeq_HGMD_reinitialized.txt') as f:
reader = csv.DictReader(f,delimiter="\t")
for row in reader:
snppos = row['snp_rein']
name = row['isoform']
snpos = int(snppos)
if name in exons:
y = exons[name]
if y[0] <= snpos <=y[1]:
print name,snppos
This, however, doesn't give me any output. I'm not sure what is wrong with my code. I am new though. I think I might be missing something. Also, I realize that my code won't do what I want it to do (tell me what range it matched to). I was thinking of using the .index() function but I'm not sure I can use it in the case I have. Any suggestions?
You just need to loop over the spans in a dict value. It's nicer to let the for-loop split them already:
for row in reader:
snppos = row['snp_rein']
name = row['isoform']
if name in exons:
for low, high in exons[name]:
if low <= snppos <= high:
print name, low, high
break # Since exons can't overlap, go to next row immediately
If you need the index in the exon list rather than the span (e.g., index 2 rather (70, 80)), then add enumerate:
...
for i, (low, high) in enumerate(exons[name]):
if low <= snppos <= high:
print name, i
break
If your list in the dictionary is ordered then this method will work
The enumerate method provides two output one will be index and the next one will be value .
Since the value here is list we are iterating over the lists[one list at a time ].
We are using max method to find the maximum of the list
Then we are comparing it with the value from the text .
If so we will print the index +1 and name
code:
for sd, i in enumerate(exons[name]):
if snpos<=max(i) and snpos>=min(i):
print sd+1,name
break

Python: skip every nth element in hashcheck, conditional mismatch?

So I currently have a script that generates hashes from the contents of a text file and saves them to a dictionary, and it then goes into a second text file and generates hashes from there and compares them to said dictionary. I'm trying to implement some sort of incomplete matching; for example, I want to program some tolerance: for example, I'd like to make it so that every third element in the hash is unimportant to the matching protocol, so if there is a mismatch, it will continue iterating unimpeded. Is it possible to do this?
Furthermore, and this is a separate case, would it be possible to determine a conditional mismatch? For example, if there is a mismatch, there are several elements that would still qualify as "matching", like if I wanted a vowel at a certain position, but it didn't matter which vowel showed up.
In summary, I'm trying to make it so that my script either goes
check,check,disregard,check,check,disregard,etc.
OR
check,check,conditional mismatch?,check,check,conditional mismatch?,etc.
along the hashes. Is this doable?
EDIT: I suppose it's not really hashchecking, but more of string comparison. Here's the relevant code I'm trying to tweak:
# hash table for finding hits
lookup = defaultdict(list)
# store sequence hashes in hash table
for i in xrange(len(file1) - hashlen + 1):
key = file1[i:i+hashlen]
lookup[key].append(i)
# look up hashes in hash table
hits = []
for i in xrange(len(file2) - hashlen + 1):
key = file2[i:i+hashlen]
# store hits to hits list
for hit in lookup.get(key, []):
hits.append((i, hit))
where hashlen is the length of the hash I want to generate (and thus the buffer so I don't go off the end of the file.
As commented, hashes do not have order. You can consider using an OrderedDict.
But maybe this code help you.
skip_rate = 3
for index, (key, value) in enumerate(your_hash.items()):
if index % skip_rate != 0:
do_something(key, value)

Categories