Contig Extension with Python - python

I want to add a function to a program that creates dictionaries with dna sequences that receives a contig (incon= initial contig; dna sequence) and extends it to the right by finding overlapping parts in form of keys in dictionaries and concatenating the values with the "+" operator.
I'll give a quick example:
GATTTGAAGC as initial contig
ATTTGAAGC:A is one of many entries in the dictionary
I want the function to search for such an overlapping part (I asked this here yesterday and it worked fine by itself and with specific values but not within the function with variables) that is a key in the dictionary and concatenate the value of that key to the initial sequence (extend the contig to the right) and save the new sequence into incon then delete this dictionary-entry and repeat until there are no entries left (this part i haven't even tried yet).
First i want the function to search for keys with length of 9 with values of length 1 (ATTTGAAGC:A) and if there are no overlapping parts for keys with length 8 with values of length 2 (f.e. ATTTGAAG:TG) and so on.
Additional Info:
The Dictionary "suffixDicts" has such entries with values with length from 1 (key has length 14) to 10 (key has length 5).
"reads" is where a list of sequences is stored
When i try to do the steps one after another some work (like the search) and some don't but when i tried to built a function out of it, literally nothing happens. The function is supposed to return the smallest possible extension.
def extendContig (incon, reads, suffixDicts):
incon = reads[0]
for x in range(1,len(incon)):
for key in suffixDicts.keys():
if incon[x:] == key:
incon = incon+suffixDicts['key']
print(incon)
else:
print("n")
return()
I'm very new to Python and there probably are very dire mistakes i made and i would like them to be pointed out. I know that I'm way over my head with this but I'm understanding most parts of the already existing code now but still have problems with implementing something by myself into it, probably due to incorrect synthax. I know there are programs i could use but i would like to understand the whole thing behind it.
edit: As asked for i will add the already given functions. Some of them were already written some parts i wrote based on the given code (basically i copied it with some tweaks). Warning: It is quite a lot:
Reading the Fasta file:
Additional Info:
The Fasta file contains large amounts of sequences in the Form:
"> read 1
TTATGAATATTACGCAATGGACGTCCAAGGTACAGCGTATTTGTACGCTA
"> read 2
AACTGCTATCTTTCTTGTCCACTCGAAAATCCATAACGTAGCCCATAACG
"> read 3
TCAGTTATCCTATATACTGGATCCCGACTTTAATCGGCGTCGGAATTACT
I uploaded the file here: http://s000.tinyupload.com/?file_id=52090273537190816031
edit: edited the large blocks of code out it doesn't seem to be necessary.

Related

subsetting very large files - python methods for optimal performance

I have one file (index1) with 17,270,877 IDs, and another file (read1) with a subset of these IDs (17,211,741). For both files, the IDs are on every 4th line.
I need a new (index2) file that contains only the IDs in read1. For each of those IDs I also need to grab the next 3 lines from index1. So I'll end up with index2 whose format exactly matches index1 except it only contains IDs from read1.
I am trying to implement the methods I've read here. But I'm stumbling on these two points: 1) I need to check IDs on every 4th line, but I need all of the data in index1 (in order) because I have to write the associated 3 lines following the ID. 2) unlike that post, which is about searching for one string in a large file, I'm searching for a huge number of strings in another huge file.
Can some folks point me in some direction? Maybe none of those 5 methods are ideal for this. I don't know any information theory; we have plenty of RAM so I think holding the data in RAM for searching is the most efficient? I'm really not sure.
Here a sample of what the index look like (IDs start with #M00347):
#M00347:30:000000000-BCWL3:1:1101:15589:1332 1:N:0:0
CCTAAGGTTCGG
+
CDDDDFFFFFCB
#M00347:30:000000000-BCWL3:1:1101:15667:1332 1:N:0:0
CGCCATGCATCC
+
BBCCBBFFFFFF
#M00347:30:000000000-BCWL3:1:1101:15711:1332 1:N:0:0
TTTGGTTCCCGG
+
CDCDECCFFFCB
read1 looks very similar, but the lines before and after the '+' are different.
If data of index1 can fit in memory, the best approach is to do a single scan of this file and store all data in a dictionary like this:
{"#M00347:30:000000000-BCWL3:1:1101:15589:1332 1:N:0:0":["CCTAAGGTTCGG","+","CDDDDFFFFFCB"],
"#M00347:30:000000000-BCWL3:1:1101:15667:1332 1:N:0:0":["CGCCATGCATCC","+","BBCCBBFFFFFF"],
..... }
Values can be stored as formatted string as you prefer.
After this, you can do a single scan on read1 and when an IDs is encountered you can do a simple lookup on the dictionary to retrieve needed data.

Caesar cipher with multiple values

I'm new to stack overflow.
I'm trying to create a Caesar cipher that can take multiple inputs for shifts. I have all my shifts stored in a list. The message the user is attempting to encrypt is also stored in a list like so:
["Hello this is what the message looks like.", "It is stored in a list like so.", "It is pretty neet."]
The shift amount must start over on the next "line". So if the last e in like is shifted by 4 in a shift sequence of
[1,2,3,4,5],
the I in It would be shifted by 1. I tried using a for loop to to search for each letter then use ord() to obtain the UTF code then add the shift but that did not work.I suspect this is because my shift values are held in a list. How would I approach this problem?
***EDIT***I am reposing this topic because my previous topic was closed for being "duplicate". However, I don't think it is a duplicate topic. The linked topic involved a shift amount that will always be constant. My program requires a shift amount that various depending on what the user inputs. My program also has a list containing the shift values. I cannot use some operations because it is a list.

A more efficient way of finding value in dictionary and its position

I have a dictionary which contains (roughly) 6 elements, each of an element which looks like the following:
What I want to do is find a particular domain (that I pass through a method) and if it exists, it stores the keyword and its position within an object. I have tried the following
def parseGoogleResponse(response, website):
i = 0
for item in response['items']:
if(item['formattedUrl'] == website):
print i
break;
i++
This approach seems to be a bit tedious and also i also remains the same at i = 10 and I'm pretty sure that this is a more efficient way. I also have to keep in consideration that if the website is not found the first time, it then queries the API for a maximum up to 5 pages, each page contains 6 search results so I somehow have to calculate the position if it is on a different page.
Any ideas
Dictionaries in Python are not ordered. There is no way to find something's position in a dictionary, unlike list type objects.
You can rather easily check for the existence of a value in the dictionary with something like:
if website in response['items'].values():
# If you enter this section, you know it's in the dictionary
else:
# If you end up here, it isn't in the dictionary

Speed/structure optimization for a recursive tree

Edit -> short version:
In Python, unlike in C, if I pass a parameter to a function I -say: a dict-, the changes made within the function call will reflect outside (as if I passed a pointer instead of just the value)
I want to avoid this so:
-> I make a copy of my dict and pass the copy to my function
But the values of my dict can be some dict and this goes on until an undefinite depth
-> the recursive copy is very long.
Question: what is a pythonic way to go about this?
Long version:
I'm coding a master-mind playing robot with a n-digit code in Python.
You try to guess the code and for each try you get an answer in terms of how many white/black/none you have, meaning resp. "good digit good position"/"good digit wrong position"/"wrong digit" (but you don't know to which digit the whites/blacks/none refer)
I analyze the answers and build a tree of possibilities with a dictionary storing white/black/none.
I store a map of the possible positions of the numbers 0-9 within the code (a digit can appear more than once) in a list.
Ex: for a 3-digit game I will have [[x,y1,y2,y3][-1,0,1,4][...][...][][][][][][]] with:
x: the total number of times this digit appears in the code (default value being n+1, ie. 4 in the exemple) with positive meaning sure and negative "at least"
y1,y2,..,yn the position within the code: 1 means I know the digit is in this position, 0 I know it's not, and 4 (or anything) as default
In my exemple: I know that '1' appears at least once in the code (-1) that it is present in position 2 and that it is NOT present in position 1 and that position 3 is still hypothetically possible.
While I explore my tree of possibilities, I update this list. Which means that each branch of the tree will have its own copy of the list.
Since I recently discovered that, unlike in C, when I pass my list to a sub-method, any change made to it within the sub will reflect on the list outside, I manually copy my list each time with a small method:
def bak_symb(_s):
_b = [[z for z in _s[i]] for i in xrange(10)]
return _b
Now, I profiled my programm and noticed that 90% of the time is spent either in
append()
(the branches of my tree are nested dictionaries {w:{},b:0,n:{}} to which I append each branch of possibilities that I explore)For each branch : the programm has to find a n-digit code
or
my copying function
So I have three questions.
Is there a way to make this function faster?
Is there a something better adapted than the structures I chose (2-depth list for the symbols and nested dict for the hypothesis)
Is there a more adequate way of doing this than building this huge tree
All comments and remarks are welcome.
I'm self-taught in and might have missed some obvious pythonic way of doing some things.
Last but not least, I tried to find a good compromise between making this short and clear, here again don't hesitate to ask for more details.
Thanks in advance,
Matt

Compare values of specific keys in 2 different dictionaries to see if they are equal

I am trying to see if the values of specific keys in 2 different dictionaries are equal (which they are at certain times). If so, I want to print both dictionary values. If not, I only want to print the dict2 values. It should be noted that the keys are two different strings, but I only want to compare the values of those specific keys.
General idea of code
for key1,val in enumerate(dict1):
KEY1printed=False
for key2,val in enumerate(dict2):
if dict1['key1'][val] == dict2['key2'][val]:
if KEY1printed == False:
print dict2['key2'][val], dict1['key1'][val]
KEY1printed=True
else:
print dict2['key2'][val]
I get a "TypeError: list indices must be integers, not str" on the if dict1['key'][val] == dict2['key'][val]. However, earlier in the code I append val to the dictionary keys as integers (see code example below). I'm not sure what to do about the problem. Any suggestions?
Code example of val append
for line in indocument:
val=line.split(",")
dict1['key1'].append(int(val[2]))
dict2['key2'].append(int(val[0]))
I have been having issues with types, because the microsoft excel 2010 program won't save the cell format when I change it from general to number. When I open the file in openoffice, the cell format saves, but there are too many rows in my file, so I have to split my document. I don't mind having to do this, but I think it might be messing with my python code.

Categories