How to refine a python script for a bioinformatics query

How to refine a python script for a bioinformatics query - python

I am quite new to python and I would be grateful for some assistance if possible. I am comparing the genomes of two closely related organisms [E_C & E_F] and trying to identify some basic insertions and deletions. I have run a FASTA pairwise alignment (glsearch36) using sequences from both organisms.
The below is a section of my python script where I have been able to identify a 7 nucleotide (heptamer) in one sequence (database) that corresponds to a gap in the other sequence (query). This is an example of what I have:
ATGCACAA-ACCTGTATG # query
ATGCAGAGGAAGAGCAAG # database
9
GAGGAAG
Assume the gap is at position 9. I am trying to refine the script to select gaps that are 20 nucleotides or more apart on both sequences and only if the surrounding nucleotides also match
ATGCACAAGTAAGGTTACCG-ACCTGTATGTGAACTCAACA
||| |||
GTGCTCGGGTCACCTTACCGGACCGCCCAGGGCGGCCCAAG
21
CCGGACC
This is the section of my script, the top half deals with opening different files. it also prints a dictionary with the count of each sequence at the end.
list_of_positions = []
for match in re.finditer(r'(?=(%s))' % re.escape("-"), dict_seqs[E_C]):
list_of_positions.append(match.start())
set_of_positions = set(list_of_positions)
for position in list_of_positions:
list_no_indels = []
for number in range(position-20, position) :
list_no_indels.append(number)
for number in range(position+1, position+21) :
list_no_indels.append(number)
set_no_indels = set(list_no_indels)
if len(set_no_indels.intersection(set_of_positions))> 0 : continue
if len(set_no_indels.intersection(set_of_positions_EF))> 0 : continue
print position
#print match.start()
print dict_seqs[E_F][position -3:position +3]
key = dict_seqs[E_F][position -3: position +3]
if nt_dict.has_key(key):
nt_dict[key] += 1
else:
nt_dict[key] = 1
print nt_dict
Essentially, I was trying to edit the results of pairwise alignments to try and identify the nucleotides opposite the gaps in both the query and database sequences in order to conduct some basic Insertion/Deletion analysis.
I was able to solve one of my earlier issues by increasing the distance between gaps "-" to 20 nt's in an attempt to reduce noise, this has improved my results. Script edited above.
This is an example of my results and at the end I have a dictionary which counts the occurences of each sequence.
ATGCACAA-ACCTGTATG # query
ATGCAGAGGAAGAGCAAG # database
9 (position on the sequence)
GAGGAA (hexamer)
ATGCACAAGACCTGTATG # query
ATGCAGAG-AAGAGCAAG # database
9 (position)
CAAGAC (hexamer)
However, I am still trying to fix the script where I get the nucleotides around the gap to match exactly such as this, where the | is just to show matching nt's on each sequence:
GGTTACCG-ACCTGTATGTGAACTCAACA # query
||| ||
CCTTACCGGACCGCCCAGGGCGGCCCAAG # database
9
ACCGAC
Any help with this would be gratefully appreciated!

I think I understand what you are trying to do but as #alko has said - comments in your code will definitely help a lot.
As to finding an exact match around the gap you could run a string comparison:
Something along the lines of:
if query[position -3: position] == database[position -3: position] and query[position +1: position +3] == database[position +1: position +3]:
# Do something
You will need to replace "query" and "database" with what you have called your strings that you want to compare.

Related

EntwicklerHeld Transposition Cipher

I'm trying to improve my coding skills on entwicklerheld.de
and right now I'm trying to solve the transposition cipher challenge:
We consider a cipher in which the plaintext is written downward and diagonally in successive columns. The number of rows or rails is given. When reaching the lowest rail, we traverse diagonally upwards, and after reaching the top rail, there is a change of direction again. Thus, the alphabets [sic] of the message are written in a zigzag pattern. After each alphabet is written, the individual lines are combined to obtain the cipher text.
Given is the plain text "coding" and the number of rails 2. The plain text is now arranged in a zigzag pattern as described above. The encoded text is obtained by combining the lines one after the other.
Thus, the encrypt() function should return the cipher "cdnoig".
The same procedure is used for entire sentences or texts as for individual words. The only thing to note here is that spaces also count as a single character.
Given is the plain text "rank the code" and the number of rails 2.
Your function should return the cipher "rn h oeaktecd".
This should work with other examples with 2 rails as well.
The encryption is very easy with a multi dimensional array.
My question
I'm stuck at the decryption part.
My idea is to build an array with 0 and 1 (to show were a character has to be). Then fill every array (line 1... line 2 ... line 3) with the characters in the order of the cipher text.
Then I iterate a third time over the array to read the word in zig-zag.
I don't know, but it feels very strange to iterate 3 times over the array. Maybe there is a zig-zag algorithm or so?

You could first define a generator that gives the mapping for each index to the index where the character has to be taken from during encryption. But this generator would not need to get the plain text input, just the length of it. As this generator just produces the indices, it can be used to decrypt as well.
It was not clear to me whether the question is only about the case where the number of rails is 2. With a bit of extra logic, this can be made for any greater number of rails also.
Here is how that could look:
# This generator can be used for encryption and decryption:
def permutation(size, numrails):
period = numrails * 2 - 2
yield from range(0, size, period) # top rail
# Following yield-from statement only needed when number of rails > 2
yield from (
index
for rail in range(1, numrails - 1)
for pair in zip(range(rail, size, period),
range(rail + period - rail*2, size + period, period))
for index in pair
if index < size
)
yield from range(numrails - 1, size, period) # bottom rail
def encrypt(plain, numrails):
n = len(plain)
return "".join([plain[i] for i in permutation(n, numrails)])
def decrypt(encrypted, numrails):
n = len(encrypted)
plain = [None] * n
for source, target in enumerate(permutation(n, numrails)):
plain[target] = encrypted[source]
return "".join(plain)

I am getting time limit error in the competition platform for my code

I am trying to solve this problem in the online coding platform(question given below) with the below code. The competition platform shows time limit error.
QUESTION:
Tahir and Mamta are woking in a project in TCS. Tahir being a problem solver came up with an interesting problem for his friend Mamta.
Problem consists of a string of length N and contains only small case alphabets.
It will be followed by Q queries, in which each query will contain an integer P (1<=P<=N) denoting a position within the string.
Mamta's task is to find the alphabet present at that location and determine the number of occurrence of same alphabet preceding the given location P.
Mamta is busy with her office work. Therefore, she asked you to help her.
Constraints
1 <= N <= 500000
S consisting of small case alphabets
1 <= Q <= 10000
1 <= P <= N
Input Format
First line contains an integer N, denoting the length of string.
Second line contains string S itself consists of small case alphabets only ('a' - 'z').
Third line contains an integer Q denoting number of queries that will be asked.
Next Q lines contains an integer P (1 <= P <= N) for which you need to find the number occurrence of character present at the Pth location preceeding P.
Output
For each query, print an integer denoting the answer on single line.
Time Limit
1
My code:
n=int(input())
a=input()[:n]
for i in range(int(input())):
p=int(input())
print(a[:p-1].count(a[p-1]))
Sample Input,Output:
Example 1
Input
9
abacsddaa
2
9
3
Output
3
1
Explanation
Here Q = 2
For P=9, character at 9th location is 'a'. Number of occurrences of 'a' before P i.e., 9 is three.
Similarly for P=3, 3rd character is 'a'. Number of occurrences of 'a' before P. i.e., 3 is one.
I'm new to Python so please help me with the error.

Your solution's complexity is O(n2) and it causes the time limit error.
You should use the prefix sum array algorithm. Take a look at this link and define a prefix sum array for every alphabet.

Fastest way to sort string to match second string - only adjacent swaps allowed

I want to get the minimum number of letter-swaps needed to convert one string to match a second string. Only adjacent swaps are allowed.
Inputs are: length of strings, string_1, string_2
Some examples:
Length | String 1 | String 2 | Output
-------+----------+----------+-------
3 | ABC | BCA | 2
7 | AABCDDD | DDDBCAA | 16
7 | ZZZAAAA | ZAAZAAZ | 6
Here's my code:
def letters(number, word_1, word_2):
result = 0
while word_1 != word_2:
index_of_letter = word_1.find(word_2[0])
result += index_of_letter
word_1 = word_1.replace(word_2[0], '', 1)
word_2 = word_2[1:]
return result
It gives the correct results, but the calculation should stay under 20 seconds.
Here are two sets of input data (1 000 000 characters long strings): https://ufile.io/8hp46 and https://ufile.io/athxu.
On my setup the first one is executed in around 40 seconds and the second in 4 minutes.
How to calculate the result in less than 20 seconds?

#KennyOstrom's is 90% there. The inversion count is indeed the right angle to look at this problem.
The only bit that is missing is that we need a "relative" inversion count, meaning the number of inversions not to get to normal sort order but to the other word's order. We therefore need to compute the permutation that stably maps word1 to word2 (or the other way round), and then compute the inversion count of that. Stability is important here, because obviously there will be lots of nonunique letters.
Here is a numpy implementation that takes only a second or two for the two large examples you posted. I did not test it extensively, but it does agree with #trincot's solution on all test cases. For the two large pairs it finds 1819136406 and 480769230766.
import numpy as np
_, word1, word2 = open("lit10b.in").read().split()
word1 = np.frombuffer(word1.encode('utf8')
+ (((1<<len(word1).bit_length()) - len(word1))*b'Z'),
dtype=np.uint8)
word2 = np.frombuffer(word2.encode('utf8')
+ (((1<<len(word2).bit_length()) - len(word2))*b'Z'),
dtype=np.uint8)
n = len(word1)
o1 = np.argsort(word1, kind='mergesort')
o2 = np.argsort(word2, kind='mergesort')
o1inv = np.empty_like(o1)
o1inv[o1] = np.arange(n)
order = o2[o1inv]
sum_ = 0
for i in range(1, len(word1).bit_length()):
order = np.reshape(order, (-1, 1<<i))
oo = np.argsort(order, axis = -1, kind='mergesort')
ioo = np.empty_like(oo)
ioo[np.arange(order.shape[0])[:, None], oo] = np.arange(1<<i)
order[...] = order[np.arange(order.shape[0])[:, None], oo]
hw = 1<<(i-1)
sum_ += ioo[:, :hw].sum() - order.shape[0] * (hw-1)*hw // 2
print(sum_)

Your algorithm runs in O(n2) time:
The find() call will take O(n) time
The replace() call will create a complete new string which takes O(n) time
The outer loop executes O(n) times
As others have stated, this can be solved by counting inversions using merge sort, but in this answer I try to stay close to your algorithm, keeping the outer loop and result += index_of_letter, but changing the way index_of_letter is calculated.
The improvement can be done as follows:
preprocess the word_1 string and note the first position of each distinct letter in word_1 in a dict keyed by these letters. Link each letter with its next occurrence. I think it is most efficient to create one list for this, having the size of word_1, where at each index you store the index of the next occurrence of the same letter. This way you have a linked list for each distinct letter. This preprocessing can be done in O(n) time, and with it you can replace the find call with a O(1) lookup. Every time you do this, you remove the matched letter from the linked list, i.e. the index in the dict moves to the index of the next occurrence.
The previous change will give the absolute index, not taking into account the removals of letters that you have in your algorithm, so this will give wrong results. To solve that, you can build a binary tree (also preprocessing), where each node represents an index in word_1, and which gives the actual number of non-deleted letters preceding a given index (including itself as well if not deleted yet). The nodes in the binary tree never get deleted (that might be an idea for a variant solution), but the counts get adjusted to reflect a deletion of a character. At most O(logn) nodes need to get a decremented value upon such a deletion. But apart from that no string would be rebuilt like with replace. This binary tree could be represented as a list, corresponding to nodes in in-order sequence. The values in the list would be the numbers of non-deleted letters preceding that node (including itself).
The initial binary tree could be depicted as follows:
The numbers in the nodes reflect the number of nodes at their left side, including themselves. They are stored in the numLeft list. Another list parent precalculates at which indexes the parents are located.
The actual code could look like this:
def letters(word_1, word_2):
size = len(word_1) # No need to pass size as argument
# Create a binary tree for word_1, organised as a list
# in in-order sequence, and with the values equal to the number of
# non-matched letters in the range up to and including the current index:
treesize = (1<<size.bit_length()) - 1
numLeft = [(i >> 1 ^ ((i + 1) >> 1)) + 1 for i in range(0, treesize)]
# Keep track of parents in this tree (could probably be simpler, I welcome comments).
parent = [(i & ~((i^(i+1)) + 1)) | (((i ^ (i+1))+1) >> 1) for i in range(0, treesize)]
# Create a linked list for each distinct character
next = [-1] * size
head = {}
for i in range(len(word_1)-1, -1, -1): # go backwards
c = word_1[i]
# Add index at front of the linked list for this character
if c in head:
next[i] = head[c]
head[c] = i
# Main loop counting number of swaps needed for each letter
result = 0
for i, c in enumerate(word_2):
# Extract next occurrence of this letter from linked list
j = head[c]
head[c] = next[j]
# Get number of preceding characters with a binary tree lookup
p = j
index_of_letter = 0
while p < treesize:
if p >= j: # On or at right?
numLeft[p] -= 1 # Register that a letter has been removed at left side
if p <= j: # On or at left?
index_of_letter += numLeft[p] # Add the number of left-side letters
p = parent[p] # Walk up the tree
result += index_of_letter
return result
This runs in O(nlogn) where the logn factor is provided by the upwards walk in the binary tree.
I tested on thousands of random inputs, and the above code produces the same results as your code in all cases. But... it runs a lot faster on the larger inputs.

I am going by the assumption that you just want to find the number of swaps, quickly, without needing to know what exactly to swap.
google how to count inversions. It is often taught with merge-sort. Several of the results are on stack overflow, like Merge sort to count split inversions in Python
Inversions are the number of adjacent swaps to get to a sorted string.
Count the inversions in string 1.
Count the inversions in string 2.
Error edited out here, see correction in correct answer. I would normally just delete a wrong answer but this answer is referenced in correct answer.
It makes sense, and it happens to work for all three of your small test cases, so I'm going to just assume this is the answer you want.
Using some code that I happen to have lying around from retaking some algorithms classes on free online classes (for fun):
print (week1.count_inversions('ABC'), week1.count_inversions('BCA'))
print (week1.count_inversions('AABCDDD'), week1.count_inversions('DDDBCAA'))
print (week1.count_inversions('ZZZAAAA'), week1.count_inversions('ZAAZAAZ'))
0 2
4 20
21 15
That lines up with the values you gave above: 2, 16, and 6.

How to use a regex to search for contiguous incrementing sequences

I would like to use regex to increase the speed of my searches for specific records within a large binary image. It seems like regex searches always outperform my own search methods, so that's why I'm looking into this. I have already implemented the following, which works, but is not very fast.
My binary image is loaded into a Numpy memmap as words.
I_FILE = np.memmap(opts.image_file, dtype='uint32', mode='r')
And here is start of my search loop currently (which works):
for i in range(0, FILESIZE - 19):
if (((I_FILE[i] + 1 == I_FILE[i + 19]) or (I_FILE[i - 19] + 1 == I_FILE[i])) and I_FILE[i] < 60):
...do stuff...
This is seeking out records that are 19 bytes long that start with a decimal sequence number between 0 and 59. It looks for an incrementing sequence on either a record before or after the current search location to validate the record.
I've seen a few examples where folks have crafted variables into string using re.escape (like this: How to use a variable inside a regular expression?) But I can't seem to figure out how to search for a changing value sequence.

I managed to make it work with regex, but it was a bit more complicated than I expected. The regex expressions look for two values between 0 and 59 that are separated by 72 bytes (18 words). I used two regex searches to ensure that I wouldn't miss records at the end of a sequence:
# First search uses the lookahead assertion to not consume large amounts of data.
SearchPattern1 = re.compile(b'[\0-\x3B]\0\0\0(?=.{72}[\1-\x3B]\0\0\0)', re.DOTALL)
# Again using the positive lookbehind assertion (?<= ... ) to grab the ending entries.
SearchPattern2 = re.compile(b'(?<=[\0-\x3B]\0\0\0.{72})[\1-\x3B]\0\0\0', re.DOTALL)
Next, perform both searches and combine the results.
HitList1 = [m.start(0) for m in SearchPattern1.finditer(I_FILE)]
HitList2 = [m.start(0) for m in SearchPattern2.finditer(I_FILE)]
AllHitList = list(set(HitList1 + HitList2))
SortedHitList = sorted(AllHitList)
Now I run a search that has the same conditions as my original solution, but it runs on a much smaller set of data!
for i in range(0, len(SortedHitList)):
TestLoc = SortedHitList[i]
if (I_FILE[TestLoc] + 1 == I_FILE[TestLoc + 19]) or (I_FILE[TestLoc - 19] + 1 == I_FILE[TestLoc]):
... do stuff ...
The result was very successful! The original solution took 58 seconds to run on a 300 MB binary file, while the new regex solution took only 2 seconds!!

Calculating Length of Sequences from .PBS File

I am new here. I am looking for help in a bioinformatics type task I have. The task was to calculate the total length of all the sequences in a .pbs file.
The file when opened, displays something like :
The Length is 102
The Length is 1100
The Length is 101
The Length is 111200
The Length is 102
I see that the length is given like a list, with letters and numbers. I need help figuring out what python code to write to add all the lengths together. Not all the sums are the same.
So far my code is:
f = open('lengthofsequence2.pbs.o8767272','r')
lines = f.readlines()
f.close()
def lengthofsequencesinpbsfile(i):
for x in i:
if
return x +=
print lengthofsequencesinpbsfile(lines)
I am not sure what to do with the for loop. I want to just count the numbers after the statement "The Length is..."
Thank You!

"The Length is " has 14 characters so line[14:] will give you the substring corresponding to the number you are after (starting after the 14th character), you then just have to convert it to int with int(line[14:]) before adding to your total: total += int(line[14:])

You need to parse your input to get the data you want to work with.
a. x.replace('The Length is ','') - this removes the unwanted text.
b. int(x.replace('The Length is ','')) - convert digit characters to
an integer
Add to a total: total += int(x.replace('The Length is ',''))
All of this is directly accessible using google. I looked for python string functions and type conversion functions. I've only looked briefly at python and never programmed with it, but I think these two items should help you do what you want to do.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to refine a python script for a bioinformatics query - python

Related

EntwicklerHeld Transposition Cipher

I am getting time limit error in the competition platform for my code

Fastest way to sort string to match second string - only adjacent swaps allowed

How to use a regex to search for contiguous incrementing sequences

Calculating Length of Sequences from .PBS File

Categories

Resources