Efficient Way to Count K-mers in O(k*N + k*Q)?

Efficient Way to Count K-mers in O(k*N + k*Q)? - python

I have a string of lowercase alphabets. I need to find how many times each k-mer the question asks appears. The catch is I need to output the count in an order of k-mers the question asks. Another catch is I may need to output the count for the same k-mer more than one time. I need to accomplish this in O(kN +kQ) where k is the length of k-mer, N is the length of a DNA string and Q is the number of specific k-mers of interest.
For example, for the following input where N=7, k=2, q=3, aaabaab is the DNA string, the next 5 lines are the k-mers of my interest :
7 3 5
aaabaab
aaa
aab
aaa
baa
xyz
I would expect to output the following:
aaa 1
aab 2
aaa 1
baa 1
xyz 0
Note that aaa is asked twice!
I have a list of Q k-mers. I have a dictionary of k-mers with the counts (the length of a dictionary could be less than Q). With a for-loop, I iterate through DNA and each character while keeping tracking of a current k-mer O(N). In the next iteration, I update the current k-mer by dropping the first letter and append the current character. In order to output the answer, I iterate the list of Q k-mers and search for its count in the dictionary.
l, n , k, q = [int(x) for x in sys.stdin.readline().strip('\n').split(' ')]
dna = ''
for i in range(l):
dna += sys.stdin.readline().strip('\n')
mykmer =[]
mycount = {}
for i in range(q):
kmer = sys.stdin.readline().strip('\n')
mykmer.append(kmer)
mycount[kmer]=0
current = dna[0:k]
for j in range(k-1,len(dna)):
if j != k-1:
current = current[1:]+str(dna[j])
if current in mykmer:
mycount[current] += 1
for x in mykmer:
print(str(x)+' '+str(mycount[x]))
I get correct answers, but I get timed out!

I would improve your inner loop to:
for j in range(len(dna) - (len(dna) % k)):
current = dna[j:j+k]
if current in mycount:
mycount[current] += 1
Slicing once costs less than repeated slicing and appending. current = current[1:]+str(dna[j]) costs more than dna[j:j+k]. As it results in 3 string allocations where as the slice results in one.
Use the dictionary you already have rather than the list to do membership tests on. This removes a factor of Q.
The range(len(dna) - (len(dna) % k)) ensures that the loop does not unnecessarily consider the last few indexes.

Related

Function that returns the length of the longest run of repetition in a given list

I'm trying to write a function that returns the length of the longest run of repetition in a given list
Here is my code:
def longest_repetition(a):
longest = 0
j = 0
run2 = 0
while j <= len(a)-1:
for i in a:
run = a.count(a[j] == i)
if run == 1:
run2 += 1
if run2 > longest:
longest = run2
j += 1
run2 = 0
return longest
print(longest_repetition([4,1,2,4,7,9,4]))
print(longest_repetition([5,3,5,6,9,4,4,4,4]))
3
0
The first test function works fine, but the second test function is not counting at all and I'm not sure why. Any insight is much appreciated
Just noticed that the question I was given and the expected results are not consistent. So what I'm basically trying to do is find the most repeated element in a list and the output would be the number of times it is repeated. That said, the output for the second test function should be 4 because the element '4' is repeated four times (elements are not required to be in one run as implied in my original question)

First of all, let's check if you were consistent with your question (function that returns the length of the longest run of repetition):
e.g.:
a = [4,1,2,4,7,9,4]
b = [5,3,5,6,9,4,4,4,4]
(assuming, you are only checking single position, e.g. c = [1,2,3,1,2,3] could have one repetition of sequence 1,2,3 - i am assuming that is not your goal)
So:
for a, there is no repetitions of same value, therefore length equals 0
for b, you have one, quadruple repetition of 4, therefore length equals 4
First, your max_amount_of_repetitions=0 and current_repetitions_run=0' So, what you need to do to detect repetition is simply check if value of n-1'th and n'th element is same. If so, you increment current_repetitions_run', else, you reset current_repetitions_run=0.
Last step is check if your current run is longest of all:
max_amount_of_repetitions= max(max_amount_of_repetitions, current_repetitions_run)
to surely get both n-1 and n within your list range, I'd simply start iteration from second element. That way, n-1 is first element.
for n in range(1,len(a)):
if a[n-1] == a[n]:
print("I am sure, you can figure out the rest")

you can use hash to calculate the frequency of the element and then get the max of frequencies.
using functional approach
from collections import Counter
def longest_repitition(array):
return max(Counter(array).values())
other way, without using Counter
def longest_repitition(array):
freq = {}
for val in array:
if val not in freq:
freq[val] = 0
freq[val] += 1
values = freq.values()
return max(values)

Matching the first element of a list with other first elements in the list

I am trying to solve the question given in this video https://www.youtube.com/watch?reload=9&v=XCeDBWI4sa4
My list contains sub-lists that constitute each digit of a number of the type strings.
Example: I turned my list of strings
['58','12','50','17'] into four sub-lists like so [['5','8'],['1','2'],['5','0'],['1','7']] because I want to compare the first digit of each number and if the first digits are equal, I increment the variable "pair" which is currently 0. pair=0
Since 58 and 50 have the same first digit, they constitute a pair, same goes for 12 and 17. Also, a pair can only be made if both the numbers are at either even position or odd position. 58 and 50 are at even indices, hence they satisfy the condition. also, at most two pairs can be made for the same first digit. So 51,52, 53 would constitute only 2 pairs instead of three. How do I check this? A simple solution will be appreciated.
list_1=[['5','8'],['1','2'],['5','0'],['1','7']]
and test_list= ['58','12','50','17']
for i in range(0,len(test_list)):
for j in range(1,len(test_list)):
if (list_1[i][0] == list_1[j][0] and (i,j%2==0 or i,j%2==1)):
pair =pair+1
print (pair)
That is what I came up with but I am not getting the desired output.

pair = 0
val_list = ['58','12','50','17', '57', '65', '51']
first_digit, visited_item_list = list(), list()
for item in val_list:
curr = int(item[0])
first_digit.append(curr)
for item in first_digit:
if item not in visited_item_list:
occurences = first_digit.count(item)
if occurences % 2 == 0:
pair = pair + occurences // 2
visited_item_list.append(item)
print(pair)

Using collections.Counter to count occurrences for each first digit. Sum up the totals minus the total number of unique types (to account for more than one).
Iterates over even and odd separately:
Uncomment #return sum(min(c,2) for x in c) - len(c) if you want it to never count more than 2 for digit duplicates. eg: [51,52,53,54,56,57,58,59,50,...] will still return 4, no matter how many more 5X you add. (min(c,2) guarantees the value will never exceed 2)
from collections import Counter
a = ['58','12','50','17','50','18']
def dupes(a):
c = Counter(a).values() # count instances of each element in a, get list of counts
#return sum(min(c,2) for x in c) - len(c) # maximum value of 2 for counts
return sum(c) - len(c) # sum up all the counts, subtract unique elements (you want the counts starting from 0)
even = dupes(a[x][0] for x in range(0, len(a), 2))
# a[x][0]: first digit of even a elements
# range(0, len(a), 2): range of numbers from 0 to length of a, skip by 2 (evens)
# call dupes([list of first digit of even elements])
odd = dupes(a[x][0] for x in range(1, len(a), 2))
# same for odd
print(even+odd)

Here's a fairly simple solution:
import collections
l= [['5','8'],['1','2'],['5','0'],['1','7']]
c = collections.Counter([i[0] for i in l])
# Counter counts the occurrences of items in a list (or other
# collection). After the previous line, c is
# Counter({'5': 2, '1': 2})
sum([c-1 for c in c.values()])
The output, in this case, is 2.

Backward search implementation python

I am dealing with some string search tasks just to improve an efficient way of searching.
I am trying to implement a way of counting how many substrings there are in a given set of strings by using backward search.
For example given the following strings:
original = 'panamabananas$'
s = smnpbnnaaaaa$a
s1 = $aaaaaabmnnnps #sorted version of s
I am trying to find how many times the substring 'ban' it occurs. For doing so I was thinking in iterate through both strings with zip function. In the backward search, I should first look for the last character of ban (n) in s1 and see where it matches with the next character a in s. It matches in indexes 9,10 and 11, which actually are the third, fourth and fifth a in s. The next character to look for is b but only for the matches that occurred before (This means, where n in s1 matched with a in s). So we took those a (third, fourth and fifth) from s and see if any of those third, fourth or fifth a in s1 match with any b in s. This way we would have found an occurrence of 'ban'.
It seems complex to me to iterate and save cuasi-occurences so what I was trying is something like this:
n = 0 #counter of occurences
for i, j in zip(s1, s):
if i == 'n' and j == 'a': # this should save the match
if i[3:6] == 'a' and any(j[3:6] == 'b'):
n += 1
I think nested if statements may be needed but I am still a beginner. Because I am getting 0 occurrences when there are one ban occurrences in the original.

You can run a loop with find to count the number of occurence of substring.
s = 'panamabananasbananasba'
ss = 'ban'
count = 0
idx = s.find(ss, 0)
while (idx != -1):
count += 1
idx += len(ss)
idx = s.find(ss, idx)
print count
If you really want backward search, then reverse the string and substring and do the same mechanism.
s = 'panamabananasbananasban'
s = s[::-1]
ss = 'ban'
ss = ss[::-1]

Python: how to optimize

Suppose I am given a string of len n, for every substring whose first and last characters are same I should add 1 to fx and print the final fx.
ex for "ababaca" , f("a")=1 , f("aba")=1 , f("abaca")=1, but f("ab")=0
n = int(raw_input())
string = list(raw_input())
f = 0
for i in range(n):
for j in range(n,i,-1):
temp = string[i:j]
if temp[0]==temp[-1]:
f+=1
print f
Is there any way I can optimize my code for large strings as I am getting time out for many test cases.

You can just count the occurrences of each letter. For example, if there are n 'a's, in the string there will be n*(n-1)/2 substrings starting and ending with 'a'. You can do same for every letter, the solution is linear.
Add len(string) to the obtained value for final answer.

Word ranking partial completion [duplicate]

This question already has answers here:
Finding the ranking of a word (permutations) with duplicate letters
(6 answers)
Closed 8 years ago.
I am not sure how to solve this problem within the constraints.
Shortened problem formulation:
"Word" as any sequence of capital letters A-Z (not limited to just "dictionary words").
Consider list of permutations of all characters in a word, sorted lexicographically
Find a position of original word in such a list
Do not generate all possible permutations of a word, since it won't fit in time-memory constraints.
Constraints: word length <= 25 characters; memory limit 1Gb, any answer should fit in 64-bit integer
Original problem formulation:
Consider a "word" as any sequence of capital letters A-Z (not limited to just "dictionary words"). For any word with at least two different letters, there are other words composed of the same letters but in a different order (for instance, STATIONARILY/ANTIROYALIST, which happen to both be dictionary words; for our purposes "AAIILNORSTTY" is also a "word" composed of the same letters as these two). We can then assign a number to every word, based on where it falls in an alphabetically sorted list of all words made up of the same set of letters. One way to do this would be to generate the entire list of words and find the desired one, but this would be slow if the word is long. Write a program which takes a word as a command line argument and prints to standard output its number. Do not use the method above of generating the entire list. Your program should be able to accept any word 25 letters or less in length (possibly with some letters repeated), and should use no more than 1 GB of memory and take no more than 500 milliseconds to run. Any answer we check will fit in a 64-bit integer.
Sample words, with their rank:
ABAB = 2
AAAB = 1
BAAA = 4
QUESTION = 24572
BOOKKEEPER = 10743
examples:
AAAB - 1
AABA - 2
ABAA - 3
BAAA - 4
AABB - 1
ABAB - 2
ABBA - 3
BAAB - 4
BABA - 5
BBAA - 6
I came up with I think is only a partial solution.
Imagine I have the word JACBZPUC. I sort the word and get ABCCJPUZ This should be rank 1 in the word rank. From ABCCJPUZ to the first alphabetical word right before the word starting with J I want to find the number of permutations between the 2 words.
ex:
for `JACBZPUC`
sorted --> `ABCCJPUZ`
permutations that start with A -> 8!/2!
permutations that start with B -> 8!/2!
permutations that start with C -> 8!/2!
Add the 3 values -> 60480
The other C is disregarded as the permutations would have the same values as the previous C (duplicates)
At this point I have the ranks from ABCCJPUZ to the word right before the word that starts with J
ABCCJPUZ rank 1
...
... 60480 values
...
*HERE*
JABCCJPUZ rank 60481 LOCATION A
...
...
...
JACBZPUC rank ??? LOCATION B
I'm not sure how to get the values between Locations A and B:
Here is my code to find the 60480 values
def perm(word):
return len(set(itertools.permutations(word)))
def swap(word, i, j):
word = list(word)
word[i], word[j] = word[j], word[i]
print word
return ''.join(word)
def compute(word):
if ''.join(sorted(word)) == word:
return 1
total = 0
sortedWord = ''.join(sorted(word))
beforeFirstCharacterSet = set(sortedWord[:sortedWord.index(word[0])])
print beforeFirstCharacterSet
for i in beforeFirstCharacterSet:
total += perm(swap(sortedWord,0,sortedWord.index(i)))
return total
Here is a solution I found online to solve this problem.
Consider the n-letter word { x1, x2, ... , xn }. My solution is based on the idea that the word number will be the sum of two quantities:
The number of combinations starting with letters lower in the alphabet than x1, and
how far we are into the the arrangements that start with x1.
The trick is that the second quantity happens to be the word number of the word { x2, ... , xn }. This suggests a recursive implementation.
Getting the first quantity is a little complicated:
Let uniqLowers = { u1, u2, ... , um } = all the unique letters lower than x1
For each uj, count the number of permutations starting with uj.
Add all those up.
I think I complete step number 1 but not number 2. I am not sure how to complete this part
Here is the Haskell solution...I don't know Haskell =/ and I am trying to write this program in Python
https://github.com/david-crespo/WordNum/blob/master/comb.hs

The idea of finding the number of prmutations of the letters before the actual first letter is good.But your calculation:
for `JACBZPUC`
sorted --> `ABCCJPUZ`
permutations that start with A -> 8!/2!
permutations that start with B -> 8!/2!
permutations that start with C -> 8!/2!
Add the 3 values -> 60480
is wrong. There are only 8!/2! = 20160 permutations of JACBZPUC, so the starting position can't be greater than 60480. In your method, the first letter is fixed, you can only permute the seven following letters. So:
permutations that start with A: 7! / 2! == 2520
permutations that start with B: 7! / 2! == 2520
permutations that start with C: 7! / 1! == 5040
-----
10080
You don't divide by 2! to find the permutations beginning with C, because the seven remaning letters are unique; there's only one C left.
Here's a Python implementation:
def fact(n):
"""factorial of n, n!"""
f = 1
while n > 1:
f *= n
n -= 1
return f
def rrank(s):
"""Back-end to rank for 0-based rank of a list permutation"""
# trivial case
if len(s) < 2: return 0
order = s[:]
order.sort()
denom = 1
# account for multiple occurrences of letters
for i, c in enumerate(order):
n = 1
while i + n < len(order) and order[i + n] == c:
n += 1
denom *= n
# starting letters alphabetically before current letter
pos = order.index(s[0])
#recurse to list without its head
return fact(len(s) - 1) * pos / denom + rrank(s[1:])
def rank(s):
"""Determine 1-based rank of string permutation"""
return rrank(list(s)) + 1
strings = [
"ABC", "CBA",
"ABCD", "BADC", "DCBA", "DCAB", "FRED",
"QUESTION", "BOOKKEEPER", "JACBZPUC",
"AAAB", "AABA", "ABAA", "BAAA"
]
for s in strings:
print s, rank(s)

The second part of the solution you have found is also --I think-- what I was about to suggest:
To go from what you call "Location A" to "Location B", you have to find the position of word ACBZPUC among its possible permutations. Consider that a new question to your algorithm, with a new word that just happens to be one position shorter than the original one.

The words in the alphabetical list between JABCCPUZ, which you know the position of, and JACBZPUC, which you want to find the position of, all start with J. Finding the position of JACBZPUC relative to JABCCPUZ, then, is equivalent to finding the relative positions of those two words with the initial J removed, which is the same as the problem you were trying to solve initially but with a word one character shorter.
Repeat that process enough times and you will be left with a word that contains a single character, C. The position of a word with a single character is known to always be 1, so you can then sum that and all of the previous relative positions for an absolute position.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Efficient Way to Count K-mers in O(kN + kQ)? - python

Related

Function that returns the length of the longest run of repetition in a given list

Matching the first element of a list with other first elements in the list

Backward search implementation python

Python: how to optimize

Word ranking partial completion [duplicate]

Categories

Resources