Collapse a list of range tuples into the overlapping ranges - python

I'm looking for the most memory efficient way to solve this problem.
I have a list of tuples representing partial string matches in a sentence:
[(0, 2), (1, 2), (0, 4), (2,6), (23, 2), (22, 6), (26, 2), (26, 2), (26, 2)]
The first value of each tuple is the start position for the match, the second value is the length.
The idea is to collapse the list so that only the longest continue string match is reported. In this case it would be:
[(0,4), (2,6), (22,6)]
I do not want just the longest range, like in algorithm to find longest non-overlapping sequences, but I want all the ranges collapsed by the longest.
In case your wondering, I am using a pure python implementation of the Aho-Corasick for matching terms in a static dictionary to the given text snippet.
EDIT: Due to the nature of these tuple lists, overlapping but not self-contained ranges should be printed out individually. For example, having the words betaz and zeta in the dictionary, the matches for betazeta are [(0,5),(4,8)]. Since these ranges overlap, but none is contained in the other, the answer should be [(0,5),(4,8)]. I have also modified the input dataset above so that this case is covered.
Thanks!

import operator
lst = [(0, 2), (1, 2), (0, 4), (2,6), (23, 2), (22, 6), (26, 2), (26, 2), (26, 2)]
lst.sort(key=operator.itemgetter(1))
for i in reversed(xrange(len(lst)-1)):
start, length = lst[i]
for j in xrange(i+1, len(lst)):
lstart, llength = lst[j]
if start >= lstart and start + length <= lstart + llength:
del lst[i]
break
print lst
#[(0, 4), (2, 6), (22, 6)]

a = [(0, 2), (1, 2), (0, 4), (23, 2), (22, 6), (26, 2), (26, 2), (26, 2)]
b = [set(xrange(i, i + j)) for i, j in a]
c = b.pop().union(*b)
collapsed = sorted(c)
print collapsed
#Maybe this is useful?:
[0, 1, 2, 3, 22, 23, 24, 25, 26, 27]
#But if you want the requested format, then do this:
d = []
start = collapsed[0]
length = 0
for val in collapsed:
if start + length < val:
d.append((start,length))
start = val
length = 0
elif val == collapsed[-1]:
d.append((start,length + 1))
length += 1
print d
#Output:
[(0,4), (22,6)]

So, taking you at your word that your main interest is space efficiency, here's one way to do what you want:
lst = [(0, 2), (1, 2), (0, 4), (23, 2), (22, 6), (26, 2), (26, 2), (26, 2)]
lst.sort()
start, length = lst.pop(0)
i = 0
while i < len(lst):
x, l = lst[i]
if start + length < x:
lst[i] = (start, length)
i += 1
start, length = x, l
else:
length = max(length, x + l - start)
lst.pop(i)
lst.append((start, length))
This modifies the list in place, never makes the list longer, only uses a small handful of variables to keep state, and only needs one pass through the list
A much faster algorithm is possible if you don't want to modify the list in place - popping items from the middle of a list can be slow, especially if the list is long.
One reasonable optimization would be to keep a list of which indices you're going to remove, and then come back and rebuild the list in a second pass, that way you could rebuild the whole list in one go and avoid the pop overhead. But that would use more memory!

Related

Longest continuous pairs

1st pair's 1st element should be less than the 2nd pairs 1st element : same for 2nd elements individually in a sorted list of pairs.
xlist = [(3, 9), (4, 6), (5, 7), (6, 0)] # sorted by first element of pair
ylist = [(j,i) for i,j in (sorted([(y,x) for x,y in xlist]))] = [(6, 0), (4, 6), (5, 7), (3, 9)] # sorted by second element of pair
What I want is to find the longest pair that is continuous, i.e. (4, 6), (5, 7)
PS. there can be other continuous pairs like that, but is there a way to extract the longest continuous pairs?
(4, 6), (5, 7) is determined as the longest pair based on the fact that the next pair's 1st element(5) is less than current(4). The next pair's 2nd element(7) is less than current(6) (Basically 5 > 4 and 6 > 7). And lets add another element to that list say (8, 10); this is added to the output sequence as well, as 8 > 5 and 10 > 7. So the longest pairs become (4, 6), (5, 7), (8, 10)
If you mean the maximal common subsequence of the two lists, here a code using difflib that do what you want. I don't know exactly the implementation of SequenceMatcher but it seems quite optimized as it avoids for-loop in the whole lists:
from difflib import SequenceMatcher
xlist = [(3, 9), (4, 6), (5, 7), (6, 0), (7, 8)]
ylist = [(6, 0), (4, 6), (5, 7), (7, 8), (3, 9)]
out = SequenceMatcher(None, xlist, ylist).get_matching_blocks()
max_block = max(out, key=lambda x: x.size)
start, end = max_block.a, max_block.a + max_block.size
out = xlist[start:end]
print(out) # [(4, 6), (5, 7)]
If you mean the longest increasing sequence of the second coordinate in xlist (same as previously but allowing "skips" in sequence), you can go with:
xlist = [(3, 9), (4, 6), (5, 7), (6, 0), (7, 8)]
def find_lis_2nd_coord(pairs: List[Tuple]) -> List[Tuple]:
"""Find longest increasing subsequence (LIS).
LIS is determined along 2nd coordinate of input pairs.
"""
# lis[i] stores the longest increasing subsequence of sublist
# `pairs[0…i][1]` that ends with `pairs[i][1]`
lis = [[] for _ in range(len(pairs))]
# lis[0] denotes the longest increasing subsequence ending at `pairs[0][1]`
lis[0].append(pairs[0])
# Start from the second element in the list
for i in range(1, len(pairs)):
# Do for each element in sublist `pairs[0…i-1][1]`
for j in range(i):
# Find the longest increasing subsequence that ends with
# `pairs[j][1]` where it is less than the current element
# `pairs[i][1]`
if pairs[j][1] < pairs[i][1] and len(lis[j]) > len(lis[i]):
lis[i] = lis[j].copy()
# include `pairs[i]` in `lis[i]`
lis[i].append(pairs[i])
return max(lis, key=len)
print(find_lis_2nd_coord(xlist)) # [(4, 6), (5, 7), (7, 8)]
Disclaimer: this version is O(n^2) but I didn't find more optimized idea or implementation. At least it works.

Return tuple with biggest increase of second value in a list of tuples

like the title says I have a list of tuples: [(3, 20), (9, 21), (18, 19)]. I need to find the tuple that has a positive y-increase wrt its predecessor. In this case 21-20 = 1. So tuple (9,21) should be returned. 19-21 = -1 so tuple (18,19) shouldn't be returned. The very first tuple in the list should never be returned. I've tried putting all the values in a list and then trying to figure it out but I'm clueless. It should work for lists of tuples of any length. I hope you guys can help me out, thanks in advance.
You could compare the second element of each tuple with the previous one, while iterating over the list:
data = [(3, 20), (9, 21), (18, 19), (1, 35), (4, 37), (1, 2)]
maxIncrease = [0, 0] # store max increase value and it's index
for i in range(1, len(data)):
lst = data[i - 1]
cur = data[i]
diff = cur[1] - lst[1]
if diff > maxIncrease[0]:
maxIncrease = [diff, i]
print(
f'Biggest increase of {maxIncrease[0]} found at index {maxIncrease[1]}: {data[maxIncrease[1]]}'
)
Out:
Biggest increase of 16 found at index 3: (1, 35)
I think something like that can solve your problem:
import numpy as np
data = [(3, 20), (9, 21), (18, 19), (10, 22)]
diff_with_previous = []
for i in range(len(data)):
if i == 0:
diff_with_previous.append(-np.inf)
else:
diff_with_previous.append(data[i][1] - data[i-1][1])
indices = np.where(np.array(diff_with_previous) > 0)
print([data[i] for i in indices[0]])
[EDIT]
Without numpy:
data = [(3, 20), (9, 21), (18, 19), (10, 22)]
indices = []
for i in range(1, len(data)):
if (data[i][1] - data[i-1][1]) > 0:
indices.append(i)
print([data[i] for i in indices])

Find the most frequent k-mers with mismatches in a text

I am trying to solve finding the most frequent k-mers with mismatches in a string. The requirements are listed below:
Frequent Words with Mismatches Problem: Find the most frequent k-mers with mismatches in a string.
Input: A string Text as well as integers k and d. (You may assume k ≤ 12 and d ≤ 3.)
Output: All most frequent k-mers with up to d mismatches in Text.
Here is an example:
Sample Input:
ACGTTGCATGTCGCATGATGCATGAGAGCT
4 1
Sample Output:
GATG ATGC ATGT
The simplest and most inefficient way is to list all of k-mers in the text and calculate their hamming_difference between each other and pick out patterns whose hamming_difference less than or equal with d, below is my code:
import collections
kmer = 4
in_genome = "ACGTTGCATGTCGCATGATGCATGAGAGCT";
in_mistake = 1;
out_result = [];
mismatch_list = []
def hamming_distance(s1, s2):
# Return the Hamming distance between equal-length sequences
if len(s1) != len(s2):
raise ValueError("Undefined for sequences of unequal length")
else:
return sum(ch1 != ch2 for ch1, ch2 in zip(s1, s2))
for i in xrange(len(in_genome)-kmer + 1):
v = in_genome[i:i + kmer]
out_result.append(v)
for i in xrange(len(out_result) - 1):
for j in xrange(i+1, len(out_result)):
if hamming_distance(str(out_result[i]), str(out_result[j])) <= in_mistake:
mismatch_list.extend([out_result[i], out_result[j]])
mismatch_count = collections.Counter(mismatch_list)
print [key for key,val in mismatch_count.iteritems() if val == max(mismatch_count.values())]
Instead of the expected results, I got 'CATG'. Does anyone know something wrong with my code?
It all seems great until your last line of code:
print [key for key,val in mismatch_count.iteritems() if val == max(mismatch_count.values())]
Since CATG scored higher than any other kmer, you'll only ever get that one answer. Take a look at:
>>> print mismatch_count.most_common()
[('CATG', 9), ('ATGA', 6), ('GCAT', 6), ('ATGC', 4), ('TGCA', 4), ('ATGT', 4), ('GATG', 4), ('GTTG', 2), ('TGAG', 2), ('TTGC', 2), ('CGCA', 2), ('TGAT', 1), ('GTCG', 1), ('AGAG', 1), ('ACGT', 1), ('TCGC', 1), ('GAGC', 1), ('GAGA', 1)]
to figure out what it is you really want back from this result.
I believe the fix is to change your second top level 'for' loop to read as follows:
for t_kmer in set(out_result):
for s_kmer in out_result:
if hamming_distance(t_kmer, s_kmer) <= in_mistake:
mismatch_list.append(t_kmer)
This produces a result similar to what you're expecting:
>>> print mismatch_count.most_common()
[('ATGC', 5), ('ATGT', 5), ('GATG', 5), ('CATG', 4), ('ATGA', 4), ('GTTG', 3), ('CGCA', 3), ('GCAT', 3), ('TGAG', 3), ('TTGC', 3), ('TGCA', 3), ('TGAT', 2), ('GTCG', 2), ('AGAG', 2), ('ACGT', 2), ('TCGC', 2), ('GAGA', 2), ('GAGC', 2), ('TGTC', 1), ('CGTT', 1), ('AGCT', 1)]

Generate ordered tuples of infinite sequences

I have two generators genA and genB and each of them generates an infinite, strictly monotonically increasing sequence of integers.
Now I need a generator that generates all tuples (a, b) such that a is produced by genA and b is produced by genB and a < b, ordered by a + b ascending. In case of ambiguity the ordering is of no importance, i.e. if a + b == c + d, it doesn't matter if it generates (a, b) first or (c, d) first.
For instance. If both genA and genB generate the prime numbers, then the new generator should generate:
(2, 3), (2, 5), (3, 5), (2, 7), (3, 7), (5, 7), (2, 11), ...
If genA and genB were finite lists, zipping and then sorting would do the trick.
Apparenyly for all tuples of form (x, b) the following holds: first(genA) <= x <= max(genA,b) <= b, being first(genA) the first element generated by genA and max(genA,b) the last element generated by genA which is less than b.
This is how far I have gotten. Any ideas of how to combine two generators in the described manner?
I don't think it is possible to do this without saving all the results from genA. A solution might look something like this:
import heapq
def gen_weird_sequence(genA, genB):
heap = []
a0 = next_a = next(genA)
saved_a = []
for b in genB:
while next_a < b:
saved_a.append(next_a)
next_a = next(genA)
# saved_a now contains all a < b
for a in saved_a:
heapq.heappush(heap, (a+b, a, b)) #decorate pair with sorting key a+b
# (minimum sum in the next round) > b + a0, so yield everything smaller
while heap and heap[0][0] <= b + a0:
yield heapq.heappop(heap)[1:] # pop smallest and undecorate
Explanation: The main loop iterates simply over all elements in genB, and then gets all elements from genA that are smaller than b and saves them in a list. It then generates all the tuples (a0, b), (a1, b), ..., (a_n, b) and stores them in a min-heap, which is an efficient data-structure when you are only interested in extracting the minimum value of a collection. As with sorting, you can do the trick to not save the pairs itself, but prepend them with the value you want to sort on (a+b), since comparisons between tuples will start by comparing the first item. Finally, it pops all the elements off the heap for which the sum is guaranteed smaller than the sum of any pair generated for the next b and yields them.
Note that both heap and saved_a will increase while you are generating results, I guess proportionally to the square root of the number of elements generated so far.
Quick test with some primes:
In [2]: genA = (a for a in [2,3,5,7,11,13,17,19])
In [3]: genB = (b for b in [2,3,5,7,11,13,17,19])
In [4]: for pair in gen_weird_sequence(genA, genB): print pair
(2, 3)
(2, 5)
(3, 5)
(2, 7)
(3, 7)
(5, 7)
(2, 11)
(3, 11)
(2, 13)
(3, 13)
(5, 11)
(5, 13)
(7, 11)
(2, 17)
(3, 17)
(7, 13)
as expected. Test with infinite generators:
In [11]: from itertools import *
In [12]: list(islice(gen_weird_sequence(count(), count()), 16))
Out[12]: [(0, 1), (0, 2), (0, 3), (1, 2), (0, 4), (1, 3), (0, 5), (1, 4),
(2, 3), (0, 6), (1, 5), (2, 4), (0, 7), (1, 6), (2, 5), (3, 4)]

Accumulate items in a list of tuples

I have a list of tuples that looks like this:
lst = [(0, 0), (2, 3), (4, 3), (5, 1)]
What is the best way to accumulate the sum of the first and secound tuple elements? Using the example above, I'm looking for the best way to produce this list:
new_lst = [(0, 0), (2, 3), (6, 6), (11, 7)]
I am looking for a solution in Python 2.6
I would argue the best solution is itertools.accumulate() to accumulate the values, and using zip() to split up your columns and merge them back. This means the generator just handles a single column, and makes the method entirely scalable.
>>> from itertools import accumulate
>>> lst = [(0, 0), (2, 3), (4, 3), (5, 1)]
>>> list(zip(*map(accumulate, zip(*lst))))
[(0, 0), (2, 3), (6, 6), (11, 7)]
We use zip() to take the columns, then apply itertools.accumulate() to each column, then use zip() to merge them back into the original format.
This method will work for any iterable, not just sequences, and should be relatively efficient.
Prior to 3.2, accumulate can be defined as:
def accumulate(iterator):
total = 0
for item in iterator:
total += item
yield total
(The docs page gives a more generic implementation, but for this use case, we can use this simple implementation).
How about this generator:
def accumulate_tuples(iterable):
accum_a = accum_b = 0
for a, b in iterable:
accum_a += a
accum_b += b
yield accum_a, accum_b
If you need a list, just call list(accumulate_tuples(your_list)).
Here's a version that works for arbitrary length tuples:
def accumulate_tuples(iterable):
it = iter(iterable):
accum = next(it) # initialize with the first value
yield accum
for val in it: # iterate over the rest of the values
accum = tuple(a+b for a, b in zip(accum, val))
yield accum
>> reduce(lambda x,y: (x[0] + y[0], x[1] + y[1]), lst)
(11, 7)
EDIT. I can see your updated question. To get the running list you can do:
>> [reduce(lambda x,y: (x[0]+y[0], x[1]+y[1]), lst[:i]) for i in range(1,len(lst)+1)]
[(0, 0), (2, 3), (6, 6), (11, 7)]
Not super efficient, but at least it works and does what you want :)
This works for any length of tuples or other iterables.
from collections import defaultdict
def accumulate(lst):
sums = defaultdict(int)
for item in lst:
for index, subitem in enumerate(item):
sums[index] += subitem
yield [sums[index] for index in xrange(len(sums))]
print [tuple(x) for x in accumulate([(0, 0), (2, 3), (4, 3), (5, 1)])]
In Python 2.7+ you would use a Counter instead of defaultdict(int).
This is a really poor way (in terms of performance) to do this because list.append is expensive, but it works.
last = lst[0]
new_list = [last]
for t in lst[1:]:
last += t
new_list.append(last)
Simple method:
>> x = [(0, 0), (2, 3), (4, 3), (5, 1)]
>>> [(sum(a for a,b in x[:t] ),sum(b for a,b in x[:t])) for t in range(1,len(x)+1)]
[(0, 0), (2, 3), (6, 6), (11, 7)]
lst = [(0, 0), (2, 3), (4, 3), (5, 1)]
lst2 = [lst[0]]
for idx in range(1, len(lst)):
newItem = [0,0]
for idx2 in range(0, idx + 1):
newItem[0] = newItem[0] + lst[idx2][0]
newItem[1] = newItem[1] + lst[idx2][1]
lst2.append(newItem)
print(lst2)
You can use the following function
>>> def my_accumulate(lst):
new_lst = [lst[0]]
for x, y in lst[1:]:
new_lst.append((new_lst[-1][0]+x, new_lst[-1][1]+y))
return new_lst
>>> lst = [(0, 0), (2, 3), (4, 3), (5, 1)]
>>> my_accumulate(lst)
[(0, 0), (2, 3), (6, 6), (11, 7)]
Changed my code to a terser version:
lst = [(0, 0), (2, 3), (4, 3), (5, 1)]
def accumulate(the_list):
the_item = iter(the_list)
accumulator = next(the_item)
while True:
yield accumulator
accumulator = tuple(x+y for (x,y) in zip (accumulator, next(the_item)))
new_lst = list(accumulate(lst))

Categories