Find the most frequent k-mers with mismatches in a text - python

I am trying to solve finding the most frequent k-mers with mismatches in a string. The requirements are listed below:
Frequent Words with Mismatches Problem: Find the most frequent k-mers with mismatches in a string.
Input: A string Text as well as integers k and d. (You may assume k ≤ 12 and d ≤ 3.)
Output: All most frequent k-mers with up to d mismatches in Text.
Here is an example:
Sample Input:
ACGTTGCATGTCGCATGATGCATGAGAGCT
4 1
Sample Output:
GATG ATGC ATGT
The simplest and most inefficient way is to list all of k-mers in the text and calculate their hamming_difference between each other and pick out patterns whose hamming_difference less than or equal with d, below is my code:
import collections
kmer = 4
in_genome = "ACGTTGCATGTCGCATGATGCATGAGAGCT";
in_mistake = 1;
out_result = [];
mismatch_list = []
def hamming_distance(s1, s2):
# Return the Hamming distance between equal-length sequences
if len(s1) != len(s2):
raise ValueError("Undefined for sequences of unequal length")
else:
return sum(ch1 != ch2 for ch1, ch2 in zip(s1, s2))
for i in xrange(len(in_genome)-kmer + 1):
v = in_genome[i:i + kmer]
out_result.append(v)
for i in xrange(len(out_result) - 1):
for j in xrange(i+1, len(out_result)):
if hamming_distance(str(out_result[i]), str(out_result[j])) <= in_mistake:
mismatch_list.extend([out_result[i], out_result[j]])
mismatch_count = collections.Counter(mismatch_list)
print [key for key,val in mismatch_count.iteritems() if val == max(mismatch_count.values())]
Instead of the expected results, I got 'CATG'. Does anyone know something wrong with my code?

It all seems great until your last line of code:
print [key for key,val in mismatch_count.iteritems() if val == max(mismatch_count.values())]
Since CATG scored higher than any other kmer, you'll only ever get that one answer. Take a look at:
>>> print mismatch_count.most_common()
[('CATG', 9), ('ATGA', 6), ('GCAT', 6), ('ATGC', 4), ('TGCA', 4), ('ATGT', 4), ('GATG', 4), ('GTTG', 2), ('TGAG', 2), ('TTGC', 2), ('CGCA', 2), ('TGAT', 1), ('GTCG', 1), ('AGAG', 1), ('ACGT', 1), ('TCGC', 1), ('GAGC', 1), ('GAGA', 1)]
to figure out what it is you really want back from this result.
I believe the fix is to change your second top level 'for' loop to read as follows:
for t_kmer in set(out_result):
for s_kmer in out_result:
if hamming_distance(t_kmer, s_kmer) <= in_mistake:
mismatch_list.append(t_kmer)
This produces a result similar to what you're expecting:
>>> print mismatch_count.most_common()
[('ATGC', 5), ('ATGT', 5), ('GATG', 5), ('CATG', 4), ('ATGA', 4), ('GTTG', 3), ('CGCA', 3), ('GCAT', 3), ('TGAG', 3), ('TTGC', 3), ('TGCA', 3), ('TGAT', 2), ('GTCG', 2), ('AGAG', 2), ('ACGT', 2), ('TCGC', 2), ('GAGA', 2), ('GAGC', 2), ('TGTC', 1), ('CGTT', 1), ('AGCT', 1)]

Related

Get all possible dots that form a continuous path - Python

For example given points: points = [(1,2),(3,4),(5,7),(4,7),(6,7)], i need the program to find all combination such that there's a path between points (let's say 7 is the destination we want to reach)
so the output would be: [(1,2),(3,4),(5,7)] [(1,2),(3,4),(4,7)] [(1,2),(3,4),(6,7)]
u get the idea?
I'm really stuck with it an i cannot find something similar on the internet.
Truly, I don't get why you need this, but it is a simple task.
source_list = [(1,2),(3,4),(5,7),(4,7),(6,7)]
final_point = 7 # ??? did not get the logic, btw
def some_magic_shit(lst, fp):
out = []
while True:
way = []
for k, v in enumerate(lst):
way.append(v)
if v[1] >= fp:
lst.pop(k)
out.append(way)
if k >= len(lst):
return out
else:
break
print(some_magic_shit(source_list, final_point)) # [[(1, 2), (3, 4), (5, 7)], [(1, 2), (3, 4), (4, 7)], [(1, 2), (3, 4), (6, 7)]]
The code above should be rewritten with the proper logic.
It only uses Y axis as a final point.

How do I impose a condition that must be satisfied by *any two* members of a set?

I want to use python to define one set in terms of another, as follows: For some set N that consists of sets, define C as the set such that an element n of N is in C just in case any two elements of n both satisfy some specific condition.
Here is the particular problem I need to solve. Consider the power set N of the set of ordered pairs of elements in x={1,2,3,4,5,6}, and the following subsets of N:
i = {{1,1},{2,2},{3,4},{4,3},{5,6},{6,5}}
j = {{3,3},{4,4},{1,2},{2,1},{5,6},{6,5}}
k = {{5,5},{6,6},{1,2},{2,1},{3,4},{4,3}}
Using python, I want to define a special class of subsets of N: the subsets of N such that any two of their members are both either in i, j, or k.
More explicitly, I want to define the set terms: C = {n in N| for all a, b in n, either a and b are both in i or a and b are both in j or a and b are both in k}.
I'm attaching what I tried to do in Python. But this doesn't give me the right result: the set C I'm defining here is not such that any two of its members are both either in i, j, or k.
Any leads would be much appreciated!
import itertools
def powerset(iterable):
s = list(iterable)
return chain.from_iterable(combinations(s, r) for r in range(len(s)+1))
x = [1,2,3,4,5,6]
ordered_pairs = [[j,k] for j in x for k in x if k>=j]
powers = list(powerset(ordered_pairs))
i = [[1,1],[2,2],[3,4],[4,3],[5,6],[6,5]]
j = [[3,3],[4,4],[1,2],[2,1],[5,6],[6,5]]
k = [[5,5],[6,6],[1,2],[2,1],[3,4],[4,3]]
M = [i,j,k]
C = []
for n in powers:
for a in n:
for b in n:
for m in M:
if a in m:
if b in m:
if a != b:
C.append(n)
if len(n) == 1:
C.append(n)
First of all, note that the ordered pairs you list are sets, not pairs. Use tuples, since they're hashable, and you'll be able to easily generate the power set using itertools. With that done, you have an easier time identifying the qualifying subsets.
The code below implements that much of the process. You can accumulate the hits at the HIT line of code. Even better, you can collapse the loop into a nested comprehension using any to iterate over the three zone sets.
test_list = [
set(((1,1),(2,2))), # trivial hit on i
set(), # trivial miss
set(((1, 1), (4, 4), (6, 6))), # one element in each target set
set(((3, 3), (6, 2), (4, 4), (2, 2))), # two elements in j
]
i = set(((1,1),(2,2),(3,4),(4,3),(5,6),(6,5)))
j = set(((3,3),(4,4),(1,2),(2,1),(5,6),(6,5)))
k = set(((5,5),(6,6),(1,2),(2,1),(3,4),(4,3)))
zone = [i, j, k]
for candidate in test_list:
for target in zone:
overlap = candidate.intersection(target)
if len(overlap) >= 2:
print("HIT", candidate, target)
break
else:
print("MISS", candidate)
Output:
HIT {(1, 1), (2, 2)} {(5, 6), (4, 3), (2, 2), (3, 4), (1, 1), (6, 5)}
MISS set()
MISS {(4, 4), (1, 1), (6, 6)}
HIT {(6, 2), (3, 3), (4, 4), (2, 2)} {(1, 2), (3, 3), (5, 6), (4, 4), (2, 1), (6, 5)}

Create a random order of (x, y) pairs, without repeating/subsequent x's

Say I have a list of valid X = [1, 2, 3, 4, 5] and a list of valid Y = [1, 2, 3, 4, 5].
I need to generate all combinations of every element in X and every element in Y (in this case, 25) and get those combinations in random order.
This in itself would be simple, but there is an additional requirement: In this random order, there cannot be a repetition of the same x in succession. For example, this is okay:
[1, 3]
[2, 5]
[1, 2]
...
[1, 4]
This is not:
[1, 3]
[1, 2] <== the "1" cannot repeat, because there was already one before
[2, 5]
...
[1, 4]
Now, the least efficient idea would be to simply randomize the full set as long as there are no more repetitions. My approach was a bit different, repeatedly creating a shuffled variant of X, and a list of all Y * X, then picking a random next one from that. So far, I've come up with this:
import random
output = []
num_x = 5
num_y = 5
all_ys = list(xrange(1, num_y + 1)) * num_x
while True:
# end if no more are available
if len(output) == num_x * num_y:
break
xs = list(xrange(1, num_x + 1))
while len(xs):
next_x = random.choice(xs)
next_y = random.choice(all_ys)
if [next_x, next_y] not in output:
xs.remove(next_x)
all_ys.remove(next_y)
output.append([next_x, next_y])
print(sorted(output))
But I'm sure this can be done even more efficiently or in a more succinct way?
Also, my solution first goes through all X values before continuing with the full set again, which is not perfectly random. I can live with that for my particular application case.
A simple solution to ensure an average O(N*M) complexity:
def pseudorandom(M,N):
l=[(x+1,y+1) for x in range(N) for y in range(M)]
random.shuffle(l)
for i in range(M*N-1):
for j in range (i+1,M*N): # find a compatible ...
if l[i][0] != l[j][0]:
l[i+1],l[j] = l[j],l[i+1]
break
else: # or insert otherwise.
while True:
l[i],l[i-1] = l[i-1],l[i]
i-=1
if l[i][0] != l[i-1][0]: break
return l
Some tests:
In [354]: print(pseudorandom(5,5))
[(2, 2), (3, 1), (5, 1), (1, 1), (3, 2), (1, 2), (3, 5), (1, 5), (5, 4),\
(1, 3), (5, 2), (3, 4), (5, 3), (4, 5), (5, 5), (1, 4), (2, 5), (4, 4), (2, 4),\
(4, 2), (2, 1), (4, 3), (2, 3), (4, 1), (3, 3)]
In [355]: %timeit pseudorandom(100,100)
10 loops, best of 3: 41.3 ms per loop
Here is my solution. First the tuples are chosen among the ones who have a different x value from the previous selected tuple. But I ve noticed that you have to prepare the final trick for the case you have only bad value tuples to place at end.
import random
num_x = 5
num_y = 5
all_ys = range(1,num_y+1)*num_x
all_xs = sorted(range(1,num_x+1)*num_y)
output = []
last_x = -1
for i in range(0,num_x*num_y):
#get list of possible tuple to place
all_ind = range(0,len(all_xs))
all_ind_ok = [k for k in all_ind if all_xs[k]!=last_x]
ind = random.choice(all_ind_ok)
last_x = all_xs[ind]
output.append([all_xs.pop(ind),all_ys.pop(ind)])
if(all_xs.count(last_x)==len(all_xs)):#if only last_x tuples,
break
if len(all_xs)>0: # if there are still tuples they are randomly placed
nb_to_place = len(all_xs)
while(len(all_xs)>0):
place = random.randint(0,len(output)-1)
if output[place]==last_x:
continue
if place>0:
if output[place-1]==last_x:
continue
output.insert(place,[all_xs.pop(),all_ys.pop()])
print output
Here's a solution using NumPy
def generate_pairs(xs, ys):
n = len(xs)
m = len(ys)
indices = np.arange(n)
array = np.tile(ys, (n, 1))
[np.random.shuffle(array[i]) for i in range(n)]
counts = np.full_like(xs, m)
i = -1
for _ in range(n * m):
weights = np.array(counts, dtype=float)
if i != -1:
weights[i] = 0
weights /= np.sum(weights)
i = np.random.choice(indices, p=weights)
counts[i] -= 1
pair = xs[i], array[i, counts[i]]
yield pair
Here's a Jupyter notebook that explains how it works
Inside the loop, we have to copy the weights, add them up, and choose a random index using the weights. These are all linear in n. So the overall complexity to generate all pairs is O(n^2 m)
But the runtime is deterministic and overhead is low. And I'm fairly sure it generates all legal sequences with equal probability.
An interesting question! Here is my solution. It has the following properties:
If there is no valid solution it should detect this and let you know
The iteration is guaranteed to terminate so it should never get stuck in an infinite loop
Any possible solution is reachable with nonzero probability
I do not know the distribution of the output over all possible solutions, but I think it should be uniform because there is no obvious asymmetry inherent in the algorithm. I would be surprised and pleased to be shown otherwise, though!
import random
def random_without_repeats(xs, ys):
pairs = [[x,y] for x in xs for y in ys]
output = [[object()], [object()]]
seen = set()
while pairs:
# choose a random pair from the ones left
indices = list(set(xrange(len(pairs))) - seen)
try:
index = random.choice(indices)
except IndexError:
raise Exception('No valid solution exists!')
# the first element of our randomly chosen pair
x = pairs[index][0]
# search for a valid place in output where we slot it in
for i in xrange(len(output) - 1):
left, right = output[i], output[i+1]
if x != left[0] and x != right[0]:
output.insert(i+1, pairs.pop(index))
seen = set()
break
else:
# make sure we don't randomly choose a bad pair like that again
seen |= {i for i in indices if pairs[i][0] == x}
# trim off the sentinels
output = output[1:-1]
assert len(output) == len(xs) * len(ys)
assert not any(L==R for L,R in zip(output[:-1], output[1:]))
return output
nx, ny = 5, 5 # OP example
# nx, ny = 2, 10 # output must alternate in 1st index
# nx, ny = 4, 13 # shuffle 'deck of cards' with no repeating suit
# nx, ny = 1, 5 # should raise 'No valid solution exists!' exception
xs = range(1, nx+1)
ys = range(1, ny+1)
for pair in random_without_repeats(xs, ys):
print pair
This should do what you want.
rando will never generate the same X twice in a row, but I realized that it is possible (though seems unlikely, in that I never noticed it happen in the 10 or so times I ran without the extra check) that due to the potential discard of duplicate pairs it could happen upon a previous X. Oh! But I think I figured it out... will update my answer in a moment.
import random
X = [1,2,3,4,5]
Y = [1,2,3,4,5]
def rando(choice_one, choice_two):
last_x = random.choice(choice_one)
while True:
yield last_x, random.choice(choice_two)
possible_x = choice_one[:]
possible_x.remove(last_x)
last_x = random.choice(possible_x)
all_pairs = set(itertools.product(X, Y))
result = []
r = rando(X, Y)
while set(result) != all_pairs:
pair = next(r)
if pair not in result:
if result and result[-1][0] == pair[0]:
continue
result.append(pair)
import pprint
pprint.pprint(result)
For completeness, I guess I will throw in the super-naive "just keep shuffling till you get one" solution. It's not guaranteed to even terminate, but if it does, it will have a good degree of randomness, and you did say one of the desired qualities was succinctness, and this sure is succinct:
import itertools
import random
x = range(5) # this is a list in Python 2
y = range(5)
all_pairs = list(itertools.product(x, y))
s = list(all_pairs) # make a working copy
while any(s[i][0] == s[i + 1][0] for i in range(len(s) - 1)):
random.shuffle(s)
print s
As was commented, for small values of x and y (especially y!), this is actually a reasonably quick solution. Your example of 5 for each completes in an average time of "right away". The deck of cards example (4 and 13) can take much longer, because it will usually require hundreds of thousands of shuffles. (And again, is not guaranteed to terminate at all.)
Distribute the x values (5 times each value) evenly across your output:
import random
def random_combo_without_x_repeats(xvals, yvals):
# produce all valid combinations, but group by `x` and shuffle the `y`s
grouped = [[x, random.sample(yvals, len(yvals))] for x in xvals]
last_x = object() # sentinel not equal to anything
while grouped[0][1]: # still `y`s left
for _ in range(len(xvals)):
# shuffle the `x`s, but skip any ordering that would
# produce consecutive `x`s.
random.shuffle(grouped)
if grouped[0][0] != last_x:
break
else:
# we tried to reshuffle N times, but ended up with the same `x` value
# in the first position each time. This is pretty unlikely, but
# if this happens we bail out and just reverse the order. That is
# more than good enough.
grouped = grouped[::-1]
# yield a set of (x, y) pairs for each unique x
# Pick one y (from the pre-shuffled groups per x
for x, ys in grouped:
yield x, ys.pop()
last_x = x
This shuffles the y values per x first, then gives you a x, y combination for each x. The order in which the xs are yielded is shuffled each iteration, where you test for the restriction.
This is random, but you'll get all numbers between 1 and 5 in the x position before you'll see the same number again:
>>> list(random_combo_without_x_repeats(range(1, 6), range(1, 6)))
[(2, 1), (3, 2), (1, 5), (5, 1), (4, 1),
(2, 4), (3, 1), (4, 3), (5, 5), (1, 4),
(5, 2), (1, 1), (3, 3), (4, 4), (2, 5),
(3, 5), (2, 3), (4, 2), (1, 2), (5, 4),
(2, 2), (3, 4), (1, 3), (4, 5), (5, 3)]
(I manually grouped that into sets of 5). Overall, this makes for a pretty good random shuffling of a fixed input set with your restriction.
It is efficient too; because there is only a 1-in-N chance that you have to re-shuffle the x order, you should only see one reshuffle on average take place during a full run of the algorithm. The whole algorithm stays within O(N*M) boundaries therefor, pretty much ideal for something that produces N times M elements of output. Because we limit the reshuffling to N times at most before falling back to a simple reverse we avoid the (extremely unlikely) posibility of endlessly reshuffling.
The only drawback then is that it has to create N copies of the M y values up front.
Here is an evolutionary algorithm approach. It first evolves a list in which the elements of X are each repeated len(Y) times and then it randomly fills in each element of Y len(X) times. The resulting orders seem fairly random:
import random
#the following fitness function measures
#the number of times in which
#consecutive elements in a list
#are equal
def numRepeats(x):
n = len(x)
if n < 2: return 0
repeats = 0
for i in range(n-1):
if x[i] == x[i+1]: repeats += 1
return repeats
def mutate(xs):
#swaps random pairs of elements
#returns a new list
#one of the two indices is chosen so that
#it is in a repeated pair
#and swapped element is different
n = len(xs)
repeats = [i for i in range(n) if (i > 0 and xs[i] == xs[i-1]) or (i < n-1 and xs[i] == xs[i+1])]
i = random.choice(repeats)
j = random.randint(0,n-1)
while xs[j] == xs[i]: j = random.randint(0,n-1)
ys = xs[:]
ys[i], ys[j] = ys[j], ys[i]
return ys
def evolveShuffle(xs, popSize = 100, numGens = 100):
#tries to evolve a shuffle of xs so that consecutive
#elements are different
#takes the best 10% of each generation and mutates each 9
#times. Stops when a perfect solution is found
#popsize assumed to be a multiple of 10
population = []
for i in range(popSize):
deck = xs[:]
random.shuffle(deck)
fitness = numRepeats(deck)
if fitness == 0: return deck
population.append((fitness,deck))
for i in range(numGens):
population.sort(key = (lambda p: p[0]))
newPop = []
for i in range(popSize//10):
fit,deck = population[i]
newPop.append((fit,deck))
for j in range(9):
newDeck = mutate(deck)
fitness = numRepeats(newDeck)
if fitness == 0: return newDeck
newPop.append((fitness,newDeck))
population = newPop
#if you get here :
return [] #no special shuffle found
#the following function takes a list x
#with n distinct elements (n>1) and an integer k
#and returns a random list of length nk
#where consecutive elements are not the same
def specialShuffle(x,k):
n = len(x)
if n == 2:
if random.random() < 0.5:
a,b = x
else:
b,a = x
return [a,b]*k
else:
deck = x*k
return evolveShuffle(deck)
def randOrder(x,y):
xs = specialShuffle(x,len(y))
d = {}
for i in x:
ys = y[:]
random.shuffle(ys)
d[i] = iter(ys)
pairs = []
for i in xs:
pairs.append((i,next(d[i])))
return pairs
for example:
>>> randOrder([1,2,3,4,5],[1,2,3,4,5])
[(1, 4), (3, 1), (4, 5), (2, 2), (4, 3), (5, 3), (2, 1), (3, 3), (1, 1), (5, 2), (1, 3), (2, 5), (1, 5), (3, 5), (5, 5), (4, 4), (2, 3), (3, 2), (5, 4), (2, 4), (4, 2), (1, 2), (5, 1), (4, 1), (3, 4)]
As len(X) and len(Y) gets larger this has more difficulty finding a solution (and is designed to return the empty list in that eventuality), in which case the parameters popSize and numGens could be increased. As is, it is able to find 20x20 solutions very rapidly. It takes about a minute when X and Y are of size 100 but even then is able to find a solution (in the times that I have run it).
Interesting restriction! I probably overthought this, solving a more general problem: shuffling an arbitrary list of sequences such that (if possible) no two adjacent sequences share a first item.
from itertools import product
from random import choice, randrange, shuffle
def combine(*sequences):
return playlist(product(*sequences))
def playlist(sequence):
r'''Shuffle a set of sequences, avoiding repeated first elements.
'''#"""#'''
result = list(sequence)
length = len(result)
if length < 2:
# No rearrangement is possible.
return result
def swap(a, b):
if a != b:
result[a], result[b] = result[b], result[a]
swap(0, randrange(length))
for n in range(1, length):
previous = result[n-1][0]
choices = [x for x in range(n, length) if result[x][0] != previous]
if not choices:
# Trapped in a corner: Too many of the same item are left.
# Backtrack as far as necessary to interleave other items.
minor = 0
major = length - n
while n > 0:
n -= 1
if result[n][0] == previous:
major += 1
else:
minor += 1
if minor == major - 1:
if n == 0 or result[n-1][0] != previous:
break
else:
# The requirement can't be fulfilled,
# because there are too many of a single item.
shuffle(result)
break
# Interleave the majority item with the other items.
major = [item for item in result[n:] if item[0] == previous]
minor = [item for item in result[n:] if item[0] != previous]
shuffle(major)
shuffle(minor)
result[n] = major.pop(0)
n += 1
while n < length:
result[n] = minor.pop(0)
n += 1
result[n] = major.pop(0)
n += 1
break
swap(n, choice(choices))
return result
This starts out simple, but when it discovers that it can't find an item with a different first element, it figures out how far back it needs to go to interleave that element with something else. Therefore, the main loop traverses the array at most three times (once backwards), but usually just once. Granted, each iteration of the first forward pass checks each remaining item in the array, and the array itself contains every pair, so the overall run time is O((NM)**2).
For your specific problem:
>>> X = Y = [1, 2, 3, 4, 5]
>>> combine(X, Y)
[(3, 5), (1, 1), (4, 4), (1, 2), (3, 4),
(2, 3), (5, 4), (1, 5), (2, 4), (5, 5),
(4, 1), (2, 2), (1, 4), (4, 2), (5, 2),
(2, 1), (3, 3), (2, 5), (3, 2), (1, 3),
(4, 3), (5, 3), (4, 5), (5, 1), (3, 1)]
By the way, this compares x values by equality, not by position in the X array, which may make a difference if the array can contain duplicates. In fact, duplicate values might trigger the fallback case of shuffling all pairs together if more than half of the X values are the same.

Generate ordered tuples of infinite sequences

I have two generators genA and genB and each of them generates an infinite, strictly monotonically increasing sequence of integers.
Now I need a generator that generates all tuples (a, b) such that a is produced by genA and b is produced by genB and a < b, ordered by a + b ascending. In case of ambiguity the ordering is of no importance, i.e. if a + b == c + d, it doesn't matter if it generates (a, b) first or (c, d) first.
For instance. If both genA and genB generate the prime numbers, then the new generator should generate:
(2, 3), (2, 5), (3, 5), (2, 7), (3, 7), (5, 7), (2, 11), ...
If genA and genB were finite lists, zipping and then sorting would do the trick.
Apparenyly for all tuples of form (x, b) the following holds: first(genA) <= x <= max(genA,b) <= b, being first(genA) the first element generated by genA and max(genA,b) the last element generated by genA which is less than b.
This is how far I have gotten. Any ideas of how to combine two generators in the described manner?
I don't think it is possible to do this without saving all the results from genA. A solution might look something like this:
import heapq
def gen_weird_sequence(genA, genB):
heap = []
a0 = next_a = next(genA)
saved_a = []
for b in genB:
while next_a < b:
saved_a.append(next_a)
next_a = next(genA)
# saved_a now contains all a < b
for a in saved_a:
heapq.heappush(heap, (a+b, a, b)) #decorate pair with sorting key a+b
# (minimum sum in the next round) > b + a0, so yield everything smaller
while heap and heap[0][0] <= b + a0:
yield heapq.heappop(heap)[1:] # pop smallest and undecorate
Explanation: The main loop iterates simply over all elements in genB, and then gets all elements from genA that are smaller than b and saves them in a list. It then generates all the tuples (a0, b), (a1, b), ..., (a_n, b) and stores them in a min-heap, which is an efficient data-structure when you are only interested in extracting the minimum value of a collection. As with sorting, you can do the trick to not save the pairs itself, but prepend them with the value you want to sort on (a+b), since comparisons between tuples will start by comparing the first item. Finally, it pops all the elements off the heap for which the sum is guaranteed smaller than the sum of any pair generated for the next b and yields them.
Note that both heap and saved_a will increase while you are generating results, I guess proportionally to the square root of the number of elements generated so far.
Quick test with some primes:
In [2]: genA = (a for a in [2,3,5,7,11,13,17,19])
In [3]: genB = (b for b in [2,3,5,7,11,13,17,19])
In [4]: for pair in gen_weird_sequence(genA, genB): print pair
(2, 3)
(2, 5)
(3, 5)
(2, 7)
(3, 7)
(5, 7)
(2, 11)
(3, 11)
(2, 13)
(3, 13)
(5, 11)
(5, 13)
(7, 11)
(2, 17)
(3, 17)
(7, 13)
as expected. Test with infinite generators:
In [11]: from itertools import *
In [12]: list(islice(gen_weird_sequence(count(), count()), 16))
Out[12]: [(0, 1), (0, 2), (0, 3), (1, 2), (0, 4), (1, 3), (0, 5), (1, 4),
(2, 3), (0, 6), (1, 5), (2, 4), (0, 7), (1, 6), (2, 5), (3, 4)]

Collapse a list of range tuples into the overlapping ranges

I'm looking for the most memory efficient way to solve this problem.
I have a list of tuples representing partial string matches in a sentence:
[(0, 2), (1, 2), (0, 4), (2,6), (23, 2), (22, 6), (26, 2), (26, 2), (26, 2)]
The first value of each tuple is the start position for the match, the second value is the length.
The idea is to collapse the list so that only the longest continue string match is reported. In this case it would be:
[(0,4), (2,6), (22,6)]
I do not want just the longest range, like in algorithm to find longest non-overlapping sequences, but I want all the ranges collapsed by the longest.
In case your wondering, I am using a pure python implementation of the Aho-Corasick for matching terms in a static dictionary to the given text snippet.
EDIT: Due to the nature of these tuple lists, overlapping but not self-contained ranges should be printed out individually. For example, having the words betaz and zeta in the dictionary, the matches for betazeta are [(0,5),(4,8)]. Since these ranges overlap, but none is contained in the other, the answer should be [(0,5),(4,8)]. I have also modified the input dataset above so that this case is covered.
Thanks!
import operator
lst = [(0, 2), (1, 2), (0, 4), (2,6), (23, 2), (22, 6), (26, 2), (26, 2), (26, 2)]
lst.sort(key=operator.itemgetter(1))
for i in reversed(xrange(len(lst)-1)):
start, length = lst[i]
for j in xrange(i+1, len(lst)):
lstart, llength = lst[j]
if start >= lstart and start + length <= lstart + llength:
del lst[i]
break
print lst
#[(0, 4), (2, 6), (22, 6)]
a = [(0, 2), (1, 2), (0, 4), (23, 2), (22, 6), (26, 2), (26, 2), (26, 2)]
b = [set(xrange(i, i + j)) for i, j in a]
c = b.pop().union(*b)
collapsed = sorted(c)
print collapsed
#Maybe this is useful?:
[0, 1, 2, 3, 22, 23, 24, 25, 26, 27]
#But if you want the requested format, then do this:
d = []
start = collapsed[0]
length = 0
for val in collapsed:
if start + length < val:
d.append((start,length))
start = val
length = 0
elif val == collapsed[-1]:
d.append((start,length + 1))
length += 1
print d
#Output:
[(0,4), (22,6)]
So, taking you at your word that your main interest is space efficiency, here's one way to do what you want:
lst = [(0, 2), (1, 2), (0, 4), (23, 2), (22, 6), (26, 2), (26, 2), (26, 2)]
lst.sort()
start, length = lst.pop(0)
i = 0
while i < len(lst):
x, l = lst[i]
if start + length < x:
lst[i] = (start, length)
i += 1
start, length = x, l
else:
length = max(length, x + l - start)
lst.pop(i)
lst.append((start, length))
This modifies the list in place, never makes the list longer, only uses a small handful of variables to keep state, and only needs one pass through the list
A much faster algorithm is possible if you don't want to modify the list in place - popping items from the middle of a list can be slow, especially if the list is long.
One reasonable optimization would be to keep a list of which indices you're going to remove, and then come back and rebuild the list in a second pass, that way you could rebuild the whole list in one go and avoid the pop overhead. But that would use more memory!

Categories