Pythonic way to merge two overlapping lists, preserving order

Pythonic way to merge two overlapping lists, preserving order - python

Alright, so I have two lists, as such:
They can and will have overlapping items, for example, [1, 2, 3, 4, 5], [4, 5, 6, 7].
There will not be additional items in the overlap, for example, this will not happen: [1, 2, 3, 4, 5], [3.5, 4, 5, 6, 7]
The lists are not necessarily ordered nor unique. [9, 1, 1, 8, 7], [8, 6, 7].
I want to merge the lists such that existing order is preserved, and to merge at the last possible valid position, and such that no data is lost. Additionally, the first list might be huge. My current working code is as such:
master = [1,3,9,8,3,4,5]
addition = [3,4,5,7,8]
def merge(master, addition):
n = 1
while n < len(master):
if master[-n:] == addition[:n]:
return master + addition[n:]
n += 1
return master + addition
What I would like to know is - is there a more efficient way of doing this? It works, but I'm slightly leery of this, because it can run into large runtimes in my application - I'm merging large lists of strings.
EDIT: I'd expect the merge of [1,3,9,8,3,4,5], [3,4,5,7,8] to be: [1,3,9,8,3,4,5,7,8]. For clarity, I've highlighted the overlapping portion.
[9, 1, 1, 8, 7], [8, 6, 7] should merge to [9, 1, 1, 8, 7, 8, 6, 7]

You can try the following:
>>> a = [1, 3, 9, 8, 3, 4, 5]
>>> b = [3, 4, 5, 7, 8]
>>> matches = (i for i in xrange(len(b), 0, -1) if b[:i] == a[-i:])
>>> i = next(matches, 0)
>>> a + b[i:]
[1, 3, 9, 8, 3, 4, 5, 7, 8]
The idea is we check the first i elements of b (b[:i]) with the last i elements of a (a[-i:]). We take i in decreasing order, starting from the length of b until 1 (xrange(len(b), 0, -1)) because we want to match as much as possible. We take the first such i by using next and if we don't find it we use the zero value (next(..., 0)). From the moment we found the i, we add to a the elements of b from index i.

There are a couple of easy optimizations that are possible.
You don't need to start at master[1], since the longest overlap starts at master[-len(addition)]
If you add a call to list.index you can avoid creating sub-lists and comparing lists for each index:
This approach keeps the code pretty understandable too (and easier to optimize by using cython or pypy):
master = [1,3,9,8,3,4,5]
addition = [3,4,5,7,8]
def merge(master, addition):
first = addition[0]
n = max(len(master) - len(addition), 1) # (1)
while 1:
try:
n = master.index(first, n) # (2)
except ValueError:
return master + addition
if master[-n:] == addition[:n]:
return master + addition[n:]
n += 1

This actually isn't too terribly difficult. After all, essentially all you're doing is checking what substring at the end of A lines up with what substring of B.
def merge(a, b):
max_offset = len(b) # can't overlap with greater size than len(b)
for i in reversed(range(max_offset+1)):
# checks for equivalence of decreasing sized slices
if a[-i:] == b[:i]:
break
return a + b[i:]
We can test with your test data by doing:
test_data = [{'a': [1,3,9,8,3,4,5], 'b': [3,4,5,7,8], 'result': [1,3,9,8,3,4,5,7,8]},
{'a': [9, 1, 1, 8, 7], 'b': [8, 6, 7], 'result': [9, 1, 1, 8, 7, 8, 6, 7]}]
all(merge(test['a'], test['b']) == test['result'] for test in test_data)
This runs through every possible combination of slices that could result in an overlap and remembers the result of the overlap if one is found. If nothing is found, it uses the last result of i which will always be 0. Either way, it returns all of a plus everything past b[i] (in the overlap case, that's the non overlapping portion. In the non-overlap case, it's everything)
Note that we can make a couple optimizations in corner cases. For instance, the worst case here is that it runs through the whole list without finding any solution. You could add a quick check at the beginning that might short circuit that worst case
def merge(a, b):
if a[-1] not in b:
return a + b
...
In fact you could take that solution one step further and probably make your algorithm much faster
def merge(a, b):
while True:
try:
idx = b.index(a[-1]) + 1 # leftmost occurrence of a[-1] in b
except ValueError: # a[-1] not in b
return a + b
if a[-idx:] == b[:idx]:
return a + b[:idx]
However this might not find the longest overlap in cases like:
a = [1,2,3,4,1,2,3,4]
b = [3,4,1,2,3,4,5,6]
# result should be [1,2,3,4,1,2,3,4,5,6], but
# this algo produces [1,2,3,4,1,2,3,4,1,2,3,4,5,6]
You could fix that be using rindex instead of index to match the longest slice instead of the shortest, but I'm not sure what that does to your speed. It's certainly slower, but it might be inconsequential. You could also memoize the results and return the shortest result, which might be a better idea.
def merge(a, b):
results = []
while True:
try:
idx = b.index(a[-1]) + 1 # leftmost occurrence of a[-1] in b
except ValueError: # a[-1] not in b
results.append(a + b)
break
if a[-idx:] == b[:idx]:
results.append(a + b[:idx])
return min(results, key=len)
Which should work since merging the longest overlap should produce the shortest result in all cases.

One trivial optimization is not iterating over the whole master list. I.e., replace while n < len(master) with for n in range(min(len(addition), len(master))) (and don't increment n in the loop). If there is no match, your current code will iterate over the entire master list, even if the slices being compared aren't even of the same length.
Another concern is that you're taking slices of master and addition in order to compare them, which creates two new lists every time, and isn't really necessary. This solution (inspired by Boyer-Moore) doesn't use slicing:
def merge(master, addition):
overlap_lens = (i + 1 for i, e in enumerate(addition) if e == master[-1])
for overlap_len in overlap_lens:
for i in range(overlap_len):
if master[-overlap_len + i] != addition[i]:
break
else:
return master + addition[overlap_len:]
return master + addition
The idea here is to generate all the indices of the last element of master in addition, and add 1 to each. Since a valid overlap must end with the last element of master, only those values are lengths of possible overlaps. Then we can check for each of them if the elements before it also line up.
The function currently assumes that master is longer than addition (you'll probably get an IndexError at master[-overlap_len + i] if it isn't). Add a condition to the overlap_lens generator if you can't guarantee it.
It's also non-greedy, i.e. it looks for the smallest non-empty overlap (merge([1, 2, 2], [2, 2, 3]) will return [1, 2, 2, 2, 3]). I think that's what you meant by "to merge at the last possible valid position". If you want a greedy version, reverse the overlap_lens generator.

I don't offer optimizations but another way of looking at the problem. To me, this seems like a particular case of http://en.wikipedia.org/wiki/Longest_common_substring_problem where the substring would always be at the end of the list/string. The following algorithm is the dynamic programming version.
def longest_common_substring(s1, s2):
m = [[0] * (1 + len(s2)) for i in xrange(1 + len(s1))]
longest, x_longest = 0, 0
for x in xrange(1, 1 + len(s1)):
for y in xrange(1, 1 + len(s2)):
if s1[x - 1] == s2[y - 1]:
m[x][y] = m[x - 1][y - 1] + 1
if m[x][y] > longest:
longest = m[x][y]
x_longest = x
else:
m[x][y] = 0
return x_longest - longest, x_longest
master = [1,3,9,8,3,4,5]
addition = [3,4,5,7,8]
s, e = longest_common_substring(master, addition)
if e - s > 1:
print master[:s] + addition
master = [9, 1, 1, 8, 7]
addition = [8, 6, 7]
s, e = longest_common_substring(master, addition)
if e - s > 1:
print master[:s] + addition
else:
print master + addition
[1, 3, 9, 8, 3, 4, 5, 7, 8]
[9, 1, 1, 8, 7, 8, 6, 7]

First of all and for clarity, you can replace your while loop with a for loop:
def merge(master, addition):
for n in xrange(1, len(master)):
if master[-n:] == addition[:n]:
return master + addition[n:]
return master + addition
Then, you don't have to compare all possible slices, but only those for which master's slice starts with the first element of addition:
def merge(master, addition):
indices = [len(master) - i for i, x in enumerate(master) if x == addition[0]]
for n in indices:
if master[-n:] == addition[:n]:
return master + addition[n:]
return master + addition
So instead of comparing slices like this:
1234123141234
3579
3579
3579
3579
3579
3579
3579
3579
3579
3579
3579
3579
3579
you are only doing these comparisons:
1234123141234
| | |
| | 3579
| 3579
3579
How much this will speed up your program depends on the nature of your data: the fewer repeated elements your lists have, the better.
You could also generate a list of indices for addition so its own slices always end with master's last element, further restricting the number of comparisons.

Based on https://stackoverflow.com/a/30056066/541208:
def join_two_lists(a, b):
index = 0
for i in xrange(len(b), 0, -1):
#if everything from start to ith of b is the
#same from the end of a at ith append the result
if b[:i] == a[-i:]:
index = i
break
return a + b[index:]

All the above solutions are similar in terms of using a for / while loop for the merging task. I first tried the solutions by #JuniorCompressor and #TankorSmash, but these solutions are way too slow for merging two large-scale lists (e.g. lists with about millions of elements).
I found using pandas to concatenate lists with large size is much more time-efficient:
import pandas as pd, numpy as np
trainCompIdMaps = pd.DataFrame( { "compoundId": np.random.permutation( range(800) )[0:80], "partition": np.repeat( "train", 80).tolist()} )
testCompIdMaps = pd.DataFrame( {"compoundId": np.random.permutation( range(800) )[0:20], "partition": np.repeat( "test", 20).tolist()} )
# row-wise concatenation for two pandas
compoundIdMaps = pd.concat([trainCompIdMaps, testCompIdMaps], axis=0)
mergedCompIds = np.array(compoundIdMaps["compoundId"])

What you need is a sequence alignment algorithm like Needleman-Wunsch.
Needleman-Wunsch is a global sequence alignment algorithm based on dynamic programming:
I found this nice implementation to merge arbitrary object sequences in python:
https://github.com/ajnisbet/paired
import paired
seq_1 = 'The quick brown fox jumped over the lazy dog'.split(' ')
seq_2 = 'The brown fox leaped over the lazy dog'.split(' ')
alignment = paired.align(seq_1, seq_2)
print(alignment)
# [(0, 0), (1, None), (2, 1), (3, 2), (4, 3), (5, 4), (6, 5), (7, 6), (8, 7)]
for i_1, i_2 in alignment:
print((seq_1[i_1] if i_1 is not None else '').ljust(15), end='')
print(seq_2[i_2] if i_2 is not None else '')
# The The
# quick
# brown brown
# fox fox
# jumped leaped
# over over
# the the
# lazy lazy
# dog dog

Related

Find smallest repeated piece of a list

I've got some list with integers like:
l1 = [8,9,8,9,8,9,8],
l2 = [3,4,2,4,3]
My purpose to slice it into the smallest repeated piece. So:
output_l1 = [8,9]
output_l2 = [3,4,2,4]
Biggest problem that the sequences not fully finished every time. So not
'abcabcabc'
just
'abcabcab'.

def shortest_repeating_sequence(inp):
for i in range(1, len(inp)):
if all(inp[j] == inp[j % i] for j in range(i, len(inp))):
return inp[:i]
# inp doesn't have a repeating pattern if we got this far
return inp[:]
This code is O(n^2). The worst case is one element repeated a lot of times followed by something that breaks the pattern at the end, for example [1, 1, 1, 1, 1, 1, 1, 1, 1, 8].
You start with 1, and then iterate over the entire list checking that each inp[i] is equal to inp[i % 1]. Any number % 1 is equal to 0, so you're checking if each item in the input is equal to the first item in the input. If all items are equal to the first element then the repeating pattern is a list with just the first element so we return inp[:1].
If at some point you hit an element that isn't equal to the first element (all() stops as soon as it finds a False), you try with 2. So now you're checking if each element at an even index is equal to the first element (4 % 2 is 0) and if every odd index is equal to the second item (5 % 2 is 1). If you get all the way through this, the pattern is the first two elements so return inp[:2], otherwise try again with 3 and so on.
You could do range(1, len(inp)+1) and then the for loop will handle the case where inp doesn't contain a repeating pattern, but then you have to needlessly iterate over the entire inp at the end. And you'd still have to have to have return [] at the end to handle inp being the empty list.
I return a copy of the list (inp[:]) instead of the list to have consistent behavior. If I returned the original list with return inp and someone called that function on a list that didn't have a repeating pattern (ie their repeating pattern is the original list) and then did something with the repeating pattern, it would modify their original list as well.
shortest_repeating_sequence([4, 2, 7, 4, 6]) # no pattern
[4, 2, 7, 4, 6]
shortest_repeating_sequence([2, 3, 1, 2, 3]) # pattern doesn't repeat fully
[2, 3, 1]
shortest_repeating_sequence([2, 3, 1, 2]) # pattern doesn't repeat fully
[2, 3, 1]
shortest_repeating_sequence([8, 9, 8, 9, 8, 9, 8])
[8, 9]
shortest_repeating_sequence([1, 1, 1, 1, 1])
[1]
shortest_repeating_sequence([])
[]

The following code is a rework of your solution that addresses some issues:
Your solution as posted doesn't handle your own 'abcabcab' example.
Your solution keeps processing even after it's found a valid result, and then filters through both the valid and non-valid results. Instead, once a valid result is found, we process and return it. Additional valid results, and non-valid results, are simply ignored.
#Boris' issue regarding returning the input if there is no repeating pattern.
CODE
def repeated_piece(target):
target = list(target)
length = len(target)
for final in range(1, length):
result = []
while len(result) < length:
for i in target[:final]:
result.append(i)
if result[:length] == target:
return result[:final]
return target
l1 = [8, 9, 8, 9, 8, 9, 8]
l2 = [3, 4, 2, 4, 3]
l3 = 'abcabcab'
l4 = [1, 2, 3]
print(*repeated_piece(l1), sep='')
print(*repeated_piece(l2), sep='')
print(*repeated_piece(l3), sep='')
print(*repeated_piece(l4), sep='')
OUTPUT
% python3 test.py
89
3424
abc
123
%
You can still use:
print(''.join(map(str, repeated_piece(l1))))
if you're uncomfortable with the simpler Python 3 idiom:
print(*repeated_piece(l1), sep='')

SOLUTION
target = [8,9,8,9,8,9,8]
length = len(target)
result = []
results = [] * length
for j in range(1, length):
result = []
while len(result) < length:
for i in target[:j]:
result.append(i)
results.append(result)
final = []
for i in range(0, len(results)):
if results[i][:length] == target:
final.append(1)
else:
final.append(0)
if 1 in final:
solution = results[final.index(1)][:final.index(1)+1]
else:
solution = target
int(''.join(map(str, solution)))
'result: [8, 9]'.

Simple Solution:
def get_unique_items_list(some_list):
new_list = []
for i in range(len(some_list)):
if not some_list[i] in new_list:
new_list.append(some_list[i])
return new_list
l1 = [8,9,8,9,8,9,8]
l2 = [3,4,2,4,3]
print(get_unique_items_list(l1))
print(get_unique_items_list(l2))
#### Output ####
# [8, 9]
# [3, 4, 2]

Sorting a list depending on multiple elements within the list

is it possible to implement a python key for sorting depending on multiple list elements?
For example:
list = [1, 2, 3, 4]
And I want to sort the list depending on the difference between two elements, so that the delta is maximized between them.
Expected result:
list = [1, 4, 2, 3] # delta = 4-1 + 4-2 + 3-2 = 6
Other result would also be possible, but 1 is before 4 in the origin array so 1 should be taken first:
list = [4, 1, 3, 2] # delta = 4-1 + 3-1 + 3-2 = 6
I want to use python sorted like:
sorted(list, key=lambda e1, e2: abs(e1-e2))
Is there any possibility to do it this way? Maybe there is another library which could be used.

Since (as you showed us) there could be multiple different results - it means that this sorting/order is not deterministic and hence you can't apply a key function to it.
That said, it's easy to implement the sorting by yourself:
def my_sort(col):
res = []
while col:
_max = max(col)
col.remove(_max)
res.append(_max)
if col:
_min = min(col)
col.remove(_min)
res.append(_min)
return res
print(my_sort([1,2,3,4])) # [4, 1, 3, 2]
This solution runs in O(n^2) but it can be improved by sorting col in the beginning and then instead of looking for max and min we can extract the items in the beginning and the end of the list. By doing that we'll reduce the time complexity to O(n log(n))
EDIT
Per your comment below: if the index plays a role, again, it's not a "real" sorting :) that said, this solution can be engineered to keep the smaller index first and etc:
def my_sort(col):
res = []
while col:
_max = max(col)
max_index = col.index(_max)
col.remove(_max)
if col:
_min = min(col)
min_index = col.index(_min)
col.remove(_min)
if max_index < min_index:
res.extend([_max, _min])
else:
res.extend([_min, _max])
continue
res.append((_max))
return res
print(my_sort([1,2,3,4])) # [1, 4, 2, 3]

This solution is quite brute force; however, it is still a possibility:
from itertools import permutations
list = [1, 2, 3, 4]
final_list = ((i, sum(abs(i[b]-i[b+1]) for b in range(len(i)-1))) for i in permutations(list, len(list)))
final_lists = max(final_list, key=lambda x:x[-1])
Output:
((2, 4, 1, 3), 7)
Note that the output is in the form: (list, total_sum))

Cycle a list from alternating sides

Given a list
a = [0,1,2,3,4,5,6,7,8,9]
how can I get
b = [0,9,1,8,2,7,3,6,4,5]
That is, produce a new list in which each successive element is alternately taken from the two sides of the original list?

>>> [a[-i//2] if i % 2 else a[i//2] for i in range(len(a))]
[0, 9, 1, 8, 2, 7, 3, 6, 4, 5]
Explanation:
This code picks numbers from the beginning (a[i//2]) and from the end (a[-i//2]) of a, alternatingly (if i%2 else). A total of len(a) numbers are picked, so this produces no ill effects even if len(a) is odd.
[-i//2 for i in range(len(a))] yields 0, -1, -1, -2, -2, -3, -3, -4, -4, -5,
[ i//2 for i in range(len(a))] yields 0, 0, 1, 1, 2, 2, 3, 3, 4, 4,
and i%2 alternates between False and True,
so the indices we extract from a are: 0, -1, 1, -2, 2, -3, 3, -4, 4, -5.
My assessment of pythonicness:
The nice thing about this one-liner is that it's short and shows symmetry (+i//2 and -i//2).
The bad thing, though, is that this symmetry is deceptive:
One might think that -i//2 were the same as i//2 with the sign flipped. But in Python, integer division returns the floor of the result instead of truncating towards zero. So -1//2 == -1.
Also, I find accessing list elements by index less pythonic than iteration.

cycle between getting items from the forward iter and the reversed one. Just make sure you stop at len(a) with islice.
from itertools import islice, cycle
iters = cycle((iter(a), reversed(a)))
b = [next(it) for it in islice(iters, len(a))]
>>> b
[0, 9, 1, 8, 2, 7, 3, 6, 4, 5]
This can easily be put into a single line but then it becomes much more difficult to read:
[next(it) for it in islice(cycle((iter(a),reversed(a))),len(a))]
Putting it in one line would also prevent you from using the other half of the iterators if you wanted to:
>>> iters = cycle((iter(a), reversed(a)))
>>> [next(it) for it in islice(iters, len(a))]
[0, 9, 1, 8, 2, 7, 3, 6, 4, 5]
>>> [next(it) for it in islice(iters, len(a))]
[5, 4, 6, 3, 7, 2, 8, 1, 9, 0]

A very nice one-liner in Python 2.7:
results = list(sum(zip(a, reversed(a))[:len(a)/2], ()))
>>>> [0, 9, 1, 8, 2, 7, 3, 6, 4, 5]
First you zip the list with its reverse, take half that list, sum the tuples to form one tuple, and then convert to list.
In Python 3, zip returns a generator, so you have have to use islice from itertools:
from itertools import islice
results = list(sum(islice(zip(a, reversed(a)),0,int(len(a)/2)),()))
Edit: It appears this only works perfectly for even-list lengths - odd-list lengths will omit the middle element :( A small correction for int(len(a)/2) to int(len(a)/2) + 1 will give you a duplicate middle value, so be warned.

Use the right toolz.
from toolz import interleave, take
b = list(take(len(a), interleave((a, reversed(a)))))
First, I tried something similar to Raymond Hettinger's solution with itertools (Python 3).
from itertools import chain, islice
interleaved = chain.from_iterable(zip(a, reversed(a)))
b = list(islice(interleaved, len(a)))

If you don’t mind sacrificing the source list, a, you can just pop back and forth:
b = [a.pop(-1 if i % 2 else 0) for i in range(len(a))]
Edit:
b = [a.pop(-bool(i % 2)) for i in range(len(a))]

Not terribly different from some of the other answers, but it avoids a conditional expression for determining the sign of the index.
a = range(10)
b = [a[i // (2*(-1)**(i&1))] for i in a]
i & 1 alternates between 0 and 1. This causes the exponent to alternate between 1 and -1. This causes the index divisor to alternate between 2 and -2, which causes the index to alternate from end to end as i increases. The sequence is a[0], a[-1], a[1], a[-2], a[2], a[-3], etc.
(I iterate i over a since in this case each value of a is equal to its index. In general, iterate over range(len(a)).)

The basic principle behind your question is a so-called roundrobin algorithm. The itertools-documentation-page contains a possible implementation of it:
from itertools import cycle, islice
def roundrobin(*iterables):
"""This function is taken from the python documentation!
roundrobin('ABC', 'D', 'EF') --> A D E B F C
Recipe credited to George Sakkis"""
pending = len(iterables)
nexts = cycle(iter(it).__next__ for it in iterables) # next instead of __next__ for py2
while pending:
try:
for next in nexts:
yield next()
except StopIteration:
pending -= 1
nexts = cycle(islice(nexts, pending))
so all you have to do is split your list into two sublists one starting from the left end and one from the right end:
import math
mid = math.ceil(len(a)/2) # Just so that the next line doesn't need to calculate it twice
list(roundrobin(a[:mid], a[:mid-1:-1]))
# Gives you the desired result: [0, 9, 1, 8, 2, 7, 3, 6, 4, 5]
alternatively you could create a longer list (containing alternating items from sequence going from left to right and the items of the complete sequence going right to left) and only take the relevant elements:
list(roundrobin(a, reversed(a)))[:len(a)]
or using it as explicit generator with next:
rr = roundrobin(a, reversed(a))
[next(rr) for _ in range(len(a))]
or the speedy variant suggested by #Tadhg McDonald-Jensen (thank you!):
list(islice(roundrobin(a,reversed(a)),len(a)))

Not sure, whether this can be written more compactly, but it is efficient as it only uses iterators / generators
a = [0,1,2,3,4,5,6,7,8,9]
iter1 = iter(a)
iter2 = reversed(a)
b = [item for n, item in enumerate(
next(iter) for _ in a for iter in (iter1, iter2)
) if n < len(a)]

For fun, here is an itertools variant:
>>> a = [0,1,2,3,4,5,6,7,8,9]
>>> list(chain.from_iterable(izip(islice(a, len(a)//2), reversed(a))))
[0, 9, 1, 8, 2, 7, 3, 6, 4, 5]
This works where len(a) is even. It would need a special code for odd-lengthened input.
Enjoy!

Not at all elegant, but it is a clumsy one-liner:
a = range(10)
[val for pair in zip(a[:len(a)//2],a[-1:(len(a)//2-1):-1]) for val in pair]
Note that it assumes you are doing this for a list of even length. If that breaks, then this breaks (it drops the middle term). Note that I got some of the idea from here.

Two versions not seen yet:
b = list(sum(zip(a, a[::-1]), ())[:len(a)])
and
import itertools as it
b = [a[j] for j in it.accumulate(i*(-1)**i for i in range(len(a)))]

mylist = [0,1,2,3,4,5,6,7,8,9]
result = []
for i in mylist:
result += [i, mylist.pop()]
Note:
Beware: Just like #Tadhg McDonald-Jensen has said (see the comment below)
it'll destroy half of original list object.

One way to do this for even-sized lists (inspired by this post):
a = range(10)
b = [val for pair in zip(a[:5], a[5:][::-1]) for val in pair]

I would do something like this
a = [0,1,2,3,4,5,6,7,8,9]
b = []
i = 0
j = len(a) - 1
mid = (i + j) / 2
while i <= j:
if i == mid and len(a) % 2 == 1:
b.append(a[i])
break
b.extend([a[i], a[j]])
i = i + 1
j = j - 1
print b

You can partition the list into two parts about the middle, reverse the second half and zip the two partitions, like so:
a = [0,1,2,3,4,5,6,7,8,9]
mid = len(a)//2
l = []
for x, y in zip(a[:mid], a[:mid-1:-1]):
l.append(x)
l.append(y)
# if the length is odd
if len(a) % 2 == 1:
l.append(a[mid])
print(l)
Output:
[0, 9, 1, 8, 2, 7, 3, 6, 4, 5]

Python, converting a list of indices to slices

So I have a list of indices,
[0, 1, 2, 3, 5, 7, 8, 10]
and want to convert it to this,
[[0, 3], [5], [7, 8], [10]]
this will run on a large number of indices.
Also, this technically isn't for slices in python, the tool I am working with is faster when given a range compared to when given the individual ids.
The pattern is based on being in a range, like slices work in python. So in the example, the 1 and 2 are dropped because they are already included in the range of 0 to 3. The 5 would need accessed individually since it is not in a range, etc. This is more helpful when a large number of ids get included in a range such as [0, 5000].

Since you want the code to be fast, I wouldn't try to be too fancy. A straight-forward approach should perform quite well:
a = [0, 1, 2, 3, 5, 7, 8, 10]
it = iter(a)
start = next(it)
slices = []
for i, x in enumerate(it):
if x - a[i] != 1:
end = a[i]
if start == end:
slices.append([start])
else:
slices.append([start, end])
start = x
if a[-1] == start:
slices.append([start])
else:
slices.append([start, a[-1]])
Admittedly, that's doesn't look too nice, but I expect the nicer solutions I can think of to perform worse. (I did not do a benchmark.)
Here is s slightly nicer, but slower solution:
from itertools import groupby
a = [0, 1, 2, 3, 5, 7, 8, 10]
slices = []
for key, it in groupby(enumerate(a), lambda x: x[1] - x[0]):
indices = [y for x, y in it]
if len(indices) == 1:
slices.append([indices[0]])
else:
slices.append([indices[0], indices[-1]])

def runs(seq):
previous = None
start = None
for value in itertools.chain(seq, [None]):
if start is None:
start = value
if previous is not None and value != previous + 1:
if start == previous:
yield [previous]
else:
yield [start, previous]
start = value
previous = value

Since performance is an issue go with the first solution by #SvenMarnach but here is a fun one liner split into two lines! :D
>>> from itertools import groupby, count
>>> indices = [0, 1, 2, 3, 5, 7, 8, 10]
>>> [[next(v)] + list(v)[-1:]
for k,v in groupby(indices, lambda x,c=count(): x-next(c))]
[[0, 3], [5], [7, 8], [10]]

Below is a simple python code with numpy:
def list_to_slices(inputlist):
"""
Convert a flatten list to a list of slices:
test = [0,2,3,4,5,6,12,99,100,101,102,13,14,18,19,20,25]
list_to_slices(test)
-> [(0, 0), (2, 6), (12, 14), (18, 20), (25, 25), (99, 102)]
"""
inputlist.sort()
pointers = numpy.where(numpy.diff(inputlist) > 1)[0]
pointers = zip(numpy.r_[0, pointers+1], numpy.r_[pointers, len(inputlist)-1])
slices = [(inputlist[i], inputlist[j]) for i, j in pointers]
return slices

If your input is a sorted sequence, which I assume it is, you can do it in a minimalistic way in three steps by employing the old good zip() function:
x = [0, 1, 2, 3, 5, 7, 8, 10]
# find beginnings and endings of sequential runs,
# N.B. the first beginning and the last ending are not included
begs_ends_iter = zip(
*[(x1, x0) for x0, x1 in zip(x[:-1], x[1:]) if x1 - x0 > 1]
)
# handling case when there is only one sequential run
begs, ends = tuple(begs_ends_iter) or ((), ())
# add the first beginning and the last ending,
# combine corresponding beginnings and endings,
# and convert isolated elements into the lists of length one
y = [
[beg] if beg == end else [beg, end]
for beg, end in zip(tuple(x[:1]) + begs, ends + tuple(x[-1:]))
]
If your input is unsorted then sort it and you will get sorted list, which is a sequence. If you have a sorted iterable and do not want to convert it to a sequence (e.g., because it is too long) then you may make use of chain() and pairwise() functions from itertools package (pairwise() is available since Python 3.10):
from itertools import chain, pairwise
x = [0, 1, 2, 3, 5, 7, 8, 10]
# find beginnings and endings of sequential runs,
# N.B. the last beginning and the first ending are None's
begs, ends = zip(
*[
(x1, x0)
for x0, x1 in pairwise(chain([None], x, [None]))
if x0 is None or x1 is None or x1 - x0 > 1
]
)
# removing the last beginning and the first ending,
# combine corresponding beginnings and endings,
# and convert isolated elements into the lists of length one
y = [
[beg] if beg == end else [beg, end]
for beg, end in zip(begs[:-1], ends[1:])
]
These solutions are similar to the one proposed by bougui, but without using numpy. Which may be more efficient if data is not in numpy array already and is not very large sequence or opposite, too large iterable to fit into memory.

Getting the indices of the X largest numbers in a list

Please no built-ins besides len() or range(). I'm studying for a final exam.
Here's an example of what I mean.
def find_numbers(x, lst):
lst = [3, 8, 1, 2, 0, 4, 8, 5]
find_numbers(3, lst) # this should return -> (1, 6, 7)
I tried this not fully....couldn't figure out the best way of going about it:
def find_K_highest(lst, k):
newlst = [0] * k
maxvalue = lst[0]
for i in range(len(lst)):
if lst[i] > maxvalue:
maxvalue = lst[i]
newlst[0] = i

Take the first 3 (x) numbers from the list. The minimum value for the maximum are these. In your case: 3, 8, 1. Their index is (0, 1, 2). Build pairs of them ((3,0), (8,1), (1,2)).
Now sort them by size of the maximum value: ((8,1), (3,0), (1,2)).
With this initial List, you can traverse the rest of the list recursively. Compare the smallest value (1, _) with the next element in the list (2, 3). If that is larger (it is), sort it into the list ((8,1), (3,0), (2,3)) and throw away the smallest.
In the beginning you have many changes in the top 3, but later on, they get rare. Of course you have to keep book about the last position (3, 4, 5, ...) too, when traversing.
An insertion sort for the top N elements should be pretty performant.
Here is a similar problem in Scala but without the need to report the indexes.

I dont know is it good to post a solution, but this seems to work:
def find_K_highest(lst, k):
# escape index error
if k>len(lst):
k=len(lst)
# the output array
idxs = [None]*k
to_watch = range(len(lst))
# do it k times
for i in range(k):
# guess that max value is at least at idx '0' of to_watch
to_del=0
idx = to_watch[to_del]
max_val = lst[idx]
# search through the list for bigger value and its index
for jj in range(len(to_watch)):
j=to_watch[jj]
val = lst[j]
# check that its bigger that previously finded max
if val > max_val:
idx = j
max_val = val
to_del=jj
# append it
idxs[i] = idx
del to_watch[to_del]
# return answer
return idxs
PS I tried to explain every line of code.

Can you use list methods? (e.g. append, sort, index?). If so, this should work (I think...)
def find_numbers(n,lst):
ll=lst[:]
ll.sort()
biggest=ll[-n:]
idx=[lst.index(i) for i in biggest] #This has the indices already, but we could have trouble if one of the numbers appeared twice
idx.sort()
#check for duplicates. Duplicates will always be next to each other since we sorted.
for i in range(1,len(idx)):
if(idx[i-1]==idx[i]):
idx[i]=idx[i]+lst[idx[i]+1:].index(lst[idx[i]]) #found a duplicate, chop up the input list and find the new index of that number
idx.sort()
return idx
lst = [3, 8, 1, 2, 0, 4, 8, 5]
print find_numbers(3, lst)

Dude. You have two ways you can go with this.
First way is to be clever. Phyc your teacher out. What she is looking for is recursion. You can write this with NO recursion and NO built in functions or methods:
#!/usr/bin/python
lst = [3, 8, 1, 2, 0, 4, 8, 5]
minval=-2**64
largest=[]
def enum(lst):
for i in range(len(lst)):
yield i,lst[i]
for x in range(3):
m=minval
m_index=None
for i,j in enum(lst):
if j>m:
m=j
m_index=i
if m_index:
largest=largest+[m_index]
lst[m_index]=minval
print largest
This works. It is clever. Take that teacher!!! BUT, you will get a C or lower...
OR -- you can be the teacher's pet. Write it the way she wants. You will need a recursive max of a list. The rest is easy!
def max_of_l(l):
if len(l) <= 1:
if not l:
raise ValueError("Max() arg is an empty sequence")
else:
return l[0]
else:
m = max_of_l(l[1:])
return m if m > l[0] else l[0]
print max_of_l([3, 8, 1, 2, 0, 4, 8, 5])

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pythonic way to merge two overlapping lists, preserving order - python

Based on https://stackoverflow.com/a/30056066/541208: def join_two_lists(a, b): index = 0 for i in xrange(len(b), 0, -1): #if everything from start to ith of b is the #same from the end of a at ith append the result if b[:i] == a[-i:]: index = i break return a + b[index:]

Related

Find smallest repeated piece of a list

Sorting a list depending on multiple elements within the list

Cycle a list from alternating sides

Python, converting a list of indices to slices

Getting the indices of the X largest numbers in a list

Categories

Resources