Split sorted list into two lists - python

I'm trying to split a sorted integer list into two lists. The first list would have all ints under n and the second all ints over n. Note that n does not have to be in the original list.
I can easily do this with:
under = []
over = []
for x in sorted_list:
if x < n:
under.append(x)
else
over.append(x)
But it just seems like it should be possible to do this in a more elegant way knowing that the list is sorted. takewhile and dropwhile from itertools sound like the solution but then I would be iterating over the list twice.
Functionally, the best I can do is this:
i = 0
while sorted_list[i] < n:
i += 1
under = sorted_list[:i]
over = sorted_list[i:]
But I'm not even sure if it is actually better than just iterating over the list twice and it is definitely not more elegant.
I guess I'm looking for a way to get the list returned by takewhile and the remaining list, perhaps, in a pair.

The correct solution here is the bisect module. Use bisect.bisect to find the index to the right of n (or the index where it would be inserted if it's missing), then slice around that point:
import bisect # At top of file
split_idx = bisect.bisect(sorted_list, n)
under = sorted_list[:split_idx]
over = sorted_list[split_idx:]
While any solution is going to be O(n) (you do have to copy the elements after all), the comparisons are typically more expensive than simple pointer copies (and associated reference count updates), and bisect reduces the comparison work on a sorted list to O(log n), so this will typically (on larger inputs) beat simply iterating and copying element by element until you find the split point.
Use bisect.bisect_left (which finds the leftmost index of n) instead of bisect.bisect (equivalent to bisect.bisect_right) if you want n to end up in over instead of under.

I would use following approach, where I find the index and use slicing to create under and over:
sorted_list = [1,2,4,5,6,7,8]
n=6
idx = sorted_list.index(n)
under = sorted_list[:idx]
over = sorted_list[idx:]
print(under)
print(over)
Output (same as with your code):
[1, 2, 4, 5]
[6, 7, 8]
Edit: As I understood the question wrong here is an adapted solution to find the nearest index:
import numpy as np
sorted_list = [1,2,4,5,6,7,8]
n=3
idx = np.searchsorted(sorted_list, n)
under = sorted_list[:idx]
over = sorted_list[idx:]
print(under)
print(over)
Output:
[1, 2]
[4, 5, 6, 7, 8]

Related

Find index of minimum value in a Python sublist - min() returns index of minimum value in list

I've been working on implementing common sorting algorithms into Python, and whilst working on selection sort I ran into a problem finding the minimum value of a sublist and swapping it with the first value of the sublist, which from my testing appears to be due to a problem with how I am using min() in my program.
Here is my code:
def selection_sort(li):
for i in range(0, len(li)):
a, b = i, li.index(min(li[i:]))
li[a], li[b] = li[b], li[a]
This works fine for lists that have zero duplicate elements within them:
>>> selection_sort([9,8,7,6,5,4,3,2,1])
[1, 2, 3, 4, 5, 6, 7, 8, 9]
However, it completely fails when there are duplicate elements within the list.
>>> selection_sort([9,8,8,7,6,6,5,5,5,4,2,1,1])
[8, 8, 7, 6, 6, 5, 5, 5, 4, 2, 9, 1, 1]
I tried to solve this problem by examining what min() is doing on line 3 of my code, and found that min() returns the index value of the smallest element inside the sublist as intended, but the index is of the element within the larger list rather than of the sublist, which I hope this experimentation helps to illustrate more clearly:
>>> a = [1,2,1,1,2]
>>> min(a)
1 # expected
>>> a.index(min(a))
0 # also expected
>>> a.index(min(a[1:]))
0 # should be 1?
I'm not sure what is causing this behaviour; it could be possible to copy li[i:] into a temporary variable b and then do b.index(min(b)), but copying li[i:] into b for each loop might require a lot of memory, and selection sort is an in-place algorithm so I am uncertain as to whether this approach is ideal.
You're not quite getting the concept correctly!
li.index(item) will return the first appearance of that item in the list li.
What you should do instead is if you're finding the minimum element in the sublist, search for that element in the sublist as well instead of searching it in the whole list. Also when searching in the sliced list, you will get the index in respect to the sublist. Though you can easily fix that by adding the starting step to the index returned.
A small fix for your problem would be:
def selection_sort(li):
for i in range(0, len(li)):
a, b = i, i + li[i:].index(min(li[i:]))
li[a], li[b] = li[b], li[a]
When you write a.index(min(a[1:])) you are searching for the first occurence of the min of a[1:], but you are searching in the original list. That's why you get 0 as a result.
By the way, the function you are looking for is generally called argmin. It is not contained in pure python, but numpy module has it.
One way you can do it is using list comprehension:
idxs = [i for i, val in enumerate(a) if val == min(a)]
Or even better, write your own code, which is faster asymptotically:
idxs = []
minval = None
for i, val in enumerate(a):
if minval is None or minval > val:
idxs = [i]
minval = val
elif minval == val:
idxs.append(i)

If values exist in list, put them at the back of the list in Python

I have a list with values and a list with some given numbers:
my_list = [1, 3, 5, 6, 8, 10]
my_numbers = [2, 3, 4]
Now I want to know if the values in my_numbers exist in my_list and, if so, put the matching value(s) to the back of my_list. I can do this for example like this:
for number in my_numbers:
if number in my_list:
my_list.remove(number)
my_list.append(number)
Specifics:
I can be certain that there are no duplicates within neither of the lists due to my program setup.
The order of which the matching numbers in my_numbers are put in the back of my_list does not matter.
Question: Can I do this more efficiently performance wise?
One possible solution is this: rebuild the list my_list from two parts:
the items that are not in my_numbers
the items that are in my_numbers
Note that I would suggest to use a set for the lookup. A membership test for a set is O(1) (constant time), whereas such a test for a list is O(n), where n is the length of the list.
This means that the total runtime of the code below is O(max(m,n)), where m, n are the lengths of the lists. Your original solution was more like O(m*n), which is much slower if either of the lists is large.
my_numbers_set = set(my_numbers)
my_list = [x for x in my_list if x not in my_numbers_set] + \
[x for x in my_list if x in my_numbers_set]

Create range without certain numbers

I want to create a range x from 0 ... n, without any of the numbers in the list y. How can I do this?
For example:
n = 10
y = [3, 7, 8]
x = # Do Something
Should give the output:
x = [0, 1, 2, 4, 5, 6, 9]
One naive way would be to concatenate several ranges, each spanning a set of numbers which have been intersected by the numbers in y. However, I'm not sure of what the simplest syntax to do this is in Python.
You can use a list comprehension to filter the range from 0 to n: range(n) generates a list (or, in Python 3, a generator object) from 0 to n - 1 (including both ends):
x = [i for i in range(n) if i not in y]
This filters out all numbers in y from the range.
You can also turn it into a generator (which you could only iterate over once but which would be faster for (very) large n) by replacing [ with ( and ] with ). Further, in Python 2, you can use xrange instead of range to avoid loading the entire range into memory at once. Also, especially if y is a large list, you can turn it into a set first to use O(1) membership checks instead of O(n) on list or tuple objects. Such a version might look like
s = set(y)
x = (i for i in range(n) if i not in s)
hlt's answer is ideal, but I'll quickly suggest another way using set operations.
n = 10
y = [3, 7, 8]
x = set(range(n)) - set(y)
x will be a set object. If you definitely need x to be a list, you can just write x = list(x).
Note that the ordering of a set in Python is not guaranteed to be anything in particular. If order is needed, remember to sort.
Adding on to the above answers, here is my answer using lambda function:
x = filter(lambda x: x not in y,range(n))

Remove duplicates from one Python list, prune other lists based on it

I have a problem that's easy enough to do in an ugly way, but I'm wondering if there's a more Pythonic way of doing it.
Say I have three lists, A, B and C.
A = [1, 1, 2, 3, 4, 4, 5, 5, 3]
B = [1, 2, 3, 4, 5, 6, 7, 8, 9]
C = [1, 2, 3, 4, 5, 6, 7, 8, 9]
# The actual data isn't important.
I need to remove all duplicates from list A, but when a duplicate entry is deleted, I would like the corresponding indexes removed from B and C:
A = [1, 2, 3, 4, 5]
B = [1, 3, 4, 5, 7]
C = [1, 3, 4, 5, 7]
This is easy enough to do with longer code by moving everything to new lists:
new_A = []
new_B = []
new_C = []
for i in range(len(A)):
if A[i] not in new_A:
new_A.append(A[i])
new_B.append(B[i])
new_C.append(C[i])
But is there a more elegant and efficient (and less repetitive) way of doing this? This could get cumbersome if the number of lists grows, which it might.
Zip the three lists together, uniquify based on the first element, then unzip:
from operator import itemgetter
from more_itertools import unique_everseen
abc = zip(a, b, c)
abc_unique = unique_everseen(abc, key=itemgetter(0))
a, b, c = zip(*abc_unique)
This is a very common pattern. Whenever you want to do anything in lock step over a bunch of lists (or other iterables), you zip them together and loop over the result.
Also, if you go from 3 lists to 42 of them ("This could get cumbersome if the number of lists grows, which it might."), this is trivial to extend:
abc = zip(*list_of_lists)
abc_unique = unique_everseen(abc, key=itemgetter(0))
list_of_lists = zip(*abc_unique)
Once you get the hang of zip, the "uniquify" is the only hard part, so let me explain it.
Your existing code checks whether each element has been seen by searching for each one in new_A. Since new_A is a list, this means that if you have N elements, M of them unique, on average you're going to be doing M/2 comparisons for each of those N elements. Plug in some big numbers, and NM/2 gets pretty big—e.g., 1 million values, a half of them unique, and you're doing 250 billion comparisons.
To avoid that quadratic time, you use a set. A set can test an element for membership in constant, rather than linear, time. So, instead of 250 billion comparisons, that's 1 million hash lookups.
If you don't need to maintain order or decorate-process-undecorate the values, just copy the list to a set and you're done. If you need to decorate, you can use a dict instead of a set (with the key as the dict keys, and everything else hidden in the values). To preserve order, you could use an OrderedDict, but at that point it's easier to just use a list and a set side by side. For example, the smallest change to your code that works is:
new_A_set = set()
new_A = []
new_B = []
new_C = []
for i in range(len(A)):
if A[i] not in new_A_set:
new_A_set.add(A[i])
new_A.append(A[i])
new_B.append(B[i])
new_C.append(C[i])
But this can be generalized—and should be, especially if you're planning to expand from 3 lists to a whole lot of them.
The recipes in the itertools documentation include a function called unique_everseen that generalizes exactly what we want. You can copy and paste it into your code, write a simplified version yourself, or pip install more-itertools and use someone else's implementation (as I did above).
PadraicCunningham asks:
how efficient is zip(*unique_everseen(zip(a, b, c), key=itemgetter(0)))?
If there are N elements, M unique, it's O(N) time and O(M) space.
In fact, it's effectively doing the same work as the 10-line version above. In both cases, the only work that's not obviously trivial inside the loop is key in seen and seen.add(key), and since both operations are amortized constant time for set, that means the whole thing is O(N) time. In practice, for N=1000000, M=100000 the two versions are about 278ms and 297ms (I forget which is which) compared to minutes for the quadratic version. You could probably micro-optimize that down to 250ms or so—but it's hard to imagine a case where you'd need that, but wouldn't benefit from running it in PyPy instead of CPython, or writing it in Cython or C, or numpy-izing it, or getting a faster computer, or parallelizing it.
As for space, the explicit version makes it pretty obvious. Like any conceivable non-mutating algorithm, we've got the three new_Foo lists around at the same time as the original lists, and we've also added new_A_set of the same size. Since all of those are length M, that's 4M space. We could cut that in half by doing one pass to get indices, then doing the same thing mu 無's answer does:
indices = set(zip(*unique_everseen(enumerate(a), key=itemgetter(1))[0])
a = [a[index] for index in indices]
b = [b[index] for index in indices]
c = [c[index] for index in indices]
But there's no way to go lower than that; you have to have at least a set and a list of length M alive to uniquify a list of length N in linear time.
If you really need to save space, you can mutate all three lists in-place. But this is a lot more complicated, and a bit slower (although still linear*).
Also, it's worth noting another advantage of the zip version: it works on any iterables. You can feed it three lazy iterators, and it won't have to instantiate them eagerly. I don't think it's doable in 2M space, but it's not too hard in 3M:
indices, a = zip(*unique_everseen(enumerate(a), key=itemgetter(1))
indices = set(indices)
b = [value for index, value in enumerate(b) if index in indices]
c = [value for index, value in enumerate(c) if index in indices]
* Note that just del c[i] will make it quadratic, because deleting from the middle of a list takes linear time. Fortunately, that linear time is a giant memmove that's orders of magnitude faster than the equivalent number of Python assignments, so if N isn't too big you can get away with it—in fact, at N=100000, M=10000 it's twice as fast as the immutable version… But if N might be too big, you have to instead replace each duplicate element with a sentinel, then loop over the list in a second pass so you can shift each element only once, which is instead 50% slower than the immutable version.
How about this - basically get a set of all unique elements of A, and then get their indices, and create a new list based on these indices.
new_A = list(set(A))
indices_to_copy = [A.index(element) for element in new_A]
new_B = [B[index] for index in indices_to_copy]
new_C = [C[index] for index in indices_to_copy]
You can write a function for the second statement, for reuse:
def get_new_list(original_list, indices):
return [original_list[idx] for idx in indices]

Inserting and removing into/from sorted list in Python

I have a sorted list of integers, L, and I have a value X that I wish to insert into the list such that L's order is maintained. Similarly, I wish to quickly find and remove the first instance of X.
Questions:
How do I use the bisect module to do the first part, if possible?
Is L.remove(X) going to be the most efficient way to do the second part? Does Python detect that the list has been sorted and automatically use a logarithmic removal process?
Example code attempts:
i = bisect_left(L, y)
L.pop(i) #works
del L[bisect_left(L, i)] #doesn't work if I use this instead of pop
You use the bisect.insort() function:
bisect.insort(L, X)
L.remove(X) will scan the whole list until it finds X. Use del L[bisect.bisect_left(L, X)] instead (provided that X is indeed in L).
Note that removing from the middle of a list is still going to incur a cost as the elements from that position onwards all have to be shifted left one step. A binary tree might be a better solution if that is going to be a performance bottleneck.
You could use Raymond Hettinger's IndexableSkiplist. It performs 3 operations in O(ln n) time:
insert value
remove value
lookup value by rank
import skiplist
import random
random.seed(2013)
N = 10
skip = skiplist.IndexableSkiplist(N)
data = range(N)
random.shuffle(data)
for num in data:
skip.insert(num)
print(list(skip))
# [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
for num in data[:N//2]:
skip.remove(num)
print(list(skip))
# [0, 3, 4, 6, 9]

Categories