Related
I have the following Python list: [1, 1, 2, 2, 2, 3, 4, 4, 4]. I want to create a function that calculates the index of element changes. So, for example, the method would yield this result for the above list - [2, 5, 6] - since the first 2 occurs at index 2, the first 3 at index 5, and the first 4 at index 6.
Of course, there are numerous ways to do this. I'll be running this method millions of times daily over much longer lists, so I'm looking for the quickest solution possible.
Here's the work I've done so far:
idxs = [i for i in range(len(inputList))]
dict_ = {v: i for v, i in zip(inputList, idxs)}
result = [v + 1 for v in dict_.values()]
However, there's a bug when inputting lists that have values that have been used previously, such as [1, 1, 2, 2, 2, 3, 4, 4, 4, 1, 1].
You could enumerate over your values and check if the [i-1] value is different than the [i] value.
>>> inputList = [1, 1, 2, 2, 2, 3, 4, 4, 4]
>>> [i for i, val in enumerate(inputList) if i>0 and inputList[i-1]!= val]
[2, 5, 6]
and with your second example
>>> inputList = [1, 1, 2, 2, 2, 3, 4, 4, 4, 1, 1]
>>> [i for i, val in enumerate(inputList) if i>0 and inputList[i-1]!= val]
[2, 5, 6, 9]
This will run in O(N) time which is the fastest this type of algorithm could possibly execute.
So a little bit of profiling on some largish data suggests reveals some improvements that can be made just by avoiding checking i > 0 each time in the list comprehension. It's comparable to numpy though we're probably losing a lot converting to numpy arrays and back:
import random
import timeit
import numpy as np
max_value = 100
min_run = 100
max_run = 1000
num_runs = 1000
test_list = [1, 1, 2, 2, 2, 3, 4, 4, 4]
test_result = [2, 5, 6]
input_list = sum(([random.randint(0, max_value)] * random.randint(min_run, max_run) for _ in range(num_runs)), [])
def method_1(input_list):
return [i for i, val in enumerate(input_list) if i>0 and input_list[i-1]!= val]
def method_2(input_list):
return [i for i in range(1, len(input_list)) if input_list[i-1] != input_list[i]]
def method_3(input_list):
return [i + 1 for i, (a, b) in enumerate(zip(input_list, input_list[1:])) if a != b]
def method_4(input_list):
input_array = np.array(input_list)
res, = np.where(input_array[:-1] != input_array[1:])
res += 1
return list(res)
def method_5(input_list):
return [i + 1 for i, val in enumerate(input_list[1:]) if input_list[i]!= val]
assert method_1(test_list) == test_result
assert method_2(test_list) == test_result
assert method_3(test_list) == test_result
assert method_4(test_list) == test_result
assert method_5(test_list) == test_result
print(timeit.timeit("method_1(input_list)", globals=globals(), number=10))
print(timeit.timeit("method_2(input_list)", globals=globals(), number=10))
print(timeit.timeit("method_3(input_list)", globals=globals(), number=10))
print(timeit.timeit("method_4(input_list)", globals=globals(), number=10))
print(timeit.timeit("method_5(input_list)", globals=globals(), number=10))
This yields the result:
0.4418060999996669
0.3605320999995456
0.3416827999972156
0.2726910000019416
0.2845658000005642
Here is an answer using only range and len, without using enumerate, otherwise similar to the answer by Cory Kramer:
inputList = [1, 1, 2, 2, 2, 3, 4, 4, 4, 1, 1]
idxs = [i for i in range(1, len(inputList)) if inputList[i-1] != inputList[i]]
print(idxs)
# [2, 5, 6, 9]
The solutions are similar in run time, but using only range and len, without using enumerate is somewhat faster:
import timeit
t = timeit.Timer("idxs = [i for i in range(1, len(inputList)) if inputList[i-1] != inputList[i]]",
"import random; random.seed(42); inputList = [random.randrange(4) for i in range(1000000)]")
print('range + len:', t.timeit(100))
t = timeit.Timer("[i for i, val in enumerate(inputList) if i>0 and inputList[i-1]!= val]",
"import random; random.seed(42); inputList = [random.randrange(4) for i in range(1000000)]")
print('enumerate:', t.timeit(100))
# range + len: 15.435243827
# enumerate: 17.243516137
You could use enumerate over zip:
L = [1, 1, 2, 2, 2, 3, 4, 4, 4]
C = [i for i,(a,b) in enumerate(zip(L,L[1:]),1) if a!=b]
print(C)
[2, 5, 6]
Suppose I have some numpy array (all elements are unique) that I want to sort in descending order. I need to find out which positions elements of initial array will take in sorted array.
Example.
In1: [1, 2, 3] # Input
Out1: [2, 1, 0] # Expected output
In2: [1, -2, 2] # Input
Out2: [1, 2, 0] # Expected output
I tried this one:
def find_positions(A):
A = np.array(A)
A_sorted = np.sort(A)[::-1]
return np.argwhere(A[:, None] == A_sorted[None, :])[:, 1]
But it doesn't work when the input array is very large (len > 100000). What I did wrong and how can I resolve it?
Approach #1
We could use double argsort -
np.argsort(a)[::-1].argsort() # a is input array/list
Approach #2
We could use one argsort and then array-assignment -
# https://stackoverflow.com/a/41242285/ #Andras Deak
def argsort_unique(idx):
n = idx.size
sidx = np.empty(n,dtype=int)
sidx[idx] = np.arange(n)
return sidx
out = argsort_unique(np.argsort(a)[::-1])
Take a look at numpy.argsort(...) function:
Returns the indices that would sort an array.
Perform an indirect sort along the given axis using the algorithm specified by the kind keyword. It returns an array of indices of the same shape as a that index data along the given axis in sorted order.
Here is the reference from the documentation, and the following is a simple example:
import numpy
arr = numpy.random.rand(100000)
indexes = numpy.argsort(arr)
the indexes array will contain all the indexes in the order in which the array arr would be sorted
I face the same problem for plain lists, and would like to avoid using numpy. So I propose a possible solution that should also work for an np.array, and which avoids reversal of the result:
def argsort(A, key=None, reverse=False):
"Indirect sort of list or array A: return indices of elements in order."
keyfunc = (lambda i: A[i]) if key is None else lambda i: key(A[i])
return sorted(range(len(A)), keyfunc, reverse=reverse)
Example of use:
>>> L = [3,1,4,1,5,9,2,6]
>>> argsort( L )
[1, 3, 6, 0, 2, 4, 7, 5]
>>> [L[i]for i in _]
[1, 1, 2, 3, 4, 5, 6, 9]
>>> argsort( L, key=lambda x:(x%2,x) ) # even elements first
[6, 2, 7, 1, 3, 0, 4, 5]
>>> [L[i]for i in _]
[2, 4, 6, 1, 1, 3, 5, 9]
>>> argsort( L, key=lambda x:(x%2,x), reverse = True)
[5, 4, 0, 1, 3, 7, 2, 6]
>>> [L[i]for i in _]
[9, 5, 3, 1, 1, 6, 4, 2]
Feedback would be welcome! (Efficiency compared to previously proposed solutions? Suggestions for improvements?)
I have a list containing integers and want to replace them so that the element which previously contained the highest number now contains a 1, the second highest number set to 2, etc etc.
Example:
[5, 6, 34, 1, 9, 3] should yield [4, 3, 1, 6, 2, 5].
I personally only care about the first 9 highest numbers by I thought there might be a simple algorithm or possibly even a python function to do take care of this task?
Edit: I don't care how duplicates are handled.
A fast way to do this is to first generate a list of tuples of the element and its position:
sort_data = [(x,i) for i,x in enumerate(data)]
next we sort these elements in reverse:
sort_data = sorted(sort_data,reverse=True)
which generates (for your sample input):
>>> sort_data
[(34, 2), (9, 4), (6, 1), (5, 0), (3, 5), (1, 3)]
and nest we need to fill in these elements like:
result = [0]*len(data)
for i,(_,idx) in enumerate(sort_data,1):
result[idx] = i
Or putting it together:
def obtain_rank(data):
sort_data = [(x,i) for i,x in enumerate(data)]
sort_data = sorted(sort_data,reverse=True)
result = [0]*len(data)
for i,(_,idx) in enumerate(sort_data,1):
result[idx] = i
return result
this approach works in O(n log n) with n the number of elements in data.
A more compact algorithm (in the sense that no tuples are constructed for the sorting) is:
def obtain_rank(data):
sort_data = sorted(range(len(data)),key=lambda i:data[i],reverse=True)
result = [0]*len(data)
for i,idx in enumerate(sort_data,1):
result[idx] = i
return result
Another option, you can use rankdata function from scipy, and it provides options to handle duplicates:
from scipy.stats import rankdata
lst = [5, 6, 34, 1, 9, 3]
rankdata(list(map(lambda x: -x, lst)), method='ordinal')
# array([4, 3, 1, 6, 2, 5])
Assuimg you do not have any duplicates, the following list comprehension will do:
lst = [5, 6, 34, 1, 9, 3]
tmp_sorted = sorted(lst, reverse=True) # kudos to #Wondercricket
res = [tmp_sorted.index(x) + 1 for x in lst] # [4, 3, 1, 6, 2, 5]
To understand how it works, you can break it up into pieces like so:
lst = [5, 6, 34, 1, 9, 3]
# let's see what the sorted returns
print(sorted(lst, reverse=True)) # [34, 9, 6, 5, 3, 1]
# biggest to smallest. that is handy.
# Since it returns a list, i can index it. Let's try with 6
print(sorted(lst, reverse=True).index(6)) # 2
# oh, python is 0-index, let's add 1
print(sorted(lst, reverse=True).index(6) + 1) # 3
# that's more like it. now the same for all elements of original list
for x in lst:
print(sorted(lst, reverse=True).index(x) + 1) # 4, 3, 1, 6, 2, 5
# too verbose and not a list yet..
res = [sorted(lst, reverse=True).index(x) + 1 for x in lst]
# but now we are sorting in every iteration... let's store the sorted one instead
tmp_sorted = sorted(lst, reverse=True)
res = [tmp_sorted.index(x) + 1 for x in lst]
Using numpy.argsort:
numpy.argsort returns the indices that would sort an array.
>>> xs = [5, 6, 34, 1, 9, 3]
>>> import numpy as np
>>> np.argsort(np.argsort(-np.array(xs))) + 1
array([4, 3, 1, 6, 2, 5])
A short, log-linear solution using pure Python, and no look-up tables.
The idea: store the positions in a list of pairs, then sort the list to reorder the positions.
enum1 = lambda seq: enumerate(seq, start=1) # We want 1-based positions
def replaceWithRank(xs):
# pos = position in the original list, rank = position in the top-down sorted list.
vp = sorted([(value, pos) for (pos, value) in enum1(xs)], reverse=True)
pr = sorted([(pos, rank) for (rank, (_, pos)) in enum1(vp)])
return [rank for (_, rank) in pr]
assert replaceWithRank([5, 6, 34, 1, 9, 3]) == [4, 3, 1, 6, 2, 5]
A is a point, and P is a list of points.
I want to find which point P[i] is the closest to A, i.e. I want to find P[i_0] with:
i_0 = argmin_i || A - P[i]||^2
I do it this way:
import numpy as np
# P is a list of 4 points
P = [np.array([-1, 0, 7, 3]), np.array([5, -2, 8, 1]), np.array([0, 2, -3, 4]), np.array([-9, 11, 3, 4])]
A = np.array([1, 2, 3, 4])
distance = 1000000000 # better would be : +infinity
closest = None
for p in P:
delta = sum((p - A)**2)
if delta < distance:
distance = delta
closest = p
print closest # the closest point to A among all the points in P
It works, but how to do this in a shorter/more Pythonic way?
More generally in Python (and even without using Numpy), how to find k_0 such that D[k_0] = min D[k]? i.e. k_0 = argmin_k D[k]
A more Pythonic way of implementing the same algorithm you're using is to replace your loop with a call to min with a key function:
closest = min(P, key=lambda p: sum((p - A)**2))
Note that I'm using ** for exponentiation (^ is the binary-xor operator in Python).
A fully vectorized approach in numpy. Similar to the one of #MikeMüller, but using numpy's broadcasting to avoid lambda functions.
With the example data:
>>> P = [np.array([-1, 0, 7, 3]), np.array([5, -2, 8, 1]), np.array([0, 2, -3, 4]), np.array([-9, 11, 3, 4])]
>>> A = np.array([1, 2, 3, 4])
And making P a 2D numpy array:
>>> P = np.asarray(P)
>>> P
array([[-1, 0, 7, 3],
[ 5, -2, 8, 1],
[ 0, 2, -3, 4],
[-9, 11, 3, 4]])
It can be computed in one line using numpy:
>>> P[np.argmin(np.sum((P - A)**2, axis=1))]
Note that P - A, with P.shape = (N, 4) and A.shape = (4,) will brooadcast the substraction to all the rows of P (Pi = Pi - A).
For small N (number of rows in P), the pythonic approach is probably faster. For large values of N this should be significantly faster.
A NumPy version as one-liner:
clostest = P[np.argmin(np.apply_along_axis(lambda p: np.sum((p - A) **2), 1, P))]
Usage of the builtin min is the way for this:
import math
p1 = [1,2]
plst = [[1,3], [10,10], [5,5]]
res = min(plst, key=lambda x: math.sqrt(pow(p1[0]-x[0], 2) + pow(p1[1]-x[1], 2)))
print res
[1, 3]
Note that I just used plain python lists.
Is there a built-in method that would help me achieve the following efficiently: given an array, I need a list of arrays, each with indices to a different unique value of the array?
If f is the desired function,
b = f(a)
and
u, idxs = unique(a)
then
b[i] == where(idxs==i)[0]
I am aware that pandas.Series.groupby() can do this, but it may no be efficient to create a dict when there are over 10^5 unique integers.
If you have numpy >= 1.9 you can do:
>>> a = np.random.randint(5, size=10)
>>> a
array([0, 2, 4, 4, 2, 4, 4, 3, 2, 1])
>>> unq, unq_inv, unq_cnt = np.unique(a, return_inverse=True, return_counts=True)
>>> np.split(np.argsort(unq_inv), np.cumsum(unq_cnt[:-1]))
[array([0]), array([9]), array([1, 4, 8]), array([7]), array([2, 3, 5, 6])]
>>> unq
array([0, 1, 2, 3, 4])
In earlier versions, you can get the counts doing an extra:
>>> unq_cnt = np.bincount(unq_inv)
Also, if you want to make sure that the indices for each value are sorted, I think you will need to use a stable sort, e.g. np.argsort(unq_inv, kind='mergesort')
Thinking about what you seem to be after, which I think is minimizing calls to an expensive function, I don't think you need to do what you are asking. Say that your function was squaring, you could simply do:
>>> unq, unq_inv = np.unique(a, return_inverse=True)
>>> f_unq = unq**2
>>> f_a = f_unq[unq_inv]
>>> a
array([0, 2, 4, 4, 2, 4, 4, 3, 2, 1])
>>> f_a
array([ 0, 4, 16, 16, 4, 16, 16, 9, 4, 1])
def foo(a):
I=np.arange(a.shape[0])
d={}
while a.shape[0]:
x = a[0]
ii = a==x
d[x] = I[ii]
a = a[~ii]
I = I[~ii]
return d
In [767]: a
Out[767]: array([4, 4, 3, 0, 0, 2, 1, 1, 0, 3])
In [768]: foo(a)
Out[768]:
{0: array([3, 4, 8]),
1: array([6, 7]),
2: array([5]),
3: array([2, 9]),
4: array([0, 1])}
Is this the sort of dictionary that you want?
For small a this works fine.
An equivalent dictionary building function is:
def foo1(a):
unq = np.unique(a)
return {i:np.where(a==i)[0] for i in unq}
Off hand I don't see how unq_inv helps with building the dictionary.
foo is about 30% slower than foo1. I was hoping that by reducing the searched array each time a value was counted that I might gain some speed. But it looks like the extra bookkeeping chews up time. And the where time might not be that sensitive to the length of a.
For a2=np.random.randint(5000,size=100000) run times are on the order of 2-3 sec.
But np.random.randint(50000,size=1000000) takes too long to time (for either version).
On further experimentation, a 'dumb' approach using a collections.defaultdict is much faster (20x):
def food(a):
d = defaultdict(list)
for i,j in enumerate(a):
d[j].append(i)
return d
The 'too big' (1000000,) array takes only 1.1 sec;
Maybe do something like:
s = argsort(a)
d = diff(a[s])
starts = where(d)[0]
f = [s[starts[i:i+1]] for i in xrange(len(a))]
(code not checked)