Find multiple maximum values in a 2d array fast - python

The situation is as follows:
I have a 2D numpy array. Its shape is (1002, 1004). Each element contains a value between 0 and Inf. What I now want to do is determine the first 1000 maximum values and store the corresponding indices in to a list named x and a list named y. This is because I want to plot the maximum values and the indices actually correspond to real time x and y position of the value.
What I have so far is:
x = numpy.zeros(500)
y = numpy.zeros(500)
for idx in range(500):
x[idx] = numpy.unravel_index(full.argmax(), full.shape)[0]
y[idx] = numpy.unravel_index(full.argmax(), full.shape)[1]
full[full == full.max()] = 0.
print os.times()
Here full is my 2D numpy array. As can be seen from the for loop, I only determine the first 500 maximum values at the moment. This however already takes about 5 s. For the first 1000 maximum values, the user time should actually be around 0.5 s. I've noticed that a very time consuming part is setting the previous maximum value to 0 each time. How can I speed things up?
Thank you so much!

If you have numpy 1.8, you can use the argpartition function or method.
Here's a script that calculates x and y:
import numpy as np
# Create an array to work with.
np.random.seed(123)
full = np.random.randint(1, 99, size=(8, 8))
# Get the indices for the largest `num_largest` values.
num_largest = 8
indices = (-full).argpartition(num_largest, axis=None)[:num_largest]
# OR, if you want to avoid the temporary array created by `-full`:
# indices = full.argpartition(full.size - num_largest, axis=None)[-num_largest:]
x, y = np.unravel_index(indices, full.shape)
print("full:")
print(full)
print("x =", x)
print("y =", y)
print("Largest values:", full[x, y])
print("Compare to: ", np.sort(full, axis=None)[-num_largest:])
Output:
full:
[[67 93 18 84 58 87 98 97]
[48 74 33 47 97 26 84 79]
[37 97 81 69 50 56 68 3]
[85 40 67 85 48 62 49 8]
[93 53 98 86 95 28 35 98]
[77 41 4 70 65 76 35 59]
[11 23 78 19 16 28 31 53]
[71 27 81 7 15 76 55 72]]
x = [0 2 4 4 0 1 4 0]
y = [6 1 7 2 7 4 4 1]
Largest values: [98 97 98 98 97 97 95 93]
Compare to: [93 95 97 97 97 98 98 98]

You could loop through the array as #Inspired suggests, but looping through NumPy arrays item-by-item tends to lead to slower-performing code than code which uses NumPy functions since the NumPy functions are written in C/Fortran, while the item-by-item loop tends to use Python functions.
So, although sorting is O(n log n), it may be quicker than a Python-based one-pass O(n) solution. Below np.unique performs the sort:
import numpy as np
def nlargest_indices(arr, n):
uniques = np.unique(arr)
threshold = uniques[-n]
return np.where(arr >= threshold)
full = np.random.random((1002,1004))
x, y = nlargest_indices(full, 10)
print(full[x, y])
print(x)
# [ 2 7 217 267 299 683 775 825 853]
print(y)
# [645 621 132 242 556 439 621 884 367]
Here is a timeit benchmark comparing nlargest_indices (above) to
def nlargest_indices_orig(full, n):
full = full.copy()
x = np.zeros(n)
y = np.zeros(n)
for idx in range(n):
x[idx] = np.unravel_index(full.argmax(), full.shape)[0]
y[idx] = np.unravel_index(full.argmax(), full.shape)[1]
full[full == full.max()] = 0.
return x, y
In [97]: %timeit nlargest_indices_orig(full, 500)
1 loops, best of 3: 5 s per loop
In [98]: %timeit nlargest_indices(full, 500)
10 loops, best of 3: 133 ms per loop
For timeit purposes I needed to copy the array inside nlargest_indices_orig, lest full get mutated by the timing loop.
Benchmarking the copying operation:
def base(full, n):
full = full.copy()
In [102]: %timeit base(full, 500)
100 loops, best of 3: 4.11 ms per loop
shows this added about 4ms to the 5s benchmark for nlargest_indices_orig.
Warning: nlargest_indices and nlargest_indices_orig may return different results if arr contains repeated values.
nlargest_indices finds the n largest values in arr and then returns the x and y indices corresponding to the locations of those values.
nlargest_indices_orig finds the n largest values in arr and then returns one x and y index for each large value. If there is more than one x and y corresponding to the same large value, then some locations where large values occur may be missed.
They also return indices in a different order, but I suppose that does not matter for your purpose of plotting.

If you want to know the indices of the n max/min values in the 2d array, my solution (for largest is)
indx = divmod((-full).argpartition(num_largest,axis=None)[:3],full.shape[0])
This finds the indices of the largest values from the flattened array and then determines the index in the 2d array based on the remainder and mod.
Nevermind. Benchmarking shows the unravel method is twice as fast at least for num_largest = 3.

I'm afraid that the most time-consuming part is recalculating maximum. In fact, you have to calculate maximum of 1002*1004 numbers 500 times which gives you 500 million comparisons.
Probably you should write your own algorithm to find the solution in one pass: keep only 1000 greatest numbers (or their indices) somewhere while scanning your 2D array (without modifying the source array). I think that some sort of a binary heap (have a look at heapq) would suit for the storage.

Related

Select only values below 50 from array, add 5 then multiply by 2. The other values should remain unchanged

I have a python array that i got using
array = np.arange(2,201,2).reshape(25,4)
which gave me this:
[[ 2 4 6 8]
[ 18 20 22 24]
[ 34 36 38 40]
[ 50 52 54 56]
[ 66 68 70 72]
[ 82 84 86 88]
[ 98 100 102 104]
[114 116 118 120]
[130 132 134 136]
[146 148 150 152]
[162 164 166 168]
[178 180 182 184]
[194 196 198 200]]
but now i'm instructed to select only the values below 50 from "array", add 5 to these values, and then multiply by 2. The other values should remain unchanged and everything should be saved as "array". This is a school assignment so I don't have the output but basically the output should be the array in the same 25x4 shape and the first ~3 rows will be changed (since those are the ones under 50) and the other rows/values will be the same (since they're over 50). I've tried the following code:
for i in array:
if array < 50:
print((i+5)*2)
else:
print(i)
and I'm getting an error that says -
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
any help would be greatly appreciated since I can't find any other articles with similar questions
There are 2 ways to address this question. A Python one and a numpy one (numpy is not Python...).
Python way:
You have a sequence of sequence containers. You can use a double iteration to test the values one at a time and replace the ones that have to be:
for row in array: # iterate over the rows
for i, val in enumerate(row): # then the values in the row
if val <=50: # test them
row[i] = (val + 5) * 2 # and replace
This works as soon as the outer iteration gives you a direct access to the row container. This is true for both Python containers (lists) and numpy arrays but may not be guaranteed for any type of containers. The super safe way would be to keep the indexes and directly modify array:
for i in range(len(array)):
for j in range(len(array[i])):
if array[i][j]< 50:
array[i][j] = (array[i][j] + 2) * 5
Numpy way:
The power of numpy is to provide high speed iterations on its arrays. In numpy wordings it is called vectorization. You should first extract the relevant indexes and then change the values in one single vectorized operation:
ix = np.where(array < 50)
array[ix] = (array[ix] + 5) * 2
For large arrays, this second way should be at least one magnitude order faster than the first one.
For your question, the correct way is the one that matches your current lesson, either Python or numpy...
import numpy as np
array = np.arange(2,201,2).reshape(25,4)
values = [ (element+5)*2 if element < 50 else element for innerList in array for element in innerList ]
print(values)

Splitting up an array in python into sub arrays

I have an array which is 1 -> 160. I want to split this into 10 arrays that are split every sixteen numbers. This is what I have so far:
amplitude=[]
for i in range (0,160):
amplitude.append(i+1)
print(amplitude)
#split arrays up into a line for each sample
traceno=10 #number of traces in file
samplesno=16 #number of samples in each trace. This wont change.
amplitude_split=np.zeros((traceno,samplesno) ,dtype=np.int)
#fill in the arrays with amplitude/sample numbers
for i in range(len(amplitude)):
for j in range(traceno):
for k in range(samplesno):
amplitude_split[j,k]=amplitude[i]
print(amplitude_split[1,:])
As an output I only get [160 160 160 160 160 160 160 160 160 160 160 160 160 160 160 160]
Where I require something along the lines of:
[1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16]
[17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32]
etc...
You are nesting the loops. So you consistently fill the new array with the same number from the first one, and end with the last one 160 repeated everywhere.
You only need to copy the list into a 1D numpy array, and then reshape it:
amplitude_split=np.array(amplitude, dtype=np.int).reshape((traceno,samplesno))
Well, if we're using Numpy arrays, we can use Numpy functionality:
amplitude = np.arange(1, 161)
amplitude_split = amplitude.reshape(10, 16)
Otherwise, you've already been linked to how to do it for plain lists, but I'd like to point out that you still don't need a loop to fill amplitude in the first place:
amplitude = list(range(1, 161))
In general, with Python you should be trying hard not to think in terms of starting with an initially blank "storage" area that you then fill in. Just create the data you want directly - by conversions of the sort above, by list comprehensions etc., or if necessary by .append() ing - rather than overwriting a dummy value.
See grouper in https://docs.python.org/2/library/itertools.html#recipes
def grouper(iterable, n, fillvalue=None):
"Collect data into fixed-length chunks or blocks"
# grouper('ABCDEFG', 3, 'x') --> ABC DEF Gxx
args = [iter(iterable)] * n
return izip_longest(fillvalue=fillvalue, *args)

Unexpected results with functools reduce in python

While executing the below code I'm simply getting the first row(split) of the matrix back not the sum of the elements as what I'm expecting. Is my understanding incorrect or did I do something stupid?
My objective is to get the sum of all elements in each row.
import numpy as np
from functools import reduce
matrix = 100*np.random.rand(4,4)
matrix=matrix.astype(int)
print(matrix)
s_matrix = np.vsplit(matrix, 4)
sum_test = reduce((lambda a,b : a+b), list(s_matrix[0]))
print(sum_test)
Output:
[[79 75 33 26]
[49 45 16 19]
[58 33 83 55]
[40 14 2 93]]
[79 75 33 26]
Expected:
[213, 129, 229, 149]
Check the expression you're using: print(list(s_matrix[0])). I think you'll find that it's a double-nested list
[[79 75 33 26]]
Thus, the "sum" is merely concatenation of a single list element.
You can use reduce() for this by continually adding the results to a list in the accumulator:
import numpy as np
from functools import reduce
matrix = 100*np.random.rand(4,4)
matrix=matrix.astype(int)
sum_test = reduce(lambda a,b : a+[sum(b)], list(matrix), [])
print(sum_test)
...but you really shouldn't. One of the major points of Numpy is that you can avoid explicit python loops. Instead you should just use the array's sum function. You can pass it an axis to tell it to sum rows instead of the whole thing:
import numpy as np
matrix = np.random.randint(0, 100, [4,4])
print(matrix)
print(matrix.sum(axis = 1))
Result
[[64 89 97 15]
[12 47 81 31]
[52 81 37 78]
[27 64 79 50]]
[265 171 248 220]
sum_test = reduce((lambda a,b : a+b), list(s_matrix[0]))
above line is your problem,
you are only giving the first row of your matrix instead of giving the whole matrix

how to successively increment a 2-dimensional list

Part of this assignment deals with a 1-dimensional list and a 2-dimensional list. The 2-D list has 10 rows, with 4 elements each; the 1-D list has 4 elements.
The assignments calls for copying the gamma list (see code) into the first row of the inStock list. Then each row after the first needs to be successively incremented by 3. By successively i mean multiplying everything in the first row of inStock by three and storing those values in the second row, then taking the values stored in the second row multiplying those by three and storing those values in the third row of inStock, and so on.
I understand how to copy gamma but I am having trouble figuring out how to increment based off the previous list.
I am having difficulty creating a function that increments inStock successively.
This is what I have done. It increases the elements in gamma by three and stores them into the first row of inStock. But all the while loop does is take the values from the first row of inStock and store them into the other rows, rather than increment them successively.
row = 10
col = 4
gamma = [11, 13, 15, 17]
inStock = [[0] * col] * row
def copyGamma(listG, gamma):
listG[0] = gamma.copy()
x = 0
while x < 9:
x +=1
listG[x] = [i * 3 for i in listG[0]]
return listG
retList = copyGamma(inStock, gamma)
print(retList)
#this is the output of the above code
11 13 15 17 #this is inStock[0]
33 39 45 51 #this is inStock[1]
33 39 45 51 #this is inStock[2]
33 39 45 51
33 39 45 51
33 39 45 51
33 39 45 51
33 39 45 51
33 39 45 51
33 39 45 51
#This is the output i am looking for, format does not matter:
11 13 15 17 #This is inStock[0]
33 39 45 51 #This is inStock[1]
99 117 135 153 #This *should* be inStock[2]
297 351 405 459 #and so on
891 1053 1215 1377
2673 3159 3645 4131
8019 9477 10935 12393
24057 28431 32805 37179
72171 85293 98415 111537
216513 255879 295245 334611
You can use a list comprehension and the fact that each row's elements are effectively multiplied by a power of 3:
inStock = [[x * 3**i for x in gamma] for i in range(row)]

numpy: take multiple range subsets of the same of size

What I'm looking for
# I have an array
x = np.arange(0, 100)
# I have a size n
n = 10
# I have a random set of numbers
indexes = np.random.randint(n, 100, 10)
# What I want is a matrix where every row i is the i-th element of indexes plus the previous n elements
res = np.empty((len(indexes), n), int)
for (i, v) in np.ndenumerate(indexes):
res[i] = x[v-n:v]
To reformulate, as I wrote in the title what am looking for is a way to take multiple subsets (of the same size) of an initial array.
Just to add a detail this loopy version works, I want just to know if there is a numpyish way to achieve this in a more elegant way.
The following does what you are asking for. It uses numpy.lib.stride_tricks.as_strided to create a special view on the data which can be indexed in the desired way.
import numpy as np
from numpy.lib import stride_tricks
x = np.arange(100)
k = 10
i = np.random.randint(k, len(x)+1, size=(5,))
xx = stride_tricks.as_strided(x, strides=np.repeat(x.strides, 2), shape=(len(x)-k+1, k))
print(i)
print(xx[i-k])
Sample output:
[ 69 85 100 37 54]
[[59 60 61 62 63 64 65 66 67 68]
[75 76 77 78 79 80 81 82 83 84]
[90 91 92 93 94 95 96 97 98 99]
[27 28 29 30 31 32 33 34 35 36]
[44 45 46 47 48 49 50 51 52 53]]
A bit of explanation. Arrays store not only data but also a small "header" with layout information. Amongst this are the strides which tell how to translate linear memory to nd. There is a stride for each dimension which is just the offset at which the next element along that dimension can be found. So the strides for a 2d array are (row offset, element offset). as_strided permits to directly manipulate an array's strides; by setting row offsets to the same as element offsets we create a view that looks like
0 1 2 ...
1 2 3 ...
2 3 4
. .
. .
. .
Note that no data are copied at this stage; for exasmple, all the 2s refer to the same memory location in the original array. Which is why this solution should be quite efficient.

Categories