I have a function that returns the argmax from a large 2d array
getMax = np.argmax(dist, axis=1)
However I want to get the next biggest values, is there a way of removing the getMax values from the original array and then performing argmax again?
Use the command np.argsort(a, axis=-1, kind='quicksort', order=None), but with appropriate choice of arguments (below).
here is the documentation. Note "It returns an array of indices of the same shape as a that index data along the given axis in sorted order."
The default order is small to large. So sort with -dist (for quick coding). Caution: doing -dist causes a new array to be generated which you may care about if dist is huge. See bottom of post for a better alternative there.
Here is an example:
x = np.array([[1,2,5,0],[5,7,2,3]])
L = np.argsort(-x, axis=1)
print L
[[2 1 0 3]
[1 0 3 2]]
x
array([[1, 2, 5, 0],
[5, 7, 2, 3]])
So the n'th entry in a row of L gives the locations of the n'th largest element of x.
x is unchanged.
L[:,0] will give the same output as np.argmax(x)
L[:,0]
array([2, 1])
np.argmax(x,axis=1)
array([2, 1])
and L[:,1] will give the same as a hypothetical argsecondmax(x)
L[:,1]
array([1, 0])
If you don't want to generate a new list, so you don't want to use -x:
L = np.argsort(x, axis=1)
print L
[[3 0 1 2]
[2 3 0 1]]
L[:,-1]
array([2, 1])
L[:,-2]
array([1, 0])
If speed is important to you, using argpartition rather than argsort could be useful.
For example, to return the n largest elements from a list:
import numpy as np
l = np.random.random_integer(0, 100, 1e6)
top_n_1 = l[np.argsort(-l)[0:n]]
top_n_2 = l[np.argpartition(l, -n)[-n:]]
The %timeit function in ipython reports
10 loops, best of 3: 56.9 ms per loop for top_n_1 and 100 loops, best of 3: 8.06 ms per loop for top_n_2.
I hope this is useful.
Related
Hi I'm new to Python & Numpy and I'd like to ask what is the most efficient way to split a ndarray into 3 parts: 20%, 60% and 20%
import numpy as np
row_indices = np.random.permutation(10)
Let's assume the ndarray has 10 items: [7 9 3 1 2 4 5 6 0 8]
The expected results are the ndarray separated into 3 parts like part1, part2 and part3.
part1: [7 9]
part2: [3 1 2 4 5]
part3: [0 8]
Here's one way -
# data array
In [85]: a = np.array([7, 9, 3, 1, 2, 4, 5, 6, 0, 8])
# percentages (ratios) array
In [86]: p = np.array([0.2,0.6,0.2]) # must sum upto 1
In [87]: np.split(a,(len(a)*p[:-1].cumsum()).astype(int))
Out[87]: [array([7, 9]), array([3, 1, 2, 4, 5, 6]), array([0, 8])]
Alternative to np.split :
np.split could be slower when working with large data, so, we could alternatively use a loop there -
split_idx = np.r_[0,(len(a)*p.cumsum()).astype(int)]
out = [a[i:j] for (i,j) in zip(split_idx[:-1],split_idx[1:])]
I normally just go for the most obvious solution, although there are much fancier ways to do the same. It takes a second to implement and doesn't even require debugging (since it's extremely simple)
part1 = [a[i, ...] for i in range(int(a.shape[0] * 0.2))]
part2 = [a[i, ...] for i in range(int(a.shape[0] * 0.2), int(len(a) * 0.6))]
part3 = [a[i, ...] for i in range(int(a.shape[0] * 0.6), len(a))]
A few things to notice though
This is rounded and therefore you could get something which is only roughly a 20-60-20 split
You get back a list of element so you might have to re-numpyfy them with np.asarray()
You can use this method for indexing multiple objects (e.g. labels and inputs) for the same elements
If you get the indices once before the splits (indices = list(range(a.shape[0]))) you could also shuffle them thus taking care of data shuffling at the same time
I am trying to represent a partition of the numbers 0 to n-1 in Python
I have a numpy array where the ith entry indicates the partition ID of number i. For instance, the numpy array
indicator = array([1, 1, 3, 0, 2, 3, 0, 0])
indicates that numbers 3, 6, and 7 belong to the partition with ID 0. Numbers 0 and 1 belong to partition 1. 4 belongs to partition 2. And 2 and 5 belong to partition 3. Let's call this the indicator representation.
Another way to represent the partition would be a list of lists where the ith list is the partition with ID i. For the array above, this maps to
explicit = [[3, 6, 7], [0, 1], [4], [2, 5]]
Let's call this the explicit representation.
My question is what is the most efficient way to convert the indicator representation to the explicit representation? The naive way is to iterate through the indicator array and assign the elements to their respective slot in the explicit array, but iterating through numpy arrays is inefficient. Is there a more natural numpy construct to do this?
Here's an approach using sorted indices and then splitting those into groups -
def indicator_to_part(indicator):
sidx = indicator.argsort() # indicator.argsort(kind='mergesort') keeps order
sorted_arr = indicator[sidx]
split_idx = np.nonzero(sorted_arr[1:] != sorted_arr[:-1])[0]
return np.split(sidx, split_idx+1)
Runtime test -
In [326]: indicator = np.random.randint(0,100,(10000))
In [327]: %timeit from_ind_to_expl(indicator) ##yogabonito's soln
100 loops, best of 3: 5.59 ms per loop
In [328]: %timeit indicator_to_part(indicator)
1000 loops, best of 3: 801 µs per loop
In [330]: indicator = np.random.randint(0,1000,(100000))
In [331]: %timeit from_ind_to_expl(indicator) ##yogabonito's soln
1 loops, best of 3: 494 ms per loop
In [332]: %timeit indicator_to_part(indicator)
100 loops, best of 3: 11.1 ms per loop
Note that the output would be a list of arrays. If you have to get a list of lists as output, a simple way would be to use map(list,indicator_to_part(indicator)). Again, a performant alternative would involve few more steps, like so -
def indicator_to_part_list(indicator):
sidx = indicator.argsort() # indicator.argsort(kind='mergesort') keeps order
sorted_arr = indicator[sidx]
split_idx = np.nonzero(sorted_arr[1:] != sorted_arr[:-1])[0]
sidx_list = sidx.tolist()
start = np.append(0,split_idx+1)
stop = np.append(split_idx+1,indicator.size+1)
return [sidx_list[start[i]:stop[i]] for i in range(start.size)]
Here is a solution for translating indicator to explicit using numpy only (no for loops, list comprehensions, itertools, etc.)
I haven't seen your iteration-based approach so I can't compare them but maybe you can tell me if it's fast enough for your needs :)
import numpy as np
indicator = np.array([1, 1, 3, 0, 2, 3, 0, 0])
explicit = [[3, 6, 7], [0, 1], [4], [2, 5]]
def from_ind_to_expl(indicator):
groups, group_sizes = np.unique(indicator, return_counts=True)
group_sizes = np.cumsum(group_sizes)
ordered = np.where(indicator==groups[:, np.newaxis])
return np.hsplit(ordered[1], group_sizes[:-1])
from_ind_to_expl(indicator) gives
[array([3, 6, 7]), array([0, 1]), array([4]), array([2, 5])]
I have also compared the times of #Divakar's and my solution. On my machine #Divakar's solution is 2-3 times faster than mine. So #Divakar definitely gets an upvote from me :)
In the last comparison in #Divakar's post there's no averaging for my solution because there's only one loop - this is slightly unfair :P ;)
np.nditer automatically iterates of the elements of an array row-wise. Is there a way to iterate of elements of an array columnwise?
x = np.array([[1,3],[2,4]])
for i in np.nditer(x):
print i
# 1
# 3
# 2
# 4
What I want is:
for i in Columnwise Iteration(x):
print i
# 1
# 2
# 3
# 4
Is my best bet just to transpose my array before doing the iteration?
For completeness, you don't necessarily have to transpose the matrix before iterating through the elements. With np.nditer you can specify the order of how to iterate through the matrix. The default is usually row-major or C-like order. You can override this behaviour and choose column-major, or FORTRAN-like order which is what you desire. Simply specify an additional argument order and set this flag to 'F' when using np.nditer:
In [16]: x = np.array([[1,3],[2,4]])
In [17]: for i in np.nditer(x,order='F'):
....: print i
....:
1
2
3
4
You can read more about how to control the order of iteration here: http://docs.scipy.org/doc/numpy-1.10.0/reference/arrays.nditer.html#controlling-iteration-order
You could use the shape and slice each column
>>> [x[:, i] for i in range(x.shape[1])]
[array([1, 2]), array([3, 4])]
You could transpose it?
>>> x = np.array([[1,3],[2,4]])
>>> [y for y in x.T]
[array([1, 2]), array([3, 4])]
Or less elegantly:
>>> [np.array([x[j,i] for j in range(x.shape[0])]) for i in range(x.shape[1])]
[array([1, 2]), array([3, 4])]
nditer is not the best iteration tool for this case. It is useful when working toward a compiled (cython) solution, but not in pure Python coding.
Look at some regular iteration strategies:
In [832]: x=np.array([[1,3],[2,4]])
In [833]: x
Out[833]:
array([[1, 3],
[2, 4]])
In [834]: for i in x:print i # print each row
[1 3]
[2 4]
In [835]: for i in x.T:print i # print each column
[1 2]
[3 4]
In [836]: for i in x.ravel():print i # print values in order
1
3
2
4
In [837]: for i in x.T.ravel():print i # print values in column order
1
2
3
4
You comment: I need to fill values into an array based on the index of each cell in the array
What do you mean by index?
A crude 2d iteration with indexing:
In [838]: for i in range(2):
.....: for j in range(2):
.....: print (i,j),x[i,j]
(0, 0) 1
(0, 1) 3
(1, 0) 2
(1, 1) 4
ndindex uses nditer to generate similar indexes
In [841]: for i,j in np.ndindex(x.shape):
.....: print (i,j),x[i,j]
.....:
(0, 0) 1
(0, 1) 3
(1, 0) 2
(1, 1) 4
enumerate is a good Python way of getting both values and indexes:
In [847]: for i,v in enumerate(x):print i,v
0 [1 3]
1 [2 4]
Or you can use meshgrid to generate all the indexes, as arrays
In [843]: I,J=np.meshgrid(range(2),range(2))
In [844]: I
Out[844]:
array([[0, 1],
[0, 1]])
In [845]: J
Out[845]:
array([[0, 0],
[1, 1]])
In [846]: x[I,J]
Out[846]:
array([[1, 2],
[3, 4]])
Note that most of these iterative methods just treat your array as a list of lists. They don't take advantage of the array nature, and will be slow compared to methods that work with the whole x.
Using numpy, I want to multiple a matrix x by a column array y, elementwise:
x = numpy.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
y = numpy.array([1, 2, 3])
z = numpy.multiply(x, y)
print z
This gives the output as if y is a row array:
[[ 1 4 9]
[ 4 10 18]
[ 7 16 27]]
However, I want the output as if y is a column array:
[[ 1 2 3]
[ 8 10 12]
[21 24 27]]
So how can I manipulate y to achieve this? If I use:
y = numpy.transpose(y)
then y remains the same shape.
Enclose it in another list to make it 2D:
>>> y2 = numpy.transpose([y])
>>> y2
array([[1],
[2],
[3]])
>>> numpy.multiply(x, y2)
array([[ 1, 2, 3],
[ 8, 10, 12],
[21, 24, 27]])
The reason you can't transpose y is because it's initialized as a 1-D array. Transposing an array only makes sense in two (or more) dimensions.
To get around these mixed-dimension issues, numpy actually provides a set of convenience functions to sanitize your inputs:
y = np.array([1, 2, 3])
y1 = np.atleast_1d(y) # Converts array to 1-D if less than that
y2 = np.atleast_2d(y) # Converts array to 2-D if less than that
y3 = np.atleast_3d(y) # Converts array to 3-D if less than that
I also think np.column_stack falls under this convenience category, as it puts together 1-D and 2-D arrays as columns like you would expect, rather than having to figure out the right series of reshapes and stacks.
y1 = np.array([1, 2, 3])
y2 = np.array([2, 4, 6])
y3 = np.array([[2, 6], [2, 4], [7, 7]])
y = np.column_stack((y1, y2, y3))
I think these functions aren't as well known as they should be, and I find them much easier, more flexible, and safer than manually fiddling with reshape or array dimensions. They also avoid making copies when possible, which can be a small performance speedup.
To answer your question, you should use np.atleast_2d to convert your array to a 2-D array, then transpose it.
y = np.atleast_2d(y).T
The other way to quickly do it without worrying about y is to transpose x then transpose the result back.
z = (x.T * y).T
Though this can obfuscate the intent of the code. It is probably faster though if performance is important.
If performance is important, that can inform which method you want to use. Some timings on my computer:
%timeit x * np.atleast_2d(y).T
100000 loops, best of 3: 7.98 us per loop
%timeit (x.T*y).T
100000 loops, best of 3: 3.27 us per loop
%timeit x * np.transpose([y])
10000 loops, best of 3: 20.2 us per loop
%timeit x * y.reshape(-1, 1)
100000 loops, best of 3: 3.66 us per loop
You can use reshape:
y = y.reshape(-1,1)
The y variable has a shape of (3,). If you construct it this way:
y = numpy.array([1, 2, 3], ndmin=2)
...it will have a shape of (1,3), which you can transpose to get the result you want:
y = numpy.array([1, 2, 3], ndmin=2).T
z = numpy.multiply(x, y)
I'm trying to run a custom kmeans clustering algorithm and am having trouble getting the document frequency for each column(term) of a 2-d numpy array by cluster. My current algorithm has two numpy arrays, a raw dataset that lists the documents by terms [2000L,9500L] and one that is the clustering assignment [2000L,]. There are 5 clusters. What I need to do is create an array that lists the document frequency for each cluster - basically a count in each column where the column number matches a row number in a different array. The output will be a [5L, 9500L] array (clusters x terms). I'm having trouble finding a way to do the equivalent of a countif and group by. Here is some sample data and the output I would like if I ran it with only 2 clusters:
import numpy as np
dataset = np.array[[1,2,0,3,0],[0,2,0,0,3],[4,5,2,3,0],[0,0,2,3,0]]
clusters = np.array[0,1,1,0]
#run code here to get documentFrequency
print documentFrequency
>> [1,1,1,2,0],[1,2,1,1,1]
my thoughts would be to select out the specific rows that match each cluster, because then counting should be easy. For example, if I could split the data into the following arrays:
cluster0 = np.array[[1,2,0,3,0],[0,0,2,3,0]]
cluster1 = np.array[[0,2,0,0,3],[4,5,2,3,0]]
Any direction or pointers would be much appreciated!
I don't think there is any easy way to vectorize your code, but if you have only a few clusters you could do the obvious:
>>> cluster_count = np.max(clusters)+1
>>> doc_freq = np.zeros((cluster_count, dataset.shape[1]), dtype=dataset.dtype)
>>> for j in xrange(cluster_count):
... doc_freq[j] = np.sum(dataset[clusters == j], axis=0)
...
>>> doc_freq
array([[1, 2, 2, 6, 0],
[4, 7, 2, 3, 3]])
As #Jaime says, if you have only a few clusters it makes sense to use the usual trick of manually looping over the smallest axis length. Often that gets you most of the benefits of full vectorization with a lot less of the headache that comes with being clever.
That said, when you find yourself wanting groupby, you're often in a domain in which a higher-level tool like pandas comes in very handy:
>>> pd.DataFrame(dataset).groupby(clusters).sum()
0 1 2 3 4
0 1 2 2 6 0
1 4 7 2 3 3
And you can easily fall back to an ndarray if needed:
>>> pd.DataFrame(dataset).groupby(clusters).sum().values
array([[1, 2, 2, 6, 0],
[4, 7, 2, 3, 3]])
Depending on how well compiled your BLAS is writing this as a matrix multiplication could be faster:
cvals = (clusters == np.arange(clusters.max()+1)[:,None]).astype(int)
cvals
array([[1, 0, 0, 1],
[0, 1, 1, 0]])
np.dot(cvals,dataset)
array([[1, 2, 2, 6, 0],
[4, 7, 2, 3, 3]])
Lets create two definitions:
def loop(cvals,dataset):
cluster_count = np.max(cvals)+1
doc_freq = np.zeros((cluster_count, dataset.shape[1]), dtype=dataset.dtype)
for j in xrange(cluster_count):
doc_freq[j] = np.sum(dataset[cvals == j], axis=0)
return doc_freq
def matrix_mult(clusters,dataset):
cvals = (clusters == np.arange(clusters.max()+1)[:,None]).astype(dataset.dtype)
return np.dot(cvals,dataset)
Now for some timings:
arr = np.random.random((2000,9500))
cluster = np.random.randint(0,5,(2000))
np.allclose(loop(cluster,arr),matrix_mult(cluster,arr))
True
%timeit loop(cluster,arr)
1 loops, best of 3: 263 ms per loop
%timeit matrix_mult(cluster,arr)
100 loops, best of 3: 14.1 ms per loop
Note this is with a threaded mkl BLAS. Your milage will vary.