I'm trying to run a custom kmeans clustering algorithm and am having trouble getting the document frequency for each column(term) of a 2-d numpy array by cluster. My current algorithm has two numpy arrays, a raw dataset that lists the documents by terms [2000L,9500L] and one that is the clustering assignment [2000L,]. There are 5 clusters. What I need to do is create an array that lists the document frequency for each cluster - basically a count in each column where the column number matches a row number in a different array. The output will be a [5L, 9500L] array (clusters x terms). I'm having trouble finding a way to do the equivalent of a countif and group by. Here is some sample data and the output I would like if I ran it with only 2 clusters:
import numpy as np
dataset = np.array[[1,2,0,3,0],[0,2,0,0,3],[4,5,2,3,0],[0,0,2,3,0]]
clusters = np.array[0,1,1,0]
#run code here to get documentFrequency
print documentFrequency
>> [1,1,1,2,0],[1,2,1,1,1]
my thoughts would be to select out the specific rows that match each cluster, because then counting should be easy. For example, if I could split the data into the following arrays:
cluster0 = np.array[[1,2,0,3,0],[0,0,2,3,0]]
cluster1 = np.array[[0,2,0,0,3],[4,5,2,3,0]]
Any direction or pointers would be much appreciated!
I don't think there is any easy way to vectorize your code, but if you have only a few clusters you could do the obvious:
>>> cluster_count = np.max(clusters)+1
>>> doc_freq = np.zeros((cluster_count, dataset.shape[1]), dtype=dataset.dtype)
>>> for j in xrange(cluster_count):
... doc_freq[j] = np.sum(dataset[clusters == j], axis=0)
...
>>> doc_freq
array([[1, 2, 2, 6, 0],
[4, 7, 2, 3, 3]])
As #Jaime says, if you have only a few clusters it makes sense to use the usual trick of manually looping over the smallest axis length. Often that gets you most of the benefits of full vectorization with a lot less of the headache that comes with being clever.
That said, when you find yourself wanting groupby, you're often in a domain in which a higher-level tool like pandas comes in very handy:
>>> pd.DataFrame(dataset).groupby(clusters).sum()
0 1 2 3 4
0 1 2 2 6 0
1 4 7 2 3 3
And you can easily fall back to an ndarray if needed:
>>> pd.DataFrame(dataset).groupby(clusters).sum().values
array([[1, 2, 2, 6, 0],
[4, 7, 2, 3, 3]])
Depending on how well compiled your BLAS is writing this as a matrix multiplication could be faster:
cvals = (clusters == np.arange(clusters.max()+1)[:,None]).astype(int)
cvals
array([[1, 0, 0, 1],
[0, 1, 1, 0]])
np.dot(cvals,dataset)
array([[1, 2, 2, 6, 0],
[4, 7, 2, 3, 3]])
Lets create two definitions:
def loop(cvals,dataset):
cluster_count = np.max(cvals)+1
doc_freq = np.zeros((cluster_count, dataset.shape[1]), dtype=dataset.dtype)
for j in xrange(cluster_count):
doc_freq[j] = np.sum(dataset[cvals == j], axis=0)
return doc_freq
def matrix_mult(clusters,dataset):
cvals = (clusters == np.arange(clusters.max()+1)[:,None]).astype(dataset.dtype)
return np.dot(cvals,dataset)
Now for some timings:
arr = np.random.random((2000,9500))
cluster = np.random.randint(0,5,(2000))
np.allclose(loop(cluster,arr),matrix_mult(cluster,arr))
True
%timeit loop(cluster,arr)
1 loops, best of 3: 263 ms per loop
%timeit matrix_mult(cluster,arr)
100 loops, best of 3: 14.1 ms per loop
Note this is with a threaded mkl BLAS. Your milage will vary.
Related
I'm trying to do a Monte Carlo simulation of a contest and allocate points based on standings for each iteration of the sim. I currently have a working solution using Numpy's argwhere, however for large contest sizes and simulations (e.g. 25000 contestants and 10000 simulations) the script is extremely slow due to list comprehensions.
import numpy as np
#sample arrays, the actual sizes are arbitrary based on the number of contestants (ranks.shape[1]) and simulations (ranks.shape[0]) but len(field) is always equal to number of contestants and len(payout_array)
#points to be allocated based on finish position
points_array = np.array([100,50,0,0,0])
#standings for the first simulation = [2,4,3,1,0], the second = [4,0,1,2,3]
ranks = np.array([[2, 4, 4, 1],
[4, 0, 0, 4],
[3, 1, 1, 0],
[1, 2, 3, 2],
[0, 3, 2, 3]])
field = np.arange(5)
ind = [np.argwhere(ranks==i) for i in field]
roi = [np.sum(points_array[i[:,0]]) for i in ind]
print(roi)
returns: [100, 100, 100, 0, 300]
This solution works and is very fast for small arrays:
%timeit [np.argwhere(ranks==i) for i in field]
36.2 µs ± 1.08 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
However for large contestant fields and many simulations the script hangs on the list comprehension using argwhere (still running after 20 minutes for a 30k person field and 10k simulations with no memory constraints on my machine). Is there a way to vectorize argwhere or reduce the complexity of the lookup to help speed up finding the indices in ranks for all of the elements in field?
You can use numpy.argsort to vectorize this behavior. This works because your final array is in order of the contestants, and you need to find the index of their finish position based on the rank array.
import numpy as np
#sample arrays, the actual sizes are arbitrary based on the number of contestants (ranks.shape[1]) and simulations (ranks.shape[0]) but len(field) is always equal to number of contestants and len(payout_array)
#points to be allocated based on finish position
points_array = np.array([100,50,0,0,0])
#standings for the first simulation = [2,4,3,1,0], the second = [4,0,1,2,3]
ranks = np.array([[2, 4, 4, 1],
[4, 0, 0, 4],
[3, 1, 1, 0],
[1, 2, 3, 2],
[0, 3, 2, 3]])
roi = points_array[np.argsort(ranks, axis=0)].sum(axis=1)
print(roi)
Hi I'm new to Python & Numpy and I'd like to ask what is the most efficient way to split a ndarray into 3 parts: 20%, 60% and 20%
import numpy as np
row_indices = np.random.permutation(10)
Let's assume the ndarray has 10 items: [7 9 3 1 2 4 5 6 0 8]
The expected results are the ndarray separated into 3 parts like part1, part2 and part3.
part1: [7 9]
part2: [3 1 2 4 5]
part3: [0 8]
Here's one way -
# data array
In [85]: a = np.array([7, 9, 3, 1, 2, 4, 5, 6, 0, 8])
# percentages (ratios) array
In [86]: p = np.array([0.2,0.6,0.2]) # must sum upto 1
In [87]: np.split(a,(len(a)*p[:-1].cumsum()).astype(int))
Out[87]: [array([7, 9]), array([3, 1, 2, 4, 5, 6]), array([0, 8])]
Alternative to np.split :
np.split could be slower when working with large data, so, we could alternatively use a loop there -
split_idx = np.r_[0,(len(a)*p.cumsum()).astype(int)]
out = [a[i:j] for (i,j) in zip(split_idx[:-1],split_idx[1:])]
I normally just go for the most obvious solution, although there are much fancier ways to do the same. It takes a second to implement and doesn't even require debugging (since it's extremely simple)
part1 = [a[i, ...] for i in range(int(a.shape[0] * 0.2))]
part2 = [a[i, ...] for i in range(int(a.shape[0] * 0.2), int(len(a) * 0.6))]
part3 = [a[i, ...] for i in range(int(a.shape[0] * 0.6), len(a))]
A few things to notice though
This is rounded and therefore you could get something which is only roughly a 20-60-20 split
You get back a list of element so you might have to re-numpyfy them with np.asarray()
You can use this method for indexing multiple objects (e.g. labels and inputs) for the same elements
If you get the indices once before the splits (indices = list(range(a.shape[0]))) you could also shuffle them thus taking care of data shuffling at the same time
I am trying to represent a partition of the numbers 0 to n-1 in Python
I have a numpy array where the ith entry indicates the partition ID of number i. For instance, the numpy array
indicator = array([1, 1, 3, 0, 2, 3, 0, 0])
indicates that numbers 3, 6, and 7 belong to the partition with ID 0. Numbers 0 and 1 belong to partition 1. 4 belongs to partition 2. And 2 and 5 belong to partition 3. Let's call this the indicator representation.
Another way to represent the partition would be a list of lists where the ith list is the partition with ID i. For the array above, this maps to
explicit = [[3, 6, 7], [0, 1], [4], [2, 5]]
Let's call this the explicit representation.
My question is what is the most efficient way to convert the indicator representation to the explicit representation? The naive way is to iterate through the indicator array and assign the elements to their respective slot in the explicit array, but iterating through numpy arrays is inefficient. Is there a more natural numpy construct to do this?
Here's an approach using sorted indices and then splitting those into groups -
def indicator_to_part(indicator):
sidx = indicator.argsort() # indicator.argsort(kind='mergesort') keeps order
sorted_arr = indicator[sidx]
split_idx = np.nonzero(sorted_arr[1:] != sorted_arr[:-1])[0]
return np.split(sidx, split_idx+1)
Runtime test -
In [326]: indicator = np.random.randint(0,100,(10000))
In [327]: %timeit from_ind_to_expl(indicator) ##yogabonito's soln
100 loops, best of 3: 5.59 ms per loop
In [328]: %timeit indicator_to_part(indicator)
1000 loops, best of 3: 801 µs per loop
In [330]: indicator = np.random.randint(0,1000,(100000))
In [331]: %timeit from_ind_to_expl(indicator) ##yogabonito's soln
1 loops, best of 3: 494 ms per loop
In [332]: %timeit indicator_to_part(indicator)
100 loops, best of 3: 11.1 ms per loop
Note that the output would be a list of arrays. If you have to get a list of lists as output, a simple way would be to use map(list,indicator_to_part(indicator)). Again, a performant alternative would involve few more steps, like so -
def indicator_to_part_list(indicator):
sidx = indicator.argsort() # indicator.argsort(kind='mergesort') keeps order
sorted_arr = indicator[sidx]
split_idx = np.nonzero(sorted_arr[1:] != sorted_arr[:-1])[0]
sidx_list = sidx.tolist()
start = np.append(0,split_idx+1)
stop = np.append(split_idx+1,indicator.size+1)
return [sidx_list[start[i]:stop[i]] for i in range(start.size)]
Here is a solution for translating indicator to explicit using numpy only (no for loops, list comprehensions, itertools, etc.)
I haven't seen your iteration-based approach so I can't compare them but maybe you can tell me if it's fast enough for your needs :)
import numpy as np
indicator = np.array([1, 1, 3, 0, 2, 3, 0, 0])
explicit = [[3, 6, 7], [0, 1], [4], [2, 5]]
def from_ind_to_expl(indicator):
groups, group_sizes = np.unique(indicator, return_counts=True)
group_sizes = np.cumsum(group_sizes)
ordered = np.where(indicator==groups[:, np.newaxis])
return np.hsplit(ordered[1], group_sizes[:-1])
from_ind_to_expl(indicator) gives
[array([3, 6, 7]), array([0, 1]), array([4]), array([2, 5])]
I have also compared the times of #Divakar's and my solution. On my machine #Divakar's solution is 2-3 times faster than mine. So #Divakar definitely gets an upvote from me :)
In the last comparison in #Divakar's post there's no averaging for my solution because there's only one loop - this is slightly unfair :P ;)
I am trying to write a function that takes a matrix A, then offsets it by one, and does element wise matrix multiplication on the shared area. Perhaps an example will help. Suppose I have the matrix:
A = np.array([[1,2,3],[4,5,6],[7,8,9]])
What i'd like returned is:
(1*2) + (4*5) + (7*8) = 78
The following code does it, but inefficently:
import numpy as np
A = np.array([[1,2,3],[4,5,6],[7,8,9]])
Height = A.shape[0]
Width = A.shape[1]
Sum1 = 0
for y in range(0, Height):
for x in range(0,Width-2):
Sum1 = Sum1 + \
A.item(y,x)*A.item(y,x+1)
print("%d * %d"%( A.item(y,x),A.item(y,x+1)))
print(Sum1)
With output:
1 * 2
4 * 5
7 * 8
78
Here is my attempt to write the code more efficently with numpy:
import numpy as np
A = np.array([[1,2,3],[4,5,6],[7,8,9]])
print(np.sum(np.multiply(A[:,0:-1], A[:,1:])))
Unfortunately, this time I get 186. I am at a loss where did I go wrong. i'd love someone to either correcty me or offer another way to implement this.
Thank you.
In this 3 column case, you are just multiplying the 1st 2 columns, and taking the sum:
A[:,:2].prod(1).sum()
Out[36]: 78
Same as (A[:,0]*A[:,1]).sum()
Now just how does that generalize to more columns?
In your original loop, you can cut out the row iteration by taking the sum of this list:
[A[:,x]*A[:,x+1] for x in range(0,A.shape[1]-2)]
Out[40]: [array([ 2, 20, 56])]
Your description talks about multiplying the shared area; what direction are you doing the offset? From the calculation it looks like the offset is negative.
A[:,:-1]
Out[47]:
array([[1, 2],
[4, 5],
[7, 8]])
If that is the offset logic, than I could rewrite my calculation as
A[:,:-1].prod(1).sum()
which should work for many more columns.
===================
Your 2nd try:
In [3]: [A[:,:-1],A[:,1:]]
Out[3]:
[array([[1, 2],
[4, 5],
[7, 8]]),
array([[2, 3],
[5, 6],
[8, 9]])]
In [6]: A[:,:-1]*A[:,1:]
Out[6]:
array([[ 2, 6],
[20, 30],
[56, 72]])
In [7]: _.sum()
Out[7]: 186
In other words instead of 1*2, you are calculating [1,2]*[2*3]=[2,6]. Nothing wrong with that, if that's you you really intend. The key is being clear about 'offset' and 'overlap'.
I have a function that returns the argmax from a large 2d array
getMax = np.argmax(dist, axis=1)
However I want to get the next biggest values, is there a way of removing the getMax values from the original array and then performing argmax again?
Use the command np.argsort(a, axis=-1, kind='quicksort', order=None), but with appropriate choice of arguments (below).
here is the documentation. Note "It returns an array of indices of the same shape as a that index data along the given axis in sorted order."
The default order is small to large. So sort with -dist (for quick coding). Caution: doing -dist causes a new array to be generated which you may care about if dist is huge. See bottom of post for a better alternative there.
Here is an example:
x = np.array([[1,2,5,0],[5,7,2,3]])
L = np.argsort(-x, axis=1)
print L
[[2 1 0 3]
[1 0 3 2]]
x
array([[1, 2, 5, 0],
[5, 7, 2, 3]])
So the n'th entry in a row of L gives the locations of the n'th largest element of x.
x is unchanged.
L[:,0] will give the same output as np.argmax(x)
L[:,0]
array([2, 1])
np.argmax(x,axis=1)
array([2, 1])
and L[:,1] will give the same as a hypothetical argsecondmax(x)
L[:,1]
array([1, 0])
If you don't want to generate a new list, so you don't want to use -x:
L = np.argsort(x, axis=1)
print L
[[3 0 1 2]
[2 3 0 1]]
L[:,-1]
array([2, 1])
L[:,-2]
array([1, 0])
If speed is important to you, using argpartition rather than argsort could be useful.
For example, to return the n largest elements from a list:
import numpy as np
l = np.random.random_integer(0, 100, 1e6)
top_n_1 = l[np.argsort(-l)[0:n]]
top_n_2 = l[np.argpartition(l, -n)[-n:]]
The %timeit function in ipython reports
10 loops, best of 3: 56.9 ms per loop for top_n_1 and 100 loops, best of 3: 8.06 ms per loop for top_n_2.
I hope this is useful.