Numpy: indicators to partition - python

I am trying to represent a partition of the numbers 0 to n-1 in Python
I have a numpy array where the ith entry indicates the partition ID of number i. For instance, the numpy array
indicator = array([1, 1, 3, 0, 2, 3, 0, 0])
indicates that numbers 3, 6, and 7 belong to the partition with ID 0. Numbers 0 and 1 belong to partition 1. 4 belongs to partition 2. And 2 and 5 belong to partition 3. Let's call this the indicator representation.
Another way to represent the partition would be a list of lists where the ith list is the partition with ID i. For the array above, this maps to
explicit = [[3, 6, 7], [0, 1], [4], [2, 5]]
Let's call this the explicit representation.
My question is what is the most efficient way to convert the indicator representation to the explicit representation? The naive way is to iterate through the indicator array and assign the elements to their respective slot in the explicit array, but iterating through numpy arrays is inefficient. Is there a more natural numpy construct to do this?

Here's an approach using sorted indices and then splitting those into groups -
def indicator_to_part(indicator):
sidx = indicator.argsort() # indicator.argsort(kind='mergesort') keeps order
sorted_arr = indicator[sidx]
split_idx = np.nonzero(sorted_arr[1:] != sorted_arr[:-1])[0]
return np.split(sidx, split_idx+1)
Runtime test -
In [326]: indicator = np.random.randint(0,100,(10000))
In [327]: %timeit from_ind_to_expl(indicator) ##yogabonito's soln
100 loops, best of 3: 5.59 ms per loop
In [328]: %timeit indicator_to_part(indicator)
1000 loops, best of 3: 801 µs per loop
In [330]: indicator = np.random.randint(0,1000,(100000))
In [331]: %timeit from_ind_to_expl(indicator) ##yogabonito's soln
1 loops, best of 3: 494 ms per loop
In [332]: %timeit indicator_to_part(indicator)
100 loops, best of 3: 11.1 ms per loop
Note that the output would be a list of arrays. If you have to get a list of lists as output, a simple way would be to use map(list,indicator_to_part(indicator)). Again, a performant alternative would involve few more steps, like so -
def indicator_to_part_list(indicator):
sidx = indicator.argsort() # indicator.argsort(kind='mergesort') keeps order
sorted_arr = indicator[sidx]
split_idx = np.nonzero(sorted_arr[1:] != sorted_arr[:-1])[0]
sidx_list = sidx.tolist()
start = np.append(0,split_idx+1)
stop = np.append(split_idx+1,indicator.size+1)
return [sidx_list[start[i]:stop[i]] for i in range(start.size)]

Here is a solution for translating indicator to explicit using numpy only (no for loops, list comprehensions, itertools, etc.)
I haven't seen your iteration-based approach so I can't compare them but maybe you can tell me if it's fast enough for your needs :)
import numpy as np
indicator = np.array([1, 1, 3, 0, 2, 3, 0, 0])
explicit = [[3, 6, 7], [0, 1], [4], [2, 5]]
def from_ind_to_expl(indicator):
groups, group_sizes = np.unique(indicator, return_counts=True)
group_sizes = np.cumsum(group_sizes)
ordered = np.where(indicator==groups[:, np.newaxis])
return np.hsplit(ordered[1], group_sizes[:-1])
from_ind_to_expl(indicator) gives
[array([3, 6, 7]), array([0, 1]), array([4]), array([2, 5])]
I have also compared the times of #Divakar's and my solution. On my machine #Divakar's solution is 2-3 times faster than mine. So #Divakar definitely gets an upvote from me :)
In the last comparison in #Divakar's post there's no averaging for my solution because there's only one loop - this is slightly unfair :P ;)

Related

flatten list of lists and scalars [duplicate]

This question already has answers here:
Flatten an irregular (arbitrarily nested) list of lists
(51 answers)
Closed 6 months ago.
So for a matrix, we have methods like numpy.flatten()
np.array([[1,2,3],[4,5,6],[7,8,9]]).flatten()
gives [1,2,3,4,5,6,7,8,9]
what if I wanted to get from np.array([[1,2,3],[4,5,6],7]) to [1,2,3,4,5,6,7]?
Is there an existing function that performs something like that?
With uneven lists, the array is a object dtype, (and 1d, so flatten doesn't change it)
In [96]: arr=np.array([[1,2,3],[4,5,6],7])
In [97]: arr
Out[97]: array([[1, 2, 3], [4, 5, 6], 7], dtype=object)
In [98]: arr.sum()
...
TypeError: can only concatenate list (not "int") to list
The 7 element is giving problems. If I change that to a list:
In [99]: arr=np.array([[1,2,3],[4,5,6],[7]])
In [100]: arr.sum()
Out[100]: [1, 2, 3, 4, 5, 6, 7]
I'm using a trick here. The elements of the array lists, and for lists [1,2,3]+[4,5] is concatenate.
The basic point is that an object array is not a 2d array. It is, in many ways, more like a list of lists.
chain
The best list flattener is chain
In [104]: list(itertools.chain(*arr))
Out[104]: [1, 2, 3, 4, 5, 6, 7]
though it too will choke on the integer 7 version.
concatenate and hstack
If the array is a list of lists (not the original mix of lists and scalar) then np.concatenate works. It iterates on the object just as though it were a list.
With the mixed original list concatenate does not work, but hstack does
In [178]: arr=np.array([[1,2,3],[4,5,6],7])
In [179]: np.concatenate(arr)
...
ValueError: all the input arrays must have same number of dimensions
In [180]: np.hstack(arr)
Out[180]: array([1, 2, 3, 4, 5, 6, 7])
That's because hstack first iterates though the list and makes sure all elements are atleast_1d. This extra iteration makes it more robust, but at a cost in processing speed.
time tests
In [170]: big1=arr.repeat(1000)
In [171]: timeit big1.sum()
10 loops, best of 3: 31.6 ms per loop
In [172]: timeit list(itertools.chain(*big1))
1000 loops, best of 3: 433 µs per loop
In [173]: timeit np.concatenate(big1)
100 loops, best of 3: 5.05 ms per loop
double the size
In [174]: big1=arr.repeat(2000)
In [175]: timeit big1.sum()
10 loops, best of 3: 128 ms per loop
In [176]: timeit list(itertools.chain(*big1))
1000 loops, best of 3: 803 µs per loop
In [177]: timeit np.concatenate(big1)
100 loops, best of 3: 9.93 ms per loop
In [182]: timeit np.hstack(big1) # the extra iteration hurts hstack speed
10 loops, best of 3: 43.1 ms per loop
The sum is quadratic in size
res=[]
for e in bigarr:
res += e
res grows with the number of e, so each iteration step is more expensive.
chain times the best.
You can write custom flatten function using yield:
def flatten(arr):
for i in arr:
try:
yield from flatten(i)
except TypeError:
yield i
Usage example:
>>> myarr = np.array([[1,2,3],[4,5,6],7])
>>> newarr = list(flatten(myarr))
>>> newarr
[1, 2, 3, 4, 5, 6, 7]
You can use apply_along_axis here
>>> arr = np.array([[1,2,3],[4,5,6],[7]])
>>> np.apply_along_axis(np.concatenate, 0, arr)
array([1, 2, 3, 4, 5, 6, 7])
As a bonus, this is not quadratic in the number of lists either.

Create a new array with Timesteps and multiple features, e.g for LSTM

Hi I am using numpy to create a new array with timesteps and multiple features, for an LSTM.
i have looked at a number of approaches using strides and reshaping but haven't managed to find an efficient solution.
Here is a function that solves a toy problem, however i have 30,000 samples, each with 100 features.
def make_timesteps(a, timesteps):
array = []
for j in np.arange(len(a)):
unit = []
for i in range(timesteps):
unit.append(np.roll(a, i, axis=0)[j])
array.append(unit)
return np.array(array)
inArr = np.array([[1, 2], [3,4], [5,6]])
inArr.shape => (3, 2)
outArr = make_timesteps(inArr, 2)
outArr.shape => (3, 2, 2)
assert(np.array_equal(outArr,
np.array([[[1, 2], [3, 4]], [[3, 4], [5, 6]], [[5, 6], [1, 2]]])))
=> True
Is there a more efficeint way of doing this (there must be!!) can someone please help?
One trick would be to append last L-1 rows off the array and append those to the start of the array. Then, it would be a simple case of using the very efficient NumPy strides. For people wondering about the cost of this trick, as we will see later on through the timing tests, it's as good as nothing.
The trick leading upto the final goal that would support both forward and backward striding in codes would look something like this -
Backward striding :
def strided_axis0_backward(inArr, L = 2):
# INPUTS :
# a : Input array
# L : Length along rows to be cut to create per subarray
# Append the last row to the start. It just helps in keeping a view output.
a = np.vstack(( inArr[-L+1:], inArr ))
# Store shape and strides info
m,n = a.shape
s0,s1 = a.strides
# Length of 3D output array along its axis=0
nd0 = m - L + 1
strided = np.lib.stride_tricks.as_strided
return strided(a[L-1:], shape=(nd0,L,n), strides=(s0,-s0,s1))
Forward striding :
def strided_axis0_forward(inArr, L = 2):
# INPUTS :
# a : Input array
# L : Length along rows to be cut to create per subarray
# Append the last row to the start. It just helps in keeping a view output.
a = np.vstack(( inArr , inArr[:L-1] ))
# Store shape and strides info
m,n = a.shape
s0,s1 = a.strides
# Length of 3D output array along its axis=0
nd0 = m - L + 1
strided = np.lib.stride_tricks.as_strided
return strided(a[:L-1], shape=(nd0,L,n), strides=(s0,s0,s1))
Sample run -
In [42]: inArr
Out[42]:
array([[1, 2],
[3, 4],
[5, 6]])
In [43]: strided_axis0_backward(inArr, 2)
Out[43]:
array([[[1, 2],
[5, 6]],
[[3, 4],
[1, 2]],
[[5, 6],
[3, 4]]])
In [44]: strided_axis0_forward(inArr, 2)
Out[44]:
array([[[1, 2],
[3, 4]],
[[3, 4],
[5, 6]],
[[5, 6],
[1, 2]]])
Runtime test -
In [53]: inArr = np.random.randint(0,9,(1000,10))
In [54]: %timeit make_timesteps(inArr, 2)
...: %timeit strided_axis0_forward(inArr, 2)
...: %timeit strided_axis0_backward(inArr, 2)
...:
10 loops, best of 3: 33.9 ms per loop
100000 loops, best of 3: 12.1 µs per loop
100000 loops, best of 3: 12.2 µs per loop
In [55]: %timeit make_timesteps(inArr, 10)
...: %timeit strided_axis0_forward(inArr, 10)
...: %timeit strided_axis0_backward(inArr, 10)
...:
1 loops, best of 3: 152 ms per loop
100000 loops, best of 3: 12 µs per loop
100000 loops, best of 3: 12.1 µs per loop
In [56]: 152000/12.1 # Speedup figure
Out[56]: 12561.98347107438
The timings of strided_axis0 stays the same even as we increase the length of subarrays in the output. That just goes to show us the massive benefit with strides and of course the crazy speedups too over the original loopy version.
As promised at the start, here's the timings on stacking cost with np.vstack -
In [417]: inArr = np.random.randint(0,9,(1000,10))
In [418]: L = 10
In [419]: %timeit np.vstack(( inArr[-L+1:], inArr ))
100000 loops, best of 3: 5.41 µs per loop
The timings support the idea of stacking to be a pretty efficient one.

vectorized sum of array according to indices of second array [duplicate]

This question already has answers here:
Numpy sum elements in array based on its value
(2 answers)
Closed 6 years ago.
I have an empty array:
empty = np.array([0, 0, 0, 0, 0])
an array of indices corresponding to positions in my array empty
ind = np.array([2, 3, 1, 2, 4, 2, 4, 2, 1, 1, 1, 2])
and an array of values
val = np.array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])
I want to add the values in 'val' into 'empty' according to position given by 'ind'.
The non-vectorized solution is:
for i, v in zip(ind, val): maps[i] += v
>>> maps
[ 0. 4. 5. 1. 2.]
My actual arrays are multidimensional and loooong so i've got a NEED FOR SPEED I really want a vectorized solution, or a solution that is very fast.
Note this does not work:
maps[ind] += val
>>> maps
array([ 0., 1., 1., 1., 1.])
I'd be extra grateful for a solution that works in python 2.7, 3.5, 3.6 with no hiccups
You can make use of np.add.at which operates equivalent to empty[ind] += val, except that results are accumulated for elements that are indexed more than once giving you a cumulated outcome for those indices.
>>> np.add.at(empty, ind, val)
>>> empty
array([0, 4, 5, 1, 2])
What you are looking for is e=np.bincount(ind, weights=val, minlength=n) where n is the length of your empty array. That way you don't have to initialize empty. You only need to do this the first time, as afterward you can do e+=np.bincount(ind, weights=val)
This is at least twice as fast as np.add.at:
%timeit np.bincount(ind, val, minlength=empty.size)
The slowest run took 12.69 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 2.05 µs per loop
%timeit np.add.at(empty, ind, val)
The slowest run took 2822.05 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 4.32 µs per loop
As for multi-dimensional indices, you can do:
np.bincount(np.ravel_multi_index(ind, empty.shape), np.ravel(val), minlength=empty.size).reshape(empty.shape)
I'm not sure how to do this with np.add.at to compare speeds
This is basically a histogram, so in the one-dimensional case:
h, b = np.histogram(ind, bins=np.arange(empty.size+1), weights=val)
empty += h
Of course you can leave out the second statement in case empty only has zeros.

Next argmax values in python

I have a function that returns the argmax from a large 2d array
getMax = np.argmax(dist, axis=1)
However I want to get the next biggest values, is there a way of removing the getMax values from the original array and then performing argmax again?
Use the command np.argsort(a, axis=-1, kind='quicksort', order=None), but with appropriate choice of arguments (below).
here is the documentation. Note "It returns an array of indices of the same shape as a that index data along the given axis in sorted order."
The default order is small to large. So sort with -dist (for quick coding). Caution: doing -dist causes a new array to be generated which you may care about if dist is huge. See bottom of post for a better alternative there.
Here is an example:
x = np.array([[1,2,5,0],[5,7,2,3]])
L = np.argsort(-x, axis=1)
print L
[[2 1 0 3]
[1 0 3 2]]
x
array([[1, 2, 5, 0],
[5, 7, 2, 3]])
So the n'th entry in a row of L gives the locations of the n'th largest element of x.
x is unchanged.
L[:,0] will give the same output as np.argmax(x)
L[:,0]
array([2, 1])
np.argmax(x,axis=1)
array([2, 1])
and L[:,1] will give the same as a hypothetical argsecondmax(x)
L[:,1]
array([1, 0])
If you don't want to generate a new list, so you don't want to use -x:
L = np.argsort(x, axis=1)
print L
[[3 0 1 2]
[2 3 0 1]]
L[:,-1]
array([2, 1])
L[:,-2]
array([1, 0])
If speed is important to you, using argpartition rather than argsort could be useful.
For example, to return the n largest elements from a list:
import numpy as np
l = np.random.random_integer(0, 100, 1e6)
top_n_1 = l[np.argsort(-l)[0:n]]
top_n_2 = l[np.argpartition(l, -n)[-n:]]
The %timeit function in ipython reports
10 loops, best of 3: 56.9 ms per loop for top_n_1 and 100 loops, best of 3: 8.06 ms per loop for top_n_2.
I hope this is useful.

Numpy array filtering by two criteria

I'm trying to run a custom kmeans clustering algorithm and am having trouble getting the document frequency for each column(term) of a 2-d numpy array by cluster. My current algorithm has two numpy arrays, a raw dataset that lists the documents by terms [2000L,9500L] and one that is the clustering assignment [2000L,]. There are 5 clusters. What I need to do is create an array that lists the document frequency for each cluster - basically a count in each column where the column number matches a row number in a different array. The output will be a [5L, 9500L] array (clusters x terms). I'm having trouble finding a way to do the equivalent of a countif and group by. Here is some sample data and the output I would like if I ran it with only 2 clusters:
import numpy as np
dataset = np.array[[1,2,0,3,0],[0,2,0,0,3],[4,5,2,3,0],[0,0,2,3,0]]
clusters = np.array[0,1,1,0]
#run code here to get documentFrequency
print documentFrequency
>> [1,1,1,2,0],[1,2,1,1,1]
my thoughts would be to select out the specific rows that match each cluster, because then counting should be easy. For example, if I could split the data into the following arrays:
cluster0 = np.array[[1,2,0,3,0],[0,0,2,3,0]]
cluster1 = np.array[[0,2,0,0,3],[4,5,2,3,0]]
Any direction or pointers would be much appreciated!
I don't think there is any easy way to vectorize your code, but if you have only a few clusters you could do the obvious:
>>> cluster_count = np.max(clusters)+1
>>> doc_freq = np.zeros((cluster_count, dataset.shape[1]), dtype=dataset.dtype)
>>> for j in xrange(cluster_count):
... doc_freq[j] = np.sum(dataset[clusters == j], axis=0)
...
>>> doc_freq
array([[1, 2, 2, 6, 0],
[4, 7, 2, 3, 3]])
As #Jaime says, if you have only a few clusters it makes sense to use the usual trick of manually looping over the smallest axis length. Often that gets you most of the benefits of full vectorization with a lot less of the headache that comes with being clever.
That said, when you find yourself wanting groupby, you're often in a domain in which a higher-level tool like pandas comes in very handy:
>>> pd.DataFrame(dataset).groupby(clusters).sum()
0 1 2 3 4
0 1 2 2 6 0
1 4 7 2 3 3
And you can easily fall back to an ndarray if needed:
>>> pd.DataFrame(dataset).groupby(clusters).sum().values
array([[1, 2, 2, 6, 0],
[4, 7, 2, 3, 3]])
Depending on how well compiled your BLAS is writing this as a matrix multiplication could be faster:
cvals = (clusters == np.arange(clusters.max()+1)[:,None]).astype(int)
cvals
array([[1, 0, 0, 1],
[0, 1, 1, 0]])
np.dot(cvals,dataset)
array([[1, 2, 2, 6, 0],
[4, 7, 2, 3, 3]])
Lets create two definitions:
def loop(cvals,dataset):
cluster_count = np.max(cvals)+1
doc_freq = np.zeros((cluster_count, dataset.shape[1]), dtype=dataset.dtype)
for j in xrange(cluster_count):
doc_freq[j] = np.sum(dataset[cvals == j], axis=0)
return doc_freq
def matrix_mult(clusters,dataset):
cvals = (clusters == np.arange(clusters.max()+1)[:,None]).astype(dataset.dtype)
return np.dot(cvals,dataset)
Now for some timings:
arr = np.random.random((2000,9500))
cluster = np.random.randint(0,5,(2000))
np.allclose(loop(cluster,arr),matrix_mult(cluster,arr))
True
%timeit loop(cluster,arr)
1 loops, best of 3: 263 ms per loop
%timeit matrix_mult(cluster,arr)
100 loops, best of 3: 14.1 ms per loop
Note this is with a threaded mkl BLAS. Your milage will vary.

Categories