compare large sets of arrays

compare large sets of arrays - python

I have a numpy array A of n 1x3 arrays where n is the total number of possible combinations of elements in the 1x3 arrays, where each element ranges from 0 to 50. That is,
A = [[0,0,0],[0,0,1]...[0,1,0]...[50,50,50]]
and
len(A) = 50*50*50 = 125000
I have a numpy array B of m 1x3 arrays where m = 10 million, and the arrays can have values belonging to the set described by A.
I want to count up how many of each combination is present in B, that is, how many times [0,0,0] appears in B, how many times [0,0,1] appears...how many times [50,50,50] appears. So far I have the following:
for i in range(len(A)):
for j in range(len(B)):
if np.array_equal(A[i], B[j]):
y[i] += 1
where y keeps track of how many times the ith array occurs. So, y[0] is how many times [0,0,0] appeared in B, y[1] is how many times [0,0,1] appeared...y[125000] is how many times [50,50,50] appeared, etc.
The problem is this takes forever. It has to check 10 million entries, 125000 times. Is there a quicker and more efficient way to do this?

Here is a fast approach. It processes 10 million tuples out of range(50)^3 in a fraction of a second and is roughly 100 times faster than the next best solution (#Primusa's):
It uses the fact that there is a straight-forward translation between such tuples and the numbers 0 - 50^3 - 1. (The mapping happens to be the same as the one between the rows of your A and the row numbers.) The functions np.ravel_multi_index and np.unravel_index implement this translation and its inverse.
Once B is translated into numbers, their frequencies can be determined very efficiently using np.bincount. Below I reshape the result to get a 50x50x50 histogram but that is just a matter of taste and can be left out. (I have taken the liberty to only use numbers 0 through 49, so len(A) becomes 125000):
>>> B = np.random.randint(0, 50, (10000000, 3))
>>> Br = np.ravel_multi_index(B.T, (50, 50, 50))
>>> result = np.bincount(Br, minlength=125000).reshape(50, 50, 50)
Let's look at a smaller example for demonstration:
>>> B = np.random.randint(0, 3, (10, 3))
>>> Br = np.ravel_multi_index(B.T, (3, 3, 3))
>>> result = np.bincount(Br, minlength=27).reshape(3, 3, 3)
>>>
>>> B
array([[1, 1, 2],
[2, 1, 2],
[2, 0, 0],
[2, 1, 0],
[2, 0, 2],
[0, 0, 2],
[0, 0, 2],
[0, 2, 2],
[2, 0, 0],
[0, 2, 0]])
>>> result
array([[[0, 0, 2],
[0, 0, 0],
[1, 0, 1]],
[[0, 0, 0],
[0, 0, 1],
[0, 0, 0]],
[[2, 0, 1],
[1, 0, 1],
[0, 0, 0]]])
To query for example how many times [2,1,0] is in B one would do
>>> result[2,1,0]
1
As mentioned above: To convert between indices into your A and the actual rows of A (which are the indices into my result), np.ravel_multi_index and np.unravel_index can be used. Or you can leave out the last reshape (i.e. use result = np.bincount(Br, minlength=125000); then the counts are indexed exactly the same as A.

You can use a dict() to speed up this process to just going through 10 million entries.
So the first thing you want to do is to change all the sublists in A to hashable objects do you can use them as keys in a dict.
Converting all the sublists to tuples:
A = [tuple(i) for i in A]
Then create a dict() with every value in A as the key and the value being 0.
d = {i:0 for i in A}
Now for each subarray in your numpy array, you just want to convert it to a tuple and increment d[that array] by 1
for subarray in B:
d[tuple(subarray)] += 1
D is now a dictionary where for each key the value is how many times that key occured in B.

You can find the unique rows and their counts from array B by calling the np.unique over its first axis and return_counts=True. Then, you can use broadcasting to find the indices of the B's unique rows in A by calling the ndarray.all and ndarray.any methods on proper axises. Then all you need is just a simple indexing:
In [82]: unique, counts = np.unique(B, axis=0, return_counts=True)
In [83]: indices = np.where((unique == A[:,None,:]).all(axis=2).any(axis=0))[0]
# Get items from A that exist in B
In [84]: unique[indices]
# Get the counts
In [85]: counts[indices]
Example:
In [86]: arr = np.array([[2 ,3, 4], [5, 6, 0], [2, 3, 4], [1, 0, 4], [3, 3, 3], [5, 6, 0], [2, 3, 4]])
In [87]: a = np.array([[2, 3, 4], [1, 9, 5], [3, 3, 3]])
In [88]: unique, counts = np.unique(arr, axis=0, return_counts=True)
In [89]: indices = np.where((unique == a[:,None,:]).all(axis=2).any(axis=0))[0]
In [90]: unique[indices]
Out[90]:
array([[2, 3, 4],
[3, 3, 3]])
In [91]: counts[indices]
Out[91]: array([3, 1])

You can do this
y=[np.where(np.all(B==arr,axis=1))[0].shape[0] for arr in A]
arr just iterate over A and np.all checks where it matches with B and np.where returns the positions of those matches as an array then shape just returns the length of that array or in other words the desired frequency

Related

Any efficient analogue of argsort for array of indices with NumPy?

I have an array of indices like a = [2, 4, 1, 0, 3] and I want to transform it into np.argsort(a) = [3, 2, 0, 4, 1].
The problem is that argsort has O(n*log(n)) timing, but for my case it may be O(n) and I even have code for this:
b = np.zeros(a.size)
for i in range(a.size):
b[a[i]] = i
The second problem is that cycles are slow in Python and I hope that it's possible to use some NumPy tricks to achieve the goal.

Do you have all numbers for 0 to len(a)-1?
Then use smart indexing:
a = [2, 4, 1, 0, 3]
b = np.empty(len(a), dtype=int) # or b = np.empty_like(a)
b[a] = np.arange(len(a))
b
output: array([3, 2, 0, 4, 1])

Replace numpy subarray when element matches a condition

I have an n x m x 3 numpy array. This represents a middle-step towards an RGB representation of a complex-function plotter. When the function being plotted takes infinite values or has singularities, parts of the RGB data become NaNs.
I'm looking for an efficient way to replace a row containing a NaN with a row of my choice, perhaps [0, 0, 0] or [1, 1, 1]. In terms of the RGB values, this has the effect of replacing poorly-behaving pixels with white or black pixels. By efficient, I mean some way that takes advantage of numpy's vectorization and speed.
Please note that I am not looking to merely replace the NaN values with 0 (which I know how to do with numpy.where); if a row contains a NaN, I want to replace the whole row. I suspect this can be done nicely in numpy, but I'm not sure how.
Concrete Question
Suppose we are given a 2 x 2 x 3 array arr. If a row contains a 5, I want to replace the row with [0, 0, 0]. Trivial code that does this slowly is as follows.
import numpy as np
arr = np.array([[[1, 2, 3], [4, 5, 6]], [[1, 3, 5], [2, 4, 6]]])
# so arr is
# array([[[1, 2, 3],
# [4, 5, 6]],
#
# [[1, 3, 5],
# [2, 4, 6]]])
# Trivial and slow version to replace rows containing 5 with [0,0,0]
for i in range(len(arr)):
for j in range(len(arr[i])):
if 5 in arr[i][j]:
arr[i][j] = np.array([0, 0, 0])
# Now arr is
#
# array([[[1, 2, 3],
# [0, 0, 0]],
#
# [[0, 0, 0],
# [2, 4, 6]]])
How can we accomplish this taking advantage of numpy?

A simpler way would be -
arr[np.isin(arr,5).any(-1)] = 0
If it's just a single value that you are looking for, then we could simplify to -
arr[(arr==5).any(-1)] = 0
If you are looking to match against NaN, we need to do the comparison differently and use np.isnan instead -
arr[np.isnan(arr).any(-1)] = 0
If you are looking to assign array values, instead of just 0, the solutions stay the same. Hence it would be -
arr[(arr==5).any(-1)] = new_array

Using np.broadcast_to
arr[np.broadcast_to((arr == 5).any(-1)[..., None], arr.shape)] = 0
array([[[1, 2, 3],
[0, 0, 0]],
[[0, 0, 0],
[2, 4, 6]]])
Just as FYI, based on your description, if you want to find np.nans instead of integers like 5, you shouldn't use ==, but rather np.isnan
arr[np.broadcast_to((np.isnan(arr)).any(-1)[..., None], arr.shape)] = 0

you can do it using in1d function like below
arr = np.array([[[1, 2, 3], [4, 5, 6]], [[1, 3, 5], [2, 4, 6]]])
arr[np.in1d(arr,5).reshape(arr.shape).any(axis=2)] = [0,0,0]
arr

vectorizing numpy bincount

I have a 2d numpy array., A I want to apply np.bincount() to each column of the matrix A to generate another 2d array B that is composed of the bincounts of each column of the original matrix A.
My problem is that np.bincount() is a function that takes a 1d array-like. It's not an array method like B = A.max(axis=1) for example.
Is there a more pythonic/numpythic way to generate this B array other than a nasty for-loop?
import numpy as np
states = 4
rows = 8
cols = 4
A = np.random.randint(0,states,(rows,cols))
B = np.zeros((states,cols))
for x in range(A.shape[1]):
B[:,x] = np.bincount(A[:,x])

Using the same philosophy as in this post, here's a vectorized approach -
m = A.shape[1]
n = A.max()+1
A1 = A + (n*np.arange(m))
out = np.bincount(A1.ravel(),minlength=n*m).reshape(m,-1).T

I would suggest to use np.apply_along_axis, which will allow you to apply a 1D-method (in this case np.bincount) to 1D slices of a higher dimensional array:
import numpy as np
states = 4
rows = 8
cols = 4
A = np.random.randint(0,states,(rows,cols))
B = np.zeros((states,cols))
B = np.apply_along_axis(np.bincount, axis=0, arr=A)
You'll have to be careful, though. This (as well as your suggested for-loop) only works if the output of np.bincount has the right shape. If the maximum state is not present in one or multiple columns of your array A, the output will not have a smaller dimensionality and thus, the code will file with a ValueError.

This solution using the numpy_indexed package (disclaimer: I am its author) is fully vectorized, thus does not include any python loops behind the scenes. Also, there are no restrictions on the input; not every column needs to contain the same set of unique values.
import numpy_indexed as npi
rowidx, colidx = np.indices(A.shape)
(bin, col), B = npi.count_table(A.flatten(), colidx.flatten())
This gives an alternative (sparse) representation of the same result, which may be much more appropriate if the B array does indeed contain many zeros:
(bin, col), count = npi.count((A.flatten(), colidx.flatten()))
Note that apply_along_axis is just syntactic sugar for a for-loop, and has the same performance characteristics.

Yet another possibility:
import numpy as np
def bincount_columns(x, minlength=None):
nbins = x.max() + 1
if minlength is not None:
nbins = max(nbins, minlength)
ncols = x.shape[1]
count = np.zeros((nbins, ncols), dtype=int)
colidx = np.arange(ncols)[None, :]
np.add.at(count, (x, colidx), 1)
return count
For example,
In [110]: x
Out[110]:
array([[4, 2, 2, 3],
[4, 3, 4, 4],
[4, 3, 4, 4],
[0, 2, 4, 0],
[4, 1, 2, 1],
[4, 2, 4, 3]])
In [111]: bincount_columns(x)
Out[111]:
array([[1, 0, 0, 1],
[0, 1, 0, 1],
[0, 3, 2, 0],
[0, 2, 0, 2],
[5, 0, 4, 2]])
In [112]: bincount_columns(x, minlength=7)
Out[112]:
array([[1, 0, 0, 1],
[0, 1, 0, 1],
[0, 3, 2, 0],
[0, 2, 0, 2],
[5, 0, 4, 2],
[0, 0, 0, 0],
[0, 0, 0, 0]])

How to get a value from every column in a Numpy matrix

I'd like to get the index of a value for every column in a matrix M. For example:
M = matrix([[0, 1, 0],
[4, 2, 4],
[3, 4, 1],
[1, 3, 2],
[2, 0, 3]])
In pseudocode, I'd like to do something like this:
for col in M:
idx = numpy.where(M[col]==0) # Only for columns!
and have idx be 0, 4, 0 for each column.
I have tried to use where, but I don't understand the return value, which is a tuple of matrices.

The tuple of matrices is a collection of items suited for indexing. The output will have the shape of the indexing matrices (or arrays), and each item in the output will be selected from the original array using the first array as the index of the first dimension, the second as the index of the second dimension, and so on. In other words, this:
>>> numpy.where(M == 0)
(matrix([[0, 0, 4]]), matrix([[0, 2, 1]]))
>>> row, col = numpy.where(M == 0)
>>> M[row, col]
matrix([[0, 0, 0]])
>>> M[numpy.where(M == 0)] = 1000
>>> M
matrix([[1000, 1, 1000],
[ 4, 2, 4],
[ 3, 4, 1],
[ 1, 3, 2],
[ 2, 1000, 3]])
The sequence may be what's confusing you. It proceeds in flattened order -- so M[0,2] appears second, not third. If you need to reorder them, you could do this:
>>> row[0,col.argsort()]
matrix([[0, 4, 0]])
You also might be better off using arrays instead of matrices. That way you can manipulate the shape of the arrays, which is often useful! Also note ajcr's transpose-based trick, which is probably preferable to using argsort.
Finally, there is also a nonzero method that does the same thing as where in this case. Using the transpose trick now:
>>> (M == 0).T.nonzero()
(matrix([[0, 1, 2]]), matrix([[0, 4, 0]]))

As an alternative to np.where, you could perhaps use np.argwhere to return an array of indexes where the array meets the condition:
>>> np.argwhere(M == 0)
array([[[0, 0]],
[[0, 2]],
[[4, 1]]])
This tells you each the indexes in the format [row, column] where the condition was met.
If you'd prefer the format of this output array to be grouped by column rather than row, (that is, [column, row]), just use the method on the transpose of the array:
>>> np.argwhere(M.T == 0).squeeze()
array([[0, 0],
[1, 4],
[2, 0]])
I also used np.squeeze here to get rid of axis 1, so that we are left with a 2D array. The sequence you want is the second column, i.e. np.argwhere(M.T == 0).squeeze()[:, 1].

The result of where(M == 0) would look something like this
(matrix([[0, 0, 4]]), matrix([[0, 2, 1]])) First matrix tells you the rows where 0s are and second matrix tells you the columns where 0s are.
Out[4]:
matrix([[0, 1, 0],
[4, 2, 4],
[3, 4, 1],
[1, 3, 2],
[2, 0, 3]])
In [5]: np.where(M == 0)
Out[5]: (matrix([[0, 0, 4]]), matrix([[0, 2, 1]]))
In [6]: M[0,0]
Out[6]: 0
In [7]: M[0,2] #0th row 2nd column
Out[7]: 0
In [8]: M[4,1] #4th row 1st column
Out[8]: 0

This isn't anything new on what's been already suggested, but a one-line solution is:
>>> np.where(np.array(M.T)==0)[-1]
array([0, 4, 0])
(I agree that NumPy matrix objects are more trouble than they're worth).

>>> M = np.array([[0, 1, 0],
... [4, 2, 4],
... [3, 4, 1],
... [1, 3, 2],
... [2, 0, 3]])
>>> [np.where(M[:,i]==0)[0][0] for i in range(M.shape[1])]
[0, 4, 0]

Quickly rarefy a matrix in Numpy/Python

I need to (quickly) rarefy a matrix.
Rarefaction - transform abundance matrices to even sampling depth.
In this example, each row is a sample and the sampling depth is the sum of the row. I want to randomly sample (with replacement) the matrix by min(rowsums(matrix)) samples.
Suppose I have a matrix:
>>> m = [ [0, 9, 0],
... [0, 3, 3],
... [0, 4, 4] ]
The rarefaction function goes row by row randomly sampling with replacement min(rowsums(matrix)) times (which is 6 in this case).
>>> rf = rarefaction(m)
>>> rf
[ [0, 6, 0], # sum = 6
[0, 3, 3], # sum = 6
[0, 3, 3] ] # sum = 6
The results are random but the row sums are always the same.
>>> rf = rarefaction(m)
>>> rf
[ [0, 6, 0], # sum = 6
[0, 2, 4], # sum = 6
[0, 4, 2], ] # sum = 6
PyCogent has a function that does this row by row however it is very slow on large matrices.
I have a feeling that there is a function in Numpy that can do this but I'm not sure what it would be called.

import numpy as np
from numpy.random import RandomState
def rarefaction(M, seed=0):
prng = RandomState(seed) # reproducible results
noccur = np.sum(M, axis=1) # number of occurrences for each sample
nvar = M.shape[1] # number of variables
depth = np.min(noccur) # sampling depth
Mrarefied = np.empty_like(M)
for i in range(M.shape[0]): # for each sample
p = M[i] / float(noccur[i]) # relative frequency / probability
choice = prng.choice(nvar, depth, p=p)
Mrarefied[i] = np.bincount(choice, minlength=nvar)
return Mrarefied
Example:
>>> M = np.array([[0, 9, 0], [0, 3, 3], [0, 4, 4]])
>>> M
array([[0, 9, 0],
[0, 3, 3],
[0, 4, 4]])
>>> rarefaction(M)
array([[0, 6, 0],
[0, 2, 4],
[0, 3, 3]])
>>> rarefaction(M, seed=1)
array([[0, 6, 0],
[0, 4, 2],
[0, 3, 3]])
>>> rarefaction(M, seed=2)
array([[0, 6, 0],
[0, 3, 3],
[0, 3, 3]])
Cheers,
Davide

I think the question is not entirely clear. I suppose the rarefaction matrix gives you the number of samples you take from each coefficient of your original matrix?
Looking at the code in your link, there might be potential to speed it up. Operate on transposed matrices and rewrite the code of your link to operate on columns instead of rows. Because that would allow your processor to cache the values it samples better, i.e. there are less jumps in the memory.
The rest is as I would do it as well, using numpy (does not have to mean that that is the most efficient way).
If you need it faster, you can try to code the function in C++ and including it into your python with scipy.weave. In C++ I would go for every row and build a lookup table of positions that are >0, generate min(rowsums(matrix)) integers within the range equal to the number of items in the lookup table. I would accumulate how often each position in the lookup table was drawn and then put those numbers back into the right positions in the array. That code should literatlly be just a few lines.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

compare large sets of arrays - python

You can do this y=[np.where(np.all(B==arr,axis=1))[0].shape[0] for arr in A] arr just iterate over A and np.all checks where it matches with B and np.where returns the positions of those matches as an array then shape just returns the length of that array or in other words the desired frequency

Related

Any efficient analogue of argsort for array of indices with NumPy?

Replace numpy subarray when element matches a condition

vectorizing numpy bincount

How to get a value from every column in a Numpy matrix

Quickly rarefy a matrix in Numpy/Python

Categories

Resources