Quickly rarefy a matrix in Numpy/Python - python

I need to (quickly) rarefy a matrix.
Rarefaction - transform abundance matrices to even sampling depth.
In this example, each row is a sample and the sampling depth is the sum of the row. I want to randomly sample (with replacement) the matrix by min(rowsums(matrix)) samples.
Suppose I have a matrix:
>>> m = [ [0, 9, 0],
... [0, 3, 3],
... [0, 4, 4] ]
The rarefaction function goes row by row randomly sampling with replacement min(rowsums(matrix)) times (which is 6 in this case).
>>> rf = rarefaction(m)
>>> rf
[ [0, 6, 0], # sum = 6
[0, 3, 3], # sum = 6
[0, 3, 3] ] # sum = 6
The results are random but the row sums are always the same.
>>> rf = rarefaction(m)
>>> rf
[ [0, 6, 0], # sum = 6
[0, 2, 4], # sum = 6
[0, 4, 2], ] # sum = 6
PyCogent has a function that does this row by row however it is very slow on large matrices.
I have a feeling that there is a function in Numpy that can do this but I'm not sure what it would be called.

import numpy as np
from numpy.random import RandomState
def rarefaction(M, seed=0):
prng = RandomState(seed) # reproducible results
noccur = np.sum(M, axis=1) # number of occurrences for each sample
nvar = M.shape[1] # number of variables
depth = np.min(noccur) # sampling depth
Mrarefied = np.empty_like(M)
for i in range(M.shape[0]): # for each sample
p = M[i] / float(noccur[i]) # relative frequency / probability
choice = prng.choice(nvar, depth, p=p)
Mrarefied[i] = np.bincount(choice, minlength=nvar)
return Mrarefied
Example:
>>> M = np.array([[0, 9, 0], [0, 3, 3], [0, 4, 4]])
>>> M
array([[0, 9, 0],
[0, 3, 3],
[0, 4, 4]])
>>> rarefaction(M)
array([[0, 6, 0],
[0, 2, 4],
[0, 3, 3]])
>>> rarefaction(M, seed=1)
array([[0, 6, 0],
[0, 4, 2],
[0, 3, 3]])
>>> rarefaction(M, seed=2)
array([[0, 6, 0],
[0, 3, 3],
[0, 3, 3]])
Cheers,
Davide

I think the question is not entirely clear. I suppose the rarefaction matrix gives you the number of samples you take from each coefficient of your original matrix?
Looking at the code in your link, there might be potential to speed it up. Operate on transposed matrices and rewrite the code of your link to operate on columns instead of rows. Because that would allow your processor to cache the values it samples better, i.e. there are less jumps in the memory.
The rest is as I would do it as well, using numpy (does not have to mean that that is the most efficient way).
If you need it faster, you can try to code the function in C++ and including it into your python with scipy.weave. In C++ I would go for every row and build a lookup table of positions that are >0, generate min(rowsums(matrix)) integers within the range equal to the number of items in the lookup table. I would accumulate how often each position in the lookup table was drawn and then put those numbers back into the right positions in the array. That code should literatlly be just a few lines.

Related

Use numpy to mask a row containing only zeros

I have a large array of point cloud data which is generated using the azure kinect. All erroneous measurements are assigned the coordinate [0,0,0]. I want to remove all coordinates with the value [0,0,0]. Since my array is rater large (1 million points) and since U need to do this process in real-time, speed is of the essence.
In my current approach I try to use numpy to mask out all rows that contain three zeroes ([0,0,0]). However, the np.ma.masked_equal function does not evaluate an entire row, but only evaluates single elements. As a result, rows that contain at least one 0 are already filtered by this approach. I only want rows to be filtered when all values in the row are 0. Find an example of my code below:
my_data = np.array([[1,2,3],[0,0,0],[3,4,5],[2,5,7],[0,0,1]])
my_data = np.ma.masked_equal(my_data, [0,0,0])
my_data = np.ma.compress_rows(my_data)
output
array([[1, 2, 3],
[3, 4, 5],
[2, 5, 7]])
desired output
array([[1, 2, 3],
[3, 4, 5],
[2, 5, 7],
[0, 0, 1]])`
Find all data points that are 0 (doesn't require np.ma module) and then select all rows that do not contain all zeros:
import numpy as np
my_data = np.array([[1, 2, 3], [0, 0, 0], [3, 4, 5], [2, 5, 7], [0, 0, 1]])
my_data[~(my_data == 0).all(axis= 1)]
Output:
array([[1, 2, 3],
[3, 4, 5],
[2, 5, 7],
[0, 0, 1]])
Instead of using the np.ma.masked_equal and np.ma.compress_rows functions, you can use the np.all function to check if all values in a row are equal to [0, 0, 0]. This should be faster than your method as it evaluates all values in a row at once.
mask = np.all(my_data == [0, 0, 0], axis=1)
my_data = my_data[~mask]

Numpy (python) - create a matrix with rows having subsequent values multiplied by the row's number

I want to create an n × n matrix with rows having subsequent values multiplied by the row's number. For example for n = 4:
[[0, 1, 2, 3], [0, 2, 4, 6], [0, 3, 6, 9], [0, 4, 8, 12]]
For creating such a matrix, I know the following code can be used:
n, n = 3, 3
K = np.empty(shape=(n, n), dtype=int)
i,j = np.ogrid[:n, :n]
L = i+j
print(L)
I don't know how I can make rows having subsequent values multiplied by the row's number.
You can use the outer product of two vectors to create an array like that. Use np.outer(). For example, for n = 4:
import numpy as np
n = 4
row = np.arange(n)
np.outer(row + 1, row)
This produces:
array([[ 0, 1, 2, 3],
[ 0, 2, 4, 6],
[ 0, 3, 6, 9],
[ 0, 4, 8, 12]])
Take a look at row and try different orders of multiplication etc to see what's going on here. As others pointed out in the commets, you should also review your code to see that you're creating n twice and not using K (and in general I'd avoid np.empty() as a beginner because it can lead to unexpected behaviour).

Replace numpy subarray when element matches a condition

I have an n x m x 3 numpy array. This represents a middle-step towards an RGB representation of a complex-function plotter. When the function being plotted takes infinite values or has singularities, parts of the RGB data become NaNs.
I'm looking for an efficient way to replace a row containing a NaN with a row of my choice, perhaps [0, 0, 0] or [1, 1, 1]. In terms of the RGB values, this has the effect of replacing poorly-behaving pixels with white or black pixels. By efficient, I mean some way that takes advantage of numpy's vectorization and speed.
Please note that I am not looking to merely replace the NaN values with 0 (which I know how to do with numpy.where); if a row contains a NaN, I want to replace the whole row. I suspect this can be done nicely in numpy, but I'm not sure how.
Concrete Question
Suppose we are given a 2 x 2 x 3 array arr. If a row contains a 5, I want to replace the row with [0, 0, 0]. Trivial code that does this slowly is as follows.
import numpy as np
arr = np.array([[[1, 2, 3], [4, 5, 6]], [[1, 3, 5], [2, 4, 6]]])
# so arr is
# array([[[1, 2, 3],
# [4, 5, 6]],
#
# [[1, 3, 5],
# [2, 4, 6]]])
# Trivial and slow version to replace rows containing 5 with [0,0,0]
for i in range(len(arr)):
for j in range(len(arr[i])):
if 5 in arr[i][j]:
arr[i][j] = np.array([0, 0, 0])
# Now arr is
#
# array([[[1, 2, 3],
# [0, 0, 0]],
#
# [[0, 0, 0],
# [2, 4, 6]]])
How can we accomplish this taking advantage of numpy?
A simpler way would be -
arr[np.isin(arr,5).any(-1)] = 0
If it's just a single value that you are looking for, then we could simplify to -
arr[(arr==5).any(-1)] = 0
If you are looking to match against NaN, we need to do the comparison differently and use np.isnan instead -
arr[np.isnan(arr).any(-1)] = 0
If you are looking to assign array values, instead of just 0, the solutions stay the same. Hence it would be -
arr[(arr==5).any(-1)] = new_array
Using np.broadcast_to
arr[np.broadcast_to((arr == 5).any(-1)[..., None], arr.shape)] = 0
array([[[1, 2, 3],
[0, 0, 0]],
[[0, 0, 0],
[2, 4, 6]]])
Just as FYI, based on your description, if you want to find np.nans instead of integers like 5, you shouldn't use ==, but rather np.isnan
arr[np.broadcast_to((np.isnan(arr)).any(-1)[..., None], arr.shape)] = 0
you can do it using in1d function like below
arr = np.array([[[1, 2, 3], [4, 5, 6]], [[1, 3, 5], [2, 4, 6]]])
arr[np.in1d(arr,5).reshape(arr.shape).any(axis=2)] = [0,0,0]
arr

compare large sets of arrays

I have a numpy array A of n 1x3 arrays where n is the total number of possible combinations of elements in the 1x3 arrays, where each element ranges from 0 to 50. That is,
A = [[0,0,0],[0,0,1]...[0,1,0]...[50,50,50]]
and
len(A) = 50*50*50 = 125000
I have a numpy array B of m 1x3 arrays where m = 10 million, and the arrays can have values belonging to the set described by A.
I want to count up how many of each combination is present in B, that is, how many times [0,0,0] appears in B, how many times [0,0,1] appears...how many times [50,50,50] appears. So far I have the following:
for i in range(len(A)):
for j in range(len(B)):
if np.array_equal(A[i], B[j]):
y[i] += 1
where y keeps track of how many times the ith array occurs. So, y[0] is how many times [0,0,0] appeared in B, y[1] is how many times [0,0,1] appeared...y[125000] is how many times [50,50,50] appeared, etc.
The problem is this takes forever. It has to check 10 million entries, 125000 times. Is there a quicker and more efficient way to do this?
Here is a fast approach. It processes 10 million tuples out of range(50)^3 in a fraction of a second and is roughly 100 times faster than the next best solution (#Primusa's):
It uses the fact that there is a straight-forward translation between such tuples and the numbers 0 - 50^3 - 1. (The mapping happens to be the same as the one between the rows of your A and the row numbers.) The functions np.ravel_multi_index and np.unravel_index implement this translation and its inverse.
Once B is translated into numbers, their frequencies can be determined very efficiently using np.bincount. Below I reshape the result to get a 50x50x50 histogram but that is just a matter of taste and can be left out. (I have taken the liberty to only use numbers 0 through 49, so len(A) becomes 125000):
>>> B = np.random.randint(0, 50, (10000000, 3))
>>> Br = np.ravel_multi_index(B.T, (50, 50, 50))
>>> result = np.bincount(Br, minlength=125000).reshape(50, 50, 50)
Let's look at a smaller example for demonstration:
>>> B = np.random.randint(0, 3, (10, 3))
>>> Br = np.ravel_multi_index(B.T, (3, 3, 3))
>>> result = np.bincount(Br, minlength=27).reshape(3, 3, 3)
>>>
>>> B
array([[1, 1, 2],
[2, 1, 2],
[2, 0, 0],
[2, 1, 0],
[2, 0, 2],
[0, 0, 2],
[0, 0, 2],
[0, 2, 2],
[2, 0, 0],
[0, 2, 0]])
>>> result
array([[[0, 0, 2],
[0, 0, 0],
[1, 0, 1]],
[[0, 0, 0],
[0, 0, 1],
[0, 0, 0]],
[[2, 0, 1],
[1, 0, 1],
[0, 0, 0]]])
To query for example how many times [2,1,0] is in B one would do
>>> result[2,1,0]
1
As mentioned above: To convert between indices into your A and the actual rows of A (which are the indices into my result), np.ravel_multi_index and np.unravel_index can be used. Or you can leave out the last reshape (i.e. use result = np.bincount(Br, minlength=125000); then the counts are indexed exactly the same as A.
You can use a dict() to speed up this process to just going through 10 million entries.
So the first thing you want to do is to change all the sublists in A to hashable objects do you can use them as keys in a dict.
Converting all the sublists to tuples:
A = [tuple(i) for i in A]
Then create a dict() with every value in A as the key and the value being 0.
d = {i:0 for i in A}
Now for each subarray in your numpy array, you just want to convert it to a tuple and increment d[that array] by 1
for subarray in B:
d[tuple(subarray)] += 1
D is now a dictionary where for each key the value is how many times that key occured in B.
You can find the unique rows and their counts from array B by calling the np.unique over its first axis and return_counts=True. Then, you can use broadcasting to find the indices of the B's unique rows in A by calling the ndarray.all and ndarray.any methods on proper axises. Then all you need is just a simple indexing:
In [82]: unique, counts = np.unique(B, axis=0, return_counts=True)
In [83]: indices = np.where((unique == A[:,None,:]).all(axis=2).any(axis=0))[0]
# Get items from A that exist in B
In [84]: unique[indices]
# Get the counts
In [85]: counts[indices]
Example:
In [86]: arr = np.array([[2 ,3, 4], [5, 6, 0], [2, 3, 4], [1, 0, 4], [3, 3, 3], [5, 6, 0], [2, 3, 4]])
In [87]: a = np.array([[2, 3, 4], [1, 9, 5], [3, 3, 3]])
In [88]: unique, counts = np.unique(arr, axis=0, return_counts=True)
In [89]: indices = np.where((unique == a[:,None,:]).all(axis=2).any(axis=0))[0]
In [90]: unique[indices]
Out[90]:
array([[2, 3, 4],
[3, 3, 3]])
In [91]: counts[indices]
Out[91]: array([3, 1])
You can do this
y=[np.where(np.all(B==arr,axis=1))[0].shape[0] for arr in A]
arr just iterate over A and np.all checks where it matches with B and np.where returns the positions of those matches as an array then shape just returns the length of that array or in other words the desired frequency

vectorizing numpy bincount

I have a 2d numpy array., A I want to apply np.bincount() to each column of the matrix A to generate another 2d array B that is composed of the bincounts of each column of the original matrix A.
My problem is that np.bincount() is a function that takes a 1d array-like. It's not an array method like B = A.max(axis=1) for example.
Is there a more pythonic/numpythic way to generate this B array other than a nasty for-loop?
import numpy as np
states = 4
rows = 8
cols = 4
A = np.random.randint(0,states,(rows,cols))
B = np.zeros((states,cols))
for x in range(A.shape[1]):
B[:,x] = np.bincount(A[:,x])
Using the same philosophy as in this post, here's a vectorized approach -
m = A.shape[1]
n = A.max()+1
A1 = A + (n*np.arange(m))
out = np.bincount(A1.ravel(),minlength=n*m).reshape(m,-1).T
I would suggest to use np.apply_along_axis, which will allow you to apply a 1D-method (in this case np.bincount) to 1D slices of a higher dimensional array:
import numpy as np
states = 4
rows = 8
cols = 4
A = np.random.randint(0,states,(rows,cols))
B = np.zeros((states,cols))
B = np.apply_along_axis(np.bincount, axis=0, arr=A)
You'll have to be careful, though. This (as well as your suggested for-loop) only works if the output of np.bincount has the right shape. If the maximum state is not present in one or multiple columns of your array A, the output will not have a smaller dimensionality and thus, the code will file with a ValueError.
This solution using the numpy_indexed package (disclaimer: I am its author) is fully vectorized, thus does not include any python loops behind the scenes. Also, there are no restrictions on the input; not every column needs to contain the same set of unique values.
import numpy_indexed as npi
rowidx, colidx = np.indices(A.shape)
(bin, col), B = npi.count_table(A.flatten(), colidx.flatten())
This gives an alternative (sparse) representation of the same result, which may be much more appropriate if the B array does indeed contain many zeros:
(bin, col), count = npi.count((A.flatten(), colidx.flatten()))
Note that apply_along_axis is just syntactic sugar for a for-loop, and has the same performance characteristics.
Yet another possibility:
import numpy as np
def bincount_columns(x, minlength=None):
nbins = x.max() + 1
if minlength is not None:
nbins = max(nbins, minlength)
ncols = x.shape[1]
count = np.zeros((nbins, ncols), dtype=int)
colidx = np.arange(ncols)[None, :]
np.add.at(count, (x, colidx), 1)
return count
For example,
In [110]: x
Out[110]:
array([[4, 2, 2, 3],
[4, 3, 4, 4],
[4, 3, 4, 4],
[0, 2, 4, 0],
[4, 1, 2, 1],
[4, 2, 4, 3]])
In [111]: bincount_columns(x)
Out[111]:
array([[1, 0, 0, 1],
[0, 1, 0, 1],
[0, 3, 2, 0],
[0, 2, 0, 2],
[5, 0, 4, 2]])
In [112]: bincount_columns(x, minlength=7)
Out[112]:
array([[1, 0, 0, 1],
[0, 1, 0, 1],
[0, 3, 2, 0],
[0, 2, 0, 2],
[5, 0, 4, 2],
[0, 0, 0, 0],
[0, 0, 0, 0]])

Categories