apply function to unique values of NumPy array - python

I have a function f which I would like to apply to all elements of an arbitrarily-shaped and -ordered NumPy array x. Because the function evaluation is expensive and x may contain duplicate values, I first reduce x to unique values, a one-dimensional array xu.
xu, ind = np.unique(x, return_inverse=True)
I then create an array for the function values
yu = np.full(len(xu), np.nan)
and fill in this array by applying f elementwise.
I would now like to create an array y of the same shape as x, so that corresponding entries contain the result of the function. My attempt:
y = np.full(x.shape, np.nan)
y[ind] = yu
This fails if x isn't already one-dimensional. (You may guess that I'm used to Matlab, where linear indexing of a multidimensional array works.) What I would need for this is a one-dimensional view on y which I can apply [ind] = to, to assign to the correct elements.
Question 1: Is there such a one-dimensional view on a multidimensional array?
Alternatively, I could create y as one-dimensional, assign values, and then reshape.
y = np.full(x.size, np.nan)
y[ind] = yu
y = np.reshape(y, x.shape)
This seems to work, but I'm unsure whether I have to account for the storage order of x.
Question 2: Does ind returned by np.unique always follow 'C' order, which is default for np.reshape, or does it depend on the internal structure of x?

The indices for np.unique operates on a raveled array. This is documented under the first parameter:
Unless axis is specified, this will be flattened if it is not already 1-D.
Ravelling/flattening always happens in C order, regardless of the memory layout. Flattening is just raveling that guarantees a copy. That means that it creates a copy when your array is not in C order:
>>> x = np.zeros((3, 3), order='F')
>>> x.ravel().base is x
False
>>> y = np.zeros((3, 3))
>>> y.ravel().base is y
True
x.ravel() is equivalent to x.reshape(-1). That means that if you can unravel the result with something like flat_y.reshape(original_x_shape):
xu, ind = np.unique(x, return_inverse=True)
yu = np.zeros_like(xu)
for i in range(len(xu)):
yu[i] = fn(xu[i])
y_flat = yu[ind]
y = y_flat.reshape(x.shape)
Since you are reshaping a contiguous buffer, y and y_flat share the same memory:
>>> y.base is y_flat
True
Fancy indexing, as in the expression y_flat = yu[ind] will always make a copy, since you can't tell if the data is contiguous or not in the general case.
Part of the reason that linear indexing always works in MATLAB is that it guarantees contiguous arrays, always stored in column-major order. Numpy maintains a length in strides in each dimension, so it supports non-contiguous arrays. That allows numpy to do things like transpose an array, or get simple slices from it, without making a copy of the underlying data.
On a side note, if you want to avoid explicitly calling reshape on y, can call it on ind instead:
xu, ind = np.unique(x, return_inverse=True)
yu = np.zeros_like(xu)
for i in range(len(xu)):
yu[i] = fn(xu[i])
y = yu[ind.reshape(x.shape)]

Related

Setting numpy array to slice without any in-place operations

How can I do this operation efficiently without any inplace operations?
n_id = np.random.choice(np.arange(2708), size=100)
z = np.random.rand(100, 64)
z_sparse = np.zeros((2708,64))
z_sparse[n_id[:100]] = z
Essentially I want the n_id rows of z_sparse to contain z's rows, but I can't do any inplace operations because my end goal is to use this in a pytorch problem.
One though would be to create zero rows within z precisely so that the rows of z end up in the positions n_id, but not sure how this would work efficiently.
Essentially row 1 of z should be placed at row n_id[0] of z_sparse, then row 2 of z should be at row n_id[1] of z_sparse, and so on...
Here's the PyTorch error jic you are curious:
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation
If n_id is a fixed index array, you can get z_sparse as a matrix multiplication:
# N, n, m = 2078,100, 64
row_mat = (n_id[:n] == np.arange(N)[:,None])
# for pytorch tensor
# row_mat = Tensor(n_id[:n] == np.arange(N)[:,None])
z_sparse = row_mat # z
Since row_mat is a constant array (tensor), your graph should work just fine.

Scipy: Sparse indicator matrix from array(s)

What is the most efficient way to compute a sparse boolean matrix I from one or two arrays a,b, with I[i,j]==True where a[i]==b[j]? The following is fast but memory-inefficient:
I = a[:,None]==b
The following is slow and still memory-inefficient during creation:
I = csr((a[:,None]==b),shape=(len(a),len(b)))
The following gives at least the rows,cols for better csr_matrix initialization, but it still creates the full dense matrix and is equally slow:
z = np.argwhere((a[:,None]==b))
Any ideas?
One way to do it would be to first identify all different elements that a and b have in common using sets. This should work well if there are not very many different possibilities for the values in a and b. One then would only have to loop over the different values (below in variable values) and use np.argwhere to identify the indices in a and b where these values occur. The 2D indices of the sparse matrix can then be constructed using np.repeat and np.tile:
import numpy as np
from scipy import sparse
a = np.random.randint(0, 10, size=(400,))
b = np.random.randint(0, 10, size=(300,))
## matrix generation after OP
I1 = sparse.csr_matrix((a[:,None]==b),shape=(len(a),len(b)))
##identifying all values that occur both in a and b:
values = set(np.unique(a)) & set(np.unique(b))
##here we collect the indices in a and b where the respective values are the same:
rows, cols = [], []
##looping over the common values, finding their indices in a and b, and
##generating the 2D indices of the sparse matrix with np.repeat and np.tile
for value in values:
x = np.argwhere(a==value).ravel()
y = np.argwhere(b==value).ravel()
rows.append(np.repeat(x, len(x)))
cols.append(np.tile(y, len(y)))
##concatenating the indices for different values and generating a 1D vector
##of True values for final matrix generation
rows = np.hstack(rows)
cols = np.hstack(cols)
data = np.ones(len(rows),dtype=bool)
##generating sparse matrix
I3 = sparse.csr_matrix( (data,(rows,cols)), shape=(len(a),len(b)) )
##checking that the matrix was generated correctly:
print((I1 != I3).nnz==0)
The syntax for generating the csr matrix is taken from the documentation. The test for sparse matrix equality is taken from this post.
Old Answer:
I don't know about performance, but at least you can avoid constructing the full dense matrix by using a simple generator expression. Here some code that uses two 1d arras of random integers to first generate the sparse matrix the way that the OP posted and then uses a generator expression to test all elements for equality:
import numpy as np
from scipy import sparse
a = np.random.randint(0, 10, size=(400,))
b = np.random.randint(0, 10, size=(300,))
## matrix generation after OP
I1 = sparse.csr_matrix((a[:,None]==b),shape=(len(a),len(b)))
## matrix generation using generator
data, rows, cols = zip(
*((True, i, j) for i,A in enumerate(a) for j,B in enumerate(b) if A==B)
)
I2 = sparse.csr_matrix((data, (rows, cols)), shape=(len(a), len(b)))
##testing that matrices are equal
## from https://stackoverflow.com/a/30685839/2454357
print((I1 != I2).nnz==0) ## --> True
I think there is no way around the double loop and ideally this would be pushed into numpy, but at least with the generator the loops are somewhat optimised ...
You could use numpy.isclose with small tolerance:
np.isclose(a,b)
Or pandas.DataFrame.eq:
a.eq(b)
Note this returns an array of True False.

Translating a Linear Regression from Matlab to Python

I tried to translate a piece of code from Matlab to Python and I'm running into some errors:
Matlab:
function [beta] = linear_regression_train(traindata)
y = traindata(:,1); %output
ind2 = find(y == 2);
ind3 = find(y == 3);
y(ind2) = -1;
y(ind3) = 1;
X = traindata(:,2:257); %X matrix,with size of 1389x256
beta = inv(X'*X)*X'*y;
Python:
def linear_regression_train(traindata):
y = traindata[:,0] # This is the output
ind2 = (labels==2).nonzero()
ind3 = (labels==3).nonzero()
y[ind2] = -1
y[ind3] = 1
X = traindata[ : , 1:256]
X_T = numpy.transpose(X)
beta = inv(X_T*X)*X_T*y
return beta
I am receiving an error: operands could not be broadcast together with shapes (257,0,1389) (1389,0,257) on the line where beta is calculated.
Any help is appreciated!
Thanks!
The problem is that you are working with numpy arrays, not matrices as in MATLAB. Matrices, by default, do matrix mathematical operations. So X*Y does a matrix multiplication of X and Y. With arrays, however, the default is to use element-by-element operations. So X*Y multiplies each corresponding element of X and Y. This is the equivalent of MATLAB's .* operation.
But just like how MATLAB's matrices can do element-by-element operations, Numpy's arrays can do matrix multiplication. So what you need to do is use numpy's matrix multiplication instead of its element-by-element multiplication. For Python 3.5 or higher (which is the version you should be using for this sort of work), that is just the # operator. So your line becomes:
beta = inv(X_T # X) # X_T # y
Or, better yet, you can use the simpler .T transpose, which is the same as np.transpose but much more concise (you can get rid of the `np.transpose line entirely):
beta = inv(X.T # X) # X.T # y
For Python 3.4 or earlier, you will need to use np.dot since those versions of python don't have the # matrix multiplication operator:
beta = np.dot(np.dot(inv(np.dot(X.T, X)), X.T), y)
Numpy has a matrix object that uses matrix operations by default like the MATLAB matrix. Do not use it! It is slow, poorly-supported, and almost never what you really want. The Python community has standardized around arrays, so use those.
There may also be some issues with the dimensions of traindata. For this to work properly then traindata.ndim should be equal to 3. In order for y and X to be 2D, traindata should be 3D.
This could be an issue if traindata is 2D and you want y to be MATLAB-style "vector" (what MATLAB calls "vectors" aren't really vectors). In numpy, using a single index like traindata[:, 0] reduces the number of dimensions, while taking a slice like traindata[:, :1] doesn't. So to keep y 2D when traindata is 2D, just do a length-1 slice, traindata[:, :1]. This is exactly the same values, but this keeps the same number of dimensions as traindata.
Notes: Your code can be significantly simplified using logical indexing:
def linear_regression_train(traindata):
y = traindata[:, 0] # This is the output
y[labels == 2] = -1
y[labels == 3] = 1
X = traindata[:, 1:257]
return inv(X.T # X) # X.T # y
return beta
Also, your slice is wrong when defining X. Python slicing excludes the last value, so to get a 256 long slice you need to do 1:257, as I did above.
Finally, please keep in mind that modifications to arrays inside functions carry over outside the functions, and indexing does not make a copy. So your changes to y (setting some values to 1 and others to -1), will affect traindata outside of your function. If you want to avoid that, you need to make a copy before you make your changes:
y = traindata[:, 0].copy()

Multiple Element Indexing in multi-dimensional array

I have a 3d Numpy array and would like to take the mean over one axis considering certain elements from the other two dimensions.
This is an example code depicting my problem:
import numpy as np
myarray = np.random.random((5,10,30))
yy = [1,2,3,4]
xx = [20,21,22,23,24,25,26,27,28,29]
mymean = [ np.mean(myarray[t,yy,xx]) for t in np.arange(5) ]
However, this results in:
ValueError: shape mismatch: objects cannot be broadcast to a single shape
Why does an indexing like e.g. myarray[:,[1,2,3,4],[1,2,3,4]] work, but not my code above?
This is how you fancy-index over more than one dimension:
>>> np.mean(myarray[np.arange(5)[:, None, None], np.array(yy)[:, None], xx],
axis=(-1, -2))
array([ 0.49482768, 0.53013301, 0.4485054 , 0.49516017, 0.47034123])
When you use fancy indexing, i.e. a list or array as an index, over more than one dimension, numpy broadcasts those arrays to a common shape, and uses them to index the array. You need to add those extra dimensions of length 1 at the end of the first indexing arrays, for the broadcast to work properly. Here are the rules of the game.
Since you use consecutive elements you can use a slice:
import numpy as np
myarray = np.random.random((5,10,30))
yy = slice(1,5)
xx = slice(20, 30)
mymean = [np.mean(myarray[t, yy, xx]) for t in np.arange(5)]
To answer your question about why it doesn't work: when you use lists/arrays as indices, Numpy uses a different set of indexing semantics than it does if you use slices. You can see the full story in the documentation and, as that page says, it "can be somewhat mind-boggling".
If you want to do it for nonconsecutive elements, you must grok that complex indexing mechanism.

How to assign a 1D numpy array to 2D numpy array?

Consider the following simple example:
X = numpy.zeros([10, 4]) # 2D array
x = numpy.arange(0,10) # 1D array
X[:,0] = x # WORKS
X[:,0:1] = x # returns ERROR:
# ValueError: could not broadcast input array from shape (10) into shape (10,1)
X[:,0:1] = (x.reshape(-1, 1)) # WORKS
Can someone explain why numpy has vectors of shape (N,) rather than (N,1) ?
What is the best way to do the casting from 1D array into 2D array?
Why do I need this?
Because I have a code which inserts result x into a 2D array X and the size of x changes from time to time so I have X[:, idx1:idx2] = x which works if x is 2D too but not if x is 1D.
Do you really need to be able to handle both 1D and 2D inputs with the same function? If you know the input is going to be 1D, use
X[:, i] = x
If you know the input is going to be 2D, use
X[:, start:end] = x
If you don't know the input dimensions, I recommend switching between one line or the other with an if, though there might be some indexing trick I'm not aware of that would handle both identically.
Your x has shape (N,) rather than shape (N, 1) (or (1, N)) because numpy isn't built for just matrix math. ndarrays are n-dimensional; they support efficient, consistent vectorized operations for any non-negative number of dimensions (including 0). While this may occasionally make matrix operations a bit less concise (especially in the case of dot for matrix multiplication), it produces more generally applicable code for when your data is naturally 1-dimensional or 3-, 4-, or n-dimensional.
I think you have the answer already included in your question. Numpy allows the arrays be of any dimensionality (while afaik Matlab prefers two dimensions where possible), so you need to be correct with this (and always distinguish between (n,) and (n,1)). By giving one number as one of the indices (like 0 in 3rd row), you reduce the dimensionality by one. By giving a range as one of the indices (like 0:1 in 4th row), you don't reduce the dimensionality.
Line 3 makes perfect sense for me and I would assign to the 2-D array this way.
Here are two tricks that make the code a little shorter.
X = numpy.zeros([10, 4]) # 2D array
x = numpy.arange(0,10) # 1D array
X.T[:1, :] = x
X[:, 2:3] = x[:, None]

Categories