Subtract the same scalar from all elements in a sparse.csr_matrix - python

Say I have a csr-matrix,cossim, representing the cosine-similarity of some data. Since I need a distance metric I can create that simply by 1-cossim or -(cossim-1)i.e substract the value of each element from 1, elementwise.
The issue is that it is not implemented thus I thought of doing it by creating a matrix with 1 all over, ones, and then calculate distance = ones-cossim (but I cannot figure out on how to create a "ones matrix" with the same shape, since csr_matrix(np.ones(cossim.shape)) needs 600GB of memory)
Can't we do some elementwise substraction with a sparse matrix?
If not what is the best way to do so?

Related

Is it possible to calculate running statistics having neighbours into account in Numpy?

I have a rank-3 matrix A of shape (Nx, Ny, Nt), and I need calculate statistics in the third (temporal) dimension for all grid points (the first two dimensions correspond to horizontal space). For this, you can just make something like np.percentile(A, 0.5, axis=2). So far so good.
Now, my actual problem is more convoluted. I'd like to have neighbours into account to make some sort of calculation over a "running window" of a given size, or even an arbitrary shape. This sounds like a convolution, but I want to calculate something else than just the average. I need percentiles and things like such.
How can I achieve this result efficiently without explicit looping in the spatial dimensions?
I was thinking that a plausible solution would be to "enlarge" the original matrix by including an additional dimension that copies the data from the neighbours in the additional dimension. Something like B of shape (Nx, Ny, Nt, Nn), where Nn is the number of neighbours (let's assume we take the 8 closest neighbours for the sake of simplicity). Then, I could calculate np.percentile(B.reshape(Nx, Ny, Nt*Nn), 0.5, axis=2). The problem with this approach is two-fold:
I don't know how to build this matrix without explicit looping.
I'm concerned with the memory cost of having a Nn redundant copies of the original array. In the actual application, the matrix will be very large and the number of neighbours will also larger than 8.
Overall, I do not think this procedure is any better than explicitly looping and calculating things on each grid point in each iteration.
I'm pretty sure there must be more convenient and standardised way to have neighbours into account within the Numpy matricial approach, but I could not find the recipe online.

Flatten only part of a dataframe shape for Euclidean calculation?

I have a data frame with shape:
(20,30,1024)
I want to find the Euclidean distance between every entry and every other entry in the dataframe (ideally non-redundantly, i.e. don't find the distance of row 1 and 5....and then row 5 and 1 but not there yet). I have this code:
from scipy.spatial.distance import pdist,squareform
distances = pdist(df_test,metric='euclidean')
dist_matrix = squareform(distances)
print(dist_matrix)
The error says:
A 2-dimensional array must be passed.
So I guess I want to convert my matrix from shape (20,30,1024) to (20,30720), and then calculate the pdist/squareform between the rows (i.e. 20 rows of vectors that are 30720 in length).
I know that I can use test_df[0:20].flatten().tolist()
But that completely flattened my matrix, the output shape was (1,614400).
Can someone show me how to convert a shape from (20,30,1024) to (20,3072), or if i'm not going about this the right way?
The ultimate end goal is to calculate Euclidean distance between all non-redundant pairs in a data set, but the data set is big, so I need to do it as efficiently as possible/not duplicating calculations.
The most straightforward way to reshape that I can think of, according to how you described the problem, is:
df_test.values.reshape(20, -1)
By calling .values, you are retrieving your dataframe data as a numpy array. From there, .reshape finishes your job. Since you need a 2D-array, you provide the size of the first dimension (in your case, 20), and by passing -1 Numpy will calculate the size of the second dimension for you (in this case it will multiply the remaining dimension sizes in the original 3D-array)

Moments of a Numpy Array

I have a numpy array, say
rho = np.arange(25).reshape((5,5))
where each value represents the mass at that coordinate. I was a fast method to calculate the first 3 moments (so sum rho(x,y) x^n y^m for m+n<=3 though I may need high values. My actual matrix is much larger. I can do it easily with for loops, so compute the mean, then subtract that in the following summations, but there must be an efficient way with numpy operations that I have not figured out. Is there a very easy way to do this?

What does (n,) mean in the context of numpy and vectors?

I've tried searching StackOverflow, googling, and even using symbolhound to do character searches, but was unable to find an answer. Specifically, I'm confused about Ch. 1 of Nielsen's Neural Networks and Deep Learning, where he says "It is assumed that the input a is an (n, 1) Numpy ndarray, not a (n,) vector."
At first I thought (n,) referred to the orientation of the array - so it might refer to a one-column vector as opposed to a vector with only one row. But then I don't see why we need (n,) and (n, 1) both - they seem to say the same thing. I know I'm misunderstanding something but am unsure.
For reference a refers to a vector of activations that will be input to a given layer of a neural network, before being transformed by the weights and biases to produce the output vector of activations for the next layer.
EDIT: This question equivocates between a "one-column vector" (there's no such thing) and a "one-column matrix" (does actually exist). Same for "one-row vector" and "one-row matrix".
A vector is only a list of numbers, or (equivalently) a list of scalar transformations on the basis vectors of a vector space. A vector might look like a matrix when we write it out, if it only has one row (or one column). Confusingly, we will sometimes refer to a "vector of activations" but actually mean "a single-row matrix of activation values transposed so that it is a single-column."
Be aware that in neither case are we discussing a one-dimensional vector, which would be a vector defined by only one number (unless, trivially, n==1, in which case the concept of a "column" or "row" distinction would be meaningless).
In numpy an array can have a number of different dimensions, 0, 1, 2 etc.
The typical 2d array has dimension (n,m) (this is a Python tuple). We tend to describe this as having n rows, m columns. So a (n,1) array has just 1 column, and a (1,m) has 1 row.
But because an array may have just 1 dimension, it is possible to have a shape (n,) (Python notation for a 1 element tuple: see here for more).
For many purposes (n,), (1,n), (n,1) arrays are equivalent (also (1,n,1,1) (4d)). They all have n terms, and can be reshaped to each other.
But sometimes that extra 1 dimension matters. A (1,m) array can multiply a (n,1) array to produce a (n,m) array. A (n,1) array can be indexed like a (n,m), with 2 indices, x[:,0] where as a (n,) only accepts x[0].
MATLAB matrices are always 2d (or higher). So people transfering ideas from MATLAB tend to expect 2 dimensions. There is a np.matrix subclass that supposed to imitate that.
For numpy programmers the distinctions between vector, row vector, column vector, matrix are loose and relatively unimportant. Or the use is derived from the application rather than from numpy itself. I think that's what's happening with this network book - the notation and expectations come from outside of numpy.
See as well this answer for how to interpret the shapes with respect to the data stored in ndarrays. It also provides insight on how to use .reshape: https://stackoverflow.com/a/22074424/3277902
(n,) is a tuple of length 1, whose only element is n. (The syntax isn't (n) because that's just n instead of making a tuple.)
If an array has shape (n,), that means it's a 1-dimensional array with a length of n along its only dimension. It's not a row vector or a column vector; it doesn't have rows or columns. It's just a vector.

Axis elimination

I'm having a trouble understanding the concept of Axis elimination in numpy. Suppose I have the following 2D matrix:
A =
1 2 3
3 4 5
6 7 8
Ok I understand that sum(A, axis=0) will sum each column down and will give a 1D array with 3 elements. I also understand that sum(A, axis=1) will sum each row.
But my trouble is when I read that axis=0 eliminates the 0th axis and axis=1 eliminates the 1th axis. Also sometime people mention "reduce" instead of "eliminate". I'm unable to understand what does that eliminate. For example sum(A, axis=0) will sum each column from top to bottom, but I don't see elimination or reduction here. What's the point? The same also for sum(A,axis=1).
AND how is it for higher dimensions?
p.s. I always confused between matrix dimensions and array dimensions. I wished that people who write the numpy documentation makes this distinction very clear.
http://docs.scipy.org/doc/numpy/reference/generated/numpy.ufunc.reduce.html
Reduces a‘s dimension by one, by applying ufunc along one axis.
For example, add.reduce() is equivalent to sum().
In numpy, the base class is ndarray - a multidimensional array (can 0d, 1d, or more)
http://docs.scipy.org/doc/numpy/reference/arrays.ndarray.html
Matrix is a subclass of array
http://docs.scipy.org/doc/numpy/reference/arrays.classes.html
Matrix objects are always two-dimensional
The history of the numpy Matrix is old, but basically it's meant to resemble the MATLAB matrix object. In the original MATLAB nearly everything was a matrix, which was always 2d. Later they generalized it to allow more dimensions. But it can't have fewer dimensions. MATLAB does have 'vectors', but they are just matrices with one dimension being 1 (row vector versus column vector).
'axis elimination' is not a common term when working with numpy. It could, conceivably, refer to any of several ways that reduce the number of dimensions of an array. Reduction, as in sum(), is one. Indexing is another: a[:,0,:]. Reshaping can also change the number of dimensions. np.squeeze is another.

Categories