specifying missing values to pdist in scipy

specifying missing values to pdist in scipy - python

how can missing values be specified when calling pdist in scipy? i.e. the function described here:
http://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.pdist.html
for example if you have:
pdist(X, "euclidean")
but X might contain missing values like the string "NA" and you want those to be excluded in pairwise comparisons among X's columns. the behavior i'm looking for is to not consider missing values when getting the euclidean distance between any pair of columns in X.

The best way is to fill your X array with np.nan for the points to be excluded. For example, assuming a 2D case with a X a (10,2) array:
import numpy as np
X = np.random.rand(10, 2)
Let's assume you want to exclude X[7] from the calculation:
X[7] = np.nan
my_dist = pdist(X, "euclidean")
Then, you'll see that my_dist has 'nan' for the pairs that involved calculating distance with the excluded element. You can exclude multiple elements.
A better idea would be to use a numpy masked array, but pdist ignores masked arrays and uses the data anyway. However, once you have the output my_dist, you can convert it to a masked array so that the nans don't get in the way of future array operations:
my_dist = np.ma.array(my_dist, mask = ~np.isfinite(my_dist))

Related

Taking mean of numpy ndarray with masked elements

I have a MxN array of values taken from an experiment. Some of these values are invalid and are set to 0 to indicate such. I can construct a mask of valid/invalid values using
mask = (mat1 == 0) & (mat2 == 0)
which produces an MxN array of bool. It should be noted that the masked locations do not neatly follow columns or rows of the matrix - so simply cropping the matrix is not an option.
Now, I want to take the mean along one axis of my array (E.G end up with a 1xN array) while excluding those invalid values in the mean calculation. Intuitively I thought
np.mean(mat1[mask],axis=1)
should do it, but the mat1[mask] operation produces a 1D array which appears to just be the elements where mask is true - which doesn't help when I only want a mean across one dimension of the array.
Is there a 'python-esque' or numpy way to do this? I suppose I could use the mask to set masked elements to NaN and use np.nanmean - but that still feels kind of clunky. Is there a way to do this 'cleanly'?

I think the best way to do this would be something along the lines of:
masked = np.ma.masked_where(mat1 == 0 && mat2 == 0, array_to_mask)
Then take the mean with
masked.mean(axis=1)

One similarly clunky but efficient way is to multiply your array with the mask, setting the masked values to zero. Then of course you'll have to divide by the number of non-masked values manually. Hence clunkiness. But this will work with integer-valued arrays, something that can't be said about the nan case. It also seems to be fastest for both small and larger arrays (including the masked array solution in another answer):
import numpy as np
def nanny(mat, mask):
mat = mat.astype(float).copy() # don't mutate the original
mat[~mask] = np.nan # mask values
return np.nanmean(mat, axis=0) # compute mean
def manual(mat, mask):
# zero masked values, divide by number of nonzeros
return (mat*mask).sum(axis=0)/mask.sum(axis=0)
# set up dummy data for testing
N,M = 400,400
mat1 = np.random.randint(0,N,(N,M))
mask = np.random.randint(0,2,(N,M)).astype(bool)
print(np.array_equal(nanny(mat1, mask), manual(mat1, mask))) # True

Find the corresponding [x,y] coordinates for a given z coordinate value

I have a 3D array, and I were to find the greatest Z-coordinate in that array. After that I need to find the corresponding X and Y coordinate values based on the Z-coordinate. How can I achieve it quickly via numpy?
What I did:
I used argsort to first sort the given 3D array, then used np. max(array) to find the greatest Z-coordinate. I do not know how else to continue. Can numpy.where be useful here?
Thanks!

What you are looking for is numpy argmax
quick example :
import numpy as np
data = np.random.rand(5,3)
print data
ind = np.argmax(data[:,2])
print data[ind, :]
outputs
[[0.92037795 0.59469121 0.02956843]
[0.82881039 0.23272832 0.97275488]
[0.98418468 0.45699429 0.44662552]
[0.62519115 0.16637013 0.40433299]
[0.98272718 0.01467489 0.57442259]]
[0.82881039 0.23272832 0.97275488]

Numpy:zero mean data and standardization

I saw in tutorial (there were no further explanation) that we can process data to zero mean with x -= np.mean(x, axis=0) and normalize data with x /= np.std(x, axis=0). Can anyone elaborate on these two pieces on code, only thing I got from documentations is that np.mean calculates arithmetic mean calculates mean along specific axis and np.std does so for standard deviation.

This is also called zscore.
SciPy has a utility for it:
>>> from scipy import stats
>>> stats.zscore([ 0.7972, 0.0767, 0.4383, 0.7866, 0.8091,
... 0.1954, 0.6307, 0.6599, 0.1065, 0.0508])
array([ 1.1273, -1.247 , -0.0552, 1.0923, 1.1664, -0.8559, 0.5786,
0.6748, -1.1488, -1.3324])

Follow the comments in the code below
import numpy as np
# create x
x = np.asarray([1,2,3,4], dtype=np.float64)
np.mean(x) # calculates the mean of the array x
x-np.mean(x) # this is euivalent to subtracting the mean of x from each value in x
x-=np.mean(x) # the -= means can be read as x = x- np.mean(x)
np.std(x) # this calcualtes the standard deviation of the array
x/=np.std(x) # the /= means can be read as x = x/np.std(x)

From the given syntax you have I conclude, that your array is multidimensional. Hence I will first discuss the case where your x is just a linear array:
np.mean(x) will compute the mean, by broadcasting x-np.mean(x) the mean of x will be subtracted form all the entries. x -=np.mean(x,axis = 0) is equivalent to x = x-np.mean(x,axis = 0). Similar for x/np.std(x).
In the case of multidimensional arrays the same thing happens, but instead of computing the mean over the entire array, you just compute the mean over the first "axis". Axis is the numpy word for dimension. So if your x is two dimensional, then np.mean(x,axis =0) = [np.mean(x[:,0], np.mean(x[:,1])...]. Broadcasting again will ensure, that this is done to all elements.
Note, that this only works with the first dimension, otherwise the shapes will not match for broadcasting. If you want to normalize wrt another axis you need to do something like:
x -= np.expand_dims(np.mean(x, axis = n), n)

Key here are the assignment operators. They actually performs some operations on the original variable.
a += c is actually equal to a=a+c.
So indeed a (in your case x) has to be defined beforehand.
Each method takes an array/iterable (x) as input and outputs a value (or array if a multidimensional array was input), which is thus applied in your assignment operations.
The axis parameter means that you apply the mean or std operation over the rows. Hence, you take values for each row in a given column and perform the mean or std.
Axis=1 would take values of each column for a given row.
What you do with both operations is that first you remove the mean so that your column mean is now centered around 0. Then, when you divide by std, you happen to reduce the spread of the data around this zero, and now it should roughly be in a [-1, +1] interval around 0.
So now, each of your column values is centered around zero and standardized.
There are other scaling techniques, such as removing the minimal or maximal value and dividing by the range of values.

Numpy: evaluation of standard deviation of values above/below the average

I want to calculate the standard deviation for values below and above the average of a matrix of n_par parameters and n_sample samples. The fastest way I found so far is:
stdleft = numpy.zeros_like(mean)
for jpar in xrange(mean.shape[1]):
stdleft[jpar] = p[p[:,jpar] < \
mean[jpar],jpar].std()
where p is a matrix like (n_samples,n_par). Is there a smarter way to do it without the for loop? I have roughly n_par = 200 and n_samples = 1e8 and therefore these three lines take ages to be performed.
Any idea would be really helpfull!
Thank you

As I understand it, you want to calculate the standard deviation of each column where the values are below the mean for that column.
In numpy, it's easiest to use masked arrays for this.
As an example:
import numpy as np
# 10 samples, 3 columns
p = np.random.random((10, 3))
# Calculate the mean of each column
colmeans = p.mean(axis=0)
# Make a boolean array where our condition is True
mask = p < colmeans
# Find the standard deviation of values in each column below the column's mean.
# For masked arrays, the True values will be masked, so we'll invert the array.
stdleft = np.ma.masked_where(~mask, p).std(axis=0)
You can also use pandas for this as #SudeepJuvekar mentioned. The performance should be broadly similar, but pandas should be a bit faster for this particular operation (untested).

Pandas is your friend. Convert your matrix in pandas Dataframe and index the Dataframe logically. Something like this
mat = pandas.DataFrame(p)
This creates a DataFrame from original numpy matrix p. Then we compute the column means for the DataFrame.
m = mat.mean()
Creates n_par sized array of all column means of mat. Finally, index the mat matrix using < logical operation and apply std to that.
stdleft = mat[mat < m].std()
Similarly for stdright. Take a couple of minutes to compute on my machine.
Here's the doc page for pandas: http://pandas.pydata.org/
Edit: Edited using the comment below. You can do almost similar indexing using the original p.
m = p.mean(axis=0)
logical = p < m
logical contains a boolean matrix of same size as p. This is where pandas comes handy. You can directly index a pandas matrix using logical of same size. Doing so in numpy is slightly hard. I guess looping is the best way to achieve it?
for i in range(len(p)):
stdleft[i] = p[logical[:, i], i].std()

How do I compute the variance of a column of a sparse matrix in Scipy?

I have a large scipy.sparse.csc_matrix and would like to normalize it. That is subtract the column mean from each element and divide by the column standard deviation (std)i.
scipy.sparse.csc_matrix has a .mean() but is there an efficient way to compute the variance or std?

You can calculate the variance yourself using the mean, with the following formula:
E[X^2] - (E[X])^2
E[X] stands for the mean. So to calculate E[X^2] you would have to square the csc_matrix and then use the mean function. To get (E[X])^2 you simply need to square the result of the mean function obtained using the normal input.

Sicco has the better answer.
However, another way is to convert the sparse matrix to a dense numpy array one column at a time (to keep the memory requirements lower compared to converting the whole matrix at once):
# mat is the sparse matrix
# Get the number of columns
cols = mat.shape[1]
arr = np.empty(shape=cols)
for i in range(cols):
arr[i] = np.var(mat[:, i].toarray())

The most efficient way I know of is to use StandardScalar from scikit:
from sklearn.preprocessing import StandardScaler
scalar = StandardScaler(with_mean=False)
scalar.fit(X)
Then the variances are in the attribute var_:
X_var = scalar.var_
The curious thing though, is that when I densified first using pandas (which is very slow) my answer was off by a few percent. I don't know which is more accurate.

The efficient way is actually to densify the entire matrix, then standardize it in the usual way with
X = X.toarray()
X -= X.mean()
X /= X.std()
As #Sebastian has noted in his comments, standardizing destroys the sparsity structure (introduces lots of non-zero elements) in the subtraction step, so there's no use keeping the matrix in a sparse format.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

specifying missing values to pdist in scipy - python

Related

Taking mean of numpy ndarray with masked elements

Find the corresponding [x,y] coordinates for a given z coordinate value

Numpy:zero mean data and standardization

Numpy: evaluation of standard deviation of values above/below the average

How do I compute the variance of a column of a sparse matrix in Scipy?

Categories

Resources