scipy.stats.skew nan_policy parameter - python

According to scipy docs for skew , we have :
scipy.stats.skew(a, axis=0, bias=True, nan_policy='propagate')
where,
nan_policy : {‘propagate’, ‘raise’, ‘omit’}, optional Defines how to
handle when input contains nan. ‘propagate’ returns nan, ‘raise’
throws an error, ‘omit’ performs the calculations ignoring nan values.
Default is ‘propagate’
So, the default method for NaN is propagate. So, how are the NaNs propagated? I can understand the 'omit' method since, it performs calculations by omitting the nan values, and the raise method but, the docs don't seem to help understand how the missing values are treated in case of 'propagate' and how they would be plotted.
Also, would be great, if someone explained the bias parameter too.
bias : bool, optional
If False, then the calculations are corrected for statistical bias.
.

It just does computations with the data as is.
n [43]: x = np.arange(12, dtype=float).reshape(4, -1)
In [44]: x[2, 1] = np.nan
In [45]: x
Out[45]:
array([[ 0., 1., 2.],
[ 3., 4., 5.],
[ 6., nan, 8.],
[ 9., 10., 11.]])
In [46]: stats.skew(x, nan_policy='propagate')
Out[46]: array([ 0., nan, 0.])

Related

Numpy version of Pandas dropna with subset - trying to remove rows if the last column in my array contains NaN

I'm trying to remove rows from a numpy array if a certain column (ie. the last column in my array) contains NaN. NaN values in other columns are acceptable, just not the last column.
I know this is possible by converting to a pandas dataframe and using df.dropna(subset=['lastcolumn']). I am wondering if it is possible to do this in numpy since converting to Pandas and using dropna is quite slow.
Using np.isnan() works but need to specify which column can't have NaN:
a = np.array([[1,np.nan,3], [4,5,np.nan], [7,8,9]])
print(a)
[[1.0000 nan 3.0000]
[4.0000 5.0000 nan]
[7.0000 8.0000 9.0000]]
b = a[~np.isnan(a[:,2:3]).any(axis=1)]
print(b)
[[1.0000 nan 3.0000]
[7.0000 8.0000 9.0000]]
Something like this might work:
In [1856]: import numpy as np
In [1857]: a = np.array([[1,2,3], [4,5,np.nan], [7,8,9]])
In [1858]: a
Out[1858]:
array([[ 1., 2., 3.],
[ 4., 5., nan],
[ 7., 8., 9.]])
In [1859]: a[~np.isnan(a).any(axis=1)]
Out[1859]:
array([[1., 2., 3.],
[7., 8., 9.]])
EDIT
If NaN needs to be removed from specific column only, you need this:
In [1870]: a[~np.isnan(a[:, 1:2]).any(axis=1)]
Out[1870]:
array([[ 4., 5., nan],
[ 7., 8., 9.]])
This will remove NaN from first two columns only.

how to get python infinity with good properties [duplicate]

I'm trying to use numpy.multiply.outer on multidimensional arrays, and I really need it to assume that any 0 * infinity it sees evaluates to zero. How can I do this efficiently?
>>> import numpy
>>> numpy.multiply.outer([0.], [float('inf')])
Warning (from warnings module):
File "__main__", line 2
RuntimeWarning: invalid value encountered in multiply
array([[ nan]])
Do you need to worry about other sources of nan values? If not, you could always just fix up in a separate step:
import numpy as np
r = np.multiply.outer([0.], [float('inf')])
np.where(np.isnan(r), 0, r)
Up to you if you want to suppress the warnings.
One solution could be to avoid using np.multiply.outer and find the solution using element-wise multiplication on matrices that have already been checked to see if they meet the condition of interest (zero in one array, inf in other array).
import numpy as np
A = np.array([0., 0., 0.4, 2])
B = np.array([float('inf'), 1., 3.4, np.inf])
# Conditions of interest
c1 = (A == 0)
c2 = (B == np.inf)
condition1 = np.multiply.outer(c1, c2)
c3 = (A == np.inf)
c4 = (B == 0)
condition2 = np.multiply.outer(c3, c4)
condition = condition1 | condition2
AA = np.multiply.outer(A, np.ones(B.shape))
BB = np.multiply.outer(np.ones(A.shape), B)
AA[condition] = 0.
BB[condition] = 0.
AA*BB
This may not pass the 'efficiency' request of the poster, however.
Here's how to suppress the warnings:
mean, nanmean and warning: Mean of empty slice
In [528]: import warnings
In [530]: x = np.array([0,1,2],float)
In [531]: y = np.array([np.inf,3,2],float)
In [532]: np.outer(x,y)
/usr/local/lib/python3.5/dist-packages/numpy/core/numeric.py:1093: RuntimeWarning: invalid value encountered in multiply
return multiply(a.ravel()[:, newaxis], b.ravel()[newaxis,:], out)
Out[532]:
array([[ nan, 0., 0.],
[ inf, 3., 2.],
[ inf, 6., 4.]])
In [535]: with warnings.catch_warnings():
...: warnings.simplefilter('ignore',category=RuntimeWarning)
...: z = np.outer(x,y)
...:
In [536]: z
Out[536]:
array([[ nan, 0., 0.],
[ inf, 3., 2.],
[ inf, 6., 4.]])
replace the nan with 1:
In [542]: z[np.isnan(z)]=1
In [543]: z
Out[543]:
array([[ 1., 0., 0.],
[ inf, 3., 2.],
[ inf, 6., 4.]])
In [547]: z[np.isinf(z)]=9999
In [548]: z
Out[548]:
array([[ 1.00000000e+00, 0.00000000e+00, 0.00000000e+00],
[ 9.99900000e+03, 3.00000000e+00, 2.00000000e+00],
[ 9.99900000e+03, 6.00000000e+00, 4.00000000e+00]])
=================
We could create a mask using the kind of testing that #P-robot demonstrates:
In [570]: np.outer(np.isinf(x),y==0)|np.outer(x==0,np.isinf(y))
Out[570]:
array([[ True, False, False],
[False, False, False],
[False, False, False]], dtype=bool)
In [571]: mask=np.outer(np.isinf(x),y==0)|np.outer(x==0,np.isinf(y))
In [572]: with warnings.catch_warnings():
...: warnings.simplefilter('ignore',category=RuntimeWarning)
...: z = np.outer(x,y)
...:
In [573]: z[mask]=1
In [574]: z
Out[574]:
array([[ 1., 0., 0.],
[ inf, 3., 2.],
[ inf, 6., 4.]])
Or with messier inputs:
In [587]: x = np.array([0,1,2,np.inf],float)
In [588]: y = np.array([np.inf,3,np.nan,0],float)
In [589]: mask=np.outer(np.isinf(x),y==0)|np.outer(x==0,np.isinf(y))
...
In [591]: with warnings.catch_warnings():
...: warnings.simplefilter('ignore',category=RuntimeWarning)
...: z = np.outer(x,y)
...:
In [592]: z[mask]=1
In [593]: z
Out[593]:
array([[ 1., 0., nan, 0.],
[ inf, 3., nan, 0.],
[ inf, 6., nan, 0.],
[ inf, inf, nan, 1.]])
While I agree that #ShadowRanger's answer, a cheap hack could be to take advantage of np.nan_to_num, which replaces infs with large finite numbers, which will then get you inf * 0 = 0.
To convert unwanted remaining high finite numbers back to inf (given some other operations besides your question) you can use multiple the high number by anything > 1 and then divide by the same amount (so as not to impact other numbers). E.g.:
In [1]: np.nan_to_num(np.inf)
Out[1]: 1.7976931348623157e+308
In [2]: np.nan_to_num(np.inf)*1.1
RuntimeWarning: overflow encountered in double_scalars
Out[2]: inf
In [3]: np.nan_to_num(np.inf)*1.1/1.1
RuntimeWarning: overflow encountered in double_scalars
Out[3]: inf
Before the flood of downvotes, this is clearly not a best practice, and can potentially have side effects depending on your use case, but just thought I'd throw an alternative out there.

Adding Numpy arrays like Counters

Since collections.Counter is so slow, I am pursuing a faster method of summing mapped values in Python 2.7. It seems like a simple concept and I'm kind of disappointed in the built-in Counter method.
Basically, I need to be able to take arrays like this:
array([[ 0., 2.],
[ 2., 2.],
[ 3., 1.]])
array([[ 0., 3.],
[ 1., 1.],
[ 2., 5.]])
And then "add" them so they look like this:
array([[ 0., 5.],
[ 1., 1.],
[ 2., 7.],
[ 3., 1.]])
If there isn't a good way to do this quickly and efficiently, I'm open to any other ideas that will allow me to do something similar to this, and I'm open to modules other than Numpy.
Thanks!
Edit: Ready for some speedtests?
Intel win 64bit machine. All of the following values are in seconds; 20000 loops.
collections.Counter results:
2.131000, 2.125000, 2.125000
Divakar's union1d + masking results:
1.641000, 1.633000, 1.625000
Divakar's union1d + indexing results:
0.625000, 0.625000, 0.641000
Histogram results:
1.844000, 1.938000, 1.858000
Pandas results:
16.659000, 16.686000, 16.885000
Conclusions: union1d + indexing wins, the array size is too small for Pandas to be effective, and the histogram approach blew my mind with its simplicity but I'm guessing it takes too much overhead to create. All of the responses I received were very good, though. This is what I used to get the numbers. Thanks again!
Edit: And it should be mentioned that using Counter1.update(Counter2.elements()) is terrible despite doing the same exact thing (65.671000 sec).
Later Edit: I've been thinking about this a lot, and I've came to realize that, with Numpy, it might be more effective to fill each array with zeros so that the first column isn't even needed since we can just use the index, and that would also make it much easier to add multiple arrays together as well as do other functions. Additionally, Pandas makes more sense than Numpy since there would be no need to 0-fill, and it would definitely be more effective with large data sets (however, Numpy has the advantage of being compatible on more platforms, like GAE, if that matters at all). Lastly, the answer I checked was definitely the best answer for the exact question I asked--adding the two arrays in the way I showed--but I think what I needed was a change in perspective.
Here's one approach with np.union1d and masking -
def app1(a,b):
c0 = np.union1d(a[:,0],b[:,0])
out = np.zeros((len(c0),2))
out[:,0] = c0
mask1 = np.in1d(c0,a[:,0])
out[mask1,1] = a[:,1]
mask2 = np.in1d(c0,b[:,0])
out[mask2,1] += b[:,1]
return out
Sample run -
In [174]: a
Out[174]:
array([[ 0., 2.],
[ 12., 2.],
[ 23., 1.]])
In [175]: b
Out[175]:
array([[ 0., 3.],
[ 1., 1.],
[ 12., 5.]])
In [176]: app1(a,b)
Out[176]:
array([[ 0., 5.],
[ 1., 1.],
[ 12., 7.],
[ 23., 1.]])
Here's another with np.union1d and indexing -
def app2(a,b):
n = np.maximum(a[:,0].max(), b[:,0].max())+1
c0 = np.union1d(a[:,0],b[:,0])
out0 = np.zeros((int(n), 2))
out0[a[:,0].astype(int),1] = a[:,1]
out0[b[:,0].astype(int),1] += b[:,1]
out = out0[c0.astype(int)]
out[:,0] = c0
return out
For the case where all indices are covered by the first column values in a and b -
def app2_specific(a,b):
c0 = np.union1d(a[:,0],b[:,0])
n = c0[-1]+1
out0 = np.zeros((int(n), 2))
out0[a[:,0].astype(int),1] = a[:,1]
out0[b[:,0].astype(int),1] += b[:,1]
out0[:,0] = c0
return out0
Sample run -
In [234]: a
Out[234]:
array([[ 0., 2.],
[ 2., 2.],
[ 3., 1.]])
In [235]: b
Out[235]:
array([[ 0., 3.],
[ 1., 1.],
[ 2., 5.]])
In [236]: app2_specific(a,b)
Out[236]:
array([[ 0., 5.],
[ 1., 1.],
[ 2., 7.],
[ 3., 1.]])
If you know the number of fields, use np.bincount.
c = np.vstack([a, b])
counts = np.bincount(c[:, 0], weights = c[:, 1], minlength = numFields)
out = np.vstack([np.arange(numFields), counts]).T
This works if you're getting all your data at once. Make a list of your arrays and vstack them. If you're getting data chunks sequentially, you can use np.add.at to do the same thing.
out = np.zeros(2, numFields)
out[:, 0] = np.arange(numFields)
np.add.at(out[:, 1], a[:, 0], a[:, 1])
np.add.at(out[:, 1], b[:, 0], b[:, 1])
You can use a basic histogram, this will deal with gaps, too. You can filter out zero-count entries if need be.
import numpy as np
x = np.array([[ 0., 2.],
[ 2., 2.],
[ 3., 1.]])
y = np.array([[ 0., 3.],
[ 1., 1.],
[ 2., 5.],
[ 5., 3.]])
c, w = np.vstack((x,y)).T
h, b = np.histogram(c, weights=w,
bins=np.arange(c.min(),c.max()+2))
r = np.vstack((b[:-1], h)).T
print(r)
# [[ 0. 5.]
# [ 1. 1.]
# [ 2. 7.]
# [ 3. 1.]
# [ 4. 0.]
# [ 5. 3.]]
r_nonzero = r[r[:,1]!=0]
Pandas have some functions doing exactly what you intend
import pandas as pd
pda = pd.DataFrame(a).set_index(0)
pdb = pd.DataFrame(b).set_index(0)
result = pd.concat([pda, pdb], axis=1).fillna(0).sum(axis=1)
Edit: If you actually need the data back in numpy format, just do
array_res = result.reset_index(name=1).values
This is a quintessential grouping problem, which numpy_indexed (disclaimer: I am its author) was created to solve elegantly and efficiently:
import numpy_indexed as npi
C = np.concatenate([A, B], axis=0)
labels, sums = npi.group_by(C[:, 0]).sum(C[:, 1])
Note: its cleaner to maintain your label arrays as a seperate int array; floats are finicky when it comes to labeling things, with positive and negative zeros, and printed values not relaying all binary state. Better to use ints for that.

How to multiply.outer() in NumPy while assuming 0 * infinity = 0?

I'm trying to use numpy.multiply.outer on multidimensional arrays, and I really need it to assume that any 0 * infinity it sees evaluates to zero. How can I do this efficiently?
>>> import numpy
>>> numpy.multiply.outer([0.], [float('inf')])
Warning (from warnings module):
File "__main__", line 2
RuntimeWarning: invalid value encountered in multiply
array([[ nan]])
Do you need to worry about other sources of nan values? If not, you could always just fix up in a separate step:
import numpy as np
r = np.multiply.outer([0.], [float('inf')])
np.where(np.isnan(r), 0, r)
Up to you if you want to suppress the warnings.
One solution could be to avoid using np.multiply.outer and find the solution using element-wise multiplication on matrices that have already been checked to see if they meet the condition of interest (zero in one array, inf in other array).
import numpy as np
A = np.array([0., 0., 0.4, 2])
B = np.array([float('inf'), 1., 3.4, np.inf])
# Conditions of interest
c1 = (A == 0)
c2 = (B == np.inf)
condition1 = np.multiply.outer(c1, c2)
c3 = (A == np.inf)
c4 = (B == 0)
condition2 = np.multiply.outer(c3, c4)
condition = condition1 | condition2
AA = np.multiply.outer(A, np.ones(B.shape))
BB = np.multiply.outer(np.ones(A.shape), B)
AA[condition] = 0.
BB[condition] = 0.
AA*BB
This may not pass the 'efficiency' request of the poster, however.
Here's how to suppress the warnings:
mean, nanmean and warning: Mean of empty slice
In [528]: import warnings
In [530]: x = np.array([0,1,2],float)
In [531]: y = np.array([np.inf,3,2],float)
In [532]: np.outer(x,y)
/usr/local/lib/python3.5/dist-packages/numpy/core/numeric.py:1093: RuntimeWarning: invalid value encountered in multiply
return multiply(a.ravel()[:, newaxis], b.ravel()[newaxis,:], out)
Out[532]:
array([[ nan, 0., 0.],
[ inf, 3., 2.],
[ inf, 6., 4.]])
In [535]: with warnings.catch_warnings():
...: warnings.simplefilter('ignore',category=RuntimeWarning)
...: z = np.outer(x,y)
...:
In [536]: z
Out[536]:
array([[ nan, 0., 0.],
[ inf, 3., 2.],
[ inf, 6., 4.]])
replace the nan with 1:
In [542]: z[np.isnan(z)]=1
In [543]: z
Out[543]:
array([[ 1., 0., 0.],
[ inf, 3., 2.],
[ inf, 6., 4.]])
In [547]: z[np.isinf(z)]=9999
In [548]: z
Out[548]:
array([[ 1.00000000e+00, 0.00000000e+00, 0.00000000e+00],
[ 9.99900000e+03, 3.00000000e+00, 2.00000000e+00],
[ 9.99900000e+03, 6.00000000e+00, 4.00000000e+00]])
=================
We could create a mask using the kind of testing that #P-robot demonstrates:
In [570]: np.outer(np.isinf(x),y==0)|np.outer(x==0,np.isinf(y))
Out[570]:
array([[ True, False, False],
[False, False, False],
[False, False, False]], dtype=bool)
In [571]: mask=np.outer(np.isinf(x),y==0)|np.outer(x==0,np.isinf(y))
In [572]: with warnings.catch_warnings():
...: warnings.simplefilter('ignore',category=RuntimeWarning)
...: z = np.outer(x,y)
...:
In [573]: z[mask]=1
In [574]: z
Out[574]:
array([[ 1., 0., 0.],
[ inf, 3., 2.],
[ inf, 6., 4.]])
Or with messier inputs:
In [587]: x = np.array([0,1,2,np.inf],float)
In [588]: y = np.array([np.inf,3,np.nan,0],float)
In [589]: mask=np.outer(np.isinf(x),y==0)|np.outer(x==0,np.isinf(y))
...
In [591]: with warnings.catch_warnings():
...: warnings.simplefilter('ignore',category=RuntimeWarning)
...: z = np.outer(x,y)
...:
In [592]: z[mask]=1
In [593]: z
Out[593]:
array([[ 1., 0., nan, 0.],
[ inf, 3., nan, 0.],
[ inf, 6., nan, 0.],
[ inf, inf, nan, 1.]])
While I agree that #ShadowRanger's answer, a cheap hack could be to take advantage of np.nan_to_num, which replaces infs with large finite numbers, which will then get you inf * 0 = 0.
To convert unwanted remaining high finite numbers back to inf (given some other operations besides your question) you can use multiple the high number by anything > 1 and then divide by the same amount (so as not to impact other numbers). E.g.:
In [1]: np.nan_to_num(np.inf)
Out[1]: 1.7976931348623157e+308
In [2]: np.nan_to_num(np.inf)*1.1
RuntimeWarning: overflow encountered in double_scalars
Out[2]: inf
In [3]: np.nan_to_num(np.inf)*1.1/1.1
RuntimeWarning: overflow encountered in double_scalars
Out[3]: inf
Before the flood of downvotes, this is clearly not a best practice, and can potentially have side effects depending on your use case, but just thought I'd throw an alternative out there.

Using numpy.cov on a vector yields NANs

Good afternoon.
I am faced with a PCA task which simply involves reducing the dimensionality of a vector. I'm not interested in a two-dimensional matrix in this case, but merely a D-dimensional vector which I would like to project along it's K principal eigenvectors.
In order to implement PCA, I need to retrieve the covariance matrix of this vector. Let's try to do this on an example vector:
someVec = np.array([[1.0, 1.0, 2.0, -1.0]])
I've defined this vector as a 1 X 4 matrix, i.e a row vector, in order to make it compatible with numpy.cov. Taking the covariance matrix of this vector through numpy.cov will yield a scalar covariance matrix, because numpy.cov makes the assumption that the features are in the rows:
print np.cov(someVec)
1.58333333333
but this is (or rather, should be) merely a difference in dimensionality assumptions, and taking the covariance of the transpose vector should work fine, right? Except that it doesn't:
print np.cov(someVec.T)
/usr/lib/python2.7/site-packages/numpy/lib/function_base.py:2005: RuntimeWarning:
invalid value encountered in divide
return (dot(X, X.T.conj()) / fact).squeeze()
[[ nan nan nan nan]
[ nan nan nan nan]
[ nan nan nan nan]
[ nan nan nan nan]]
I'm not exactly sure what I've done wrong here. Any advice?
Thanks,
Jason
If you want to pass in the transpose, you'll need to set rowvar to zero.
In [10]: np.cov(someVec, rowvar=0)
Out[10]: array(1.5833333333333333)
In [11]: np.cov(someVec.T, rowvar=0)
Out[11]: array(1.5833333333333333)
From the docs:
rowvar : int, optional
If rowvar is non-zero (default), then each row
represents a variable, with observations in the columns. Otherwise,
the relationship is transposed: each column represents a variable,
while the rows contain observations.
If you want to find a full covariance matrix, you'll need more than one observation. With a single observation, and numpy's default estimator, NaN is exactly what you'd expect. If you would like to have normalization done by N instead of (N-1), you can pass in a 1 to the bias.
In [12]: np.cov(someVec.T, bias=1)
Out[12]:
array([[ 0., 0., 0., 0.],
[ 0., 0., 0., 0.],
[ 0., 0., 0., 0.],
[ 0., 0., 0., 0.]])
Again, from the docs.
bias : int, optional
Default normalization is by (N - 1), where N is
the number of observations given (unbiased estimate). If bias is 1,
then normalization is by N. These values can be overridden by using
the keyword ddof in numpy versions >= 1.5.
You should use the option row_var=0 in numpy.cov:
In [1]: a = array([[1, 2, 3, 4]])
In [2]: np.cov(a)
Out[2]: array(1.6666666666666667)
In [3]: np.cov(a.T)
Out[3]:
array([[ nan, nan, nan, nan],
[ nan, nan, nan, nan],
[ nan, nan, nan, nan],
[ nan, nan, nan, nan]])
In [4]: np.cov(a.T, rowvar=0)
Out[4]: array(1.6666666666666667)
Not really, shouldn't that be returning a matrix of size 4 x 4? I mean, the vector has 4 "features", so given that I want to measure the variance between the features and store them in appropriate places, I need a covariance matrix.
Since you only have one observation, you can't compute a covariance matrix. Depending on the estimator the covariances would either be zero or undefined.
If that's not intuitively clear, try answering the following questions:
what is the variance of 1.0?
what is the covariance of 1.0 and 2.0?
In essence, these are the computations that you're asking numpy.cov() to perform.

Categories