numpy/pandas NaN difference confusion

numpy/pandas NaN difference confusion - python

I happened onto this when trying to find the means/sums of non-nan elements in rows of a pandas dataframe. It seems that
df.apply(np.mean, axis=1)
works fine.
However, applying np.mean to a numpy array containing nans returns a nan.
Is this all speced out somewhere? I would not want to get burned down the road...

numpy's mean function first checks whether its input has a mean method, as #EdChum explains in this answer.
When you use df.apply, the input passed to the function is a pandas.Series. Since pandas.Series has a mean method, numpy uses that instead of using its own function. And by default, pandas.Series.mean ignores NaN.
You can access the underlying numpy array by the values attribute and pass that to the function:
df.apply(lambda x: np.mean(x.values), axis=1)
this will use numpy's version.

Divakar has correctly suggested using np.nanmean
If I may answer the question still standing, the semantics differ because Numpy supports masked arrays, while Pandas does not.

Related

How to make Numpy treat each row/tensor as a value

Many functions like in1d and setdiff1d are designed for 1-d array. One workaround to apply these methods on N-dimensional arrays is to make numpy to treat each row (something more high dimensional) as a value.
One approach I found to do so is in this answer Get intersecting rows across two 2D numpy arrays by Joe Kington.
The following code is taken from this answer. The task Joe Kington faced was to detect common rows in two arrays A and B while trying to use in1d.
import numpy as np
A = np.array([[1,4],[2,5],[3,6]])
B = np.array([[1,4],[3,6],[7,8]])
nrows, ncols = A.shape
dtype={'names':['f{}'.format(i) for i in range(ncols)],
'formats':ncols * [A.dtype]}
C = np.intersect1d(A.view(dtype), B.view(dtype))
# This last bit is optional if you're okay with "C" being a structured array...
C = C.view(A.dtype).reshape(-1, ncols)
I am hoping you to help me with any of the following three questions. First, I do not understand the mechanisms behind this method. Can you try to explain it to me?
Second, is there other ways to let numpy treat an subarray as one object?
One more open question: dose Joe's approach have any drawbacks? I mean whether treating rows as a value might cause some problems? Sorry this question is pretty broad.

Try to post what I have learned. The method Joe used is called structured arrays. It will allow users to define what is contained in a single cell/element.
We take a look at the description of the first example the documentation provided.
x = np.array([(1,2.,'Hello'), (2,3.,"World")], ...
dtype=[('foo', 'i4'),('bar', 'f4'), ('baz', 'S10')])
Here we have created a one-dimensional array of length 2. Each element
of this array is a structure that contains three items, a 32-bit
integer, a 32-bit float, and a string of length 10 or less.
Without passing in dtype, however, we will get a 2 by 3 matrix.
With this method, we would be able to let numpy treat a higher dimensional array as an single element with properly set dtype.
Another trick Joe showed is that we don't need to really form a new numpy array to achieve the purpose. We can use the view function (See ndarray.view) to change the way numpy view data. There is a section of Note section in ndarray.view that I think you should take a look before utilizing the method. I have no guarantee that there would not be side effects. The paragraph below is from the note section and seems to call for caution.
For a.view(some_dtype), if some_dtype has a different number of bytes per entry than the previous dtype (for example, converting a regular array to a structured array), then the behavior of the view cannot be predicted just from the superficial appearance of a (shown by print(a)). It also depends on exactly how a is stored in memory. Therefore if a is C-ordered versus fortran-ordered, versus defined as a slice or transpose, etc., the view may give different results.
Other reference
https://docs.scipy.org/doc/numpy-1.13.0/reference/arrays.dtypes.html
https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.dtype.html

using xarray.apply(np.nansum) with args

I have been trying to apply np.nansum to an xr.Dataset (xarray), but keep on coming up with errors. For a 3D dataset, I try to apply to axis=2. The syntax is not quite clear and I may have misunderstood the documentation, but I have tried:
ds.apply(np.nansum,axis=2)` and `ds.apply(lambda x: np.nansum(x,axis=2))
and get the same error:
cannot set variable 'var' with 2-dimensional data without explicit
dimension names. Pass a tuple of (dims, data) instead.
I am guessing this means that it it does not know what dimension names to return to the new dataset object? Any ideas how to fix this?
And does anyone know why and when xarray might implement np.nansum()?
Thanks

The problem you're running into here is that nansum returns a numpy ndarray, and not a DataArray, which is what the function passed into apply is supposed to return.
For nansum, you should just use xarray.Dataset.sum, which skips NaNs by default if your data is float.

Jeremy is correct that the built-in sum() method already skips NaN by default. But if you want to supply a custom aggregation function, you can do so with reduce, e.g., ds.reduce(np.nansum, axis=2).

python array initialisation (preallocation) with nans

I want to initialise an array that will hold some data. I have created a random matrix (using np.empty) and then multiplied it by np.nan. Is there anything wrong with that? Or is there a better practice that I should stick to?
To further explain my situation: I have data I need to store in an array. Say I have 8 rows of data. The number of elements in each row is not equal, so my matrix row length needs to be as long as the longest row. In other rows, some elements will not be filled. I don't want to use zeros since some of my data might actually be zeros.
I realise I can use some value I know my data will never, but nans is definitely clearer. Just wondering if that can cause any issues later with processing. I realise I need to use nanmax instead of max and so on.

I have created a random matrix (using np.empty) and then multiplied it by np.nan. Is there anything wrong with that? Or is there a better practice that I should stick to?
You can use np.full, for example:
np.full((100, 100), np.nan)
However depending on your needs you could have a look at numpy.ma for masked arrays or scipy.sparse for sparse matrices. It may or may not be suitable, though. Either way you may need to use different functions from the corresponding module instead of the normal numpy ufuncs.

A way I like to do it which probably isn't the best but it's easy to remember is adding a 'nans' method to the numpy object this way:
import numpy as np
def nans(n):
return np.array([np.nan for i in range(n)])
setattr(np,'nans',nans)
and now you can simply use np.nans as if it was the np.zeros:
np.nans(10)

Check if two scipy.sparse.csr_matrix are equal

I want to check if two csr_matrix are equal.
If I do:
x.__eq__(y)
I get:
raise ValueError("The truth value of an array with more than one "
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all().
This, However, works well:
assert (z in x for z in y)
Is there a better way to do it? maybe using some scipy optimized function instead?
Thanks so much

Can we assume they are the same shape?
In [202]: a=sparse.csr_matrix([[0,1],[1,0]])
In [203]: b=sparse.csr_matrix([[0,1],[1,1]])
In [204]: (a!=b).nnz==0
Out[204]: False
This checks the sparsity of the inequality array.
It will give you an efficiency warning if you try a==b (at least the 1st time you use it). That's because it has to test all those zeros. It can't take much advantage of the sparsity.
You need a relatively recent version to use logical operators like this. Were you trying to use x.__eq__(y) in some if expression, or did you get error from just that expression?
In general you probably want to check several parameters first. Same shape, same nnz, same dtype. You need to be careful with floats.
For dense arrays np.allclose is a good way of testing equality. And if the sparse arrays aren't too large, that might be good as well
np.allclose(a.A, b.A)
allclose uses all(less_equal(abs(x-y), atol + rtol * abs(y))). You can use a-b, but I suspect that this too will give an efficiecy warning.

SciPy and Numpy Hybrid Method
What worked best for my case was (using a generic code example):
bool_answer = np.arrays_equal(sparse_matrix_1.todense(), sparse_matrix_2.todense())
You might need to pay attention to the equal_nan parameter in np.arrays_equal
The following doc references helped me get there:
CSR Sparse Matrix Methods
CSC Sparse Matrix Methods
Numpy arrays_equal method
SciPy todense method

Is there a Pandas equivalent to each_slice to operate on dataframes

I am wondering if there is a Python or Pandas function that approximates the Ruby #each_slice method. In this example, the Ruby #each_slice method will take the array or hash and break it into groups of 100.
var.each_slice(100) do |batch|
# do some work on each batch
I am trying to do this same operation on a Pandas dataframe. Is there a Pythonic way to accomplish the same thing?
I have checked out this answer: Python equivalent of Ruby's each_slice(count)
However, it is old and is not Pandas specific. I am checking it out but am wondering if there is a more direct method.

There isn't a built in method as such but you can use numpy's array_slice, you can pass the dataframe to this and the number of slices.
In order to get ~100 size slices you'll have to calculate this which is simply the number of rows/100:
import numpy as np
# df.shape returns the dimensions in a tuple, the first dimension is the number of rows
np.array_slice(df, df.shape[0]/100)
This returns a list of dataframes sliced as evenly as possible

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

numpy/pandas NaN difference confusion - python

Divakar has correctly suggested using np.nanmean If I may answer the question still standing, the semantics differ because Numpy supports masked arrays, while Pandas does not.

Related

How to make Numpy treat each row/tensor as a value

using xarray.apply(np.nansum) with args

python array initialisation (preallocation) with nans

Check if two scipy.sparse.csr_matrix are equal

Is there a Pandas equivalent to each_slice to operate on dataframes

Categories

Resources