using xarray.apply(np.nansum) with args

using xarray.apply(np.nansum) with args - python

I have been trying to apply np.nansum to an xr.Dataset (xarray), but keep on coming up with errors. For a 3D dataset, I try to apply to axis=2. The syntax is not quite clear and I may have misunderstood the documentation, but I have tried:
ds.apply(np.nansum,axis=2)` and `ds.apply(lambda x: np.nansum(x,axis=2))
and get the same error:
cannot set variable 'var' with 2-dimensional data without explicit
dimension names. Pass a tuple of (dims, data) instead.
I am guessing this means that it it does not know what dimension names to return to the new dataset object? Any ideas how to fix this?
And does anyone know why and when xarray might implement np.nansum()?
Thanks

The problem you're running into here is that nansum returns a numpy ndarray, and not a DataArray, which is what the function passed into apply is supposed to return.
For nansum, you should just use xarray.Dataset.sum, which skips NaNs by default if your data is float.

Jeremy is correct that the built-in sum() method already skips NaN by default. But if you want to supply a custom aggregation function, you can do so with reduce, e.g., ds.reduce(np.nansum, axis=2).

Related

Is there a way of change the default data types in Python?

suppose I want the default data type to be np.uint8, in such a way that when I call:
a = 2
print(type(a))
I get in output numpy.uint8.
Is it possible to obtain this?

It's not possible, at least not with little effort, and it was discouraged when it was discussed on Numpy's issue tracker as 'unlikely to add', for good reason.
The easiest thing to do is to either use a function that takes the input and casts it to the desired data type or to check out this post on to 'overload' your numpy functions to always use eg. dtype=uint8.

not able to change object to float in pandas dataframe

just started learning python. trying to change a columns data type from object to float to take out the mean. I have tried to change [] to () and even the "". I dont know whether it makes a difference or not. Please help me figure out what the issue is. thanks!!
My code:
df["normalized-losses"]=df["normalized-losses"].astype(float)
error which i see: attached as imageenter image description here

Use:
df['normalized-losses'] = df['normalized-losses'][~(df['normalized-losses'] == '?' )].astype(float)
Using df.normalized-losses leads to interpreter evaluating df.normalized which doesn't exist. The statement you have written executes (df.normalized) - (losses.astype(float)).There appears to be a question mark in your data which can't be converted to float.The above statement converts to float only those rows which don't contain a question mark and drops the rest.If you don't want to drop the columns you can replace them with 0 using:
df['normalized-losses'] = df['normalized-losses'].replace('?', 0.0)
df['normalized-losses'] = df['normalized-losses'].astype(float)

Welcome to Stack Overflow, and good luck on your Python journey! An important part of coding is learning how to interpret error messages. In this case, the traceback is quite helpful - it is telling you that you cannot call normalized after df, since a dataframe does not have a method of this name.
Of course you weren't trying to call something called normalized, but rather the normalized-losses column. The way to do this is as you already did once - df["normalized-losses"].
As to your main problem - if even one of your values can't be converted to a float, the columnwide operation will fail. This is very common. You need to first eliminate all of the non-numerical items in the column, one way to find them is with df[~df['normalized_losses'].str.isnumeric()].

The "df.normalized-losses" does not signify anything to python in this case. you can replace it with df["normalized-losses"]. Usually, if you try
df["normalized-losses"]=df["normalized-losses"].astype(float)
This should work. What this does is, it takes normalized-losses column from dataframe, converts it to float, and reassigns it to normalized column in the same dataframe. But sometimes it might need some data processing before you try the above statement.

You can't use - in an attribute or variable name. Perhaps you mean normalized_losses?

numpy/pandas NaN difference confusion

I happened onto this when trying to find the means/sums of non-nan elements in rows of a pandas dataframe. It seems that
df.apply(np.mean, axis=1)
works fine.
However, applying np.mean to a numpy array containing nans returns a nan.
Is this all speced out somewhere? I would not want to get burned down the road...

numpy's mean function first checks whether its input has a mean method, as #EdChum explains in this answer.
When you use df.apply, the input passed to the function is a pandas.Series. Since pandas.Series has a mean method, numpy uses that instead of using its own function. And by default, pandas.Series.mean ignores NaN.
You can access the underlying numpy array by the values attribute and pass that to the function:
df.apply(lambda x: np.mean(x.values), axis=1)
this will use numpy's version.

Divakar has correctly suggested using np.nanmean
If I may answer the question still standing, the semantics differ because Numpy supports masked arrays, while Pandas does not.

Cannot cast array data from dtype('O') to dtype('float64')

I am using scipy's curve_fit to fit a function to some data, and receive the following error;
Cannot cast array data from dtype('O') to dtype('float64') according to the rule 'safe'
which points me to this line in my code;
popt_r, pcov = curve_fit(
self.rightFunc, np.array(wavelength)[beg:end][edgeIndex+30:],
np.dstack(transmitted[:,:,c][edgeIndex+30:])[0][0],
p0=[self.m_right, self.a_right])
rightFunc is defined as follows;
def rightFunc(self, x, m, const):
return np.exp(-(m*x + const))
As I understand it, the 'O' type refers to a python object, but I can't see what is causing this error.
Complete Error:
Any ideas for what I should investigate to get to the bottom of this?

Just in case it could help someone else, I used numpy.array(wavelength,dtype='float64') to force the conversion of objects in the list to numpy's float64. Works well for me.

Typically these scipy functions require parameters like:
curvefit( function, initial_values, (aux_values,), ...)
where the tuple of aux_values is passed through to your function along with the current value of the main variable.
Is the dstack expression this aux_values? Or a concatenation of several. It may need to be wrapped in a tuple.
(np.dstack(transmitted[:,:,c][edgeIndex+30:])[0][0],)
We may need to know exactly where this error arises, not just which line of your code does it. We need to know what value is being converted. Where is there an array with dtype object?

Just to clarify, I had the same problem, did not see the right answers in the comments, before solving it on my own. So I just repeat them here:
I have resolved the issue. I was passing an array with one element to the p0 list, rather than the element itself. Thank you for your help – Jacobadtr Sep 12 at 17:51
An O dtype often results when constructing an array from a list of sublists that differ in size. If np.array(...) can't make a clean n-d array of numbers, it resorts to making an array of objects. – hpaulj Sep 12 at 17:15
That is, make sure that the tuple of parameters you pass to curve_fit can be properly casted to an numpy array

From here, apparently numpy struggles with index type. The proposed solution is:
One thing you can do is use np.intp as dtype whenever things have to do with indexing or are logically related to indexing/array sizes. This is the natural dtype for it and it will normally also be the fastest one.
Does this help?

Python conciseness confuses me

I have been looking at Pandas: run length of NaN holes, and this code fragment from the comments in particular:
Series([len(list(g)) for k, g in groupby(a.isnull()) if k])
As a python newbie, I am very impressed by the conciseness but not sure how to read this. Is it short for something along the lines of
myList = []
for k, g in groupby(a.isnull()) :
if k:
myList.append(len(list(g)))
Series(myList)
In order to understand what is going on I was trying to play around with it but get an error:
list object is not callable
so not much luck there.
It would be lovely if someone could shed some light on this.
Thanks,
Anne

You've got the translation correct. However, the code you give cannot be run because a is a free variable.
My guess is that you are getting the error because you have assigned a list object to the name list. Don't do that, because list is a global name for the type of a list.
Also, in future please always provide a full stack trace, not just one part of it. Please also provide sufficient code that at least there are no free variables.

If that is all of your code, then you have only a few possibilities:
myList.append is really a list
len is really a list
list is really a list
isnull is really a list
groupby is really a list
Series is really a list
The error exists somewhere behind groupby.
I'm going to go ahead and strike out myList.append (because that is impossible unless you are using your own groupby function for some reason) and Series. Unless you are importing Series from somewhere strange, or you are re-assigning the variable, we know Series can't be a list. A similar argument can be made for a.isnull.
So that leaves us with two real possibilities. Either you have re-assigned something somewhere in your script to be a list where it shouldn't be, or the error is behind groupby.
I think you're using the wrong groupby itertools.groupby takes and array or list as an argument, groupby in pandas may evaluate the first argument as a function. I especially think this because isnull() returns an array-like object.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

using xarray.apply(np.nansum) with args - python

The problem you're running into here is that nansum returns a numpy ndarray, and not a DataArray, which is what the function passed into apply is supposed to return. For nansum, you should just use xarray.Dataset.sum, which skips NaNs by default if your data is float.

Jeremy is correct that the built-in sum() method already skips NaN by default. But if you want to supply a custom aggregation function, you can do so with reduce, e.g., ds.reduce(np.nansum, axis=2).

Related

Is there a way of change the default data types in Python?

not able to change object to float in pandas dataframe

numpy/pandas NaN difference confusion

Cannot cast array data from dtype('O') to dtype('float64')

Python conciseness confuses me

Categories

Resources