Combining DataArrays in an xarray Dataset - python

Is there a nicer way of summing over all the DataArrays in an xarray Dataset than
sum(d for d in ds.data_vars.values())
This works, but seems a bit clunky. Is there an equivalent to summing over pandas DataFrame columns?
Note the ds.sum() method applies to each of the DataArrays - but I want to combine the DataArrays.

I assume you want to sum each data variable as well, e.g., sum(d.sum() for d in ds.data_vars.values()). In a future version of xarray (not yet in v0.10) this will be more succinct: you will be able to write sum(d.sum() for d in ds.values()).
Another option is to convert the Dataset into a single DataArray and sum it at once, e.g., ds.to_array().sum(). This will be less efficient if you have data variables with different dimensions.

Related

How do I append new data to the DataArray along existing dimension in-place?

I have data grouped into a 3-D DataArray named 'da' with dimensions 'time', 'indicators', and 'coins', using Dask as a backend:
I need to select the data for peculiar indicator, calculate a new indicator based on it, and append this newly calculated indicator to da along indicators dimension using the new indicator name (let's call it daily_return). In somewhat simplistic terms of a 2-D analogy, I need to perform something like calculating a new pandas DataFrame column based on its other columns, but in 3-D.
So far I've tried to apply_ufunc() with both drop=False (then I retrieve scalar indicators coordinate on the resulting DataArray) and drop=True (respectively, indicators are dropped) using the corresponding tutorial:
dr_func = lambda today, yesterday: today / yesterday - 1 # Assuming for simplicity that yesterday != 0
today_da = da.sel(indicators="price_daily_close_usd", drop=False) # or drop=True
yesterday_da = da.shift(time=1).sel(indicators="price_daily_close_usd", drop=False) # or drop=True
dr = xr.apply_ufunc(
dr_func,
today_da,
yesterday_da,
dask="allowed",
dask_gufunc_kwargs={"allow_rechunk": True},
vectorize=True
)
Obviously, in case of drop=True I cannot concat da and dr DataArrays, since indicators are not present among dr's coordinates.
In its turn, in case of drop=False I've managed to concat these DataArrays along indicators; however, the resulting indicators coord would contain two similarly named CoordinateVariables, specifically "price_daily_close_usd":
...while the second of them should be renamed into "daily_return".
I've also tried to extract the needed data from dr through .sel(), but failed due to the absence of index along indicators dimension (as far as I've understood, it's not possible to set an index in this case, since this dimension is scalar):
dr.sel(indicators="price_daily_close_usd") # Would result in KeyError: "no index found for coordinate 'indicators'"
Moreover, the solution above is not done in-place - i.e. it creates a new combined DataArray instance instead of modifying da, while the latter would be highly preferable.
How can I append new data to da along existing dimension, desirably in-place?
Loading all the data directly into RAM would hardly be possible due to its huge volumes, that's why Dask is being used.
I'm also not sticking to the DataArray data structure and it would be no problem to switch to a Dataset if it has more suitable methods for solving my problem.
Xarray does not support appending in-place. Any change to the shape of your array will need to produce a new array.
If you want to work with a single array and know the size of the final array, you could generate an empty array and assign values based on coordinate labels.
I need to perform something like calculating a new pandas DataFrame column based on its other columns, but in 3-D.
Xarray's Dataset is a better analog to the Pandas.Dataframe. The Dataset is a dict-like container storing ND arrays (DataArray's) just like the Dataframe is a dict-like container storing 1D arrays (Series).

Lookup table in xarray?

So I want to merge 2 datasets, 1 is a single band raster dataset that came from rioxarray.open_rasterio(), the other a lookup table, with an index dim 'mukey'. The coords along 'mukey' correspond to 'mukey' index values in the lookup table. The desired result is a dataset with identical x and y coords to the Raster dataset, with variables 'n' and 'K' whose values are populated by merging on the 'mukey'. If you are familiar with ArcGIS, this is the analogous operation.
xr.merge() and assign() don't quite seem to perform this operation, and cheating by converting into pandas or numpy hits memory issues on my 32GB machine. Does xarray provide any support for this simple use case? Thanks,
data = (np.abs(np.random.randn(12000000))).astype(np.int32).reshape(1,4000,3000)
raster = xr.DataArray(data,dims=['band','y','x'],coords=[[1],np.arange(4000),np.arange(3000)])
raster = raster.to_dataset(name='mukey')
raster
lookup = pd.DataFrame({'mukey':list(range(10)),'n':np.random.randn(10),'K':np.random.randn(10)*2}).set_index('mukey').to_xarray()
lookup
You're looking for the advanced indexing with DataArrays feature of xarray.
You can provide a DataArray as a keyword argument to DataArray.sel or Dataset.sel - this will reshape the indexed array along the dimensions of the indexing array, based on the values of the indexing array. I think this is exactly what you're looking for in a "lookup table".
In your case:
lookup.sel(mukey=raster.mukey)

Efficiently perform cheap calculations on many (1e6-1e10) combinations of rows in a pandas dataframe in python

I need to perform some simple calculations on a large number of combinations of rows or columns for a pandas dataframe. I need to figure out how to do so most efficiently because the number of combinations might go up above a billion.
The basic approach is easy--just performing means, comparison operators, and sums on subselections of a dataframe. But the only way I've figured out involves doing a loop over the combinations, which isn't very pythonic and isn't super efficient. Since efficiency will matter as the number of samples goes up I'm hoping there might be some smarter way to do this.
Right now I am building the list of combinations and then selecting those rows and doing the calculations using built-in pandas tools (see pseudo-code below). One possibility is to parallelize this, which should be pretty easy. However, I wonder if I'm missing a deeper way to do this more efficiently.
A few thoughts, ordered from big to small:
Is there some smart pandas/python or even some smart linear algebra way to do this? I haven't figured such out, but want to check.
Is the best approach to stick with pandas? Or convert to a numpy array and just do everything using numeric indices there, and then convert back to easier-to-understand data-frames?
Is the built-in mean() the best approach, or should I use some kind of apply()?
Is it faster to select rows or columns in any way? The matrix is symmetric so it's easy to grab either.
I'm currently actually selecting 18 rows because each of the 6 rows actually has three entries with slightly different parameters--I could combine those into individual rows beforehand if it's faster to select 6 rows than 18 for some reason.
Here's a rough-sketch of what I'm doing:
from itertools import combinations
df = from_excel() #Test case is 30 rows & cols
df = df.set_index('Col1') # Column and row 1 are names, rest are the actual matrix values
allSets = combinations(df.columns, 6)
temp = []
for s in allSets:
avg1 = df.loc[list(s)].mean().mean()
cnt1 = df.loc[list(s)].gt(0).sum().sum()
temp.append([s,avg1,cnt1])
dfOut = pd.DataFrame(temp,columns=['Set','Average','Count'])
A few general considerations that should help:
Not that I know of, though the best place to ask is on Mathematics or Math Professionals. And it is worth a try. There may be a better way to frame the question if you are doing something very specific with the results - looking for minimum/maximum, etc.
In general, you are right, that pandas, as a layer on top of NumPy is probably not the speeding things up. However, most of the heavy-lifting is done at the level of NumPy, and until you are sure pandas is to blame, use it.
mean is better than your own function applied across rows or columns because it uses C implementation of mean in NumPy under the hood which is always going to be faster than Python.
Given that pandas is organizing data in column fashion (i.e. each column is a contiguous NumPy array), it is better to go row-wise.
It would be great to see an example of data here.
Now, some comments on the code:
use iloc and numeric indices instead of loc - it is way faster
it is unnecessary to turn tuples into list here: df.loc[list(s)].gt(0).sum().sum()
just use: df.loc[s].gt(0).sum().sum()
you should rather use a generator instead of the for loop where you append elements to a temporary list (this is awfully slow and unnecessary, because you are creating pandas dataframe either way). Also, use tuples instead of lists wherever possible for maximum speed:
def gen_fun():
allSets = combinations(df.columns, 6)
for s in allSets:
avg1 = df.loc[list(s)].mean().mean()
cnt1 = df.loc[list(s)].gt(0).sum().sum()
yield (s, avg1, cnt1)
dfOut = pd.DataFrame(gen_fun(), columns=['Set', 'Average', 'Count'])
Another thing is, that you can preprocess the dataframe to use only values that are positive to avoid gt(0) operation in each loop.
In this way you are sparing both memory and CPU time.

Dataframe C or Fortran ordering in pandas

I want to create pandas dataframes and be able to manipulate them in optimised code (numba).
Most optimised code will take series or dataframe inputs and store results in preallocated outputs.
#njit
def calc(u, v, w):
for i in range (w.shape[0]):
w[i] = some_f(u[i], v[i])
where some_f is a placeholder for operations that can be complex, with tests and loops, hence the use of numba. Most importantly, I want to avoid any useless copies of data in the process.
For vectorized functions such as the above, I want to use the same code for series and dataframes.
So for series u, v, w, I'll use:
calc(u.values, v.values, w.values)
For dataframes, I thought of reusing the same function with
calc(u.values.reshape(-1), v.values.reshape(-1), w.values.reshape(-1))
This only works if
The array ordering of the dataframe (C or Fortran) is consistent across the three dataframes
The reshape method is passed an argument order='C' or 'F' matching the original dataframe ordering, otherwise a copy is made.
Pandas does not seem to have a consistent policy for dataframe ordering.
For instance, a constructor
df=pandas.Dataframe(index=..., columns=..., data=0.0)
will return a C ordered array. While
df.copy()
will be Fortran ordered.
I wanted to know if some people here had encountered similar issues, and found a consistent way to ensure that dataframes are always in the same order (C or Fortran) without cluttering too much ordinary pandas code.

Pandas: Series of arrays to series of transposed arrays

Ok, this is an easy one, I hope.
Using Pandas, I have a Series of 100 equal length Numpy arrays each with 30000 elements. I'd like to quickly transpose them into a series of 30000 arrays with 100 elements.
I of course can do it with list comprehensions or pulling the arrays but is there an efficient Pandas way to do it? Thanks!
UPDATE:
As per the request by #Alexander to make this a better example, here is some toy data.
import pandas
s1 = pandas.Series([np.array(range(10)) for i in range(10)])
And what I want returned in this example is:
s2 = pandas.Series([np.ones(10)*i for i in range(10)])
That is, an element-wise transpose of a Series of arrays into a new Series of arrays. Thanks!
Ok, this works actually. Any one have a more efficient solution?
pandas.Series(np.asarray(s1.tolist()).T.tolist())

Categories