xarray/numpy how to create large views on arrays? - python
For later processing convenience, I want to be able to create a very large view onto an xarray DataArray. Here's a small example that works:
data = xr.DataArray([list(i+np.arange(10,20)) for i in range(5)], dims=["t", "x"])
indices = xr.DataArray(list([i]*20 for i in range(len(data))), dims=["y", "i"])
print(data)
print(indices)
selection = data.isel(t=indices)
print("selection:")
print(selection)
Output:
<xarray.DataArray (t: 5, x: 10)>
array([[10, 11, 12, 13, 14, 15, 16, 17, 18, 19],
[11, 12, 13, 14, 15, 16, 17, 18, 19, 20],
[12, 13, 14, 15, 16, 17, 18, 19, 20, 21],
[13, 14, 15, 16, 17, 18, 19, 20, 21, 22],
[14, 15, 16, 17, 18, 19, 20, 21, 22, 23]])
Dimensions without coordinates: t, x
<xarray.DataArray (y: 5, i: 20)>
array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
[2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2],
[3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3],
[4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4]])
Dimensions without coordinates: y, i
selection:
<xarray.DataArray (y: 5, i: 20, x: 10)>
array([[[10, 11, 12, 13, 14, 15, 16, 17, 18, 19],
[10, 11, 12, 13, 14, 15, 16, 17, 18, 19],
[10, 11, 12, 13, 14, 15, 16, 17, 18, 19],
[10, 11, 12, 13, 14, 15, 16, 17, 18, 19],
[10, 11, 12, 13, 14, 15, 16, 17, 18, 19],
[10, 11, 12, 13, 14, 15, 16, 17, 18, 19],
[10, 11, 12, 13, 14, 15, 16, 17, 18, 19],
[10, 11, 12, 13, 14, 15, 16, 17, 18, 19],
[10, 11, 12, 13, 14, 15, 16, 17, 18, 19],
[10, 11, 12, 13, 14, 15, 16, 17, 18, 19],
[10, 11, 12, 13, 14, 15, 16, 17, 18, 19],
[10, 11, 12, 13, 14, 15, 16, 17, 18, 19],
[10, 11, 12, 13, 14, 15, 16, 17, 18, 19],
[10, 11, 12, 13, 14, 15, 16, 17, 18, 19],
[10, 11, 12, 13, 14, 15, 16, 17, 18, 19],
[10, 11, 12, 13, 14, 15, 16, 17, 18, 19],
[10, 11, 12, 13, 14, 15, 16, 17, 18, 19],
[10, 11, 12, 13, 14, 15, 16, 17, 18, 19],
[10, 11, 12, 13, 14, 15, 16, 17, 18, 19],
[10, 11, 12, 13, 14, 15, 16, 17, 18, 19]],
...
[[14, 15, 16, 17, 18, 19, 20, 21, 22, 23],
[14, 15, 16, 17, 18, 19, 20, 21, 22, 23],
[14, 15, 16, 17, 18, 19, 20, 21, 22, 23],
[14, 15, 16, 17, 18, 19, 20, 21, 22, 23],
[14, 15, 16, 17, 18, 19, 20, 21, 22, 23],
[14, 15, 16, 17, 18, 19, 20, 21, 22, 23],
[14, 15, 16, 17, 18, 19, 20, 21, 22, 23],
[14, 15, 16, 17, 18, 19, 20, 21, 22, 23],
[14, 15, 16, 17, 18, 19, 20, 21, 22, 23],
[14, 15, 16, 17, 18, 19, 20, 21, 22, 23],
[14, 15, 16, 17, 18, 19, 20, 21, 22, 23],
[14, 15, 16, 17, 18, 19, 20, 21, 22, 23],
[14, 15, 16, 17, 18, 19, 20, 21, 22, 23],
[14, 15, 16, 17, 18, 19, 20, 21, 22, 23],
[14, 15, 16, 17, 18, 19, 20, 21, 22, 23],
[14, 15, 16, 17, 18, 19, 20, 21, 22, 23],
[14, 15, 16, 17, 18, 19, 20, 21, 22, 23],
[14, 15, 16, 17, 18, 19, 20, 21, 22, 23],
[14, 15, 16, 17, 18, 19, 20, 21, 22, 23],
[14, 15, 16, 17, 18, 19, 20, 21, 22, 23]]])
Dimensions without coordinates: y, i, x
So we have a data array, an indexing array, and we use vectorised indexing to select from the first array into a structure that is convenient for later use. So far, great. But it looks like xarray isn't very smart about this, and is copying a lot of data rather than using views back on to the original data (where we have many views on to the same data so it shouldn't take so much RAM to create the view).
To demonstrate the problem, here's a scaled up version:
data = xr.DataArray([list(i+np.arange(10,20000)) for i in range(5)], dims=["t", "x"])
indices = xr.DataArray(list([i]*50000 for i in range(len(data))), dims=["y", "i"])
selection = data.isel(t=indices)
print("big selection")
print(selection)
Output:
Traceback (most recent call last):
File "xarray_test.py", line 15, in <module>
selection = data.isel(t=indices)
File "/data2/users/bfarmer/envs/bfarmer_dev_py38_clone_w_numba_TEST/lib/python3.8/site-packages/xarray/core/dataarray.py", line 1183, in isel
ds = self._to_temp_dataset()._isel_fancy(
File "/data2/users/bfarmer/envs/bfarmer_dev_py38_clone_w_numba_TEST/lib/python3.8/site-packages/xarray/core/dataset.py", line 2389, in _isel_fancy
new_var = var.isel(indexers=var_indexers)
File "/data2/users/bfarmer/envs/bfarmer_dev_py38_clone_w_numba_TEST/lib/python3.8/site-packages/xarray/core/variable.py", line 1156, in isel
return self[key]
File "/data2/users/bfarmer/envs/bfarmer_dev_py38_clone_w_numba_TEST/lib/python3.8/site-packages/xarray/core/variable.py", line 777, in __getitem__
data = as_indexable(self._data)[indexer]
File "/data2/users/bfarmer/envs/bfarmer_dev_py38_clone_w_numba_TEST/lib/python3.8/site-packages/xarray/core/indexing.py", line 1159, in __getitem__
return array[key]
File "/data2/users/bfarmer/envs/bfarmer_dev_py38_clone_w_numba_TEST/lib/python3.8/site-packages/xarray/core/nputils.py", line 126, in __getitem__
return np.moveaxis(self._array[key], mixed_positions, vindex_positions)
numpy.core._exceptions.MemoryError: Unable to allocate 37.2 GiB for an array with shape (5, 50000, 19990) and data type int64
This shouldn't take 40 GB of RAM since its just lots of views on to the same data. Yeah there is some overhead in the indexing, but we should only need one index per row of 20000 in data. We shouldn't have to copy over that row of 20000 into a new array.
Is there a way to make xarray do this in a smarter way? Can I more explicitly tell it to use Dask or something, or structure things differently somehow?
Actually I'm also totally happy just doing this with straight up numpy if that's easier, or any other library that can do this efficiently. I just used xarray because I thought it would do something smart with this operation, but I guess not, at least not automatically.
Edit: Ok I just found this question suggesting it is impossible with numpy: Can I get a view of a numpy array at specified indexes? (a view from "fancy indexing"). Not sure if this implies that xarray can't do it either though...
Edit 2: Ok the documentations for xarray.DataArray.sel (https://docs.xarray.dev/en/stable/generated/xarray.Dataset.sel.html) says this:
Returns obj (Dataset) – A new Dataset with the same contents as this
dataset, except each variable and dimension is indexed by the
appropriate indexers. If indexer DataArrays have coordinates that do
not conflict with this object, then these coordinates will be
attached. In general, each array’s data will be a view of the array’s
data in this dataset, unless vectorized indexing was triggered by
using an array indexer, in which case the data will be a copy.
So I guess xarray does try to be smart about selections in general, but not in the vectorised indexing case I want, which is rather annoying... I guess I have to think of a way to do this without vectorised indexing...
Ok so grabbing just one index at a time we can do it without blowing out the ram:
data = xr.DataArray([list(i+np.arange(10,20000)) for i in range(5)], dims=["t", "x"])
indices = xr.DataArray(list([i]*50000 for i in range(len(data))), dims=["y", "i"])
# Ram blowout
#selection = data.isel(t=indices)
#print("big selection")
#print(selection)
# Can do it one index at a time I guess?
ilen, jlen = indices.shape
out_data = [[data.isel(t=indices[i,j]) for i in range(ilen)] for j in range(jlen)]
print("out_data[0][0]:")
print(out_data[0][0])
print("out_data[0][1]:")
print(out_data[0][1])
Output:
out_data[0][0]:
<xarray.DataArray (x: 19990)>
array([ 10, 11, 12, ..., 19997, 19998, 19999])
Dimensions without coordinates: x
out_data[0][1]:
<xarray.DataArray (x: 19990)>
array([ 11, 12, 13, ..., 19998, 19999, 20000])
Dimensions without coordinates: x
But of course this loses rather a lot of the xarray convenience I was looking for...
Related
Python Append items to dictionary in a loop
I am trying to append values to a dictionary inside a loop, but somehow it's only appending one of the values. I recreated the setup using the same numbers I am dynamically getting. The output from "print(vertex_id_from_shell)" is "{0: [4], 1: [12], 2: [20]}". I need to keep the keys, but add the remaining numbers to the values. Thanks. shells = {0: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13], 1: [14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27], 2: [28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41]} uvsID = [0, 1, 3, 2, 2, 3, 5, 4, 4, 5, 7, 6, 6, 7, 9, 8, 1, 10, 11, 3, 12, 0, 2, 13, 14, 15, 16, 17, 17, 16, 18, 19, 19, 18, 20, 21, 21, 20, 22, 23, 15, 24, 25, 16, 26, 14, 17, 27, 28, 29, 30, 31, 31, 30, 32, 33, 33, 32, 34, 35, 35, 34, 36, 37, 29, 38, 39, 30, 40, 28, 31, 41] vertsID = [0, 1, 3, 2, 2, 3, 5, 4, 4, 5, 7, 6, 6, 7, 1, 0, 1, 7, 5, 3, 6, 0, 2, 4, 8, 9, 11, 10, 10, 11, 13, 12, 12, 13, 15, 14, 14, 15, 9, 8, 9, 15, 13, 11, 14, 8, 10, 12, 16, 17, 19, 18, 18, 19, 21, 20, 20, 21, 23, 22, 22, 23, 17, 16, 17, 23, 21, 19, 22, 16, 18, 20] vertex_id_from_shell = {} for shell in shells: selection_shell = shells.get(shell) #print(selection_shell) for idx, item in enumerate(selection_shell): if item in uvsID: uv_index = uvsID.index(item) vertex_ids = vertsID[uv_index] vertex_id_from_shell[shell] = [ ( vertex_ids ) ] print(vertex_id_from_shell) #{0: [4], 1: [12], 2: [20]} #desired result {0: [0, 1, 2, 3, 4, 5, 6, 7, 0, 1, 7, 0, 1, 7, 5, 6, 4], 1: [8, 9, 11, 10, 13, 12, 15, 14, 9, 8, 15, 13, 14, 12], 2: [16, 17, 19, 18, 21, 20, 23, 22, 17, 16, 23, 21, 22, 20]}
You're overwriting vertex_id_from_shell[shell] each time through the loop, not appending to it. Use collections.defaultdict() to automatically create the dictionary elements with an empty list if necessary, then you can append. from collections import defaultdict vertex_id_from_shell = defaultdict(list) for shell, selection_shell in shells.items(): for item in selection_shell: if item in uvsID: uv_index = uvsID.index(item) vertex_ids = vertsID[uv_index] vertex_id_from_shell[shell].append(vertex_ids)
You are setting vertex_id_from_shell[shell] to a new list, containing only one item every time. Instead, you should append to it.But first, of course that list needs to exist, so you should check and create it if it doesn't already exist. shells = {0: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13], 1: [14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27], 2: [28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41]} uvsID = [0, 1, 3, 2, 2, 3, 5, 4, 4, 5, 7, 6, 6, 7, 9, 8, 1, 10, 11, 3, 12, 0, 2, 13, 14, 15, 16, 17, 17, 16, 18, 19, 19, 18, 20, 21, 21, 20, 22, 23, 15, 24, 25, 16, 26, 14, 17, 27, 28, 29, 30, 31, 31, 30, 32, 33, 33, 32, 34, 35, 35, 34, 36, 37, 29, 38, 39, 30, 40, 28, 31, 41] vertsID = [0, 1, 3, 2, 2, 3, 5, 4, 4, 5, 7, 6, 6, 7, 1, 0, 1, 7, 5, 3, 6, 0, 2, 4, 8, 9, 11, 10, 10, 11, 13, 12, 12, 13, 15, 14, 14, 15, 9, 8, 9, 15, 13, 11, 14, 8, 10, 12, 16, 17, 19, 18, 18, 19, 21, 20, 20, 21, 23, 22, 22, 23, 17, 16, 17, 23, 21, 19, 22, 16, 18, 20] vertex_id_from_shell = {} for shell in shells: selection_shell = shells.get(shell) #print(selection_shell) for idx, item in enumerate(selection_shell): if item in uvsID: uv_index = uvsID.index(item) vertex_ids = vertsID[uv_index] # if the list does not exist, create it if shell not in vertex_id_from_shell: vertex_id_from_shell[shell] = [] # append to list vertex_id_from_shell[shell].append(vertex_ids) print(vertex_id_from_shell) # {0: [0, 1, 2, 3, 4, 5, 6, 7, 0, 1, 7, 5, 6, 4], # 1: [8, 9, 11, 10, 13, 12, 15, 14, 9, 8, 15, 13, 14, 12], # 2: [16, 17, 19, 18, 21, 20, 23, 22, 17, 16, 23, 21, 22, 20]}
split 3D numpy to 3 diffrent arrays
I have numpy.array pf shape (64 , 64 , 64) I would like to split it on to 3 variables ,so x.shape ==> (64) y.shape ==> (64) z.shape ==> (64) as each dim represent voxels coordinate (x,y,z) , I tried use dsplit() but no luck. any suggestion?
What you're looking for is probably transpose + ravel: X = np.arange(27).reshape((3,3,3)) >>> X ([[[ 0, 1, 2], [ 3, 4, 5], [ 6, 7, 8]], [[ 9, 10, 11], [12, 13, 14], [15, 16, 17]], [[18, 19, 20], [21, 22, 23], [24, 25, 26]]]) Your x,y,z: >>> X.transpose((0,1,2)).ravel() array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26]) >>> X.transpose((1,2,0)).ravel() array([ 0, 9, 18, 1, 10, 19, 2, 11, 20, 3, 12, 21, 4, 13, 22, 5, 14, 23, 6, 15, 24, 7, 16, 25, 8, 17, 26]) >>> X.transpose((2,0,1)).ravel() array([ 0, 3, 6, 9, 12, 15, 18, 21, 24, 1, 4, 7, 10, 13, 16, 19, 22, 25, 2, 5, 8, 11, 14, 17, 20, 23, 26])
how to add box plot to scatter data in matplotlib
I have the following data and after plotting scatter data point, I would like to add boxplot around each set of position. Here is my code for plotting the scatter plot: %matplotlib inline import matplotlib.pyplot as plt X = [1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 6, 6, 7, 7, 7, 7, 7, 7, 7, 8, 8, 8, 8, 8, 8, 8, 9, 9, 9, 9, 9, 9, 9, 10, 10, 10, 10, 10, 10, 10, 11, 11, 11, 11, 11, 11, 11, 12, 12, 12, 12, 12, 12, 12, 13, 13, 13, 13, 13, 13, 13, 14, 14, 14, 14, 14, 14, 14, 15, 15, 15, 15, 15, 15, 15] H = [15, 17, 16, 20, 15, 18, 15, 17, 16, 16, 20, 19, 18, 15, 20, 22, 20, 22, 19, 21, 21, 19, 21, 20, 23, 21, 20, 22, 21, 23, 22, 20, 24, 22, 20, 20, 19, 20, 18, 21, 17, 19, 18, 20, 16, 15, 17, 20, 19, 19, 19, 18, 21, 21, 16, 19, 21, 22, 22, 24, 24, 23, 25, 28, 26, 30, 27, 26, 29, 30, 27, 26, 29, 31, 27, 29, 30, 25, 26, 27, 28, 25, 27, 30, 31, 28, 25, 27, 30, 25, 31, 28, 26, 30, 28, 29, 27, 31, 24, 26, 25, 28, 26, 23, 25] fig, axes = plt.subplots(figsize=(8,5)) axes.scatter(X, H, color='b') axes.set_xlabel('Pos'); axes.set_ylabel('H, µm'); when i add plt.boxplot, it captures all data not individual position. I appreciate the answers either in matplotlib or seaborn. thanks
A good way would be using pandas: df = pd.DataFrame({'X':X, 'H': H}) ax=df.plot(kind='scatter', x='X', y='H') df.boxplot(by='X', ax=ax) plt.show() output:
Here's a condensed solution to how to map your H array by X and plot it using matplotlib: groups = [[] for i in range(max(X))] [groups[X[i]-1].append(H[i]) for i in range(len(H))] plt.boxplot(groups) Outcome: you can add grid with plt.grid(True)
Python: change color of nodes based on its numbers
I have a graph and I need to change color of some nodes. For example [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29] all nodes in graph and there are blue. How can I [0, 1, 2, 3, 4, 5, 8, 9, 11, 15, 19, 21, 22, 23, 24, 25, 29] change to red?
How to test if dataset follows zipf's law? (using a Plot in R, tikz or even Python)
I have a very simple problem which I can't solve by myself: I have a small dataset of counts and I want to compare their distribution with zipf's law. my data is a simple table with one row: 31, 29, 28, 27, 27, 27, 27, 26, 25, 24, 23, 23, 22, 22, 22, 21, 21, 20, 20, 20, 19, 19, 19, 18, 18, 18, 18, 17, 17, 17, 17, 16, 15, 15, 15, 15, 14, 14, 13, 13, 13, 13, 12, 12, 12, 12, 11, 11, 11, 10, 10, 9, 8, 8, 8, 7, 7, 7, 6, 6, 6, 6, 6, 5, 5, 5, 4, 3, 2, 2, 2, 2, 1, 1, could anyone help me with this? I have tried the hole day…