How to achieve numpy indexing with xarray Dataset - python

I know the x and the y indices of a 2D array (numpy indexing).
Following this documentation, xarray uses e.g. Fortran style of indexing.
So when I pass e.g.
ind_x = [1, 2]
ind_y = [3, 4]
I expect 2 values for the index pairs (1,3) and (2,4), but xarray returns a 2x2 matrix.
Now I want to know how to achieve numpy like indexing with xarray?
Note: I want to avoid loading the whole data into memory. So using .values api is not part of the solution I am looking for.

You can access the underlying numpy array to index it directly:
import xarray as xr
x = xr.tutorial.load_dataset("air_temperature")
ind_x = [1, 2]
ind_y = [3, 4]
print(x.air.data[0, ind_y, ind_x].shape)
# (2,)
Edit:
Assuming you have your data in a dask-backed xarray and don't want to load all of it into memory, you need to use vindex on the dask array behind the xarray data object:
import xarray as xr
# simple chunk to convert to dask array
x = xr.tutorial.load_dataset("air_temperature").chunk({"time":1})
extract = x.air.data.vindex[0, ind_y, ind_x]
print(extract.shape)
# (2,)
print(extract.compute())
# [267.1, 274.1], dtype=float32)

In order to take the speed into account I have made a test with different methods.
def method_1(file_paths: List[Path], indices) -> List[np.array]:
data=[]
for file in file_paths:
d = Dataset(file, 'r')
data.append(d.variables['hrv'][indices])
d.close()
return data
def method_2(file_paths: List[Path], indices) -> List[np.array]:
data=[]
for file in file_paths:
data.append(xarray.open_dataset(file, engine='h5netcdf').hrv.values[indices])
return data
def method_3(file_paths: List[Path], indices) -> List[np.array]:
data=[]
for file in file_paths:
data.append(xarray.open_mfdataset([file], engine='h5netcdf').hrv.data.vindex[indices].compute())
return data
In [1]: len(file_paths)
Out[1]: 4813
The results:
method_1 (using netcdf4 library): 101.9s
method_2 (using xarray and values API): 591.4s
method_3 (using xarray+dask): 688.7s
I guess that xarray+dask takes to much time within .compute step.

Related

Why is buff/cache getting larger and larger while loading a large number of numpy arrays using a for loop?

I'm working on a project where I need to load a large number of numpy arrays saved on the disk using a for loop. The system I'm using is Linux.
The image below shows the memory usage during the process
As you can see, the part of unavailable memory under buff/cache can be even larger than the used memory. What is saved on this part of the memory? How can I reduce it?
The script used for loading the arrays is something like this:
import numpy as np
tmp = []
slice1, slice2 = [], []
for item in hashes:
# np.load(item) has a shape (50, 96)
tmp.append(np.load(item))
tmp = np.concatenate(tmp, axis=0)
mask1 = # a mask used for slicing, a third of the entries will be selected
mask2 = # a different mask for slicing, a third of the entries will be selected
slice1 = tmp[mask1]
slice2 = tmp[mask2]
Based on the short code segment shown, you may be converting Numpy ndarray objects into list objects while manipulating them. Try using all Numpy objects and methods. Also try to avoid for loops and use Numpy vectorized operations instead. 56GB is a huge amount of memory. Yikes! :-)
Possible Numpy codes:
import numpy as np
tmp = np.array([0, 1, 2, 3, 4], ndim=2)
tmp = np.zeros((50, 96))
load_object = np.load(filename)
tmp = np.array(load_object)
# slice1 = [mask range 1]
# slice2 = [mask range 2]
# Numpy vertical stack
result = np.vstack((tmp[slice1], tmp[slice2]))
Numpy Vertical Stack method docs:
https://docs.scipy.org/doc/numpy/reference/generated/numpy.vstack.html
a = np.array([1, 2, 3])
b = np.array([2, 3, 4])
np.vstack((a,b))
(output)
array([[1, 2, 3],
[2, 3, 4]])
Hopefully that will solve your problem. I will do some more testing later, and edit my answer.

H5PY/Numpy - Setting the inner shape of a numpy arrays (for h5py)

I am trying to use h5py to store data as a list of tuples of (images, angles). Images are numpy arrays of size (240,320,3) of type uint8 from OpenCV while angles are just a number of type float16.
When using h5py, you need to have a predetermine shape in order to maintain a usable speed of read/write. H5py preloads the entire dataset with arbitrary values in which you can index later and set these values to whatever you would like.
I would like to know how to set the shape of an inner numpy array when initializing the shape of a dataset for h5py. I believe the same solution would apply for numpy as well.
import h5py
import numpy as np
dset_length = 100
# fake data of same shape
images = np.ones((dset_length,240,320,3), dtype='uint8') * 255
# fake data of same shape
angles = np.ones(dset_length, dtype='float16') * 90
f = h5py.File('dataset.h5', 'a')
dset = f.create_dataset('dset1', shape=(dset_length,2))
for i in range(dset_length):
# does not work since the shape of dset[0][0] is a number,
# and can't store an array datatype
dset[i] = np.array((images[i],angles[i]))
Recreateing the problem in numpy looks like this:
import numpy as np
a = np.array([
[np.array([0,0]), 0],
[np.array([0,0]), 0],
[np.array([0,0]), 0]
])
a.shape # (3, 2)
b = np.empty((3,2))
b.shape # (3, 2)
a[0][0] = np.array([1,1])
b[0][0] = np.array([1,1]) # ValueError: setting an array element with a sequence.
The dtype that #Eric creates should work with both numpy and h5py. But I wonder if you really want or need that. An alternative is to have two arrays in numpy, images and angles, one being 4d uint8, the other float. In h5py you could create a group, and store these 2 arrays as datasets.
You could select the values for the ith' image with
images[i,...], angles[i] # or
data[i]['image'], data[i]['angle']
for example:
import h5py
dt = np.dtype([('angle', np.float16), ('image', np.uint8, (40,20,3))])
data = np.ones((3,), dt)
f = h5py.File('test.h5','w')
g = f.create_group('data')
dataset with the compound dtype:
g.create_dataset('data', (3,), dtype=dt)
g['data'][:] = data
or datasets with the two arrays
g.create_dataset('image', (3,40,20,3), dtype=np.uint8)
g.create_dataset('angle', (3,), dtype=np.float16)
g['image'][:] = data['image']
g['angle'][:] = data['angle']
fetch angle array from either dataset:
g['data']['angle'][:]
g['angle'][:]
In numpy, you can store that data with structured arrays:
dtype = np.dtype([('angle', np.float16), ('image', np.uint8, (240,320,3))])
data = np empty(10, dtype=dtype)
data[0]['angle'] = ... # etc

how to save feature matrix as csv file

I have several features of Image and after image pre-processing I have plenty of data which I need to use frequently in future works. So to save time I want to save the data of image features in csv format. The following image features are the row attributes: Intensity, Skewness, Kurtosis, Std_deviation Max5, Min5.
Here every image feature is a numpy array of size (34560,1).
How to make a csv file which consists of all these image features.
You can use structured array if you want to include attribute name to numpy array. But that will make things a bit more complicated to use. I would rather save the numpy array with same types and save the attributes name somewhere else. That is more straight forward and easier to use.
Example: For the sake of simplicity, let's say that you have three col arrays of length 4: a, b, c
a -> array([[1],
[2],
[3],
[4]])
a.shape -> (4,1)
b and c array have same array shape.
For the sake of faster access to the data, it would be better to make that as a row array so that it is stored continuously on the disk and memory when loaded.
a = a.ravel(); b = b.ravel(); c = c.ravel()
a - > array([1, 2, 3, 4])
a.shape -> (4,)
Then, you stack them into a large array and save it to csv file.
x = np.vstack((a,b,c))
array([[1, 2, 3, 4],
[5, 6, 7, 8],
[9, 10, 11, 12]])
x.shape -> (3,4)
Then, just save this stacked array to csv file.
np.savetxt('out.csv', x, delimiter=',')
This can be done in one line:
np.savetxt('out.csv', np.vstack(a.ravel(), b.ravel(), c.ravel()), delimiter='x')
For example if yo got your out put into a variable "result" then you can save that result in csv format using the below commands
import pandas as pd
result = "......................"(your expression)
result1 = pd.DataFrame({'col1':result[:,0],'col2':result[:,1]})
result1.to_csv('Myresult.csv', header = False)
in the place of "Myresult" you can mention your desire path of output. In my case it would be like "C:\Users\dinesh.n\Desktop\Myresult.csv".
I hope it clears your doubt if not please excuse me and correct me.
Thank you.

How to copy data from memory to numpy array in Python

For example, I have a variable which point to a vector contains many elements in memory, I want to copy element in vector to a numpy array, what should I do except one by one copy? Thx
I am assuming that your vector can be represented like that:-
import array
x = array('l', [1, 3, 10, 5, 6]) # an array using python's built-in array module
Casting it as a numpy array will then be:-
import numpy as np
y = np.array(x)
If the data is packed in a buffer in native float format:
a = numpy.fromstring(buf, dtype=float, count=N)

How to flatten a numpy ndarray along axis?

I have three arrays: longitude(400,600),latitude(400,600),data(30,400,60); what I am trying to do is to extract value in the data array according to it's location(latitude and longitude).
Here is my code:
import numpy
import tables
hdf = "data.hdf5"
h5file = tables.openFile(hdf, mode = "r")
lon = numpy.array(h5file.root.Lonitude)
lat = numpy.array(h5file.root.Latitude)
arr = numpy.array(h5file.root.data)
lon = numpy.array(lon.flat)
lat = numpy.array(lat.flat)
arr = numpy.array(arr.flat)
lonlist=[]
latlist=[]
layer=[]
fre=[]
for i in range(0,len(lon)):
for j in range(0,30):
longi = lon[j]
lati = lat[j]
layers=[j]
frequency= arr[i]
lonlist.append(longi)
latlist.append(lati)
layer.append(layers)
fre.append(frequency)
output = numpy.column_stack((lonlist,latlist,layer,fre))
The problem is that the "frequency" is not what I want.I want the data array to be flattened along axis-zero,so that the "frequency" would be the 30 values at one location.Is there such a function in numpy to flatten ndarray along a particular axis?
You can try np.ravel(your_array), or your_array.shape=-1. The np.ravel function lets you use an optional argument order: choose C for a row-major order or F for a column-major order.
I guess what you actually wanted was just transpose to change the axis order. Depending on what you do with it, it might be useful to do a .copy() after the transposed to optimize the memory layout, since transpose will not create a copy itself.
Just to add, if you want to make something that is beyond F and C order, you can use transposed = ndarray.transpose([1,2,0]) to move the first axis to the end, the last into second position and then do transposed.ravel() (I assumed C order, so moved 0 axis to the end). You can also use reshape which is more powerful then the simple ravel (return shape can be any dimension).
Note that unless the strides add up exactly, numpy will have to make a copy of the array, you can avoid that by the very nice transposed.flat() iterator in many cases.
>>> a = np.random.rand(2,2,2)
>>> a
array([[[ 0.67379148, 0.95508303],
[ 0.80520281, 0.34666202]],
[[ 0.01862911, 0.33851973],
[ 0.18464121, 0.64637853]]])
>>> np.ravel(a)
array([ 0.67379148, 0.95508303, 0.80520281, 0.34666202, 0.01862911,
0.33851973, 0.18464121, 0.64637853])
You are essentially unfolding a high-dimensional tensor. Try tensorly.unfold(arr, mode=the_direction_you_want). For example,
import numpy as np
import tensorly as tl
a = np.zeros((3, 4, 5))
b = tl.unfold(a, mode=1)
b.shape # (4, 15)

Categories