How to access random indices from h5 data set? - python

I have some h5 data that I want to sample from by using some randomly generated indices. However, if the indices are out of increasing order, then the effort fails. Is it possible to select indices, that have been generated randomly, from h5 data sets?
Here is a MWE citing the error:
import h5py
import numpy as np
arr = np.random.random(50).reshape(10,5)
with h5py.File('example1.h5', 'w') as h5fw:
h5fw.create_dataset('data', data=arr)
random_subset = h5py.File('example1.h5', 'r')['data'][[3, 1]]
# TypeError: Indexing elements must be in increasing order
I could sort the indices, but then we lose the randomness component.

As hpaulj mentioned, random indices aren't a problem for numpy arrays in memory. So, yes it's possible to select data with randomly generated indices from h5 data sets read to numpy arrays. The key is having sufficient memory to hold the dataset in memory. The code below shows how to do this:
#random_subset = h5py.File('example1.h5', 'r')['data'][[3, 1]]
arr = h5py.File('example1.h5', 'r')['data'][:]
random_subset = arr[[3,1]]

A potential solution is to pre-sort the desired indices as follow:
idx = np.sort([3,1])
random_subset = h5py.File('example1.h5', 'r')['data'][idx]

Related

How to concatenate numpy arrays to create a 2d numpy array

I'm working on using AI to give me better odds at winning Keno. (don't laugh lol)
My issue is that when I gather my data it comes in the form of 1d arrays of drawings at a time. I have different files that have gathered the data and formatted it as well as performed simple maths on the data set. Now I'm trying to get the data into a certain shape for my Neural Network layers and am having issues.
formatted_list = file.readlines()
#remove newline chars
formatted_list = list(filter(("\n").__ne__, formatted_list))
#iterate through each drawing, format the ends and split into list of ints
for i in formatted_list:
i = i[1:]
i = i[:-2]
i = [int(j) for j in i.split(",")]
#convert to numpy array
temp = np.array(i)
#t1 = np.reshape(temp, (-1, len(temp)))
#print(np.shape(t1))
#append to master list
master_list.append(temp)
print(np.shape(master_list))
This gives output of "(292,)" which is correct there are 292 rows of data however they contain 20 columns as well. If I comment in the "#t1 = np.reshape(temp, (-1, len(temp))) #print(np.shape(t1))" it gives output of "(1,20)(1,20)(1,20)(1,20)(1,20)(1,20)(1,20)(1,20)", etc. I want all of those rows to be added together and keep the columns the same (292,20). How can this be accomplished?
I've tried reshaping the final list and many other things and had no luck. It either populates each number in the row and adds it to the first dimension, IE (5840,) I was expecting to be able to append each new drawing to a master list, convert to numpy array and reshape it to the 292 rows of 20 columns. It just appears that it want's to keep the single dimension. I've tried numpy.concat also and no luck. Thank you.
You can use vstack to concatenate your master_list.
master_list = []
for array in formatted_list:
master_list.append(array)
master_array = np.vstack(master_list)
Alternatively, if you know the length of your formatted_list containing the arrays and array length you can just preallocate the master_array.
import numpy as np
formatted_list = [np.random.rand(20)]*292
master_array = np.zeros((len(formatted_list), len(formatted_list[0])))
for i, array in enumerate(formatted_list):
master_array[i,:] = array
** Edit **
As mentioned by hpaulj in the comments, np.array(), np.stack() and np.vstack() worked with this input and produced a numpy array with shape (7,20).

How to use np.unique on big arrays?

I work with geospatial images in tif format. Thanks to the rasterio lib I can exploit these images as numpy arrays of dimension (nb_bands, x, y). Here I manipulate an image that contains patches of unique values that I would like to count. (they were generated with the scipy.ndimage.label function).
My idea was to use the unique method of numpy to retrieve the information from these patches as follows:
# identify the clumps
with rio.open(mask) as f:
mask_raster = f.read(1)
class_, indices, count = np.unique(mask_raster, return_index=True, return_counts=True)
del mask_raster
# identify the value
with rio.open(src) as f:
src_raster = f.read(1)
src_flat = src_raster.flatten()
del src_raster
values = [src_flat[index] for index in indices]
df = pd.DataFrame({'patchId': indices, 'nb_pixel': count, 'value': values})
My problem is this:
For an image of shape 69940, 70936, (84.7 mB on my disk), np.unique tries to allocate an array of the same dim in int64 and I get the following error:
Unable to allocate 37.0 GiB for an array with shape (69940, 70936) and data type uint64
Is it normal that unique reformats my painting in int64?
Is it possible to force it to use a more optimal format? (even if all my patches were 1 pixel np.int32would be sufficent)
Is there another solution using a function I don't know?
The uint64 array is probably allocated during argsort here in the source code.
Since the labels from scipy.ndimage.label are consecutive integers starting at zero you can use numpy.bincount:
num_features = np.max(mask_raster)
count = np.bincount(mask_raster, minlength=num_features+1)
To get values from src you can do the following assignment. It's really inefficient but I don't think it allocates too much memory.
values = np.zeros(num_features+1, dtype=src_raster.dtype)
values[mask_raster] = src_raster
Maybe scipy.ndimage has a function that better suits the use case.
I think splitting Numpy array into smaller chunks and yield unique:count values will be memory efficient solution as well as changing data type to int16 or similar.
I dig into the scipy.ndimage lib and effectivly find a solution that avoid memory explosion.
As it's slicing the initial raster is faster than I thought :
from scipy import ndimage
import numpy as np
# open the files
with rio.open(mask) as f_mask, rio.open(src) as f_src:
mask_raster = f_mask.read(1)
src_raster = f_src.read(1)
# use patches as slicing material
indices = [i for i in range(1, np.max(mask_raster))]
counts = []
values = []
for i, loc in enumerate(ndimage.find_objects(mask_raster)):
loc_values, loc_counts = np.unique(mask_raster[loc], return_counts=True)
# the value of the patch is the value with the highest count
idx = np.argmax(loc_counts)
counts.append(loc_counts[idx])
values.append(loc_values[idx])
df = pd.DataFrame({'patchId': indices, 'nb_pixel': count, 'value': values})

Initializing or populating multiple numpy arrays from h5 file groups

I have an h5 file with 5 groups, each group containing a 3D dataset. I am looking to build a for loop that allows me to extract each group into a numpy array and assign the numpy array to an object with the group header name. I am able to get a number of different methods to work with one group, but when I try to build a for loop that applies to code to all 5 groups, it breaks. For example:
import h5py as h5
import numpy as np
f = h5.File("FFM0012.h5", "r+") #read in h5 file
print(list(f.keys())) #['FFM', 'Image'] for my dataset
FFM = f['FFM'] #Generate object with all 5 groups
print(list(FFM.keys())) #['Amp', 'Drive', 'Phase', 'Raw', 'Zsnsr'] for my dataset
Amp = FFM['Amp'] #Generate object for 1 group
Amp = np.array(Amp) #Turn into numpy array, this works.
Now when I try to apply the same logic with a for loop:
h5_keys = []
FFM.visit(h5_keys.append) #Create list of group names ['Amp', 'Drive', 'Phase', 'Raw', 'Zsnsr']
for h5_key in h5_keys:
tmp = FFM[h5_key]
h5_key = np.array(tmp)
print(Amp[30,30,30]) #To check that array is populated
When I run this code I get "NameError: name 'Amp' is not defined". I've tried initializing the numpy array before the for loop with:
h5_keys = []
FFM.visit(h5_keys.append) #Create list of group names
Amp = np.array([])
for h5_key in h5_keys:
tmp = FFM[h5_key]
h5_key = np.array(tmp)
print(Amp[30,30,30]) #To check that array is populated
This produces the error message "IndexError: too many indices for array"
I've also tried generating a dictionary and creating numpy arrays from the dictionary. That is a similar story where I can get the code to work for one h5 group, but it falls apart when I build the for loop.
Any suggestions are appreciated!
You seem to have jumped to using h5py and numpy before learning much of Python
Amp = np.array([]) # creates a numpy array with 0 elements
for h5_key in h5_keys: # h5_key is set of a new value each iteration
tmp = FFM[h5_key]
h5_key = np.array(tmp) # now you reassign h5_key
print(Amp[30,30,30]) # Amp is the original (0,) shape array
Try this basic python loop, paying attention to the value of i:
alist = [1,2,3]
for i in alist:
print(i)
i = 10
print(i)
print(alist) # no change to alist
f is the file.
FFM = f['FFM']
is a group
Amp = FFM['Amp']
is a dataset. There are various ways of load the dataset into an numpy array. I believe the [...] slicing is the current preferred one. .value used to used but is now deprecated (loading dataset)
Amp = FFM['Amp'][...]
is an array.
alist = [FFM[key][...] for key in h5_keys]
should create a list of arrays from the FFM group.
If the shapes are compatible, you can concatenate the arrays into one array:
np.array(alist)
np.stack(alist)
np.concatatenate(alist, axis=0) # or other axis
etc
adict = {key: FFM[key][...] for key in h5_keys}
should crate of dictionary of array keyed by dataset names.
In Python, lists and dictionaries are the ways of accumulating objects. The h5py groups behave much like dictionaries. Datasets behave much like numpy arrays, though they remain on the disk until loaded with [...].

Iterate over numpy array to fill a python list

I'm iterating over a numpy array to apply a function through each element and add the new value to a list so I can keep the original data.
The problem is: it's kinda slow.
Is there a better way to do this (without changing the original array)?
import numpy as np
original_data = np.arange(0,16000, dtype = np.float32)
new_data = [i/max(original_data) for i in original_data]
print('done')
You could simply do:
new_data = original_data/original_data.max()
Numpy already performs this operation element-wise.
In your code there is an extra source of slowness: each call max(original_data) will result in an iteration over all elements from original_data, making your cost proportional to O(n^2).

Shuffling multiple HDF5 datasets in-place

I have multiple HDF5 datasets saved in the same file, my_file.h5. These datasets have different dimensions, but the same number of observations in the first dimension:
features.shape = (1000000, 24, 7, 1)
labels.shape = (1000000)
info.shape = (1000000, 4)
It is important that the info/label data is correctly connected to each set of features and I therefore want to shuffle these datasets with an identical seed. Furthermore, I would like to shuffle these without ever loading them fully into memory. Is that possible using numpy and h5py?
Shuffling arrays on disk will be time consuming, as it means that you have allocate new arrays in the hdf5 file, then copy all the rows in a different order. You can iterate over rows (or use chunks of rows), if you want to avoid loading all the data at once into memory with PyTables or h5py.
An alternative approach could be to keep your data as it is and simply to map new row numbers to old row numbers in a separate array (that you can keep fully loaded in RAM, since it will be only 4MB with your array sizes). For instance, to shuffle a numpy array x,
x = np.random.rand(5)
idx_map = numpy.arange(x.shape[0])
numpy.random.shuffle(idx_map)
Then you can use advanced numpy indexing to access your shuffled data,
x[idx_map[2]] # equivalent to x_shuffled[2]
x[idx_map] # equivament to x_shuffled[:], etc.
this will work also with arrays saved to hdf5. Of course there would be some overhead, as compared to writing shuffled arrays on disk, but it could be sufficient depending on your use-case.
Shuffling arrays like this in numpy is straight forward
Create the large suffling index (shuffle np.arange(1000000)) and index the arrays
features = features[I, ...]
labels = labels[I]
info = info[I, :]
This isn't an inplace operation. labels[I] is a copy of labels, not a slice or view.
An alternative
features[I,...] = features
looks on the surface like it is an inplace operation. I doubt that it is, down in the C code. It has to be buffered, because the I values are not guaranteed to be unique. In fact there is a special ufunc .at method for unbuffered operations.
But look at what h5py says about this same sort of 'fancy indexing':
http://docs.h5py.org/en/latest/high/dataset.html#fancy-indexing
labels[I] selection is implemented, but with restrictions.
List selections may not be empty
Selection coordinates must be given in increasing order
Duplicate selections are ignored
Very long lists (> 1000 elements) may produce poor performance
Your shuffled I is, by definition not in increasing order. And it is very large.
Also I don't see anything about using this fancy indexing on the left handside, labels[I] = ....
import numpy as np
import h5py
data = h5py.File('original.h5py', 'r')
with h5py.File('output.h5py', 'w') as out:
indexes = np.arange(data['some_dataset_in_original'].shape[0])
np.random.shuffle(indexes)
for key in data.keys():
print(key)
feed = np.take(data[key], indexes, axis=0)
out.create_dataset(key, data=feed)

Categories