Extraction of file using numpy memory map python - python

Trying to extract a sample from of large file using numpy's memmap:
# indices - boolean vector indicating which lines we want to extract
file_big = np.memmap('path_big_file',dtype='int16',shape=(indices.shape[0],L))
file_small = np.memmap('new_path_for_small_file',dtype='int16',shape=(indices.sum(),L))
The expected result would be that a new file will be created with only part of the data, as identified by the indices.
# place data in files:
file_small[:] = file_big[indices]
The above is the described procedure in the manual. It does not work - called as not having enough memory, even though memory should not be an issue: using only memmap and not uploading data.

Related

Changes in h5 file arent refelcted in xdmf file

Hello I was given an h5 file with an xdmf file associated with it. The visualization looks like this. Here the color is just the feature ID. I wanted to add some data to the h5 file to be able to visualize it in paraview. The newly added data does not appear in the paraview although clearly being there when using hdfview. The data I'm trying to add are the ones titled engineering stress and true stress. The only difference I noticed is that the number of attributes for these is zero while it's 5 for the rest but I dont know what to do with that information.
Here's the code I currently have set up:
nf_product = h5py.File(filename,"a")
e_princ = np.empty((280,150,280,3))
t_princ = e_princ
for i in tqdm(range(grain_count)):
a = np.where(feature_ID == i+1)
e_princ[a,0] = eng_stress[i,0]
e_princ[a,1] = eng_stress[i,1]
e_princ[a,2] = eng_stress[i,2]
t_princ[a,0] = true_stress[i,0]
t_princ[a,1] = true_stress[i,1]
t_princ[a,2] = true_stress[i,2]
EngineeringStress = nf_product.create_dataset('DataContainers/nfHEDM/CellData/EngineeringStressPrinciple',data=np.float32(e_princ))
TrueStress = nf_product.create_dataset('DataContainers/nfHEDM/CellData/TrueStressPrinciple',data=np.float32(t_princ))
I am new to using h5 and xdmf files so I may be going about this entirely wrong but the way I understand it is an xdmf file acts as a pointer to the data in the h5 file so I can't understand why the new data doesnt appear in paraview.
First, did you close the file withnf_product.close()? If not, new datasets may not have been flushed from memory. You may also need to flush the buffers withnf_product.flush() Better, use the Python with / as: file context manager and it is done automatically.
Next, you can simply use data=e_princ (and t_princ), there is no need to cast a numpy array to a numpy array.
Finally, verify the values in e_princ and t_princ. I think will be the same because they reference the same numpy object. You need to create t_princ as an empty array, the same as e_princ. Also they have 4 indices, and you only have 2 when you populate them with [a,0]. Be sure that works as expected.

Xarray to merge two hdf5 file with different dimension length

I have some instrumental data which saved in hdf-5 format as multiple 2-d array along with the measuring time. As attached figures below, d1 and d2 are two independent file in which the instrument recorded in different time. They have the same data variables, and the only difference is the length of phony_dim_0, which represet the total data points varying with measurement time.
These files need to be loaded to a specific software provided by the instrument company for obtaining meaningful results. I want to merge multiple files with Python xarray while keeping in their original format, and then loaed one merged file into the software.
Here is my attempt:
files = os.listdir("DATA_PATH")
d1 = xarray.open_dataset(files[0])
d2 = xarray.open_dataset(files[1])
## copy a new one to save the merged data array.
d0 = d1
vars_ = [c for c in d1]
for var in vars_:
d0[var].values = np.vstack([d1[var],d2[var]])
The error shows like this:
replacement data must match the Variable's shape. replacement data has shape (761, 200); Variable has shape (441, 200)
I thought about two solution for this problem:
expanding the dimension length to the total length of all merged files.
creating a new empty dataframe in the same format of d1 and d2.
However, I still could not figure out the function to achieve that. Any comments or suggestions would be appreciated.
Supplemental information
dataset example [d1],[d2]
I'm not familiar with xarray, so can't help with your code. However, you don't need xarray to copy HDF5 data; h5py is designed to work nicely with HDF5 data as NumPy arrays, and is all you need to get merge the data.
A note about Xarray. It uses different nomenclature than HDF5 and h5py. Xarray refers to the files as 'datasets', and calls the HDF5 datasets 'data variables'. HDF5/h5py nomenclature is more frequently used, so I am going to use it for the rest of my post.
There are some things to consider when merging datasets across 2 or more HDF5 files. They are:
Consistency of the data schema (which you have checked).
Consistency of attributes. If datasets have different attribute names or values, the merge process gets a lot more complicated! (Yours appear to be consistent.)
It's preferable to create resizabe datasets in the merged file. This simplifies the process, as you don't need to know the total size when you initially create the dataset. Better yet, you can add more data later (if/when you have more files).
I looked at your files. You have 8 HDF5 datasets in each file. One nice thing: the datasets are resizble. That simplifies the merge process. Also, although your datasets have a lot of attributes, they appear to be common in both files. That also simplifies the process.
The code below goes through the following steps to merge the data.
Open the new merge file for writing
Open the first data file (read-only)
Loop thru all data sets
a. use the group copy function to copy the dataset (data plus maxshape parameters, and attribute names and values).
Open the second data file (read-only)
Loop thru all data sets and do the following:
a. get the size of the 2 datasets (existing and to be added)
b. increase the size of HDF5 dataset with .resize() method
c. write values from dataset to end of existing dataset
At the end it loops thru all 3 files and prints shape and
maxshape for all datasets (for visual comparison).
Code below:
import h5py
files = [ '211008_778183_m.h5', '211008_778624_m.h5', 'merged_.h5' ]
# Create the merge file:
with h5py.File('merged_.h5','w') as h5fw:
# Open first HDF5 file and copy each dataset.
# Will use maxhape and attributes from existing dataset.
with h5py.File(files[0],'r') as h5fr:
for ds in h5fr.keys():
h5fw.copy(h5fr[ds], h5fw, name=ds)
# Open second HDF5 file and copy data from each dataset.
# Resizes existing dataset as needed to hold new data.
with h5py.File(files[1],'r') as h5fr:
for ds in h5fr.keys():
ds_a0 = h5fw[ds].shape[0]
add_a0 = h5fr[ds].shape[0]
h5fw[ds].resize(ds_a0+add_a0,axis=0)
h5fw[ds][ds_a0:] = h5fr[ds][:]
for fname in files:
print(f'Working on file:{fname}')
with h5py.File(fname,'r') as h5f:
for ds, h5obj in h5f.items():
print (f'for: {ds}; axshape={h5obj.shape}, maxshape={h5obj.maxshape}')

How can I create a Numpy Array that is much bigger than my RAM from 1000s of CSV files?

I have 1000s of CSV files that I would like to append and create one big numpy array. The problem is that the numpy array would be much bigger than my RAM. Is there a way of writing a bit at a time to disk without having the entire array in RAM?
Also is there a way of reading only a specific part of the array from disk at a time?
When working with numpy and large arrays, there are several approaches depending on what you need to do with that data.
The simplest answer is to use less data. If your data has lots of repeating elements, it is often possible to use a sparse array from scipy because the two libraries are heavily integrated.
Another answer (IMO: the correct solution to your problem) is to use a memory mapped array. This will let numpy only load the necessary parts to ram when needed, and leave the rest on disk. The files containing the data can be simple binary files created using any number of methods, but the built-in python module that would handle this is struct. Appending more data would be as simple as opening the file in append mode, and writing more bytes of data. Make sure that any references to the memory mapped array are re-created any time more data is written to the file so the information is fresh.
Finally is something like compression. Numpy can compress arrays with savez_compressed which can then be opened with numpy.load. Importantly, compressed numpy files cannot be memory-mapped, and must be loaded into memory entirely. Loading one column at a time may be able to get you under the threshold, but this could similarly be applied to other methods to reduce memory usage. Numpy's built in compression techniques will only save disk space not memory. There may exist other libraries that perform some sorts of streaming compression, but that is beyond the scope of my answer.
Here is an example of putting binary data into a file then opening it as a memory-mapped array:
import numpy as np
#open a file for data of a single column
with open('column_data.dat', 'wb') as f:
#for 1024 "csv files"
for _ in range(1024):
csv_data = np.random.rand(1024).astype(np.float) #represents one column of data
f.write(csv_data.tobytes())
#open the array as a memory-mapped file
column_mmap = np.memmap('column_data.dat', dtype=np.float)
#read some data
print(np.mean(column_mmap[0:1024]))
#write some data
column_mmap[0:512] = .5
#deletion closes the memory-mapped file and flush changes to disk.
# del isn't specifically needed as python will garbage collect objects no
# longer accessable. If for example you intend to read the entire array,
# you will need to periodically make sure the array gets deleted and re-created
# or the entire thing will end up in memory again. This could be done with a
# function that loads and operates on part of the array, then when the function
# returns and the memory-mapped array local to the function goes out of scope,
# it will be garbage collected. Calling such a function would not cause a
# build-up of memory usage.
del column_mmap
#write some more data to the array (not while the mmap is open)
with open('column_data.dat', 'ab') as f:
#for 1024 "csv files"
for _ in range(1024):
csv_data = np.random.rand(1024).astype(np.float) #represents one column of data
f.write(csv_data.tobytes())

Reading Large HDF5 Files

I am new to using HDF5 files and I am trying to read files with shapes of (20670, 224, 224, 3). Whenever I try to store the results from the hdf5 into a list or another data structure, it takes either takes so long that I abort the execution or it crashes my computer. I need to be able to read 3 sets of hdf5 files, use their data, manipulate it, use it to train a CNN model and make predictions.
Any help for reading and using these large HDF5 files would be greatly appreciated.
Currently this is how I am reading the hdf5 file:
db = h5py.File(os.getcwd() + "/Results/Training_Dataset.hdf5")
training_db = list(db['data'])
Crashes probably mean you are running out of memory. Like Vignesh Pillay suggested, I would try chunking the data and work on a small piece of it at a time. If you are using the pandas method read_hdf you can use the iterator and chunksize parameters to control the chunking:
import pandas as pd
data_iter = pd.read_hdf('/tmp/test.hdf', key='test_key', iterator=True, chunksize=100)
for chunk in data_iter:
#train cnn on chunk here
print(chunk.shape)
Note this requires the hdf to be in table format
My answer updated 2020-08-03 to reflect code you added to your question.
As #Tober noted, you are running out of memory. Reading a dataset of shape (20670, 224, 224, 3) will become a list of 3.1G entities. If you read 3 image sets, it will require even more RAM.
I assume this is image data (maybe 20670 images of shape (224, 224, 3) )?
If so, you can read the data in slices with both h5py and tables (Pytables).
This will return the data as a NumPy array, which you can use directly (no need to manipulate into a different data structure).
Basic process would look like this:
with h5py.File(os.getcwd() + "/Results/Training_Dataset.hdf5",'r') as db:
training_db = db['data']
# loop to get images 1 by 1
for icnt in range(20670) :
image_arr = training_db [icnt,:,:,:}
# then do something with the image
You could also read multiple images by setting the first index to a range (say icnt:icnt+100) then handle looping appropriately.
Your problem is arising as you are running out of memory. So, Virtual Datasets come in handy while dealing with large datasets like yours. Virtual datasets allow a number of real datasets to be mapped together into a single, sliceable dataset via an interface layer. You can read more about them here https://docs.h5py.org/en/stable/vds.html
I would recommend you to start from one file at a time. Firstly, create a Virtual Dataset file of your existing data like
with h5py.File(os.getcwd() + "/Results/Training_Dataset.hdf5", 'r') as db:
data_shape = db['data'].shape
layout = h5py.VirtualLayout(shape = (data_shape), dtype = np.uint8)
vsource = h5py.VirtualSource(db['data'])
with h5py.File(os.getcwd() + "/virtual_training_dataset.hdf5", 'w', libver = 'latest') as file:
file.create_virtual_dataset('data', layout = layout, fillvalue = 0)
This will create a virtual dataset of your existing training data. Now, if you want to manipulate your data, you should open your file in r+ mode like
with h5py.File(os.getcwd() + "/virtual_training_dataset.hdf5", 'r+', libver = 'latest') as file:
# Do whatever manipulation you want to do here
One more thing I would like to advise is make sure your indices while slicing are of int datatype, otherwise you will get an error.

"Reading in" large text file into hdf5 via PyTables or PyHDF?

I'm attempting some statistics using SciPy, but my input dataset is quite large (~1.9GB) and in dbf format.
The file is large enough that Numpy returns an error message when I try to create an array with genfromtxt. (I've got 3GB ram, but running win32).
i.e.:
Traceback (most recent call last):
File "<pyshell#5>", line 1, in <module>
ind_sum = numpy.genfromtxt(r"W:\RACER_Analyses\Terrestrial_Heterogeneity\IND_SUM.dbf", dtype = (int, int, int, float, float, int), names = True, usecols = (5))
File "C:\Python26\ArcGIS10.0\lib\site-packages\numpy\lib\npyio.py", line 1335, in genfromtxt
for (i, line) in enumerate(itertools.chain([first_line, ], fhd)):
MemoryError
From other posts, I see that the chunked array provided by PyTables could be useful, but my problem is reading in this data in the first place. Or in other words, PyTables or PyHDF easily create a HDF5 output that is desired, but what should I do to get my data into an array first?
For instance:
import numpy, scipy, tables
h5file = tables.openFile(r"W:\RACER_Analyses\Terrestrial_Heterogeneity\HET_IND_SUM2.h5", mode = "w", title = "Diversity Index Results")
group = h5.createGroup("/", "IND_SUM", "Aggregated Index Values"`)
and then I could either create a table or array, but how do I refer back to the original dbf data? In the description?
Thanks for any thoughts you might have!
If the data is too big to fit in memory, you can work with a memory-mapped file (it's like a numpy array but stored on disk - see docs here), though you may be able to get similar results using HDF5 depending on what operations you need to perform on the array. Obviously this will make many operations slower but this is better than not being able to do them at all.
Because you are hitting a memory limit, I think you cannot use genfromtxt. Instead, you should iterate through your text file one line at a time, and write the data to the relevant position in the memmap/hdf5 object.
It is not clear what you mean by "referring back to the original dbf data"? Obviously you can just store the filename it came from somewhere. HDF5 objects have "attributes" which are designed to store this kind of meta-data.
Also, I have found that using h5py is a much simpler and cleaner way to access hdf5 files than pytables, though this is largely a matter of preference.
If the data is in a dbf file, you might try my dbf package -- it only keeps the records in memory that are being accessed, so you should be able to cycle through the records pulling out the data that you need:
import dbf
table = dbf.Table(r"W:\RACER_Analyses\Terrestrial_Heterogeneity\IND_SUM.dbf")
sums = [0, 0, 0, 0.0, 0.0, 0]
for record in table:
for index in range(5):
sums[index] += record[index]

Categories