Splitting up `h5` file and combining the pieces back

Splitting up `h5` file and combining the pieces back - python

I have an h5 file, which is basically model weights output by keras. For some storage requirements, I'd like to split up the large h5 file into smaller pieces, and combine them back into a single file when needed. However, the way I do it seems to miss some "metadata" (not sure, maybe it's missing a lot more, but judging by the size of the combined file and the original file, it seems that I'm not missing much).
Here's my splitting script:
prefix = "model_weights"
fname_src = "DiffusiveSizeFactorAI/model_weights.h5"
size_max = 90 * 1024**2 # maximum size allowed in bytes
is_file_open = False
dest_fnames = []
idx = 0
with h5py.File(fname_src, "r") as src:
for group in src:
fname = f"{prefix}_{idx}.h5"
if not is_file_open:
dest = h5py.File(fname, "w")
dest_fnames.append(fname)
is_file_open = True
group_id = dest.require_group(group)
src.copy(f"/{group}", group_id)
size = os.path.getsize(fname)
if size > size_max:
dest.close()
idx += 1
is_file_open = False
dest.close()
and here's the script that I use for combining back the pieces:
fname_combined = f"{prefix}_combined.h5"
with h5py.File(fname_combined, "w") as combined:
for fname in dest_fnames:
with h5py.File(fname, "r") as src:
for group in src:
group_id = combined.require_group(group)
src.copy(f"/{group}", group_id)
Just to add a little bit of context if it helps debugging my case, when I load the "combined" model weights, here's the error I'm getting:
ValueError: Layer count mismatch when loading weights from file. Model expected 108 layers, found 0 saved layers.
Note: the size of the original file and the combined one are about the same (they differ by less than 0.5%), which is why I think that I might be missing some metadata.

I am wondering if there is an alternative solution to your problem. I am assuming you want to deploy the model on an embedded system, which leads to memory restrictions. If that is the case, here are some alternatives:
Use TensorFlow Lite: claims that it significantly reduces the size of the model (haven't really tested this). It also improves other important aspects of ML deployment on the edge. In summary, you can make the size up to x5 times smaller.
Apply Pruning: pruning gradually zeroes out model weights during the training process to achieve model sparsity. Sparse models are easier to compress, and thus the zeroes during inference can be skipped for latency improvements.

Based on an answer from h5py developers, there are two issues:
Every time an h5 file is copied this way, a duplicate extra folder level will be added to the destination file. Let me explain:
Suppose in src.h5, I have the following structure: /A/B/C. In these two lines:
group_id = dest.require_group(group)
src.copy(f"/{group}", group_id)
group is /A, and so, after copying, an extra /A will be added to dest.h5, which results in the following erroneous struction: /A/A/B/C. To fix that, one needs to explicitly pass name="A" as an argument to copy.
Metadata of the root level "/" is not being copied neither in the splitting nor in the combining script. To fix that, given that h5 data structure is very similar to Python's dict, you just need to add:
dest.attrs.update(src.attrs)
For personal use, I've written two helper functions, one that splits up a large h5 file into smaller parts, each not exceeding a specified size (passed as argument by user), and another one that combines them back into a single h5 file. In case you find it useful, it can be found on Github here.

Related

Memory issue with cv.imread

I trying to read a large number (54K) of 512x512x3 .png images into an array to create a dataset afterwards and feed to a Keras model. I am using the code below, however I am getting the cv2.OutofMemory error (at around image 50K...) pointing to the fourth line of my code. I have been reading a bit about it, and: I am using the 64bit version, and the images can not be resized as it is a fixed input representation. Is there anything that can be done from a memory management side of things to make it work?
'''
#Images (512x512x3)
X_data = []
files = glob.glob ('C:\Users\77901677\Projects\images1\*.png')
for myFile in files:
image = cv2.imread (myFile)
X_data.append (image)
dataset_image = np.array(X_data)
# Annontations (multilabel) 512x512x2
Y_data = []
files = glob.glob ('C:\\Users\\77901677\\Projects\\annotations1\\*.png')
for myFile in files:
mask = cv2.imread (myFile)
# Gets rid of first channel which is empty
mask = mask[:,:,1:]
Y_data.append (mask)
dataset_mask = np.array(Y_data)
'''
Any ideas or advices are welcome

You can reduce the memory by cutting one of your variables, because you have 2x the array at the moment.
You could use yield for this, thus creating a generator, which will only load your file one at a time, instead of storing it all in an auxiliary variable.
def myGenerator():
files = glob.glob ('C:\\Users\\77901677\\Projects\\annotations1\\*.png')
for myFile in files:
mask = cv2.imread (myFile)
# Gets rid of first channel which is empty
yield mask[:,:,1:]
# initialise your numpy array here
yData = np.zeros(NxHxWxC)
# initialise the generator
mygenerator = myGenerator() # create a generator
for I, data in enumerate(myGenerator):
yData[I,::] = data # load the data
But, this is not optimal for you. If you plan to train a model in the next step, you will have memory issues for sure. In keras, you can additionally implement a Keras Sequence Generator, which will load your files in batches (similarly to this yield generator) to your model in the training stage. I recommend this article here, which demonstrates an easy implementation of it, it's what I use for my keras/tf model pipelines.
It's good practice to use generators when feeding our models large amounts of data.

What's an intelligent way of loading a compressed array completely from disk into memory - also (indentically) compressed?

I am experimenting with a 3-dimensional zarr-array, stored on disk:
Name: /data
Type: zarr.core.Array
Data type: int16
Shape: (102174, 1100, 900)
Chunk shape: (12, 220, 180)
Order: C
Read-only: True
Compressor: Blosc(cname='zstd', clevel=3, shuffle=BITSHUFFLE, blocksize=0)
Store type: zarr.storage.DirectoryStore
No. bytes: 202304520000 (188.4G)
No. bytes stored: 12224487305 (11.4G)
Storage ratio: 16.5
Chunks initialized: 212875/212875
As I understand it, zarr-arrays can also reside in memory - compressed, as if they were on disk. So I thought why not try to load the entire thing into RAM on a machine with 32 GByte memory. Compressed, the dataset would require approximately 50% of RAM. Uncompressed, it would require about 6 times more RAM than available.
Preparation:
import os
import zarr
from numcodecs import Blosc
import tqdm
zpath = '...' # path to zarr data folder
disk_array = zarr.open(zpath, mode = 'r')['data']
c = Blosc(cname = 'zstd', clevel=3, shuffle = Blosc.BITSHUFFLE)
memory_array = zarr.zeros(
disk_array.shape, chunks = disk_array.chunks,
dtype = disk_array.dtype, compressor = c
)
The following experiment fails almost immediately with an out of memory error:
memory_array[:, :, :] = disk_array[:, :, :]
As I understand it, disk_array[:, :, :] will try to create an uncompressed, full-size numpy array, which will obviously fail.
Second attempt, which works but is agonizingly slow:
chunk_lines = disk_array.chunks[0]
chunk_number = disk_array.shape[0] // disk_array.chunks[0]
chunk_remain = disk_array.shape[0] % disk_array.chunks[0] # unhandled ...
for chunk in tqdm.trange(chunk_number):
chunk_slice = slice(chunk * chunk_lines, (chunk + 1) * chunk_lines)
memory_array[chunk_slice, :, :] = disk_array[chunk_slice, :, :]
Here, I am trying to reads a certain number of chunks at a time and put them into my in-memory array. It works, but it is about 6 to 7 times slower than what it took to write this thing to disk in the first place. EDIT: Yes, it's still slow, but the 6 to 7 times happened due to a disk issue.
What's an intelligent and fast way of achieving this? I'd guess, besides not using the right approach, my chunks might also be too small - but I am not sure.
EDIT: Shape, chunk size and compression are supposed to be identical for the on-disk array and the in-memory array. It should therefore be possible to eliminate the decompress-compress procedure in my example above.
I found zarr.convenience.copy but it is marked as an experimental feature, subject to further change.
Related issue on GitHub

You could conceivably try with fsspec.implementations.memory.MemoryFileSystem, which has a .make_mapper() method, with which you can make the kind of object expected by zarr.
However, this is really just a dict of path:io.BytesIO, which you could make yourself, if you want.

There are a couple of ways one might solve this issue today.
Use LRUStoreCache to cache (some) compressed data in memory.
Coerce your underlying store into a dict and use that as your store.
The first option might be appropriate if you only want some frequently used data in-memory. Of course how much you load into memory is something you can configure. So this could be the whole array. This will only happen with data on-demand, which may be useful for you.
The second option just creates a new in-memory copy of the array by pulling all of the compressed data from disk. The one downside is if you intend to write back to disk this will be something you need to do manually, but it is not too difficult. The update method is pretty handy for facilitating this copying of data between different stores.

How to efficiently randomly select a subset of data from an h5py dataset

I have a very very big dataset in h5py and this leads to memory problem when loaded in full and subsequent processing. I need to randomly select a subset and work with it. This is doing "boosting" in the context in machine learning.
dataset = h5py.File(h5_file, 'r')
train_set_x_all = dataset['train_set_x'][:]
train_set_y_all = dataset['train_set_y'][:]
dataset.close()
p = np.random.permutation(len(train_set_x_all))[:2000] # rand select 2000
train_set_x = train_set_x_all[p]
train_set_y = train_set_y_all[p]
I still somehow need to get the full set and slice it with index array p. This works for me as subsequently training only worked on the smaller set. But I wonder if there's still a better way to let me do this without even keeping the full dataset in memory at all.

How can I pick specific records in TensorFlow from a .tfrecords file?

My goal is to train a neural net for a fixed number of epochs or steps, I would like each step to use a batch of data of a specific size from a .tfrecords file.
Currently I am reading from the file using this loop:
i = 0
data = np.empty(shape=[x,y])
for serialized_example in tf.python_io.tf_record_iterator(filename):
example = tf.train.Example()
example.ParseFromString(serialized_example)
Labels = example.features.feature['Labels'].byte_list.value
# Some more features here
data[i-1] = [Labels[0], # more features here]
if i == 3:
break
i = i + 1
print data # do some stuff etc.
I am a bit of a Python noob, and I suspect that creating "i" outside the loop and breaking out when it reaches a certain value is just a hacky word-around.
Is there a way that I can read data from the file but specify "I would like the first 100 values in the byte_list that is contained within the Labels feature" and then subsequently "I would like the next 100 values".
To clarify, the thing that I am unfamiliar with is looping over a file in this manner, I am not really certain how to manipulate the loop.
Thanks.

Impossible. TFRecords is a streaming reader and has no random access.
A TFRecords file represents a sequence of (binary) strings. The format is not random access, so it is suitable for streaming large amounts of data but not suitable if fast sharding or other non-sequential access is desired.

Expanding on the comment by Shan Carter (although it's not an ideal solution for your question) for archival purposes.
If you'd like to use enumerate() to break out from a loop at a certain iteration, you could do the following:
n = 5 # Iteration you would like to stop at
data = np.empty(shape=[x,y])
for i, serialized_example in enumerate(tf.python_io.tf_record_iterator(filename)):
example = tf.train.Example()
example.ParseFromString(serialized_example)
Labels = example.features.feature['Labels'].byte_list.value
# Some more features here
data[i-1] = [Labels[0], Labels[1]]# more features here
if i == n:
break
print(data)
Addressing your use case for .tfrecords
I would like each step to use a batch of data of a specific size from a .tfrecords file.
As mentioned by TimZaman, .tfrecords are not meant for arbitrary access of data. But seeing as you just need to continously pull batches from the .tfrecords file, you might be better off using the tf.data API to feed your model.
Adapted from the the tf.data guide:
Constructing a Dataset from .tfrecord files
filepath1 = '/path/to/file.tfrecord'
filepath2 = '/path/to/another_file.tfrecord
dataset = tf.data.TFRecordDataset(filenames = [filepath1, filepath2])
From here, if you're using the tf.keras API, you could pass dataset as an argument into model.fit like so:
model.fit(x = dataset,
batch_size = None,
validation_data = some_other_dataset)
Extra Stuff
Here's a blog which helps to explain .tfrecord files a little better than the tensorflow documentation.

Training classifier with large data

I was trying with two class text classification. Usually I created Pickle files of trained model and load those pickle in training phase to eliminate retraining.
When I had 12000 review + more then 50000 tweets for each of the class, the training model size goes to 1.4 GB.
Now storing this large model data into Pickle and loading it is really not feasible and advisable.
Is there any better alternative to this scenario?
Here is sample code, I tried multiple ways of pickleing, here i Have used dill package
def train(self):
global pos, neg, totals
retrain = False
# Load counts if they already exist.
if not retrain and os.path.isfile(CDATA_FILE):
# pos, neg, totals = cPickle.load(open(CDATA_FILE))
pos, neg, totals = dill.load(open(CDATA_FILE, 'r'))
return
for file in os.listdir("./suspected/"):
for word in set(self.negate_sequence(open("./unsuspected/" + file).read())):
neg[word] += 1
pos['not_' + word] += 1
for file in os.listdir("./suspected/"):
for word in set(self.negate_sequence(open("./suspected/" + file).read())):
pos[word] += 1
neg['not_' + word] += 1
self.prune_features()
totals[0] = sum(pos.values())
totals[1] = sum(neg.values())
countdata = (pos, neg, totals)
dill.dump(countdata, open(CDATA_FILE, 'w') )
UPDATE : Reason behind large pickle is, classification data is very large. And I have considered 1-4 gram for feature selection. Classification dataset itself is around 300mb, so considering multigram approach for feature selection creates large training model.

Pickle is very heavy as a format. It stores all the details of the objects.
It would be much better to store your data in an efficient format like hdf5.
If you are not familiar with hdf5, you can look into storing your data in a simple flat text files. You can use csv or json, depending on your data structure. You'll find that either is more efficient than pickle.
You can look at gzip to create and load compressed archives.

The problem and solution is explained here. In short, the problem is due to the fact that when doing featurization, e.g. using CountVectorizer, although you might ask for small number of features e.g. max_features=1000, the transformer still keeps a copy of all possible features for debugging purposes, under the hood.
For instance, the CountVectorizer has the following attribute:
stop_words_ : set
Terms that were ignored because they either:
- occurred in too many documents (max_df)
- occurred in too few documents (min_df)
- were cut off by feature selection (max_features).
This is only available if no vocabulary was given.
and this causes the model size to become too large. To solve this issue, you can set stop_words_ to None before pickling your model (taken from the above link's example): (please check the link above for details)
import pickle
model_name = 'clickbait-model-sm.pkl'
cfr_pipeline.named_steps.vectorizer.stop_words_ = None
pickle.dump(cfr_pipeline, open(model_name, 'wb'), protocol=2)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Splitting up `h5` file and combining the pieces back - python

Related

Memory issue with cv.imread

What's an intelligent way of loading a compressed array completely from disk into memory - also (indentically) compressed?

How to efficiently randomly select a subset of data from an h5py dataset

How can I pick specific records in TensorFlow from a .tfrecords file?

Training classifier with large data

Categories

Resources