Training classifier with large data - python

I was trying with two class text classification. Usually I created Pickle files of trained model and load those pickle in training phase to eliminate retraining.
When I had 12000 review + more then 50000 tweets for each of the class, the training model size goes to 1.4 GB.
Now storing this large model data into Pickle and loading it is really not feasible and advisable.
Is there any better alternative to this scenario?
Here is sample code, I tried multiple ways of pickleing, here i Have used dill package
def train(self):
global pos, neg, totals
retrain = False
# Load counts if they already exist.
if not retrain and os.path.isfile(CDATA_FILE):
# pos, neg, totals = cPickle.load(open(CDATA_FILE))
pos, neg, totals = dill.load(open(CDATA_FILE, 'r'))
return
for file in os.listdir("./suspected/"):
for word in set(self.negate_sequence(open("./unsuspected/" + file).read())):
neg[word] += 1
pos['not_' + word] += 1
for file in os.listdir("./suspected/"):
for word in set(self.negate_sequence(open("./suspected/" + file).read())):
pos[word] += 1
neg['not_' + word] += 1
self.prune_features()
totals[0] = sum(pos.values())
totals[1] = sum(neg.values())
countdata = (pos, neg, totals)
dill.dump(countdata, open(CDATA_FILE, 'w') )
UPDATE : Reason behind large pickle is, classification data is very large. And I have considered 1-4 gram for feature selection. Classification dataset itself is around 300mb, so considering multigram approach for feature selection creates large training model.

Pickle is very heavy as a format. It stores all the details of the objects.
It would be much better to store your data in an efficient format like hdf5.
If you are not familiar with hdf5, you can look into storing your data in a simple flat text files. You can use csv or json, depending on your data structure. You'll find that either is more efficient than pickle.
You can look at gzip to create and load compressed archives.

The problem and solution is explained here. In short, the problem is due to the fact that when doing featurization, e.g. using CountVectorizer, although you might ask for small number of features e.g. max_features=1000, the transformer still keeps a copy of all possible features for debugging purposes, under the hood.
For instance, the CountVectorizer has the following attribute:
stop_words_ : set
Terms that were ignored because they either:
- occurred in too many documents (max_df)
- occurred in too few documents (min_df)
- were cut off by feature selection (max_features).
This is only available if no vocabulary was given.
and this causes the model size to become too large. To solve this issue, you can set stop_words_ to None before pickling your model (taken from the above link's example): (please check the link above for details)
import pickle
model_name = 'clickbait-model-sm.pkl'
cfr_pipeline.named_steps.vectorizer.stop_words_ = None
pickle.dump(cfr_pipeline, open(model_name, 'wb'), protocol=2)

Related

Splitting up `h5` file and combining the pieces back

I have an h5 file, which is basically model weights output by keras. For some storage requirements, I'd like to split up the large h5 file into smaller pieces, and combine them back into a single file when needed. However, the way I do it seems to miss some "metadata" (not sure, maybe it's missing a lot more, but judging by the size of the combined file and the original file, it seems that I'm not missing much).
Here's my splitting script:
prefix = "model_weights"
fname_src = "DiffusiveSizeFactorAI/model_weights.h5"
size_max = 90 * 1024**2 # maximum size allowed in bytes
is_file_open = False
dest_fnames = []
idx = 0
with h5py.File(fname_src, "r") as src:
for group in src:
fname = f"{prefix}_{idx}.h5"
if not is_file_open:
dest = h5py.File(fname, "w")
dest_fnames.append(fname)
is_file_open = True
group_id = dest.require_group(group)
src.copy(f"/{group}", group_id)
size = os.path.getsize(fname)
if size > size_max:
dest.close()
idx += 1
is_file_open = False
dest.close()
and here's the script that I use for combining back the pieces:
fname_combined = f"{prefix}_combined.h5"
with h5py.File(fname_combined, "w") as combined:
for fname in dest_fnames:
with h5py.File(fname, "r") as src:
for group in src:
group_id = combined.require_group(group)
src.copy(f"/{group}", group_id)
Just to add a little bit of context if it helps debugging my case, when I load the "combined" model weights, here's the error I'm getting:
ValueError: Layer count mismatch when loading weights from file. Model expected 108 layers, found 0 saved layers.
Note: the size of the original file and the combined one are about the same (they differ by less than 0.5%), which is why I think that I might be missing some metadata.
I am wondering if there is an alternative solution to your problem. I am assuming you want to deploy the model on an embedded system, which leads to memory restrictions. If that is the case, here are some alternatives:
Use TensorFlow Lite: claims that it significantly reduces the size of the model (haven't really tested this). It also improves other important aspects of ML deployment on the edge. In summary, you can make the size up to x5 times smaller.
Apply Pruning: pruning gradually zeroes out model weights during the training process to achieve model sparsity. Sparse models are easier to compress, and thus the zeroes during inference can be skipped for latency improvements.
Based on an answer from h5py developers, there are two issues:
Every time an h5 file is copied this way, a duplicate extra folder level will be added to the destination file. Let me explain:
Suppose in src.h5, I have the following structure: /A/B/C. In these two lines:
group_id = dest.require_group(group)
src.copy(f"/{group}", group_id)
group is /A, and so, after copying, an extra /A will be added to dest.h5, which results in the following erroneous struction: /A/A/B/C. To fix that, one needs to explicitly pass name="A" as an argument to copy.
Metadata of the root level "/" is not being copied neither in the splitting nor in the combining script. To fix that, given that h5 data structure is very similar to Python's dict, you just need to add:
dest.attrs.update(src.attrs)
For personal use, I've written two helper functions, one that splits up a large h5 file into smaller parts, each not exceeding a specified size (passed as argument by user), and another one that combines them back into a single h5 file. In case you find it useful, it can be found on Github here.

How to load the face feature (as np.ndarray) from txt file fastly

I have a face feature dataset stored in a .txt file. Each record has an image name and a 512d feature vector. The format of each line is shown as follows:
img_id.jpg 0.0637221 0.0835939 0.0283522 -0.0266797 0.0502897 -0.0108895 ... -0.0266797 0.0502897
Each feature should be loaded in a numpy matrix like:
[
[512d-feature],
[512d-feature],
...
[512d-feature]
]
I have tried to load the feature by for-loop, but it seems that the maximum loading speed is ~5000 record/second. However, the total dataset has nearly 10 million records. It would cost nearly 30 minutes to load the feature.
for line in tqdm(content):
...
# load feature
feature = line.strip().split()[1:]
feature = [float(i) for i in feature]
feat_matrix[index] = feature
I was wondering that is there a fast way to load the feature in order to reduce the time-consuming? Like using numpy builtin methods, multi-threading, or parallelly loading?

How to efficiently randomly select a subset of data from an h5py dataset

I have a very very big dataset in h5py and this leads to memory problem when loaded in full and subsequent processing. I need to randomly select a subset and work with it. This is doing "boosting" in the context in machine learning.
dataset = h5py.File(h5_file, 'r')
train_set_x_all = dataset['train_set_x'][:]
train_set_y_all = dataset['train_set_y'][:]
dataset.close()
p = np.random.permutation(len(train_set_x_all))[:2000] # rand select 2000
train_set_x = train_set_x_all[p]
train_set_y = train_set_y_all[p]
I still somehow need to get the full set and slice it with index array p. This works for me as subsequently training only worked on the smaller set. But I wonder if there's still a better way to let me do this without even keeping the full dataset in memory at all.

Loading a custom dataset from json annotations files for Keras classification task

I am new to deep learning and would like to implement a simple classification task using Keras. My dataset contains over 2000 images & for each image I have a respective json file which contains the label for that image. Following is the code to load the json files & create the X (image) & Y (labels) arrays:
X = []
Y = []
with concurrent.futures.ProcessPoolExecutor() as executor:
# Get a list of files to process
str = jsonpath + '/*.json'
#print(str)
json_files = glob.glob(str)
for jsonfile,y in zip(json_files, executor.map(create_array, json_files)):
X.append(y[0])
Y.append(y[1])
where the function create_array is defined as follows:
def create_array(jsonfile):
array_list = []
y_list = []
with open(jsonfile) as f:
data = json.load(f)
name = data['annotation']['data_filename']
img = cv2.imread(imgDIR + '/' + name)
array_list.append(img)
l = data['annotation']['data_annotation']['classification'][0]['classification_label']
y_list.append(l)
return array_list, y_list
It works for small no of images say 15, but for the entire set of 2000 images, the program gets automatically killed or sometimes it gives the error "MemoryError: out of memory".
Is there an efficient way to do this? How can I speed up this data pre-processing part to give it as an input to the keras classification model?
It seems like your images are pretty much ready for training and your preprocessing is simply about loading the files. json format might not be the fastest approach when it comes to loading data. If you're using somthing like pickle to save and load your images, you might experience a speed boost.
The other question is how to efficiently passing the data to keras. Normally you would use model.fit but since not all your data can fit into your memory you can use model.fit_generator
Ther keras doc gives us the folowing hint:
The generator is run in parallel to the model, for efficiency. For
instance, this allows you to do real-time data augmentation on images
on CPU in parallel to training your model on GPU.
The use of keras.utils.Sequence guarantees the ordering and guarantees
the single use of every input per epoch when using
use_multiprocessing=True.
Here is an example how to implement such a generator.

How can I pick specific records in TensorFlow from a .tfrecords file?

My goal is to train a neural net for a fixed number of epochs or steps, I would like each step to use a batch of data of a specific size from a .tfrecords file.
Currently I am reading from the file using this loop:
i = 0
data = np.empty(shape=[x,y])
for serialized_example in tf.python_io.tf_record_iterator(filename):
example = tf.train.Example()
example.ParseFromString(serialized_example)
Labels = example.features.feature['Labels'].byte_list.value
# Some more features here
data[i-1] = [Labels[0], # more features here]
if i == 3:
break
i = i + 1
print data # do some stuff etc.
I am a bit of a Python noob, and I suspect that creating "i" outside the loop and breaking out when it reaches a certain value is just a hacky word-around.
Is there a way that I can read data from the file but specify "I would like the first 100 values in the byte_list that is contained within the Labels feature" and then subsequently "I would like the next 100 values".
To clarify, the thing that I am unfamiliar with is looping over a file in this manner, I am not really certain how to manipulate the loop.
Thanks.
Impossible. TFRecords is a streaming reader and has no random access.
A TFRecords file represents a sequence of (binary) strings. The format is not random access, so it is suitable for streaming large amounts of data but not suitable if fast sharding or other non-sequential access is desired.
Expanding on the comment by Shan Carter (although it's not an ideal solution for your question) for archival purposes.
If you'd like to use enumerate() to break out from a loop at a certain iteration, you could do the following:
n = 5 # Iteration you would like to stop at
data = np.empty(shape=[x,y])
for i, serialized_example in enumerate(tf.python_io.tf_record_iterator(filename)):
example = tf.train.Example()
example.ParseFromString(serialized_example)
Labels = example.features.feature['Labels'].byte_list.value
# Some more features here
data[i-1] = [Labels[0], Labels[1]]# more features here
if i == n:
break
print(data)
Addressing your use case for .tfrecords
I would like each step to use a batch of data of a specific size from a .tfrecords file.
As mentioned by TimZaman, .tfrecords are not meant for arbitrary access of data. But seeing as you just need to continously pull batches from the .tfrecords file, you might be better off using the tf.data API to feed your model.
Adapted from the the tf.data guide:
Constructing a Dataset from .tfrecord files
filepath1 = '/path/to/file.tfrecord'
filepath2 = '/path/to/another_file.tfrecord
dataset = tf.data.TFRecordDataset(filenames = [filepath1, filepath2])
From here, if you're using the tf.keras API, you could pass dataset as an argument into model.fit like so:
model.fit(x = dataset,
batch_size = None,
validation_data = some_other_dataset)
Extra Stuff
Here's a blog which helps to explain .tfrecord files a little better than the tensorflow documentation.

Categories