Saving MultinominalNB Model to Disk takes too much time and memory - python

I have a dataset (around 120k entries, 8mb, 4 cols, one with text). I ran a MultinominalNB to classify the text column in order to predict its classe (another column).
I did that with a pipeline as follows (the text column goes though cleaning text process, including stopwords removal prior to pipeline).
text_clf_comp = Pipeline([('vect', CountVectorizer(ngram_range=(1,6))),
('tfidf', TfidfTransformer(use_idf=False)),
('clf', MultinomialNB(alpha=0.01)),])
text_clf_comp = text_clf_comp.fit(X_train_comp, y_train)
The parameter were optimized using GridSearch.
The pipeline and fit takes 17s and the model is very good at predicting.
The problem occurs when I try to save the model using joblib or pickle. It creates a 300mb file and takes 7 min to run. Doesn't make sense, considering the time to train and the size of the data.
saved_model=joblib.dump(text_clf_comp,'saved_model.joblib')
I created a LSTM model that takes like 1 hour to train and saving it took like a couple of seconds and 2 mb.
Right now, is better to train my MultinominalNB classifier everytime than saving and loading it.

Related

How to split text data for LSTM training

I'm using Colab Pro TPU that offers up to 35 GB of memory. My dataset contains 650,000 sequences. I'm trying to use BiDirectional LSTM to predict the next word.
When I attempt to generate the binary_class vector using to_categorical(), it crashes because of memory limits. I took the first 200k sequences, trained the model, accuracy almost stops at around 65%. Before tweaking the hyperparameters, I wanted to feed the whole dataset and train the model. Is it possible to split the dataset, generate sequences, join them for training?
Appreciate any suggestions.
Thanks.

Having problems while doing multiclass classification with tensorflow

https://colab.research.google.com/drive/1EdCL6YXCAvKqpEzgX8zCqWv51Yum2PLO?usp=sharing
Hello,
Above, I'm trying to identify 5 different type of restorations on dental x-rays with tensorflow. i'm using the official documentation to follow the steps but now i'm kind of stucked and i need help. here are my questions:
1-i have my data on my local disk. TF example on the link above downloads the data from a different repository. when i want to test my images, do i have any other way than to use the code below ?:
import numpy as np
from keras.preprocessing import image
from google.colab import files
uploaded = files.upload()
# predicting images
for fn in uploaded.keys():
path = fn
img = image.load_img(path, target_size=(180, 180))
x = image.img_to_array(img)
x = np.expand_dims(x, axis=0)
images = np.vstack([x])
classes = model.predict(images)
print(fn)
print(classes)
i'm asking this because the official documentation just shows the way to test images one-by-one, like this:
img = keras.preprocessing.image.load_img(
sunflower_path, target_size=(img_height, img_width)
)
img_array = keras.preprocessing.image.img_to_array(img)
img_array = tf.expand_dims(img_array, 0) # Create a batch
predictions = model.predict(img_array)
score = tf.nn.softmax(predictions[0])
print(
"This image most likely belongs to {} with a {:.2f} percent confidence."
.format(class_names[np.argmax(score)], 100 * np.max(score))
)
2- i'm using "image_dataset_from_directory" method, so i don't have a separate validation directory. is that ok ? or should i use ImageDataGenerator ? For testing my data, i picked some data randomly from all 5 categories by hand and put them in my test folder which has 5 subfolders as i have that number of categories. is this what i am supposed to do for prediction, also separating the test data into different folders ? if yes, how can i load all these 5 folders simultaneously at test time ?
3- i'm also supposed to create the confusion matrix. but i couldn't understand how i can apply this to my code ? some others say, use scikit-learn's confusion matrix, but this time i have to define y-true, y_pred values, which i cannot fit into this code. am i supposed to evaluate 5 different confusion matrices for 5 different predictions and how ?
4-sometimes, i observe that the validation accuracy starts much higher than the training accuracy. is this unusual ? after 3-4 epochs, train accuracy cathces the validation accuracy and continues in a more balanced way. i thought this should not be happening. is everything alright ?
5- final question, why the first epoch takes much much longer time than other epochs? in my setup, it's about 30-40 minutes to complete the first epoch, and then only about a minute or so to complete every other epoch. is there a way to fix it or does it always happen the same way ?
thanks.
I am no expert in image processing with tf, but let me try to answer as much as possible:
1
I dont really understand this question, because you are using image_dataset_from_directory which should handle the file loading process for you. So far to me, it looks good what you are doing there.
2
Let me cite tf.keras.preprocessing.image_dataset_from_directory:
Then calling image_dataset_from_directory(main_directory,
labels='inferred') will return a tf.data.Dataset that yields batches
of images from the subdirectories class_a and class_b, together with
labels 0 and 1 (0 corresponding to class_a and 1 corresponding to
class_b).
And ImageDataGenerator:
Generate batches of tensor image data with real-time data augmentation.
The data will be looped over (in batches).
As your data is handpicked, there is no need for ImageDataGenerator, as image_dataset_from_directory returns what you want. If you test and validation data (which you should have), you can use the tf.data.Dataset functions for splitting data in test, train and valid. This can be a bit clunky, but the time learning tf.data.dataset is well spent.
3
The confusion matrix give the the F1-Score, Precision and Recall values. But as the confusion matrix is normally for binary classification (which is not your case), it only returns those values for one class (and for not this class). Better use the metrics Tensorflow relies on. Tensorflow can calculate the recall and precision and F1 score for you as metric, so if you ask me, use them.
4
Depending on how the data is shuffled and structured this can be normal. When there are more special cases in the training data, the model will have more difficulties to predict them correct. When there are more simple predictions in the test labels, the model will be better there, which gives you a higher accuracy at that point. It is indeed an indicator, that the classes in your train and test data might not be equally distributed.
5
tf.data.Dataset loads the data when needed. This means, the files are not loaded into memory until the training process has started which results in a very long first epoch (loading all images first) and the second very short epoch (oh cool, all images are already there). You can approve this by checking the gpu usage of your machine, it should often be doing nothing or be very low.
To fix this, you can use .prefetch(z) on your dataset variable. ´prefetch() ´makes the dataset prefetch the next ´z´ values, while the gpu is already doing some calculations. This might speed up the first epoch.

Loaded model receives different prediction compared to saved model

I am trying to save a model and load it into a different session, but I am having prediction inconsistencies, and I would appreciate any help that can be offered. So here is what I did...
First, after running the model, I used this code to save the model:
from sklearn.externals import joblib
joblib.dump(clf, "models.pkl")
and then to load the file in a different colaboratory notebook, I used the function
from sklearn.externals import joblib
loaded_model = joblib.load('models.pkl')
then the program I used to process a single image for testing
img_toArray = cv2.imread("/content/ESD/ESD/folder1/img1.png")
new_array = cv2.resize(img_toArray, (220, 220))
new_array = np.array(new_array).reshape(1,145200)
but this results in an output of array([4]) with every image I test, and I am not sure why.
I have also tried to reload the entire dataset again and separate the labels from the features (the image), and use train_test_split to dedicate 90% of the dataset for testing, and when I run the features (images) to test with, through the block of code:
loaded_model.predict(np.array(xTest[whatEverNumber]).reshape(1,145200))
I get the right predictions. So I am confused as to what I a doing wrong, because in both examples,I am processing the images in basically the same method, and then separating the images and running them through the same prediction method. So I would appreciate any help in figuring out what I did wrong.
Extra information that may prove beneficial: I am using colaboratory and my model is an sklearn SVM that runs through a cross_validation_predict, cross_validation_predict, and finally an SVM fit function.
Thank you in advance!
Is loaded_model always trained with the same data? you might be encountering this problem because your fitted model is trained with different chunks (folds) of your dataset and you are fitting/saving it with the last iteration only and hence, each time you test it, the model learns from different data (given by each fold) and returns different predictions. This if model fitting is within your cross-validation loop. May I ask, what type of train-test split did you use? shuffled?

LSTM Does Not Do Well On Second Test Data

During training my LSTM performs well (I use training, validation, and test dataset). And use my test dataset once at the end after training, and I get really good values. So I save the meta file and checkpoint.
Then, during inference, I load my checkpoint and meta file, initialize the weights (using sess.run(tf.initialize_variables())), but when I use a second test dataset (different from the dataset I used during training) my LSTM performance goes from 96% to 20%.
My second test dataset was recorded in similar conditions as my training, validation, and first test dataset, but it was recorded on a different day.
All my dataset was recorded using the same webcam, and with the same background in all images, so technically I should get similar performance in my first and second test set.
In shuffled my dataset during training.
I am using tensorflow 1.1.0
What could be the issue here?
Well, I was reloading my checkpoint during inference, and somehow tensorflow would complain if I did not call the initializer after starting my session like this:
init = tf.global_variables_initializer()
lstm_sess.run(init)
Somehow that seems to randomly initialize my weights rather than reloading the last used weight values.
So what I did instead is freezing my graph as soon as I finish training, and now during inference I reload my graph, so I get the same performance as the performance I got with my test dataset during training. It's kinda weird. Maybe I am not saving/reloading my checkpoint correctly?

Dataset too big for RAM, how to do efficient epochs

I currently am working with a dataset of about 2 million objects. Before I train on them I have to load them from disk and perform some preprocessing (that makes the dataset much bigger, so that it would be inefficient to save the post-processed data).
Right now I just load and train in small batches, but if I want to train for multiple epochs on the full dataset, I would have to load all the data from the previous epoch multiple times, which ends up taking a lot of time. The alternative is training for multiple epochs on the smaller batches of data before moving onto the next batch. Could the second method result in any issues (like overfitting)? And is there any other, better way to do this? I'm using tflearn with Python 3 if there are any built-in methods using that.
tl;dr: Is it okay to train for multiple epochs on subsets of data before training for a single epoch on all the data

Categories