Sklearn KMeans cluster center change after saving and loading with Pickle - python

I'm training a Kmeans model for Vector Quantization and alter the Cluster Centers manually for optimization with a Variantion of the Kiefer-Wolfowitz algorithm.
I take the cluster centers, adjust them and put them in the model.
The error takes place when I save the model with pickl and load it later.
When I evalute the Quantizer I get different results after saving and loading.
To retrieve the centers I have getter and setter methods.
def getWeights():
return self.kmeans.cluster_centers_.flatten()
def setWeights(weights):
kmeans.cluster_centers_ = weights.reshape(self.clusters, self.dim)
To save the model I use Pickle but I also tried joblib.
# save
with open(os.path.join(savepath, savename + '.pkl'), 'wb') as f:
pickle.dump(self.kmeans, f)
#load
with open(os.path.join(savepath, savename + '.pkl'), 'rb') as f:
self.kmeans = pickle.load(f)
I compared the cluster centers with np.sum(cc1 - cc2) and got
[-23657.44412046 -27826.84822967 -34863.87009913 -22867.6671942 -31120.73019114]
Where are these differences comming from?
Does Pickle not save the full precision?
Or is it because of Package Version?
The Sklearn Version is 0.24.2 and Python 3.6 because of some dependencies.

Related

Loading a custom dataset from json annotations files for Keras classification task

I am new to deep learning and would like to implement a simple classification task using Keras. My dataset contains over 2000 images & for each image I have a respective json file which contains the label for that image. Following is the code to load the json files & create the X (image) & Y (labels) arrays:
X = []
Y = []
with concurrent.futures.ProcessPoolExecutor() as executor:
# Get a list of files to process
str = jsonpath + '/*.json'
#print(str)
json_files = glob.glob(str)
for jsonfile,y in zip(json_files, executor.map(create_array, json_files)):
X.append(y[0])
Y.append(y[1])
where the function create_array is defined as follows:
def create_array(jsonfile):
array_list = []
y_list = []
with open(jsonfile) as f:
data = json.load(f)
name = data['annotation']['data_filename']
img = cv2.imread(imgDIR + '/' + name)
array_list.append(img)
l = data['annotation']['data_annotation']['classification'][0]['classification_label']
y_list.append(l)
return array_list, y_list
It works for small no of images say 15, but for the entire set of 2000 images, the program gets automatically killed or sometimes it gives the error "MemoryError: out of memory".
Is there an efficient way to do this? How can I speed up this data pre-processing part to give it as an input to the keras classification model?
It seems like your images are pretty much ready for training and your preprocessing is simply about loading the files. json format might not be the fastest approach when it comes to loading data. If you're using somthing like pickle to save and load your images, you might experience a speed boost.
The other question is how to efficiently passing the data to keras. Normally you would use model.fit but since not all your data can fit into your memory you can use model.fit_generator
Ther keras doc gives us the folowing hint:
The generator is run in parallel to the model, for efficiency. For
instance, this allows you to do real-time data augmentation on images
on CPU in parallel to training your model on GPU.
The use of keras.utils.Sequence guarantees the ordering and guarantees
the single use of every input per epoch when using
use_multiprocessing=True.
Here is an example how to implement such a generator.

Tensorflow frozen inference graph from .meta .info .data and combining frozen inference graphs

I am new to tensorflow, and currently struggling with some issues :
How to get frozen inference graph from .meta .data .info without pipeline config
I wanted to check pre trained models of traffic sign detection in real time. Model contains 3 files - .meta .data .info, but i cant find information, how to convert them into frozen inference graph without pipeline config. Everything i find is either outdated or needs pipeline config.
Also, i tried to train model myself, but i think that problem is .ppa files (GTSDB dataset), because with .png or .jpg everything worked just fine.
How to combine two or more frozen inference graphs
I have successfully trained model on my own dataset (detect some specific object), but i want that model to work with some pre trained models like faster rcnn inception or ssd mobilenet. I understand that i have to load both models, but i have no idea how to make them work at the same time and is it even possible?
UPDATE
I'm halfway there on first problem - now i have frozen_model.pb, problem was in output node names, i got confused and didn't know what to put there, so after hours of "investigating", got working code:
import os, argparse
import tensorflow as tf
# The original freeze_graph function
# from tensorflow.python.tools.freeze_graph import freeze_graph
dir = os.path.dirname(os.path.realpath(__file__))
def freeze_graph(model_dir):
"""Extract the sub graph defined by the output nodes and convert
all its variables into constant
Args:
model_dir: the root folder containing the checkpoint state file
output_node_names: a string, containing all the output node's names,
comma separated
"""
if not tf.gfile.Exists(model_dir):
raise AssertionError(
"Export directory doesn't exists. Please specify an export "
"directory: %s" % model_dir)
# if not output_node_names:
# print("You need to supply the name of a node to --output_node_names.")
# return -1
# We retrieve our checkpoint fullpath
checkpoint = tf.train.get_checkpoint_state(model_dir)
input_checkpoint = checkpoint.model_checkpoint_path
# We precise the file fullname of our freezed graph
absolute_model_dir = "/".join(input_checkpoint.split('/')[:-1])
output_graph = absolute_model_dir + "/frozen_model.pb"
# We clear devices to allow TensorFlow to control on which device it will load operations
clear_devices = True
# We start a session using a temporary fresh Graph
with tf.Session(graph=tf.Graph()) as sess:
# We import the meta graph in the current default Graph
saver = tf.train.import_meta_graph(input_checkpoint + '.meta', clear_devices=clear_devices)
# We restore the weights
saver.restore(sess, input_checkpoint)
# We use a built-in TF helper to export variables to constants
output_graph_def = tf.graph_util.convert_variables_to_constants(
sess, # The session is used to retrieve the weights
tf.get_default_graph().as_graph_def(), # The graph_def is used to retrieve the nodes
[n.name for n in tf.get_default_graph().as_graph_def().node] # The output node names are used to select the usefull nodes
)
# Finally we serialize and dump the output graph to the filesystem
with tf.gfile.GFile(output_graph, "wb") as f:
f.write(output_graph_def.SerializeToString())
print("%d ops in the final graph." % len(output_graph_def.node))
return output_graph_def
if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument("--model_dir", type=str, default="", help="Model folder to export")
# parser.add_argument("--output_node_names", type=str, default="", help="The name of the output nodes, comma separated.")
args = parser.parse_args()
freeze_graph(args.model_dir)
I had to change few lines - remove --output_node_names and change output_node_names in output_graph_def to [n.name for n in tf.get_default_graph().as_graph_def().node]
Now i got new problems - I can't convert .pb to .pbtxt, and error is :
ValueError: Input 0 of node prefix/Variable/Assign was passed float from prefix/Variable:0 incompatible with expected float_ref.
And once again, information on this problem is outdated - everything i found is at least year old. I'm starting to think that fix for frozen_graph is not correct, and that is the reason why i'm having new error.
I would really appreciate some advice on this matter.
if you write
[n.name for n in tf.get_default_graph().as_graph_def().node]
in your convert_variables_to_constants function, you define every node the graph has as an output node, which of course will not work. (This is probably the reason for your ValueError)
You need to find the name of the real output node, the best way for this is often to look at the trained model in tensorboard and analyze the graph there, or you print out every node of your graph. Often the last node that is printed out is your output node (ignore everything that has 'gradients' in the name or 'Adam' if you have used that as an optimizer)
An easy way to do this (insert it after you restore the session):
gd = sess.graph.as_graph_def()
for node in gd.node:
print(node.name)

load decision tree from file in scikit-learn

In python scikit learn, there is a method called export_graphviz to export a decision tree to dot file.
I want to ask if there is a method to import a dot file to scikit learn as a decision tree? like some function called sklearn.tree.import_graphviz() ?
AFAIK There is no easy way to do that. Graphviz can only be used to visualise the decision tree. If you wish to save the model you can use Pickle to save the model. For example:
import cPickle
# save the classifier
with open('my_dumped_classifier.pkl', 'wb') as fid:
cPickle.dump(gnb, fid)
# load it again
with open('my_dumped_classifier.pkl', 'rb') as fid:
gnb_loaded = cPickle.load(fid)

How to store and read nolearn.lasagne NeuralNet models using pickle

How do I store the weights and biases in nolearn.lasagne NeuralNet model? From the documentation, I can't see how to access the NeuralNet's weights and biases and store them.
To save the entire nolearn model ( training history, parameters and architecture), you can do this :
import cPickle as pickle
sys.setrecursionlimit(10000) # you may need this if the network is large
with open("model_file", 'wb') as f:
pickle.dump(nolearnnet , f, -1)
Please note that incase you train your model on GPU and pickle it using the above but want to unpickle it on CPU ( or vice versa) , this won't work. In that case you should just save the parameter values , which you can do like this:
weights = lasagne.layers.get_all_param_values(nolearnnet.get_all_layers()[-1])
And now you can save these weights . When you want to load them into another nolearn model, you can just do the following:
lasagne.layers.set_all_param_values(nolearnnet2.get_all_layers()[-1], weights)
It may help to refer to this discussion : https://groups.google.com/forum/#!topic/lasagne-users/BbG95R6SZ0I

Training classifier with large data

I was trying with two class text classification. Usually I created Pickle files of trained model and load those pickle in training phase to eliminate retraining.
When I had 12000 review + more then 50000 tweets for each of the class, the training model size goes to 1.4 GB.
Now storing this large model data into Pickle and loading it is really not feasible and advisable.
Is there any better alternative to this scenario?
Here is sample code, I tried multiple ways of pickleing, here i Have used dill package
def train(self):
global pos, neg, totals
retrain = False
# Load counts if they already exist.
if not retrain and os.path.isfile(CDATA_FILE):
# pos, neg, totals = cPickle.load(open(CDATA_FILE))
pos, neg, totals = dill.load(open(CDATA_FILE, 'r'))
return
for file in os.listdir("./suspected/"):
for word in set(self.negate_sequence(open("./unsuspected/" + file).read())):
neg[word] += 1
pos['not_' + word] += 1
for file in os.listdir("./suspected/"):
for word in set(self.negate_sequence(open("./suspected/" + file).read())):
pos[word] += 1
neg['not_' + word] += 1
self.prune_features()
totals[0] = sum(pos.values())
totals[1] = sum(neg.values())
countdata = (pos, neg, totals)
dill.dump(countdata, open(CDATA_FILE, 'w') )
UPDATE : Reason behind large pickle is, classification data is very large. And I have considered 1-4 gram for feature selection. Classification dataset itself is around 300mb, so considering multigram approach for feature selection creates large training model.
Pickle is very heavy as a format. It stores all the details of the objects.
It would be much better to store your data in an efficient format like hdf5.
If you are not familiar with hdf5, you can look into storing your data in a simple flat text files. You can use csv or json, depending on your data structure. You'll find that either is more efficient than pickle.
You can look at gzip to create and load compressed archives.
The problem and solution is explained here. In short, the problem is due to the fact that when doing featurization, e.g. using CountVectorizer, although you might ask for small number of features e.g. max_features=1000, the transformer still keeps a copy of all possible features for debugging purposes, under the hood.
For instance, the CountVectorizer has the following attribute:
stop_words_ : set
Terms that were ignored because they either:
- occurred in too many documents (max_df)
- occurred in too few documents (min_df)
- were cut off by feature selection (max_features).
This is only available if no vocabulary was given.
and this causes the model size to become too large. To solve this issue, you can set stop_words_ to None before pickling your model (taken from the above link's example): (please check the link above for details)
import pickle
model_name = 'clickbait-model-sm.pkl'
cfr_pipeline.named_steps.vectorizer.stop_words_ = None
pickle.dump(cfr_pipeline, open(model_name, 'wb'), protocol=2)

Categories