load decision tree from file in scikit-learn - python

In python scikit learn, there is a method called export_graphviz to export a decision tree to dot file.
I want to ask if there is a method to import a dot file to scikit learn as a decision tree? like some function called sklearn.tree.import_graphviz() ?

AFAIK There is no easy way to do that. Graphviz can only be used to visualise the decision tree. If you wish to save the model you can use Pickle to save the model. For example:
import cPickle
# save the classifier
with open('my_dumped_classifier.pkl', 'wb') as fid:
cPickle.dump(gnb, fid)
# load it again
with open('my_dumped_classifier.pkl', 'rb') as fid:
gnb_loaded = cPickle.load(fid)

Related

Sklearn KMeans cluster center change after saving and loading with Pickle

I'm training a Kmeans model for Vector Quantization and alter the Cluster Centers manually for optimization with a Variantion of the Kiefer-Wolfowitz algorithm.
I take the cluster centers, adjust them and put them in the model.
The error takes place when I save the model with pickl and load it later.
When I evalute the Quantizer I get different results after saving and loading.
To retrieve the centers I have getter and setter methods.
def getWeights():
return self.kmeans.cluster_centers_.flatten()
def setWeights(weights):
kmeans.cluster_centers_ = weights.reshape(self.clusters, self.dim)
To save the model I use Pickle but I also tried joblib.
# save
with open(os.path.join(savepath, savename + '.pkl'), 'wb') as f:
pickle.dump(self.kmeans, f)
#load
with open(os.path.join(savepath, savename + '.pkl'), 'rb') as f:
self.kmeans = pickle.load(f)
I compared the cluster centers with np.sum(cc1 - cc2) and got
[-23657.44412046 -27826.84822967 -34863.87009913 -22867.6671942 -31120.73019114]
Where are these differences comming from?
Does Pickle not save the full precision?
Or is it because of Package Version?
The Sklearn Version is 0.24.2 and Python 3.6 because of some dependencies.

How to load CSV file instead of built in dataset in "Surprise" Python recommender system?

I don't know how to write a code to load a CSV file or .inter file instead of the built in dataset in this example of evaluating a dataset as a recommender system:
from surprise import SVD
from surprise import KNNBasic
from surprise import Dataset
from surprise.model_selection import cross_validate
# Load the movielens-100k dataset (download it if needed).
data = Dataset.load_builtin('ml-100k')
# Use the famous SVD algorithm.
algo = KNNBasic()
# Run 5-fold cross-validation and print results.
cross_validate(algo, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)
How would the full line of code be where I only need to input datapath and filename? I have tried the website for Surprise, but I didn't find anything. So I don't want the movielens code in the example, but instead a line that loads a datapath and file.
At first you need to create instance of Reader():
reader = Reader(line_format=u'rating user item', sep=',', rating_scale=(1, 6), skip_lines=1)
Note that line_format parameter can be only 'rating user item' (optionally 'timestamp' may be added) and these parameters has nothing to do with names of columns in your custom_rating.csv. Thats why skip_lines=1 prameter is defined (it skips first line in your csv file where usually column names are defined).
On the other hand line_format parameter determines the order of columns. So just to be clear my custom_ratings.csv looks like this:
rating,userId,movieId
4,1,1
6,1,2
1,1,3
. . .
. . .
. . .
Now you can create your data instance:
data = Dataset.load_from_file("custom_rating.csv", reader=reader)
Finally you can proceed with creating SVD model as shown in examples:
# sample random trainset and testset
# test set is made of 20% of the ratings.
trainset, testset = train_test_split(data, test_size=.2)
# We'll use the famous SVD algorithm.
algo = SVD()
# Train the algorithm on the trainset, and predict ratings for the testset
algo.fit(trainset)
predictions = algo.test(testset)
# Then compute RMSE
accuracy.rmse(predictions)
PS: And also don't forget to import libraries at the beginning of your code :)
from surprise import SVD
from surprise import Dataset
from surprise import accuracy
from surprise import Reader
from surprise.model_selection import train_test_split

Is there a way to extract Tree depths from a Random Forest model?

I have created a Random Forest classifier and I'm trying to produce a histogram of the depths of the trees of my random forest model. I'm just not being able to extract the depth of every tree in my forest.
My RF model is called 'RF_optimised' and I've tried the code below to iterate over my trees and visualise which has worked. I have gone through the estimators_ and export_graphviz documentation but there doesn't seem to be a way to extract the actual depth of tree.
from sklearn import tree
from sklearn.tree import export_graphviz
from sklearn.externals.six import StringIO
# Create a string buffer to write to (a fake text file)
f = StringIO()
i_tree = 0
for tree_in_forest in RF_optimised.estimators_:
export_graphviz(tree_in_forest,out_file=f,
#feature_names=col,
filled=True,
rounded=True,
proportion=True)
graph = pydotplus.graph_from_dot_data(f.getvalue())
display(Image(graph.create_png()))
I need a function that iterates over the trees in my Random Forest and stores the depth of the trees in a list or data-frame, in order to produce a histogram later. Can anyone help?
Some exploration in the interpreter shows that each Tree instance has a max_depth parameter which appears to be what I'm looking for -- again, it's undocumented.
[estimator.tree_.max_depth for estimator in RF_optimised.estimators_]
did the trick for me :)

FastText in Gensim

I am using Gensim to load my fasttext .vec file as follows.
m=load_word2vec_format(filename, binary=False)
However, I am just confused if I need to load .bin file to perform commands like m.most_similar("dog"), m.wv.syn0, m.wv.vocab.keys() etc.? If so, how to do it?
Or .bin file is not important to perform this cosine similarity matching?
Please help me!
The following can be used:
from gensim.models import KeyedVectors
model = KeyedVectors.load_word2vec_format(link to the .vec file)
model.most_similar("summer")
model.similarity("summer", "winter")
Many options to use the model now.
The gensim-lib has evolved, so some code fragments got deprecated. This is an actual working solution:
import gensim.models.wrappers.fasttext
model = gensim.models.wrappers.fasttext.FastTextKeyedVectors.load_word2vec_format(Source + '.vec', binary=False, encoding='utf8')
word_vectors = model.wv
# -- this saves space, if you plan to use only, but not to train, the model:
del model
# -- do your work:
word_vectors.most_similar("etc")
If you want to be able to retrain the gensim model later with additional data, you should save the whole model like this: model.save("fasttext.model").
If you save just the word vectors with model.wv.save_word2vec_format(Path("vectors.txt")), you will still be able to perform any of the functions that vectors provide - like similarity, but you will not be able to retrain the model with more data.
Note that if you are saving the whole model, you should pass a file name as a string instead of wrapping it in get_tmpfile, as suggested in the documentation here.
Maybe I am late in answering this:
But here you can find your answer in the documentation:https://github.com/facebookresearch/fastText/blob/master/README.md#word-representation-learning
Example use cases
This library has two main use cases: word representation learning and text classification. These were described in the two papers 1 and 2.
Word representation learning
In order to learn word vectors, as described in 1, do:
$ ./fasttext skipgram -input data.txt -output model
where data.txt is a training file containing UTF-8 encoded text. By default the word vectors will take into account character n-grams from 3 to 6 characters. At the end of optimization the program will save two files: model.bin and model.vec. model.vec is a text file containing the word vectors, one per line. model.bin is a binary file containing the parameters of the model along with the dictionary and all hyper parameters. The binary file can be used later to compute word vectors or to restart the optimization.

How to load SVM data from file in OpenCV 3.1?

I have a problem with load trained SVM from file. I use Python and OpenCv 3.1.0. I create svm object by:
svm = cv2.ml.SVM_create()
Next, I train svm and save to file by:
svm.save('data.xml')
Now i want to load this file in other Python script. In docs i can't find any methods to do it.
Is there a trick to load svm from file? Thanks for any responses.
I think it's a litte bit confusing that there is no svm.load(filepath) method as a counterpart of svm.save(filepath), but when I read the module help it makes sense to me that SVM_load is a child of cv2.ml (sibling of SVM_create).
Be sure that your opencv master branch is up-to-date (currently version 3.1.0-dev)
>>> import cv2
>>> cv2.__version__
'3.1.0-dev'
>>> help(cv2.ml)
returns
SVM_create(...)
SVM_create() -> retval
SVM_load(...)
SVM_load(filepath) -> retval
so you can simply use something like:
if not os.path.isfile('svm.dat'):
svm = cv2.ml.SVM_create()
...
svm.save('svm.dat')
else:
svm = cv2.ml.SVM_load('svm.dat')

Categories