I'm using scikit-learn to cluster text documents. I'm using the classes CountVectorizer, TfidfTransformer and MiniBatchKMeans to help me do that.
New text documents are added to the system all the time, which means that I need to use the classes above to transform the text and predict a cluster. My question is: how should I store the data on disk?
Should I simply pickle the vectorizer, transformer and kmeans objects?
Should I just save the data? If so, how do I add it back to the vectorizer, transformer and kmeans objects?
Any help would be greatly appreciated
It depends on what you want to do.
If you want to find some fixed cluster centers on a training set and then re-use them later to compute cluster assignments for new data then pickling the models (or just saving the vocabulary of the vectorizer and the other models constructors parameters and the cluster center positions) is ok.
If what you want is doing clustering with new data, you might want to retrain the whole pipeline using the union of the new data + the old data to make it possible for the vocabulary of the vectorizer to build new features (dimensions) for the new words and let the clustering algorithm find cluster centers that better match the structure of the complete dataset.
Note that in the future we will provide hashing vectorizers (see for instance this pull request on hashing transformers as a first building block), hence storing the vocabulary won't be necessary any more (but you will loose the ability to introspect the "meaning" of the feature dimensions).
As for pickling the models vs using your own representation for their parameters I have answered this part in your previous question here: Persist Tf-Idf data
Yeah, I think the general answer with sk-learn is to pickle and pray.
It seems to me that this is super fragile, compared to have a documented serialization format that doesn't depend on implementation details. But maybe they know this, and won't make backwards incompatible changes to their classes?
Related
Suppose that I have preprocessed some text data, removed stopwords, urls and so on.
How should I structure these cleaned data in order to make them usable for a classifier like a Neural Network? Is there a preferred structure, or a rule of thumb? (Bag of words, tf-idf or anything else?) Also, can you suggest some package which will automatically do all the work in python?
Now I train the model, and things work properly.
The model performs well on test set too.
How do I have to treat unseen data?
When I decide to implement the model in a real life project it will encounter new data: do I have to store the structure (like the tf-idf structure) I used for training and apply it to these new data?
Also, let's suppose that in the training/validation/test data there was not the word "hello", so it has not a representation. A real life sentence I have to classify contains the word "hello"
How do I cope with this problem?
Thanks for all the clarifications.
What you can do the make a class and inside that define the function like
import dataset
data cleaning
data preprocessing(BOW, TfIDf)
model building
predictions
You can follow up the code from the below like to get understanding
https://github.com/azeem110201/lifecycledatascienceproject
I have implemented ML model using naive Bayes algorithm, where I want to implement incremental learning. The issue that I am facing is when I train my model and it generates 1500 features while preprocessing and then after a month using feedback mechanism if I want to train my model with new data which might contain some new features, may be less than or more than 1500 (i.e of my previous dataset) here if I use fit_transform to get the new features then my existing feature set gets lost.
I have been using partial fit but the issue with partial fit is you require same number of features as of previous model. How do I make it learn incrementally?
cv = CountVectorizer()
X = cv.fit_transform(corpus).toarray() #replaces my older feature set
classifier = GaussianNB()
classifier.partial_fit(X,y)
#does not fit because the size of feature set count is not equal to previous feature set count
You could use just transform() for the CountVectorizer() and then partial_fit() for Naive-Bayes like the following for the incremental learning. Remember, transform extracts the same set of features, which you had learned using the training dataset.
X = cv.transform(corpus)
classifier.partial_fit(X,y)
But, you cannot revamp the features all from scratch and continue the incremental leaning. Meaning the number of feature needs to be consistent for any model to do incremental learning.
If you think, your new dataset have significantly different features compared to older one, use cv.fit_transform() and then classifier.fit() on complete dataset (both old and new one), which means we are going to create a new model for the entire available data. You could adopt this, if your dataset not big enough to keep in memory!
You cannot with CountVectorizer. You will need to fix the number of features for partial_fit() in GaussianNB.
Now you can use a different preprocessor (in place of CountVectorizer) which can map the inputs (old and new) to same feature space. Have a look at HashingVectorizer which is recommended by scikit-learn authors to be used in just the scenario you mentioned. While initializing, you will need to specify the number of features you want. In most cases, default value is enough for not having collisions in hashes of different words. You may try experimenting with different numbers. Try using that and check out the performance. If not at par with CountVectorizer then you can do what #AI_Learning suggests and make a new model on the whole data (old+new).
I have two set of data let's say A and B.
I want to apply PCA and T-sne to A and fine tune the algo.
Once, I am satisfied with my tuning I want to save the learnt things to some pickle file.
Now I want to apply the same learnt PCA and t-sne to set B.
I want t-sne to produce the same results every time on B. I am hoping this because I am assuming, we can save the state of learnt t-sne parametes also. If the parametes are same, and when I load the same file everytime, the result of applying t-sne on the set B should be same every time.
How can I do this in Sklearn and python?
I am sorry, I am new to ML and python, this may be a very basic question.
Fine-tuning T-SNE equals tuning some heuristic-algorithm (it's ill-conditioned after all; higher-dimension -> lower-dimension mapping is lossy) for your data.
Applying this tuned & learned mapping to other data is done by sklearn's transform.
But: you will see that there is no transform-method for T-SNE and the reason is given here (including further discussion):
It is a transductive learner, like many clustering algorithms: the model is not really applicable beyond the data points it is fed as training.
So whatever you tuned for dataset A, does not really apply to dataset B (including parameters)!
For PCA this is trivial. Use the methods described in docs: model_persistence and use PCA's transform-method (assuming compatible datasets; dimensions!).
To be able to save the models you should use the below library:
from joblib import dump, load
after establishing the model as below in PCA:
pca_model = PCA(n_components=n)
you can save the model in joblib format in the current directory:
dump(pca_model, 'pca_model.joblib')
I am looking for a way to load vectors I generated previously using scikit-learn's TfidfVectorizer. In general what I wish is to get a better understanding of the TfidfVectorizer's data persistence.
For instance, what I did so far is:
vectorizer = TfidfVectorizer(stop_words=stop)
vect_train = vectorizer.fit_transform(corpus)
Then I wrote 2 functions in order to be able to save and load my vectorizer:
def save_model(model,name):
'''
Function that enables us to save a trained model
'''
joblib.dump(model, '{}.pkl'.format(name))
def load_model(name):
'''
Function that enables us to load a saved model
'''
return joblib.load('{}.pkl'.format(name))
I checked posts like the one below but i still didn't manage to make much sense.
How do I store a TfidfVectorizer for future use in scikit-learn?
What I ultimately wish is to be able to have a training session and then load this set of produced vectors, transform some newly text input based on those vectors and perform cosine_similarity using old vectors and new ones generated based on them.
One of the reasons that I wish to do this is because the vectorization in such a large dataset takes approximately 10 minutes and I wish to do this once and not every time a new query comes in.
I guess what I should be saving is vect_train right? But then which is the correct way to firstly save it and then load it to a newly created instance of TfidfVectorizer?
First time I tried to save vect_train with joblib as the kind people in scikit-learn advise to do I got 4 files: tfidf.pkl, tfidf.pkl_01.npy, tfidf.pkl_02.npy, tfidf.pkl_03.npy. It would be great if I knew what exactly are those and how I could load them to a new instance of
vectorizer = TfidfVectorizer(stop_words=stop)
created in a different script.
Thank you in advance.
The result of your vect_train = vectorizer.fit_transform(corpus) is twofold: (i) the vectorizer fits your data, that is it learns the corpus vocabulary and the idf for each term, and
(ii) vect_train is instantiated with the vectors of your corpus.
The save_model and load_model functions you propose persist and load the vectorizer, that is the internal parameters that it has learned such as the vocabulary and the idfs. Having loaded the vectorizer, all you need to get vectors is to transform a list with data. It can be unseen data, or the raw data you used during the fit_transform. Therefore, all you need is:
vectorizer = load_model(name)
vect_train = vectorizer.transform(corpus) # (1) or any unseen data
At this point, you have everything you had before saving, but the transformation call (1) will take some time depending on your corpus. In case you want to skip this, you need to also save the content of vect_train, as you correctly wonder in your question. This is a sparse matrix and can be saved/loaded using scipy, you can find information in this question for example. Copying from that question, to actually save the csr matrices you also need:
def save_sparse_csr(filename,array):
np.savez(filename,data = array.data ,indices=array.indices,
indptr =array.indptr, shape=array.shape )
def load_sparse_csr(filename):
loader = np.load(filename)
return csr_matrix(( loader['data'], loader['indices'], loader['indptr']),
shape = loader['shape'])
Concluding, the above functions can be used for saving/loading your vec_train whereas the ones you provided for saving/loading the transformer in order to vectorize the new data.
I have a NLP task which basically is supervised text classification. I tagged a corpus with it's POS-tags, then i use the diferent vectorizers that scikit-learn provide in order to feed some classification algorithm that scikit-learn provide as well. I also have the labels (categories) of the corpus which previously i obtained in an unsupervised way.
First I POS-tagged the corpus, then I obtained some differents bigrams, they have the following structure:
bigram = [[('word','word'),...,('word','word')]]
Apparently it seems that i have everything to classify (i all ready classify with some little examples but not with all the corpus).
I would like to use the bigrams as features in order to present them to a classification algorithm(Multinomial naive bayes, SVM, etc).
What could be a standard (pythonic) way to arrange all the text data to classify and show the results of the classified corpus?. I was thinking about using arff files and use numpy arrays, but I guess it could complicate the task unnecessarily. By the other hand i was thinking about spliting the data into train and test folders but i dont visualize how to set up the labels in the train folder.
Your question is very vague. There are books and courses on the subject you can access.
Have a look at this blog for a start 1 and these course 2 and 3.
The easiest option is load_files, which expects a directory layout
data/
positive/ # class label
1.txt # arbitrary filename
2.txt
...
negative/
1.txt
2.txt
...
...
(This isn't really a standard, it's just convenient and customary. Some ML datasets on the web are offered in this format.)
The output of load_files is a dict with the data in them.
1) larsmans has already mentioned a convenient way to arrange and store your data. 2) When using scikit, numpy arrays always make life easier as they have many features for changing the arrangement of your data easily. 3) Training data and testing data are labeled in the same way. So you would usually have something like:
bigramFeatureVector = [(featureVector0, label), (featureVector1, label),..., (featureVectorN, label)]
The proportion of training data to testing data highly depends on the size of your data. You should indeed learn about n-fold cross validation. Because it will resolve all your doubts, and most probably you have to use it for more accurate evaluations. Just to briefly explain it, for doing a 10-fold cross validation lets say you will have an array in which all your data along with labels are held (something like my above example). Then in a loop running for 10 times, you would leave one tenth of the data for testing and the rest for training. If you learn this then you would have no confusions about how training or testing data should look like. They both should look exactly the same. 4) How to visualize your classification results, depends on what evaluation meausures you would like to use. Its unclear in your question, but let me know if you have further questions.