I am looking for a way to load vectors I generated previously using scikit-learn's TfidfVectorizer. In general what I wish is to get a better understanding of the TfidfVectorizer's data persistence.
For instance, what I did so far is:
vectorizer = TfidfVectorizer(stop_words=stop)
vect_train = vectorizer.fit_transform(corpus)
Then I wrote 2 functions in order to be able to save and load my vectorizer:
def save_model(model,name):
'''
Function that enables us to save a trained model
'''
joblib.dump(model, '{}.pkl'.format(name))
def load_model(name):
'''
Function that enables us to load a saved model
'''
return joblib.load('{}.pkl'.format(name))
I checked posts like the one below but i still didn't manage to make much sense.
How do I store a TfidfVectorizer for future use in scikit-learn?
What I ultimately wish is to be able to have a training session and then load this set of produced vectors, transform some newly text input based on those vectors and perform cosine_similarity using old vectors and new ones generated based on them.
One of the reasons that I wish to do this is because the vectorization in such a large dataset takes approximately 10 minutes and I wish to do this once and not every time a new query comes in.
I guess what I should be saving is vect_train right? But then which is the correct way to firstly save it and then load it to a newly created instance of TfidfVectorizer?
First time I tried to save vect_train with joblib as the kind people in scikit-learn advise to do I got 4 files: tfidf.pkl, tfidf.pkl_01.npy, tfidf.pkl_02.npy, tfidf.pkl_03.npy. It would be great if I knew what exactly are those and how I could load them to a new instance of
vectorizer = TfidfVectorizer(stop_words=stop)
created in a different script.
Thank you in advance.
The result of your vect_train = vectorizer.fit_transform(corpus) is twofold: (i) the vectorizer fits your data, that is it learns the corpus vocabulary and the idf for each term, and
(ii) vect_train is instantiated with the vectors of your corpus.
The save_model and load_model functions you propose persist and load the vectorizer, that is the internal parameters that it has learned such as the vocabulary and the idfs. Having loaded the vectorizer, all you need to get vectors is to transform a list with data. It can be unseen data, or the raw data you used during the fit_transform. Therefore, all you need is:
vectorizer = load_model(name)
vect_train = vectorizer.transform(corpus) # (1) or any unseen data
At this point, you have everything you had before saving, but the transformation call (1) will take some time depending on your corpus. In case you want to skip this, you need to also save the content of vect_train, as you correctly wonder in your question. This is a sparse matrix and can be saved/loaded using scipy, you can find information in this question for example. Copying from that question, to actually save the csr matrices you also need:
def save_sparse_csr(filename,array):
np.savez(filename,data = array.data ,indices=array.indices,
indptr =array.indptr, shape=array.shape )
def load_sparse_csr(filename):
loader = np.load(filename)
return csr_matrix(( loader['data'], loader['indices'], loader['indptr']),
shape = loader['shape'])
Concluding, the above functions can be used for saving/loading your vec_train whereas the ones you provided for saving/loading the transformer in order to vectorize the new data.
Related
Suppose that I have preprocessed some text data, removed stopwords, urls and so on.
How should I structure these cleaned data in order to make them usable for a classifier like a Neural Network? Is there a preferred structure, or a rule of thumb? (Bag of words, tf-idf or anything else?) Also, can you suggest some package which will automatically do all the work in python?
Now I train the model, and things work properly.
The model performs well on test set too.
How do I have to treat unseen data?
When I decide to implement the model in a real life project it will encounter new data: do I have to store the structure (like the tf-idf structure) I used for training and apply it to these new data?
Also, let's suppose that in the training/validation/test data there was not the word "hello", so it has not a representation. A real life sentence I have to classify contains the word "hello"
How do I cope with this problem?
Thanks for all the clarifications.
What you can do the make a class and inside that define the function like
import dataset
data cleaning
data preprocessing(BOW, TfIDf)
model building
predictions
You can follow up the code from the below like to get understanding
https://github.com/azeem110201/lifecycledatascienceproject
I often see in many Tensorflow tutorials text like:
To do this calculation, you need the column means. You would obviously
need to compute these in real life, but for this example we'll just
provide them.
For small or medium sized CSV datasets computing the mean is as easy as a pandas method on a dataframe or using `scikit-learn
BUT, if we have large dataset, say a CSV file that is 50GB, then how do you calculate the mean or other similar statistics. Tensorflow Transform claims that it can calculate global summary statistics, but they don't really explain how this work or how to integrate this into a workflow.
Here is the code example from their getting started guide.
import tensorflow as tf
import tensorflow_transform as tft
def preprocessing_fn(inputs):
x = inputs['x']
y = inputs['y']
s = inputs['s']
x_centered = x - tft.mean(x)
y_normalized = tft.scale_to_0_1(y)
s_integerized = tft.compute_and_apply_vocabulary(s)
x_centered_times_y_normalized = x_centered * y_normalized
return {
'x_centered': x_centered,
'y_normalized': y_normalized,
'x_centered_times_y_normalized': x_centered_times_y_normalized,
's_integerized': s_integerized
}
The documentation says that this code will run tft.mean(x) over the entire dataset, but it is not clear how that will happen since x is limited to just the scope of the batch? Yet here is the claim in the documentation.
While not obvious in the example above, the user defined preprocessing
function is passed tensors representing batches and not individual
instances, as happens during training and serving with TensorFlow. On
the other hand, analyzers perform a computation over the entire
dataset that returns a single value and not a batch of values. x is a
Tensor with a shape of (batch_size,), while tft.mean(x) is a Tensor
with a shape of ().
So the questions are
Does tft.mean() run over the entire dataset first, and only after computing the global mean does it begin to load batches?
Are there any more detailed or complete examples of using tft.transforms in a workflow? Like can these tranforms be included in a single batch preprocessing function on a tf.data.Dataset.map() call, or how?
So if I was trying to write some code to calculate the average age of individuals in my tensorflow dataset. Here is the code I have so far. Is this the best way to do something like this, or is there a better way?
I used the tensorflow-2.0 make_csv_dataset() which takes care of stacking the examples from the CSV file into a column structure. Note I took the code for the make_csv_dataset() from the new tutorial on the tensorflow website referenced in the link above.
dataset = tf.data.experimental.make_csv_dataset(
file_path,
batch_size=32,
label_name=LABEL_COLUMN,
na_value="?",
num_epochs=1,
ignore_errors=True)
ds_iter = dataset.make_one_shot_iterator()
list_of_batch_means = []
for ex_features, ex_labels in ds_iter:
batch_length = len(ex_features)
batch_sum = tf.reduce_sum(ex_features['age'])
list_of_batch_means.append(batch_sum/len(ex_features)
average_age = np.mean(list_of_batch_means)
As a caveat, I divided the batch_sum/len(ex_features) since the final batch will not necessarily be the same size as the other batches, hence I did that calculate manually instead of using tf.reduce_mean().This might be a minor issue if you have a lot of batches, but just wanted to be as accurate as possible.
Any suggestions would be appreciated.
The most important concept of tf.transform is preprocessing function. The preprocessing function is the logical description of the transformation of the dataset. A preprocessing function accepts and returns a dictionary of Tensors. There are 2 kinds of functions(steps) used to define a preprocessing function :
Analyze Step
Transform Step
Analyze step: It iterates through the whole dataset and creates a graph. So, for example in order to calculate mean, we pass the full dataset to calculate the average of particular column of that dataset (This step requires the full pass of the dataset)
Transform step: It basically uses the graph that has been created in the analyze step and transforms the complete dataset.
So, basically the constants calculated in the analyze step is used in the Transform step.
For better understanding, you can go through this video followed by this presentation which should solidify your understanding of how Tensorflow Transform works internally.
I have implemented ML model using naive Bayes algorithm, where I want to implement incremental learning. The issue that I am facing is when I train my model and it generates 1500 features while preprocessing and then after a month using feedback mechanism if I want to train my model with new data which might contain some new features, may be less than or more than 1500 (i.e of my previous dataset) here if I use fit_transform to get the new features then my existing feature set gets lost.
I have been using partial fit but the issue with partial fit is you require same number of features as of previous model. How do I make it learn incrementally?
cv = CountVectorizer()
X = cv.fit_transform(corpus).toarray() #replaces my older feature set
classifier = GaussianNB()
classifier.partial_fit(X,y)
#does not fit because the size of feature set count is not equal to previous feature set count
You could use just transform() for the CountVectorizer() and then partial_fit() for Naive-Bayes like the following for the incremental learning. Remember, transform extracts the same set of features, which you had learned using the training dataset.
X = cv.transform(corpus)
classifier.partial_fit(X,y)
But, you cannot revamp the features all from scratch and continue the incremental leaning. Meaning the number of feature needs to be consistent for any model to do incremental learning.
If you think, your new dataset have significantly different features compared to older one, use cv.fit_transform() and then classifier.fit() on complete dataset (both old and new one), which means we are going to create a new model for the entire available data. You could adopt this, if your dataset not big enough to keep in memory!
You cannot with CountVectorizer. You will need to fix the number of features for partial_fit() in GaussianNB.
Now you can use a different preprocessor (in place of CountVectorizer) which can map the inputs (old and new) to same feature space. Have a look at HashingVectorizer which is recommended by scikit-learn authors to be used in just the scenario you mentioned. While initializing, you will need to specify the number of features you want. In most cases, default value is enough for not having collisions in hashes of different words. You may try experimenting with different numbers. Try using that and check out the performance. If not at par with CountVectorizer then you can do what #AI_Learning suggests and make a new model on the whole data (old+new).
I have two set of data let's say A and B.
I want to apply PCA and T-sne to A and fine tune the algo.
Once, I am satisfied with my tuning I want to save the learnt things to some pickle file.
Now I want to apply the same learnt PCA and t-sne to set B.
I want t-sne to produce the same results every time on B. I am hoping this because I am assuming, we can save the state of learnt t-sne parametes also. If the parametes are same, and when I load the same file everytime, the result of applying t-sne on the set B should be same every time.
How can I do this in Sklearn and python?
I am sorry, I am new to ML and python, this may be a very basic question.
Fine-tuning T-SNE equals tuning some heuristic-algorithm (it's ill-conditioned after all; higher-dimension -> lower-dimension mapping is lossy) for your data.
Applying this tuned & learned mapping to other data is done by sklearn's transform.
But: you will see that there is no transform-method for T-SNE and the reason is given here (including further discussion):
It is a transductive learner, like many clustering algorithms: the model is not really applicable beyond the data points it is fed as training.
So whatever you tuned for dataset A, does not really apply to dataset B (including parameters)!
For PCA this is trivial. Use the methods described in docs: model_persistence and use PCA's transform-method (assuming compatible datasets; dimensions!).
To be able to save the models you should use the below library:
from joblib import dump, load
after establishing the model as below in PCA:
pca_model = PCA(n_components=n)
you can save the model in joblib format in the current directory:
dump(pca_model, 'pca_model.joblib')
I'm using scikit-learn to cluster text documents. I'm using the classes CountVectorizer, TfidfTransformer and MiniBatchKMeans to help me do that.
New text documents are added to the system all the time, which means that I need to use the classes above to transform the text and predict a cluster. My question is: how should I store the data on disk?
Should I simply pickle the vectorizer, transformer and kmeans objects?
Should I just save the data? If so, how do I add it back to the vectorizer, transformer and kmeans objects?
Any help would be greatly appreciated
It depends on what you want to do.
If you want to find some fixed cluster centers on a training set and then re-use them later to compute cluster assignments for new data then pickling the models (or just saving the vocabulary of the vectorizer and the other models constructors parameters and the cluster center positions) is ok.
If what you want is doing clustering with new data, you might want to retrain the whole pipeline using the union of the new data + the old data to make it possible for the vocabulary of the vectorizer to build new features (dimensions) for the new words and let the clustering algorithm find cluster centers that better match the structure of the complete dataset.
Note that in the future we will provide hashing vectorizers (see for instance this pull request on hashing transformers as a first building block), hence storing the vocabulary won't be necessary any more (but you will loose the ability to introspect the "meaning" of the feature dimensions).
As for pickling the models vs using your own representation for their parameters I have answered this part in your previous question here: Persist Tf-Idf data
Yeah, I think the general answer with sk-learn is to pickle and pray.
It seems to me that this is super fragile, compared to have a documented serialization format that doesn't depend on implementation details. But maybe they know this, and won't make backwards incompatible changes to their classes?