How do you train Scikit's LinearSVC on a dataset too big or impractical to fit into memory? I'm trying to use it to classify documents, and I have a few thousand tagged example records, but when I try to load all this text into memory and train LinearSVC, it consumes over 65% of my memory and I'm forced to kill it before my system becomes totally unresponsive.
Is it possible to format my training data as a single file and feed it into LinearSVC with a filename instead of having to call the fit() method?
I found this guide, but it only really covers classification, and assumes training is done incrementally, something LinearSVC doesn't support.
As far as I know, non-incremental implementations like LinearSVC would need the entire data set to train on. Unless you create an incremental version of it, you might be unable to use LinearSVC.
There are classifiers in scikit-learn that can be used incrementally just like in the guide you found wherein it was using an SGDClassifier. The SGDClassifier has the *partial_fit* method which allows you to train it in batches. There are a couple of other classifiers that support incremental learning such as SGDCLassifier, Multinomial Naive Bayes and Bernoulli Naive Bayes
You can use a Generator function like this.
def lineGenerator():
with open(INPUT_FILENAMES_TITLE[0],'r') as f1:
title_reader = csv.reader(f1)
for line in title_reader:
yield line[0]
Then you can call the Classifier as
clf = LinearSVC()
clf.fit(lineGenerator())
This assumes INPUT_FILENAMES_TITLE[0] is your filename.
Related
For research purposes, I find myself needing to traing SVM via SGD on a large DS (that is, a large number of examples). This makes using scikit-learn's implementation (SGDClassifier) problematic, as it requires loading the entire DS at once.
The algorithm I am familiar with uses n step of SGD to obtain n different separators w_i, and then averages them (specifics can be seen in slide 12 of https://www.cse.huji.ac.il/~shais/Lectures2014/lecture8.pdf).
This made me think that maybe I can use scikit-learn to train multiple such classifiers and then take the average of the resulting linear separators (assume no bias).
Is this a reasonable line of thinking, or does scikit-learn's implementation not fall under my logic?
Edit: I am well aware of the alternatives for training SVM in different ways, but this is for a specific research purpose. I would just like to know if this line of thinking is possible with scikit-learn's implementation, or if you are aware of an alternative that will allow me to train SVM using SGD without loading an entire DS to memory.
SGDClassifier have a partial_fit method, and one of the primary objectives of partial_fit method is to scale sklearn models to large-scale datasets. Using this, you can load a part of the dataset into RAM, feed it to SGD, and keep repeating this unless full dataset is used.
In code below, I use KFold mainly to imitate loading chunk of dataset.
class GD_SVM(BaseEstimator, ClassifierMixin):
def __init__(self):
self.sgd = SGDClassifier(loss='hinge',random_state=42,fit_intercept=True,l1_ratio=0,tol=.001)
def fit(self,X,y):
cv = KFold(n_splits=10,random_state=42,shuffle=True)
for _,test_id in cv.split(X,y):
xt,yt = X[test_id],y[test_id]
self.sgd = self.sgd.partial_fit(xt,yt,classes=np.unique(y))
def predict(self,X):
return self.sgd.predict(X)
To test this against regular (linear) SVM:
X,y = load_breast_cancer(return_X_y=True)
X = StandardScaler().fit_transform(X) #For simplicity, Pipeline is better choice
cv = RepeatedStratifiedKFold(n_splits=5,n_repeats=5,random_state=43)
sgd = GD_SVM()
svm = LinearSVC(loss='hinge',max_iter=1,random_state=42,
C=1.0,fit_intercept=True,tol=.001)
r = cross_val_score(sgd,X,y,cv=cv) #cross_val_score(svm,X,y,cv=cv)
print(r.mean())
This returned 95% accuracy for above GD_SVM, and 96% for SVM. In Digits dataset SVM had 93% accuracy, while GD_SVM had 91%. While these performances are broadly similar, as these measurements show, please note that they are not identical. This is expected, since these algorithms use pretty different optimization algorithms, but I think careful tuning of hyper-parameter would reduce the gap.
Based on the concern of loading all of the data in memory, if you have access to more compute resources, you may want to use PySpark's SVM implementation: https://spark.apache.org/docs/2.2.0/ml-classification-regression.html#linear-support-vector-machine, as that Spark is built for large scale data processing. I don't know if averaging the separators from multiple Scikit-Learn models would work as expected; there isn't a clean way to instantiate a new model with new separators, based on the documentation (https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html), so it would probably have to be implemented as an ensemble approach.
If you insist on using the whole DS for training instead of sampling (btw that is what the slides describe) and you do not care about performance, I would train n classifiers, and then select only their support vectors and retrain final version on those support vectors only. This way you effectively dismiss most of the data and concentrate only on the points that are important for the classification.
I am fitting a kNN model in Ubuntu, Python, and sklearn with 32 GB DDR4 RAM.
I wan to load the model without fitting the model again, so I decided to dump the fitted model after fitting it.
The input is a large dataset, 1M+ rows, 1052 columns. (Most of columns are the dummy of categorical variables in [0,1].)
After the KNN fitted, I got a Memory Error when trying to dump the model with pickle.
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors = 14)
knn = knn.fit(x,y)
import pickle
pickle.dump(knn, open('KNN.sav', 'wb'))
From the other post, the solution always to use higher specifications and RAM to solve it. I think it is not a good solution for the long run.
Updated questions : Any possible way to limit the model train use 26/32GB and left 6GB for Dumping? or any method to save the fitted model etc.
There's a question - do you really need so huge training dataset? The idea is that the model maybe doesn't need whole dataset. A good code example from sklearn. Or use some of the cloud instance for working with such huge dataset.
I have a very big dataset that can not be loaded in memory.
I want to use this dataset as training set of a scikit-learn classifier - for example a LogisticRegression.
Is there the possibility to perform a mini batch-training of a scikit-learn classifier where I provide the mini batches?
I believe that some of the classifiers in sklearn have a partial_fit method. This method allows you to pass minibatches of data to the classifier, such that a gradient descent step is performed for each minibatch. You would simply load a minibatch from disk, pass it to partial_fit, release the minibatch from memory, and repeat.
If you are particularly interested in doing this for Logistic Regression, then you'll want to use SGDClassifier, which can be set to use logistic regression when loss = 'log'.
You simply pass the features and labels for your minibatch to partial_fit in the same way that you would use fit:
clf.partial_fit(X_minibatch, y_minibatch)
Update:
I recently came across the dask-ml library which would make this task very easy by combining dask arrays with partial_fit. There is an example on the linked webpage.
Have a look at the scaling strategies included in the sklearn documentation:
http://scikit-learn.org/stable/modules/scaling_strategies.html
A good example is provided here:
http://scikit-learn.org/stable/auto_examples/applications/plot_out_of_core_classification.html
I have a big file with training data. I am worry, when I use this code:
clf = RandomForestClassifier()
for chunk in reader:
clf.fit(chunk, target)
Do clf will produce model for all chunk or only for current? For incremental learning should I use only Classifiers with partial_fit() method? How I should normalize train data (build normalizer for whole data, neither only current chunk) in that way?
Yes, for incremental learning you can only use classifiers which implement partial_fit.
StandardScaler has partial_fit method, so it can be applied online. I'm not sure though if that's the right way to do it, as transformation will change over time. If you don't expect data distribution to change much, you can also fit any scaler on a subset of data and use it later.
Also note that RandomForestClassifier (like all tree-based classifiers) is scale invariant, so it is not clear standartization has any effect for it.
partial_fit() method is not implemented for RandomForestClassifier which supports incremental learning on chunks of data.
However, you can combine the trainings of RandomForestClassifier as mentioned here using estimators_ and n_estimators.
Yes this will only work for classifiers with partial_fit; depending on how you normalise you may be able to do this chunk-by-chunk (e.g. scaling by a fixed factor or doing a label encoding).
I am learning about text classification and I classify with my own corpus with linnear regression as follows:
from sklearn.linear_model.logistic import LogisticRegression
classifier = LogisticRegression(penalty='l2', C=7)
classifier.fit(training_matrix, y_train)
prediction = classifier.predict(testing_matrix)
I would like to increase the classification report with a Restricted Boltzman Machine that scikit-learn provide, from the documentation I read that this could be use to increase the classification recall, f1-score, accuracy, etc. Could anybody help me to increase this is what I tried so far, thanks in advance:
vectorizer = TfidfVectorizer(max_df=0.5,
max_features=None,
ngram_range=(1, 1),
norm='l2',
use_idf=True)
X_train = vectorizer.fit_transform(X_train_r)
X_test = vectorizer.transform(X_test_r)
from sklearn.pipeline import Pipeline
from sklearn.neural_network import BernoulliRBM
logistic = LogisticRegression()
rbm= BernoulliRBM(random_state=0, verbose=True)
classifier = Pipeline(steps=[('rbm', rbm), ('logistic', logistic)])
classifier.fit(X_train, y_train)
First, you have to understand the concepts here. RBM can be seen as a powerful clustering algorithm and clustering algorithms are unsupervised, i.e., they don't need labels.
Perhaps, the best way to use RBM in your problem is, first to train an RBM (which only needs data without labels) and then use the RBM weights to initialize a Neural network. To get a logistic regression in the output, you have to add an output layer with logistic reg. cost function to this neural net and train this neural network. This setting may result in performance improvement.
There are a couple of things that could be wrong.
1. You haven't properly calibrated the RBM
Look at the example on the scikit-learn site: http://scikit-learn.org/stable/auto_examples/plot_rbm_logistic_classification.html
In particular, these lines:
rbm.learning_rate = 0.06
rbm.n_iter = 20
# More components tend to give better prediction performance, but larger
# fitting time
rbm.n_components = 100
You don't set these anywhere. In the example, these are obtained through cross validation using a grid search. You should do the same and try to obtain (close to) optimal parameters for your own problem.
Additionally, you might want to try using cross validation to determine other parameters as well, such as the ngram range (using higher level ngrams as well usually helps, if you can afford the memory and execution time. For some problems, character level ngrams do better than word level) and logistic regression parameters.
2. You are just unlucky
There is nothing that says using an RBM in an intermediate step will definitely improve any performance measure. It can, but it's not a rule, it may very well do nothing or very little for your problem. You have to be prepared for this.
It's worth trying because it shouldn't take long to implement, but be prepare to have to look elsewhere.
Also look at the SGDClassifier and the PassiveAggressiveClassifier. These might improve performance.