I am fitting a kNN model in Ubuntu, Python, and sklearn with 32 GB DDR4 RAM.
I wan to load the model without fitting the model again, so I decided to dump the fitted model after fitting it.
The input is a large dataset, 1M+ rows, 1052 columns. (Most of columns are the dummy of categorical variables in [0,1].)
After the KNN fitted, I got a Memory Error when trying to dump the model with pickle.
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors = 14)
knn = knn.fit(x,y)
import pickle
pickle.dump(knn, open('KNN.sav', 'wb'))
From the other post, the solution always to use higher specifications and RAM to solve it. I think it is not a good solution for the long run.
Updated questions : Any possible way to limit the model train use 26/32GB and left 6GB for Dumping? or any method to save the fitted model etc.
There's a question - do you really need so huge training dataset? The idea is that the model maybe doesn't need whole dataset. A good code example from sklearn. Or use some of the cloud instance for working with such huge dataset.
Related
For research purposes, I find myself needing to traing SVM via SGD on a large DS (that is, a large number of examples). This makes using scikit-learn's implementation (SGDClassifier) problematic, as it requires loading the entire DS at once.
The algorithm I am familiar with uses n step of SGD to obtain n different separators w_i, and then averages them (specifics can be seen in slide 12 of https://www.cse.huji.ac.il/~shais/Lectures2014/lecture8.pdf).
This made me think that maybe I can use scikit-learn to train multiple such classifiers and then take the average of the resulting linear separators (assume no bias).
Is this a reasonable line of thinking, or does scikit-learn's implementation not fall under my logic?
Edit: I am well aware of the alternatives for training SVM in different ways, but this is for a specific research purpose. I would just like to know if this line of thinking is possible with scikit-learn's implementation, or if you are aware of an alternative that will allow me to train SVM using SGD without loading an entire DS to memory.
SGDClassifier have a partial_fit method, and one of the primary objectives of partial_fit method is to scale sklearn models to large-scale datasets. Using this, you can load a part of the dataset into RAM, feed it to SGD, and keep repeating this unless full dataset is used.
In code below, I use KFold mainly to imitate loading chunk of dataset.
class GD_SVM(BaseEstimator, ClassifierMixin):
def __init__(self):
self.sgd = SGDClassifier(loss='hinge',random_state=42,fit_intercept=True,l1_ratio=0,tol=.001)
def fit(self,X,y):
cv = KFold(n_splits=10,random_state=42,shuffle=True)
for _,test_id in cv.split(X,y):
xt,yt = X[test_id],y[test_id]
self.sgd = self.sgd.partial_fit(xt,yt,classes=np.unique(y))
def predict(self,X):
return self.sgd.predict(X)
To test this against regular (linear) SVM:
X,y = load_breast_cancer(return_X_y=True)
X = StandardScaler().fit_transform(X) #For simplicity, Pipeline is better choice
cv = RepeatedStratifiedKFold(n_splits=5,n_repeats=5,random_state=43)
sgd = GD_SVM()
svm = LinearSVC(loss='hinge',max_iter=1,random_state=42,
C=1.0,fit_intercept=True,tol=.001)
r = cross_val_score(sgd,X,y,cv=cv) #cross_val_score(svm,X,y,cv=cv)
print(r.mean())
This returned 95% accuracy for above GD_SVM, and 96% for SVM. In Digits dataset SVM had 93% accuracy, while GD_SVM had 91%. While these performances are broadly similar, as these measurements show, please note that they are not identical. This is expected, since these algorithms use pretty different optimization algorithms, but I think careful tuning of hyper-parameter would reduce the gap.
Based on the concern of loading all of the data in memory, if you have access to more compute resources, you may want to use PySpark's SVM implementation: https://spark.apache.org/docs/2.2.0/ml-classification-regression.html#linear-support-vector-machine, as that Spark is built for large scale data processing. I don't know if averaging the separators from multiple Scikit-Learn models would work as expected; there isn't a clean way to instantiate a new model with new separators, based on the documentation (https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html), so it would probably have to be implemented as an ensemble approach.
If you insist on using the whole DS for training instead of sampling (btw that is what the slides describe) and you do not care about performance, I would train n classifiers, and then select only their support vectors and retrain final version on those support vectors only. This way you effectively dismiss most of the data and concentrate only on the points that are important for the classification.
I am trying to transform two datasets: x_train and x_test using tsne. I assume the way to do this is to fit tsne to x_train, and then transform x_test and x_train. But, I am not able to transform any of the datasets.
tsne = TSNE(random_state = 420, n_components=2, verbose=1, perplexity=5, n_iter=350).fit(x_train)
I assume that tsne has been fitted to x_train.
But, when I do this:
x_train_tse = tsne.transform(x_subset)
I get:
AttributeError: 'TSNE' object has no attribute 'transform'
Any help will be appreciated. (I know I could do fit_transform, but wouldn't I get the same error on x_test?)
Judging by the documentation of sklearn, TSNE simply does not have any transform method.
Also, TSNE is an unsupervised method for dimesionality reduction/visualization, so it does not really work with a TRAIN and TEST. You simply take all of your data and use fit_transform to have the transformation and plot it.
EDIT - It is actually not possible to learn a transformation and reuse it on different data (i.e. Train and Test), as T-sne does not learn a mapping function on a lower dimensional space, but rather runs an iterative procedure on a subspace to find an equilibrium that minimizes a loss/distance ON SOME DATA.
Therefore if you want to preprocess and reduce dimensionality of both a Train and Test datasets, the way to go is PCA/SVD or Autoencoders. T-Sne will only help you for unsupervised tasks :)
As the accepted answer says, there is no separate transform method and it probably wouldn't work in a a train/test setting.
However, you can still use TSNE without information leakage.
Training Time
Calculate the TSNE per record on the training set and use it as a feature in classification algorithm.
Testing Time
Append your training and testing data and fit_transform the TSNE. Now continue on processing your test set, using the TSNE as a feature on those records.
Does this cause information leakage? No.
Inference Time
New records arrive e.g. as images or table rows.
Add the new row(s) to the training table, calculate TSNE (i.e. where the new sample sits in the space relative to your trained samples). Perform any other processing and run your prediction against the row.
It works fine. Sometimes, we worry too much about train/test split because of Kaggle etc. But the main thing is can your method be replicated at inference time and with the same expected accuracy for live use. In this case, yes it can!
Only drawback is you need your training database available at inference time and depending on size, the preprocessing might be costly.
Check the openTSNE1 out. It has all you need.
You can also save the trained model using pickle.dump for example.
[1]: https://opentsne.readthedocs.io/en/latest/index.html
I am trying to save a model and load it into a different session, but I am having prediction inconsistencies, and I would appreciate any help that can be offered. So here is what I did...
First, after running the model, I used this code to save the model:
from sklearn.externals import joblib
joblib.dump(clf, "models.pkl")
and then to load the file in a different colaboratory notebook, I used the function
from sklearn.externals import joblib
loaded_model = joblib.load('models.pkl')
then the program I used to process a single image for testing
img_toArray = cv2.imread("/content/ESD/ESD/folder1/img1.png")
new_array = cv2.resize(img_toArray, (220, 220))
new_array = np.array(new_array).reshape(1,145200)
but this results in an output of array([4]) with every image I test, and I am not sure why.
I have also tried to reload the entire dataset again and separate the labels from the features (the image), and use train_test_split to dedicate 90% of the dataset for testing, and when I run the features (images) to test with, through the block of code:
loaded_model.predict(np.array(xTest[whatEverNumber]).reshape(1,145200))
I get the right predictions. So I am confused as to what I a doing wrong, because in both examples,I am processing the images in basically the same method, and then separating the images and running them through the same prediction method. So I would appreciate any help in figuring out what I did wrong.
Extra information that may prove beneficial: I am using colaboratory and my model is an sklearn SVM that runs through a cross_validation_predict, cross_validation_predict, and finally an SVM fit function.
Thank you in advance!
Is loaded_model always trained with the same data? you might be encountering this problem because your fitted model is trained with different chunks (folds) of your dataset and you are fitting/saving it with the last iteration only and hence, each time you test it, the model learns from different data (given by each fold) and returns different predictions. This if model fitting is within your cross-validation loop. May I ask, what type of train-test split did you use? shuffled?
I have a big file with training data. I am worry, when I use this code:
clf = RandomForestClassifier()
for chunk in reader:
clf.fit(chunk, target)
Do clf will produce model for all chunk or only for current? For incremental learning should I use only Classifiers with partial_fit() method? How I should normalize train data (build normalizer for whole data, neither only current chunk) in that way?
Yes, for incremental learning you can only use classifiers which implement partial_fit.
StandardScaler has partial_fit method, so it can be applied online. I'm not sure though if that's the right way to do it, as transformation will change over time. If you don't expect data distribution to change much, you can also fit any scaler on a subset of data and use it later.
Also note that RandomForestClassifier (like all tree-based classifiers) is scale invariant, so it is not clear standartization has any effect for it.
partial_fit() method is not implemented for RandomForestClassifier which supports incremental learning on chunks of data.
However, you can combine the trainings of RandomForestClassifier as mentioned here using estimators_ and n_estimators.
Yes this will only work for classifiers with partial_fit; depending on how you normalise you may be able to do this chunk-by-chunk (e.g. scaling by a fixed factor or doing a label encoding).
How do you train Scikit's LinearSVC on a dataset too big or impractical to fit into memory? I'm trying to use it to classify documents, and I have a few thousand tagged example records, but when I try to load all this text into memory and train LinearSVC, it consumes over 65% of my memory and I'm forced to kill it before my system becomes totally unresponsive.
Is it possible to format my training data as a single file and feed it into LinearSVC with a filename instead of having to call the fit() method?
I found this guide, but it only really covers classification, and assumes training is done incrementally, something LinearSVC doesn't support.
As far as I know, non-incremental implementations like LinearSVC would need the entire data set to train on. Unless you create an incremental version of it, you might be unable to use LinearSVC.
There are classifiers in scikit-learn that can be used incrementally just like in the guide you found wherein it was using an SGDClassifier. The SGDClassifier has the *partial_fit* method which allows you to train it in batches. There are a couple of other classifiers that support incremental learning such as SGDCLassifier, Multinomial Naive Bayes and Bernoulli Naive Bayes
You can use a Generator function like this.
def lineGenerator():
with open(INPUT_FILENAMES_TITLE[0],'r') as f1:
title_reader = csv.reader(f1)
for line in title_reader:
yield line[0]
Then you can call the Classifier as
clf = LinearSVC()
clf.fit(lineGenerator())
This assumes INPUT_FILENAMES_TITLE[0] is your filename.