Sklearn-GMM on large datasets

Sklearn-GMM on large datasets - python

I have a large data-set (I can't fit entire data on memory). I want to fit a GMM on this data set.
Can I use GMM.fit() (sklearn.mixture.GMM) repeatedly on mini batch of data ??

There is no reason to fit it repeatedly.
Just randomly sample as many data points as you think your machine can compute in a reasonable time. If variation is not very high, the random sample will have approximately the same distribution as the full dataset.
randomly_sampled = np.random.choice(full_dataset, size=10000, replace=False)
#If data does not fit in memory you can find a way to randomly sample when you read it
GMM.fit(randomly_sampled)
And the use
GMM.predict(full_dataset)
# Again you can fit one by one or batch by batch if you cannot read it in memory
on the rest to classify them.

fit will always forget previous data in scikit-learn. For incremental fitting, there is the partial_fit function. Unfortunately, GMM doesn't have a partial_fit (yet), so you can't do that.

As Andreas Mueller mentioned, GMM doesn't have partial_fit yet which will allow you to train the model in an iterative fashion. But you can make use of warm_start by setting it's value to True when you create the GMM object. This allows you to iterate over batches of data and continue training the model from where you left it in the last iteration.

I think you can set the init_para to empty string '' when you create the GMM object, then you might be able to train the whole data set.

Related

How to use data augmentation with cross validation

I need to use data augmentation on what would be my training data from the data augmentation step. The problem is that i am using cross-validation, so i can't find a reference how to adjust my model to use data augmentation. My cross-validation is somewhat indexing by hand my data.
There is articles and general content about data augmentation, but very little and with no generalization for cross validation with data augmentation
I need to use data augmentation on training data by simply rotating and adding zoom, cross validate for the best weights and save them, but i wouldnt know how.
This example can be copy pasted for better reproducibility, in short how would i employ data augmentation and also save the weights with the best accuracy?

When training machine learning models, you should not test model on the samples used during model training phase (if you care for realistic results).
Cross validation is a method for estimating model accuracy. The essence of the method is that you split your available labeled data into several parts (or folds), and then use one part as a test set, training the model on all the rest, and repeating this procedure for all parts one by one. This way you essentially test your model on all the available data, without hurting training too much. There is an implicit assumption that data distribution is the same in all folds. As a rule of thumb, the number of cross validation folds is usually 5 or 7. This depends on the amount of the labeled data at one's disposal - if you have lots of data, you can afford to leave less data to train the model and increase test set size. The higher the number of folds, the better accuracy estimation you can achieve, as the training size part increases, and more time you have to invest into the procedure. In extreme case one have a leave-one-out training procedure: train on everything but one single sample, effectively making number of the folds equal to the number of data samples.
So for a 5-fold CV you train 5 different models, which have a a large overlap of the training data. As a result, you should get 5 models that have similar performance. (If it is not the case, you have a problem ;) ) After you have the test results, you throw away all 5 models you have trained, and train a new model on all the available data, assuming it's performance would be a mean of the values you've got during CV phase.
Now about the augmented data. You should not allow data obtained by augmentation of the training part leak into the test. Each data point created from the training part should be used only for training, same applies to the test set.
So you should split your original data into k-folds (for example using KFold or GroupKFold), then create augmented data for each fold and concatenate them to the original. Then you follow regular CV procedure.
In your case, you can simply pass each group (such as x_group1) through augmenting procedure before concatenating them, and you should be fine.
Please note, that splitting data in linear way can lead to unbalanced data sets and it is not the best way of splitting the data. You should consider functions I've mentioned above.

python tsne.transform does not exist?

I am trying to transform two datasets: x_train and x_test using tsne. I assume the way to do this is to fit tsne to x_train, and then transform x_test and x_train. But, I am not able to transform any of the datasets.
tsne = TSNE(random_state = 420, n_components=2, verbose=1, perplexity=5, n_iter=350).fit(x_train)
I assume that tsne has been fitted to x_train.
But, when I do this:
x_train_tse = tsne.transform(x_subset)
I get:
AttributeError: 'TSNE' object has no attribute 'transform'
Any help will be appreciated. (I know I could do fit_transform, but wouldn't I get the same error on x_test?)

Judging by the documentation of sklearn, TSNE simply does not have any transform method.
Also, TSNE is an unsupervised method for dimesionality reduction/visualization, so it does not really work with a TRAIN and TEST. You simply take all of your data and use fit_transform to have the transformation and plot it.
EDIT - It is actually not possible to learn a transformation and reuse it on different data (i.e. Train and Test), as T-sne does not learn a mapping function on a lower dimensional space, but rather runs an iterative procedure on a subspace to find an equilibrium that minimizes a loss/distance ON SOME DATA.
Therefore if you want to preprocess and reduce dimensionality of both a Train and Test datasets, the way to go is PCA/SVD or Autoencoders. T-Sne will only help you for unsupervised tasks :)

As the accepted answer says, there is no separate transform method and it probably wouldn't work in a a train/test setting.
However, you can still use TSNE without information leakage.
Training Time
Calculate the TSNE per record on the training set and use it as a feature in classification algorithm.
Testing Time
Append your training and testing data and fit_transform the TSNE. Now continue on processing your test set, using the TSNE as a feature on those records.
Does this cause information leakage? No.
Inference Time
New records arrive e.g. as images or table rows.
Add the new row(s) to the training table, calculate TSNE (i.e. where the new sample sits in the space relative to your trained samples). Perform any other processing and run your prediction against the row.
It works fine. Sometimes, we worry too much about train/test split because of Kaggle etc. But the main thing is can your method be replicated at inference time and with the same expected accuracy for live use. In this case, yes it can!
Only drawback is you need your training database available at inference time and depending on size, the preprocessing might be costly.

Check the openTSNE1 out. It has all you need.
You can also save the trained model using pickle.dump for example.
[1]: https://opentsne.readthedocs.io/en/latest/index.html

How to select a subset of mnist training set

I have trouble on how to select a subset of mnist training set which contains M points to train the 1-NN classifier because the number of original training points are too large.
That is , I need to figure out a scheme that takes as input a labeled training set as well as a number M, and return a subset.of the training set of size M.
Besides, uniform-random selection is not allowed.((that is, just picking M of the training points at random)

One option could be to train your network with a a data-generator.
It loads only one batch of data step for step. You will not have issues with your data anymore. Furthermore, it is able to use multithreading.
So loading and maybe preprocessing of your data is not a bottleneck.
Here is a good example:
https://stanford.edu/~shervine/blog/keras-how-to-generate-data-on-the-fly
I hope this helps.

How to apply sklearn's EllipticEnvelope to find out top outliers in the given dataset?

I am using sklearn's EllipticEnvelope to find outliers in dataset. But I am not sure about how to model my problem? Should I just use all the data (without dividing into training and test sets) and apply fit? Also how would I obtain the outlyingness of each datapoint? Should I use predict on the same dataset?

Right way to do this is:
Divide data into normal and outliers.
Take large sample from normal data as normal_train for fitting the novelty detection model.
Create a test set with a sample from normal that is not used in training (say normal_test) and a sample from outlier (say outlier_test) in a way such that the distribution of the test data (normal_test + outlier_test) retains population distribution.
Predict on this test data to get usual metrics (accuracy, sensitivity, positive-predictive-value, etc.)
Wow. I have come a long way!

Yes, call Fit to train it using all your data. Then you can either call Predict using all your data, which returns 1 for inliers and -1 for an outlier, or build up a matrix of all points (to encompass all your data points) and call Predict on each point, then plot the results, which would give you a nice image of the ellipse.
Here's a full example with comparison to other methods.

Training SGDRegressor on a dataset in chunks

For a machine learning task I need to deal with quite large data sets. As a result, I cannot fit the entire data set at once in my algorithm. I am looking for a way to train my algorithm in parts on the data set, simply feeding new chunks won't work since my algorithm will just refit and not won't take the previous examples into account. Is there a method with which I can feed my algorithm new data, while "remembering" the previous data seen before?
Edit: The algorithm I use is the SGDRegressor from scikit-learn.
The code:
train = pd.read_csv(os.path.join(dir,"Train.csv"),chunksize = 5000)
labels = pd.read_csv(os.path.join(dir,"Labels.csv"),chunksize = 5000)
algo = SGDRegressor(n_iter = 75)
print("looping for chunks in train")
for chunk in train:
algo.fit(train,labels)

You can use partial_fit to feed parts of training data to SGDRegressor.
See sample code in examples.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.