sklearn and large datasets

sklearn and large datasets - python

I have a dataset of 22 GB. I would like to process it on my laptop. Of course I can't load it in memory.
I use a lot sklearn but for much smaller datasets.
In this situations the classical approach should be something like.
Read only part of the data -> Partial train your estimator -> delete the data -> read other part of the data -> continue to train your estimator.
I have seen that some sklearn algorithm have the partial fit method that should allow us to train the estimator with various subsamples of the data.
Now I am wondering is there an easy why to do that in sklearn?
I am looking for something like
r = read_part_of_data('data.csv')
m = sk.my_model
`for i in range(n):
x = r.read_next_chunk(20 lines)
m.partial_fit(x)
m.predict(new_x)
Maybe sklearn is not the right tool for these kind of things?
Let me know.

I've used several scikit-learn classifiers with out-of-core capabilities to train linear models: Stochastic Gradient, Perceptron and Passive Agressive and also Multinomial Naive Bayes on a Kaggle dataset of over 30Gb. All these classifiers share the partial_fit method which you mention. Some behave better than others though.
You can find the methodology, the case study and some good resources in of this post:
http://www.opendatascience.com/blog/riding-on-large-data-with-scikit-learn/

I think sklearn is fine for larger data. If your chosen algorithms support partial_fit or an online learning approach then you're on track. One thing to be aware of is that your chunk size may influence your success.
This link may be useful...
Working with big data in python and numpy, not enough ram, how to save partial results on disc?
I agree that h5py is useful but you may wish to use tools that are already in your quiver.
Another thing you can do is to randomly pick whether or not to keep a row in your csv file...and save the result to a .npy file so it loads quicker. That way you get a sampling of your data that will allow you to start playing with it with all algorithms...and deal with the bigger data issue along the way(or not at all! sometimes a sample with a good approach is good enough depending on what you want).

You may want to take a look at Dask or Graphlab
http://dask.pydata.org/en/latest/
https://turi.com/products/create/
They are similar to pandas but working on large scale data (using out-of-core dataframes). The problem with pandas is all data has to fit into memory.
Both frameworks can be used with scikit learn. You can load 22 GB of data into Dask or SFrame, then use with sklearn.

I find it interesting that you have chosen to use Python for statistical analysis rather than R however, I would start by putting my data into a format that can handle such large datasets. The python h5py package is fantastic for this kind of storage - allowing very fast access to your data. You will need to chunk up your data in reasonable sizes say 1 million element chunks e.g. 20 columns x 50,000 rows writing each chunk to the H5 file. Next you need to think about what kind of model you are running - which you haven't really specified.
The fact is that you will probably have to write the algorithm for model and the machine learning cross validation because the data is large. Start by writing an algorithm to summarize the data, so that you know what you am looking at. Then once you decide what model you want to run you will need to think about what the cross validation will be. Put in a "column" into each chunk of the data set that denotes which validation set each row belongs to. You many choose to label each chunk to a particular validation set.
Next you will need to write a map reduce style algorithm to run your model on the validation subsets. The alternative is simply to run models on each chunk of each validation set and average the result (consider the theoretical validity of this approach).
Consider using spark, or R and rhdf5 or something similar. I haven't supplied any code because this is a project rather than just a simple coding question.

Related

Should I separate my data into different batches and then perform tsne on each batch?

I have a very huge dataset and required to reduce the embedding of 768 dimension to 128dimension with TSNE. Since I have more than 1million rows, it takes more than weeks to complete dimension reduction on whole dataset, so I thought maybe I can separate the dataset into different parts and then perform each part separately. I do not have GPU so only CPU.
from sklearn.manifold import TSNE
tsne = TSNE(n_components=128, init='pca', random_state=1001, perplexity=30, method='exact', n_iter=250, verbose=1)
X_tsne = tsne.fit_transform(df_dataset[:1000000]) # this will either fail or take a while (most likely overnight)
I am wondering whether my way is considered OK?
The above is not using split yet, but just load all the datasets. I just want to confirm whether splitting to multiple batches and then fit_transform each batch is the right way or not.
Also, I check the below link about whitening sentence representation but not sure whether does it work with my above method by replacing tsne with whitening. https://deep-ch.medium.com/dimension-reduction-by-whitening-bert-roberta-5e103093f782

It probably depends what you're trying to do, but I suspect the answer is that it is the wrong thing to do.
Between different batches it would difficult to guarantee that the reduced dimension representations would be comparable, since they would have been optimised independently, not using the same data. So you could end up with data looking similar in the low-D representation, when they aren't similar in the original representation.
It seems like PCA might be more suited to you, since it's very fast. Or UMAP, since it is also fast, but additionally has some ways to work with batched data etc.

Saving pomegranate Bayesian Network models

I am making some rather big Bayesian Networks for generating synthetic data, and I find pomegranate to be a good alternative as it generates data quickly and easily allows for inputting evidence. I have one problem with it: saving the trained models. Pomegranate's built-in methods stores as json's so big that I run out of memory when I have 30 or so variables, even when using "lighter" algorithms. The models can not be pickled due to the error
TypeError: self.distributions_ptr,self.parent_count,self.parent_idxs cannot be converted to a Python object for pickling
I am wondering if anyone has a good alternative for storing pomegranate models, or else knows of a Bayesian Network library that generates data quickly after training. I would be grateful for any tips.

if your model can be learned and stored in the memory, it can be saved in a file, but maybe not by 'pickling'. There are many different formats for Bayesian networks (bif, xmlbif, dsl, uai, etc.). I don't know pomegranate, but there is certainly a way to read/save using such a format. With pyAgrum (of which I am one of the authors), you just have to write gum.saveBN(model, "model.xxx") to save it, and then bn=gum.loadBN("model.xxx") to read it ... You can choose xxx among all the supported format, for now : bif|dsl|net|bifxml|o3prm|uai (https://pyagrum.readthedocs.io/en/1.3.1/functions.html#pyAgrum.loadBN).
As far as I understand, evidence for a sampling is just a way to filter the samples by keeping only the samples that respect the constraints (rejection sampling). There is no such a direct method in pyAgrum but this is can be done as a post-process :
import pyAgrum as gum
#create a BN with random CPTs
bn=gum.fastBN("A->B{yes|maybe|no}<-C->D->E<-F<-B")
# generate a sample of size 100
g=gum.BNDatabaseGenerator(bn)
g.setRandomVarOrder()
g.drawSamples(100)
df=g.to_pandas()
#filtering the dataframe
rslt_df = df[(df['B'] == "yes") &
(df['E'] == "1")]
And in a notebook :

How does pairwise comparison training work in XGBoost XGBRanker?

I'm interested in learning to rank with pairwise comparison. While working on this I found that XGBoost has a model called XGBRanker which works very well.
I want to find out how the XGBRanker manages the training data to get such low memory usage and great results?(It uses LambdaMART I believe) I imagine it must be some kind of lookup table for the features and maybe making the pairs iteratively or not using all possible permutations with different labels within one group.
I tried looking through the source code but everything keeps referring to some other XGBoost method and I haven't been able to understand it so far.
I would like to create a similar method to train NNs for pairwise comparison but handling the training data has been a huge hurdle so far.
So more generally my Question would be: How are the pairs created in pairwise ranking anlgorithms?(RankNet,LambdaNet and so on) Are all pairs used? Only a percentage? Is there some other way of doing this? If you're working with >100.000 items you would easily get into the range of hundreds of millions.
I hope someone has some information about this or knows who might.

How to solve MemoryError in sklearn when fitting huge data to a GMM?

I am trying to generate a Universal Background Model (UBM) based on a huge array of extracted MFCC features but I keep getting aMemoryError when I am fitting my data to the model. Here is the relevant code section:
files_features.shape
(2469082, 56)
gmm = GMM(n_components = 1024, n_iter = 512, covariance_type = 'full', n_init = 3)
gmm.fit(features)
Is there a way to solve this error or to decompose the processing of the data to avoid the memory error. I am quite new to this field and would appreciate any help I get.
[Update]
Unfortunately the answers mentioned in here do not solve my issue since the data_set is assumed to have low variance, whereas in my case:
round(np.var(files_features), 3)
47.781
Incremental fitting is maybe a solution but scikit-learn does not have such a partial_fit for GMMs. I would appreciate any suggestions on how to tackle this, whether alternative libs suggestions, partial_fit reference implementations or processing the data batch by batch (which does not work in this case because GMM.fit() is memory-less) ?

That's fairly straightforward using Dask.
Just use Dask's DataFrame instead of pandas', and everything else should work without any changes.
As an alternative to scikit-learn, you can use Turis' Graphlab Create, which could handle arbitrary large datasets (though I'm not sure it supports GMM).

For those who have the same issue, I recommend the use of the Bob library, which supports big data processing and even offers parallel processing.
In my use-case Bob was a great fit for the development of GMM-UBM systems, as all the relevant functionalities are already implemented.

Python kmeans clustering for large datasets

I need to use bag of words (in this case bag of features) to generate descriptor vectors to classify the KTH video dataset. In order to do this, I need to use kmeans clustering algorithm to cluster the extracted features and find the codebook. The extracted features from dataset form approximately 75000 vectors of 100 elements each. So I'm facing memory issues using the scipy.cluster.kmeans2 implementation in Ubuntu. I runed some tests and discovered that with 32000 vector with 100 elements each, the amount of memory used is around 20GB (my total memory is 32GB).
Is there any other Python kmeans implementation more memory effcient?
I already read about Mahout for clustering big data, but I still not understand what is his advantages, is it more memory-efficient with that mentioned amount of data?

When having many samples, consider using sklearn's MiniBatchKMeans, which is a SGD-like method build for this case! (A more tutorial-like intro which does not address memory-usage, but i expect it to be better there for large n_samples. Of course memory also depends on many other parameters like k ... In the case of huge n_features it won't help in regards to memory; but that's not your problem here)
In this case you should carefully tune your mini-batch sizes then.
You can try the classic kmeans implementation there too as you seem to be just quite off the memory-requirements and maybe this implementation is more efficient (more tunable for sure).
In the latter case, init, n_init, precompute_distances, algorithm and maybe copy_x are all parameters having effect on memory-consumption.
And furthermore: if(!) your data is sparse; try calling it with sparse-matrices. (from reading kmeans2-docs it seems it's not supported, but sklearn's kmeans does!)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

sklearn and large datasets - python

Related

Should I separate my data into different batches and then perform tsne on each batch?

Saving pomegranate Bayesian Network models

How does pairwise comparison training work in XGBoost XGBRanker?

How to solve MemoryError in sklearn when fitting huge data to a GMM?

Python kmeans clustering for large datasets

Categories

Resources