Python kmeans clustering for large datasets

Python kmeans clustering for large datasets - python

I need to use bag of words (in this case bag of features) to generate descriptor vectors to classify the KTH video dataset. In order to do this, I need to use kmeans clustering algorithm to cluster the extracted features and find the codebook. The extracted features from dataset form approximately 75000 vectors of 100 elements each. So I'm facing memory issues using the scipy.cluster.kmeans2 implementation in Ubuntu. I runed some tests and discovered that with 32000 vector with 100 elements each, the amount of memory used is around 20GB (my total memory is 32GB).
Is there any other Python kmeans implementation more memory effcient?
I already read about Mahout for clustering big data, but I still not understand what is his advantages, is it more memory-efficient with that mentioned amount of data?

When having many samples, consider using sklearn's MiniBatchKMeans, which is a SGD-like method build for this case! (A more tutorial-like intro which does not address memory-usage, but i expect it to be better there for large n_samples. Of course memory also depends on many other parameters like k ... In the case of huge n_features it won't help in regards to memory; but that's not your problem here)
In this case you should carefully tune your mini-batch sizes then.
You can try the classic kmeans implementation there too as you seem to be just quite off the memory-requirements and maybe this implementation is more efficient (more tunable for sure).
In the latter case, init, n_init, precompute_distances, algorithm and maybe copy_x are all parameters having effect on memory-consumption.
And furthermore: if(!) your data is sparse; try calling it with sparse-matrices. (from reading kmeans2-docs it seems it's not supported, but sklearn's kmeans does!)

Related

Efficient way for Computing the Similarity of Multiple Documents using Spacy

I have around 10k docs (mostly 1-2 sentences) and want for each of these docs find the ten most simliar docs of a collection of 60k docs. Therefore, I want to use the spacy library. Due to the large amount of docs this needs to be efficient, so my first idea was to compute both for each of the 60k docs as well as the 10k docs the document vector (https://spacy.io/api/doc#vector) and save them in two matrices. This two matrices can be multiplied to get the dot product, which can be interpreted as the similarity.
Now, I have basically two questions:
Is this actually the most efficient way or is there a clever trick that can speed up this process
If there is no other clever way, I was wondering whether there is at least a clever way to speed up the process of computing the matrices of document vectors. Currently I am using a for loop, which obviously is not exactly fast:
import spacy
nlp = spacy.load('en_core_web_lg')
doc_matrix = np.zeros((len(train_list), 300))
for i in range(len(train_list)):
doc = nlp(train_list[i]) #the train list contains the single documents
doc_matrix[i] = doc.vector
Is there for example a way to parallelize this?

Don't do a big matrix operation, instead put your document vectors in an approximate nearest neighbors store (annoy is easy to use) and query the nearest items for each vector.
Doing a big matrix operation will do n * n comparisons, but using approximate nearest neighbors techniques will partition the space to perform many fewer calculations. That's much more important for the overall runtime than anything you do with spaCy.
That said, also check the spaCy speed FAQ.

I personally never worked with sentence similarity/vectors in SpaCy directly, so I can't tell you for sure about your first question, there might be some clever way to do this which is more native to SpaCy/the usual way to do it.
For generally speeding up the SpaCy processing:
Disable components you don't need such as Named Entity Recognition, Part of Speech Tagging etc.
Use processed_docs = nlp.pipe(train_list) instead of calling nlp inside the loop. Then access with for doc in processed_docs: or doc = next(processed_docs) inside the loop. You can tune the pipe() parameters to speed it up even more, depending on your hardware, see the documentation.
For your actual "find the n most similar" problem:
This problem is not NLP- or SpaCy-specific but a general problem. There are a lot of sources on how to optimize this for numpy vectors online, you are basically looking for the n nearest datapoints within a large dataset (10000) of high dimensional (300) data. Check out this thread for some general ideas or this thread to for how to perform this kind of search (in this case K-nearest neighbours search) on numpy data.
Generally you should also not forget that in a large dataset (unless filtered) there are going to be documents/sentences which are duplicates or nearly duplicates (only differ by comma or so), so you might want to apply some filtering before performing the search.

Performing PCA in Pyspark Returns Different Results each run

Can someone help me understand why my PCA is getting different results each run?
Im working in Pyspark using Databricks
The current implementation of my code is as below
from pyspark.ml.feature import PCA
from pyspark.mllib.linalg import Vectors
pca = PCA(k=35, inputCol="scaled_features", outputCol="pcaFeatures")
model = pca.fit(df.select('scaled_features'))
result = model.transform(df.select('scaled_features'))
print(model.explainedVariance)
If i run this code multiple times, i get different results for Explained Variance
The difference is quite small, but when i try to perform a K-Means Clustering after, the difference changes the result a lot.

PySpark is a distributed computation system and relies on distributed versions of the k-Means and PCA algorithms. In the distributed versions, they can be non-deterministic and have non-zero error bounds (see the links at the bottom) owing to the nature of data locality and a lack of a universal view of the dataset as necessary design constraints.
Each algorithm is designed in a way that no single machine has access to all of the data at one time to allow for datasets are too big to do so. At each step of the calculations, local segments of the data are used to generate data for intermediate results which is then shuffled across nodes with available capacity. Which entry is grouped with what other entry may change the results for PCA. In what order machines are available versus which elements are ready for the next iteration is not easy to synchronize.
k-Means (even on a single machine) can also be very non-deterministic - it is very sensitive to the initial seed centroids of the clusters. Making sure you always start with the same centroids can help (but if it is performed on PCA features that are changing, that will not help). Carefully setting random seeds on each machine that so they do not collide is also something to consider. Also ensuring that the same data is assigned to the same partitions at the beginning of the run with sorting/indexes can help. All of these things together may improve the variance between runs, but there are a lot of moving parts that all play a role.
https://www.cs.cmu.edu/~ninamf/papers/distributedPCAandCoresets.pdf
https://www.researchgate.net/publication/232063041_Principal_Component_Analysis_for_Dimension_Reduction_in_Massive_Distributed_Data_Sets

TFIDF for Large Dataset

I have a corpus which has around 8 million news articles, I need to get the TFIDF representation of them as a sparse matrix. I have been able to do that using scikit-learn for relatively lower number of samples, but I believe it can't be used for such a huge dataset as it loads the input matrix into memory first and that's an expensive process.
Does anyone know, what would be the best way to extract out the TFIDF vectors for large datasets?

Gensim has an efficient tf-idf model and does not need to have everything in memory at once.
Your corpus simply needs to be an iterable, so it does not need to have the whole corpus in memory at a time.
The make_wiki script runs over Wikipedia in about 50m on a laptop according to the comments.

I believe you can use a HashingVectorizer to get a smallish csr_matrix out of your text data and then use a TfidfTransformer on that. Storing a sparse matrix of 8M rows and several tens of thousands of columns isn't such a big deal. Another option would be not to use TF-IDF at all- it could be the case that your system works reasonably well without it.
In practice you may have to subsample your data set- sometimes a system will do just as well by just learning from 10% of all available data. This is an empirical question, there is not way to tell in advance what strategy would be best for your task. I wouldn't worry about scaling to 8M document until I am convinced I need them (i.e. until I have seen a learning curve showing a clear upwards trend).
Below is something I was working on this morning as an example. You can see the performance of the system tends to improve as I add more documents, but it is already at a stage where it seems to make little difference. Given how long it takes to train, I don't think training it on 500 files is worth my time.

I solve that problem using sklearn and pandas.
Iterate in your dataset once using pandas iterator and create a set of all words, after that use it in CountVectorizer vocabulary. With that the Count Vectorizer will generate a list of sparse matrix all of them with the same shape. Now is just use vstack to group them. The sparse matrix resulted have the same information (but the words in another order) as CountVectorizer object and fitted with all your data.
That solution is not the best if you consider the time complexity but is good for memory complexity. I use that in a dataset with 20GB +,
I wrote a python code (NOT THE COMPLETE SOLUTION) that show the properties, write a generator or use pandas chunks for iterate in your dataset.
from sklearn.feature_extraction.text import CountVectorizer
from scipy.sparse import vstack
# each string is a sample
text_test = [
'good people beauty wrong',
'wrong smile people wrong',
'idea beauty good good',
]
# scikit-learn basic usage
vectorizer = CountVectorizer()
result1 = vectorizer.fit_transform(text_test)
print(vectorizer.inverse_transform(result1))
print(f"First approach:\n {result1}")
# Another solution is
vocabulary = set()
for text in text_test:
for word in text.split():
vocabulary.add(word)
vectorizer = CountVectorizer(vocabulary=vocabulary)
outputs = []
for text in text_test: # use a generator
outputs.append(vectorizer.fit_transform([text]))
result2 = vstack(outputs)
print(vectorizer.inverse_transform(result2))
print(f"Second approach:\n {result2}")
Finally, use TfidfTransformer.

The lengths of the documents The number of terms in common Whether the terms are common or unusual How many times each term appears

sklearn and large datasets

I have a dataset of 22 GB. I would like to process it on my laptop. Of course I can't load it in memory.
I use a lot sklearn but for much smaller datasets.
In this situations the classical approach should be something like.
Read only part of the data -> Partial train your estimator -> delete the data -> read other part of the data -> continue to train your estimator.
I have seen that some sklearn algorithm have the partial fit method that should allow us to train the estimator with various subsamples of the data.
Now I am wondering is there an easy why to do that in sklearn?
I am looking for something like
r = read_part_of_data('data.csv')
m = sk.my_model
`for i in range(n):
x = r.read_next_chunk(20 lines)
m.partial_fit(x)
m.predict(new_x)
Maybe sklearn is not the right tool for these kind of things?
Let me know.

I've used several scikit-learn classifiers with out-of-core capabilities to train linear models: Stochastic Gradient, Perceptron and Passive Agressive and also Multinomial Naive Bayes on a Kaggle dataset of over 30Gb. All these classifiers share the partial_fit method which you mention. Some behave better than others though.
You can find the methodology, the case study and some good resources in of this post:
http://www.opendatascience.com/blog/riding-on-large-data-with-scikit-learn/

I think sklearn is fine for larger data. If your chosen algorithms support partial_fit or an online learning approach then you're on track. One thing to be aware of is that your chunk size may influence your success.
This link may be useful...
Working with big data in python and numpy, not enough ram, how to save partial results on disc?
I agree that h5py is useful but you may wish to use tools that are already in your quiver.
Another thing you can do is to randomly pick whether or not to keep a row in your csv file...and save the result to a .npy file so it loads quicker. That way you get a sampling of your data that will allow you to start playing with it with all algorithms...and deal with the bigger data issue along the way(or not at all! sometimes a sample with a good approach is good enough depending on what you want).

You may want to take a look at Dask or Graphlab
http://dask.pydata.org/en/latest/
https://turi.com/products/create/
They are similar to pandas but working on large scale data (using out-of-core dataframes). The problem with pandas is all data has to fit into memory.
Both frameworks can be used with scikit learn. You can load 22 GB of data into Dask or SFrame, then use with sklearn.

I find it interesting that you have chosen to use Python for statistical analysis rather than R however, I would start by putting my data into a format that can handle such large datasets. The python h5py package is fantastic for this kind of storage - allowing very fast access to your data. You will need to chunk up your data in reasonable sizes say 1 million element chunks e.g. 20 columns x 50,000 rows writing each chunk to the H5 file. Next you need to think about what kind of model you are running - which you haven't really specified.
The fact is that you will probably have to write the algorithm for model and the machine learning cross validation because the data is large. Start by writing an algorithm to summarize the data, so that you know what you am looking at. Then once you decide what model you want to run you will need to think about what the cross validation will be. Put in a "column" into each chunk of the data set that denotes which validation set each row belongs to. You many choose to label each chunk to a particular validation set.
Next you will need to write a map reduce style algorithm to run your model on the validation subsets. The alternative is simply to run models on each chunk of each validation set and average the result (consider the theoretical validity of this approach).
Consider using spark, or R and rhdf5 or something similar. I haven't supplied any code because this is a project rather than just a simple coding question.

Hierarchical clustering of 1 million objects

Can anyone point me to a hierarchical clustering tool (preferable in python) that can cluster ~1 Million objects? I have tried hcluster and also Orange.
hcluster had trouble with 18k objects. Orange was able to cluster 18k objects in seconds, but failed with 100k objects (saturated memory and eventually crashed).
I am running on a 64bit Xeon CPU (2.53GHz) and 8GB of RAM + 3GB swap on Ubuntu 11.10.

The problem probably is that they will try to compute the full 2D distance matrix (about 8 GB naively with double precision) and then their algorithm will run in O(n^3) time anyway.
You should seriously consider using a different clustering algorithm. Hierarchical clustering is slow and the results are not at all convincing usually. In particular for millions of objects, where you can't just look at the dendrogram to choose the appropriate cut.
If you really want to continue hierarchical clustering, I belive that ELKI (Java though) has a O(n^2) implementation of SLINK. Which at 1 million objects should be approximately 1 million times as fast. I don't know if they already have CLINK, too. And I'm not sure if there actually is any sub-O(n^3) algorithm for other variants than single-link and complete-link.
Consider using other algorithms. k-means for example scales very well with the number of objects (it's just not very good usually either, unless your data is very clean and regular). DBSCAN and OPTICS are quite good in my opinion, once you have a feel for the parameters. If your data set is low dimensional, they can be accelerated quite well with an appropriate index structure. They should then run in O(n log n), if you have an index with O(log n) query time. Which can make a huge difference for large data sets. I've personally used OPTICS on a 110k images data set without problems, so I can imagine it scales up well to 1 million on your system.

To beat O(n^2), you'll have to first reduce your 1M points (documents)
to e.g. 1000 piles of 1000 points each, or 100 piles of 10k each, or ...
Two possible approaches:
build a hierarchical tree from say 15k points, then add the rest one by one:
time ~ 1M * treedepth
first build 100 or 1000 flat clusters,
then build your hierarchical tree of the 100 or 1000 cluster centres.
How well either of these might work depends critically
on the size and shape of your target tree --
how many levels, how many leaves ?
What software are you using,
and how many hours / days do you have to do the clustering ?
For the flat-cluster approach,
K-d_tree s
work fine for points in 2d, 3d, 20d, even 128d -- not your case.
I know hardly anything about clustering text;
Locality-sensitive_hashing ?
Take a look at scikit-learn clustering --
it has several methods, including DBSCAN.
Added: see also
google-all-pairs-similarity-search
"Algorithms for finding all similar pairs of vectors in sparse vector data", Beyardo et el. 2007
SO hierarchical-clusterization-heuristics

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.