I currently want to calculate all-pair document similarity using cosine similarity and Tfidf features in python. My basic approach is the following:
from sklearn.feature_extraction.text import TfidfVectorizer
#c = [doc1, doc2, ..., docn]
vec = TfidfVectorizer()
X = vec.fit_transform(c)
del vec
Y = X * X.T
Works perfectly fine, but unfortunately, not for my very large datasets. X has a dimension of (350363, 2526183) and hence, the output matrix Y should have (350363, 350363). X is very sparse due to the tfidf features, and hence, easily fits into memory (around 2GB only). Yet, the multiplication gives me a memory error after running for some time (even though the memory is not full but I suppose that scipy is so clever as to expect the memory usage).
I have already tried to play around with the dtypes without any success. I have also made sure that numpy and scipy have their BLAS libraries linked -- whereas this does not have an effect on the csr_matrix dot functionality as it is implemented in C. I thought of maybe using things like memmap, but I am not sure about that.
Does anyone have an idea of how to best approach this?
Even though X is sparse, X * X.T probably won't, notice, that it just needs one nonzero common element in a given pair of rows. You are working with NLP task, so I am pretty sure that there are huge amounts of words which occur in nearly all documents (and as said before - it does not have to be one word for all pairs, but one (possibly different) for each pair. As a result you get a matrix of 350363^2 elements which has about 122,000,000,000 elements, if you don't have 200GB of ram, it does not look computable. Try to perform much more aggresive filtering of words in order to force X * X.T to be sparse (remove many common words)
In general you won't be able to compute Gram matrix of big data, unless you enforce the sparsity of the X * X.T, so most of your vectors' pairs (documents) have 0 "similarity". It can be done in numerous ways, the easiest way is to set some threshold T under which you treat <a,b> as 0 and compute the dot product by yourself, and create an entry in the resulting sparse matrix iff <a,b> > T
You may want to look at the random_projection module in scikit-learn. The Johnson-Lindenstrauss lemma says that a random projection matrix is guaranteed to preserve pairwise distances up to some tolerance eta, which is a hyperparameter when you calculate the number of random projections needed.
To cut a long story short, the scikit-learn class SparseRandomProjection seen here is a transformer to do this for you. If you run it on X after vec.fit_transform you should see a fairly large reduction in feature size.
The formula from sklearn.random_projection.johnson_lindenstrauss_min_dim shows that to preserve up to a 10% tolerance, you only need johnson_lindenstrauss_min_dim(350363, .1) 10942 features. This is an upper bound, so you may be able to get away with much less. Even 1% tolerance would only need johnson_lindenstrauss_min_dim(350363, .01) 1028192 features which is still significantly less than you have right now.
EDIT:
Simple thing to try - if your data is dtype='float64', try using 'float32'. That alone can save a massive amount of space, especially if you do not need double precision.
If the issue is that you cannot store the "final matrix" in memory either, I would recommend working with the data in an HDF5Store (as seen in pandas using pytables). This link has some good starter code, and you could iteratively calculate chunks of your dot product and write to disk. I have been using this extensively in a recent project on a 45GB dataset, and could provide more help if you decide to go this route.
What you could do is slice a row and a column of X, multiply those and save the resulting row to a file. Then move to the next row and column.
It is still the same amount of calculation work but you wouldn't run out of memory.
Using multiprocessing.Pool.map() or multiprocessing.Pool.map_async() you migt be able to speed it up, provided you use numpy.memmap() to read the matrix in the mapped function. And you would probably have to write each of the calculated rows to a separate file to merge them later. If you were to return the row from the mapped function it would have to be transferred back to the original process. That would take a lot of memory and IPC bandwidth.
Related
I have around 10k docs (mostly 1-2 sentences) and want for each of these docs find the ten most simliar docs of a collection of 60k docs. Therefore, I want to use the spacy library. Due to the large amount of docs this needs to be efficient, so my first idea was to compute both for each of the 60k docs as well as the 10k docs the document vector (https://spacy.io/api/doc#vector) and save them in two matrices. This two matrices can be multiplied to get the dot product, which can be interpreted as the similarity.
Now, I have basically two questions:
Is this actually the most efficient way or is there a clever trick that can speed up this process
If there is no other clever way, I was wondering whether there is at least a clever way to speed up the process of computing the matrices of document vectors. Currently I am using a for loop, which obviously is not exactly fast:
import spacy
nlp = spacy.load('en_core_web_lg')
doc_matrix = np.zeros((len(train_list), 300))
for i in range(len(train_list)):
doc = nlp(train_list[i]) #the train list contains the single documents
doc_matrix[i] = doc.vector
Is there for example a way to parallelize this?
Don't do a big matrix operation, instead put your document vectors in an approximate nearest neighbors store (annoy is easy to use) and query the nearest items for each vector.
Doing a big matrix operation will do n * n comparisons, but using approximate nearest neighbors techniques will partition the space to perform many fewer calculations. That's much more important for the overall runtime than anything you do with spaCy.
That said, also check the spaCy speed FAQ.
I personally never worked with sentence similarity/vectors in SpaCy directly, so I can't tell you for sure about your first question, there might be some clever way to do this which is more native to SpaCy/the usual way to do it.
For generally speeding up the SpaCy processing:
Disable components you don't need such as Named Entity Recognition, Part of Speech Tagging etc.
Use processed_docs = nlp.pipe(train_list) instead of calling nlp inside the loop. Then access with for doc in processed_docs: or doc = next(processed_docs) inside the loop. You can tune the pipe() parameters to speed it up even more, depending on your hardware, see the documentation.
For your actual "find the n most similar" problem:
This problem is not NLP- or SpaCy-specific but a general problem. There are a lot of sources on how to optimize this for numpy vectors online, you are basically looking for the n nearest datapoints within a large dataset (10000) of high dimensional (300) data. Check out this thread for some general ideas or this thread to for how to perform this kind of search (in this case K-nearest neighbours search) on numpy data.
Generally you should also not forget that in a large dataset (unless filtered) there are going to be documents/sentences which are duplicates or nearly duplicates (only differ by comma or so), so you might want to apply some filtering before performing the search.
I'm dealing with a dataframe of dimension 4 million x 70. Most columns are numeric, and some are categorical, in addition to the occasional missing values. It is essential that the clustering is ran on all data points, and we look to produce around 400,000 clusters (so subsampling the dataset is not an option).
I have looked at using Gower's distance metric for mixed type data, but this produces a dissimilarity matrix of dimension 4 million x 4 million, which is just not feasible to work with since it has 10^13 elements. So, the method needs to avoid dissimilarity matrices entirely.
Ideally, we would use an agglomerative clustering method, since we want a large amount of clusters.
What would be a suitable method for this problem? I am struggling to find a method which meets all of these requirements, and I realise it's a big ask.
Plan B is to use a simple rules-based grouping method based on categorical variables alone, handpicking only a few variables to cluster on since we will suffer from the curse of dimensionality otherwise.
The first step is going to be turning those categorical values into numbers somehow, and the second step is going to be putting the now all numeric attributes into the same scale.
Clustering is computationally expensive, so you might try a third step of representing this data by the top 10 components of a PCA (or however many components have an eigenvalue > 1) to reduce the columns.
For the clustering step, you'll have your choice of algorithms. I would think something hierarchical would be helpful for you, since even though you expect a high number of clusters, it makes intuitive sense that those clusters would fall under larger clusters that continue to make sense all the way down to a small number of "parent" clusters. A popular choice might be HDBSCAN, but I tend to prefer trying OPTICS. The implementation in free ELKI seems to be the fastest (it takes some messing around with to figure it out) because it runs in java. The output of ELKI is a little strange, it outputs a file for every cluster so you have to then use python to loop through the files and create your final mapping, unfortunately. But it's all doable (including executing the ELKI command) from python if you're building an automated pipeline.
Right now I am using the numpy.linalg.solve to solve my matrix, but the fact that I am using it to solve a 5000*17956 matrix makes it really time consuming. It runs really slow and It have taken me more than an hour to solve. The running time for this is probably O(n^3) for solving matrix equation but I never thought it would be that slow. Is there any way to solve it faster in Python?
My code is something like that, to solve a for the equation BT * UT = BT*B a, where m is the number of test cases (in my case over 5000), B is a data matrix m*17956, and u is 1*m.
C = 0.005 # hyperparameter term for regulization
I = np.identity(17956) # 17956*17956 identity matrix
rhs = np.dot(B.T, U.T) # (17956*m) * (m*1) = 17956*1
lhs = np.dot(B.T, B)+C*I # (17956*m) * (m*17956) = 17956*17956
a = np.linalg.solve(lhs, rhs) # B.T u = B.T B a, solve for a (17956*1)
Update (2 July 2018): The updated question asks about the impact of a regularization term and the type of data in the matrices. In general, this can make a large impact in terms of the datatypes a particular CPU is most optimized for (as a rough rule of thumb, AMD is better with vectorized integer math and Intel is better with vectorized floating point math when all other things are held equal), and the presence of a large number of zero values can allow for the use of sparse matrix libraries. In this particular case though, the changes on the main diagonal (well under 1% of all the values in consideration) will have a negligible impact in terms of runtime.
TLDR;
An hour is reasonable (a cubic regression suggests that this would take around 83 minutes on my machine -- a low-end chromebook).
The pre-processing to generate lhs and rhs account for almost none of that time.
You won't be able to solve that exact problem much faster than with numpy.linalg.solve.
If m is small as you suggest and if B is invertible, you can instead solve the equation U.T=Ba in a minute or less.
If this is part of a larger problem, this costly intermediate step might be able to be simplified away from a mathematical framework.
Performance bottlenecks really should be addressed with profiling to figure out which step is causing the issues.
Since this comes from real-world data, you might be able to get away with fewer features (either directly or through a reduction step like PCA, NMF, or LLE), depending on the end goal.
As mentioned in another answer, if the matrix is sufficiently sparse you can get away with sparse linear algebra routines to great effect (many natural language processing data sources are like this).
Since the output is a 1D vector, I would use np.dot(U, B).T instead of np.dot(B.T, U.T). Transposes are neat that way. This avoids doing the transpose on a big matrix like B, though since you have a cubic operation as the dominant step this doesn't matter much for your problem.
Depending on whether you need the original data anymore and if the matrices involved have any other special properties, you might be able to fiddle with the parameters in scipy.linalg.solve instead for a gain.
I've had mixed success replacing large matrix equations with block matrix equations falling back on numpy routines. That approach typically saves 5-20% over numpy approaches and takes 1% or so off scipy approaches on my system. I haven't fully explored the reason for the discrepancy.
Assuming your matrix is sparse, the scipy.sparse.linalg module will be useful. Here is the documentation for the whole module, and here is the documentation for spsolve.
Using Scikit-learn (v 0.15.2) for non-negative matrix factorization on a large sparse matrix (less than 1% values > 0). I want to find factors by minimizing errors only on non-zero values of the matrix (i.e., do not calculate errors for entries that are zero), and to favor sparsity. I'm not sure if something's wrong with what I'm trying. The scikit-learn package's NMF and ProjectedGradientNMF have worked well for me before. But it seems that when the matrix size increases, the factorization is terribly slow.
I'm talking about matrices with > 10^10 cells. For matrix with ~10^7 cells, I find the executime time to be good.
The parameters I've used are as follows: nmf_model = NMF(n_components = 100, init='nndsvd', random_state=0, tol = 0.01, sparseness='data').
When I tried slightly different parameters (change to init=random), I get the following warning. After the warning, the execution of the script halts.
/lib/python2.7/site-packages/sklearn/decomposition/nmf.py:252: UserWarning: Iteration limit reached in nls subproblem.
warnings.warn("Iteration limit reached in nls subproblem.")
Is there a way to make this faster and solve the above problem? I've tried using numpy sparse matrix (column- and row-sparse), but surprisingly - it's slower on the test I did with a smaller matrix (~10^7 cells).
Considering that one would have to run multiple iterations of such a factorization (to choose an ideal number of factors and k-fold cross validation), a faster way to solve this problem is highly desirable.
I'm also open to suggestions of package/tools that's not based on sklearn or Pyhon. I understand questions about package/tool choices are not encouraged, but for such a specific use-case, knowing what techniques others in the field use would be very helpful.
Maybe a few words on what the initial problem is about could enable us to give better answers.
Matrix Factorization on a very large matrix is always going to be slow due to the nature of the problem.
Suggestions:
Reducing n_components to < 20 will speed it up somewhat. However, the only real improvement in speed will be achieved by limiting the size of the matrix.
With a matrix like you describe, one could think that you are trying to factorize a term frequency matrix. If this is so, you could try to use the vectorization functions in scikit-learn to limit the size of the matrix. Most of them have a max_features parameter. Example:
vectorizer = TfidfVectorizer(
max_features=10000,
ngram_range=(1,2))
tfidf = vectorizer.fit_transform(data)
This will significantly speed up the problem solving.
Should I be completely wrong and this is not a term frequency problem, I would still look into ways to limit the initial matrix you are trying to factorize.
You might want to take a look at this article which discusses more recent techniques on NMF: http://www.cc.gatech.edu/~hpark/papers/nmf_blockpivot.pdf
The idea is to work only on the nonzero entries for factorization which reduces computational time especially when the matrix/matrices involved is/are very sparse.
Also, one of the authors from the same article created NMF implementations on github including the ones mentioned in their article. Here's the link: https://github.com/kimjingu/nonnegfac-python
Hope that helps.
Old question, new answer.
The OP asks for "zero-masked" NMF, where zeros are treated as missing values. This will never be faster than normal NMF. Consider NMF by alternating least squares, here the left-hand side of systems of equations is generally constant (it is simply the tcrossprod of W or H), but in zero-masked NMF it needs to be re-calculated for every single sample or feature.
I've implemented zero-masked NMF in the RcppML R package. You can install it from CRAN and use the nmf function setting the mask_zeros argument to TRUE:
install.packages("RcppML")
A <- rsparsematrix(1000, 1000, 0.1) # simulate random matrix
model <- RcppML::nmf(A, k = 10, mask_zeros = TRUE)
My NMF implementation is faster than scikit-learn without masking zeros, and shouldn't be impossibly slow for 99% sparse matrices.
I am trying to apply hierarchial clustering to my dataset which consists of 14039 vectors of users. Each vector has 10 features, where each feature is basically frequency of tags tagged by that user.
I am using Scipy api for clustering.
Now I need to calculate pairwise distances between these 14039 users and pass tis distance matrix to linkage function.
import scipy.cluster.hierarchy as sch
Y = sch.distance.pdist( allUserVector,'cosine')
set_printoptions(threshold='nan')
print Y
But my program gives me MemoryError while calculating the distance matrix itself
File "/usr/lib/pymodules/python2.7/numpy/core/numeric.py", line 1424, in array_str
return array2string(a, max_line_width, precision, suppress_small, ' ', "", str)
File "/usr/lib/pymodules/python2.7/numpy/core/arrayprint.py", line 306, in array2string
separator, prefix)
File "/usr/lib/pymodules/python2.7/numpy/core/arrayprint.py", line 210, in _array2string
format_function = FloatFormat(data, precision, suppress_small)
File "/usr/lib/pymodules/python2.7/numpy/core/arrayprint.py", line 392, in __init__
self.fillFormat(data)
File "/usr/lib/pymodules/python2.7/numpy/core/arrayprint.py", line 399, in fillFormat
non_zero = absolute(data.compress(not_equal(data, 0) & ~special))
MemoryError
Any idea how to fix this? Is my dataset too large? But I guess clustering 14k users shouldnt be too much that it should cause Memory error.
I am running it on i3 and 4 Gb Ram.
I need to apply DBScan clustering too, but that too needs distance matrix as input.
Any suggestions appreciated.
Edit: I get the error only when I print Y. Any ideas why?
Well, hierarchical clustering doesn't make that much sense for large datasets. It's actually mostly a textbook example in my opinion. The problem with hierarchical clustering is that it doesn't really build sensible clusters. It builds a dendrogram, but with 14000 objects the dendrogram becomes pretty much unusable. And very few implementations of hierarchical clustering have non-trivial methods to extract sensible clusters from the dendrogram. Plus, in the general case, hierarchical clustering is of complexity O(n^3) which makes it scale really bad to large datasets.
DBSCAN technically does not need a distance matrix. In fact, when you use a distance matrix, it will be slow, as computing the distance matrix already is O(n^2). And even then, you can safe the O(n^2) memory cost for DBSCAN by computing the distances on the fly at the cost of computing distances twice each. DBSCAN visits each point once, so there is next to no benefit from using a distance matrix except the symmetry gain. And technically, you could do some neat caching tricks to even reduce that, since DBSCAN also just needs to know which objects are below the epsilon threshold. When the epsilon is chosen reasonably, managing the neighbor sets on the fly will use significantly less memory than O(n^2) at the same CPU cost of computing the distance matrix.
Any really good implementation of DBSCAN (it is spelled all uppercase, btw, as it is an abbreviation, not a scan) however should have support for index structures and then run in O(n log n) runtime.
On http://elki.dbs.ifi.lmu.de/wiki/Benchmarking they run DBSCAN on a 110250 object dataset and 8 dimensions, and the non-indexed variant takes 1446 seconds, the one with index just 219. That is about 7 times faster, including index buildup. (It's not python, however) Similarly, OPTICS is 5 times faster with the index. And their kmeans implementation in my experiments was around 6x faster than WEKA kmeans and using much less memory. Their single-link hierarchical clustering also is an optimized O(n^2) implementation. Actually the only one I've seen so far that is not the naive O(n^3) matrix-editing approach.
If you are willing to go beyond python, that might be a good choice.
It's possible that you really are running out of RAM. Finding pairwise distances between N objects means storing N^2 distances. In your case, N^2 is going to be 14039 ^ 2 = 1.97 * 10^8. If we assume that each distance takes only four bytes (which is almost certainly not the case, as they have to be held in some sort of data structure which may have non-constant overhead) that works out to 800 megabytes. That's a lot of memory for the interpreter to be working with. 32-bit architectures only allow up to 2 GB of process memory, and just your raw data is taking up around 50% of that. With the overhead of the data structure you could be looking at usage much higher than that -- I can't say how much because I don't know the memory model behind SciPy/numpy.
I would try breaking your data sets up into smaller sets, or not constructing the full distance matrix. You can break it down into more manageable chunks (say, 14 subsets of around 1000 elements) and do nearest-neighbor between each chunk and all of the vectors -- then you're looking at loading an order of magnitude less into memory at any one time (14000 * 1000, 14 times instead of 14000 * 14000 once).
Edit: agf is completely right on both counts: I missed your edit, and the problem probably comes about when it tries to construct the giant string that represents your matrix. If it's printing floating point values, and we assume 10 characters are printed per element and the string is stored with one byte per character, then you're looking at exactly 2 GB of memory usage just for the string.