I have around 10k docs (mostly 1-2 sentences) and want for each of these docs find the ten most simliar docs of a collection of 60k docs. Therefore, I want to use the spacy library. Due to the large amount of docs this needs to be efficient, so my first idea was to compute both for each of the 60k docs as well as the 10k docs the document vector (https://spacy.io/api/doc#vector) and save them in two matrices. This two matrices can be multiplied to get the dot product, which can be interpreted as the similarity.
Now, I have basically two questions:
Is this actually the most efficient way or is there a clever trick that can speed up this process
If there is no other clever way, I was wondering whether there is at least a clever way to speed up the process of computing the matrices of document vectors. Currently I am using a for loop, which obviously is not exactly fast:
import spacy
nlp = spacy.load('en_core_web_lg')
doc_matrix = np.zeros((len(train_list), 300))
for i in range(len(train_list)):
doc = nlp(train_list[i]) #the train list contains the single documents
doc_matrix[i] = doc.vector
Is there for example a way to parallelize this?
Don't do a big matrix operation, instead put your document vectors in an approximate nearest neighbors store (annoy is easy to use) and query the nearest items for each vector.
Doing a big matrix operation will do n * n comparisons, but using approximate nearest neighbors techniques will partition the space to perform many fewer calculations. That's much more important for the overall runtime than anything you do with spaCy.
That said, also check the spaCy speed FAQ.
I personally never worked with sentence similarity/vectors in SpaCy directly, so I can't tell you for sure about your first question, there might be some clever way to do this which is more native to SpaCy/the usual way to do it.
For generally speeding up the SpaCy processing:
Disable components you don't need such as Named Entity Recognition, Part of Speech Tagging etc.
Use processed_docs = nlp.pipe(train_list) instead of calling nlp inside the loop. Then access with for doc in processed_docs: or doc = next(processed_docs) inside the loop. You can tune the pipe() parameters to speed it up even more, depending on your hardware, see the documentation.
For your actual "find the n most similar" problem:
This problem is not NLP- or SpaCy-specific but a general problem. There are a lot of sources on how to optimize this for numpy vectors online, you are basically looking for the n nearest datapoints within a large dataset (10000) of high dimensional (300) data. Check out this thread for some general ideas or this thread to for how to perform this kind of search (in this case K-nearest neighbours search) on numpy data.
Generally you should also not forget that in a large dataset (unless filtered) there are going to be documents/sentences which are duplicates or nearly duplicates (only differ by comma or so), so you might want to apply some filtering before performing the search.
Related
I have the following code which takes in 2 sentences and return the similarity:
nlp = spacy.load("en_core_web_md/en_core_web_md-3.2.0")
def get_categories_nlp_sim(cat_1, cat_2):
if (cat_1 != cat_1) or (cat_2 != cat_2):
s = np.nan
else:
doc1 = nlp(cat_1)
doc2 = nlp(cat_2)
s = doc1.similarity(doc2)
return s
So, this seems to give reasonable results but when using it in a for loop of ~1M rows, it just becomes too slow to use.
Any ideas on how to speed this up? or perhaps another NLP library that could do the same thing faster?
Thanks!
If you truly have 1m rows and compare each of them as pairs you would have an astronomical number of comparisons. SpaCys nlp() does a whole lot other than just the stuff needed for the similarity.
What SpaCys similarity() does is use the processed documents vector and calculate a cosine similarity (document vector = average over word vectors), check out source code.
So the probably most efficient possibly way for you to replicate a similarity for this many pairs would be to get a semantic token representation vector for each unique token in the entire corpus using something like Gensims pretrained word2vec model, then for each row calculate the average of the vectors of the tokens in it and then once you have those 1m document vectors as numpy arrays you calculate the cosine similarities using numpy or scipy which is drastically faster than pure Python.
Also check out this thread which is a similar question to yours: Efficient way for Computing the Similarity of Multiple Documents using Spacy
I'm not sure what exactly your main goal is in your code but I am pretty sure that calculating each pairwise similarity is not required or at least not the best way to go ahead and reach that goal, so please share more about the context you need this method in.
After going through the answers and this other related thread Efficient way for Computing the Similarity of Multiple Documents using Spacy, I managed to get a significant speed-up.
I am now using the following code:
nlp = spacy.load(en_core_web_md, exclude=["tagger", "parser", "senter", "attribute_ruler", "lemmatizer", "ner"])
processed_docs_1 = nlp.pipe(texts_1)
processed_docs_2 = nlp.pipe(texts_2)
for _ in range(len(texts_1)):
doc_1 = next(processed_docs_1)
doc_2 = next(processed_docs_2)
s = doc_1.similarity(doc_2)
where texts_1 and texts_2 are of the same length consisting of the pairs to compare (e.g. texts_1[i] with texts_2[i]).
Adding the "exclude" in spacy load resulted in ~ 2x speed up.
Using nlp.pipe as opposed to calling nlp inside the loop resulted in a ~10x speed up. So combined, I obtain ~20x speed up.
I'm trying to compute the text similarity of a search term, A, like "How to make chickens" against a collection of other search terms. To compute similarity I'm using the cosine distance and TF-IDF to transform A into a vector. I'd like to compare the similarity of A against all documents at once.
Currently, my approach involves computing the cosine similarity for A against every other document one at a time, iteratively. I have 100 documents I'm comparing against. If the result of cos_sim(A, X) > 0.8 then I break and say "cool, this is similar".
However, I feel like this might not be a true representation of the overall similarity. Is there a way to pre-compute a vector(s) for my 100 documents at runtime, and every time I see a new search query A, I can compare against this pre-defined vector/document?
I believe I can achieve this by simply combining all documents into one... feels rough though. What are the pros and & cons, and possible solutions? Extra points for efficiency!
This problem is essentially the traditional search problem: Have you tried putting your documents into something like Lucene (Java) or Whoosh (python)? I think they have a cosine-similarity model (but even if they don't, the default may be better).
The general trick all search engines use is that in general, documents are sparse. This means to compute the similarity (e.g., cosine similarity) it only matters what the lengths of the documents are (known way ahead of time) and the terms that they both contain; you can organize a data structure like a back-of-the-book index, called an inverted index that can quickly tell you which documents will get at least a non-zero score.
With only 100 documents, a search engine is probably overkill; you want to pre-compute the TF-IDF vectors and keep them in a numpy matrix. You can then use numpy operations to compute the dot product all at once for all the documents -- it will output a 1x100 vector of the numerators you need. The denominators can similarly be precomputed. A numpy.max(numpy.dot(query, docs)/denom) will then probably be fast enough.
You should profile your code, but I would bet your vector extraction is the slow part; but you should only have to do that once for all queries.
If you had thousands or millions of documents to compare against, you could look into SciKit learn's K-nearest-neighbor structures (e.g., Ball Tree or KDTree, or things like Facebook's FAISS library.
Svenstrup et. al. 2017 propose an interesting way to handle hash collisions in hashing vectorizers: Use 2 different hashing functions, and concatenate their results before modeling.
They claim that the combination of multiple hash functions approximates a single hash function with much larger range (see section 4 of the paper).
I'd like to try this out with some text data I'm working with in sklearn. The idea would be to run the HashingVectorizer twice, with a different hash function each time, and then concatenate the results as an input to my model.
How might I do with with sklearn? There's not an option to change the hash function used, but maybe could modify the vectorizer somehow?
Or maybe there's a way I could achieve this with SparseRandomProjection ?
HashingVectorizer in scikit-learn already includes a mechanism to mitigate hash collisions with alternate_sign=True option. This adds a random sign during token summation which improves the preservation of distances in the hashed space (see scikit-learn#7513 for more details).
By using N hash functions and concatenating the output, one would increase both n_features and the number of non null terms (nnz) in the resulting sparse matrix by N. In other words each token will now be represented as N elements. This is quite wastful memory wise. In addition, since the run time for sparse array computations is directly dependent on nnz (and less so on n_features) this will have a much larger negative performance impact than only increasing n_features. I'm not sure that such approach is very useful in practice.
If you nevertheless want to implement such vectorizer, below are a few comments.
because FeatureHasher is implemented in Cython, it is difficult to modify its functionality from Python without editing/re-compiling the code.
writing a quick pure-python implemnteation of HashingVectorizer could be one way to do it.
otherwise, there is a somewhat experimental re-implementation of HashingVectorizer in the text-vectorize package. Because it is written in Rust (with Python binding), other hash functions are easily accessible and can potentially be added.
I have a corpus which has around 8 million news articles, I need to get the TFIDF representation of them as a sparse matrix. I have been able to do that using scikit-learn for relatively lower number of samples, but I believe it can't be used for such a huge dataset as it loads the input matrix into memory first and that's an expensive process.
Does anyone know, what would be the best way to extract out the TFIDF vectors for large datasets?
Gensim has an efficient tf-idf model and does not need to have everything in memory at once.
Your corpus simply needs to be an iterable, so it does not need to have the whole corpus in memory at a time.
The make_wiki script runs over Wikipedia in about 50m on a laptop according to the comments.
I believe you can use a HashingVectorizer to get a smallish csr_matrix out of your text data and then use a TfidfTransformer on that. Storing a sparse matrix of 8M rows and several tens of thousands of columns isn't such a big deal. Another option would be not to use TF-IDF at all- it could be the case that your system works reasonably well without it.
In practice you may have to subsample your data set- sometimes a system will do just as well by just learning from 10% of all available data. This is an empirical question, there is not way to tell in advance what strategy would be best for your task. I wouldn't worry about scaling to 8M document until I am convinced I need them (i.e. until I have seen a learning curve showing a clear upwards trend).
Below is something I was working on this morning as an example. You can see the performance of the system tends to improve as I add more documents, but it is already at a stage where it seems to make little difference. Given how long it takes to train, I don't think training it on 500 files is worth my time.
I solve that problem using sklearn and pandas.
Iterate in your dataset once using pandas iterator and create a set of all words, after that use it in CountVectorizer vocabulary. With that the Count Vectorizer will generate a list of sparse matrix all of them with the same shape. Now is just use vstack to group them. The sparse matrix resulted have the same information (but the words in another order) as CountVectorizer object and fitted with all your data.
That solution is not the best if you consider the time complexity but is good for memory complexity. I use that in a dataset with 20GB +,
I wrote a python code (NOT THE COMPLETE SOLUTION) that show the properties, write a generator or use pandas chunks for iterate in your dataset.
from sklearn.feature_extraction.text import CountVectorizer
from scipy.sparse import vstack
# each string is a sample
text_test = [
'good people beauty wrong',
'wrong smile people wrong',
'idea beauty good good',
]
# scikit-learn basic usage
vectorizer = CountVectorizer()
result1 = vectorizer.fit_transform(text_test)
print(vectorizer.inverse_transform(result1))
print(f"First approach:\n {result1}")
# Another solution is
vocabulary = set()
for text in text_test:
for word in text.split():
vocabulary.add(word)
vectorizer = CountVectorizer(vocabulary=vocabulary)
outputs = []
for text in text_test: # use a generator
outputs.append(vectorizer.fit_transform([text]))
result2 = vstack(outputs)
print(vectorizer.inverse_transform(result2))
print(f"Second approach:\n {result2}")
Finally, use TfidfTransformer.
The lengths of the documents The number of terms in common Whether the terms are common or unusual How many times each term appears
I currently want to calculate all-pair document similarity using cosine similarity and Tfidf features in python. My basic approach is the following:
from sklearn.feature_extraction.text import TfidfVectorizer
#c = [doc1, doc2, ..., docn]
vec = TfidfVectorizer()
X = vec.fit_transform(c)
del vec
Y = X * X.T
Works perfectly fine, but unfortunately, not for my very large datasets. X has a dimension of (350363, 2526183) and hence, the output matrix Y should have (350363, 350363). X is very sparse due to the tfidf features, and hence, easily fits into memory (around 2GB only). Yet, the multiplication gives me a memory error after running for some time (even though the memory is not full but I suppose that scipy is so clever as to expect the memory usage).
I have already tried to play around with the dtypes without any success. I have also made sure that numpy and scipy have their BLAS libraries linked -- whereas this does not have an effect on the csr_matrix dot functionality as it is implemented in C. I thought of maybe using things like memmap, but I am not sure about that.
Does anyone have an idea of how to best approach this?
Even though X is sparse, X * X.T probably won't, notice, that it just needs one nonzero common element in a given pair of rows. You are working with NLP task, so I am pretty sure that there are huge amounts of words which occur in nearly all documents (and as said before - it does not have to be one word for all pairs, but one (possibly different) for each pair. As a result you get a matrix of 350363^2 elements which has about 122,000,000,000 elements, if you don't have 200GB of ram, it does not look computable. Try to perform much more aggresive filtering of words in order to force X * X.T to be sparse (remove many common words)
In general you won't be able to compute Gram matrix of big data, unless you enforce the sparsity of the X * X.T, so most of your vectors' pairs (documents) have 0 "similarity". It can be done in numerous ways, the easiest way is to set some threshold T under which you treat <a,b> as 0 and compute the dot product by yourself, and create an entry in the resulting sparse matrix iff <a,b> > T
You may want to look at the random_projection module in scikit-learn. The Johnson-Lindenstrauss lemma says that a random projection matrix is guaranteed to preserve pairwise distances up to some tolerance eta, which is a hyperparameter when you calculate the number of random projections needed.
To cut a long story short, the scikit-learn class SparseRandomProjection seen here is a transformer to do this for you. If you run it on X after vec.fit_transform you should see a fairly large reduction in feature size.
The formula from sklearn.random_projection.johnson_lindenstrauss_min_dim shows that to preserve up to a 10% tolerance, you only need johnson_lindenstrauss_min_dim(350363, .1) 10942 features. This is an upper bound, so you may be able to get away with much less. Even 1% tolerance would only need johnson_lindenstrauss_min_dim(350363, .01) 1028192 features which is still significantly less than you have right now.
EDIT:
Simple thing to try - if your data is dtype='float64', try using 'float32'. That alone can save a massive amount of space, especially if you do not need double precision.
If the issue is that you cannot store the "final matrix" in memory either, I would recommend working with the data in an HDF5Store (as seen in pandas using pytables). This link has some good starter code, and you could iteratively calculate chunks of your dot product and write to disk. I have been using this extensively in a recent project on a 45GB dataset, and could provide more help if you decide to go this route.
What you could do is slice a row and a column of X, multiply those and save the resulting row to a file. Then move to the next row and column.
It is still the same amount of calculation work but you wouldn't run out of memory.
Using multiprocessing.Pool.map() or multiprocessing.Pool.map_async() you migt be able to speed it up, provided you use numpy.memmap() to read the matrix in the mapped function. And you would probably have to write each of the calculated rows to a separate file to merge them later. If you were to return the row from the mapped function it would have to be transferred back to the original process. That would take a lot of memory and IPC bandwidth.