I have the following code which takes in 2 sentences and return the similarity:
nlp = spacy.load("en_core_web_md/en_core_web_md-3.2.0")
def get_categories_nlp_sim(cat_1, cat_2):
if (cat_1 != cat_1) or (cat_2 != cat_2):
s = np.nan
else:
doc1 = nlp(cat_1)
doc2 = nlp(cat_2)
s = doc1.similarity(doc2)
return s
So, this seems to give reasonable results but when using it in a for loop of ~1M rows, it just becomes too slow to use.
Any ideas on how to speed this up? or perhaps another NLP library that could do the same thing faster?
Thanks!
If you truly have 1m rows and compare each of them as pairs you would have an astronomical number of comparisons. SpaCys nlp() does a whole lot other than just the stuff needed for the similarity.
What SpaCys similarity() does is use the processed documents vector and calculate a cosine similarity (document vector = average over word vectors), check out source code.
So the probably most efficient possibly way for you to replicate a similarity for this many pairs would be to get a semantic token representation vector for each unique token in the entire corpus using something like Gensims pretrained word2vec model, then for each row calculate the average of the vectors of the tokens in it and then once you have those 1m document vectors as numpy arrays you calculate the cosine similarities using numpy or scipy which is drastically faster than pure Python.
Also check out this thread which is a similar question to yours: Efficient way for Computing the Similarity of Multiple Documents using Spacy
I'm not sure what exactly your main goal is in your code but I am pretty sure that calculating each pairwise similarity is not required or at least not the best way to go ahead and reach that goal, so please share more about the context you need this method in.
After going through the answers and this other related thread Efficient way for Computing the Similarity of Multiple Documents using Spacy, I managed to get a significant speed-up.
I am now using the following code:
nlp = spacy.load(en_core_web_md, exclude=["tagger", "parser", "senter", "attribute_ruler", "lemmatizer", "ner"])
processed_docs_1 = nlp.pipe(texts_1)
processed_docs_2 = nlp.pipe(texts_2)
for _ in range(len(texts_1)):
doc_1 = next(processed_docs_1)
doc_2 = next(processed_docs_2)
s = doc_1.similarity(doc_2)
where texts_1 and texts_2 are of the same length consisting of the pairs to compare (e.g. texts_1[i] with texts_2[i]).
Adding the "exclude" in spacy load resulted in ~ 2x speed up.
Using nlp.pipe as opposed to calling nlp inside the loop resulted in a ~10x speed up. So combined, I obtain ~20x speed up.
Related
I have around 10k docs (mostly 1-2 sentences) and want for each of these docs find the ten most simliar docs of a collection of 60k docs. Therefore, I want to use the spacy library. Due to the large amount of docs this needs to be efficient, so my first idea was to compute both for each of the 60k docs as well as the 10k docs the document vector (https://spacy.io/api/doc#vector) and save them in two matrices. This two matrices can be multiplied to get the dot product, which can be interpreted as the similarity.
Now, I have basically two questions:
Is this actually the most efficient way or is there a clever trick that can speed up this process
If there is no other clever way, I was wondering whether there is at least a clever way to speed up the process of computing the matrices of document vectors. Currently I am using a for loop, which obviously is not exactly fast:
import spacy
nlp = spacy.load('en_core_web_lg')
doc_matrix = np.zeros((len(train_list), 300))
for i in range(len(train_list)):
doc = nlp(train_list[i]) #the train list contains the single documents
doc_matrix[i] = doc.vector
Is there for example a way to parallelize this?
Don't do a big matrix operation, instead put your document vectors in an approximate nearest neighbors store (annoy is easy to use) and query the nearest items for each vector.
Doing a big matrix operation will do n * n comparisons, but using approximate nearest neighbors techniques will partition the space to perform many fewer calculations. That's much more important for the overall runtime than anything you do with spaCy.
That said, also check the spaCy speed FAQ.
I personally never worked with sentence similarity/vectors in SpaCy directly, so I can't tell you for sure about your first question, there might be some clever way to do this which is more native to SpaCy/the usual way to do it.
For generally speeding up the SpaCy processing:
Disable components you don't need such as Named Entity Recognition, Part of Speech Tagging etc.
Use processed_docs = nlp.pipe(train_list) instead of calling nlp inside the loop. Then access with for doc in processed_docs: or doc = next(processed_docs) inside the loop. You can tune the pipe() parameters to speed it up even more, depending on your hardware, see the documentation.
For your actual "find the n most similar" problem:
This problem is not NLP- or SpaCy-specific but a general problem. There are a lot of sources on how to optimize this for numpy vectors online, you are basically looking for the n nearest datapoints within a large dataset (10000) of high dimensional (300) data. Check out this thread for some general ideas or this thread to for how to perform this kind of search (in this case K-nearest neighbours search) on numpy data.
Generally you should also not forget that in a large dataset (unless filtered) there are going to be documents/sentences which are duplicates or nearly duplicates (only differ by comma or so), so you might want to apply some filtering before performing the search.
I'm trying to compute the text similarity of a search term, A, like "How to make chickens" against a collection of other search terms. To compute similarity I'm using the cosine distance and TF-IDF to transform A into a vector. I'd like to compare the similarity of A against all documents at once.
Currently, my approach involves computing the cosine similarity for A against every other document one at a time, iteratively. I have 100 documents I'm comparing against. If the result of cos_sim(A, X) > 0.8 then I break and say "cool, this is similar".
However, I feel like this might not be a true representation of the overall similarity. Is there a way to pre-compute a vector(s) for my 100 documents at runtime, and every time I see a new search query A, I can compare against this pre-defined vector/document?
I believe I can achieve this by simply combining all documents into one... feels rough though. What are the pros and & cons, and possible solutions? Extra points for efficiency!
This problem is essentially the traditional search problem: Have you tried putting your documents into something like Lucene (Java) or Whoosh (python)? I think they have a cosine-similarity model (but even if they don't, the default may be better).
The general trick all search engines use is that in general, documents are sparse. This means to compute the similarity (e.g., cosine similarity) it only matters what the lengths of the documents are (known way ahead of time) and the terms that they both contain; you can organize a data structure like a back-of-the-book index, called an inverted index that can quickly tell you which documents will get at least a non-zero score.
With only 100 documents, a search engine is probably overkill; you want to pre-compute the TF-IDF vectors and keep them in a numpy matrix. You can then use numpy operations to compute the dot product all at once for all the documents -- it will output a 1x100 vector of the numerators you need. The denominators can similarly be precomputed. A numpy.max(numpy.dot(query, docs)/denom) will then probably be fast enough.
You should profile your code, but I would bet your vector extraction is the slow part; but you should only have to do that once for all queries.
If you had thousands or millions of documents to compare against, you could look into SciKit learn's K-nearest-neighbor structures (e.g., Ball Tree or KDTree, or things like Facebook's FAISS library.
I'm using word2vec on a 1 million abstracts dataset (2 billion words). To find most similar documents, I use the gensim.similarities.WmdSimilarity class. When trying to retrieve the best match using wmd_similarity_index[query], the calculation spends most of its time building a dictionary. Here is a piece of log:
2017-08-25 09:45:39,441 : INFO : built Dictionary(127 unique tokens: ['empirical', 'model', 'estimating', 'vertical', 'concentration']...) from 2 documents (total 175 corpus positions)
2017-08-25 09:45:39,445 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
What does this part ? Is it dependent on the query ? Is there a way to do these calculations once for all ?
EDIT: training and scoring phases in my code:
Training and saving to disk:
w2v_size = 300
word2vec = gensim.models.Word2Vec(texts, size=w2v_size, window=9, min_count=5, workers=1, sg=1, hs=1, iter=20) # sg=1 means skip gram is used
word2vec.save(utils.paths.PATH_DATA_GENSIM_WORD2VEC)
corpus_w2v_wmd_index = gensim.similarities.WmdSimilarity(texts, word2vec.wv)
corpus_w2v_wmd_index.save(utils.paths.PATH_DATA_GENSIM_CORPUS_WORD2VEC_WMD_INDEX)
Loading and scoring:
w2v = gensim.models.Word2Vec.load(utils.paths.PATH_DATA_GENSIM_WORD2VEC)
words = [t for t in proc_text if t in w2v.wv]
corpus_w2v_wmd_index = gensim.similarities.docsim.Similarity.load(utils.paths.PATH_DATA_GENSIM_CORPUS_WORD2VEC_WMD_INDEX)
scores_w2v = np.array(corpus_w2v_wmd_index[words])
The "Word Mover's Distance" calculation is relatively expensive – for each pairwise document comparison, it searches for an optimal 'shifting' of semantic positions, and that shifting is itself dependent on the pairwise simple-distances between all words of each compared document.
That is, it involves far more calculation than a simple cosine-distance between two high-dimensional vectors, and it involves more calculation the longer the two documents are.
There isn't much that could be pre-calculated, from the texts corpus, until the query's words are known. (Each pairwise calculation depends on the query's words, and their simple-distances to each corpus document's words.)
That said, there are some optimizations the gensim WmdSimilarity class doesn't yet do.
The original WMD paper described a quicker calculation that could help eliminate corpus texts that couldn't possibly be in the top-N most-WMD-similar results. Theoretically, the gensim WmdSimilarity could also implement this optimization, and give quicker results, at least when initializing the WmdSimilarity with the num_best parameter. (Without it, every query returns all WMD-similarity-scores, so this optimization wouldn't help.)
Also, for now the WmdSimilarity class just calls KeyedVectors.wmdistance(doc1, doc2) for every query-to-corpus-document pair, as raw texts. Thus the pairwise simple-distances from all doc1 words to doc2 words will be recalculated each time, even if many pairs repeat across the corpus. (That is, if 'apple' is in the query and 'orange' is in every corpus doc, it will still calculate the 'apple'-to-'orange' distance repeatedly.)
So, some caching of interim values might help performance. For example, with a query of 1000 words, and a vocabulary of 100,000 words among all corpus documents, the ((1000 * 100,000) / 2) 50 million pairwise word-distances could be precalculated once, using 200MB, then shared by all subsequent WMD-calculations. To add this optimization would require a cooperative refactoring of WmdSimilarity.get_similarities() and KeyedVectors.wmdistance().
Finally, Word2Vec/Doc2Vec applications don't necessarily require or benefit much from stop-word removal or stemming. But because the expense of WMD calculation grows with document and vocabulary size, anything that shrinks effective document sizes could help performance. So various ways of discarding low-value words, or coalescing similar words, may be worth considering when using WMD on large document sets.
This is my first post on SO so I hope I'm not committing any posting crimes just yet ;-). This is verbose because part of what I am trying to do is to validate my process and ensure I understand how this is done without screwing up majorly. I will sum up my questions here:
How can I have a stress value in the 50's from an MDS? I thought it should be between 0 and 1.
Running a clustering function on coordinates obtained through MDS is a big no-no? I ask because my results do not change significantly doing so but it could just be because of my data
I want to validate my k value for the number of clusters using an "elbow" method. How can I compute this knowing that I rely on linkage() and fcluster() to plot a number of clusters against an error value? Any help on methods or calls to access that data or the data I need to compute it would be greatly appreciated.
I am working on a document classification scheme using python 3.4 for a pet projet I have where I want to feed a corpus of several thousand texts and classify them using hierarchical clustering. I also would like to use MDS to graphically represent the cluster structures (I will also use a dendrogram but want to give this a shot).
Anyway, first thing I want to do is validate my procedure to make sure I understand how this works. This is done using NLTK and scikit-learn. My objective is not to call one procedure in scikit-learn that would do everything. Rather, I want to compute my similarity matrix (using a procedure in NLTK for example) and then feed that into a clustering function, using the precomputed parameter in some of the methods I rely on.
So my steps are currently as follows:
Load corpus
Clean up corpus items: remove stop words and
unwanted chars (numerical values and other text that is not relevant
to my objective); use lemmatization (WordNet)
the end result is a matrix with n documents and m terms
Compute the similarity between documents: for each document, compute cosine
similarity against the matrix of terms.
To do that, I use TfidfVectorizer
Note: I am a python newbie so I may not do things in a pythonic way. I apologize in advance...
vectorizer = TfidfVectorizer(tokenizer = tokenize, preprocessor = preprocess)
sparse_matrix = vectorizer.fit_transform(term_dict.values())
The tokenizer and preprocessor methods are dummy methods I had to add so that it would not try and tokenize etc. my dictionary which was previously built.
The cosine similarity matrix is built using:
i = 0
return_matrix = [[0 for x in range(len(document_terms_list))] for x in range(len(document_terms_list))]
for index in enumerate(document_terms_list):
if (i < len(document_terms_list)):
similarity = cosine_similarity(sparse_matrix[i:i+1], sparse_matrix)
M = coo_matrix(similarity)
for k, j, v in zip(M.row, M.col, M.data):
return_matrix[i][j] = v
i += 1
So for 100 documents, return_matrix is basically 100 x 100 with each cell having a similarity between Doc_x and Doc_y.
My next step is to perform the clustering (I want to use complete using scipy's hierarchical clustering).
To reduce dimensionality and be able to visualize results, I first perform an MDS on the data:
mds = manifold.MDS(n_components = 2, dissimilarity = "precomputed", verbose = 1)
results = mds.fit(return_matrix)
coordinates = results.embedding_
My problem arises here: calling mds.stress_ reports a value of about 53. I was under the understanding that my stress value should be somewhere between 0 and 1. Ahem, needless to say that I am speechless with this... This would be my first question. When I print the similarity matrix etc. everything looks relatively good...
To build the clusters, I am currently passing in coordinates to the linkage() and fcluster() functions, i.e. I am passing in the MDS'ed version of my similarity matrix. Now, I wonder if this could be an issue although the results look ok when I look at the clusters assigned to my data. But conceptually, I am not sure this makes sense.
In trying to determine an ideal number of clusters, I want to use an "elbow" method, plotting the variance explained against the number of clusters to have an "ideal" cutoff. I am not sure I see this anywhere in the scikit-learn docs and tutorials. I see places where people do it in R etc. but when I use hierarchical clustering, how can I achieve this? I just don't know where to get the data from the API and what data I am looking for exactly.
Many thanks in advance. I apologize for the length of this post but I figured giving out some context might help.
Cheers,
Greg
I have a corpus which has around 8 million news articles, I need to get the TFIDF representation of them as a sparse matrix. I have been able to do that using scikit-learn for relatively lower number of samples, but I believe it can't be used for such a huge dataset as it loads the input matrix into memory first and that's an expensive process.
Does anyone know, what would be the best way to extract out the TFIDF vectors for large datasets?
Gensim has an efficient tf-idf model and does not need to have everything in memory at once.
Your corpus simply needs to be an iterable, so it does not need to have the whole corpus in memory at a time.
The make_wiki script runs over Wikipedia in about 50m on a laptop according to the comments.
I believe you can use a HashingVectorizer to get a smallish csr_matrix out of your text data and then use a TfidfTransformer on that. Storing a sparse matrix of 8M rows and several tens of thousands of columns isn't such a big deal. Another option would be not to use TF-IDF at all- it could be the case that your system works reasonably well without it.
In practice you may have to subsample your data set- sometimes a system will do just as well by just learning from 10% of all available data. This is an empirical question, there is not way to tell in advance what strategy would be best for your task. I wouldn't worry about scaling to 8M document until I am convinced I need them (i.e. until I have seen a learning curve showing a clear upwards trend).
Below is something I was working on this morning as an example. You can see the performance of the system tends to improve as I add more documents, but it is already at a stage where it seems to make little difference. Given how long it takes to train, I don't think training it on 500 files is worth my time.
I solve that problem using sklearn and pandas.
Iterate in your dataset once using pandas iterator and create a set of all words, after that use it in CountVectorizer vocabulary. With that the Count Vectorizer will generate a list of sparse matrix all of them with the same shape. Now is just use vstack to group them. The sparse matrix resulted have the same information (but the words in another order) as CountVectorizer object and fitted with all your data.
That solution is not the best if you consider the time complexity but is good for memory complexity. I use that in a dataset with 20GB +,
I wrote a python code (NOT THE COMPLETE SOLUTION) that show the properties, write a generator or use pandas chunks for iterate in your dataset.
from sklearn.feature_extraction.text import CountVectorizer
from scipy.sparse import vstack
# each string is a sample
text_test = [
'good people beauty wrong',
'wrong smile people wrong',
'idea beauty good good',
]
# scikit-learn basic usage
vectorizer = CountVectorizer()
result1 = vectorizer.fit_transform(text_test)
print(vectorizer.inverse_transform(result1))
print(f"First approach:\n {result1}")
# Another solution is
vocabulary = set()
for text in text_test:
for word in text.split():
vocabulary.add(word)
vectorizer = CountVectorizer(vocabulary=vocabulary)
outputs = []
for text in text_test: # use a generator
outputs.append(vectorizer.fit_transform([text]))
result2 = vstack(outputs)
print(vectorizer.inverse_transform(result2))
print(f"Second approach:\n {result2}")
Finally, use TfidfTransformer.
The lengths of the documents The number of terms in common Whether the terms are common or unusual How many times each term appears