I have around 10k docs (mostly 1-2 sentences) and want for each of these docs find the ten most simliar docs of a collection of 60k docs. Therefore, I want to use the spacy library. Due to the large amount of docs this needs to be efficient, so my first idea was to compute both for each of the 60k docs as well as the 10k docs the document vector (https://spacy.io/api/doc#vector) and save them in two matrices. This two matrices can be multiplied to get the dot product, which can be interpreted as the similarity.
Now, I have basically two questions:
Is this actually the most efficient way or is there a clever trick that can speed up this process
If there is no other clever way, I was wondering whether there is at least a clever way to speed up the process of computing the matrices of document vectors. Currently I am using a for loop, which obviously is not exactly fast:
import spacy
nlp = spacy.load('en_core_web_lg')
doc_matrix = np.zeros((len(train_list), 300))
for i in range(len(train_list)):
doc = nlp(train_list[i]) #the train list contains the single documents
doc_matrix[i] = doc.vector
Is there for example a way to parallelize this?
Don't do a big matrix operation, instead put your document vectors in an approximate nearest neighbors store (annoy is easy to use) and query the nearest items for each vector.
Doing a big matrix operation will do n * n comparisons, but using approximate nearest neighbors techniques will partition the space to perform many fewer calculations. That's much more important for the overall runtime than anything you do with spaCy.
That said, also check the spaCy speed FAQ.
I personally never worked with sentence similarity/vectors in SpaCy directly, so I can't tell you for sure about your first question, there might be some clever way to do this which is more native to SpaCy/the usual way to do it.
For generally speeding up the SpaCy processing:
Disable components you don't need such as Named Entity Recognition, Part of Speech Tagging etc.
Use processed_docs = nlp.pipe(train_list) instead of calling nlp inside the loop. Then access with for doc in processed_docs: or doc = next(processed_docs) inside the loop. You can tune the pipe() parameters to speed it up even more, depending on your hardware, see the documentation.
For your actual "find the n most similar" problem:
This problem is not NLP- or SpaCy-specific but a general problem. There are a lot of sources on how to optimize this for numpy vectors online, you are basically looking for the n nearest datapoints within a large dataset (10000) of high dimensional (300) data. Check out this thread for some general ideas or this thread to for how to perform this kind of search (in this case K-nearest neighbours search) on numpy data.
Generally you should also not forget that in a large dataset (unless filtered) there are going to be documents/sentences which are duplicates or nearly duplicates (only differ by comma or so), so you might want to apply some filtering before performing the search.
Svenstrup et. al. 2017 propose an interesting way to handle hash collisions in hashing vectorizers: Use 2 different hashing functions, and concatenate their results before modeling.
They claim that the combination of multiple hash functions approximates a single hash function with much larger range (see section 4 of the paper).
I'd like to try this out with some text data I'm working with in sklearn. The idea would be to run the HashingVectorizer twice, with a different hash function each time, and then concatenate the results as an input to my model.
How might I do with with sklearn? There's not an option to change the hash function used, but maybe could modify the vectorizer somehow?
Or maybe there's a way I could achieve this with SparseRandomProjection ?
HashingVectorizer in scikit-learn already includes a mechanism to mitigate hash collisions with alternate_sign=True option. This adds a random sign during token summation which improves the preservation of distances in the hashed space (see scikit-learn#7513 for more details).
By using N hash functions and concatenating the output, one would increase both n_features and the number of non null terms (nnz) in the resulting sparse matrix by N. In other words each token will now be represented as N elements. This is quite wastful memory wise. In addition, since the run time for sparse array computations is directly dependent on nnz (and less so on n_features) this will have a much larger negative performance impact than only increasing n_features. I'm not sure that such approach is very useful in practice.
If you nevertheless want to implement such vectorizer, below are a few comments.
because FeatureHasher is implemented in Cython, it is difficult to modify its functionality from Python without editing/re-compiling the code.
writing a quick pure-python implemnteation of HashingVectorizer could be one way to do it.
otherwise, there is a somewhat experimental re-implementation of HashingVectorizer in the text-vectorize package. Because it is written in Rust (with Python binding), other hash functions are easily accessible and can potentially be added.
I need to find the diameter of the points cloud (two points with maximum distance between them) in 3-dimensional space. As a temporary solution, right now I'm just iterating through all possible pairs and comparing the distance between them, which is a very slow, O(n^2) solution.
I believe it can be done in O(n log n). It's a fairly easy task in 2D (just find the convex hull and then apply the rotating calipers algorithm), but in 3D I can't imagine how to use rotating calipers, since there is no way to order the points.
Is there any simple way to do it (or ready-to-use implementation in python or C/C++)?
PS: There are similar questions on StackOverflow, but the answers that I found only refers to Rotating Calipers (or similar) algorithms, which works fine in 2D situation but not really clear how to implement in 3D (or higher dimensionals).
While O(n log n) expected time algorithms exist in 3d, they seem tricky to implement (while staying competitive to brute-force O(n^2) algorithms).
An algorithm is described in Har-Peled 2001. The authors provide a source code than can optionally be used for optimal computation. I was not able to download the latest version, the "old" version could be enough for your purpose, or you might want to contact the authors for the code.
An alternative approach is presented in Malandain & Boissonnat 2002 and the authors provide code. Altough this algorithm is presented as approximate in higher dimensions, it could fit your purpose. Note that their code provide an implementation of Har-Peled's method for exact computation that you might also check.
In any case, in a real-world usage you should always check that your algorithm remains competitive with respect to the naïve O(n^2) approach.
I am interested to perform kmeans clustering on a list of words with the distance measure being Leveshtein.
1) I know there are a lot of frameworks out there, including scipy and orange that has a kmeans implementation. However they all require some sort of vector as the data which doesn't really fit me.
2) I need a good clustering implementation. I looked at python-clustering and realize that it doesn't a) return the sum of all the distance to each centroid, and b) it doesn't have any sort of iteration limit or cut off which ensures the quality of the clustering. python-clustering and the clustering algorithm on daniweb doesn't really work for me.
Can someone find me a good lib? Google hasn't been my friend
Yeah I think there isn't a good implementation to what I need.
I have some crazy requirements, like distance caching etc.
So i think i will just write my own lib and release it as GPLv3 soon.
Not really an answer to your specific question, but I recommend glancing at "Programming Collective Intelligence". At the end of each chapter, e.g., clustering, it wanders off into describing all the best reading on the subject.
Maybe have a look at Weka. It is a Java library with some unsupervised learning implementations and nice visualization tools. It has been a while since I used it, not sure if it is great for a real production environment but defenitely a good starting point.
What about this very nice answer on CrossValidated?
It uses Affinity Propagation instead of k-means and in that case you can give as input a distance metric. I do not think any k-means based approach could work in your case since it is based on building a centroid and in order to do that you have to be in a vector space.
Affinity Propagation has the bonus that it selects automatically the number of clusters, which you can tweak (to have more or less clusters) by altering the preference (which by default is the median of all pairwise distance, but you can choose other percentiles).
If you need to specify the exact number of clusters, besides tweaking Affinity Propagation by trial and error, you could look for implementation of k-medoids (apparently there is no implementation of it in sklearn, but people have asked for it here and there). K-medoids does not build centroids, so it does not need the concept of vector space. So implementation might accept as input a precomputed distance matrix (haven't checked the references I give, though).
I have a what I think is a simple machine learning question.
Here is the basic problem: I am repeatedly given a new object and a list of descriptions about the object. For example: new_object: 'bob' new_object_descriptions: ['tall','old','funny']. I then have to use some kind of machine learning to find previously handled objects that have the 10 or less most similar descriptions, for example, past_similar_objects: ['frank','steve','joe']. Next, I have an algorithm that can directly measure whether these objects are indeed similar to bob, for example, correct_objects: ['steve','joe']. The classifier is then given this feedback training of successful matches. Then this loop repeats with a new object.
a
Here's the pseudo-code:
Classifier=new_classifier()
while True:
new_object,new_object_descriptions = get_new_object_and_descriptions()
past_similar_objects = Classifier.classify(new_object,new_object_descriptions)
correct_objects = calc_successful_matches(new_object,past_similar_objects)
Classifier.train_successful_matches(object,correct_objects)
But, there are some stipulations that may limit what classifier can be used:
There will be millions of objects put into this classifier so classification and training needs to scale well to millions of object types and still be fast. I believe this disqualifies something like a spam classifier that is optimal for just two types: spam or not spam. (Update: I could probably narrow this to thousands of objects instead of millions, if that is a problem.)
Again, I prefer speed when millions of objects are being classified, over accuracy.
Update: The classifier should return the 10 (or fewer) most similar objects, based on feedback from past training. Without this limit, an obvious cheat would be for the classifier could just return all past objects :)
What are decent, fast machine learning algorithms for this purpose?
Note: The calc_successful_matches distance metric is extremely expensive to calculate and that's why I'm using a fast machine learning algorithm to try to guess which objects will be close before I actually do the expensive calculation.
An algorithm that seems to meet your requirements (and is perhaps similar to what John the Statistician is suggesting) is Semantic Hashing. The basic idea is that it trains a deep belief network (a type of neural network that some have called 'neural networks 2.0' and is a very active area of research right now) to create a hash of the list of descriptions of an object into binary number such that the Hamming distance between the numbers correspond to similar objects. Since this just requires bitwise operations it can be pretty fast, and since you can use it to create a nearest neighbor-style algorithm it naturally generalizes to a very large number of classes. This is very good state of the art stuff. Downside: it's not trivial to understand and implement, and requires some parameter tuning. The author provides some Matlab code here. A somewhat easier algorithm to implement and is closely related to this one is Locality Sensitive Hashing.
Now that you say that you have an expensive distance function you want to approximate quickly, I'm reminded of another very interesting algorithm that does this, Boostmap. This one uses boosting to create a fast metric which approximates an expensive to calculate metric. In a certain sense it's similar to the above idea but the algorithms used are different. The authors of this paper have several papers on related techniques, all pretty good quality (published in top conferences) that you might want to check out.
do you really need a machine learning algorithm for this? What is your metric for similarity? You've mentioned the dimensionality of the number of objects, what about the size of the trait set for each person? Are there a maximum number of trait types? I might try something like this:
1) Have a dictionary mapping trait to a list of names named map
for each person p
for each trait t in p
map[t].add(p);
2) then when I want to find the closest person, I'd take my dictionary and create a new temp one:
dictionary mapping name to count called cnt
for each trait t in my person of interest
for each person p in map[t]
cnt[p]++;
then the entry with the highest count is closest
The benefit here is the map is only created once. if the traits per person is small, and the types of available traits are large, then the algorithm should be fast.
You could use the vector space model (http://en.wikipedia.org/wiki/Vector_space_model). I think what you are trying to learn is how to weight terms in considering how close two object description vectors are to each other, say for example in terms of a simplified mutual information. This could be very efficient as you could hash from terms to vectors, which means you wouldn't have to compare objects without shared features. The naive model would then have an adjustable weight per term (this could either be per term per vector, per term overall, or both), as well as a threshold. The vector space model is a widely used technique (for example, in Apache Lucene, which you might be able to use for this problem), so you'll be able to find out a lot about it through further searches.
Let me give a very simple formulation of this in terms of your example. Given bob: ['tall','old','funny'], I retrieve
frank: ['young','short,'funny']
steve: ['tall','old','grumpy']
joe: ['tall','old']
as I am maintaining a hash from funny->{frank,...}, tall->{steve, joe,...}, and old->{steve, joe,...}
I calculate something like the overall mutual information: weight of shared tags/weight of bob's tags. If that weight is over the threshold, I include them in the list.
When training, if I make a mistake I modify the shared tags. If my error was including frank, I reduce the weight for funny, while if I make a mistake by not including Steve or Joe, I increase the weight for tall and old.
You can make this as sophisticated as you'd like, for example by including weights for conjunctions of terms.
SVM is pretty fast. LIBSVM for Python, in particular, provides a very decent implementation of Support Vector Machine for classification.
This project departs from typical classification applications in two notable ways:
Rather than outputting the class which the new object is thought to belong to (or possibly outputting an array of these classes, each with probability / confidence level), the "classifier" provides a list of "neighbors" which are "close enough" to the new object.
With each new classification, an objective function, independent from the classifier, provides the list of the correct "neighbors"; in turn the corrected list (a subset of the list provided by the classifier ?) is then used to train the classifier
The idea behind the second point is probably that future objects submitted to the classifier and with similar to the current object should get better "classified" (be associated with a more correct set of previously seen objects) since the on-going training re-enforces connections to positive (correct) matches, while weakening the connection to objects which the classifier initially got wrong.
These two characteristics introduce distinct problems.
- The fact that the output is a list of objects rather than a "prototype" (or category identifier of sorts) make it difficult to scale as the number of objects seen so far grows toward the millions of instances as suggested in the question.
- The fact that the training is done on the basis of a subset of the matches found by the classifier, may introduce over-fitting, whereby the classifier could become "blind" to features (dimensions) which it, accidentally, didn't weight as important/relevant, in the early parts of the training. (I may be assuming too much with regards to the objective function in charge of producing the list of "correct" objects)
Possibly, the scaling concern could be handled by having a two-step process, with a first classifier, based the K-Means algorithm or something similar, which would produce a subset of the overall object collection (of objects previously seen) as plausible matches for the current object (effectively filtering out say 70% or more of collection). These possible matches would then be evaluated on the basis of Vector Space Model (particularly relevant if the feature dimensions are based on factors rather than values) or some other models. The underlying assumption for this two-step process is that the object collection will effectively expose clusters (it may just be relatively evenly distributed along the various dimensions).
Another way to further limit the number of candidates to evaluate, as the size of the previously seen objects grows, is to remove near duplicates and to only compare with one of these (but to supply the full duplicate list in the result, assuming that if the new object is close to the "representative" of this near duplicate class, all members of the class would also match)
The issue of over-fitting is trickier to handle. A possible approach would be to [sometimes] randomly add objects to the matching list which the classifier would not normally include. The extra objects could be added on the basis of their distance relative distance to the new object (i.e. making it a bit more probable that a relatively close object be added)
What you describe is somewhat similar to the Locally Weighted Learning algorithm, which given a query instance, it trains a model locally around the neighboring instances weighted by their distances to the query one.
Weka (Java) has an implementation of this in weka.classifiers.lazy.LWL