Svenstrup et. al. 2017 propose an interesting way to handle hash collisions in hashing vectorizers: Use 2 different hashing functions, and concatenate their results before modeling.
They claim that the combination of multiple hash functions approximates a single hash function with much larger range (see section 4 of the paper).
I'd like to try this out with some text data I'm working with in sklearn. The idea would be to run the HashingVectorizer twice, with a different hash function each time, and then concatenate the results as an input to my model.
How might I do with with sklearn? There's not an option to change the hash function used, but maybe could modify the vectorizer somehow?
Or maybe there's a way I could achieve this with SparseRandomProjection ?
HashingVectorizer in scikit-learn already includes a mechanism to mitigate hash collisions with alternate_sign=True option. This adds a random sign during token summation which improves the preservation of distances in the hashed space (see scikit-learn#7513 for more details).
By using N hash functions and concatenating the output, one would increase both n_features and the number of non null terms (nnz) in the resulting sparse matrix by N. In other words each token will now be represented as N elements. This is quite wastful memory wise. In addition, since the run time for sparse array computations is directly dependent on nnz (and less so on n_features) this will have a much larger negative performance impact than only increasing n_features. I'm not sure that such approach is very useful in practice.
If you nevertheless want to implement such vectorizer, below are a few comments.
because FeatureHasher is implemented in Cython, it is difficult to modify its functionality from Python without editing/re-compiling the code.
writing a quick pure-python implemnteation of HashingVectorizer could be one way to do it.
otherwise, there is a somewhat experimental re-implementation of HashingVectorizer in the text-vectorize package. Because it is written in Rust (with Python binding), other hash functions are easily accessible and can potentially be added.
Related
I have around 10k docs (mostly 1-2 sentences) and want for each of these docs find the ten most simliar docs of a collection of 60k docs. Therefore, I want to use the spacy library. Due to the large amount of docs this needs to be efficient, so my first idea was to compute both for each of the 60k docs as well as the 10k docs the document vector (https://spacy.io/api/doc#vector) and save them in two matrices. This two matrices can be multiplied to get the dot product, which can be interpreted as the similarity.
Now, I have basically two questions:
Is this actually the most efficient way or is there a clever trick that can speed up this process
If there is no other clever way, I was wondering whether there is at least a clever way to speed up the process of computing the matrices of document vectors. Currently I am using a for loop, which obviously is not exactly fast:
import spacy
nlp = spacy.load('en_core_web_lg')
doc_matrix = np.zeros((len(train_list), 300))
for i in range(len(train_list)):
doc = nlp(train_list[i]) #the train list contains the single documents
doc_matrix[i] = doc.vector
Is there for example a way to parallelize this?
Don't do a big matrix operation, instead put your document vectors in an approximate nearest neighbors store (annoy is easy to use) and query the nearest items for each vector.
Doing a big matrix operation will do n * n comparisons, but using approximate nearest neighbors techniques will partition the space to perform many fewer calculations. That's much more important for the overall runtime than anything you do with spaCy.
That said, also check the spaCy speed FAQ.
I personally never worked with sentence similarity/vectors in SpaCy directly, so I can't tell you for sure about your first question, there might be some clever way to do this which is more native to SpaCy/the usual way to do it.
For generally speeding up the SpaCy processing:
Disable components you don't need such as Named Entity Recognition, Part of Speech Tagging etc.
Use processed_docs = nlp.pipe(train_list) instead of calling nlp inside the loop. Then access with for doc in processed_docs: or doc = next(processed_docs) inside the loop. You can tune the pipe() parameters to speed it up even more, depending on your hardware, see the documentation.
For your actual "find the n most similar" problem:
This problem is not NLP- or SpaCy-specific but a general problem. There are a lot of sources on how to optimize this for numpy vectors online, you are basically looking for the n nearest datapoints within a large dataset (10000) of high dimensional (300) data. Check out this thread for some general ideas or this thread to for how to perform this kind of search (in this case K-nearest neighbours search) on numpy data.
Generally you should also not forget that in a large dataset (unless filtered) there are going to be documents/sentences which are duplicates or nearly duplicates (only differ by comma or so), so you might want to apply some filtering before performing the search.
I'm dealing with a dataframe of dimension 4 million x 70. Most columns are numeric, and some are categorical, in addition to the occasional missing values. It is essential that the clustering is ran on all data points, and we look to produce around 400,000 clusters (so subsampling the dataset is not an option).
I have looked at using Gower's distance metric for mixed type data, but this produces a dissimilarity matrix of dimension 4 million x 4 million, which is just not feasible to work with since it has 10^13 elements. So, the method needs to avoid dissimilarity matrices entirely.
Ideally, we would use an agglomerative clustering method, since we want a large amount of clusters.
What would be a suitable method for this problem? I am struggling to find a method which meets all of these requirements, and I realise it's a big ask.
Plan B is to use a simple rules-based grouping method based on categorical variables alone, handpicking only a few variables to cluster on since we will suffer from the curse of dimensionality otherwise.
The first step is going to be turning those categorical values into numbers somehow, and the second step is going to be putting the now all numeric attributes into the same scale.
Clustering is computationally expensive, so you might try a third step of representing this data by the top 10 components of a PCA (or however many components have an eigenvalue > 1) to reduce the columns.
For the clustering step, you'll have your choice of algorithms. I would think something hierarchical would be helpful for you, since even though you expect a high number of clusters, it makes intuitive sense that those clusters would fall under larger clusters that continue to make sense all the way down to a small number of "parent" clusters. A popular choice might be HDBSCAN, but I tend to prefer trying OPTICS. The implementation in free ELKI seems to be the fastest (it takes some messing around with to figure it out) because it runs in java. The output of ELKI is a little strange, it outputs a file for every cluster so you have to then use python to loop through the files and create your final mapping, unfortunately. But it's all doable (including executing the ELKI command) from python if you're building an automated pipeline.
Suppose I have multi-dimensional datasets, which have many vectors as data. I am writing an algorithm which needs to do k nearest neighbour searches for all those vectors - classical KNN. However, during my algorithm I add new vectors to the overall dataset and need to include those new vectors into my KNN search. I want to do that efficiently. I looked into KD tree and ball tree of scikit-learn, but they don't allow inserts (by the nature of the concepts). I am not sure whether SR tree or R tree would provide inserts, but in any case, I was not able to find a python implementation for data beyond 3D.
Regarding the search I am fine with either the query "give me the closest vector" (so 1-NN) or "give me all vectors that are closer then radius".
General comment: I don't quite understand why KD-Trees are so popular for high-dimensional kNN queries. In my experience, other trees scale much better with high dimensionality or large datasets (I tested up to 25Million points and (only) up to 40 dimensions). Some more details:
KD-Trees: As far as I know, KD-Trees should support insertion at any time, but there is a chance that they get imbalanced. I don't use python, so I don't know why your KD-tree does not support insertion/deletion on the fly.
Quadtree: Depending on the dimensionality, you could also use quadtree/octrees, but standard implementations are not good for more than 10 dimensions or so. In the reference above I tested a quadtree with a special 'hypecube' navigation approach. That requires a lot of memory but scales much better with dimensionality in terms of performance.
R-Tree/R*Tree: The original R-Trees are not very good with insertion on the fly. However, if you look at R+Trees, (R-Plus-Tree), they are quite fast with reinsertion and kNN queries.
PH-Trees have basically the same kNN performance as R+Trees, but much better insertion time, because PH-Trees do not need rebalancing, while having inherently limited depth and nodesize. Unfortunately, implementations gets a lot more complicated for >=64 dimensions (the tree uses one bit of a long integer for each dimensions). I'm not aware of an implementation that supports more than 63 dimensions.
Python:
R+Plus trees should be available for Python. If not, you could adapt a normal R-Tree (only the insertion algorithm is different)
I heard once of someone starting to implement a PH-Tree in Python, but I haven't seen any open-source variant yet.
If you have some time/interest to do your own implementation, you could look at the Java implementations here and translate them to Python. The library contains various multidimensional indexes, except KD-Trees. KD-Tree implementations that allow on-the-fly insertion can be found here and here.
I am interested in the performance of Pyomo to generate an OR model with a huge number of constraints and variables (about 10e6). I am currently using GAMS to launch the optimizations but I would like to use the different python features and therefore use Pyomo to generate the model.
I made some tests and apparently when I write a model, the python methods used to define the constraints are called each time the constraint is instanciated. Before going further in my implementation, I would like to know if there exists a way to create directly a block of constraints based on numpy array data ? From my point of view, constructing constraints by block may be more efficient for large models.
Do you think it is possible to obtain performance comparable to GAMS or other AML languages with pyomo or other python modelling library ?
Thanks in advance for your help !
While you can use NumPy data when creating Pyomo constraints, you cannot currently create blocks of constraints in a single NumPy-style command with Pyomo. Fow what it's worth, I don't believe that you can in languages like AMPL or GAMS, either. While Pyomo may eventually support users defining constraints using matrix and vector operations, it is not likely that that interface would avoid generating the individual constraints, as the solver interfaces (e.g., NL, LP, MPS files) are all "flat" representations that explicit represent individual constraints. This is because Pyomo needs to explicitly generate representations of the algebra (i.e., the expressions) to send out to the solvers. In contrast, NumPy only has to calculate the result: it gets its efficiency by creating the data in a C/C++ backend (i.e., not in Python), relying on low-level BLAS operations to compute the results efficiently, and only bringing the result back to Python.
As far as performance and scalability goes, I have generated raw models with over 13e6 variables and 21e6 constraints. That said, Pyomo was designed for flexibility and extensibility over speed. Runtimes in Pyomo can be an order of magnitude slower than AMPL when using cPython (although that can shrink to within a factor of 4 or 5 using pypy). At least historically, AMPL has been faster than GAMS, so the gap between Pyomo and GAMS should be smaller.
I was also wondering the same when I came across this piece of code from Jonas Hörsch and Tom Brown and it was very useful to me:
https://github.com/FRESNA/PyPSA/blob/master/pypsa/opt.py
They define classes to define constraints more efficiently than the original Pyomo parser do. I did some tests on a large model that I have and it reduced the generation time considerably.
You can build big linear (LP) and mix-integer (MILP) optimization problems in Python with the open-source tool Linopy. Linopy promises a speedup of times 4-6 and a memory reduction of roughly 50% reaching roughly Julia JUMP performance. See the benchmark:
The tool is part of the PyPSA ecosystem. This tool is the next-level version of the PyPSA 'opt.py' developments that Jon Cardodo mentioned. It has roughly the same speed, same performance but better usability -- reported by developers.
I am trying to implement the hyperloglog counting algorithm using stochastic averaging. To do that, I need many independent universal hash functions to hash items in different substreams.
I found that there are only a few hash function available in hashlib
and there seems to be no way for me to provide a seed or something? I am thinking using different salts for different substreams.
You probably DON'T need different hash functions. A common solution to this problem is to use only part of the hash to compute the HyperLogLog rho statistic, and the other part to select the substream. If you use a good hash function (e.g. murmur3), it effectively behaves as multiple independent ones.
See the "stochastic averaging" section here for an explanation of this:
https://research.neustar.biz/2012/10/25/sketch-of-the-day-hyperloglog-cornerstone-of-a-big-data-infrastructure/