Is the format/structure of SciPy's condensed distance matrix stable? - python

Several SciPy functions are documented as taking a "condensed distance matrix as returned by scipy.spatial.distance.pdist". Now, inspection shows that what pdist returns is the row-major 1D-array form of the upper off-diagonal part of the distance matrix. This is all well and good, and natural and obvious, but is it documented or defined anywhere? I'd rather not assume anything about a data structure that'll suddenly change. (Granted, there isn't a lot of things it could change to, but I guess one possibility would be to wrap the array in an object that allows matrix-like indexing.)

Honestly, this is a better question for the scipy users or dev list, as it's about future plans for scipy.
However, the structure is fairly rigorously documented in the docstrings for both scipy.spatial.pdist and in scipy.spatial.squareform.
E.g. for pdist:
Returns a condensed distance matrix Y. For
each :math:`i` and :math:`j` (where :math:`i<j<n`), the
metric ``dist(u=X[i], v=X[j])`` is computed and stored in the
:math:`ij`th entry.
See ``squareform`` for information on how to calculate the index of
this entry or to convert the condensed distance matrix to a
redundant square matrix.
Becuase of this, and the fact that so many other functions in scipy.spatial expect a distance matrix in this form, I'd seriously doubt it's going to change without a number of depreciation warnings and announcements.
Modules in scipy itself (as opposed to scipy's scikits) are fairly stable, and there's a great deal of consideration put into backwards compatibility when changes are made (and because of this, there's quite a bit of legacy "cruft" in scipy: e.g. the fact that the core scipy module is just numpy with different defaults on a couple of functions.).

Related

lmfit/scipy.optimize minimization methods description?

Is there any place with a brief description of each of the algorithms for the parameter method in the minimize function of the lmfit package? Both there and in the documentation of SciPy there is no explanation about the details of each algorithm. Right now I know I can choose between them but I don't know which one to choose...
My current problem
I am using lmfit in Python to minimize a function. I want to minimize the function within a finite and predefined range where the function has the following characteristics:
It is almost zero everywhere, which makes it to be numerically identical to zero almost everywhere.
It has a very, very sharp peak in some point.
The peak can be anywhere within the region.
This makes many minimization algorithms to not work. Right now I am using a combination of the brute force method (method="brute") to find a point close to the peak and then feed this value to the Nelder-Mead algorithm (method="nelder") to finally perform the minimization. It is working approximately 50 % of the times, and the other 50 % of the times it fails to find the minimum. I wonder if there are better algorithms for cases like this one...
I think it is a fair point that docs for lmfit (such as https://lmfit.github.io/lmfit-py/fitting.html#fit-methods-table) and scipy.optimize (such as https://docs.scipy.org/doc/scipy/reference/tutorial/optimize.html#optimization-scipy-optimize) do not give detailed mathematical descriptions of the algorithms.
Then again, most of the docs for scipy, numpy, and related libraries describe how to use the methods, but do not describe in much mathematical detail how the algorithms work.
In fairness, the different optimization algorithms share many features and the differences between them can get pretty technical. All of these methods try to minimize some metric (often called "cost" or "residual") by changing the values of parameters for the supplied function.
It sort of takes a text book (or at least a Wikipedia page) to establish the concepts and mathematical terms used for these methods, and then a paper (or at least a Wikipedia page) to describe how each method differs from the others. So, I think the basic answer would be to look up the different methods.

Scipy Linear algebra LinearOperator function utilised in Conjugate Gradient

I am preconditioning a matrix using spilu, however, to pass this preconditioner into cg (the built in conjugate gradient method) it is necessary to use the LinearOperator function, can someone explain to me the parameter matvec, and why I need to use it. Below is my current code
Ainv=scla.spilu(A,drop_tol= 1e-7)
Ainv=scla.LinearOperator(Ainv.shape,matvec=Ainv)
scla.cg(A,b,maxiter=maxIterations, M = Ainv)
However this doesnt work and I am given the error TypeError: 'SuperLU' object is not callable. I have played around and tried
Ainv=scla.LinearOperator(Ainv.shape,matvec=Ainv.solve)
instead. This seems to work but I want to know why matvec needs Ainv.solve rather than just Ainv, and is it the right thing to feed LinearOperator?
Thanks for your time
Without having much experience with this part of scipy, some comments:
According to the docs you don't have to use LinearOperator, but you might do
M : {sparse matrix, dense matrix, LinearOperator}, so you can use explicit matrices too!
The idea/advantage of the LinearOperator:
Many iterative methods (e.g. cg, gmres) do not need to know the individual entries of a matrix to solve a linear system A*x=b. Such solvers only require the computation of matrix vector products docs
Depending on the task, sometimes even matrix-free approaches are available which can be much more efficient
The working approach you presented is indeed the correct one (some other source doing it similarily, and some course-materials doing it like that)
The idea of not using the inverse matrix, but using solve() here is not to form the inverse explicitly (which might be very costly)
A similar idea is very common in BFGS-based optimization algorithms although wiki might not give much insight here
scipy has an extra LinearOperator for this not forming the inverse explicitly! (although i think it's only used for statistics / completing/finishing some optimization; but i successfully build some LBFGS-based optimizers with this one)
Source # scicomp.stackexchange discussing this without touching scipy
And because of that i would assume spilu is completely going for this too (returning an object with a solve-method)

Do numpy or scipy implement sub-cubic multiplication

I've searched quite a bit, but I've only found homegrown reimplementations of Strassen matrix multiplication.
Wikipedia says that numpy uses BLAS (which includes a high-performance implementations of sub-cubic matrix multiplication algorithms, e.g. Strassen's method), but I couldn't find if numpy matrix multiplication has some sort of size check and then chooses either naive $O(n^3)$ multiplication or some more sophisticated approach based on the size of the matrix (this is similar to choosing the leaf size or recursion limit in Strassen sub-calls).
I also tried just plotting the log of the runtime vs. the log of the problem size, but looking for a subtle change in slope is non-trivial (because of cache effects, etc. as the problems get larger).
Since the documentation for numpy matrix didn't have any mention of Strassen (or alternative sub-cubic algorithm) or the runtime, and since numpy source in question is in C++ for performance (the C++ code in turn uses the BLAS library), it isn't so easy to tell from the source code, so I thought I would ask:
Does anyone know about the algorithm or big-oh runtime of a numpy.matrix(...) * numpy.matrix(...) call?

NaN/inf values in scikit-learn manifold learning functions

I have a manifold learning / non-linear dimensionality reduction problem where I know distances between objects up to some threshold, and beyond that I just know that the distance is "far". Also, in some cases some of the distances might be missing. I am trying to use sklearn.manifold in order to perform the task of finding a 1d representation. A natural representation would be to represent "far" distances an inf and missing distances as nan.
However, it seems that currently scikit-learn does not support nan and inf values in distance matrices given to manifold learning functions in sklearn.manifold, since I get ValueError: Array contains NaN or infinity.
Is there a conceptual reason for this? Some methods seem to be especially suitable for inf, e.g. non-metric MDS. Also I know that some implementations of these methods in other languages are able to handle missing/inf values.
Instead of using inf I have considered setting "far" values to a very large number, but I am not sure how this will affect the results.
Update:
I dug in the code of sklearn.manifold.MDS._smacof_single() and found a piece of code and a comment saying that "similarities with 0 are considered as missing values". Is this an undocumented way to specify missing-values? Does this work with all manifold functions?
Short answer: As you mentioned the non-metric MDS is capable of working with incomplete dissimilarity matrices. You are right: Setting values to zero allows will be interpreted as missing values when using MDS(metric=False). It won't work for other manifold learning procedures that are not based on non-metric MDS, but there might be similar (non-documented) approaches available.
On your question concerning
Replacing inf by high values will shape your low dimensional representation for sure. Whether this is valid rather is a conceptual question that one can only answer knowing the origin of the inf values. Is the inf-entries mean something like "these data are reeaaaalllyyyy distant from each other" replacement by high values can make sense (like in your case). If it is rather missing knowledge about the dissimilarity I would not recommend to replace by inf. If there is no other solution (like non-metric MDS or matrix completion) then I would rather recommend to replace by the median of the measurable distances in such cases (checkout Imputation).
Checkout my answer to a similar question from 2017.

scipy.interpolate.griddata equivalent in CUDA

I'm trying to perform Fitted Value Iteration (FVI) in python (involving approximating a 5 dimensional function using piecewise linear interpolation).
scipy.interpolate.griddata works perfectly for this. However, I need to call the interpolation routine several thousand times (since FVI is a MC based algorithm).
So basically, the set of points where the function is known is static (and large - say 32k), but the points i need to approximate (which are small perturbations of the original set) is very large (32k x 5000 say).
Is there an implementation of what scipy.interpolate.griddata does that's been ported to CUDA?
alternatively, is there a way to speed up the calculation somehow?
Thanks.
For piece-wise linear interpolation, the docs say that scipy.interpolate.griddata uses the methods of scipy.interpolate.LinearNDInterpolator, which in turn uses qhull to do a Delaunay tesellation of the input points, then performs standard barycentric interpolation, where for each point you have to determine inside which hypertetrahedron each point is, then use its barycentric coordinates as the interpolation weights for the hypertetrahedron node values.
The tesellation is probably hard to parallelize, but you can access the CPU version with scipy.spatial.Delaunay. The other two steps are easily parallelized, although I don't know of any freely available implementation.
If your known-function points are on a regular grid, the method described here is specially easy to implement in CUDA, and I have worked with actual implementations of it, albeit none publicly available.
So I am afraid you are going to have to do most of the work yourself...

Categories