Do sklearn algorithms internally work with double precision?

Do sklearn algorithms internally work with double precision? - python

I am using sklearn for machine learning purposes. If I found out correctly, the float type in python works with double precision. Does sklearn work with the same precision internally? I pass data to sklearn in lists/numpy arrays filled with floats (is this even relevant?).
Do I have to be worried about error propagation? I guess I do not, if double precision is used.
Just want to make sure.

sklearn does not seem to specify how it works internally regarding data types. However, it probably makes sense to assume it retains at least the precision of the input data type. So, to be on the safe side, probably specify dtype as double in your data.
In practice error propagation should not be an issue, since most algorithms are approximative in nature, and some of them rely much more on the random initial conditions than accuracy. Recently, there is even the suggestion that we should limit accuracy to save resources, since the impact is small. See for example
https://arxiv.org/pdf/1502.02551.pdf

Related

Solving a large (150 variable) system of linear, ordinary differential equations; running into floating point rounding and/or stiffness problems

EDIT: Original post too vague. I am looking for an algorithm to solve a large-system, solvable, linear IVP that can handle very small floating point values. Solving for the eigenvectors and eigenvalues is impossible with numpy.linalg.eig() as the returned values are complex and should not be, it does not support numpy.float128 either, and the matrix is not symmetric so numpy.linalg.eigh() won't work. Sympy could do it given an infinite amount of time, but after running it for 5 hours I gave up. scipy.integrate.solve_ivp() works with implicit methods (have tried Radau and BDF), but the output is wildly wrong. Are there any libraries, methods, algorithms, or solutions for working with this many, very small numbers?
Feel free to ignore the rest of this.
I have a 150x150 sparse (~500 nonzero entries of 22500) matrix representing a system of first order, linear differential equations. I'm attempting to find the eigenvalues and eigenvectors of this matrix to construct a function that serves as the analytical solution to the system so that I can just give it a time and it will give me values for each variable. I've used this method in the past for similar 40x40 matrices, and it's much (tens, in some cases hundreds of times) faster than scipy.integrate.solve_ivp() and also makes post model analysis much easier as I can find maximum values and maximum rates of change using scipy.optimize.fmin() or evaluate my function at inf to see where things settle if left long enough.
This time around, however, numpy.linalg.eig() doesn't seem to like my matrix and is giving me complex values, which I know are wrong because I'm modeling a physical system that can't have complex rates of growth or decay (or sinusoidal solutions), much less complex values for its variables. I believe this to be a stiffness or floating point rounding problem where the underlying LAPACK algorithm is unable to handle either the very small values (smallest is ~3e-14, and most nonzero values are of similar scale) or disparity between some values (largest is ~4000, but values greater than 1 only show up a handful of times).
I have seen suggestions for similar users' problems to use sympy to solve for the eigenvalues, but when it hadn't solved my matrix after 5 hours I figured it wasn't a viable solution for my large system. I've also seen suggestions to use numpy.real_if_close() to remove the imaginary portions of the complex values, but I'm not sure this is a good solution either; several eigenvalues from numpy.linalg.eig() are 0, which is a sign of error to me, but additionally almost all the real portions are of the same scale as the imaginary portions (exceedingly small), which makes me question their validity as well. My matrix is real, but unfortunately not symmetric, so numpy.linalg.eigh() is not viable either.
I'm at a point where I may just run scipy.integrate.solve_ivp() for an arbitrarily long time (a few thousand hours) which will probably take a long time to compute, and then use scipy.optimize.curve_fit() to approximate the analytical solutions I want, since I have a good idea of their forms. This isn't ideal as it makes my program much slower, and I'm also not even sure it will work with the stiffness and rounding problems I've encountered with numpy.linalg.eig(); I suspect Radau or BDF would be able to navigate the stiffness, but not the rounding.
Anybody have any ideas? Any other algorithms for finding eigenvalues that could handle this? Can numpy.linalg.eig() work with numpy.float128 instead of numpy.float64 or would even that extra precision not help?
I'm happy to provide additional details upon request. I'm open to changing languages if needed.

As mentioned in the comment chain above the best solution for this is to use a Matrix Exponential, which is a lot simpler (and apparently less error prone) than diagonalizing your system with eigenvectors and eigenvalues.
For my case I used scipy.sparse.linalg.expm() since my system is sparse. It's fast, accurate, and simple. My only complaint is the loss of evaluation at infinity, but it's easy enough to work around.

In sklearn.linear_model.Ridge, what exactly is the solverparameter doing?

In the sklearn.linear_model.Ridge method, there is a parameter, solver : {‘auto’, ‘svd’, ‘cholesky’, ‘lsqr’, ‘sparse_cg’, ‘sag’, ‘saga’}.
And according to the documentation, we should choose different parameter depending on different types' values which is dense or sparse or just use auto.
So in my opinion, we just choose a specific parameter to make calculations fast to corresponding data.
Are my thoughts right or wrong?
If not mind, could anyone give me some advice because I didn't search and get anything proving my thoughts or not?
Sincerely thanks.

You are almost right.
Some solver work only with specific type of data (dense vs. sparse) or specific type of problem (non-negative weights).
However for many cases you can use multiple solvers (e.g. for sparse problems you have at least sag, sparse_cg and lsqr). These solvers have different characteristics and some of them might work better in some cases and some of them work better in other cases. In some cases some solvers even do not converge.
In many cases, the simple engineering-like answer is to use solver which works best on your data. You just test all of them and measure time to answer.
If you want, more precise answer you should dig into documentation of referenced methods (e.g. scipy.sparse.linalg.lsqr).

Using ideas from HashEmbeddings with sklearn's HashingVectorizer

Svenstrup et. al. 2017 propose an interesting way to handle hash collisions in hashing vectorizers: Use 2 different hashing functions, and concatenate their results before modeling.
They claim that the combination of multiple hash functions approximates a single hash function with much larger range (see section 4 of the paper).
I'd like to try this out with some text data I'm working with in sklearn. The idea would be to run the HashingVectorizer twice, with a different hash function each time, and then concatenate the results as an input to my model.
How might I do with with sklearn? There's not an option to change the hash function used, but maybe could modify the vectorizer somehow?
Or maybe there's a way I could achieve this with SparseRandomProjection ?

HashingVectorizer in scikit-learn already includes a mechanism to mitigate hash collisions with alternate_sign=True option. This adds a random sign during token summation which improves the preservation of distances in the hashed space (see scikit-learn#7513 for more details).
By using N hash functions and concatenating the output, one would increase both n_features and the number of non null terms (nnz) in the resulting sparse matrix by N. In other words each token will now be represented as N elements. This is quite wastful memory wise. In addition, since the run time for sparse array computations is directly dependent on nnz (and less so on n_features) this will have a much larger negative performance impact than only increasing n_features. I'm not sure that such approach is very useful in practice.
If you nevertheless want to implement such vectorizer, below are a few comments.
because FeatureHasher is implemented in Cython, it is difficult to modify its functionality from Python without editing/re-compiling the code.
writing a quick pure-python implemnteation of HashingVectorizer could be one way to do it.
otherwise, there is a somewhat experimental re-implementation of HashingVectorizer in the text-vectorize package. Because it is written in Rust (with Python binding), other hash functions are easily accessible and can potentially be added.

Python's XGBRegressor vs R's XGBoost

I'm using python's XGBRegressor and R's xgb.train with the same parameters on the same dataset and I'm getting different predictions.
I know that XGBRegressor uses 'gbtree' and I've made the appropriate comparison in R, however, I'm still getting different results.
Can anyone lead me in the right direction on how to differentiate the 2 and/or find R's equivalence to python's XGBRegressor?
Sorry if this is a stupid question, thank you.

Since XGBoost uses decision trees under the hood it can give you slightly different results between fits if you do not fix random seed so the fitting procedure becomes deterministic.
You can do this via set.seed in R and numpy.random.seed in Python.
Noting Gregor's comment you might want to set nthread parameter to 1 to achieve full determinism.

Using 'Decimal' numbers with scipy?

I have numbers too large to use with Python's inbuilt types and as such am using the decimal library. I want to use scipy.optimise.brentq with a function that operates on 'decimals' but when the function returns decimal it obviously cannot be used with the optimisation function's float based internals. How can I get around this: How can I use scipy optimisation techniques with the Decimal class for big numbers?

You can't. Scipy heavily relies on numerical algorithms that only deal with true numerical data types, and can't deal with the decimal class.
As a general rule of thumb: If your problem is well-defined and well-conditioned (that's something that numerical mathematicians define), you can just scale it so that it fits into normal python floats, and then you can apply scipy functionality to it.
If your problem, however, involves very small numbers as well as numbers that can't fit into float, there's little you can numerically do about that problem usually: It's hard to find a good solution.
If, however, your function only returns values that would fit into float, then you could just use
lambda x: float(your_function(x))
instead of your_function in brentq.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.