GridSeachCV optimization slowing down

GridSeachCV optimization slowing down - python

I am doing an SVR and SVC optimizations through GridSearchCV with parallization n_jobs=-1 which is 8 in my case, my question is why do the first fits run very fast compared with the last fits? As in the photo 10212 fits took 23.7 sec but the full number of fits 106764 needed 20.7 min, which should be 4.2 minutes only if a linear extrapolation is assumed.
here is a sample of the code:
opt = GridSearchCV(SVR(tol=tol),param_grid=param_grid,scoring=scoring,n_jobs=n_jobs,cv=cv,verbose=verbose)
opt.fit(allr_sets_nor[:,:2],allr_sets_nor[:,2])
and this is the screen log:

Support-Vector-Machine learning is highly dependent on the parameters given.
Parameters like C have consequences on the number of support-vectors and therefore instances with many support-vectors (indirectly controlled by C) are trained much more slowly.
This is a basic caveat of GridSearches.
(Another slightly more complete take on this here by user lejlot)
The learning-algorithms are also based on heuristics, which add some additional hardly-predictable factor to this.

Related

Continue a RandomizedSearchCV fit

In order to tune some machine learning's (or even pipeline's) hyperparameters, sklearn proposes the exhaustive "GridsearchCV" and the randomized "RandomizedSearchCV". The latter samples the provided distributions and test them out, to finally select the best model (and provide the result of each tentative).
But let's say I train 1'000 models using this randomized method. Later, I decide this isn't precise enough, and want to try 1'000 more models. Can I resume the training? Aka, asking to sample more, and try more models without losing current progress. Calling fit() a second time "restarts" and discards previous hyperparameters combinations.
My situation looks like the following:
pipeline_cv = RandomizedSearchCV(pipeline, distribution, n_iter=1000, n_jobs=-1)
pipeline_cv = pipeline_cv.fit(trainX, trainy)
predictions = pipeline_cv.predict(targetX)
Then, later, I'd decide that 1000 iterations are not enough to cover my distributions' space, so I would do something like
pipeline_cv = pipeline_cv.resume(trainX, trainy, n_iter=1000) # doesn't exist?
And then I'd have a model trained across 2'000 hyperparameters combinations.
Is my goal achievable?

There is a Github issue on that back from Sep 2017, but it is still open:
In practice it is useful to search over some parameter space and then continue the search over some related space. We could provide a warm_start parameter to make it easy to accumulate results for further candidates into cv_results_ (without reevaluating parameter combinations that have already tested).
And a similar question in Cross Validated is also effectively unanswered.
So, the answer would seem to be no (plus that the scikit-learn community has not felt the need to include such a possibility).
But let's stop for a moment to think if something like that would be really valuable...
RandomizedSearchCV essentially works by random sampling parameter values from a given distribution; e.g., using the example from the docs:
distributions = dict(C=uniform(loc=0, scale=4),
penalty=['l2', 'l1'])
According to the very basic principles of such random sampling and random number generation (RNG) in general, there is not any guarantee that such a randomly sampled value will not be randomly sampled more than one time, especially if the number of iterations is large. Factor in the fact that RandomizedSearchCV does not do any bookkeeping itself either, hence in principle it can happen that same parameter combinations will be tried more than once in any single run (again, provided that the number of iterations is sufficiently large).
Even in cases of continuous distributions (like the uniform one used above), where the probability of getting exact values already sampled may be very small, there is the routine case of two samples being like 0.678918 and 0.678919, which, however close, they are still different, and count as different trials.
Given the above, I cannot see how "warm starting" a RandomizedSearchCV will be of any practical use. The real value of RandomizedSearchCV lies at the possibility of sampling a usually large area of parameter values - so large that we consider useful to unleash the power of simple random sampling, which, let me repeat, does not itself "remember" past samples returned, and it may very well return samples that are (exactly or approximately) equal to what has been already returned in the past, thus rendering any "warm start" practically irrelevant.
So effectively, simply running two (or more) RandomizedSearchCV processes sequentially (and storing their results) does the job adequately, provided that we do not use the same random seed for different runs (i.e. what is effectively suggested in the Cross Validated thread mentioned above).

NEAT algorithm result precision

I am a PhD student who is trying to use the NEAT algorithm as a controller for a robot and I am having some accuracy issues with it. I am working with Python 2.7 and for it and am using two NEAT python implementations:
The NEAT which is in this GitHub repository: https://github.com/CodeReclaimers/neat-python
Searching in Google, it looks like it has been used in some projects with succed.
The multiNEAT library developed by Peter Chervenski and Shane Ryan: http://www.multineat.com/index.html.
Which appears in the "official" software web page of NEAT software catalog.
While testing the first one, I've found that my program converges quickly to a solution, but this solution is not precise enough. As lack of precision I want to say a deviation of a minimum of 3-5% in the median and average related to the "perfect" solution at the end of the evolution (Depending on the complexity of the problem, an error around 10% is normal for my solutions. Furthermore, I could said that I've "never" seen an error value under the 1% between the solution given by the NEAT and the solution that it is the correct one). I must said that I've tried a lot of different parameter combinations and configurations (this is an old problem for me).
Due to that, I tested the second library. The MultiNEAT library converges quickly and easier that the previous one. (I assume that is due to the C++ implementation instead the pure Python) I get similar results, but I still have the same problem; lack of accuracy. This second library has different configuration parameters too, and I haven't found a proper combination of them to improve the performance of the problem.
My question is:
Is it normal to have this lack of accuracy in the NEAT results? It achieves good solutions, but not good enough for controlling a robot arm, which is what I want to use it for.
I'll write what I am doing in case someone sees some conceptual or technical mistake in the way I set out my problem:
To simplify the problem, I'll show a very simple example: I have a very simple problem to solve, I want a NN that may calculate the following function: y = x^2 (similar results are found with y=x^3 or y = x^2 + x^3 or similar functions)
The steps that I follow to develop the program are:
"Y" are the inputs to the network and "X" the outputs. The
activation functions of the neural net are sigmoid functions.
I create a data set of "n" samples given values to "X" between the
xmin = 0.0 and the xmax = 10.0
As I am using sigmoid functions, I make a normalization of the "Y"
and "X" values:
"Y" is normalized linearly between (Ymin, Ymax) and (-2.0, 2.0) (input range of sigmoid).
"X" is normalized linearly between (Xmin, Xmax) and (0.0, 1.0) (the output range of sigmoid).
After creating the data set, I subdivide in in a train sample (70%
percent of the total amount), a validation sample and a test sample
(15% each one).
At this point, I create a population of individuals for doing
evolution. Each individual of the population is evaluated in all the
train samples. Each position is evaluated as:
eval_pos = xmax - abs(xtarget - xobtained)
And the fitness of the individual is the average value of all the train positions (I've selected the minimum too but it gives me worse performance).
After the whole evaluation, I test the best obtained individual
against the test sample. And here is where I obtained those
"un-precise values". Moreover, during the evaluation process, the
maximum value where "abs(xtarget - xobtained) = 0" is never
obtained.
Furthermore, I assume that how I manipulate the data is right because, I use the same data set for training a neural network in Keras and I get much better results than with NEAT (an error less than a 1% is achievable after 1000 epochs in a layer with 5 neurons).
At this point, I would like to know if what is happened is normal because I shouldn't use a data set of data for developing the controller, it must be learned "online" and NEAT looks like a suitable solution for my problem.
Thanks in advance.
EDITED POST:
Firstly, Thanks for comment nick.
I'll answer your questions below::
I am using the NEAT algorithm.
Yes, I've carried out experiments increasing the number of individuals in the population and the generations number. A typical graph that I get is like this:
Although the population size in this example is not such big, I've obtained similar results in experiments incrementing the number of individuals or the number of generations. Populations of 500 in individuals and 500 generations, for example. In this experiments, the he algorithm converge fast to a solution, but once there, the best solution is stucked and it does not improve any more.
As I mentioned in my previous post, I've tried several experiments with many different parameters configurations... and the graphics are more or less similar to the previous showed.
Furthermore, other two experiments that I've tried were: once the evolution reach the point where the maximum value and the median converge, I generate other population based on that genome with new configuration parameters where:
The mutation parameters change with a high probability of mutation (weight and neuron probability) in order to find new solutions with the aim to "jumping" from the current genome to other better.
The neuron mutation is reduced to 0, while the weight "mutation probability" increase for "mutate weight" in a lower range in order to get slightly modifications with the aim to get a better adjustment of the weights. (trying to get a "similar" functionality as backprop. making slighty changes in the weights)
This two experiments didn't work as I expected and the best genome of the population was also the same of the previous population.
I am sorry, but I do not understand very well what do you want to say with "applying your own weighted penalties and rewards in your fitness function". What do you mean with including weight penalities in the fitness function?
Regards!

Disclaimer: I have contributed to these libraries.
Have you tried increasing the population size to speed up the search and increasing the number of generations? I use it for a trading task, and by increasing the population size my champions were found much sooner.
Another thing to think about is applying your own weighted penalties and rewards in your fitness function, so that anything that doesn't get very close right away is "killed off" sooner and the correct genome is found faster. It should be noted that neat uses a fitness function to learn as a opposed to gradient descent so it wont converge in the same way and its possible you may have to train a bit longer.
Last question, are you using the neat or hyperneat algo from multineat?

Optimizer Tuning in sklearn Gaussian Process Regressor

I'm attempting to use a GaussianProcessRegressor as part of scikit-learn 0.18.1
I'm training on 200 data points and using 13 input features for my kernel - one constant multiplied by a radial basis function with twelve elements. The model runs without complaints, but if I run the same script several times I notice that I sometimes get different solutions. It may be worth noting that several of the optimized parameters are running into the bounds I've provided them (I'm currently working out which features matter).
I've tried increasing the parameter n_restarts_optimizer to 50, and while this takes considerably longer to run it doesn't eliminate the element of apparent randomness. It seems possible to change the optimizer itself, though I've had no luck. From a quick scan it seems the most similar syntactically are scipy's fmin_tnc and fmin_slsqp (other optimizers do not include bounds). However, using either of those cause other issues: for example, fmin_tnc does not return the value of the objective function at its minimum.
Are there any suggestions for how to have a more deterministic script? Ideally I'd like it to print the same values regardless of iteration, because as it stands it feels a bit like a lottery (and therefore drawing any conclusions is questionable).
A snippet of the code I'm using:
from sklearn.gaussian_process import GaussianProcessRegressor as GPR
from sklearn.gaussian_process.kernels import RBF, ConstantKernel as C
lbound = 1e-2
rbound = 1e1
n_restarts = 50
n_features = 12 # Actually determined elsewhere in the code
kernel = C(1.0, (lbound,rbound)) * RBF(n_features*[10], (lbound,rbound))
gp = GPR(kernel=kernel, n_restarts_optimizer=n_restarts)
gp.fit(train_input, train_outputs)
test_model, sigma2_pred = gp.predict(test_input, return_std=True)
print gp.kernel_

This uses random values to initialize optimization:
As the LML may have multiple local optima, the optimizer can be
started repeatedly by specifying n_restarts_optimizer.
As far as I understand, there will always be a random factor. Sometimes it will find local minima, which are the bounds you've mentioned.
If your data allows it (invertible X matrix) You can use normal equation if it suits your needs, no random factor there.
You can do (something like a random forest) sampling on top of this, where you run this algorithm several times and choose the best fit or the common value: You will have to weigh consistency versus accuracy.
Hope I understood your question correctly.

pymc3 : Multiple observed values

I have some observational data for which I would like to estimate parameters, and I thought it would be a good opportunity to try out PYMC3.
My data is structured as a series of records. Each record contains a pair of observations that relate to a fixed one hour period. One observation is the total number of events that occur during the given hour. The other observation is the number of successes within that time period. So, for example, a data point might specify that in a given 1 hour period, there were 1000 events in total, and that of those 1000, 100 were successes. In another time period, there may be 1000000 events in total, of which 120000 are successes. The variance of the observations is not constant and depends on the total number of events, and it is partly this effect that I would like to control and model.
My first step in doing this to estimate the underlying success rate. I've prepared the code below which is intended to mimic the situation by providing two sets of 'observed' data by generating it using scipy. However, it does not work properly.
What I expect it to find is:
loss_lambda_factor is roughly 0.1
total_lambda (and total_lambda_mu) is roughly 120.
Instead, the model converges very quickly, but to an unexpected answer.
total_lambda and total_lambda_mu are respectively sharp peaks around 5e5.
loss_lambda_factor is roughly 0.
The traceplot (which I cannot post due to reputation below 10) is fairly uninteresting - quick convergence, and sharp peaks at numbers that do not correspond to the input data. I am curious as to whether there is something fundamentally wrong with the approach I am taking. How should the following code be modified to give the correct / expected result?
from pymc import Model, Uniform, Normal, Poisson, Metropolis, traceplot
from pymc import sample
import scipy.stats
totalRates = scipy.stats.norm(loc=120, scale=20).rvs(size=10000)
totalCounts = scipy.stats.poisson.rvs(mu=totalRates)
successRate = 0.1*totalRates
successCounts = scipy.stats.poisson.rvs(mu=successRate)
with Model() as success_model:
total_lambda_tau= Uniform('total_lambda_tau', lower=0, upper=100000)
total_lambda_mu = Uniform('total_lambda_mu', lower=0, upper=1000000)
total_lambda = Normal('total_lambda', mu=total_lambda_mu, tau=total_lambda_tau)
total = Poisson('total', mu=total_lambda, observed=totalCounts)
loss_lambda_factor = Uniform('loss_lambda_factor', lower=0, upper=1)
success_rate = Poisson('success_rate', mu=total_lambda*loss_lambda_factor, observed=successCounts)
with success_model:
step = Metropolis()
success_samples = sample(20000, step) #, start)
plt.figure(figsize=(10, 10))
_ = traceplot(success_samples)

There is nothing fundamentally wrong with your approach, except for the pitfalls of any Bayesian MCMC analysis: (1) non-convergence, (2) the priors, (3) the model.
Non-convergence: I find a traceplot that looks like this:
This is not a good thing, and to see more clearly why, I would change the traceplot code to show only the second-half of the trace, traceplot(success_samples[10000:]):
The Prior: One major challenge for convergence is your prior on total_lambda_tau, which is a exemplar pitfall in Bayesian modeling. Although it might appear quite uninformative to use prior Uniform('total_lambda_tau', lower=0, upper=100000), the effect of this is to say that you are quite certain that total_lambda_tau is large. For example, the probability that it is less than 10 is .0001. Changing the prior to
total_lambda_tau= Uniform('total_lambda_tau', lower=0, upper=100)
total_lambda_mu = Uniform('total_lambda_mu', lower=0, upper=1000)
results in a traceplot that is more promising:
This is still not what I look for in a traceplot, however, and to get something more satisfactory, I suggest using a "sequential scan Metropolis" step (which is what PyMC2 would default to for an analogous model). You can specify this as follows:
step = pm.CompoundStep([pm.Metropolis([total_lambda_mu]),
pm.Metropolis([total_lambda_tau]),
pm.Metropolis([total_lambda]),
pm.Metropolis([loss_lambda_factor]),
])
This produces a traceplot that seems acceptible:
The model: As #KaiLondenberg responded, the approach you have taken with priors on total_lambda_tau and total_lambda_mu is not a standard approach. You describe widely varying event totals (1,000 one hour and 1,000,000 the next) but your model posits it to be normally distributed. In spatial epidemiology, the approach I have seen for analogous data is a model more like this:
import pymc as pm, theano.tensor as T
with Model() as success_model:
loss_lambda_rate = pm.Flat('loss_lambda_rate')
error = Poisson('error', mu=totalCounts*T.exp(loss_lambda_rate),
observed=successCounts)
I'm sure that there are other ways that will seem more familiar in other research communities as well.
Here is a notebook collecting up these comments.

I see several potential problems with the model.
1.) I would think that the success counts (called error ?) should follow a Binomial(n=total,p=loss_lambda_factor) distribution, not a Poisson.
2.)
Where does the chain start ? Starting at a MAP or MLE configuration would make sense unless you use pure Gibbs sampling. Otherwise the chain might take a long time for burn-in, which might be what's happening here.
3.) Your choice of a hierarchical prior for total_lambda (i.e. normal with two uniform priors on those parameters) ensures that the chain will take a long time to converge, unless you pick your start wisely (as in Point 2.). You essentially introduce a lot of unneccessary degrees of freedom for the MCMC chain to get lost in. Given that total_lambda has to be nonngative, I would choose a Uniform prior for total_lambda in a suitable range (from 0 to the observed maximum for example).
4.) You use the Metropolis Sampler. 20000 samples might not be enough for that one. Try 60000 and discard the first 20000 as burn-in. The Metropolis Sampler can take a while to tune the step size, so it might well be that it spent the first 20000 samples to mainly reject proposals and tune. Try other samplers, such as NUTS.

large scale clustering library possibly with python bindings

I've been trying to cluster some larger dataset. consisting of 50000 measurement vectors with dimension 7. I'm trying to generate about 30 to 300 clusters for further processing.
I've been trying the following clustering implementations with no luck:
Pycluster.kcluster (gives only 1-2 non-empty clusters on my dataset)
scipy.cluster.hierarchy.fclusterdata (runs too long)
scipy.cluster.vq.kmeans (runs out of memory)
sklearn.cluster.hierarchical.Ward (runs too long)
Are there any other implementations which I might miss?

50000 instances and 7 dimensions isn't really big, and should not kill an implementation.
Although it doesn't have python binding, give ELKI a try. The benchmark set they use on their homepage is 110250 instances in 8 dimensions, and they run k-means on it in 60 seconds apparently, and the much more advanced OPTICS in 350 seconds.
Avoid hierarchical clustering. It's really only for small data sets. The way it is commonly implemented on matrix operations is O(n^3), which is really bad for large data sets. So I'm not surprised these two timed out for you.
DBSCAN and OPTICS when implemented with index support are O(n log n). When implemented naively, they are in O(n^2). K-means is really fast, but often the results are not satisfactory (because it always splits in the middle). It should run in O(n * k * iter) which usually converges in not too many iterations (iter<<100). But it will only work with Euclidean distance, and just doesn't work well with some data (high-dimensional, discrete, binary, clusters with different sizes, ...)

Since you're already trying scikit-learn: sklearn.cluster.KMeans should scale better than Ward and supports parallel fitting on multicore machines. MiniBatchKMeans is better still, but won't do random restarts for you.
>>> from sklearn.cluster import MiniBatchKMeans
>>> X = np.random.randn(50000, 7)
>>> %timeit MiniBatchKMeans(30).fit(X)
1 loops, best of 3: 114 ms per loop

My package milk handles this problem easily:
import milk
import numpy as np
data = np.random.rand(50000,7)
%timeit milk.kmeans(data, 300)
1 loops, best of 3: 14.3 s per loop
I wonder whether you meant to write 500,000 data points, because 50k points is not that much. If so, milk takes a while longer (~700 sec), but still handles it well as it does not allocate any memory other than your data and the centroids.

The real answer for actually large scale situations is to use something like FAISS, Facebook Research's library for efficient similarity search and clustering of dense vectors.
See
https://github.com/facebookresearch/faiss/wiki/Faiss-building-blocks:-clustering,-PCA,-quantization

OpenCV has a k-means implementation, Kmeans2
Expected running time is on the order of O(n**4) - for an order-of-magnitude approximation, see how long it takes to cluster 1000 points, then multiply that by seven million (50**4 rounded up).

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.