In order to tune some machine learning's (or even pipeline's) hyperparameters, sklearn proposes the exhaustive "GridsearchCV" and the randomized "RandomizedSearchCV". The latter samples the provided distributions and test them out, to finally select the best model (and provide the result of each tentative).
But let's say I train 1'000 models using this randomized method. Later, I decide this isn't precise enough, and want to try 1'000 more models. Can I resume the training? Aka, asking to sample more, and try more models without losing current progress. Calling fit() a second time "restarts" and discards previous hyperparameters combinations.
My situation looks like the following:
pipeline_cv = RandomizedSearchCV(pipeline, distribution, n_iter=1000, n_jobs=-1)
pipeline_cv = pipeline_cv.fit(trainX, trainy)
predictions = pipeline_cv.predict(targetX)
Then, later, I'd decide that 1000 iterations are not enough to cover my distributions' space, so I would do something like
pipeline_cv = pipeline_cv.resume(trainX, trainy, n_iter=1000) # doesn't exist?
And then I'd have a model trained across 2'000 hyperparameters combinations.
Is my goal achievable?
There is a Github issue on that back from Sep 2017, but it is still open:
In practice it is useful to search over some parameter space and then continue the search over some related space. We could provide a warm_start parameter to make it easy to accumulate results for further candidates into cv_results_ (without reevaluating parameter combinations that have already tested).
And a similar question in Cross Validated is also effectively unanswered.
So, the answer would seem to be no (plus that the scikit-learn community has not felt the need to include such a possibility).
But let's stop for a moment to think if something like that would be really valuable...
RandomizedSearchCV essentially works by random sampling parameter values from a given distribution; e.g., using the example from the docs:
distributions = dict(C=uniform(loc=0, scale=4),
penalty=['l2', 'l1'])
According to the very basic principles of such random sampling and random number generation (RNG) in general, there is not any guarantee that such a randomly sampled value will not be randomly sampled more than one time, especially if the number of iterations is large. Factor in the fact that RandomizedSearchCV does not do any bookkeeping itself either, hence in principle it can happen that same parameter combinations will be tried more than once in any single run (again, provided that the number of iterations is sufficiently large).
Even in cases of continuous distributions (like the uniform one used above), where the probability of getting exact values already sampled may be very small, there is the routine case of two samples being like 0.678918 and 0.678919, which, however close, they are still different, and count as different trials.
Given the above, I cannot see how "warm starting" a RandomizedSearchCV will be of any practical use. The real value of RandomizedSearchCV lies at the possibility of sampling a usually large area of parameter values - so large that we consider useful to unleash the power of simple random sampling, which, let me repeat, does not itself "remember" past samples returned, and it may very well return samples that are (exactly or approximately) equal to what has been already returned in the past, thus rendering any "warm start" practically irrelevant.
So effectively, simply running two (or more) RandomizedSearchCV processes sequentially (and storing their results) does the job adequately, provided that we do not use the same random seed for different runs (i.e. what is effectively suggested in the Cross Validated thread mentioned above).
Related
I'm using Doc2Vec in gensim library, and finding similiarity between movie, with its name as input.
model = doc2vec.Doc2Vec(vector_size=100, alpha=0.025, min_alpha=0.025, window=5)
model.build_vocab(tagged_corpus_list)
model.train(tagged_corpus_list, total_examples=model.corpus_count, epochs=50)
I set parameter like this, and didn't change preprocessing mechanism of input data, didn't changed original data.
similar_doc = model.dv.most_similar(input)
I also used this code to find most similar movie.
When I restarted code to train this model, the most similar movie has changed, with changed score.
Is this possible? Why? If then, how can I fix the training result?
Yes, this sort of change from run to run is normal. It's well-explained in question 11 of the Gensim FAQ:
Q11: I've trained my Word2Vec / Doc2Vec / etc model repeatedly using the exact same text corpus, but the vectors are different each time. Is there a bug or have I made a mistake? (*2vec training non-determinism)
Answer: The *2vec models (word2vec, fasttext, doc2vec…) begin with random initialization, then most modes use additional randomization
during training. (For example, the training windows are randomly
truncated as an efficient way of weighting nearer words higher. The
negative examples in the default negative-sampling mode are chosen
randomly. And the downsampling of highly-frequent words, as controlled
by the sample parameter, is driven by random choices. These
behaviors were all defined in the original Word2Vec paper's algorithm
description.)
Even when all this randomness comes from a
pseudorandom-number-generator that's been seeded to give a
reproducible stream of random numbers (which gensim does by default),
the usual case of multi-threaded training can further change the exact
training-order of text examples, and thus the final model state.
(Further, in Python 3.x, the hashing of strings is randomized each
re-launch of the Python interpreter - changing the iteration ordering
of vocabulary dicts from run to run, and thus making even the same
string-of-random-number-draws pick different words in different
launches.)
So, it is to be expected that models vary from run to run, even
trained on the same data. There's no single "right place" for any
word-vector or doc-vector to wind up: just positions that are at
progressively more-useful distances & directions from other vectors
co-trained inside the same model. (In general, only vectors that were
trained together in an interleaved session of contrasting uses become
comparable in their coordinates.)
Suitable training parameters should yield models that are roughly as
useful, from run-to-run, as each other. Testing and evaluation
processes should be tolerant of any shifts in vector positions, and of
small "jitter" in the overall utility of models, that arises from the
inherent algorithm randomness. (If the observed quality from
run-to-run varies a lot, there may be other problems: too little data,
poorly-tuned parameters, or errors/weaknesses in the evaluation
method.)
You can try to force determinism, by using workers=1 to limit
training to a single thread – and, if in Python 3.x, using the
PYTHONHASHSEED environment variable to disable its usual string hash
randomization. But training will be much slower than with more
threads. And, you'd be obscuring the inherent
randomness/approximateness of the underlying algorithms, in a way that
might make results more fragile and dependent on the luck of a
particular setup. It's better to tolerate a little jitter, and use
excessive jitter as an indicator of problems elsewhere in the data or
model setup – rather than impose a superficial determinism.
If the change between runs is small – nearest neighbors mostly the same, with a few in different positions – it's best to tolerate it.
If the change is big, there's likely some other problem, like insufficient training data or poorly-chosen parameters.
Notably, min_alpha=0.025 isn't a sensible value - the training is supposed to use a gradually-decreasing value, and the usual default (min_alpha=0.0001) usually doesn't need changing. (If you copied this from an online example: that's a bad example! Don't trust that site unless it explains why it's doing an odd thing.)
Increasing the number of training epochs, from the default epochs=5 to something like 10 or 20 may also help make run-to-run results more consistent, especially if you don't have plentiful training data.
I'm working on a word2vec with negative sampling implementation using python and tensorflow+keras. The initial input for the script is a list of positive target-word pairs, which is processed via looping through them and assigning a number of negative examples to each, then the positive + k negative samples are appended in the corresponding order to a new list. That list is later (after a few adjustments) passed to a keras model.fit():
model.fit([data[:, 0], data[:, 1]], data[:, 2],
batch_size = numof_positives * (numof_negatives + 1))
I looked through some examples, and from what I understand, the batches passed to the neural network should contain the negative context words of those positives that are present in the batch, meaning that shuffling of the data should take place before assigning the negatives. On the other hand, I did not realize that keras' model.fit() has its shuffle argument on True by default, so first it was run with the data being shuffled after the assignment as well. Now that I've added shuffle=False, it seems like it affected the quality of the resulting embedding vectors negatively. Can that be the case? Where should the input be shuffled? What are the implications of passing completely randomly ordered data vs ordered batches?
I may have a few trust issues with the shuffle argument of keras' model.fit(), after experiencing this bug regarding shuffle='batch' first hand.
The value of shuffling with respect to word2vec training (that I'm familiar with) is to avoid cases where all examples of a word, or all similar senses, are clumped together in one range of the training data. (You don't want th model to get really good at those examples where a word/sense is overrepresented... only to then have that overly-specialized performance to be lost when a long run of samples where those same words/senses are completely unrepresented.)
That's something you can achieve with a corpus-shuffle before any batch-shuffle – which might separately bring its own benefits, for similar reasons, by achieving better interleaving of contrasting microexamples.
The word2vec implementations I know tend to do the backprop-updates of a positive-example closely-alongside the corresponding N synthetic negative-examples from the same context. That is, no extra shuffling will move them further from each other.
But it's not impossible further shuffling could help! So I'd not express any opinion on what theoretically "should" happen, compared to any empirical observations, of tradeoffs seen in real runs. (Do what works best!)
(Alternatively, if the actual goal is perfectly-reproducing some other implementation's choices, then you'd mainly want to mimic its actual code, and verify by comparing quantitative results on same-data/same-evaluation tasks.)
Is it possible to pass a vector to a trained neural network so it only chooses from a subset of the classes it was trained to recognize. For example, I have a network trained to recognize numbers and letters, but I know that the images I'm running it on next will not contain lowercase letters (Such as images of serial numbers). Then I pass it a vector telling it not to guess any lowercase letters. Since the classes are exclusive the network ends in a softmax function. Following are just examples of what I'd thought of trying but none really work.
import numpy as np
def softmax(arr):
return np.exp(arr)/np.exp(arr).sum()
#Stand ins for previous layer/NN output and vector of allowed answers.
output = np.array([ 0.15885351,0.94527385,0.33977026,-0.27237907,0.32012873,
0.44839673,-0.52375875,-0.99423903,-0.06391236,0.82529586])
restrictions = np.array([1,1,0,0,1,1,1,0,1,1])
#Ideas -----
'''First: Multilpy restricted before sending it through softmax.
I stupidly tried this one.'''
results = softmax(output*restrictions)
'''Second: Multiply the results of the softmax by the restrictions.'''
results = softmax(output)
results = results*restrictions
'''Third: Remove invalid entries before calculating the softmax.'''
result = output*restrictions
result[result != 0] = softmax(result[result != 0])
All of these have issues. The first one causes invalid choices to default to:
1/np.exp(arr).sum()
since inputs to softmax can be negative this can raise the probability given to an invalid choice and make the answer worse. (Should've looked into it before I tried it.)
The second and third both have similar issues in that they wait until right before an answer is given to apply the restriction. For example, if the network is looking at the letter l, but it starts to determine that it's the number 1, this won't be corrected until the very end with these methods. So if it was on it's way to giving the output of 1 with .80 probability but then this option removed it seems the remaining options will redistribute and the highest valid answer won't be as confident as 80%. The remaining options end up a lot more homogeneous.
An example of what I'm trying to say:
output
Out[75]: array([ 5.39413513, 3.81445419, 3.75369546, 1.02716988, 0.39189373])
softmax(output)
Out[76]: array([ 0.70454877, 0.14516581, 0.13660832, 0.00894051, 0.00473658])
softmax(output[1:])
Out[77]: array([ 0.49133596, 0.46237183, 0.03026052, 0.01603169])
(Arrays were ordered to make it easier.)
In the original output the softmax gives .70 that the answer is [1,0,0,0,0] but if that's an invalid answer and thus removed the redistribution how assigns the 4 remaining options with under 50% probability which could easily be ignored as too low to use.
I've considered passing a vector into the network earlier as another input but I'm not sure how to do this without requiring it to learn what the vector is telling it to do, which I think would increase time required to train.
EDIT: I was writing way too much in the comments so I'll just post updates here. I did eventually try giving the restrictions as an input to the network. I took the one hot-encoded answer and randomly added extra enabled classes to simulate an answer key and ensure the correct answer was always in the key. When the key had very few enabled categories the network relied heavily on it and it interfered with learning features from the image. When the key had a lot of enabled categories it seemingly ignored the key completely. This could have been a problem that needed optimized, issues with my network architecture, or just needed a tweak to training but I never got around the the solution.
I did find out that removing answers and zeroing were almost the same when I eventually subtracted np.inf instead of multiplying by 0. I was aware of ensembles but as mentioned in a comment to the first response my network was dealing with CJK characters (alphabet was just to make example easier) and had 3000+ classes. The network was already overly bulky which is why I wanted to look into this method. Using binary networks for each individual category was something I hadn't thought of but 3000+ networks seems problematic too (if I understood what you were saying correctly) though I may look into it later.
First of all, I will loosely go through available options you have listed and add some viable alternatives with the pros and cons. It's kinda hard to structure this answer but I hope you'll get what I'm trying to put out:
1. Multiply restricted before sending it through softmax.
Obviously may give higher chance to the zeroed-out entries as you have written, at seems like a false approach at the beginning.
Alternative: replace impossible values with smallest logit value. This one is similar to softmax(output[1:]), though the network will be even more uncertain about the results. Example pytorch implementation:
import torch
logits = torch.Tensor([5.39413513, 3.81445419, 3.75369546, 1.02716988, 0.39189373])
minimum, _ = torch.min(logits, dim=0)
logits[0] = minimum
print(torch.nn.functional.softmax(logits))
which yields:
tensor([0.0158, 0.4836, 0.4551, 0.0298, 0.0158])
Discussion
Citing you: "In the original output the softmax gives .70 that the answer is [1,0,0,0,0] but if that's an invalid answer and thus removed the redistribution how assigns the 4 remaining options with under 50% probability which could easily be ignored as too low to use."
Yes, and you would be in the right when doing that. Even more so, the actual probabilities for this class are actually far lower, around 14% (tensor([0.7045, 0.1452, 0.1366, 0.0089, 0.0047])). By manually changing the output you are essentially destroying the properties this NN has learned (and it's output distribution) rendering some part of your computations pointless. This points to another problem stated in the bounty this time:
2. NN are known to be overconfident for classification problems
I can imagine this being solved in multiple ways:
2.1 Ensemble
Create multiple neural networks and ensemble them by summing logits taking argmax at the end (or softmax and then `argmax). Hypothetical situation with 3 different models with different predictions:
import torch
predicted_logits_1 = torch.Tensor([5.39413513, 3.81419, 3.7546, 1.02716988, 0.39189373])
predicted_logits_2 = torch.Tensor([3.357895, 4.0165, 4.569546, 0.02716988, -0.189373])
predicted_logits_3 = torch.Tensor([2.989513, 5.814459, 3.55369546, 3.06988, -5.89473])
combined_logits = predicted_logits_1 + predicted_logits_2 + predicted_logits_3
print(combined_logits)
print(torch.nn.functional.softmax(combined_logits))
This would gives us the following probabilities after softmax:
[0.11291057 0.7576356 0.1293983 0.00005554 0.]
(notice the first class is now the most probable)
You can use bootstrap aggregating and other ensembling techniques to improve predictions. This approach makes the classifying decision surface smoother and fixes mutual errors between classifiers (given their predictions vary quite a lot). It would take many posts to describe in any greater detail (or separate question with specific problem would be needed), here or here are some which might get you started.
Still I would not mix this approach with manual selection of outputs.
2.2 Transform the problem into binary
This approach might yield better inference time and maybe even better training time if you can distribute it over multiple GPUs.
Basically, each class of yours can either be present (1) or absent (0). In principle you could train N neural networks for N classes, each outputting a single unbounded number (logit). This single number tells whether the network thinks this example should be classified as it's class or not.
If you are sure certain class won't be the outcome for sure you do not run network responsible for this class detection.
After obtaining predictions from all the networks (or subset of networks), you choose the highest value (or highest probability if you use sigmoid activation, though it would be computationally wasteful).
Additional benefit would be simplicity of said networks (easier training and fine-tuning) and easy switch-like behavior if needed.
Conclusions
If I were you I would go with the approach outlined in 2.2 as you could save yourself some inference time easily and would allow you to "choose outputs" in a sensible manner.
If this approach is not enough, you may consider N ensembles of networks, so a mix of 2.2 and 2.1, some bootstrap or other ensembling techniques. This should improve your accuracy as well.
First ask yourself: what is the benefit of excluding certain outputs based on external data. In your post, I don't see why exactly you want to exclude them.
Saving them won't save computation as óne connection (or óne neuron) has effect on multiple outputs: you can't disable connections/neurons.
Is it really necessary to exclude certain classes? If your network is trained well enough, it will know if it's a capital or not.
So my answer: I don't think you should fiddle with any operation before the softmax. This will give you false conclusions. So you have the following options:
Multiply the results of the softmax by the restrictions.
Don't multiply, if the highest class is 'a', convert it to 'A' as output (convert output to lowercase)
Train a network that sees no difference between capital and non-capital letters
I used a Random Forest Classifier in Python and MATLAB. With 10 trees in the ensemble, I got ~80% accuracy in Python and barely 30% in MATLAB. This difference persisted even when MATLAB's random forests were grown with 100 or 200 tress.
What could be the possible reason for this difference between these two programming languages?
The MATLAB code is below:
load 'path\to\feature vector'; % Observations X Features, loaded as segment_features
load 'path\to\targetValues'; % Observations X Target value, loaded as targets
% Set up Division of Data for Training, Validation, Testing
trainRatio = 70/100;
valRatio = 0/100;
testRatio = 30/100;
[trainInd,valInd,testInd] = dividerand(size(segment_features,1),trainRatio,...
valRatio,testRatio);
% Train the Forest
B=TreeBagger(10,segment_features(trainInd,:), target(trainInd),...
'OOBPred','On');
% Test the Network
outputs_test = predict(B,segment_features(testInd, :));
outputs_test = str2num(cell2mat(outputs_test));
targets_test = target(testInd,:);
Accuracy_test=sum(outputs_test==targets_test)/size(testInd,2);
oobErrorBaggedEnsemble = oobError(B);
plot(oobErrorBaggedEnsemble)
xlabel 'Number of grown trees';
ylabel 'Out-of-bag classification error';
The Problem
There are many reasons why the implementation of a random forest in two different programming languages (e.g., MATLAB and Python) will yield different results.
First of all, note that results of two random forests trained on the same data will never be identical by design: random forests often choose features at each split randomly and use bootstrapped samples in the construction of each tree.
Second, different programming languages may have different default values set for the hyperparameters of a random forest (e.g., scikit-learn's random forest classifier uses gini as its default criterion to measure the quality of a split.)
Third, it will depend on the size of your data (which you do not specify in your question). Smaller datasets will yield more variability in the structure of your random forests and, in turn, their output will differ more from one forest to another.
Finally, a decision tree is susceptible to variability in input data (slight data perturbations can yield very different trees). Random forests try to get more stable and accurate solutions by growing many trees, but often 10 (or even 100 or 200) are often not enough trees to get stable output.
Toward a Solution
I can recommend several strategies. First, ensure that the way in which the data are loaded into each respective program are equivalent. Is MATLAB misreading a critical variable in a different way from Python, causing the variable to become non-predictive (e.g., misreading a numeric variable as a string variable?).
Second, once you are confident that your data are loaded identically across your two programs, read the documentation of the random forest functions closely and ensure that you are specifying the same hyperparameters (e.g., criterion) in your two programs. You want to ensure that the random forests in each are being created as similarly as possible.
Third, it will likely be necessary to increase the number of trees to get more stable output from your forests. Ensure that the number of trees in both implementations is the same.
Fourth, a potential difference between programs may come from how the data are split into training vs testing sets. It may be necessary to ensure some method that allows you to replicate the same cross-validation sets across your two programming languages (e.g., if you have a unique ID for each record, assign those with even numbers to training and those with odd numbers to testing).
Finally, you may also benefit from creating multiple forests in each programming language and compare the mean accuracy numbers across iterations. These will give you a better sense of whether differences in accuracy are truly reliable and significant or just a fluke.
Good luck!
I have some observational data for which I would like to estimate parameters, and I thought it would be a good opportunity to try out PYMC3.
My data is structured as a series of records. Each record contains a pair of observations that relate to a fixed one hour period. One observation is the total number of events that occur during the given hour. The other observation is the number of successes within that time period. So, for example, a data point might specify that in a given 1 hour period, there were 1000 events in total, and that of those 1000, 100 were successes. In another time period, there may be 1000000 events in total, of which 120000 are successes. The variance of the observations is not constant and depends on the total number of events, and it is partly this effect that I would like to control and model.
My first step in doing this to estimate the underlying success rate. I've prepared the code below which is intended to mimic the situation by providing two sets of 'observed' data by generating it using scipy. However, it does not work properly.
What I expect it to find is:
loss_lambda_factor is roughly 0.1
total_lambda (and total_lambda_mu) is roughly 120.
Instead, the model converges very quickly, but to an unexpected answer.
total_lambda and total_lambda_mu are respectively sharp peaks around 5e5.
loss_lambda_factor is roughly 0.
The traceplot (which I cannot post due to reputation below 10) is fairly uninteresting - quick convergence, and sharp peaks at numbers that do not correspond to the input data. I am curious as to whether there is something fundamentally wrong with the approach I am taking. How should the following code be modified to give the correct / expected result?
from pymc import Model, Uniform, Normal, Poisson, Metropolis, traceplot
from pymc import sample
import scipy.stats
totalRates = scipy.stats.norm(loc=120, scale=20).rvs(size=10000)
totalCounts = scipy.stats.poisson.rvs(mu=totalRates)
successRate = 0.1*totalRates
successCounts = scipy.stats.poisson.rvs(mu=successRate)
with Model() as success_model:
total_lambda_tau= Uniform('total_lambda_tau', lower=0, upper=100000)
total_lambda_mu = Uniform('total_lambda_mu', lower=0, upper=1000000)
total_lambda = Normal('total_lambda', mu=total_lambda_mu, tau=total_lambda_tau)
total = Poisson('total', mu=total_lambda, observed=totalCounts)
loss_lambda_factor = Uniform('loss_lambda_factor', lower=0, upper=1)
success_rate = Poisson('success_rate', mu=total_lambda*loss_lambda_factor, observed=successCounts)
with success_model:
step = Metropolis()
success_samples = sample(20000, step) #, start)
plt.figure(figsize=(10, 10))
_ = traceplot(success_samples)
There is nothing fundamentally wrong with your approach, except for the pitfalls of any Bayesian MCMC analysis: (1) non-convergence, (2) the priors, (3) the model.
Non-convergence: I find a traceplot that looks like this:
This is not a good thing, and to see more clearly why, I would change the traceplot code to show only the second-half of the trace, traceplot(success_samples[10000:]):
The Prior: One major challenge for convergence is your prior on total_lambda_tau, which is a exemplar pitfall in Bayesian modeling. Although it might appear quite uninformative to use prior Uniform('total_lambda_tau', lower=0, upper=100000), the effect of this is to say that you are quite certain that total_lambda_tau is large. For example, the probability that it is less than 10 is .0001. Changing the prior to
total_lambda_tau= Uniform('total_lambda_tau', lower=0, upper=100)
total_lambda_mu = Uniform('total_lambda_mu', lower=0, upper=1000)
results in a traceplot that is more promising:
This is still not what I look for in a traceplot, however, and to get something more satisfactory, I suggest using a "sequential scan Metropolis" step (which is what PyMC2 would default to for an analogous model). You can specify this as follows:
step = pm.CompoundStep([pm.Metropolis([total_lambda_mu]),
pm.Metropolis([total_lambda_tau]),
pm.Metropolis([total_lambda]),
pm.Metropolis([loss_lambda_factor]),
])
This produces a traceplot that seems acceptible:
The model: As #KaiLondenberg responded, the approach you have taken with priors on total_lambda_tau and total_lambda_mu is not a standard approach. You describe widely varying event totals (1,000 one hour and 1,000,000 the next) but your model posits it to be normally distributed. In spatial epidemiology, the approach I have seen for analogous data is a model more like this:
import pymc as pm, theano.tensor as T
with Model() as success_model:
loss_lambda_rate = pm.Flat('loss_lambda_rate')
error = Poisson('error', mu=totalCounts*T.exp(loss_lambda_rate),
observed=successCounts)
I'm sure that there are other ways that will seem more familiar in other research communities as well.
Here is a notebook collecting up these comments.
I see several potential problems with the model.
1.) I would think that the success counts (called error ?) should follow a Binomial(n=total,p=loss_lambda_factor) distribution, not a Poisson.
2.)
Where does the chain start ? Starting at a MAP or MLE configuration would make sense unless you use pure Gibbs sampling. Otherwise the chain might take a long time for burn-in, which might be what's happening here.
3.) Your choice of a hierarchical prior for total_lambda (i.e. normal with two uniform priors on those parameters) ensures that the chain will take a long time to converge, unless you pick your start wisely (as in Point 2.). You essentially introduce a lot of unneccessary degrees of freedom for the MCMC chain to get lost in. Given that total_lambda has to be nonngative, I would choose a Uniform prior for total_lambda in a suitable range (from 0 to the observed maximum for example).
4.) You use the Metropolis Sampler. 20000 samples might not be enough for that one. Try 60000 and discard the first 20000 as burn-in. The Metropolis Sampler can take a while to tune the step size, so it might well be that it spent the first 20000 samples to mainly reject proposals and tune. Try other samplers, such as NUTS.