I am attempting to curve fit the training error of a neural network as a function of the number of training iterations. An example is shown in red in the image below. Here I've trained for 3000 iterations. What I'm interested in is whether I can find a function that I can fit on the first 1000 (or so) iterations to extrapolate out to 3000 iterations with some reasonable accuracy.
However, I don't know what functional form would be best for me to use. At first I tried an exponential of the form f(x)=A+Bexp(-Cx), which is shown in blue. Obviously this doesn't work too well. The exponential dies off way too fast and then basically just becomes the constant term.
Perhaps it's just difficult, since the beginning of the training shows a very sharp drop off of the error but then transitions to something much more gradual for higher iterations. But maybe someone with experience in neural network training and/or experience in fitting unknown functions might have some ideas. I've been trying various exponential forms and polynomial fits within scipy/numpy but with no success. I've varied the number of iterations used in the fit as well (including throwing out the small iteration numbers).
Any thoughts?
I think exponential fitting may work. In your f(x)=A+B*exp(-C*x), I choose A = 0.005, B = 0.045, and C = 1/250, I will get,
It's just about the parameter tuning. Yet I am trying to understand the motivation that you want to fit the learning curve. I think the interpolation method includes the 'extrapolation' option that you can used to predict the error after more epochs. If you want to precisely learn the curve, you can use another neural network with linear hidden layer and output to 'learn' the curve again, though I didn't try whether it works.
Check out this page: http://www.astroml.org/sklearn_tutorial/practical.html
The useful thing for your situation it describes is diagnosing whether your algorithms appear to be high bias or high variance on your data set, and offering specific directions to go in for either case.
Related
I have a simple function f : R->R, f(x) = x2 + a, and would like to create a neural network to learn that function, as entirely as it can. Currently, I have a pytorch implementation that takes in inputs of a limited range of course, from x0 to xN with a particular number of points. Each epoch, the training data is randomly perturbed, in efforts to not only learn the relationship on the same grid points each time.
Currently, it does a great job of learning on the function on the range it is trained on, but is it at all feasible to train in such a way that can extend this learning beyond what it is trained on? Currently the behavior outside the training range seems dependent on the activation function. For example, with ReLU, the true function (orange) compared to the networks prediction (blue) are below:
I understand that if I transform the input vector to higher dimensions that contain higher powers of x, it may work out pretty well, but for a generalized case and how I plan to implement this in the future it won't work as well on non-polynomial functions.
One thought that came to mind is from support vector machines and the choice of a kernel, and how the radial basis kernel gets around this generalization issue, but I'm not sure if this can be applied here without the inner product properties of svm.
What you want is called extrapolation (as opposed to interpolation which is predicting a value that is inside the trained domain / range). There is never a good solution for extrapolation and using higher powers can give you a better fit for a specific problem, but if you change the fitted curve slightly (either change its x and y-intercept, one of the powers, etc) the extrapolation will be pretty bad again.
This is also why neural networks use a large data set (to maximize their input range and rely on interpolation) and why over-training / over fitting (which is what you're trying to do) is a bad idea; it never works well in the general case.
I am quite new in machine learning and decided that a good way to start getting some experience would be to play around with some real data bases and the python scikit library. I used haberman's surgery data, a binary classification task, that can be found at https://archive.ics.uci.edu/ml/datasets/Haberman%27s+Survival. I trained a few perceptrons using this data. At some point, I decided to demonstrate the concept of overfitting. Therefore, I mapped all 306 data points, of 3 features each, to a very high dimension, getting all terms up to and including the 11th degree. That is a vast 364 features (which is more than the 306 data points). Yet, when I trained the model, I did not achieve zero in-sample error. I figured the reason should be that there are some points that coincide and have different labels, so I removed duplicate data points, but again, I could not achieve zero in-sample error. Here is the interesting part of my code using the methods of the scikit library:
perceptron = Perceptron()
polynomial = preprocessing.PolynomialFeatures(11)
perceptron.fit(polynomial.fit_transform(X), Y)
print(perceptron.score(polynomial.fit_transform(X),Y))
And the output I got was a mere 0.7, an accuracy score far from 1 (100%) that I expected. What am I missing?
You only have 11 polynomial features. If you want to be guaranteed to hit every point you need almost as many if not more polynomial features than your number of datapoints. This is because each additional polynomial feature allows the graph to bend again.
Having a bunch of features of the same degree can't really increase your complexity in the way you expect. If your function is first degree for example, you really can't expect it to be anything other than linear, regardless of the like term count.
So while you may have more features than datapoints, since you don't have more polynomial features than datapoints most of your features are effectively tweaking the same weights.
I'm trying to use SVM from sklearn for a classification problem. I got a highly sparse dataset with more than 50K rows and binary outputs.
The problem is I don't know quite well how to efficiently choose the parameters, mainly the kernel, gamma anc C.
For the kernels for example, am I supposed to try all kernels and just keep the one that gives me the most satisfying results or is there something related to our data that we can see in the first place before choosing the kernel ?
Same goes for C and gamma.
Thanks !
Yes, this is mostly a matter of experimentation -- especially as you've told us very little about your data set: separability, linearity, density, connectivity, ... all the characteristics that affect classification algorithms.
Try the linear and Gaussian kernels for starters. If linear doesn't work well and Gaussian does, then try the other kernels.
Once you've found the best 1 or 2 kernels, then play with the cost and gamma parameters. Gamma is a "slack" parameter: it gives the kernel permission to make a certain proportion of raw classification errors as a trade-off for other benefits: width of the gap, simplicity of the partition function, etc.
I haven't yet had an application that got more than trivial benefit from altering the cost.
My model throws up learning curves as I have shown below. Are these fine? I am a beginner and all across the internet I see that as training examples increase the Training score should decrease and then converge. But here the training score is increasing and then converging. Therefore I would like to know does this indicate a bug in my code / something wrong with my input?
Okay I figured out what was wrong with my code.
train_sizes , train_accuracy , cv_accuracy = lc(linear_model.LogisticRegression(solver='lbfgs',penalty='l2',multi_class='ovr'),trainData,multiclass_response_train,train_sizes=np.array([0.1,0.33,0.5,0.66,1.0]),cv=5)
I had not entered a regularization parameter for Logistic Regression.
But now,
train_sizes , train_accuracy , cv_accuracy = lc(linear_model.LogisticRegression(C=1000,solver='lbfgs',penalty='l2',multi_class='ovr'),trainData,multiclass_response_train,train_sizes=np.array([0.1,0.33,0.5,0.66,1.0]),cv=5)
The learning curve looks alright.
Can anybody tell me why this is so? i.e. with default reg term the training score increases and with lower reg it decreases?
Data details: 10 classes. Images of varying sizes. (Digit Classification - street view digits)
You need to be more precise regarding your metrics. What metrics are used here?
Loss in general means: lower is better, while Score usually means: higher is better.
This also means, that the interpretation of your plot is dependent on the used metrics during training and cross-validation.
Have a look at the related webpage of scipy:
http://scikit-learn.org/stable/modules/learning_curve.html
The score is typically some measure that needs to be maximized (ROCAUC, accuracy,...). Intuitively you could expect that the more training examples you see the better your model gets and hence the higher the score is. There are however some subtleties regarding overfitting and underfitting that you should keep in mind.
Building off of Alex's answer, it looks like the default regularization parameter for your model underfits the data a bit, because when you relaxed regularization, you see 'more appropriate' learning curves. It doesn't matter how many examples you throw at a model that underfits.
As for your concern of why the training score increases in the first case rather than decreases -- it's probably a consequence of the multiclass data you're using. With fewer training examples, you have fewer numbers of images of each class (because lc tries to keep the same class distribution in each fold of the cv), so with regularization (if you call C=1 regularization, that is), it may be harder for your model to accurately guess some of the classes.
I was wondering if I could get some help on a problem.
I am creating a tool for a former lab of mine which uses data from a physics based machine (a lot of noise) that results as simple x, y coordinates. I want to identify local maximums of the dataset, however, since there is a bunch of noise in the set, you cannot just check the slope between the points in order to determine the peak.
In order to solve this, I was thinking of using polynomial regression to somewhat "smooth out" the data set, then determine local maximums from the resulting model.
I've run through this link
http://scikit-learn.org/stable/auto_examples/linear_model/plot_polynomial_interpolation.html, however, it only tells you how create a model that is a close fit. It doesn't tell you if there is an integrated metric in which to measure which is the best model. Should I do this through Chi squared? Or is there some other metric that works better or is integrated into the scikit-learn kit?
Link procided esentially shows you how to build a Ridge Regression on top of polynomial features. Consequently this is not a "tight fit", as you can control it through regularization (alpha parameter) - prior over the parameters. Now, what do you mean by "best model" - there are infinitely many possible criterions for being a best regression, each tested through different criterion. You need to answer yourself - what is the measure that you are interested in. Should it be some kind of "golden ratio" between smoothness and close fitness? Or maybe you want a model of at most some smoothness, which minimizes some error measure (mean squared distance to the points?)? Yet another would be to test how well it captures the underlying process - through some kind of typical validation (like cross validation etc.) where you repeat building the model on the subset of the data and check error on the holdout part. There are many possible (and completely valid!) approaches - everything depends on the exact question you want to answer. "What is the best model" is not a good question, unfortunately.