I have a simple function f : R->R, f(x) = x2 + a, and would like to create a neural network to learn that function, as entirely as it can. Currently, I have a pytorch implementation that takes in inputs of a limited range of course, from x0 to xN with a particular number of points. Each epoch, the training data is randomly perturbed, in efforts to not only learn the relationship on the same grid points each time.
Currently, it does a great job of learning on the function on the range it is trained on, but is it at all feasible to train in such a way that can extend this learning beyond what it is trained on? Currently the behavior outside the training range seems dependent on the activation function. For example, with ReLU, the true function (orange) compared to the networks prediction (blue) are below:
I understand that if I transform the input vector to higher dimensions that contain higher powers of x, it may work out pretty well, but for a generalized case and how I plan to implement this in the future it won't work as well on non-polynomial functions.
One thought that came to mind is from support vector machines and the choice of a kernel, and how the radial basis kernel gets around this generalization issue, but I'm not sure if this can be applied here without the inner product properties of svm.
What you want is called extrapolation (as opposed to interpolation which is predicting a value that is inside the trained domain / range). There is never a good solution for extrapolation and using higher powers can give you a better fit for a specific problem, but if you change the fitted curve slightly (either change its x and y-intercept, one of the powers, etc) the extrapolation will be pretty bad again.
This is also why neural networks use a large data set (to maximize their input range and rely on interpolation) and why over-training / over fitting (which is what you're trying to do) is a bad idea; it never works well in the general case.
Related
I am working on a project with Wasserstein GANs and more specifically with an implementation of the improved version of Wasserstein GANs. I have two theoretical questions about wGANs regarding their stability and training process. Firstly, the result of the loss function notoriously is correlated with the quality of the result of the generated samples (that is stated here). Is there some extra bibliography that supports that argument?
Secondly, during my experimental phase, I noticed that training my architecture using wGANs is much faster than using a simple version of GANs. Is that a common behavior? Is there also some literature analysis about that?
Furthermore, one question about the continuous functions that are guaranteed by using Wasserstein loss. I am having some issues understanding this concept in practice, what it means that the normal GANs loss is not continuous function?
You can check Inception Score and Frechet Inception Distance for now. And also here. The problem is that GANs not having a unified objective functions(there are two networks) there's no agreed way of evaluating and comparing GAN models. INstead people devise metrics that's relating the image distributinos and generator distributions.
wGAN could be faster due to having morestable training procedures as opposed to vanilla GAN(Wasserstein metric, weight clipping and gradient penalty(if you are using it) ) . I dont know if there's a literature analysis for speed and It may not always the case for WGAN faster than a simple GAN. WGAN cannot find the best Nash equlibirum like GAN.
Think two distributions: p and q. If these distributions overlap, i.e. , their domains overlap, then KL or JS divergence are differentiable. The problem arises when p and q don't overlap. As in WGAN paper example, say two pdfs on 2D space, V = (0, Z) , Q = (K , Z) where K is different from 0 and Z is sampled from uniform distribution. If you try to take derivative of KL/JS divergences of these two pdfs well you cannot. This is because these two divergence would be a binary indicator function (equal or not) and we cannot take derivative of these functions. However, if we use Wasserstein loss or Earth-Mover distance, we can take it since we are approximating it as a distance between two points on space. Short story: Normal GAN loss function is continuous iff the distributions have an overlap, otherwise it is discrete.
Hope this helps
The most common way to stabilize the training of a WGAN is to replace the Gradient Clipping technique that was used in the early W-GAN with Gradient Penalty (WGAN-GP). This technique seems outperform the original WGAN. The paper that describes what GP is can be found here:
https://arxiv.org/pdf/1704.00028.pdf
Also, If you need any help of how to implement this, You can check a nice repository that I have found here:
https://github.com/kochlisGit/Keras-GAN
There are also other tricks that You can use to improve the overall quality of your generated images, described in the repository. For example:
Add Random Gaussian Noise at the inputs of the discriminator that decays over the time.
Random/Adaptive Data Augmentations
Separate fake/real batches
etc.
I am making a project in which I have to predict a plane trajectory.
I have 2 types of trajectory, the first one is the planned, and the second one is the real one that I recovered after the end of the flight.
The two trajectories are (x,y) points on a map and I want to predict the real one with the planned one.
What kind of model do you use? I heard about multivariate regression or recurrent neural network but I am not sure about both, I think multivariate is not appropriate and rnn include time as parameter and I would not want to use it first.
Do you have any ideas?
Thank you
You could try either training single-target multiple regression models, and predict the x and y variables independently. The other way to go about is to use multi-target regression-based methods. The most commonly used method using Predictive Clustering trees. You can read about various methods from https://towardsdatascience.com/regression-models-with-multiple-target-variables-8baa75aacd to start with. I hope it is somewhat helpful. :)
I'm a bit of a beginner in the art of machine learning. Here is a rather conceptual question I've been wondering:
Suppose I have a function X->Y, say y=x^2, then, generating enough data of X->Y, I can train a neural network to perform regression on the function, and get x^2 with any input x. This is basically also what the Universal Approximation Theorem suggests.
Now, my question is, what if I want the inverse relation, Y->X? In this case, Y is a multi-valued function of X, for instance for X>0, x=+-sqrt(y). I can swap X and Y as input/output data to train the network alright, but for any given y, there should be a random 1/2 - 1/2 chance that x=sqrt(y) and x=-sqrt(y). But of course, if one trains it with min-squared-error, the network wouldn't know this is a multi-value function, and would just follow SGD on the loss function and get x=0, the average value, for any given y.
Therefore, I wonder if there is any way a neural network can model a multi-valued function? For instance, my guess would be
(1) the neural network can output a collection of, say, the top 2 possible values for X and train it with cross-entropy. The problem is, if X is a vector or even a matrix (like a bit-map image) instead of a number, we don't know how many solutions Y=X has (which could very well be an infinite number, i.e. a continuous range), so a "list" of possible values and probabilities won't work - ideally the neural network should output values randomly and continuously distributed across possible X solutions.
(2) perhaps does this fall into the realm of probabilistic neural networks (PNN)? Does PNN model functions that support a given probabilistic distribution (continuous or discrete) of vectors as its output? If so, is it possible to implement PNN with popular frameworks like Tensorflow+Keras?
(Also, note that this is different from a "multivariate" function, which is the case where X,Y could be multi-component vectors, which is still something a traditional network can easily train on. The actual problem in question here is where the output could be a probabilistic distribution of vectors, which is something that a simple feed-forward network doesn't capture, since it doesn't have the inherent randomness.)
Thank you for your kind help!
Image of forward function Y=X^2 (can be easily modeled by network with regression)
Image of inverse function X=+-sqrt(Y) (the network cannot capture the two-value function and outputs the average value X=0 for any Y)
Try to read the following paper:
https://onlinelibrary.wiley.com/doi/abs/10.1002/ecjc.1028
Mifflin's algorithm (or its more general version SLQP-GS) mentioned in this paper is available here and corresponding paper with description is here.
I was wondering if I could get some help on a problem.
I am creating a tool for a former lab of mine which uses data from a physics based machine (a lot of noise) that results as simple x, y coordinates. I want to identify local maximums of the dataset, however, since there is a bunch of noise in the set, you cannot just check the slope between the points in order to determine the peak.
In order to solve this, I was thinking of using polynomial regression to somewhat "smooth out" the data set, then determine local maximums from the resulting model.
I've run through this link
http://scikit-learn.org/stable/auto_examples/linear_model/plot_polynomial_interpolation.html, however, it only tells you how create a model that is a close fit. It doesn't tell you if there is an integrated metric in which to measure which is the best model. Should I do this through Chi squared? Or is there some other metric that works better or is integrated into the scikit-learn kit?
Link procided esentially shows you how to build a Ridge Regression on top of polynomial features. Consequently this is not a "tight fit", as you can control it through regularization (alpha parameter) - prior over the parameters. Now, what do you mean by "best model" - there are infinitely many possible criterions for being a best regression, each tested through different criterion. You need to answer yourself - what is the measure that you are interested in. Should it be some kind of "golden ratio" between smoothness and close fitness? Or maybe you want a model of at most some smoothness, which minimizes some error measure (mean squared distance to the points?)? Yet another would be to test how well it captures the underlying process - through some kind of typical validation (like cross validation etc.) where you repeat building the model on the subset of the data and check error on the holdout part. There are many possible (and completely valid!) approaches - everything depends on the exact question you want to answer. "What is the best model" is not a good question, unfortunately.
I am attempting to curve fit the training error of a neural network as a function of the number of training iterations. An example is shown in red in the image below. Here I've trained for 3000 iterations. What I'm interested in is whether I can find a function that I can fit on the first 1000 (or so) iterations to extrapolate out to 3000 iterations with some reasonable accuracy.
However, I don't know what functional form would be best for me to use. At first I tried an exponential of the form f(x)=A+Bexp(-Cx), which is shown in blue. Obviously this doesn't work too well. The exponential dies off way too fast and then basically just becomes the constant term.
Perhaps it's just difficult, since the beginning of the training shows a very sharp drop off of the error but then transitions to something much more gradual for higher iterations. But maybe someone with experience in neural network training and/or experience in fitting unknown functions might have some ideas. I've been trying various exponential forms and polynomial fits within scipy/numpy but with no success. I've varied the number of iterations used in the fit as well (including throwing out the small iteration numbers).
Any thoughts?
I think exponential fitting may work. In your f(x)=A+B*exp(-C*x), I choose A = 0.005, B = 0.045, and C = 1/250, I will get,
It's just about the parameter tuning. Yet I am trying to understand the motivation that you want to fit the learning curve. I think the interpolation method includes the 'extrapolation' option that you can used to predict the error after more epochs. If you want to precisely learn the curve, you can use another neural network with linear hidden layer and output to 'learn' the curve again, though I didn't try whether it works.
Check out this page: http://www.astroml.org/sklearn_tutorial/practical.html
The useful thing for your situation it describes is diagnosing whether your algorithms appear to be high bias or high variance on your data set, and offering specific directions to go in for either case.