Neural network regression with multi-value (probabilistic) functions - python

I'm a bit of a beginner in the art of machine learning. Here is a rather conceptual question I've been wondering:
Suppose I have a function X->Y, say y=x^2, then, generating enough data of X->Y, I can train a neural network to perform regression on the function, and get x^2 with any input x. This is basically also what the Universal Approximation Theorem suggests.
Now, my question is, what if I want the inverse relation, Y->X? In this case, Y is a multi-valued function of X, for instance for X>0, x=+-sqrt(y). I can swap X and Y as input/output data to train the network alright, but for any given y, there should be a random 1/2 - 1/2 chance that x=sqrt(y) and x=-sqrt(y). But of course, if one trains it with min-squared-error, the network wouldn't know this is a multi-value function, and would just follow SGD on the loss function and get x=0, the average value, for any given y.
Therefore, I wonder if there is any way a neural network can model a multi-valued function? For instance, my guess would be
(1) the neural network can output a collection of, say, the top 2 possible values for X and train it with cross-entropy. The problem is, if X is a vector or even a matrix (like a bit-map image) instead of a number, we don't know how many solutions Y=X has (which could very well be an infinite number, i.e. a continuous range), so a "list" of possible values and probabilities won't work - ideally the neural network should output values randomly and continuously distributed across possible X solutions.
(2) perhaps does this fall into the realm of probabilistic neural networks (PNN)? Does PNN model functions that support a given probabilistic distribution (continuous or discrete) of vectors as its output? If so, is it possible to implement PNN with popular frameworks like Tensorflow+Keras?
(Also, note that this is different from a "multivariate" function, which is the case where X,Y could be multi-component vectors, which is still something a traditional network can easily train on. The actual problem in question here is where the output could be a probabilistic distribution of vectors, which is something that a simple feed-forward network doesn't capture, since it doesn't have the inherent randomness.)
Thank you for your kind help!
Image of forward function Y=X^2 (can be easily modeled by network with regression)
Image of inverse function X=+-sqrt(Y) (the network cannot capture the two-value function and outputs the average value X=0 for any Y)

Try to read the following paper:
https://onlinelibrary.wiley.com/doi/abs/10.1002/ecjc.1028
Mifflin's algorithm (or its more general version SLQP-GS) mentioned in this paper is available here and corresponding paper with description is here.

Related

Neural Networks Extending Learning Domain

I have a simple function f : R->R, f(x) = x2 + a, and would like to create a neural network to learn that function, as entirely as it can. Currently, I have a pytorch implementation that takes in inputs of a limited range of course, from x0 to xN with a particular number of points. Each epoch, the training data is randomly perturbed, in efforts to not only learn the relationship on the same grid points each time.
Currently, it does a great job of learning on the function on the range it is trained on, but is it at all feasible to train in such a way that can extend this learning beyond what it is trained on? Currently the behavior outside the training range seems dependent on the activation function. For example, with ReLU, the true function (orange) compared to the networks prediction (blue) are below:
I understand that if I transform the input vector to higher dimensions that contain higher powers of x, it may work out pretty well, but for a generalized case and how I plan to implement this in the future it won't work as well on non-polynomial functions.
One thought that came to mind is from support vector machines and the choice of a kernel, and how the radial basis kernel gets around this generalization issue, but I'm not sure if this can be applied here without the inner product properties of svm.
What you want is called extrapolation (as opposed to interpolation which is predicting a value that is inside the trained domain / range). There is never a good solution for extrapolation and using higher powers can give you a better fit for a specific problem, but if you change the fitted curve slightly (either change its x and y-intercept, one of the powers, etc) the extrapolation will be pretty bad again.
This is also why neural networks use a large data set (to maximize their input range and rely on interpolation) and why over-training / over fitting (which is what you're trying to do) is a bad idea; it never works well in the general case.

Training stability of Wasserstein GANs

I am working on a project with Wasserstein GANs and more specifically with an implementation of the improved version of Wasserstein GANs. I have two theoretical questions about wGANs regarding their stability and training process. Firstly, the result of the loss function notoriously is correlated with the quality of the result of the generated samples (that is stated here). Is there some extra bibliography that supports that argument?
Secondly, during my experimental phase, I noticed that training my architecture using wGANs is much faster than using a simple version of GANs. Is that a common behavior? Is there also some literature analysis about that?
Furthermore, one question about the continuous functions that are guaranteed by using Wasserstein loss. I am having some issues understanding this concept in practice, what it means that the normal GANs loss is not continuous function?
You can check Inception Score and Frechet Inception Distance for now. And also here. The problem is that GANs not having a unified objective functions(there are two networks) there's no agreed way of evaluating and comparing GAN models. INstead people devise metrics that's relating the image distributinos and generator distributions.
wGAN could be faster due to having morestable training procedures as opposed to vanilla GAN(Wasserstein metric, weight clipping and gradient penalty(if you are using it) ) . I dont know if there's a literature analysis for speed and It may not always the case for WGAN faster than a simple GAN. WGAN cannot find the best Nash equlibirum like GAN.
Think two distributions: p and q. If these distributions overlap, i.e. , their domains overlap, then KL or JS divergence are differentiable. The problem arises when p and q don't overlap. As in WGAN paper example, say two pdfs on 2D space, V = (0, Z) , Q = (K , Z) where K is different from 0 and Z is sampled from uniform distribution. If you try to take derivative of KL/JS divergences of these two pdfs well you cannot. This is because these two divergence would be a binary indicator function (equal or not) and we cannot take derivative of these functions. However, if we use Wasserstein loss or Earth-Mover distance, we can take it since we are approximating it as a distance between two points on space. Short story: Normal GAN loss function is continuous iff the distributions have an overlap, otherwise it is discrete.
Hope this helps
The most common way to stabilize the training of a WGAN is to replace the Gradient Clipping technique that was used in the early W-GAN with Gradient Penalty (WGAN-GP). This technique seems outperform the original WGAN. The paper that describes what GP is can be found here:
https://arxiv.org/pdf/1704.00028.pdf
Also, If you need any help of how to implement this, You can check a nice repository that I have found here:
https://github.com/kochlisGit/Keras-GAN
There are also other tricks that You can use to improve the overall quality of your generated images, described in the repository. For example:
Add Random Gaussian Noise at the inputs of the discriminator that decays over the time.
Random/Adaptive Data Augmentations
Separate fake/real batches
etc.

What kind of supervised model to use when I have vectors as input and vectors as output?

I am making a project in which I have to predict a plane trajectory.
I have 2 types of trajectory, the first one is the planned, and the second one is the real one that I recovered after the end of the flight.
The two trajectories are (x,y) points on a map and I want to predict the real one with the planned one.
What kind of model do you use? I heard about multivariate regression or recurrent neural network but I am not sure about both, I think multivariate is not appropriate and rnn include time as parameter and I would not want to use it first.
Do you have any ideas?
Thank you
You could try either training single-target multiple regression models, and predict the x and y variables independently. The other way to go about is to use multi-target regression-based methods. The most commonly used method using Predictive Clustering trees. You can read about various methods from https://towardsdatascience.com/regression-models-with-multiple-target-variables-8baa75aacd to start with. I hope it is somewhat helpful. :)

What activation function to use or modifications to make when neural network gives same output on regression with PyBrain?

I have a neural network with one input, three hidden neurons and one output. I have 720 input and corresponding target values, 540 for training, 180 for testing.
When I train my network using Logistic Sigmoid or Tan Sigmoid function, I get the same outputs while testing, i.e. I get same number for all 180 output values. When I use Linear activation function, I get NaN, because apparently, the value gets too high.
Is there any activation function to use in such a case? Or any improvements to be done? I can update the question with details and code if required.
Neural nets are not stable when fed input data on arbitrary scales (such as between approximately 0 and 1000 in your case). If your output units are tanh they can't even predict values outside the range -1 to 1 or 0 to 1 for logistic units!
You should try recentering/scaling the data (making it have mean zero and unit variance - this is called standard scaling in the datascience community). Since it is a lossless transformation you can revert back to your original scale once you've trained the net and predicted on the data.
Additionally, a linear output unit is probably the best as it makes no assumptions about the output space and I've found tanh units to do much better on recurrent neural networks in low dimensional input/hidden/output nets.
Newmu is right that the scaling is probably the issue here; you need to scale your inputs to lie in the valid range. (Standardization to zero mean, unit variance, as they suggest, though, isn't a great choice since that means about a third of your data will like outside [-1, 1]....) I don't know about pybrain, but in scikit-learn you'd want sklearn.preprocessing.MinMaxScaler.
But, also, in the comments you said your dataset looks like this:
where the horizontal axis is inputs, vertical is targets. So, when you see an input of 200, you have one training example saying it's 80 and one saying it's 320; what do you want it to say then? An "optimal" neural network (which may be hard to achieve) would predict 200 or so.
You may need to think about how to reframe your learning problem to be a more-consistent function from inputs to targets.

Python Neurolab - fixing output range

I am learning some model based on examples ${((x_{i1},x_{i2},....,x_{ip}),y_i)}_{i=1...N}$ using a neural network of Feed Forward Multilayer Perceptron (newff) (using python library neurolab). I expect the output of the NN to be positive for any further simulation of the NN.
How can I make sure that the results of simulation of my learned NN are always positive?
(how I do it in neurolab?)
Simply use a standard sigmoid/logistic activation function on the output neuron. sigmoid(x) > 0 forall real-valued x so that should do what you want.
By default, many neural network libraries will use either linear or symmetric sigmoid outputs (which can go negative).
Just note that it takes longer to train networks with a standard sigmoid output function. It's usually better in practice to let the values go negative and instead transform the outputs from the network into the range [0,1] after the fact (shift up by the minimum, divide by the range (aka max-min)).

Categories