I'm having trouble trying to teach a neural network the XOR logic function. I've already trained the network with succesful results using the hyperbolic tangent and ReLU as activation functions (regarding the ReLU, I know it's not the appropiate for this kind of problem, but I still wanted to test it). Still, I can't make it work with the logistic function. My definition of the function is:
def logistic(data):
return 1.0 / (1.0 + np.exp(-data))
and its derivative:
def logistic_prime(data):
output = logistic(data)
return output * (1.0 - output)
where np is the name given to the NumPy imported package. As the XOR logic uses are 0's and 1's, the logistic function should be an appropriate activation function. Still, the results I get are close to 0.5 in all cases, i.e. any input combination of 0's and 1's results in a value close to 0.5. Is there any error in what I'm saying?
Don't hesitate in asking me for more context or more code. Thanks in advance.
I had the same problem as you.
The problem happens when the data can not the divided by a linear hyperplane.
Try to train the data:
X = [[-1,0],[0,1],[1,0],[0,-1]]
Y = [1,0,1,0]
if you draw this down on a coordinate, then you will fine it is not linearly dividable.
Train it on logistic, parameters are all close to 0 and result close to 0.5.
Another linearly dividable example is use Y = [1,1,0,0]
and logistic work.
Related
Hello StackOverflow people,
I am encountering a problem where I don't know what else I can try. First off, I am using a custom loss function (at least I believe that is the problem, but maybe it's something different?) for a mixture density network:
def nll_loss(mu, sigma, alpha, y):
gm = tfd.MixtureSameFamily(
mixture_distribution=tfd.Categorical(probs=alpha),
components_distribution=tfd.Normal(
loc=mu,
scale=sigma))
log_likelihood = gm.log_prob(tf.transpose(y))
return -tf.reduce_mean(log_likelihood, axis=-1)
The funny thing is, the network randomly collapses after a varying amount of training.
The things I already tried and checked:
All my input data is scaled between 0 and 1 (both x and y)
I tried multiplying the y's and adding an integer to those, so the distance to zero is increased
Different optimizers
Clipping optimizers
Clipping loss function
Setting Learning Rate to 0! (That ones puzzles me the most, as I am sure my inputs are correct)
Adding Batch Normalization to every layer of my network
Does anyone have an idea why this is happening? What am I missing? Thank you!
I am working on a NN with Pytorch which simply maps points from the plane into real numbers, for example
model = nn.Sequential(nn.Linear(2,2),nn.ReLU(),nn.Linear(2,1))
What I want to do, since this network defines a map h:R^2->R, is to compute the gradient of this mapping h in the training loop. So for example
for it in range(epochs):
pred = model(X_train)
grad = torch.autograd.grad(pred,X_train)
....
The training set has been defined as a tensor requiring the gradient. My problem is that even if the output, for each fixed point, is a scalar, since I am propagating a set of N=100 points, the output is actually a Nx1 tensor. This brings to the error: autograd can compute the gradient just of scalar functions.
In fact, trying with the little change
pred = torch.sum(model(X_train))
everything works perfectly. However I am interested in all the single gradients so, is there a way to compute all these gradients together?
Actually computing the sum as presented above gives exactly the same result I expect of course, but I wanted to know if this is the only possiblity.
There are other possibilities but using .sum is the simplest way. Using .sum() on the final loss vector and computing dpred/dinput will give you the desired output. Here is why:
Since, pred = sum(loss) = sum (f(xi))
where i is the index of input x.
dpred/dinput will be a matrix [dpred/dx0, dpred/dx1, dpred/dx...]
Consider, dpred/dx0, it will be equal to df(x0)/dx0, since other df(xi)/dx0 is 0.
PS: Please excuse the crappy mathematical expressions... SO does not support latex/math expressions.
The code for the loss function in scikit-learn logestic regression is:
# Logistic loss is the negative of the log of the logistic function.
out = -np.sum(sample_weight * log_logistic(yz)) + .5 * alpha * np.dot(w, w)
However, it seems to be different from common form of the logarithmic loss function, which reads:
-y(log(p)+(1-y)log(1-p))
(please see http://wiki.fast.ai/index.php/Log_Loss)
Could anyone tell me how to understand to code for loss function in scikit-learn logestic regression and what is the relation between it and the general form of the logarithmic loss function?
Thank you in advance.
First you should note that 0.5 * alpha * np.dot(w, w) is just a normalization. So, sklearn logistic regression reduces to the following
-np.sum(sample_weight * log_logistic(yz))
Also, the np.sum is due to the fact it consider multiple samples, so it again reduces to
sample_weight * log_logistic(yz)
Finally if you read HERE, you note that sample_weight is an optional array of weights that are assigned to individual samples. If not provided, then each sample is given unit weight. So, it should be equal to one (as in the original definition of cross entropy loss we do not consider unequal weight for different samples), hence the loss reduces to:
- log_logistic(yz)
which is equivalent to
- log_logistic(y * np.dot(X, w)).
Now, why it looks different (in essence it is the same) from the cross entropy loss function, i. e.:
- [y log(p) + (1-y) log(1-p))].
The reason is, we can use either of two different labeling conventions for binary classification, either using {0, 1} or {-1, 1}, which results in the two different representations. But they are the same!
More details (on why they are the same) can be found HERE. Note that you should read the response by Manuel Morales.
I've recently had to tweak a neural network. Here's how it works:
Given an image as input, several layer turns it into a mean matrix mu and a covariance matrix sigma.
Then, a sample z is taken from the Gaussian distribution of parameters mu, sigma.
Several layer turns this sample into an output
this output is compared to a given image, which gives a cost
What I want to do is to keep mu and sigma, take multiple samples z, propagate them through the rest of the NN, and compare the multiple images I get to a given image.
Note that the step z -> image output calls other package, I'd like not having to dig into these...
What I did so far :
At first, I thought I did not need to go through all this hassle : I take a batch_size of one, it is as if I'm doing a Monte Carlo by running the NN multiple times. But I actually need the neural net to try several image before updating the weights, and thus changing mu and sigma.
I simply sampled multiple z then propagated them through the net. But I soon discovered that I was duplicating all the layers, making the code terribly slow, and above all preventing me from taking many samples to achieve the MC I'm aiming at.
Of course, I updated the loss and data input classes to take that into account.
Do you have any ideas ? Basically, I'd like an efficient way to make z -> output multiple time, in a cost-efficient manner. I've still a lot to learn from tensorflow and keras, so I'm a little bit lost on how to do that. As usual, please apologized if an answer already exists somewhere, I did my best to look for one by myself!
Ok, my question was a bit stupid. So as not to duplicate layers, I created multiple slice layers, and then I simply propagated them through the net with previously declared layers. Here's my code :
# First declare layers
a = layer_A()
b = layer_B()
# And so on ...
# Generate samples
samples = generate_samples()([mu, sigma])
# for all the monte carlo samples, do :
for i in range(mc_samples):
cur_sample = Lambda(lambda x: K.slice(x, (0, 0, 0, 2*i), (-1, -1, -1, 2)), name="slice-%i" % i)(samples)
cur_output = a(cur_sample)
cur_output = b(cur_output)
all_output.append(output)
output_of_net = keras.layers.concatenate(all_output)
return Model(inputs=inputs, outputs=output_of_net)
Simply loop over the last dimension in the loss function, average, and you're done ! A glimpse at my loss :
loss = 0
for i in range(mc_samples):
loss += f(y_true[..., i], y_pred[..., i])
return loss/mc_samples
I am nooby in this field of study and probably this is a pretty silly question. I want to build a normal ANN, but I am not sure if I can use a weighted mean square error as the loss function.
If we are not treating each sample equally, I mean we care the prediction precision more for some of the categories of the samples more than the others, then we want to form a weighted loss function.
Lets say, we have a categorical feature ci, i is the index of the sample, and for simplicity, we assume that this feature takes binary value, either 0 or 1. So, we can form the loss function as
(ci + 1)(yi_hat - yi)^2
#and take the sum for all i
Are there going to be any problem with the back-propagation? I don't see any issue with calculating the gradient or updating the weights between layers.
And, if no issue, how can I program this loss function in Keras? Because it seems that the loss function only takes two parameters, y_true and y_pred, how can I plug in the vector c?
There is absolutely nothing wrong with that. Functions can declare the constants withing themselves or even take the constants from an outside scope:
import keras.backend as K
c = K.constant([c1,c2,c3,c4,...,cn])
def weighted_loss(y_true,y_pred):
loss = keras.losses.get('mse')
return c * loss(y_true,y_pred)
Exactly like yours:
def weighted_loss(y_true,y_pred):
weighted = (c+1)*K.square(y_true-y_pred)
return K.sum(weighted)