Can I set the predicted values of a neural network?

Can I set the predicted values of a neural network? - python

My question is more theoretical than practical but I can show some code as well. I have network that is mapping from random values in a domain uvw, to a xyz domain. I want a certain values of the uvw to go to other certain values in the xyz that I already know, since that is the idea of the function that I want to obtain and I want the network to over-fit.
My question is divided into two questions:
Can I set the predicted values I want into the network so it doesn't have to calculate those predicted values?
Does this is going to affect prediction of the other values?
This is my code, I want to show it so we have some notation that we can talk about.
# The model is a simple fully connected network mapping a 3D parameter point to 3D
phi = common.MLP(in_dim=3, out_dim=3).to(args.device)
# Eps is 1/lambda and max_iters is the maximum number of Sinkhorn iterations to do
emd_loss_fun = SinkhornLoss(eps=args.sinkhorn_eps, max_iters=args.max_sinkhorn_iters,
stop_thresh=1e-3, return_transport_matrix=True) # TODO add r-1 function to the weights
mse_loss_fun = torch.nn.MSELoss()
# Adam optimizer at first
optimizer = torch.optim.Rprop(phi.parameters(), lr=args.learning_rate)
fit_start_time = time.time()
for epoch in range(args.num_epochs):
optimizer.zero_grad()
# Do the forward pass of the neural net, evaluating the function at the parametric points
y = phi(t)
# Compute the Sinkhorn divergence between the reconstruction*(using the francis library) and the target
# NOTE: The Sinkhorn function expects a batch of b point sets (i.e. tensors of shape [b, n, 3])
# since we only have 1, we unsqueeze so x and y have dimension [1, n, 3]
with torch.no_grad():
_, M = emd_loss_fun(phi(t[num:]).unsqueeze(0), x[num:].unsqueeze(0))
_, Q = emd_loss_fun(phi(t[0:num]).unsqueeze(0), x[0:num].unsqueeze(0))
P[0,num:,num:] = M[0]
P[0,0:num,0:num] = Q[0]
#print(y[Q.squeeze().max(0)[1], :])
# Project the transport matrix onto the space of permutation matrices and compute the L-2 loss
# between the permuted points
loss = mse_loss_fun(y[P.squeeze().max(0)[1], :], x)
# loss = mse_loss_fun(P.squeeze() # y, x) # Use the transport matrix directly
# Take an optimizer step
loss.backward()
optimizer.step()
print("Epoch %d, loss = %f" % (epoch, loss.item()))

Related

How to customize an LSTM loss function to only concider a given index range of prediction and target sequence?

I am currently working with an LSTM sequence to sequence model for time domain signal predictions. Because of domain knowledge, I know that the first part of the prediction (about 20%) can never be predicted correctly, since the information required is not available in the given input sequence. The remaining 80% of the predicted sequence are usually predicted quite well. In order to exclude the first 20% from the training optimization, it would be nice to define a loss function that would basically operate on a given index range like the numpy code below:
start = int(0.2*sequence_length)
stop = sequence_length
def mse(pred, target):
""" Mean squared error between two time series np.arrays """
return 1/target.shape[0]*np.sum((pred-target)**2)
def range_mse_loss(y_pred, y):
return mse(y_pred[start:stop],y[start:stop])
How do I have to write this loss function in order to have it work with my preexisting keras code, where loss is simply given by model.compile(loss='mse') ?

You can slice your tensor to just last 80% of the data.
size = int(y_true.shape[0] * 0.8) # for 2D vector, e.g., (100, 1)
loss_fn = tf.keras.losses.MeanSquaredError(name='mse')
loss_fn(y_pred[:-size], y_true[:-size])
You can also use the sample_weights at the tf.keras.losses.MeanSquaredError(), passing an array of weights and the first 20% of weights is zero
size = int(y_true.shape[0] * 0.8) # for 2D vector, e.g., (100, 1)
zeros = tf.zeros((y_true.shape[0] - size), dtype=tf.int32)
ones = tf.ones((size), dtype=tf.int32)
weights = tf.concat([zeros, ones], 0)
loss_fn = tf.keras.losses.MeanSquaredError(name='mse')
loss_fn(y_pred, y_true, sample_weights=weights)
There is a warming of the second solution, the final loss will be lower than the first solution, because you are putting zero in the first predictions values, but you aren't removing them in the formula MSE = 1 /n * sum((y-y_hat)^2).

One workaround would be to mark the observations as None/nan and then overwrite the train_step method. Following tensorflow's tutorial about customizing train_step, you would do something like this
#tf.function
def train_step(keras_model, data):
print('custom train_step')
# Unpack the data. Its structure depends on your model and
# on what you pass to `fit()`.
x, y = data
with tf.GradientTape() as tape:
y_pred = keras_model(x, training=True) # Forward pass
# masking nan values in observations, also assuming that targets are >0.0
mask = tf.greater(y, 0.0)
true_y = tf.boolean_mask(y, mask)
pred_y = tf.boolean_mask(y_pred, mask)
# Compute the loss value
# (the loss function is configured in `compile()`)
loss = keras_model.compiled_loss(true_y, pred_y, regularization_losses=keras_model.losses)
# Compute gradients
trainable_vars = keras_model.trainable_variables
gradients = tape.gradient(loss, trainable_vars)
# Update weights
keras_model.optimizer.apply_gradients(zip(gradients, trainable_vars))
# Update metrics (includes the metric that tracks the loss)
keras_model.compiled_metrics.update_state(true_y, pred_y)
# Return a dict mapping metric names to current value
return {m.name: m.result() for m in keras_model.metrics}
This will work for all the performance metrics you are tracking. Alternative way would be to mask the nans inside the loss function but that would be limited to only one loss function and not any other loss function/performance metrics.

Can you reverse a PyTorch neural network and activate the inputs from the outputs?

Can we activate the outputs of a NN to gain insight into how the neurons are connected to input features?
If I take a basic NN example from the PyTorch tutorials. Here is an example of a f(x,y) training example.
import torch
N, D_in, H, D_out = 64, 1000, 100, 10
x = torch.randn(N, D_in)
y = torch.randn(N, D_out)
model = torch.nn.Sequential(
torch.nn.Linear(D_in, H),
torch.nn.ReLU(),
torch.nn.Linear(H, D_out),
)
loss_fn = torch.nn.MSELoss(reduction='sum')
learning_rate = 1e-4
for t in range(500):
y_pred = model(x)
loss = loss_fn(y_pred, y)
model.zero_grad()
loss.backward()
with torch.no_grad():
for param in model.parameters():
param -= learning_rate * param.grad
After I've finished training the network to predict y from x inputs. Is it possible to reverse the trained NN so that it can now predict x from y inputs?
I don't expect y to match the original inputs that trained the y outputs. So I expect to see what features the model activates on to match x and y.
If it is possible, then how do I rearrange the Sequential model without breaking all the weights and connections?

It is possible but only for very special cases. For a feed-forward network (Sequential) each of the layers needs to be reversible; that means the following arguments apply to each layer separately. The transformation associated with one layer is y = activation(W*x + b) where W is the weight matrix and b the bias vector. In order to solve for x we need to perform the following steps:
Reverse activation; not all activation functions have an inverse though. For example the ReLU function does not have an inverse on (-inf, 0). If we used tanh on the other hand we can use its inverse which is 0.5 * log((1 + x) / (1 - x)).
Solve W*x = inverse_activation(y) - b for x; for a unique solution to exist W must have similar row and column rank and det(W) must be non-zero. We can control the former by choosing a specific network architecture while the latter depends on the training process.
So for a neural network to be reversible it must have a very specific architecture: all layers must have the same number of input and output neurons (i.e. square weight matrices) and the activation functions all need to be invertible.
Code: Using PyTorch we will have to do the inversion of the network manually, both in terms of solving the system of linear equations as well as finding the inverse activation function. Consider the following example of a 1-layer neural network (since the steps apply to each layer separately extending this to more than 1 layer is trivial):
import torch
N = 10 # number of samples
n = 3 # number of neurons per layer
x = torch.randn(N, n)
model = torch.nn.Sequential(
torch.nn.Linear(n, n), torch.nn.Tanh()
)
y = model(x)
z = y # use 'z' for the reverse result, start with the model's output 'y'.
for step in list(model.children())[::-1]:
if isinstance(step, torch.nn.Linear):
z = z - step.bias[None, ...]
z = z[..., None] # 'torch.solve' requires N column vectors (i.e. shape (N, n, 1)).
z = torch.solve(z, step.weight)[0]
z = torch.squeeze(z) # remove the extra dimension that we've added for 'torch.solve'.
elif isinstance(step, torch.nn.Tanh):
z = 0.5 * torch.log((1 + z) / (1 - z))
print('Agreement between x and z: ', torch.dist(x, z))

If I've understood correctly, there are two questions here:
Is it possible to determine what features in the input have activated neurons?
If so, is it possible to use this information to generate samples from p(x|y)?
Regarding 1, a basic way to determine if a neuron is sensitive to an input feature x_i is to compute the gradient of this neuron's output w.r.t x_i. A high gradient will indicate sensitivity to a particular input element. There is a rich literature on the subject, for example, you can have a look at guided backpropagation or at GradCam (the latter is about classification with convnets, but it does contain useful ideas).
As for 2, I don't think that your approach to "reversing the problem" is correct. The problem is that your network is discriminative and what it outputs can be seen as argmax_y p(y|x). Note that this is a point-wise estimation, not a full modeling of the distribution. However, the inverse problem that you're interested in seems to be sampling from
p(x|y)=constant*p(y|x)p(x).
You don't know how to sample from p(y|x) and you don't know anything about p(x). Even if you use a method to discover correlations between the neurons and specific input features, you have only discovered which features where more important to the networks prediction, but depending on the nature of y this might be insufficiant. Consider a toy example where your inputs x are 2d points distributed according to some distribution in R^2 and where the output y is binary, such that any (a,b) in R^2 is classified as 1 if a<1 and it is classified as 0 if a>1. Then a discriminative network could learn the vertical line x=1 as its decision boundary. Inspecting correlations between neurons and input features will reveal that only the first coordinate was useful in this prediction, but this information is not sufficient for sampling from the full 2d distribution of inputs.
I think that Variational autoencoders could be what you're looking for.

Correct way to use custom weight maps in unet architecture

There is a famous trick in u-net architecture to use custom weight maps to increase accuracy. Below are the details of it:
Now, by asking here and at multiple other place, I get to know about 2 approaches. I want to know which one is correct or is there any other right approach which is more correct?
First is to use torch.nn.Functional method in the training loop:
loss = torch.nn.functional.cross_entropy(output, target, w) where w will be the calculated custom weight.
Second is to use reduction='none' in the calling of loss function outside the training loop
criterion = torch.nn.CrossEntropy(reduction='none')
and then in the training loop multiplying with the custom weight:
gt # Ground truth, format torch.long
pd # Network output
W # per-element weighting based on the distance map from UNet
loss = criterion(pd, gt)
loss = W*loss # Ensure that weights are scaled appropriately
loss = torch.sum(loss.flatten(start_dim=1), axis=0) # Sums the loss per image
loss = torch.mean(loss) # Average across a batch
Now, I am kinda confused which one is right or is there any other way, or both are right?

The weighting portion looks like just simply weighted cross entropy which is performed like this for the number of classes (2 in the example below).
weights = torch.FloatTensor([.3, .7])
loss_func = nn.CrossEntropyLoss(weight=weights)
EDIT:
Have you seen this implementation from Patrick Black?
# Set properties
batch_size = 10
out_channels = 2
W = 10
H = 10
# Initialize logits etc. with random
logits = torch.FloatTensor(batch_size, out_channels, H, W).normal_()
target = torch.LongTensor(batch_size, H, W).random_(0, out_channels)
weights = torch.FloatTensor(batch_size, 1, H, W).random_(1, 3)
# Calculate log probabilities
logp = F.log_softmax(logits)
# Gather log probabilities with respect to target
logp = logp.gather(1, target.view(batch_size, 1, H, W))
# Multiply with weights
weighted_logp = (logp * weights).view(batch_size, -1)
# Rescale so that loss is in approx. same interval
weighted_loss = weighted_logp.sum(1) / weights.view(batch_size, -1).sum(1)
# Average over mini-batch
weighted_loss = -1. * weighted_loss.mean()

Note that torch.nn.CrossEntropyLoss() is a class that calls torch.nn.functional.
See https://pytorch.org/docs/stable/_modules/torch/nn/modules/loss.html#CrossEntropyLoss
You can use the weights when you define the criteria. Comparing them functionally, both methods are the same.
Now, I do not understand your idea of computing loss inside the training loop in method 1 and outside the training loop in method 2. if you compute loss outside the loop then how will you backpropagate?

Python - Neural Network Approximating Sphere Function

I am new to neural networks and I am using an example neural network I found online to attempt to approximate the sphere function(the addition of a set of numbers squared) using back propagation.
The initial code is:
class NeuralNetwork():
def __init__(self):
#Seed the random number generator, so it generates the same numbers
#every time the program is run.
#random.seed(1)
#Model a single neuron, with 3 input connections and 1 output connection.
#We assign random weights to a 3 x 1 matrix, with values in the range -1 to 1
#and mean 0
self.synaptic_weights = 2 * random.random((2,1)) - 1
#The Sigmoid function, which describes an S shaped curve.
#We pass the weighted sum of thle inputs through this function to
#normalise them between 0 and 1.
def __sigmoid(self, x):
return 1 / (1 + exp(-x))
#The derivative of the sigmoid function
#This is the gradient of the sigmoid curve.
#It indicates how confident we are about existing weight.
def __sigmoid_derivative(self, x):
return x * (1 -x)
#Train the network through a process of trial and error.
#Adjusting the synaptic weights each time.
def train(self, training_set_inputs, training_set_outputs, number_of_training_iterations):
for iteration in xrange(10000):
#Pass the training set through our neural network(a single neuron)
output = self.think(training_set_inputs)
#Calculate the error(Difference between the desired output and predicted output).
error = training_set_outputs - output
#Multiply the error by the input and again by the gradient of the Sigmoid curve.
#This means less confident weights are adjusted more.
#This means inputs, which are zero, do not cause changes to the weights.
adjustment = dot(training_set_inputs.T, error * self.__sigmoid_derivative(output))
#Adjust the weights
self.synaptic_weights += adjustment
#The neural network thinks.
def think(self, inputs):
#Pass inputs through our neural network(OUR SINGLE NEURON).
return self.__sigmoid(dot(inputs, self.synaptic_weights))
if __name__ == "__main__":
#Initialise a single neuron neural network.
neural_network = NeuralNetwork()
print"Random starting synaptic weights: "
print neural_network.synaptic_weights
#The training set. We have 4 examples, each consisting of 3 input values and 1 output value
training_set_inputs = array([[0, 1], [1,0], [0,0]])
training_set_outputs = array([[1,1,0]]).T
#Train the neural network using a training set.
#Do it 10,000 times and make small adjustments each time.
neural_network.train(training_set_inputs, training_set_outputs, 10000)
print "New synaptic weights after training: "
print neural_network.synaptic_weights
#Test the neural network with a new situation.
print "Considering new situation [1,1] -> ?: "
print neural_network.think(array([1,1]))
My aim is to input training data(the sphere function input and outputs) into the neural network to train it and meaningfully adjust the weights. After continuous training the weights should reach a point where reasonably accurate results are given from the training inputs.
I imagine an example of some training sets for the sphere function would be something like:
training_set_inputs = array([[2,1], [3,2], [4,6], [8,3]])
training_set_outputs = array([[5, 13, 52, 73]])
The example I found online can successfully approximate the XOR operation, but when given sphere function inputs it only gives me an output of 1 when tested on a new example(for example, [6,7] which should ideally return an approximation around 85)
From what I have read about neural networks I suspect this is because I need to normalize the inputs but I am not entirely sure how to do this. Any help on this or something to point me on the right track would be appreciated a lot, thank you.

Why does this backpropagation implementation fail to train weights correctly?

I've written the following backpropagation routine for a neural network, using the code here as an example. The issue I'm facing is confusing me, and has pushed my debugging skills to their limit.
The problem I am facing is rather simple: as the neural network trains, its weights are being trained to zero with no gain in accuracy.
I have attempted to fix it many times, verifying that:
the training sets are correct
the target vectors are correct
the forward step is recording information correctly
the backward step deltas are recording properly
the signs on the deltas are correct
the weights are indeed being adjusted
the deltas of the input layer are all zero
there are no other errors or overflow warnings
Some information:
The training inputs are an 8x8 grid of [0,16) values representing an intensity; this grid represents a numeral digit (converted to a column vector)
The target vector is an output that is 1 in the position corresponding to the correct number
The original weights and biases are being assigned by Gaussian distribution
The activations are a standard sigmoid
I'm not sure where to go from here. I've verified that all things I know to check are operating correctly, and it's still not working, so I'm asking here. The following is the code I'm using to backpropagate:
def backprop(train_set, wts, bias, eta):
learning_coef = eta / len(train_set[0])
for next_set in train_set:
# These record the sum of the cost gradients in the batch
sum_del_w = [np.zeros(w.shape) for w in wts]
sum_del_b = [np.zeros(b.shape) for b in bias]
for test, sol in next_set:
del_w = [np.zeros(wt.shape) for wt in wts]
del_b = [np.zeros(bt.shape) for bt in bias]
# These two helper functions take training set data and make them useful
next_input = conv_to_col(test)
outp = create_tgt_vec(sol)
# Feedforward step
pre_sig = []; post_sig = []
for w, b in zip(wts, bias):
next_input = np.dot(w, next_input) + b
pre_sig.append(next_input)
post_sig.append(sigmoid(next_input))
next_input = sigmoid(next_input)
# Backpropagation gradient
delta = cost_deriv(post_sig[-1], outp) * sigmoid_deriv(pre_sig[-1])
del_b[-1] = delta
del_w[-1] = np.dot(delta, post_sig[-2].transpose())
for i in range(2, len(wts)):
pre_sig_vec = pre_sig[-i]
sig_deriv = sigmoid_deriv(pre_sig_vec)
delta = np.dot(wts[-i+1].transpose(), delta) * sig_deriv
del_b[-i] = delta
del_w[-i] = np.dot(delta, post_sig[-i-1].transpose())
sum_del_w = [dw + sdw for dw, sdw in zip(del_w, sum_del_w)]
sum_del_b = [db + sdb for db, sdb in zip(del_b, sum_del_b)]
# Modify weights based on current batch
wts = [wt - learning_coef * dw for wt, dw in zip(wts, sum_del_w)]
bias = [bt - learning_coef * db for bt, db in zip(bias, sum_del_b)]
return wts, bias
By Shep's suggestion, I checked what's happening when training a network of shape [2, 1, 1] to always output 1, and indeed, the network trains properly in that case. My best guess at this point is that the gradient is adjusting too strongly for the 0s and weakly on the 1s, resulting in a net decrease despite an increase at each step - but I'm not sure.

I suppose your problem is in choice of initial weights and in choice of initialization of weights algorithm. Jeff Heaton author of Encog claims that it as usually performs worse then other initialization method. Here is another results of weights initialization algorithm perfomance. Also from my own experience recommend you to init your weights with different signs values. Even in cases when I had all positive outputs weights with different signs perfomed better then with the same sign.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.