Efficient batch derivative operations in PyTorch

Efficient batch derivative operations in PyTorch - python

I am using Pytorch to implement a neural network that has (say) 5 inputs and 2 outputs
class myNetwork(nn.Module):
def __init__(self):
super(myNetwork,self).__init__()
self.layer1 = nn.Linear(5,32)
self.layer2 = nn.Linear(32,2)
def forward(self,x):
x = torch.relu(self.layer1(x))
x = self.layer2(x)
return x
Obviously, I can feed this an (N x 5) Tensor and get an (N x 2) result,
net = myNetwork()
nbatch = 100
inp = torch.rand([nbatch,5])
inp.requires_grad = True
out = net(inp)
I would now like to compute the derivatives of the NN output with respect to one element of the input vector (let's say the 5th element), for each example in the batch. I know I can calculate the derivatives of one element of the output with respect to all inputs using torch.autograd.grad, and I could use this as follows:
deriv = torch.zeros([nbatch,2])
for i in range(nbatch):
for j in range(2):
deriv[i,j] = torch.autograd.grad(out[i,j],inp,retain_graph=True)[0][i,4]
However, this seems very inefficient: it calculates the gradient of out[i,j] with respect to every single element in the batch, and then discards all except one. Is there a better way to do this?

By virtue of backpropagation, if you did only compute the gradient w.r.t a single input, the computational savings wouldn't necessarily amount to much, you would only save some in the first layer, all layers afterwards need to be backpropagated either way.
So this may not be the optimal way, but it doesn't actually create much overhead, especially if your network has many layers.
By the way, is there a reason that you need to loop over nbatch? If you wanted the gradient of each element of a batch w.r.t a parameter, I could understand that, because pytorch will lump them together, but you seem to be solely interested in the input...

Related

Gradient of a function evaluated over a batch

I want to use Tensorflow to calculate the gradients of a function. However, if I use the tf.gradients function, it returns a single list of gradients. How to return a list for each point of the batch?
# in a tensorflow graph I have the following code
tf_x = tf.placeholder(dtype=tf.float32, shape=(None,N_in), name='x')
tf_net #... conveniently defined neural network
tf_y = tf.placeholder(dtype=tf.float32, shape=(None,1), name='y')
tf_cost = (tf_net(tf_x) - tf_y)**2 # this should have length N_samples because I did not apply a tf.reduce_mean
tf_cost_gradients = tf.gradients(tf_cost,tf_net.trainable_weights)
If we run it in a tensorflow session,
# suppose myx = np.random.randn(N_samples,N_in) and myy conveniently chosen
feed = {tf_x:myx, tx_y:myy}
sess.run(tf_cost_gradients,feed)
I get only one list, and not a list for each sample as I would like. I can use
for i in len(myx):
feed = {tf_x:myx[i], tx_y:myy[i]}
sess.run(tf_cost_gradients,feed)
but this is extremely slow! What can I do? Thank you

Although, there is an 'aggregation_method' parameter in tf.gradients, it is not easy to get the individual gradients.
aggregation_method: Specifies the method used to combine gradient terms.
Please see these threads:
https://github.com/tensorflow/tensorflow/issues/15760
https://github.com/tensorflow/tensorflow/issues/4897
In one of the threads(#4897), Ian Goodfellow makes the following suggestion to speed up individual gradient computation:
This is only pseudocode, but basic idea is:
examples = tf.split(batch)
weight_copies = [tf.identity(weights) for x in examples]
output = tf.stack(f(x, w) in zip(examples, weight_copies))
cost = cost_function(output)
per_example_gradients = tf.gradients(cost, weight_copies)

Forward Jacobian Of Neural Network in Pytorch is Slow

I am computing the forward jacobian (derivative of outputs with respect to inputs) of a 2 layer feedforward neural network in pytorch, and my results are correct but relatively slow. Given the nature of the calculation I would expect it to be approximately as fast as a forward pass through the network (or maybe 2-3x as long), but it takes ~12x as long to run an optimization step on this routine (in my test example I just want the jacobian=1 at all points) vs the standard mean squared error so I assume I am doing something in an un-optimal manner. I'm just wondering if anyone knew a faster way to code this. My test network has 2 input nodes, followed by 2 hidden layers of 5 nodes each and an output layer of 2 nodes, and uses tanh activation functions on the hidden layers, with a linear output layer.
The Jacobian calculations are based on the paper The Limitations of Deep Learning in Adversarial Settings which gives a basic recursive definition of the forward derivative (basically you end up multiplying the derivative of your activation functions with the weights and previous partial derivatives of each layer). This is very similar to forward propagation, which is why I would expect it to be faster than it is. Then the determinant of the 2x2 jacobian at the end is pretty straightforward.
Below is the code for the network and the jacobian
class Network(torch.nn.Module):
def __init__(self):
super(Network, self).__init__()
self.h_1_1 = torch.nn.Linear(input_1, hidden_1)
self.h_1_2 = torch.nn.Linear(hidden_1, hidden_2)
self.out = torch.nn.Linear(hidden_2, out_1)
def forward(self, x):
x = F.tanh(self.h_1_1(x))
x = F.tanh(self.h_1_2(x))
x = (self.out(x))
return x
def jacobian(self, x):
a = self.h_1_1.weight
x = F.tanh(self.h_1_1(x))
tanh_deriv_tensor = 1 - (x ** 2)
expanded_deriv = tanh_deriv_tensor.unsqueeze(-1).expand(-1, -1, input_1)
partials = expanded_deriv * a.expand_as(expanded_deriv)
a = torch.matmul(self.h_1_2.weight, partials)
x = F.tanh(self.h_1_2(x))
tanh_deriv_tensor = 1 - (x ** 2)
expanded_deriv = tanh_deriv_tensor.unsqueeze(-1).expand(-1, -1, out_1)
partials = expanded_deriv*a
partials = torch.matmul(self.out.weight, partials)
determinant = partials[:, 0, 0] * partials[:, 1, 1] - partials[:, 0, 1] * partials[:, 1, 0]
return determinant
and here are the two error functions being compared. Note that the first one requires an extra forward call through the network, to get the output values (labeled action) while the second function does not since it works on the input values.
def actor_loss_fcn1(action, target):
loss = ((action-target)**2).mean()
return loss
def actor_loss_fcn2(input): # 12x slower
jacob = model.jacobian(input)
loss = ((jacob-1)**2).mean()
return loss
Any insight on this would be greatly appreciated

The second calculation of 'a' takes the most time on my machine (cpu).
# Here you increase the size of the matrix with a factor of "input_1"
expanded_deriv = tanh_deriv_tensor.unsqueeze(-1).expand(-1, -1, input_1)
partials = expanded_deriv * a.expand_as(expanded_deriv)
# Here your torch.matmul() needs to handle "input_1" times more computations than in a normal forward call
a = torch.matmul(self.h_1_2.weight, partials)
On my machine the time of computing the Jacobian is roughly the time it takes torch to compute
a = torch.rand(hidden_1, hidden_2)
b = torch.rand(n_inputs, hidden_1, input_1)
%timeit torch.matmul(a,b)
I don't think it is possible to speed this up, computationally wise. Unless you can move from CPU to GPU, because GPU get better on larges matrices.

Writing this exotic NN architecture with keras, tensorflow and python

I'm trying to get Keras to train a multiclass classification model that can be written in a network like this:
The only set of trainable parameters are those , all the rest is given. The functions fi are combinations of usual mathematical functions (for example .Sigma stands for summing the previous terms and softmax is the usual function. The (x1,x2,...xn) are elements of train or test set and are a specific subset of the original data already selected.
The model in more depth:
Specificaly, given (x_1,x_2,...,x_n) an input in train or test set, the network evaluates
where fi are given mathematical functions, are rows of a particular subset of the original data and the coefficients are the parameters I want to train.
As I'm using keras, I expect it to add a bias term to each row.
After the above evaluation, I will apply a softmax layer (each of the m lines above are numbers that will be inputs for the softmax function).
At the end I want to compile the model and run model.fit as usual.
The problem is that I couln't translate the expression to keras sintax.
My attempt:
Following the network scratch above, I first tried to consider each of the expressions of the form as lambda layers in a Sequential Model, but the best I could get to work was a combination of a dense layer with linear activation (which would play the role of a row's parameters: ) followed by a Lambda layer outputting a vector without the required summation, as follows:
model = Sequential()
#single row considered:
model.add(Lambda(lambda x: f_fixedRow(x), input_shape=(nFeatures,)))
#parameters set after lambda layer to get (a1*f(x1,y1),...,an*f(xn,yn)) and not (f(a1*x1,y1),...,f(an*xn,yn))
model.add(Dense(nFeatures, activation='linear'))
#missing summation: sum(x)
#missing evaluation of f in all other rows
model.add(Dense(classes,activation='softmax',trainable=False)) #should get all rows
model.compile(optimizer='sgd',
loss='categorical_crossentropy',
metrics=['accuracy'])
Also, I had to define the function in the lambda function call with the argument already fixed (because the lambda function could have only the input layers as variable):
def f_fixedRow(x):
#picking a particular row (as a vector) to evaluate f in (f works element-wise)
y=tf.constant(value=x[0,:],dtype=tf.float32)
return f(x,y)
I managed to write the f function with tensorflow (working element-wise in a row), although this is a possible source for problems in my code (and the above workaround seems unnatural).
I also thought that if I could properly write the element-wise sum of the vector in the aforementioned attempt I could repeat the same procedure in a parallelized manner with the keras Functional API and then insert the output of each parallel model in a softmax function, as I need.
Another approach that I considered was to train the parameters keeping their natural matrix structure seen in Network Description, maybe writing a matrix Lambda layer, but I could not find anything related to this idea.
Anyway, I'm not sure what is a good way to work with this model within keras, maybe I'm missing an important point because of the non standard way the parameters are written or lack of experience with tensorflow. Any suggestions are welcome.

For this answer, it's important that f be a tensor function that operates elementwise. (No iterating). This is reasonably easy to have, just check the keras backend functions.
Assumptions:
The x_pk set is constant, otherwise this solution must be reviewed.
The function f is elementwise (if not, please show f for better code)
Your model will need x_pk as a tensor input. And you should do that in a functional API model.
import keras.backend as K
from keras.layers import Input, Lambda, Activation
from keras.models import Model
#x_pk data
x_pk_numpy = select_X_pk_samples(x_train)
x_pk_tensor = K.variable(x_pk_numpy)
#number of rows in x_pk
m = len(x_pk_numpy)
#I suggest a fixed batch size for simplicity
batch = some_batch_size
First let's work on the function that will take x and x_pk calling f.
def calculate_f(inputs): #inputs will be a list with x and x_pk
x, x_pk = inputs
#since f will work elementwise, let's replicate x and x_pk so they have equal shapes
#please explain f for better optimization
# x from (batch, n) to (batch, m, n)
x = K.stack([x]*m, axis=1)
# x_pk from (m, n) to (batch, m, n)
x_pk = K.stack([x_pk]*batch, axis=0)
#a batch size of 1 could make this even simpler
#a variable batch size would make this more complicated
#certain f functions could make this process unnecessary
return f(x, x_pk)
Now, different from a Dense layer, this formula is using the a_pk weights multiplied elementwise. So we need a custom layer:
class ElementwiseWeights(Layer):
def __init__(self, **kwargs):
super(ElementwiseWeights, self).__init__(**kwargs)
def build(self, input_shape):
weight_shape = (1,) + input_shape[1:] #shape (1, m, n)
self.kernel = self.add_weight(name='kernel',
shape=weight_shape,
initializer='uniform',
trainable=True)
super(ElementwiseWeights, self).build(input_shape)
def compute_output_shape(self,input_shape):
return input_shape
def call(self, inputs):
return self.kernel * inputs
Now let's build our functional API model:
#x_pk model tensor input
x_pk = Input(tensor=x_pk_tensor) #shape (m, n)
#x usual input with fixed batch size
x = Input(batch_shape=(batch,n)) #shape (batch, n)
#calculate F
out = Lambda(calculate_f)([x, xp_k]) #shape (batch, m, n)
#multiply a_pk
out = ElementwiseWeights()(out) #shape (batch, m, n)
#sum n elements, keep m rows:
out = Lambda(lambda x: K.sum(x, axis=-1))(out) #shape (batch, m)
#softmax
out = Activation('softmax')(out) #shape (batch,m)
Continue this model with whatever you want and finish it:
model = Model([x, x_pk], out)
model.compile(.....)
model.fit(x_train, y_train, ....) #perhaps you might need .fit([x_train], ytrain,...)
Edit for function f
You can have the proposed f like this:
#create the n coefficients:
coefficients = np.array([c0, c1, .... , cn])
coefficients = coefficients.reshape((1,1,n))
def f(x, x_pk):
c = K.variable(coefficients) #shape (1, 1, n)
out = (x - x_pk) / c
return K.exp(out)
This f would accept x with shape (batch, 1, n), without the stack used in the calculate_f function.
Or could accept x_pk with shape (1, m, n), allowing variable batch size.
But I'm not sure it's possible to have both of these shapes together. Testing this might be interesting.

XOR neural network 2-1-1

I am trying to implement a XOR in neural networks with the typology of 2 inputs, 1 element in the hidden layer, and 1 output. But the learning rate is really bad (0,5). I think it is because I am missing a connection between the inputs AND the outputs, but I am not really sure how to do it. I have already made the bias connection so that the learning is better. Only using Numpy.
def sigmoid_output_to_derivative(output):
return output*(1-output)
a=0.1
X = np.array([[0,0],
[0,1],
[1,0],
[1,1]])
np.random.seed(1)
y = np.array([[0],
[1],
[1],
[0]])
bias = np.ones(4)
X = np.c_[bias, X]
synapse_0 = 2*np.random.random((3,1)) - 1
synapse_1 = 2*np.random.random((1,1)) - 1
for j in (0,600000):
layer_0 = X
layer_1 = sigmoid(np.dot(layer_0,synapse_0))
layer_2 = sigmoid(np.dot(layer_1,synapse_1))
layer_2_error = layer_2 - y
if (j% 10000) == 0:
print( "Error after "+str(j)+" iterations:" + str(np.mean(np.abs(layer_2_error))))
layer_2_delta = layer_2_error*sigmoid_output_to_derivative(layer_2)
layer_1_error = layer_2_delta.dot(synapse_1.T)
layer_1_delta = layer_1_error * sigmoid_output_to_derivative(layer_1)
synapse_1 -= a *(layer_1.T.dot(layer_2_delta))
synapse_0 -= a *(layer_0.T.dot(layer_1_delta))

You need to be careful with statements like
the learning rate is bad
Usually the learning rate is the step size that gradient descent takes in negative gradient direction. So, I'm not sure what you mean by a bad learning rate.
I'm also not sure if I understand your code correctly, but the forward step of a neural net is basically a matrix multiplication of the weight matrix for the hidden layer times the input vector. This will (if you set up everything correctly) result in a matrix which is equal to the size of your hidden layer. Now, you can simply add the bias before applying your logistic function elementwise to this matrix.
h_i = f(h_i+bias_in)
Afterwards you can do the same thing for the hidden layer times the output weights and apply its activation to get the outputs.
o_j = f(o_j+bias_h)
The backwards step is to calculate the deltas at output and hidden layer including another elementwise operation with your function
sigmoid_output_to_derivative(output)
and update both weight matrices using the gradients (here the learning rate is needed to define the step size). The gradients are simply the value of a corresponding node times its delta.
Note: The deltas are differently calculated for output and hidden nodes.
I'd advice you to keep separate variables for the biases. Because modern approaches usually update those by summing up the deltas of its connected notes times a different learning rate and subtract this product from the specific bias.
Take a look at the following tutorial (it uses numpy):
http://peterroelants.github.io/posts/neural_network_implementation_part04/

Writing a custom loss function element by element for Keras

I am new to machine learning, python and tensorflow. I am used to code in C++ or C# and it is difficult for me to use tf.backend.
I am trying to write a custom loss function for an LSTM network that tries to predict if the next element of a time series will be positive or negative. My code runs nicely with the binary_crossentropy loss function. I want now to improve my network having a loss function that adds the value of the next time series element if the predicted probability is greater than 0.5 and substracts it if the prob is less or equal to 0.5.
I tried something like this:
def customLossFunction(y_true, y_pred):
temp = 0.0
for i in range(0, len(y_true)):
if(y_pred[i] > 0):
temp += y_true[i]
else:
temp -= y_true[i]
return temp
Obviously, dimensions are wrong but since I cannot step into my function while debugging, it is very hard to get a grasp of dimensions here.
Can you please tell me if I can use an element-by-element function? If yes, how? And if not, could you help me with tf.backend?
Thanks a lot

From keras backend functions, you have the function greater that you can use:
import keras.backend as K
def customLossFunction(yTrue,yPred)
greater = K.greater(yPred,0.5)
greater = K.cast(greater,K.floatx()) #has zeros and ones
multiply = (2*greater) - 1 #has -1 and 1
modifiedTrue = multiply * yTrue
#here, it's important to know which dimension you want to sum
return K.sum(modifiedTrue, axis=?)
The axis parameter should be used according to what you want to sum.
axis=0 -> batch or sample dimension (number of sequences)
axis=1 -> time steps dimension (if you're using return_sequences = True until the end)
axis=2 -> predictions for each step
Now, if you have only a 2D target:
axis=0 -> batch or sample dimension (number of sequences)
axis=1 -> predictions for each sequence
If you simply want to sum everything for every sequence, then just don't put the axis parameter.
Important note about this function:
Since it contains only values from yTrue, it cannot backpropagate to change the weights. This will lead to a "none values not supported" error or something very similar.
Although yPred (the one that is connected to the model's weights) is used in the function, it's used only for getting a true x false condition, which is not differentiable.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.