Multi-task learning with sample weights in tensorflow -- shape problem - python

I'm doing sequence classification, I've got batch sizes of 1, 5 outcomes, and variable time steps (14 in this example). My sample weights w are the same shape as my label y:
y = tf.convert_to_tensor(np.ones(shape = (1,14,5)))
w = tf.convert_to_tensor(np.random.uniform(size = (1,14,5)))
Out[53]: TensorShape([1, 14, 5])
Out[54]: TensorShape([1, 14, 5])
When I try to run this through the loss function, I get the following error:
loss_object = tf.keras.losses.BinaryCrossentropy(from_logits=False)
loss_object(y_true = y,
y_pred = y,
sample_weight = w)
InvalidArgumentError: Can not squeeze dim[2], expected a dimension of 1, got 5 [Op:Squeeze]
what's going on? It should be a straightforward multiplication of a loss matrix (pre-reduction) with the weights. How to fix?

Super simple fix! Tensorflow squeezes the last dimension of the sample weights because they are supposed to be applied per sample, therefore, all you need to do is add one dimension to your weight matrix along the last axis:
y = tf.convert_to_tensor(np.ones(shape = (1,14,5)))
w = tf.convert_to_tensor(np.random.uniform(size = (1,14,5,1))) # Change made here
loss_object = tf.keras.losses.BinaryCrossentropy(from_logits=False)
loss_object(y_true = y,
y_pred = y,
sample_weight = w)
You can also just change the shape of the weights matrix after creation:
w = tf.expand_dims(w, axis=-1)


When and why do we use tf.reduce_mean?

In setting up the model I sometimes see the code:
# Scenario 1
# Define loss and optimizer
loss_op = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(
logits=logits, labels=Y))
# Scenario 2
# Evaluate model (with test logits, for dropout to be disabled)
prediction = tf.equal(tf.argmax(prediction, 1), tf.argmax(Y, 1))
accuracy = tf.reduce_mean(tf.cast(prediction, tf.float32))
The definition of tf.reduce_mean states that it "calculates the mean of tensor elements along various dimensions of the tensor." I am confused about what it does in simpler language? When do we need to use it, maybe with reference to # Scenario 1 & 2 ? Thank you
As far as I understand, tensorflow.reduce_mean is the same as numpy.mean. It creates an operation in the underlying tensorflow graph which computes the mean of a tensor.
The most important keyword argument of tensorflow.reduce_mean is axis. Basically, if you have a tensor with shape (4, 3, 2) and axis=1, an empty array with shape (4, 2) will be created, and the mean values along the selected axis will be computed to fill in the empty array. (This is just a pseudo-process to help you make sense of the output, but may not be the actual process)
Here is a simple example to help you understand
import tensorflow as tf
import numpy as np
one = np.linspace(1, 30, 30).reshape(5, 3, 2)
x = tf.placeholder('float32', shape=[5, 3, 2])
op_1 = tf.reduce_mean(x)
op_2 = tf.reduce_mean(x, axis=0)
op_3 = tf.reduce_mean(x, axis=1)
op_4 = tf.reduce_mean(x, axis=2)
with tf.Session() as sess:
print(, feed_dict={x: one}))
print(, feed_dict={x: one}))
print(, feed_dict={x: one}))
print(, feed_dict={x: one}))
The first output is a number because we didn't provide an axis. The shapes of the rest of the outputs are (3, 2), (5, 2) and (5, 3), respectively.
reduce_mean can be useful when the target value is a matrix.
User #meTchaikovsky explained the general case of tf.reduce_mean. In both of your cases tf.reduce_mean simply works as any mean calculator i.e,. you're not taking mean along any particular axis of a tensor, you simply divide the sum of the elements in a tensor by number of elements.
Let's decode what exactly is happening in both the cases. For the both the cases assume batch_size = 2 and num_classes = 5, meaning that there are two examples per batch.
Now for the first case, tf.nn.softmax_cross_entropy_with_logits(logits=logits, labels=Y) returns an array of shape (2,).
>>import numpy as np
>>import tensorflow as tf
>>sess= tf.InteractiveSession()
>>batch_size = 2
>>num_classes = 5
>>logits = np.random.rand(batch_size,num_classes)
[[0.94108451 0.68186329 0.04000461 0.25996487 0.50391948]
[0.22781201 0.32305269 0.93359371 0.22599208 0.05942905]]
>>labels = np.array([[1,0,0,0,0],[0,1,0,0,0]])
[[1 0 0 0 0]
[0 1 0 0 0]]
>>logits_ = tf.placeholder(dtype=tf.float32,shape=(batch_size,num_classes))
>>Y_ = tf.placeholder(dtype=tf.int32,shape=(batch_size,num_classes))
>>loss_op = tf.nn.softmax_cross_entropy_with_logits(logits=logits_, labels=Y_)
>>loss_per_example =,feed_dict={Y_:labels,logits_:logits})
array([1.2028817, 1.6912657], dtype=float32)
You can see that loss_per_example is of shape (2,). If we take the mean of this variable then we can approximate the average loss for the full batch. Hence we calculate
>>loss_per_example_holder = tf.placeholder(dtype=tf.float32,shape=(batch_size))
>>final_loss_per_batch = tf.reduce_mean(loss_per_example_holder)
>>final_loss =,feed_dict={loss_per_example_holder:loss_per_example})
Coming to your second case:
>>predictions_holder = tf.placeholder(dtype=tf.float32,shape=(batch_size,num_classes))
>>labels_holder = tf.placeholder(dtype=tf.int32,shape=(batch_size,num_classes))
>>prediction_tf = tf.equal(tf.argmax(predictions_holder, 1), tf.argmax(labels_holder, 1))
>>labels_match =,feed_dict={predictions_holder:logits,labels_holder:labels})
[ True False]
The above output was expected because only the first example of the variable logits says that the neuron with highest activation (0.9410) is zeroth which is same as labels. Now we want to calculate the accuracy, which means we have to take the average of the variable labels_match.
>>labels_match_holder = tf.placeholder(dtype=tf.float32,shape=(batch_size))
>>accuracy_calc = tf.reduce_mean(tf.cast(labels_match_holder, tf.float32))
>>accuracy =, feed_dict={labels_match_holder:labels_match})

Shapes in Tensorflow

I am new to Tensorflow and I have problems with combining shapes (n,) with shapes (n,1).
I have this code:
if __name__ == '__main__':
trainSetX, trainSetY = utils.load_train_set()
# create placeholders & variables
X = tf.placeholder(tf.float32, shape=(num_of_features,))
y = tf.placeholder(tf.float32, shape=(1,))
W, b = initialize_params()
# predict y
y_estim = linear_function(X, W, b)
y_pred = tf.sigmoid(y_estim)
# set the optimizer
loss = tf.nn.sigmoid_cross_entropy_with_logits(labels=y, logits=y_pred)
loss_mean = tf.reduce_mean(loss)
optimizer = tf.train.GradientDescentOptimizer(learning_rate=alpha).minimize(loss_mean)
# training phase
init = tf.global_variables_initializer()
with tf.Session() as sess:
for idx in range(num_of_examples):
cur_x, cur_y = trainSetX[idx], trainSetY[idx]
_, c =[optimizer, loss_mean], feed_dict={X: cur_x, y: cur_y})
I am trying to implement a stochastic gradient descent by feeding one example at the time. The problem is that it seems to feed the data in shape (num_of_features,), while I need (num_of_features,1) for the correct usage of the other functions.
For example, the code given before causes error when it comes to calculating the prediction of y with this function:
def linear_function(x, w, b):
y_est = tf.add(tf.matmul(w, x), b)
return y_est
The error is:
ValueError: Shape must be rank 2 but is rank 1 for 'MatMul' (op: 'MatMul') with input shapes: [1,3197], [3197].
I was trying to use tf.reshape with X and y to somehow solve this problem, but it caused errors in other places.
Is it possible to feed the data in feed_dict={X: cur_x, y: cur_y} in "correct" shape?
Or what is the way to properly implement this?
For matrix multiplications, you need to follow the rule of shapes
(a, b) * (b, c) = (a, c)
Which means you do need to reshape it since the shapes in your code are not following it. Showing what error you got after reshape would help.
Hope this gives you some hint
import tensorflow as tf
a = tf.constant([1, 2], shape=[1, 2])
b = tf.constant([7, 8], shape=[2])
print(a.shape) # => (1, 2)
print(b.shape) # => (2,)
sess = tf.Session()
# r = tf.matmul(a, b)
# print( # this gives you error
c = tf.reshape(b, [2, 1])
print(c.shape) # => (2, 1)
r = tf.matmul(a, c)
foo = tf.reshape(r, [1])
foo =
print(foo) # this gives you [23]

Error with input dimension

I am trying to implement a custom rbf kernel function. However I am getting the following error. I am not sure why it is expected a certain amount of data points?
Error occurs in this line of code:
rbf_y = rbf_kernel.predict(X_test)
def myKernel(x,y):
pairwise_dists = squareform(pdist(x, 'euclidean'))
K = scip.exp(-pairwise_dists ** 2 / .08 ** 2)
return K
rbf_kernel = svm.SVC(kernel=myKernel, C=1).fit(X_train, Y_train.ravel())
rbf_y = rbf_kernel.predict(X_test)
rbf_accuracy = accuracy_score(Y_test, rbf_y)
ValueError: X.shape[1] = 15510 should be equal to 31488, the number of samples at training time
Data Shape
X_train shape: (31488, 128)
X_test shape: (15510, 128)
Y_train shape: (31488, 1)
Y_test shape: (15510, 1)
Return Shape from Kernel
myKernel(X_train, X_train).shape = (31488, 31488)
A custom kernel kernel(X, Y) should compute a similarity measure between the matrix X and the matrix Y, and the output should be of shape [X.shape[0], Y.shape[0]]. Your kernel function ignores Y, and returns a matrix of shape [X.shape[0], X.shape[0]], which leads to the error you are seeing.
To fix the issue, implement a kernel function that computes a kernel matrix of the correct shape. Scikit-learn's custom kernels documentation has some simple examples of how this might work.
In the case of your specific kernel, you might try cdist(x, y) in place of squareform(pdist(x)).

Model not learning in tensorflow

I am new to tensorflow and neural networks, and I am trying to create a model that just multiples two float values together.
I wasn't sure how many neurons I would want, but I picked 10 neurons and tried to see where I could go from that. I figured that would probably introduce enough complexity in order to semi-accurately learn that operation.
Anyways, here is my code:
import tensorflow as tf
import numpy as np
# Teach how to multiply
def generate_data(how_many):
data = np.random.rand(how_many, 2)
answers = data[:, 0] * data[:, 1]
return data, answers
sess = tf.InteractiveSession()
# Input data
input_data = tf.placeholder(tf.float32, shape=[None, 2])
correct_answers = tf.placeholder(tf.float32, shape=[None])
# Use 10 neurons--just one layer for now, but it'll be fully connected
weights_1 = tf.Variable(tf.truncated_normal([2, 10], stddev=.1))
bias_1 = tf.Variable(.1)
# Output of this will be a [None, 10]
hidden_output = tf.nn.relu(tf.matmul(input_data, weights_1) + bias_1)
# Weights
weights_2 = tf.Variable(tf.truncated_normal([10, 1], stddev=.1))
bias_2 = tf.Variable(.1)
# Softmax them together--this will be [None, 1]
calculated_output = tf.nn.softmax(tf.matmul(hidden_output, weights_2) + bias_2)
cross_entropy = tf.reduce_mean(correct_answers * tf.log(calculated_output))
optimizer = tf.train.GradientDescentOptimizer(.5).minimize(cross_entropy)
for i in range(1000):
x, y = generate_data(100), feed_dict={input_data: x, correct_answers: y})
error = tf.reduce_sum(tf.abs(calculated_output - correct_answers))
x, y = generate_data(100)
print("Total Error: ", error.eval(feed_dict={input_data: x, correct_answers: y}))
It seems that the error is always around 7522.1, which very very bad for just 100 data points, so I assume it is not learning.
My questions: Is my machine learning? If so, what can I do to make it more accurate? If not, how can I make it learn?
There are a few major issues with the code. Aaron has already identified some of them, but there's another important one: calculated_output and correct_answers are not the same shape, so you're creating a 2D matrix when you subtract them. (The shape of calculated_output is (100, 1) and the shape of correct_answers is (100).) So you need to adjust the shape (for example, by using tf.squeeze on calculated_output).
This problem also doesn't really require any non-linearities, so you could get by with no activations and only one layer. The following code gets a total error of about 6 (~0.06 error on average for each test point). Hope that helps!
import tensorflow as tf
import numpy as np
# Teach how to multiply
def generate_data(how_many):
data = np.random.rand(how_many, 2)
answers = data[:, 0] * data[:, 1]
return data, answers
sess = tf.InteractiveSession()
input_data = tf.placeholder(tf.float32, shape=[None, 2])
correct_answers = tf.placeholder(tf.float32, shape=[None])
weights_1 = tf.Variable(tf.truncated_normal([2, 1], stddev=.1))
bias_1 = tf.Variable(.0)
output_layer = tf.matmul(input_data, weights_1) + bias_1
mean_squared = tf.reduce_mean(tf.square(correct_answers - tf.squeeze(output_layer)))
optimizer = tf.train.GradientDescentOptimizer(.1).minimize(mean_squared)
for i in range(1000):
x, y = generate_data(100), feed_dict={input_data: x, correct_answers: y})
error = tf.reduce_sum(tf.abs(tf.squeeze(output_layer) - correct_answers))
x, y = generate_data(100)
print("Total Error: ", error.eval(feed_dict={input_data: x, correct_answers: y}))
The way you are using softmax is weird. Softmax is normally used when you want to have a probability distribution over a set of classes. In your code it looks like you have a one dimensional output. The softmax is not helping you there.
The cross entropy loss function is appropriate in classification problems but you are doing regression. You should try using a mean squared error loss function instead.

MLP Neural Network: calculating the gradient (matrices)

What is a good implementation for calculating the gradient in a n-layered neural network?
Weight layers:
First layer weights:     (n_inputs+1, n_units_layer)-matrix
Hidden layer weights: (n_units_layer+1, n_units_layer)-matrix
Last layer weights:     (n_units_layer+1, n_outputs)-matrix
If there is only one hidden layer we would represent the net by using just two (weight) layers:
inputs --first_layer-> network_unit --second_layer-> output
For a n-layer network with more than one hidden layer, we need to implement the (2) step.
A bit vague pseudocode:
weight_layers = [ layer1, layer2 ] # a list of layers as described above
input_values = [ [0,0], [0,0], [1,0], [0,1] ] # our test set (corresponds to XOR)
target_output = [ 0, 0, 1, 1 ] # what we want to train our net to output
output_layers = [] # output for the corresponding layers
for layer in weight_layers:
output <-- calculate the output # calculate the output from the current layer
output_layers <-- output # store the output from each layer
n_samples = input_values.shape[0]
n_outputs = target_output.shape[1]
error = ( output-target_output )/( n_samples*n_outputs )
""" calculate the gradient here """
Final implementation
The final implementation is available at github.
With Python and numpy that is easy.
You have two options:
You can either compute everything in parallel for num_instances instances or
you can compute the gradient for one instance (which is actually a special case of 1.).
I will now give some hints how to implement option 1. I would suggest that you create a new class that is called Layer. It should have two functions:
X: shape = [num_instances, num_inputs]
W: shape = [num_outputs, num_inputs]
b: shape = [num_outputs]
g: function
activation function
Y: shape = [num_instances, num_outputs]
dE/dY: shape = [num_instances, num_outputs]
backpropagated gradient
W: shape = [num_outputs, num_inputs]
b: shape = [num_outputs]
gd: function
calculates the derivative of g(A) = Y
based on Y, i.e. gd(Y) = g'(A)
Y: shape = [num_instances, num_outputs]
X: shape = [num_instances, num_inputs]
dE/dX: shape = [num_instances, num_inputs]
will be backpropagated (dE/dY of lower layer)
dE/dW: shape = [num_outputs, num_inputs]
accumulated derivative with respect to weights
dE/db: shape = [num_outputs]
accumulated derivative with respect to biases
The implementation is simple:
def forward(X, W, b):
A = + b # will be broadcasted
Y = g(A)
return Y
def backprop(dEdY, W, b, gd, Y, X):
Deltas = gd(Y) * dEdY # element-wise multiplication
dEdX =
dEdW =
dEdb = Deltas.sum(axis=0)
return dEdX, dEdW, dEdb
X of the first layer is your taken from your dataset and then you pass each Y as the X of the next layer in the forward pass.
The dE/dY of the output layer is computed (either for softmax activation function and cross entropy error function or for linear activation function and sum of squared errors) as Y-T, where Y is the output of the network (shape = [num_instances, num_outputs]) and T (shape = [num_instances, num_outputs]) is the desired output. Then you can backpropagate, i.e. dE/dX of each layer is dE/dY of the previous layer.
Now you can use dE/dW and dE/db of each layer to update W and b.
Here is an example for C++: OpenANN.
Btw. you can compare the speed of instance-wise and batch-wise forward propagation:
In [1]: import timeit
In [2]: setup = """import numpy
...: W = numpy.random.rand(10, 5000)
...: X = numpy.random.rand(1000, 5000)"""
In [3]: timeit.timeit('[ for x in X]', setup=setup, number=10)
Out[3]: 0.5420958995819092
In [4]: timeit.timeit('', setup=setup, number=10)
Out[4]: 0.22001314163208008
