I'm working on a simple linear regression model to predict the next step in a series. I'm giving it x/y coordinate data and I want the regressor to predict where the next point on the plot will lie.
I'm using dense layers with AdamOptmizer and have my loss function set to:
tf.reduce_mean(tf.square(layer_out - y))
I'm trying to create linear regression models from scratch (I don't want to utilize the TF estimator package here).
I've seen ways to do it by manually specifying weights and biases, but nothing goes into deep regression.
X = tf.placeholder(tf.float32, [None, self.data_class.batch_size, self.inputs])
y = tf.placeholder(tf.float32, [None, self.data_class.batch_size, self.outputs])
layer_input = tf.layers.dense(inputs=X, units=10, activation=tf.nn.relu)
layer_hidden = tf.layers.dense(inputs=layer_input, units=10, activation=tf.nn.relu)
layer_out = tf.layers.dense(inputs=layer_hidden, units=1, activation=tf.nn.relu)
cost = tf.reduce_mean(tf.square(layer_out - y))
optmizer = tf.train.AdamOptimizer(learning_rate=self.learning_rate)
training_op = optmizer.minimize(cost)
init = tf.initialize_all_variables()
iterations = 10000
with tf.Session() as sess:
init.run()
for iteration in range(iterations):
X_batch, y_batch = self.data_class.get_data_batch()
sess.run(training_op, feed_dict={X: X_batch, y: y_batch})
if iteration % 100 == 0:
mse = cost.eval(feed_dict={X:X_batch, y:y_batch})
print(mse)
array = []
for i in range(len(self.data_class.dates), (len(self.data_class.dates)+self.data_class.batch_size)):
array.append(i)
x_pred = np.array(array).reshape(1, self.data_class.batch_size, 1)
y_pred = sess.run(layer_out, feed_dict={X: x_pred})
print(y_pred)
predicted = np.array(y_pred).reshape(self.data_class.batch_size)
predicted = np.insert(predicted, 0, self.data_class.prices[0], axis=0)
plt.plot(self.data_class.dates, self.data_class.prices)
array = [self.data_class.dates[0]]
for i in range(len(self.data_class.dates), (len(self.data_class.dates)+self.data_class.batch_size)):
array.append(i)
plt.plot(array, predicted)
plt.show()
When I run training I'm getting the same loss value over and over again.
It's not being reduced, like it should, why?
The issue is that I'm applying an activation to the output layer. This is causing that output to go to whatever it activates to.
By specifying in the last layer that activation=None the deep regression works as intended.
Here is the updated architecture:
layer_input = tf.layers.dense(inputs=X, units=150, activation=tf.nn.relu)
layer_hidden = tf.layers.dense(inputs=layer_input, units=100, activation=tf.nn.relu)
layer_out = tf.layers.dense(inputs=layer_hidden, units=1, activation=None)
Related
I am testing some TensorFlow code; I'm seeing this error:
AttributeError: module 'tensorflow' has no attribute 'variable_scope'
I am running TensorFlow version 2.1.0.
Here is the code that I am testing.
# imports
import os
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
# Input data:
# For this tutorial we use the MNIST dataset. MNIST is a dataset of handwritten digits. If you are into machine learning, you might have heard of this dataset by now. MNIST is kind of benchmark of datasets for deep learning. One other reason that we use the MNIST is that it is easily accesible through Tensorflow. If you want to know more about the MNIST dataset you can check Yann Lecun's website. We can easily import the dataset and see the size of training, test and validation set:
# Import MNIST data
# from tensorflow.examples.tutorials.mnist import input_data
#import tensorflow_datasets as tfds
# Construct a tf.data.Dataset
#mnist = tfds.load(name="mnist", split=tfds.Split.TRAIN)
mnist = tf.keras.datasets.mnist
#mnist = input_data.read_data_sets("MNIST_data/", one_hot=True)
#print("Size of:")
#print("- Training-set:\t\t{}".format(len(mnist.train.labels)))
#print("- Test-set:\t\t{}".format(len(mnist.test.labels)))
#print("- Validation-set:\t{}".format(len(mnist.validation.labels)))
# hyper-parameters
logs_path = "C:/Users/ryans/MNIST_data/logs/embedding/" # path to the folder that we want to save the logs for Tensorboard
learning_rate = 0.001 # The optimization learning rate
epochs = 10 # Total number of training epochs
batch_size = 100 # Training batch size
display_freq = 100 # Frequency of displaying the training results
# Network Parameters
# We know that MNIST images are 28 pixels in each dimension.
img_h = img_w = 28
# Images are stored in one-dimensional arrays of this length.
img_size_flat = img_h * img_w
# Number of classes, one class for each of 10 digits.
n_classes = 10
# number of units in the first hidden layer
h1 = 200
# Graph:
# Like before, we start by constructing the graph. But, we need to define some functions that we need rapidly in our code.
# weight and bais wrappers
def weight_variable(name, shape):
"""
Create a weight variable with appropriate initialization
:param name: weight name
:param shape: weight shape
:return: initialized weight variable
"""
initer = tf.truncated_normal_initializer(stddev=0.01)
return tf.get_variable('W_' + name,
dtype=tf.float32,
shape=shape,
initializer=initer)
def bias_variable(name, shape):
"""
Create a bias variable with appropriate initialization
:param name: bias variable name
:param shape: bias variable shape
:return: initialized bias variable
"""
initial = tf.constant(0., shape=shape, dtype=tf.float32)
return tf.get_variable('b_' + name,
dtype=tf.float32,
initializer=initial)
def fc_layer(x, num_units, name, use_relu=True):
"""
Create a fully-connected layer
:param x: input from previous layer
:param num_units: number of hidden units in the fully-connected layer
:param name: layer name
:param use_relu: boolean to add ReLU non-linearity (or not)
:return: The output array
"""
with tf.variable_scope(name):
in_dim = x.get_shape()[1]
W = weight_variable(name, shape=[in_dim, num_units])
tf.summary.histogram('W', W)
b = bias_variable(name, [num_units])
tf.summary.histogram('b', b)
layer = tf.matmul(x, W)
layer += b
if use_relu:
layer = tf.nn.relu(layer)
return layer
# Now that we have our helper functions we can create our graph.
# Create graph
# Placeholders for inputs (x), outputs(y)
with tf.compat.v1.variable_scope('Input'):
x = tf.compat.v1.placeholder(tf.float32, shape=[None, img_size_flat], name='X')
tf.summary.image('input_image', tf.reshape(x, (-1, img_w, img_h, 1)), max_outputs=5)
y = tf.compat.v1.placeholder(tf.float32, shape=[None, n_classes], name='Y')
fc1 = fc_layer(x, h1, 'Hidden_layer', use_relu=True)
output_logits = fc_layer(fc1, n_classes, 'Output_layer', use_relu=False)
# Define the loss function, optimizer, and accuracy
with tf.compat.v1.variable_scope('Train'):
with tf.compat.v1.variable_scope('Loss'):
loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=y, logits=output_logits), name='loss')
tf.summary.scalar('loss', loss)
with tf.compat.v1.variable_scope('Optimizer'):
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate, name='Adam-op').minimize(loss)
with tf.compat.v1.variable_scope('Accuracy'):
correct_prediction = tf.equal(tf.argmax(output_logits, 1), tf.argmax(y, 1), name='correct_pred')
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32), name='accuracy')
tf.summary.scalar('accuracy', accuracy)
# Network predictions
cls_prediction = tf.argmax(output_logits, axis=1, name='predictions')
# Initializing the variables
init = tf.global_variables_initializer()
merged = tf.summary.merge_all()
# Session:
# Launch the graph (session)
sess = tf.InteractiveSession() # using InteractiveSession instead of Session to test network in separate cell
sess.run(init)
train_writer = tf.summary.FileWriter(logs_path, sess.graph)
num_tr_iter = int(mnist.train.num_examples / batch_size)
global_step = 0
for epoch in range(epochs):
print('Training epoch: {}'.format(epoch + 1))
for iteration in range(num_tr_iter):
batch_x, batch_y = mnist.train.next_batch(batch_size)
global_step += 1
# Run optimization op (backprop)
feed_dict_batch = {x: batch_x, y: batch_y}
_, summary_tr = sess.run([optimizer, merged], feed_dict=feed_dict_batch)
train_writer.add_summary(summary_tr, global_step)
if iteration % display_freq == 0:
# Calculate and display the batch loss and accuracy
loss_batch, acc_batch = sess.run([loss, accuracy],
feed_dict=feed_dict_batch)
print("iter {0:3d}:\t Loss={1:.2f},\tTraining Accuracy={2:.01%}".
format(iteration, loss_batch, acc_batch))
# Run validation after every epoch
feed_dict_valid = {x: mnist.validation.images, y: mnist.validation.labels}
loss_valid, acc_valid = sess.run([loss, accuracy], feed_dict=feed_dict_valid)
print('---------------------------------------------------------')
print("Epoch: {0}, validation loss: {1:.2f}, validation accuracy: {2:.01%}".
format(epoch + 1, loss_valid, acc_valid))
print('---------------------------------------------------------')
I think the code was designed for an earlier version of TensorFlow. I made a few small modifications to get the code to run on my laptop. Here's the part that I am struggling with.
# Placeholders for inputs (x), outputs(y)
with tf.compat.v1.variable_scope('Input'):
x = tf.compat.v1.placeholder(tf.float32, shape=[None, img_size_flat], name='X')
tf.summary.image('input_image', tf.reshape(x, (-1, img_w, img_h, 1)), max_outputs=5)
y = tf.compat.v1.placeholder(tf.float32, shape=[None, n_classes], name='Y')
fc1 = fc_layer(x, h1, 'Hidden_layer', use_relu=True)
output_logits = fc_layer(fc1, n_classes, 'Output_layer', use_relu=False)
The 'with' statement runs, but I am getting an error on this line:
fc1 = fc_layer(x, h1, 'Hidden_layer', use_relu=True)
I thought the change to 'tf.compat.v1' would oversome the issue of different TensorFlow versions, but apparently not.
I found the code sample here.
https://www.easy-tensorflow.com/tf-tutorials/tensorboard/tb-embedding-visualization
As placeholder is removed from tensorflow 2.0, compat.v1 must be used. However, another problem is incompatibility and can be solved by using tf.compat.v1.disable_eager_execution() before with tf.compat.v1.variable_scope(...):
In a way, you can turn on the eager execution by calling tf.compat.v1.enable_eager_execution
You may check https://www.tensorflow.org/guide/migrate
In the following Keras and Tensorflow implementations of the training of a neural network, how model.train_on_batch([x], [y]) in the keras implementation is different than sess.run([train_optimizer, cross_entropy, accuracy_op], feed_dict=feed_dict) in the Tensorflow implementation? In particular: how those two lines can lead to different computation in training?:
keras_version.py
input_x = Input(shape=input_shape, name="x")
c = Dense(num_classes, activation="softmax")(input_x)
model = Model([input_x], [c])
opt = Adam(lr)
model.compile(loss=['categorical_crossentropy'], optimizer=opt)
nb_batchs = int(len(x_train)/batch_size)
for epoch in range(epochs):
loss = 0.0
for batch in range(nb_batchs):
x = x_train[batch*batch_size:(batch+1)*batch_size]
y = y_train[batch*batch_size:(batch+1)*batch_size]
loss_batch, acc_batch = model.train_on_batch([x], [y])
loss += loss_batch
print(epoch, loss / nb_batchs)
tensorflow_version.py
input_x = Input(shape=input_shape, name="x")
c = Dense(num_classes)(input_x)
input_y = tf.placeholder(tf.float32, shape=[None, num_classes], name="label")
cross_entropy = tf.reduce_mean(
tf.nn.softmax_cross_entropy_with_logits_v2(labels=input_y, logits=c, name="xentropy"),
name="xentropy_mean"
)
train_optimizer = tf.train.AdamOptimizer(learning_rate=lr).minimize(cross_entropy)
nb_batchs = int(len(x_train)/batch_size)
init = tf.global_variables_initializer()
with tf.Session() as sess:
sess.run(init)
for epoch in range(epochs):
loss = 0.0
acc = 0.0
for batch in range(nb_batchs):
x = x_train[batch*batch_size:(batch+1)*batch_size]
y = y_train[batch*batch_size:(batch+1)*batch_size]
feed_dict = {input_x: x,
input_y: y}
_, loss_batch = sess.run([train_optimizer, cross_entropy], feed_dict=feed_dict)
loss += loss_batch
print(epoch, loss / nb_batchs)
Note: This question follows Same (?) model converges in Keras but not in Tensorflow , which have been considered too broad but in which I show exactly why I think those two statements are somehow different and lead to different computation.
Yes, the results can be different. The results shouldn't be surprising if you know the following things in advance:
Implementation of corss-entropy in Tensorflow and Keras is different. Tensorflow assumes the input to tf.nn.softmax_cross_entropy_with_logits_v2 as the raw unnormalized logits while Keras accepts inputs as probabilities
Implementation of optimizers in Keras and Tensorflow are different.
It might be the case that you are shuffling the data and the batches passed aren't in the same order. Although it doesn't matter if you run the model for long but initial few epochs can be entirely different. Make sure same batch is passed to both and then compare the results.
I want to use the dropout function of tensorflow to check if I can improve my results (TPR, FPR) of my recurrent neural network.
However I implemented it by following a guide. So I am not sure if I did any mistakes. But if I train my model with with e.g. 10 epochs I get nearly the same results after validation. Thats why I am not sure if I use the dropout function correctly. Is this the right implementation in the following code or did I made something wrong? If I did everything right why do I get nearly the same result then?
hm_epochs = 10
n_classes = 2
batch_size = 128
chunk_size = 341
n_chunks = 5
rnn_size = 32
dropout_prop = 0.5 # Dropout, probability to drop a unit
batch_size_validation = 65536
x = tf.placeholder('float', [None, n_chunks, chunk_size])
y = tf.placeholder('float')
def recurrent_neural_network(x):
layer = {'weights':tf.Variable(tf.random_normal([rnn_size, n_classes])),
'biases':tf.Variable(tf.random_normal([n_classes]))}
x = tf.transpose(x, [1,0,2])
x = tf.reshape(x, [-1, chunk_size])
x = tf.split(x, n_chunks, 0)
lstm_cell = rnn.BasicLSTMCell(rnn_size)
outputs, states = rnn.static_rnn(lstm_cell, x, dtype=tf.float32)
output = tf.matmul(outputs[-1], layer['weights']) + layer['biases']
#DROPOUT Implementation -> is this code really working?
#The result is nearly the same after 20 epochs...
output_layer = tf.layers.dropout(output, rate=dropout_prop)
return output
def train_neural_network(x):
prediction = recurrent_neural_network(x)
cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits_v2(logits=prediction,labels=y))
optimizer = tf.train.AdamOptimizer().minimize(cost)
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
for epoch in range(1,hm_epochs+1):
epoch_loss = 0
for i in range(0, training_data.shape[0], batch_size):
epoch_x = np.array(training_data[i:i+batch_size, :, :], dtype='float')
epoch_y = np.array(training_labels[i:i+batch_size, :], dtype='float')
if len(epoch_x) != batch_size:
epoch_x = epoch_x.reshape((len(epoch_x), n_chunks, chunk_size))
else:
epoch_x = epoch_x.reshape((batch_size, n_chunks, chunk_size))
_, c = sess.run([optimizer, cost], feed_dict={x: epoch_x, y: epoch_y})
epoch_loss += c
train_neural_network(x)
print("rnn - finished!")
In its most basic form, dropout should happen inside the cell and applied to the weights. You only applied it afterwards. This article explained it pretty well with some good visualization and few variations.
To use it in your code, you can either
Implement your own RNN cell where the keep probability is a parameter that initializes the cell or is a parameter that got passed in when it's called every time.
Use an rnn dropout wrapper here.
I am currently studying TensorFlow. I am trying to create a NN which can accurately assess a prediction model and assign it a score. My plan right now is to combine scores from already existing programs run them through a mlp while comparing them to true values. I have played around with the MNIST data and I am trying to apply what I have learnt to my project. Unfortunately i have a problem
def multilayer_perceptron(x, w1):
# Hidden layer with RELU activation
layer_1 = tf.matmul(x, w1)
layer_1 = tf.nn.relu(layer_1)
# Output layer with linear activation
#out_layer = tf.matmul(layer_1, w2)
return layer_1
def my_mlp (trainer, trainer_awn, learning_rate, training_epochs, n_hidden, n_input, n_output):
trX, trY= trainer, trainer_awn
#create placeholders
x = tf.placeholder(tf.float32, shape=[9517, 5])
y_ = tf.placeholder(tf.float32, shape=[9517, ])
#create initial weights
w1 = tf.Variable(tf.zeros([5, 1]))
#predicted class and loss function
y = multilayer_perceptron(x, w1)
cross_entropy = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(y, y_))
#training
train_step = tf.train.AdamOptimizer(learning_rate=learning_rate).minimize(cross_entropy)
correct_prediction = tf.equal(tf.argmax(y,1), tf.argmax(y_,1))
with tf.Session() as sess:
# you need to initialize all variables
sess.run(tf.initialize_all_variables())
print("1")
for i in range(training_epochs + 1):
sess.run([train_step], feed_dict={x: [trX['V7'], trX['V8'], trX['V9'], trX['V10'], trX['V12']], y_: trY})
return
The code gives me this error
ValueError: Dimension 0 in both shapes must be equal, but are 9517 and 1
This error occurs when running the line for cross_entropy. I don't understand why this is happing, if you need any more information I would be happy to give it to you.
in your case, y has shape [9517, 1] while y_ has shape [9517]. they are not campatible. Please try to reshape y_ using tf.reshape(y_, [-1, 1])
This was caused by the weights.hdf5 file being incompatible with the new data in the repository. I have updated the repo and it should work now.
I hope you can help me. I'm implementing a small multilayer perceptron using TensorFlow and a few tutorials I found on the internet. The problem is that the net is able to learn something, and by this I mean that I am able to somehow optimize the value of the training error and get a decent accuracy, and that's what I was aiming for. However, I am recording with Tensorboard some strange NaN values for the loss function. Quite a lot actually. Here you can see my latest Tensorboard recording of the loss function output. Please all those triangles followed by discontinuities - those are the NaN values, note also that the general trend of the function is what you would expect it to be.
Tensorboard report
I thought that a high learning rate could be the problem, or maybe a net that's too deep, causing the gradients to explode, so I lowered the learning rate and used a single hidden layer (this is the configuration of the image above, and the code below). Nothing changed, I just caused the learning process to be slower.
Tensorflow Code
import tensorflow as tf
import numpy as np
import scipy.io, sys, time
from numpy import genfromtxt
from random import shuffle
#shuffles two related lists #TODO check that the two lists have same size
def shuffle_examples(examples, labels):
examples_shuffled = []
labels_shuffled = []
indexes = list(range(len(examples)))
shuffle(indexes)
for i in indexes:
examples_shuffled.append(examples[i])
labels_shuffled.append(labels[i])
examples_shuffled = np.asarray(examples_shuffled)
labels_shuffled = np.asarray(labels_shuffled)
return examples_shuffled, labels_shuffled
# Import and transform dataset
dataset = scipy.io.mmread(sys.argv[1])
dataset = dataset.astype(np.float32)
all_labels = genfromtxt('oh_labels.csv', delimiter=',')
num_examples = all_labels.shape[0]
dataset, all_labels = shuffle_examples(dataset, all_labels)
# Split dataset into training (66%) and test (33%) set
training_set_size = 2000
training_set = dataset[0:training_set_size]
training_labels = all_labels[0:training_set_size]
test_set = dataset[training_set_size:num_examples]
test_labels = all_labels[training_set_size:num_examples]
test_set, test_labels = shuffle_examples(test_set, test_labels)
# Parameters
learning_rate = 0.0001
training_epochs = 150
mini_batch_size = 100
total_batch = int(num_examples/mini_batch_size)
# Network Parameters
n_hidden_1 = 50 # 1st hidden layer of neurons
#n_hidden_2 = 16 # 2nd hidden layer of neurons
n_input = int(sys.argv[2]) # number of features after LSA
n_classes = 2;
# Tensorflow Graph input
with tf.name_scope("input"):
x = tf.placeholder(np.float32, shape=[None, n_input], name="x-data")
y = tf.placeholder(np.float32, shape=[None, n_classes], name="y-labels")
print("Creating model.")
# Create model
def multilayer_perceptron(x, weights, biases):
with tf.name_scope("h_layer_1"):
# First hidden layer with SIGMOID activation
layer_1 = tf.add(tf.matmul(x, weights['h1']), biases['b1'])
layer_1 = tf.nn.sigmoid(layer_1)
#with tf.name_scope("h_layer_2"):
# Second hidden layer with SIGMOID activation
#layer_2 = tf.add(tf.matmul(layer_1, weights['h2']), biases['b2'])
#layer_2 = tf.nn.sigmoid(layer_2)
with tf.name_scope("out_layer"):
# Output layer with SIGMOID activation
out_layer = tf.add(tf.matmul(layer_1, weights['out']), biases['bout'])
out_layer = tf.nn.sigmoid(out_layer)
return out_layer
# Layer weights
with tf.name_scope("weights"):
weights = {
'h1': tf.Variable(tf.random_normal([n_input, n_hidden_1], stddev=0.01, dtype=np.float32)),
#'h2': tf.Variable(tf.random_normal([n_hidden_1, n_hidden_2], stddev=0.05, dtype=np.float32)),
'out': tf.Variable(tf.random_normal([n_hidden_1, n_classes], stddev=0.01, dtype=np.float32))
}
# Layer biases
with tf.name_scope("biases"):
biases = {
'b1': tf.Variable(tf.random_normal([n_hidden_1], dtype=np.float32)),
#'b2': tf.Variable(tf.random_normal([n_hidden_2], dtype=np.float32)),
'bout': tf.Variable(tf.random_normal([n_classes], dtype=np.float32))
}
# Construct model
pred = multilayer_perceptron(x, weights, biases)
# Define loss and optimizer
with tf.name_scope("loss"):
cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(pred, y))
with tf.name_scope("adam"):
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate).minimize(cost)
# Initializing the variables
init = tf.initialize_all_variables()
# Define summaries
tf.scalar_summary("loss", cost)
summary_op = tf.merge_all_summaries()
print("Model ready.")
# Launch the graph
with tf.Session() as sess:
sess.run(init)
board_path = sys.argv[3]+time.strftime("%Y%m%d%H%M%S")+"/"
writer = tf.train.SummaryWriter(board_path, graph=tf.get_default_graph())
print("Starting Training.")
for epoch in range(training_epochs):
training_set, training_labels = shuffle_examples(training_set, training_labels)
for i in range(total_batch):
# example loading
minibatch_x = training_set[i*mini_batch_size:(i+1)*mini_batch_size]
minibatch_y = training_labels[i*mini_batch_size:(i+1)*mini_batch_size]
# Run optimization op (backprop) and cost op
_, summary = sess.run([optimizer, summary_op], feed_dict={x: minibatch_x, y: minibatch_y})
# Write log
writer.add_summary(summary, epoch*total_batch+i)
print("Optimization Finished!")
# Test model
test_error = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(pred, y))
accuracy = tf.equal(tf.argmax(pred, 1), tf.argmax(y, 1))
accuracy = tf.reduce_mean(tf.cast(accuracy, np.float32))
test_error, accuracy = sess.run([test_error, accuracy], feed_dict={x: test_set, y: test_labels})
print("Test Error: " + test_error.__str__() + "; Accuracy: " + accuracy.__str__())
print("Tensorboard path: " + board_path)
I'll post the solution here just in case someone gets stuck in a similar way. If you see that plot very carefully, all of the NaN values (the triangles) come on a regular basis, like if at the end of every loop something causes the output of the loss function to just go NaN.
The problem is that, at every loop, I was giving a mini batch of "empty" examples. The problem lies in how I declared my inner training loop:
for i in range(total_batch):
Now what we'd like here is to have Tensorflow go through the entire training set, one minibatch at a time. So let's look at how total_batch was declared:
total_batch = int(num_examples / mini_batch_size)
That is not quite what we'd want to do - as we want to consider the training set only. So changing this line to:
total_batch = int(training_set_size / mini_batch_size)
Fixed the problem.
It is to be noted that Tensorflow seemed to ignore those "empty" batches, computing NaN for the loss but not updating the gradients - that's why the trend of the loss was one of a net that's learning something.