Related
I have a data set like this:
edge_origins = np.array([[0,1,2,3,4],[6,7,8]])
edge_destinations = np.array([[1,2,3,4,5],[7,8,9]])
target = np.array([0,1])
x = [[np.array([0.1,0.5,0.2]),np.array([0.5,0.6,0.23]),
np.array([0.1,0.5,0.5]),np.array([0.1,0.6,0.23]),
np.array([0.1,0.4,0.4]),np.array([0.52,0.6,0.23])],
[np.array([0.1,0.3,0.3]),np.array([0.3,0.6,0.23]),
np.array([0.1,0.1,0.2]),np.array([0.4,0.6,0.23])]]
This is a list of two networks. The first network has 6 nodes with 5 edges and a class 0, and then 4 nodes with 3 edges and class 1 networks.
I want to develop a model in Pytorch that will classify each network into it's class, and then i'll give it a new set of networks to classify.
So ultimately, I want to be able to shuffle these lists (simultaneously, i.e. maintaining the order between the data and the classes), split into train and test, and then read the train and test data into two data loaders, and feed these into a PyTorch network.
I wrote this:
edge_origins = np.array([[0,1,2,3,4],[6,7,8]])
edge_destinations = np.array([[1,2,3,4,5],[7,8,9]])
target = np.array([0,1])
x = [[np.array([0.1,0.5,0.2]),np.array([0.5,0.6,0.23]),
np.array([0.1,0.5,0.5]),np.array([0.1,0.6,0.23]),
np.array([0.1,0.4,0.4]),np.array([0.52,0.6,0.23])],
[np.array([0.1,0.3,0.3]),np.array([0.3,0.6,0.23]),
np.array([0.1,0.1,0.2]),np.array([0.4,0.6,0.23])]]
edge_index = torch.tensor([edge_origins, edge_destinations], dtype=torch.long)
dataset = Data(x=x, edge_index=edge_index, y=y, num_classes = len(set(target)))
print(dataset)
And the error is:
edge_index = torch.tensor([edge_origins, edge_destinations], dtype=torch.long)
ValueError: expected sequence of length 5 at dim 2 (got 3)
But then once that is fixed I think the next step is:
torch.manual_seed(12345)
dataset = dataset.shuffle()
train_dataset = dataset[:1] #for toy example
test_dataset = dataset[1:]
print(f'Number of training graphs: {len(train_dataset)}')
print(f'Number of test graphs: {len(test_dataset)}')
train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=64, shuffle=False)
class GCN(torch.nn.Module):
def __init__(self, hidden_channels):
super(GCN, self).__init__()
torch.manual_seed(12345)
self.conv1 = GCNConv(dataset.num_node_features, hidden_channels)
self.conv2 = GCNConv(hidden_channels, hidden_channels)
self.conv3 = GCNConv(hidden_channels, hidden_channels)
self.lin = Linear(hidden_channels, dataset.num_classes)
def forward(self, x, edge_index, batch):
# 1. Obtain node embeddings
x = self.conv1(x, edge_index)
x = x.relu()
x = self.conv2(x, edge_index)
x = x.relu()
x = self.conv3(x, edge_index)
# 2. Readout layer
x = global_mean_pool(x, batch) # [batch_size, hidden_channels]
# 3. Apply a final classifier
x = F.dropout(x, p=0.5, training=self.training)
x = self.lin(x)
return x
model = GCN(hidden_channels=64)
print(model)
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)
criterion = torch.nn.CrossEntropyLoss()
def train():
model.train()
for data in train_loader: # Iterate in batches over the training dataset.
out = model(data.x, data.edge_index, data.batch) # Perform a single forward pass.
loss = criterion(out, data.y) # Compute the loss.
loss.backward() # Derive gradients.
optimizer.step() # Update parameters based on gradients.
optimizer.zero_grad() # Clear gradients.
def test(loader):
model.eval()
correct = 0
for data in loader: # Iterate in batches over the training/test dataset.
out = model(data.x, data.edge_index, data.batch)
pred = out.argmax(dim=1) # Use the class with highest probability.
correct += int((pred == data.y).sum()) # Check against ground-truth labels.
return correct / len(loader.dataset) # Derive ratio of correct predictions.
for epoch in range(1, 171):
train()
train_acc = test(train_loader)
test_acc = test(test_loader)
print(f'Epoch: {epoch:03d}, Train Acc: {train_acc:.4f}, Test Acc: {test_acc:.4f}')
Could someone demonstrate to me how to get my data running into the Pytorch network above?
In Pytorch Geometric the Data object is used to contain only one graph. So you could iterate through all your arrays like so:
data_list = []
for i in range(2):
edge_index_curr = torch.tensor([edge_origins[i],
edge_destinations[i],
dtype=torch.long)
data = Data(x=torch.tensor(x[i]), edge_index=edge_index_curr, y=torch.tensor(target[i]))
datas.append(data)
You can then use this list of Data to create your own Dataloader:
loader = DataLoader(data_list, batch_size=32)
If you need to split into train/val/test (I would advise having more than 2 samples for this case) you can do it manually or using sklearn.model_selection.
For data augmentation if you really do have very little data, pytorch-geometric comes with transforms.
I've got around 10 GB of training data in numpy array format. However, my RAM is not big enough to load the data and the tensorflow 2.0 model at the same time. I've done plenty of research into tf.data, tf.TFRecords and generators, but I'm stuck in the actual implementation of the training loop. I currently have the following standard training loop, when using a loadable subset of the data to test whether everything worked.
The options I considered are:
1) Splitting the 10GB of numpy files into a large number of tf.TFRecords files (shuffled) with each being the size of the BATCH_SIZE and then loading them in every train_step and epoch
2) Splitting the 10GB of numpy files into smaller files of 1-2 GB each (shuffled) and then sequentially training the model by loading the smaller files every epoch sequentially
Is there a best practice for this, or a better option I am not considering? I imagine this should be a trivial problem, but I can't find any good solutions online.
if __name__ == '__main__':
# get the original_dataset
#train_dataset, valid_dataset
train_dataset = tf.data.Dataset.from_tensor_slices((mtr, labels))
train_dataset = train_dataset.shuffle(buffer_size=mtr.shape[0]).batch(BATCH_SIZE)
valid_dataset = tf.data.Dataset.from_tensor_slices((val_mtr, val_labels))
valid_dataset = valid_dataset.batch(BATCH_SIZE)
# create model
model = get_model()
# define loss and optimizer
loss_object = tf.keras.losses.SparseCategoricalCrossentropy()
optimizer = tf.keras.optimizers.Adam(learning_rate=1e-04, beta_1=0.9, beta_2=0.999, epsilon=1e-08,)
train_loss = tf.keras.metrics.Mean(name='train_loss')
train_accuracy = tf.keras.metrics.SparseCategoricalAccuracy(name='train_accuracy')
valid_loss = tf.keras.metrics.Mean(name='valid_loss')
valid_accuracy = tf.keras.metrics.SparseCategoricalAccuracy(name='valid_accuracy')
#tf.function
def train_step(images, labels):
with tf.GradientTape() as tape:
predictions = model(images, training=True)
loss = loss_object(y_true=labels, y_pred=predictions)
gradients = tape.gradient(loss, model.trainable_variables)
optimizer.apply_gradients(grads_and_vars=zip(gradients, model.trainable_variables))
train_loss(loss)
train_accuracy(labels, predictions)
#tf.function
def valid_step(images, labels):
predictions = model(images, training=False)
v_loss = loss_object(labels, predictions)
valid_loss(v_loss)
valid_accuracy(labels, predictions)
# start training
for epoch in range(EPOCHS):
train_loss.reset_states()
train_accuracy.reset_states()
valid_loss.reset_states()
valid_accuracy.reset_states()
step = 0
for imgs, lbls in train_dataset:
step += 1
train_step(imgs, lbls)
print("Epoch: {}/{}, step: {}/{}, loss: {:.5f}, accuracy: {:.5f}".format(epoch + 1, EPOCHS, step, math.ceil(mtr.shape[0] / BATCH_SIZE),
train_loss.result(), train_accuracy.result()))
for valid_images, valid_labels in valid_dataset:
valid_step(valid_images, valid_labels)
print("Epoch: {}/{}, train loss: {:.5f}, train accuracy: {:.5f}, "
"valid loss: {:.5f}, valid accuracy: {:.5f}".format(epoch + 1, EPOCHS, train_loss.result(),
train_accuracy.result(), valid_loss.result(), valid_accuracy.result()))
model.save_weights(filepath=save_model_dir, save_format='tf')
Based on the answer below this seemed to work:
def load_data(filename: tf.Tensor):
coll = pickle.load(open("./TrainData2016/%s" %(str(filename.numpy().decode('utf-8'))), "rb")) #coll[0] = data, coll[1] = label
return tf.convert_to_tensor(coll[0],dtype=tf.float32), tf.convert_to_tensor(coll[1],dtype=tf.int32)
filenames = os.listdir("./TrainData2016") # You can just shuffle this array and not use tf.data.Dataset.shuffle()
train_dataset = tf.data.Dataset.from_tensor_slices(filenames)
train_dataset = train_dataset.map(lambda x: tf.py_function(func=load_data, inp=[x], Tout=(tf.float32, tf.int32)))
Did you try to use the map function from tf.data.Dataset? I mean instead of loading images you can have an array with the filenames only have a look at the example below:
def load_img(filename):
# Read image file
img = tf.io.read_file(....)
img = tf.io.decode_png(img) # have a look at tf.io.decode_*
# Do some normalization or transformations if needed on img
# Add label
label = associated_image_label() # associate the needed lable to your image
return img, label
filenames = os.listdir("./images_dir") # You can just shuffle this array and not use tf.data.Dataset.shuffle()
train_dataset = tf.data.Dataset.from_tensor_slices(filenames)
train_dataset = train_dataset.map(load_img)
# Create batches and prefetch
This way you will load only the batch of images for your train_step only
how about using keras with train on batch?
Model.train_on_batch(
x,
y=None,
sample_weight=None,
class_weight=None,
reset_metrics=True,
return_dict=False,
)
https://keras.io/api/models/model_training_apis/
At the moment I'm trying to follow a example of Temperature Forecasting in Keras (as given in chapter 6.3 of F. Chollet's "Deep Learning with Python" book). I'm having some issues with prediction using the generator that is specified. My understanding is that I should be using model.predict_generator for prediction, but I'm unsure how to use the steps parameter for this method and how to get back predictions that are the correct "shape" for my original data.
Ideally, I would like to be able to plot the test set (indices 300001 until the end) and also plot my predictions for this test set (i.e. an array of the same length with predicted values).
An example (Dataset available here: https://s3.amazonaws.com/keras-datasets/jena_climate_2009_2016.csv.zip) is as follows:
import numpy as np
# Read in data
fname = ('jena_climate_2009_2016.csv')
f = open(fname)
data = f.read()
f.close()
lines = data.split('\n')
col_names = lines[0].split(',')
col_names = [i.replace('"', "") for i in col_names]
# Normalize the data
float_data = np.array(df.iloc[:, 1:])
temp = float_data[:, 1]
mean = float_data[:200000].mean(axis=0)
float_data -= mean
std = float_data[:200000].std(axis=0)
float_data /= std
def generator(data, lookback, delay, min_index, max_index, shuffle=False, batch_size=128, step=6):
if max_index is None:
max_index = len(data) - delay - 1
i = min_index + lookback
while 1:
if shuffle:
rows = np.random.randint(
min_index + lookback, max_index, size=batch_size)
else:
if i + batch_size >= max_index:
i = min_index + lookback
rows = np.arange(i, min(i + batch_size, max_index))
i += len(rows)
samples = np.zeros((len(rows),
lookback // step,
data.shape[-1]))
targets = np.zeros((len(rows),))
for j, row in enumerate(rows):
indices = range(rows[j] - lookback, rows[j], step)
samples[j] = data[indices]
targets[j] = data[rows[j] + delay][1]
yield(samples, targets)
lookback = 720
step = 6
delay = 144
train_gen = generator(float_data, lookback=lookback, delay=delay,
min_index=0, max_index=200000, shuffle=True,
step=step, batch_size=batch_size)
val_gen = generator(float_data, lookback=lookback, delay=delay,
min_index=200001, max_index=300000, step=step,
batch_size=batch_size)
test_gen = generator(float_data, lookback=lookback, delay=delay,
min_index=300001, max_index=None, step=step,
batch_size=batch_size)
val_steps = (300000 - 200001 - lookback)
test_steps = (len(float_data) - 300001 - lookback)
from keras.models import Sequential
from keras import layers
from keras.optimizers import RMSprop
model = Sequential()
model.add(layers.Flatten(input_shape=(lookback // step, float_data.shape[-1])))
model.add(layers.Dense(32, activation='relu'))
model.add(layers.Dense(1))
model.compile(optimizer=RMSprop(), loss='mae')
model.fit_generator(train_gen, steps_per_epoch=500,
epochs=20, validation_data=val_gen,
validation_steps=val_steps)
After some searching around online, I tried some techniques similar to the following:
pred = model.predict_generator(test_gen, steps=test_steps // batch_size)
However the prediction array that I got back was far too long and didn't match up to my original data at all. Has anyone got any suggestions?
For anyone looking at the question now, we are not required to specify the steps parameter when using predict_generator for the newer versions of keras. Ref: https://github.com/keras-team/keras/issues/11902
If a value is provided, predictions for step*batch_size examples will be generated. This may result in exclusion of len(test)%batch_size rows, as mentioned by OP.
Also, it seems to me that setting batch_size=1 defeats the purpose of using the generator, as it is equivalent to iterating over the test data one by one.
Similarly setting steps=1 (when batch_size is not set in test_generator) will read the entire test data at once, which is not ideal for large test data.
In predict_generator for steps divide number of images you have in test path with whatever batchsize you have provided in test_gen
EX: i have 50 images and i provided batch size of 10 than steps would be 5
#first seperate the `test images` and `test labels`
test_images,test_labels = next(test_gen)
#get the class indices
test_labels = test_labels[:,0] #this should give you array of labels
predictions = model.predict_generator(test_gen,steps = number of images/batchsize,verbose=0)
predictions[:,0] #this is your actual predictions
Your original code looks correct:
pred = model.predict_generator(test_gen, steps=test_steps // batch_size)
I tried and did not see any problem generating a pred of length around 120k. What size did you get?
Actually both of the steps in the code are incorrect. They should be:
val_steps = (300000 - 200001 - lookback) // batch_size
test_steps = (len(float_data) - 300001 - lookback) // batch_size
(Didn't it take forever for your validation to run for each epoch?)
Of course with this correction you can simply use
pred = model.predict_generator(test_gen, steps=test_steps)
As I arrived at a semi-acceptable version of an answer to my own question, I decided to post it for posterity:
test_gen = generator(float_data, lookback=lookback, delay=delay,
min_index=300001, max_index=None, step=step,
batch_size=1) # "reset" the generator
pred = model.predict_generator(test_gen, steps=test_steps)
This now has the shape I want to plot it against my original test set. I could also use a more manual approach inspired somewhat by this answer:
test_gen = generator(float_data, lookback=lookback, delay=delay,
min_index=300001, max_index=None, step=step,
batch_size=1) # "reset" the generator
truth = []
pred = []
for i in range(test_steps):
x, y = next(test_gen)
pred.append(model.pred(x))
truth.append(y)
pred = np.concatenate(pred)
truth = np.concatenate(truth)
I am self-teaching myself about tf.data API. I am using MNIST dataset for binary classification. The training x and y data is zipped together in the full train_dataset. Chained along together with this zip method is first the batch() dataset method. the data is batched with a batch size of 30. Since my training set size is 11623, with batch size 128, I will have 91 batches. The size of the last batch will be 103 which is fine since this is LSTM. Additionally, I am using drop-out. When I compute batch accuracy, I am turning off the drop-out.
The full code is given below:
#Ignore the warnings
import warnings
warnings.filterwarnings("ignore")
import pandas as pd
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = (8,7)
from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets("MNIST_data/")
Xtrain = mnist.train.images[mnist.train.labels < 2]
ytrain = mnist.train.labels[mnist.train.labels < 2]
print(Xtrain.shape)
print(ytrain.shape)
#Data parameters
num_inputs = 28
num_classes = 2
num_steps=28
# create the training dataset
Xtrain = tf.data.Dataset.from_tensor_slices(Xtrain).map(lambda x: tf.reshape(x,(num_steps, num_inputs)))
# apply a one-hot transformation to each label for use in the neural network
ytrain = tf.data.Dataset.from_tensor_slices(ytrain).map(lambda z: tf.one_hot(z, num_classes))
# zip the x and y training data together and batch and Prefetch data for faster consumption
train_dataset = tf.data.Dataset.zip((Xtrain, ytrain)).batch(128).prefetch(128)
iterator = tf.data.Iterator.from_structure(train_dataset.output_types,train_dataset.output_shapes)
X, y = iterator.get_next()
training_init_op = iterator.make_initializer(train_dataset)
#### model is here ####
#Network parameters
num_epochs = 2
batch_size = 128
output_keep_var = 0.5
with tf.Session() as sess:
init.run()
print("Initialized")
# Training cycle
for epoch in range(0, num_epochs):
num_batch = 0
print ("Epoch: ", epoch)
avg_cost = 0.
avg_accuracy =0
total_batch = int(11623 / batch_size + 1)
sess.run(training_init_op)
while True:
try:
_, miniBatchCost = sess.run([trainer, loss], feed_dict={output_keep_prob: output_keep_var})
miniBatchAccuracy = sess.run(accuracy, feed_dict={output_keep_prob: 1.0})
print('Batch %d: loss = %.2f, acc = %.2f' % (num_batch, miniBatchCost, miniBatchAccuracy * 100))
num_batch +=1
except tf.errors.OutOfRangeError:
break
When I run this code, it seems it is working and printing:
Batch 0: loss = 0.67276, acc = 0.94531
Batch 1: loss = 0.65672, acc = 0.92969
Batch 2: loss = 0.65927, acc = 0.89062
Batch 3: loss = 0.63996, acc = 0.99219
Batch 4: loss = 0.63693, acc = 0.99219
Batch 5: loss = 0.62714, acc = 0.9765
......
......
Batch 39: loss = 0.16812, acc = 0.98438
Batch 40: loss = 0.10677, acc = 0.96875
Batch 41: loss = 0.11704, acc = 0.99219
Batch 42: loss = 0.10592, acc = 0.98438
Batch 43: loss = 0.09682, acc = 0.97656
Batch 44: loss = 0.16449, acc = 1.00000
However, as one can see easily, there is something wrong. Only 45 batches are printed not 91 and I do not know why this is happening. I tried so many things and I think I am missing something out.
I can use repeat() function but I do not want that because I have redundant observations for last batches and I want LSTM to handle it.
This is an annoying pitfall when defining a model based directly on the get_next() output of a tf.data iterator. In your loop, you have two sess.run calls, both of which will advance the iterator by one step. This means each loop iteration actually consumes two batches (and also your loss and accuracy calculations are computed on different batches).
Not entirely sure if there is a "canonical" way of fixing this, but you could
compute the accuracy in the same run call as the cost/training step. This would mean that the accuracy calculation is also affected by the dropout mask, but since it's an approximate value based on only one batch, that shouldn't be a huge issue.
define your model based on a placeholder instead, and in each loop iteration run the get_next op itself, then feed the resulting numpy arrays (i.e. the batch) into the loss/accuracy computations.
I'm using this Pytorch implementation of Segnet with pretrained values I found for object segmentation, and it works fine.
Now I want to resume the training from the values I have, using a new dataset with similar images.
How can I do that?
I guess I have to use the "train.py" file found in the repository, but I don't know what to write in order to replace the "fill the batch" comment.
Here is that portion of the code:
def train(epoch):
model.train()
# update learning rate
lr = args.lr * (0.1 ** (epoch // 30))
for param_group in optimizer.param_groups:
param_group['lr'] = lr
# define a weighted loss (0 weight for 0 label)
weights_list = [0]+[1 for i in range(17)]
weights = np.asarray(weights_list)
weigthtorch = torch.Tensor(weights_list)
if(USE_CUDA):
loss = nn.CrossEntropyLoss(weight=weigthtorch).cuda()
else:
loss = nn.CrossEntropyLoss(weight=weigthtorch)
total_loss = 0
# iteration over the batches
batches = []
for batch_idx,batch_files in enumerate(tqdm(batches)):
# containers
batch = np.zeros((args.batch_size,input_nbr, imsize, imsize), dtype=float)
batch_labels = np.zeros((args.batch_size,imsize, imsize), dtype=int)
# fill the batch
# ...
# What should I write here?
batch_th = Variable(torch.Tensor(batch))
target_th = Variable(torch.LongTensor(batch_labels))
if USE_CUDA:
batch_th =batch_th.cuda()
target_th = target_th.cuda()
# initilize gradients
optimizer.zero_grad()
# predictions
output = model(batch_th)
# Loss
output = output.view(output.size(0),output.size(1), -1)
output = torch.transpose(output,1,2).contiguous()
output = output.view(-1,output.size(2))
target = target.view(-1)
l_ = loss(output.cuda(), target)
total_loss += l_.cpu().data.numpy()
l_.cuda()
l_.backward()
optimizer.step()
return total_loss/len(files)
If I had to guess he probablly made some Dataloader feeder that extended the Pytorch Dataloader class. See
https://pytorch.org/tutorials/beginner/data_loading_tutorial.html
Near the bottom of the page you can see an example in which they loop over their data loader
for i_batch, sample_batched in enumerate(dataloader):
What this would like like for images for example is:
trainset = torchvision.datasets.CIFAR10(root='./data', train=True, download=False, transform=transform_train)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=batchSize, shuffle=True, num_workers=2)
for batch_idx, (inputs, targets) in enumerate(trainloader):
# Using the pytorch data loader the inputs and targets are given
# automatically
inputs, targets = inputs.cuda(), targets.cuda()
optimizer.zero_grad()
inputs, targets = Variable(inputs), Variable(targets)
How exactly the author loads his files I don't know. You could follow the procedure from: https://pytorch.org/tutorials/beginner/data_loading_tutorial.html to make your own Dataloader though.