Tensorflow: How can I restore model for training? (Python) - python

I want to train a cnn for 20000 steps. In the 100th step I want to save all variables and after that I want to re-run my code restoring model and starting from the 100th step. I am trying to make it work with tensorflow documentation: https://www.tensorflow.org/versions/r0.10/how_tos/variables/index.html but I can't. Any help?

Im stuck in something similar but maybe this link can help you. Im new in tensorflow but i think you cant restore and fit without need to training again you model.

This functionality is still unstable , and the documentation is outdated so is confusing, what worked for me(this was a suggestion of people from google that works directly on tensorflow) was to use the model_dir parameter on the constructor of my models before training, in this you will tell where to store your model, after training you just instantiate again a model using the same model_dir and it will restore the model from the files and checkpoints generated.

Related

Can you plot the accuracy graph of a pre-trained model? Deep Learning

I am new to Deep Learning. I finished training a model that took 8 hours to run, but I forgot to plot the accuracy graph before closing the jupyter notebook.
I need to plot the graph, and I did save the model to my hard-disk. But how do I plot the accuracy graph of a pre-trained model? I searched online for solutions and came up empty.
Any help would be appreciated! Thanks!
What kind of framework did you use and which version? In the future problem, you may face, this information can play a key role in the way we can help you.
Unfortunately, for Pytorch/Tensorflow the model you saved is likely to be saved with only the weights of the neurons, not with its history. Once Jupyter Notebook is closed, the memory is cleaned (and with it, the data of your training history).
The only thing you can extract is the final loss/accuracy you had.
However, if you regularly saved a version of the model, you can load them and compute manually the accuracy/loss that you need. Next, you can use matplotlib to reconstruct the graph.
I understand this is probably not the answer you were looking for. However, if the hardware is yours, I would recommend you to restart training. 8h is not that much to train a model in deep learning.

Tensorflow: How to load a pre-trained ResNet model

I want to use a pre-trained ResNet model which Tensorflow provides here.
First I downloaded the code (resnet_v1.py) to reconstruct the model's graph here. The model's weights (resnet_v1_50.ckpt) can be found on the same page here.
The model can be tested using the following script (resnet_v1_test.py) from here. However, I have problems to extract the right information from resnet_v1_test.py. I don't understand many things that happen in this script. Which are the essential functions to pass a random image through the network? How can I access the weights and activations for further work?
What are the next steps from here? I would appreciate any help!
TL;DR: How can I use the resnet_v1_test.py script to perform classification and access weights and activations?

Why does the saved model starts with initital loss and accuracy values after loading in Keras?

I am building a model for machine comprehension. It's a heavy model required to train on lots of data and this requires me more time. I have used keras callbacks to save model after every epoch and also save a history of loss and accuracy.
The problem is, when I am loading a trained model, and try to continue it's training using initial_epoch argument, the loss and accuracy values are same as untrained model.
Here is the code: https://github.com/ParikhKadam/bidaf-keras
The code used to save and load model is in /models/bidaf.py
The script I am using to load the model is:
from .models import BidirectionalAttentionFlow
from .scripts.data_generator import load_data_generators
import os
import numpy as np
def main():
emdim = 400
bidaf = BidirectionalAttentionFlow(emdim=emdim, num_highway_layers=2,
num_decoders=1, encoder_dropout=0.4, decoder_dropout=0.6)
bidaf.load_bidaf(os.path.join(os.path.dirname(__file__), 'saved_items', 'bidaf_29.h5'))
train_generator, validation_generator = load_data_generators(batch_size=16, emdim=emdim, shuffle=True)
model = bidaf.train_model(train_generator, epochs=50, validation_generator=validation_generator, initial_epoch=29,
save_history=False, save_model_per_epoch=False)
if __name__ == '__main__':
main()
The training history is quite good which is:
epoch,accuracy,loss,val_accuracy,val_loss
0,0.5021367247352657,5.479433422293752,0.502228641179383,5.451400522458351
1,0.5028450897193741,5.234336488338403,0.5037527732234647,5.0748545675049
2,0.5036885394022954,5.042028017280698,0.5039489093881276,5.0298488218407975
3,0.503893446146289,4.996997425685413,0.5040753162241299,4.976164487656699
4,0.5040576918224873,4.955544574118662,0.5041905890181151,4.931354981493792
5,0.5042372655790888,4.909940965651957,0.5043896965802341,4.881359395178988
6,0.504458428129642,4.8542871887472465,0.5045972716586732,4.815464454729135
7,0.50471843351102,4.791098495962496,0.5048680457262408,4.747811231472629
8,0.5050776754196002,4.713560494026321,0.5054184527602898,4.64730478015052
9,0.5058853749443502,4.580552254050073,0.5071290369370443,4.446513280167718
10,0.5081544614246304,4.341471499420364,0.5132941329030303,4.145318906086552
11,0.5123970410575613,4.081624463197288,0.5178775145611896,4.027316586998608
12,0.5149879128865782,3.9577423109634613,0.5187159608315838,3.950151870168726
13,0.5161411008840144,3.8964761709052578,0.5191430166876064,3.906301355196609
14,0.5168211272672539,3.8585826589385697,0.5191263493850466,3.865382308412537
15,0.5173216891201444,3.830764191839807,0.519219763635108,3.8341492204942607
16,0.5177805591697787,3.805340048675155,0.5197178382215892,3.8204319018292585
17,0.5181171635676399,3.7877712072310343,0.5193657963810704,3.798006804522368
18,0.5184295824699279,3.77086071548255,0.5193122694008523,3.7820449101377243
19,0.5187343664397653,3.7555085003534194,0.5203585262348183,3.776260506494833
20,0.519005008308583,3.7430062334375065,0.5195983755362352,3.7605361109533995
21,0.5192872482429703,3.731001830462149,0.5202017035842986,3.7515058917231405
22,0.5195097722222706,3.7194103983513553,0.5207148585133065,3.7446572377159795
23,0.5197511249107636,3.7101052441559905,0.5207420740297026,3.740088335181619
24,0.5199862479678652,3.701593302911729,0.5200187951731082,3.7254406861185188
25,0.5200847805044403,3.6944093077914464,0.520112738649039,3.7203616696860786
26,0.5203289568582412,3.6844954882274092,0.5217114634669081,3.7214983577364547
27,0.5205629846610852,3.6781935968943595,0.520915311442328,3.705435317731209
28,0.5206827641463226,3.6718110897539193,0.5214088439286978,3.7003081666703377
Also, I have already taken care of loading custom objects such as layers, loss function and accuracy.
I am kind of frustrated by now as I took me days to train this model upto epochs and now I can't resume training. I have referred various threads in keras issues and found many people are facing such issues but can't find a solution.
Someone in a thread said that "Keras will not save RNN states" (I ain't using stateful RNNs) and someone else said "Keras reinitializes all the weights before saving which we can handle using a flag." I mean, if such problems exist in Keras, what will be the use of functions like save().
I have also tried saving only weights after every epoch and then building model from scratch and then loading those weights into it. But that didn't work. You can find the old code I used to save weights only in the above listed github repo's older branches.
I have referred this issue with no help - #4875
That issue is open from past two years. Can't understand what all the developers are doing! Is anyone here who can help? Should I switch to tensorflow or I will face the same issues in that too?
Please help...
Edit1:
I haven't tried saving model using model.save() but I have seen people on other threads saying that the issue was solved with model.save() and models.save_model(). If it is actually solved, ModelCheckpoint should also save optimizer state to resume training but it doesn't (or can't) whatsoever the reason. I have verified the code of ModelCheckpoint callback which indirectly calls model.save() which leads a call to models.save_model(). So theoretically, if the issue in the base i.e. models.save_model() is solved, then it should also be solved in other functions.
Sorry, but I don't have a powerful machine to check this practically. If someone here has, I have shared my code on github and the link is provided in the issue. Please try the resume training on it and detect the cause of this problem.
I am using the computer provided by a national institute and hence, students here need to share this single computer for their projects. I can't use that computer for such tasks. Thank you..
Edit2:
Recently, I tried to check if the weights are saved correctly. For that, I evaluated the model with my validation generator. I saw that the output loss and accuracy remained the same as those in the beginning of model training. Seeing this, I reached a conclusion that its actually the issue with saving weights of the model. I might be wrong here..
BTW, I have also used multi_gpu_model() in my model code. Can it cause this issue? I can't try training model on CPU as it too heavy for that and will take a few days for only 1 epoch to complete. Can anyone help debugging?
I see no response in such issues these days. Just list current issues on the README.md in keras github so that users can be aware before trying keras out and wasting months behind it.

Export and import tensorflow network for evaluating states in application

I am writing a neural network in tensorflow and I want to be able to export my final trained network and import it in another program to play a game. I have found multiple forum posts like:
Tensorflow: How to use a trained model in a application?
Tensorflow: how to save/restore a model?
I also saw in the tf documentations they were using estimators to save the model but I am not sure if that is what I'm looking for and how to apply it.
But those talk about exporting the entire session and importing it into the application and using Session.run, but as I understand it that requires an input of the predicted output and will run another training step on my network. I don't want to continue training my network - it's finished - I now want to evaluate a specific state given to me by the game only.
Thanks in advance for any help available.
As I know, there are 2 way of doing it.
checkpoint files(metagraph)
savedmodel
savedmodel is very convenient, but study curve is higher than checkpoint file. you can check this tutorial
and import model is not continue run the training, it is basically restore all the variable you learned.

Transfer learning with inception model in Tensorflow (python)

How can I load a .pb protobuf model and then tweak the network as needed (specially the outer layers) in order to train a new model for completely different classes? Effectively doing transfer learning?
I want to do something like these (i.e. train the outer layers with a bigger learning rate than the inner layers) among other things, so I need a way to not only load the graph with the variables, but to alter the network's structure and hyperparameters too.
If anyone has an example to follow with the inception model, it would be amazing!
My question is very similar to this one.
I've searched all over the internet (TF docs, Github, StackOverflow, Google...) but I can't seam to find something useful for a novice.
Thanks a lot!
This is the updated tutorial from official Tensorflow website https://www.tensorflow.org/hub/tutorials/image_retraining
They use the pre trained Inception V3 model and everything works fine. You can change the dataset folder to your own dataset
tf.import_graph_def() is the function for loading a GraphDef:
https://www.tensorflow.org/versions/0.6.0/api_docs/python/framework.html#import_graph_def
Hopefully once imported, you can make the modifications to the graph you need. It would be easier, though, to modify the Python code that generated the graph in the first place, if you have access to that.

Categories