I am trying to debug a Nan training issue, and thought to use Tensorboard Debugger V2 to do so. I implemented it with this line at the top of my training code:
tf.debugging.experimental.enable_dump_debug_info(OUTPUT_FOLDER + 'tensorboard' +
"/debug/tfdbg2_logdir",
tensor_debug_mode="FULL_HEALTH",
circular_buffer_size=-1)
However, no data shows on Tensorboard even though the folder is being populated with log data. Both are version 2.9.0
Related
I am trying to load a Keras model on aws server with the following command
import tensorflow_hub as hub
from keras.models import load_model
model = load_model(model_path, custom_objects={'KerasLayer': hub.KerasLayer})
but its giving an error
FileNotFoundError: Op type not registered 'RegexSplitWithOffsets' in binary running on ip-10-0-xx-xxx.us-east-2.compute.internal. Make sure the Op and Kernel are registered in the binary running in this process. Note that if you are loading a saved graph which used ops from tf.contrib, accessing (e.g.) `tf.contrib.resampler` should be done before importing the graph, as contrib ops are lazily registered when the module is first accessed.
You may be trying to load on a different device from the computational device. Consider setting the `experimental_io_device` option in `tf.saved_model.LoadOptions` to the io_device such as '/job:localhost'.
The exact same code was working the other day but this time it's giving an error. I double-checked if the model file is still there on the specified path. Also, the file is not modified so there is no doubt of the file being corrupt or anything.
I guess the error has to do something with aws, but I don't how to resolve it. I tried finding any solution on the web but it wasn't helpful.
importing tensorflow_text resolve this issue
Am currently trying to deploy a custom model to AI platform by following https://cloud.google.com/ai-platform/prediction/docs/deploying-models#gcloud_1. which is based on a combination of the pre-trained model from 'Pytorch' and 'torchvision.transform'. Currently, I keep getting below error which happens to be related to 500MB constraint on custom prediction.
ERROR: (gcloud.beta.ai-platform.versions.create) Create Version failed. Bad model detected with error: Model requires more memory than allowed. Please try to decrease the model size and re-deploy. If you continue to experience errors, please contact support.
Setup.py
from setuptools import setup
from pathlib import Path
base = Path(__file__).parent
REQUIRED_PACKAGES = [line.strip() for line in open(base/"requirements.txt")]
print(f"\nPackages: {REQUIRED_PACKAGES}\n\n")
# [torch==1.3.0,torchvision==0.4.1, ImageHash==4.2.0
# Pillow==6.2.1,pyvis==0.1.8.2] installs 800mb worth of files
setup(description="Extract features of a image",
author=,
name='test',
version='0.1',
install_requires=REQUIRED_PACKAGES,
project_urls={
'Documentation':'https://cloud.google.com/ai-platform/prediction/docs/custom-prediction-routines#tensorflow',
'Deploy':'https://cloud.google.com/ai-platform/prediction/docs/deploying-models#gcloud_1',
'Ai_platform troubleshooting':'https://cloud.google.com/ai-platform/training/docs/troubleshooting',
'Say Thanks!': 'https://medium.com/searce/deploy-your-own-custom-model-on-gcps-ai-platform-
7e42a5721b43',
'google Torch wheels':"http://storage.googleapis.com/cloud-ai-pytorch/readme.txt",
'Torch & torchvision wheels':"https://download.pytorch.org/whl/torch_stable.html "
},
python_requires='~=3.7',
scripts=['predictor.py', 'preproc.py'])
Steps taken:
Tried adding ‘torch’ and torchvision directly to ‘REQUIRED_PACKAGES’ list in setup.py file in order to provide PyTorch + torchvision as a dependency to be installed while deployment. I am guessing, Internally Ai platform downloads PyPI package for PyTorch which is +500 MB, this results in the failure of our model deployment. If I just deploy the model with 'torch' only and it seems to be working (of course throws error for not able to find library 'torchvision')
File size
pytorch (torch-1.3.1+cpu-cp37-cp37m-linux_x86_64.whl about 111MB)
torchvision (torchvision-0.4.1+cpu-cp37-cp37m-linux_x86_64.whl about 46MB) from https://download.pytorch.org/whl/torch_stable.html and stored it on GKS.
The zipped predictor model file (.tar.gz format) which is the output of setup.py (5kb )
A trained PyTorch model (size 44MB)
In total, the model dependencies should be less than 250MB but still, keep getting this error. Have also tried to use the torch and torchvision provided from Google mirrored packages http://storage.googleapis.com/cloud-ai-pytorch/readme.txt, but same memory issue persists. AI platform is quite new for us and would like some input from professional’s.
MORE INFO:
GCP CLI Input:
My environment variable:
BUCKET_NAME= “something”
MODEL_DIR=<>
VERSION_NAME='v6'
MODEL_NAME="something_model"
STAGING_BUCKET=$MODEL_DIR<>
# TORCH_PACKAGE=$MODEL_DIR"package/torch-1.3.1+cpu-cp37-cp37m-linux_x86_64.whl"
# TORCHVISION_PACKAGE=$MODEL_DIR"package/torchvision-0.4.1+cpu-cp37-cp37m-linux_x86_64.whl"
TORCH_PACKAGE=<>
TORCHVISION_PACKAGE=<>
CUSTOM_CODE_PATH=$STAGING_BUCKET"imt_ai_predict-0.1.tar.gz"
PREDICTOR_CLASS="predictor.MyPredictor"
REGION=<>
MACHINE_TYPE='mls1-c4-m2'
gcloud beta ai-platform versions create $VERSION_NAME \
--model=$MODEL_NAME \
--origin=$MODEL_DIR \
--runtime-version=2.3 \
--python-version=3.7 \
--machine-type=$MACHINE_TYPE \
--package-uris=$CUSTOM_CODE_PATH,$TORCH_PACKAGE,$TORCHVISION_PACKAGE \
--prediction-class=$PREDICTOR_CLASS \
GCP CLI Output:
**[1] global**
[2] asia-east1
[3] asia-northeast1
[4] asia-southeast1
[5] australia-southeast1
[6] europe-west1
[7] europe-west2
[8] europe-west3
[9] europe-west4
[10] northamerica-northeast1
[11] us-central1
[12] us-east1
[13] us-east4
[14] us-west1
[15] cancel
Please enter your numeric choice: 1
To make this the default region, run `gcloud config set ai_platform/region global`.
Using endpoint [https://ml.googleapis.com/]
Creating version (this might take a few minutes)......failed.
ERROR: (gcloud.beta.ai-platform.versions.create) Create Version failed. Bad model detected with error: **Model requires more memory than allowed. Please try to decrease the model size and re-deploy. If you continue to experience errors, please contact support.**
My finding:
Have found articles of people struggling in same ways for PyTorch package and made it work by installing torch wheels on the GCS (https://medium.com/searce/deploy-your-own-custom-model-on-gcps-ai-platform-
7e42a5721b43).
Have tried the same approach with torch and torchvision but no luck till now and waiting response from "cloudml-feedback#google.com cloudml-feedback#google.com". Any help on getting custom torch_torchvision based custom predictor working on AI platform that will be great.
Got this fixed by a combination of few things. I stuck to 4gb CPU MlS1 machine and custom predictor routine (<500MB).
Install the libraries using setup.py parameter but instead of parsing just the package name and it's version, add correct torch wheel (ideally <100 mb).
REQUIRED_PACKAGES = [line.strip() for line in open(base/"requirements.txt")] +\
['torchvision==0.5.0', 'torch # https://download.pytorch.org/whl/cpu/torch-1.4.0%2Bcpu-cp37-cp37m-linux_x86_64.whl']
I reduced the steps for preprocessing taken. Couldn't fit in all of them so jsonify your SEND response and GET one from both preproc.py and predictor.py
import json
json.dump(your data to send to predictor class)
Import those function from the class of a library that is needed.
from torch import zeros,load
your code
[Important]
Haven't tested different types of serializing object for the trained model, could be a difference there as well to which one (torch.save, pickle, joblib etc) is memory saving.
Found this link for those whose organization is a partner with GCP might be able to request more quota (am guessing from 500MB to 2GB or so). Didn't had to go in this direction as my issue was resolved and other poped up lol.
https://cloud.google.com/ai-platform/training/docs/quotas
I'm trying to open a tensorboard using the example code. I ran the example locally and then tried to open in tensorboard using the following command tensorboard --logdir=F:\mnist_tmp\tensorflow\mnist\logs\mnist_with_summaries\.
Tensorboard starts, however I am not able to see any data ie, scalars, graphs,
or histograms.
Below is a picture of what I see when I open tensorflow. If I click on the other tabs I get a similar text dialog.
Don't use '' to brace the dir , just use the
tensorboard --logdir=dir
if you use the win10 System Operation
My issue was that I my log logdir contained the F: but tensorflow uses colon as an seperator which is a known bug.
I am trying to learn how to use tensorboard and I would like to have it run in my program. I do not understand how to create a log directory. These are the lines I have for running tensorboard.
summary_writer = tf.train.SummaryWriter('/tensorflow/logdir', sess.graph_def)
tensorboard --logdir=tensorflow/logdir
The error message that I got was
Cannot assign to operator
This line needs to be in your code (the python script), as it seems you put it:
summary_writer = tf.train.SummaryWriter('/tensorflow/logdir',
sess.graph_def)
This line, however, you have to call from linux (and not from within the script):
tensorboard --logdir=tensorflow/logdir
However, there is a lot more you need to do, before tensorboard really runs:
How to create a Tensorflow Tensorboard Empty Graph
The tutorial might disclose not very clear on the TensorFlow official website
I have stuck in the same issue before
But in order not to confuse you, i still use it as a guide here
First Part (lines of codes in .py file)
Just skip to class tf.train.SummaryWriter in official guide
First, you need this lines of code in your .py file to create a dataflow graph
In tensorflow, a session is where the graph been created
#...create a graph...
# Launch the graph in a session.
sess = tf.Session()
Then, you also need to type in these lines into your code
# Create a summary writer, add the 'graph' to the event file.
writer = tf.train.SummaryWriter(< directory name you create>, sess.graph)
The logs folder will be generated in the directory you assigned after the .py file you created is executed
Here is the sample code you can use
Second Part (lines of code in your linux terminal)
In your Linux terminal window, type in
tensorboard --logdir="path of your log file"
It will link to your log file automatically
Last step (key in link into your browser)
After key in
tensorboard --logdir="path of your log file"
It will generate a http link ,ex http://666.6.6.6:6006
Copy the http link into your web browser
Enjoy it!
Be careful
Do not go to the directory where the log file is before key in the line of code above
It might miss the log file
This youtube video will explain more explicitly about this at 9:40
You also can take a look of how to launch tensorboard on official guide
Hope you can show your data graph ASAP~
for --colab
%tensorflow_version 2.x
from torch.utils.tensorboard import SummaryWriter
tensorboard_logdir = '/content/tensorboard/'
writer = SummaryWriter(tensorboard_logdir)
I am using Tensorboard to show the training results of the code using Tensorflow (0.7). The previous Tensorflow version had issue for multiple event files: when I run my local server using $ tensorboard --logdir=./tmp/, it shows up error if there are more than 1 event files. It seems that the latest version (0.7) does not show up the same error for multiple event files, but it still shows the overlapped curves for multiple event files on Tensorboard. I wonder how to solve this problem. Thanks!
Training my own networks, I am writing summaries in different subfolders like /tmp/project/train and /tmp/project/eval. If you start TensorBoard using
tensorboard --logdir=/tmp/project/
you will still receive multiple graphs from each event file in the subfolders at once, as you mentioned. To see seperate graphs you can start TensorBoard from the desired subfolder:
tensorboard --logdir=/tmp/project/train/
I second ArnoXf's answer. You should use different subfolders for each experiment and assuming the logging root is /tmp start tensorboard using:
tensorboard --logdir=/tmp/
If you want to display just a single graph you can either pass that directory to your tensorboard call as described in ArnoXf's answer.
However, with the above call you can also select your graph directly in tensorboard, i.e., deactivate all others. The same way you can also compare individual runs as shown in the following screenshot. In my opinion, this is usually the preferred option, as it gives you more flexibility.
A detailed example can be found here.