I have created a script to evaluate a TensorFlow convolutional neural network. It loads some images and does some simple preprocessing:
def main(argv):
classifier = import_model()
for path in argv[1:]:
image_reversed = imread(path).astype(np.float32)
image_unlayered = np.transpose(image_reversed, (1, 0, 2))
image = np.reshape(image_unlayered, [1, -1, 480, 3])
angle = infer_steering_angle(classifier, image)
print("Steering angle %f for image %s." % (angle, path))
It imports the model using a network structure function in another file that has been verified to at least mostly work and is used to train a network:
def import_model():
# Load estimator
classifier = learn.Estimator(
model_fn=cnn_model_fn,
model_dir="/tmp/network2"
)
return classifier
and finally, it uses the Estimator.predict function to pass the single image to the network, overriding the default batch_size of 10 and setting it to 1. It returns a tensor with a single element, which should correspond to the steering angle (this is an end-to-end autonomous driving regression problem).
def infer_steering_angle(classifier, image):
output = classifier.predict(
x=image,
batch_size=1
)
for angle in output:
return angle
The problem is, it always outputs 0.0 for the steering angle. I've looked over all of this several times, and the only thing I can think of is that I'm misunderstanding the Estimator.predict function. It's rather poorly documented, in that it lacks concrete examples of how it should be used. Does anybody notice anything wrong with how I'm formatting the input or parsing the output?
UPDATE:
I tried putting this code right in the training file, so the importing can't be the problem. I'm starting to become suspicious it's a problem with the model itself. The code is at https://hastebin.com/rakulonebu.py.
Related
I'm working on a variational auto-encoder and I'd like the prior used in the KL-divergence regularization of the latent distribution to have its loc (mean) and scale (stddev) updated.
The below snippet is a contrived minimal example demonstrating what I'm trying to achieve. This starts to work but then just freezes after some random number of epochs (sometimes 1, sometimes 200, but usually around 7 or 8). There's no error message or anything.
loc = tf.Variable(tf.random.normal([ndim], stddev=0.1, dtype=tf.float32))
scale = tfp.util.TransformedVariable(
tf.random.normal([ndim], mean=1.0, stddev=0.1, dtype=tf.float32),
bijector=tfb.Chain([tfb.Shift(1e-5), tfb.Softplus(), tfb.Shift(0.5413)]))
prior = tfd.Independent(tfd.Normal(loc=loc, scale=scale), reinterpreted_batch_ndims=1)
_input = tfkl.Input(shape=(1,))
_loc = tfkl.Dense(ndim, name="loc_params")(_input)
_scale = tfkl.Dense(ndim, name="untransformed_scale_params")(_input)
_scale = tf.math.softplus(_scale + np.log(np.exp(1) - 1)) + 1e-5
_output = tfpl.DistributionLambda(
make_distribution_fn=lambda t: tfd.Independent(tfd.Normal(loc=t[0], scale=t[1])),
activity_regularizer=tfpl.KLDivergenceRegularizer(prior, use_exact_kl=True, weight=0.1)
)([_loc, _scale])
model = tf.keras.Model(_input, _output)
model.compile(optimizer='adam', loss=lambda y_true, model_out: -model_out.log_prob(y_true))
hist = model.fit(ds, epochs=N_EPOCHS, verbose=2)
I have a runnable gist here.
A more concrete example, and an architecture close to what I'm trying to update and simplify, is the tfp example for disentangled_vae. In its manual training loop, a new tfd.MultivariateNormalDiag is instantiated on every loop, though it is parameterized using persistent tf.Variables. I'm trying my best to avoid manual training loops, and I'm also trying to move to more Keras-like syntax, so I'd rather not do a direct port of this example.
Any advice is greatly appreciated. Thanks!
Edit: The activity_regularizer seems to work fine when attached to a latent (bottleneck) distribution. I have a more complete example in this Colab notebook. As this works in my architecture, I'm no longer in need of an answer.
However, I highly doubt having model fitting freeze is desirable behaviour, so this remains a problem.
As the machinery works in most circumstances, just not the contrived freezing example above, I no longer consider this a question that needs an answer.
I have reported the errorless freezing behaviour via the tensorflow-probability repository issues page. See here.
I am trying to deploy my object detection model that was trained using tensorflow to sagemaker. I was able to deploy it without specifying any entry points during model creation but it turns out doing that will only work for small sizes images (Sagemaker has limit of 5MB). The code I used for this is as:
from sagemaker.tensorflow.serving import Model
# Initialize model ...
model = Model(
model_data= s3_path_for_model,
role=sagemaker_role,
framework_version="1.14",
env=env)
# Deploy model ...
predictor = model.deploy(initial_instance_count=1,
instance_type='ml.t2.medium')
# Test using an image ...
import cv2
import numpy as np
image_content = cv2.imread("PATH_TO_IMAGE",
1).astype('uint8').tolist()
body = {"instances": [{"inputs": image_content}]}
# Works fine for small images ...
# I get predictions perfectly with this ...
results = predictor.predict(body)
So, I googled around and found that I need to pass an entry_point for Model() in order to predict for larger images. Something like:
model = Model(
entry_point="inference.py",
dependencies=["requirements.txt"],
model_data= s3_path_for_model,
role=sagemaker_role,
framework_version="1.14",
env=env
)
But doing this gives FileNotFoundError: [Errno 2] No such file or directory: 'inference.py'. A little help here please. I am using sagemaker-python-sdk.
My folder structure is as:
model
|__ 001
|__saved_model.pb
|__variables
|__<contents here>
|__ code
|__inference.py
|__requirements.txt
Note: I have also tried, ./code/inference.py and /code/inference.py.
5MB is a hard limit for real-time endpoints.
Are you sure you need to pass such large images for prediction? Most use cases work fine with smaller, lower resolution images.
If you need real-time prediction, one workaround would be to pass the image S3 URI in the prediction request (instead of the image itself), and load the image from S3.
If you don't need real-time prediction, you should look at batch transform, which doesn't enforce that size restriction: https://docs.aws.amazon.com/sagemaker/latest/dg/batch-transform.html
I-m trying to run my python program it seems that it should run smoothly however I encounter an error that I haven't seen before it says:
free(): invalid pointer
Aborted (core dumped)
However I'm not sure how to try and fix error since it doesn't give me too much information about the problem itself.
At first I thought it should be a problem with the sizes of the tensor in my network however they are completely fine. I've google the problem a little and found that I can see that is a problem with allocating memory where I shouldn't, but I don't know how to fix this problem
My code is divided in two different files, and I use two libraries to be able to use Sinkhorn loss function and make sample randomly a mesh.
import argparse
import point_cloud_utils as pcu
import time
import numpy as np
import torch
import torch.nn as nn
from fml.nn import SinkhornLoss
import common
def main():
# x is a tensor of shape [n, 3] containing the positions of the vertices that
x = torch._C.from_numpy(common.loadpointcloud("sphere.txt"))
# t is a tensor of shape [n, 3] containing a set of nicely distributed samples in the unit cube
v, f = common.unit_cube()
t = torch._C.sample_mesh_lloyd(pcu.lloyd(v,f,x.shape[0]).astype(np.float32)) # sample randomly a point cloud (cube for now?)
# The model is a simple fully connected network mapping a 3D parameter point to 3D
phi = common.MLP(in_dim=3, out_dim=3)
# Eps is 1/lambda and max_iters is the maximum number of Sinkhorn iterations to do
emd_loss_fun = SinkhornLoss(eps=1e-3, max_iters=20,
stop_thresh=1e-3, return_transport_matrix=True)
mse_loss_fun = torch.nn.MSELoss()
# Adam optimizer at first
optimizer = torch.optim.Adam(phi.parameters(), lr= 10e-3)
fit_start_time = time.time()
for epoch in range(100):
optimizer.zero_grad()
# Do the forward pass of the neural net, evaluating the function at the parametric points
y = phi(t)
# Compute the Sinkhorn divergence between the reconstruction*(using the francis library) and the target
# NOTE: The Sinkhorn function expects a batch of b point sets (i.e. tensors of shape [b, n, 3])
# since we only have 1, we unsqueeze so x and y have dimension [1, n, 3]
with torch.no_grad():
_, P = emd_loss_fun(phi(t).unsqueeze(0), x.unsqueeze(0))
# Project the transport matrix onto the space of permutation matrices and compute the L-2 loss
# between the permuted points
loss = mse_loss_fun(y[P.squeeze().max(0)[1], :], x)
# loss = mse_loss_fun(P.squeeze() # y, x) # Use the transport matrix directly
# Take an optimizer step
loss.backward()
optimizer.step()
print("Epoch %d, loss = %f" % (epoch, loss.item()))
fit_end_time = time.time()
print("Total time = %f" % (fit_end_time - fit_start_time))
# Plot the ground truth, reconstructed points, and a mesh representing the fitted function, phi
common.visualitation(x,t,phi)
if __name__ == "__main__":
main()
The error message is:
free(): invalid pointer
Aborted (core dumped)
That again doesn't help me that much. I'll appreciate it a lot if someone has any idea what is happening or if you know more about this error.
Edit: The cause is actually known. The recommended solution is to build both packages from source.
There is a known issue with importing both open3d and PyTorch. The cause is unknown. https://github.com/pytorch/pytorch/issues/19739
A few possible workarounds exist:
(1) Some people have found that changing the order in which you import the two packages can resolve the issue, though in my personal testing both ways crash.
(2) Other people have found compiling both packages from source to help.
(3) Still others have found that moving open3d and PyTorch to be called from separate scripts resolves the issue.
Note for future readers: This bug was filed as issue #21018.
This is not a problem in your Python code. It is a bug in PyTorch (probably) or in Python itself (unlikely, but possible).
free(3) is a C function that releases dynamically allocated memory when it is no longer needed. You cannot (easily) call it from Python, because memory management is a low-level implementation detail normally handled by the Python interpreter. However, you are also using PyTorch, which is written in C++ and C, and does have the ability to directly allocate and free memory.
In this case, some C code has tried to release a block of memory, but the block of memory it tried to release was not dynamically allocated in the first place, which is an error. You should report this behavior to the PyTorch developers. Include as much detail as possible, including the shortest code you can find that reproduces the problem, and the complete output of that program.
Today, I got a really weird thing.
I load a caffe model, feed input, net.forward, check the output data, perfect.
Then, I feed labels to the bottom layer blobs.diff, net.backward, then check the gradients (params.diff) with the result from same model caffe c++ program. They were different.
Further, when I continued to run net.backward several times at python, each time I got different gradients. This is not the case for C++ programs, they keep the same no matter how many time you run net.backward, as long as you did not change the bottom diff.
I check the bottom layer's blobs and diff, they kept unchanged both in python and C++ programs, and weights were also unchanged. This was really weird.
Anyone can provide some hints? I can provide codes if it is necessary.
Here is part of the codes :
def train_one_step(X, y, lr) :
net.blobs['data'].data[...] = X
#Forward, to get the softmax output
output = net.forward()
prob = output['prob']
#Calculate the loss of cross entropy loss function
net.blobs['prob'].diff[:] = y[:] - prob[:]
#Calculate the gradients of net parameter
net.backward()
#Renew weights based on gradients and learning rate
for key in net.params:
net.params[key][0].data[:] += lr * net.params[key][0].diff[:]
net.params[key][1].data[:] += lr * net.params[key][1].diff[:]
return loss, prob
I just want to dig out my own step function (the step of solver), so I can make some trick on the loss before it backwards, and something else. I know this is quite low efficient, data between GPU, CPU exchanged a lot.
In order to test it, I kept input the same sample(same X, y), you get different diff data. That means this function cannot work.
I'm trying to implement a CNN using theano/lasagne.
I've made a neural network but can't figure out how to train it with the current state.
This is how I'm trying to get the output of the network with the current_states as input.
train = theano.function([input_var], lasagne.layers.get_output(l.out))
output = train(current_states)
However I get this error:
theano.compile.function_module.UnusedInputError: theano.function was asked to create a function computing outputs given certain inputs, but the provided input variable at index 0 is not part of the computational graph needed to compute the outputs: inputs.
To make this error into a warning, you can pass the parameter on_unused_input='warn' to theano.function. To disable it completely, use on_unused_input='ignore'.
Why is current_states not used?
I want to get the output of the model on the current_states. How do I do this?
(the CNN build code: http://pastebin.com/Gd35RncU)
The following code snippet works for me:
import lasagne, theano
import theano.tensor as T
import numpy as np
input_var = theano.tensor.tensor4('inputs')
l_out = build_cnn(input_var)
train = theano.function([input_var], lasagne.layers.get_output(l_out))
x = np.random.randn(10, 4, 80, 80).astype(theano.config.floatX)
train(x)
You didn't post your entire code, but you can check to see if in your script you are passing in the input_var variable to your build_cnn function. If you do not, then input_var will not be part of your computational graph, which is why Theano is raising the error.