I'm running someone else's program so I can't be 100% open about the code.
I'm using load_model from tensorflow.keras.model to load an h5 model, then model.predict(<data>) in a for loop.
I'm also using tqdm to display a progress Bar. Now the problem is that as soon as I get to the loop, my console start printing output every new line:
I tried to block this output:
import absl.logging
import logging
import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'
absl.logging.set_verbosity(absl.logging.ERROR)
logging.getLogger("tensorflow").setLevel(logging.ERROR)
logging.getLogger("tensorflow").addHandler(logging.NullHandler(logging.ERROR))
But nothing seems to work.
Solved it: according to tensorflow.keras.model documentation, the method predict has attributes verbose and steps: https://www.tensorflow.org/api_docs/python/tf/keras/Model#predict
However, the same documentation suggest to use the __call__ method when used in a loop: model(<data>, training=False) which does not output any steps.
Related
I'm trying to understand why the following simple example doesn't successfully complete execution and seems to get stuck on the first line of really_simple_func (on Ubuntu machines, but not Windows). The code is:
import torch as t
import numpy as np
import multiprocessing as mp # I've tried both multiprocessing
# import torch.multiprocessing as mp # and torch.multiprocessing
def really_simple_func():
temp_val_2 = t.tensor(np.zeros(425447)[0:400000]) # this is the line that blocks.
return 4.3
if __name__ == "__main__":
print("Run brief starting")
some_zeros = np.zeros(425447)
temp_val = t.tensor(some_zeros[0:400000]) # DELETE THIS LINE TO MAKE IT WORK
pool = mp.Pool(processes=1)
job = pool.apply_async(really_simple_func)
print("just before job.get()")
result = job.get()
print("Run brief completed. Reward = {}".format(result))
I have torch 1.11.0 installed, numpy 1.22.3 and have tried both CPU and GPU versions of Torch. When I run this code on two different Ubuntu machines, I get the following output:
Run brief starting
just before job.get()
However, the code never successfully completes (doesn't print the "Run brief completed" line). (It does complete on a third Windows box).
On the Ubuntu machines, if I delete the line with the comment "#DELETE THIS LINE TO MAKE IT WORK" the execution DOES complete, printing the final line as expected. Similarly, if I leave the line defining temp_val in but delete the line with the comment "This is the line that blocks" it will also complete. Moreover, if I reduce the size of the temp_val tensor (say from 400000 to 4000) it will also complete successfully. Finally, it is worth noting that while I can reproduce this behaviour on two different Ubuntu machines, this code does actually complete on my Windows machine - though, as far as I can tell, the versions of key packages, such as torch, are the same.
I don't understand this behaviour. I suspect it is something to do with the way torch allocates memory or stores information. I've tried calling del temp_val to free up memory, but that doesn't seem to fix things. It seems to me that the async call to t.tensor within really_simple_func is stopped from completing if there has already been a call to t.tensor in the main code block, creating a sufficiently large tensor.
I don't understand why this is happening, or even if that is the correct explanation. In any case, what would be best practice if I do need to do some tensor processing within apply_async as well as in the main thread? More generally, what is Torch waiting on when I make a call to t.tensor?
(Obviously, this is just the simplest version of the real code I'm trying to get to work that reproduced this issue. I realise that calling mp.Pool with only one process doesn't really make sense...nor, indeed, does using apply_async to call a function that returns a constant!)
Unfortunately, I cannot provide any answers to your questions.
I can, however, share experiences with seemingly the same issue. I use a Linux machine with torch 1.8.1 and numpy 1.19.2.
When I run the following code on my machine:
with Pool(max_pool) as p:
pool_outputs = list(
tqdm(
p.imap(lambda f: get_model_results_per_query_file(get_preds, tokenizer, f), query_files),
total=len(query_files)
)
)
For which the function get_model_results_per_query_file contains operations similar to the following:
feats = features.unsqueeze(0).repeat(batch_size, 1, 1).to(device)
(features is a torch tensor)
The first round of jobs automatically fail, and new ones are immediately started (that do not fail for some reason). The whole process never completes though, since the main process still seems to be waiting for the first failed jobs.
If I remove the lines in my code involving the repeat function, no jobs fail.
I managed to solve my issue and preserve the same results by adapting a similar solution to yours:
feats = torch.as_tensor(np.tile(features, (batch_size, 1, 1))).to(device)
I believe as_tensor works in a similar fashion to from_numpy in this case.
I only managed to find this solution thanks to your post and your proposed workaround, so thank you!
After some further exploration, here is a brief answer to my own question.
While I still don't fully understand the blocking behaviour (and would welcome any further explanation), I have just seen that the way I'm generating torch tensors from a numpy array is not correct.
In particular, instead of using torch.tensor(temp_val) where temp_val is a numpy array, I should be using torch.from_numpy(temp_val). Doing this fixes the problem.
Alternatively, I can convert temp_val into a list and then create the tensor via torch.tensor(temp_val_as_list) - which also avoids the issue.
I'm not sure why the model isn't defined
Taken from here
https://github.com/DariusAf/MesoNet/blob/master/example.py
Code:
from classifiers import *
from pipeline import *
from keras.preprocessing.image import ImageDataGenerator
classifier = Meso4()
classifier.load('Meso4_DF')
gives error:
classifier = Meso4()
NameError: name 'Meso4' is not defined
The reason for this is that Meso4 is defined in classifiers.py, as you can see here.
Strictly speaking, your problem would be solved by also downloading the classifiers.py file and putting it in the same directory as your example.py file.
However, you should, in general, refrain from copy-pasting code from GitHub unless you know what you are doing, and if you need to wonder if you do, you don't.
Therefore, I recommend actually cloning the repo and working from the local copy.
I am using joblib to parallel a for loop for my own function.
from joblib import Parallel, delayed
from my_function import my_case_study
result = Parallel(n_jobs=4)(delayed(my_case_study)(i) for i in range(100))
So my_case_study is the only function in my_function.py file, and it takes i as the hyperparameter. my_case_study calls a bunch of different model fitting algorithm contains in the other python files, which are imported at the top of my_function. my_function.py basicly looks like
from anotherfile import fun1
from anotherfile import fun2
def my_case_study(i):
mse1 = fun1(i)
mse2 = fun2(i)
return (mse1,mse2)
But then I get the error message: A task gas failed to un-serialize. Please ensure that the arguments of the function are all picklable.
How to solve this? Thanks!!
I found the solution in the following link helpful in my case:
https://github.com/joblib/joblib/issues/810
Don't know if there is any other better solution. Since the comments at the end mentioned that there might be some issues with that (didn't fully understand).
Here is a great question on how to find the first occurence of Nan in a tensorflow graph:
Debugging nans in the backward pass
The answer is quite helpful, here is the code from it:
train_op = ...
check_op = tf.add_check_numerics_ops()
sess = tf.Session()
sess.run([train_op, check_op]) # Runs training and checks for NaNs
Apparently, running the training and the numerical check at the same time will result in an error report as soon as Nan is encountered for the first time.
How do I integrate this into Keras ?
In the documentation, I can't find anything that looks like this.
I checked the code, too.
The update step is executed here:
https://github.com/fchollet/keras/blob/master/keras/engine/training.py
There is a function called _make_train_function where an operation to compute the loss and apply updates is created. This is later called to train the network.
I could change the code like this (always assuming that we're running on a tf backend):
check_op = tf.add_check_numerics_ops()
self.train_function = K.function(inputs,
[self.total_loss] + self.metrics_tensors + [check_op],
updates=updates, name='train_function', **self._function_kwargs)
I'm currently trying to set this up properly and not sure whether the code above actually works.
Maybe there is an easier way ?
I've been running into the exact same problem, and found an alternative to the check_add_numerics_ops() function. Instead of going that route, I use the TensorFlow Debugger to walk through my model, following the example in https://www.tensorflow.org/guide/debugger to figure out exactly where my code produces nans. This snippet should work for replacing the TensorFlow Session that Keras is using with a debugging session, allowing you to use tfdbg.
from tensorflow.python import debug as tf_debug
sess = K.get_session()
sess = tf_debug.LocalCLIDebugWrapperSession(sess)
K.set_session(sess)
I have tried to modify the CIFAR-10 example to run on the new TensorFlow distributed runtime. However, I get the following error when trying to run the program:
InvalidArgumentError: Cannot assign a device to node 'softmax_linear/biases/ExponentialMovingAverage':
Could not satisfy explicit device specification '/job:local/task:0/device:CPU:0'
I start the cluster using the following commands. On the first node I run:
bazel-bin/tensorflow/core/distributed_runtime/rpc/grpc_tensorflow_server --cluster_spec='local|10.31.101.101:7777;10.31.101.224:7778' --job_name=local --task_id=0
...and on the second node I run:
bazel-bin/tensorflow/core/distributed_runtime/rpc/grpc_tensorflow_server --cluster_spec='local|10.31.101.101:7777;10.31.101.224:7778' --job_name=local --task_id=1
For the CIFAR-10 multi-GPU code, I make the simple modifications, replacing two lines in the train() function. The following line:
with tf.Graph().as_default(), tf.device('/cpu:0'):
...is replaced with:
with tf.Graph().as_default(), tf.device('/job:local/task:0/cpu:0'):
and the following line:
with tf.device('/gpu:%d' % i):
...is replaced with:
with tf.device('/job:local/task:0/gpu:%d' % i):
In my understanding, the second substitution should take care of the model substitution. Running a simpler example, like the code below, works fine:
with tf.device('/job:local/task:0/cpu:0'):
c = tf.constant("Hello, distributed TensorFlow!")
sess.run(c)
print(c)
I can't tell from your program, but my guess is that you also have to modify the line that creates the session to specify the address of one of your worker tasks. For example, given your configuration above, you might write:
sess = tf.Session(
"grpc://10.31.101.101:7777",
config=tf.ConfigProto(
allow_soft_placement=True,
log_device_placement=FLAGS.log_device_placement))
As it happens, we've been trying to improve that error message to make it less confusing. If you update to the latest version in GitHub and run the same code, you should see an error message that explains why the device specification could not be satisfied.