Any time I train a NN (using TensorFlow with keras) and attempt to plot (matplotlib) the loss history of the fit model, the kernel dies. I do not think it is code because running different code from different validated sources (links below) causes the same problem.
Also, it appears to be specific to TensorFlow and matplotlib. If I run a sklearn model and then plot it works fine.
Example links:
https://github.com/chrisalbon/notes/blob/master/docs/deep_learning/keras/visualize_loss_history.ipynb
https://machinelearningmastery.com/tutorial-first-neural-network-python-keras/
Tried fixes:
Restarting comp
Removing keras and matplotlib
Rolling back matplotlib to a previous version (3.02)
Updating Python 3.6 to 3.71
Uninstalling Python and anaconda from comp and re-installing
Running code in different browser (Safari and Chrome)
I believe it has something to do with my installation. I sent the notebook and data to someone else with the same exact comp and setup and it worked fine.
I've also tried running the py file through the command line to retrieve errors, but nothing happens (no error, no indication that the file is running). Other py files run though.
Current versions
OS - Mojave v10.14.5
Python - 3.71
Matplotlib - 3.0.3
Keras - 2.2.4
TensorFlow - 1.13.1
After trial and error the issue appears to stem from a bug in TensorFlow. I'm not sure of the full specifics to what is causing the issue, but when TensorFlow is rolled back to 1.11 the issue no longer occurs. So for anyone who is also experiencing this problem, you may want to try rolling back TensorFlow.
Related
I just installed the tensorboard profiler with
pip install -U tensorboard_plugin_profile
The version is 2.3.
Tensorflow-Version 2.3
Tensorboard-Version 2.3
cudatoolkit-Version 10.1.243
When i now try to open the Profil-Tab in Tensorboard i see the Profiler-Window normaly but empty and the Error-Message:
DEM6561: Failed to load libcupti (is it installed and accessible?)
And the warning:
No step marker observed and hence the step time is unknown. This may happen if (1) training steps are not instrumented (e.g., if you are not using Keras) or (2) the profiling duration is shorter than the step time. For (1), you need to add step instrumentation; for (2), you may try to profile longer.
I think it has something to do with the enviroment-pathes- and variables but i dont know how they work with the virtuel enviroments of Anaconda. (I dont have a Cuda-Folder i can link to)
Had someone the same problem like me or any ideas what i can try?
Thanks ahead!
First, make sure that the CUPTI has been set to Path (via Environment Variables if you're using Windows), adding a path which should look like this:
%CUDA_PATH%\extras\CUPTI\lib64
Second, check if Tensorflow is looking for the correct CUPTI dll. I've encountered this exact same issue and as I checked, it appears that TF 2.4 is looking for cupti64_110.dll instead of cupti64_2020.1.1.dll. It is currently a known issue and will be addressed in TF 2.5. I'm not sure if that's the case too with TF 2.3.
I basically resolved the issue by copying the dll in the same directory and renaming it. Let me know if this helps!
I am having an issue with keras leading to my processor seemingly getting bogged down while working through examples.
In the IMDB data set for instance (exercise 3.4.1 in Deep Learning with Python by Chollet if anyone knows the book), running the script:
import keras
from keras.datasets import imdb
(train_data, train_labels), (test_data, test_labels) =
imdb.load_data(num_words=10000)
Produces an output looking something like:
[=====>...] - ETA: 59s✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓16105472/17464789
That updates increasingly slowly as the numbers get larger and move toward completion.
I'm assuming my installation of keras/Tensorflow/CUDA/cuDNN is to blame, but curious if you know of anything obvious that would solve the issue.
Running Ubuntu Linux, NVIDIA GTX 1080, Keras/Tensorflow (GPU)/CUDA,cuDNN (maybe, assuming I installed everything correctly which is probably not accurate).
Thanks!
This progress bar is shown during the first initial download and should not be present on subsequent imports of the data.
There might be several issues that cause this to slow down and/or fail:
Your internet connection is unstable.
There is an issue with the serving of the file. Maybe the repository server serves a corrupt file? (You could try to force download from another repository, see How to setup pip to download from mirror repository by default? )
Your local disc or or a previously partial download are corrput: You can try to delete a partial download in ~/.keras/datasets/mnist.npz
Check if your harddisk is full.
I have been trying to get the Machine Learning Setup for ML-Agents for Unity 3D up and running for the past several hours, with no luck.
First I followed this video, which goes over the initial installations which are also outlined in this GitHub repository.
Next, I moved on to part 2 of the video series (here), however problems started at minute 4:48, where I realized that the tutorial was using v 0.2, while I had v 0.3.
V 0.3 has done away with the PPO.ipynb file shown in the video. Everything is done through learn.py file.
I then decided to try and follow the official Unity installation guide:
https://github.com/Unity-Technologies/ml-agents/blob/master/docs/Getting-Started-with-Balance-Ball.md
and got to Training with PPO section which I have not managed to resolve.
the problem arises here: The documentation states:
To summarize, go to your command line, enter the ml-agents directory and type:
python3 python/learn.py <env_file_path> --run-id=<run-identifier> --train
Note: If you're using Anaconda, don't forget to activate the ml-agents
environment first.
I tried to run:
python learn.py ball --run-id=ballBalance --train
but I am greeted with a number of warnings as follows:
File "learn.py", line 9, in
from unitytrainers.trainer_controller import TrainerController
File "C:\Users****\Downloads\ml-agents-master\python\unitytrainers__init__.py", line 1, in
from .buffer import *
I have been trying to solve this error message for quite some time now. It seems that the file learn.py is actually being found, but somehow not being interpreted correctly?
First 9 lines of learn.py:
# # Unity ML Agents
# ## ML-Agent Learning
import logging
import os
from docopt import docopt
from unitytrainers.trainer_controller import TrainerController
Any guidance on how I can solve this problem would be appreciated. Would gladly give more information where needed. The steps mentioned above should replicate the problem I am experiencing.
I am not completely sure whether I solved the same problem. But somewhere under my errors it also told me about line 9 in learn.py.
Nevertheless, I found this https://github.com/tensorflow/tensorflow/issues/18503
So all I did was installing tensorflow version 1.5 by executing:
pip install --upgrade --ignore-installed tensorflow-gpu==1.5
Afterwards it did run through errorless and the training worked fine.
I use Anaconda on a Windows 10 laptop with Python 2.7 and Spark 2.1. Built a deep learning model using Sknn.mlp package. I have completed the model. When I try to predict using the predict function, it throws an error. I run the same code on my Mac and it works just fine. Wondering what is wrong with my windows packages.
'NoneType' object is not callable
I verified input data. It is numpy.array and it does not have null value. Its dimension is same as training one and all attributed are the same. Not sure what it can be.
I don't work with Python on Windows, so this answer will be very vague, but maybe it will guide you in the right direction. Sometimes there are cross-platform errors due to one module still not being updated for the OS, frequently when another related module gets an update. I recall something happened to me with a django application which required somebody more familiar with Windows to fix it for me.
Maybe you could try with an environment using older versions of your modules until you find the culprit.
I finally solved the problem on windows. Here is the solution in case you face it.
The Theano package was faulty. I installed the latest version from github and then it threw another error as below:
RuntimeError: To use MKL 2018 with Theano you MUST set "MKL_THREADING_LAYER=GNU" in your environment.
In order to solve this, I created a variable named MKL_Threading_Layer under user environment variable and passed GNU. Reset the kernel and it was working.
Hope it helps!
I am trying to debug a Fortran warning in some Sklearn code that runs perfectly on my laptop...but after transferring to my desktop (which is a fresh Ubuntu 15.10, fresh Pycharm, and fresh Anaconda3), I get the following error when running sklearn.cross_validation.cross_val_score:
/anaconda3/lib/python3.5/site-packages/sklearn/externals/joblib
/hashing.py:197: DeprecationWarning: Changing the shape of non-C contiguous
array by
descriptor assignment is deprecated. To maintain
the Fortran contiguity of a multidimensional Fortran
array, use 'a.T.view(...).T' instead
obj_bytes_view = obj.view(self.np.uint8)
The command I am submitting to cross_val_score is:
test_results = cross_val_score(learner(**learner_args),data,y=classes,n_jobs=n_jobs,scoring='accuracy',cv=LeaveOneOut(data.shape[0]))
Where the iterator is the sklearn cross validation object...and nothing else special is going on. What could be happening here? Am I missing some installation step?
Just for the record for people like me who found this SO post via Google, this is has been recorded as issue #6370 for scikit-learn.
As mentioned there:
This problem has been fixed in joblib master. It won't be fixed in scikit-learn until:
1) we do a new joblib release
2) we update scikit-learn master to have the new joblib release
3) if you are using released versions of scikit-learn, which I am guessing you are, you will have to wait until there is a new scikit-learn release
I was able to use the above workaround from #bordeo :
import warnings
warnings.filterwarnings("ignore")