We are working on a deep learning project with a friend, and we would like to train the same model on our 2 different computers and get the same results. Is there a way we can do that?
Tensorflow version is 2.5.0
What we tried so far:
At the start of our code, in the first line of the main function, we set the Python Hash Seed to 0.
os.environ['PYTHONHASHSEED'] = str(0)
After that, in the training function we set the seeds appropriately.
random.seed(seed)
np.random.seed(seed)
tf.random.set_seed(seed)
On one computer, two runs with the same seed gives us the same result, however on the 2 different computers, the results are different.
For example, with seed = 1, on my pc, the loss is around 0.09, but on my friend's it is lower than 0.06.
Why is that, and how can we solve it?
Related
I'm using XGBoost, the default, to forecast a hourly time series(720 hours is the size of my test), I always run the model 10 times, and in the end I analyse the the 10 runs. With XGBoost I'm getting the exactly same predictions, all 720 hours, in all the 10 times, without any change. I'm using the default version of the XGBoost, already tried to change seed, putting a different seed number in each run, changed the random_state too, all that with no sucesss, the only thing that changes it is when I put subsample = 0.99, but my professor told me that he wants to use all the training dataset in subsample, but using subsample = 1 present the same erro of all 10 exactly same predictions. Already checked my code and thats no error in the way I'm predicting the time series.
I'm using Xgboost that way.
import xgboost as xg
for run in range(10):
model = xg.XGBRegressor().fit(X_train,y_train)
Thanks for the help.
I am a little bit confused about the randomness of the pytorch model.
I used the following code to fix the random seed so that the training results of the model can be repeated on the same device:
def seed_torch(seed=2020):
random.seed(seed)
os.environ["PYTHONHASHSEED"] = str(seed)
np.random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed(seed)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
But what is confusing is that when the code is run on another device with the same hardware configuration and software environment, it will produce different results.
I reinstalled the virtual environment to ensure that the versions of the libraries are consistent, but this problem still puzzles me.
Feel free to ask if more code is needed to explain the problem.
I had implemented a CNN with 3 Convolutional layers with Maxpooling and dropout after each layer
I had noticed that when I trained the model for the first time it gave me 88% as testing accuracy but after retraining it for the second time successively, with the same training dataset it gave me 92% as testing accuracy.
I could not understand this behavior, is it possible that the model had overfitting in the second training process?
Thank you in advance for any help!
It is quite possible if you have not provided the seed number set.seed( ) in the R language or tf.random.set_seed(any_no.) in python
Well I am no expert when it comes to machine learning but I do know the math behind it. What you are doing when you train a neural network you basicly find the local minima to the loss function. What this means is that the end result will heavily depend on the initial guess of all of the internal varaibles.
Usually the variables are randomized as a initial estimation and you could therefore reach quite different results from running the training process multiple times.
That being said, from when I studied the subject I was told that you usually reach similar regardless of the initial guess of the parameters. However it is hard to say if 0.88 and 0.92 would be considered similar or not.
Hope this gives a somewhat possible answer to your question.
As mentioned in another answer, you could remove the randomization, both in the parameter initialization of the parameters and the randomization of the data used for each epoch of training by introducing a seed. This would insure that when you run it twice, everything will get "randomized" in the exact same order. In tensorflow this is done using for example tf.random.set_seed(1), the number 1 can be changed to any number to get a new seed.
I'm running a reinforcement learning program in a gym environment(BipedalWalker-v2) implemented in tensorflow. I've set the random seed of the environment, tensorflow and numpy manually as follows
os.environ['PYTHONHASHSEED']=str(42)
random.seed(42)
np.random.seed(42)
tf.set_random_seed(42)
env = gym.make('BipedalWalker-v2')
env.seed(0)
config = tf.ConfigProto(intra_op_parallelism_threads=1, inter_op_parallelism_threads=1)
config.gpu_options.allow_growth = True
sess = tf.Session(config=config)
# run the graph with sess
However, I get different results every time I run my program (without changing any code). Why are the results not consistent and what should I do if I want to obtain the same result?
Update:
The only places that I can think of may introduce randomness (other than the neural networks) are
I use tf.truncated_normal to generate random noise epsilon so as to implement noisy layer
I use np.random.uniform to randomly select samples from replay buffer
I also spot that the scores I get are pretty consistent at the first 10 episodes, but then begin to differ. Other things such as losses also show a similar trend but are not the same in numeric.
Update 2
I've also set "PYTHONHASHSEED" and use single-thread CPU as #jaypops96 described, but still cannot reproduce the result. Code has been updated in the above code block
I suggest checking whether your TensorFlow graph contains nondeterministic operations. For example, reduce_sum before TensorFlow 1.2 was one such operation. These operations are nondeterministic because floating-point addition and multiplication are nonassociative (the order in which floating-point numbers are added or multiplied affects the result) and because such operations don't guarantee their inputs are added or multiplied in the same order every time. See also this question.
EDIT (Sep. 20, 2020): The GitHub repository framework-determinism has more information about sources of nondeterminism in machine learning frameworks, particularly TensorFlow.
It seems that tensorflow neural networks introduce randomness during training that isn't controlled by a numpy random seed. The randomness appears to possibly come from python hash operations and parallelized operations executing in non-controlled ordering, at the very least.
I had success getting 100% reproducibility using a keras-tensorflow NN, by following the setup steps in this response:
How to get reproducible results in keras
specifically, I used the formulation proposed by #Poete Maudit in that link.
They key was to set random seed values UP FRONT, for numpy, python, and tensorflow, then also to make tensorflow run on single-thread CPU in a specially-configured session.
Here's the code i used, updated very slightly from the link i posted.
print('Running in 1-thread CPU mode for fully reproducible results training a CNN and generating numpy randomness. This mode may be slow...')
# Seed value
# Apparently you may use different seed values at each stage
seed_value= 1
# 1. Set `PYTHONHASHSEED` environment variable at a fixed value
import os
os.environ['PYTHONHASHSEED']=str(seed_value)
seed_value += 1
# 2. Set `python` built-in pseudo-random generator at a fixed value
import random
random.seed(seed_value)
seed_value += 1
# 3. Set `numpy` pseudo-random generator at a fixed value
import numpy as np
np.random.seed(seed_value)
seed_value += 1
# 4. Set `tensorflow` pseudo-random generator at a fixed value
import tensorflow as tf
tf.set_random_seed(seed_value)
# 5. Configure a new global `tensorflow` session
session_conf = tf.ConfigProto(intra_op_parallelism_threads=1, inter_op_parallelism_threads=1)
sess = tf.Session(graph=tf.get_default_graph(), config=session_conf)
tf.keras.backend.set_session(sess)
#rest of code...
Maybe you can try to set the number of parallelism threads to 1. I have the same problem: the loss became different to the seventh decimal place start from the second episode. It fixed when I set
tf.ConfigProto(intra_op_parallelism_threads=1, inter_op_parallelism_threads=1)
For learning purposes, I am trying to implement a CNN from scratch, but the results do not seem to improve from random guessing. I know this is not the best approach on home hardware, and following course.fast.ai I have obtained much better results via transfer learning, but for a deeper understanding I would like to see, at least in theory, how one could do it otherwise.
Testing on CIFAR-10 posed no issues - a small CNN trained from scratch in a matter of minutes with an error of less than 0.5%.
However, when trying to test against the Cats vs. Dogs Kaggle dataset, the results did not bulge from 50% accuracy. The architecture is basically a copy of AlexNet, including the non-state-of-the-art choices (large filters, histogram equalization, Nesterov-SGD optimizer). For more details, I put the code in a notebook on GitHub:
https://github.com/mspinaci/deep-learning-examples/blob/master/dogs_vs_cats_with_AlexNet.ipynb
(I also tried different architectures, more VGG-like and using Adam optimizer, but the result was the same; the reason why I followed the structure above was to match as closely as possible the Caffe procedure described here:
https://github.com/adilmoujahid/deeplearning-cats-dogs-tutorial
and that seems to converge quickly enough, according to the author's description here: http://adilmoujahid.com/posts/2016/06/introduction-deep-learning-python-caffe/).
I was expecting some fitting to happen quickly, possibly flattening out due to the many suboptimal choices made (e.g. small dataset, no data augmentation). Instead, I saw no increment at all, as the notebook shows.
So I thought that maybe I was simply overestimating my GPU and patience, and that the model was too complicated even to overfit my data in a few hours (I ran 70 epochs, each time roughly 360 batches of 64 images). Therefore I tried to overfit as hard as I could, running these other models:
https://github.com/mspinaci/deep-learning-examples/blob/master/Ridiculously%20overfitting%20models...%20or%20maybe%20not.ipynb
The purely linear model started showing some overfit - around 53.5% training accuracy vs 52% validation accuracy (which I guess is thus my best result). That followed my expectations. However, to try and overfit as hard as I could, the second model is a simple 2 layers feedforward neural network, without any regularization, that I trained on just 2000 images with batch size up to 500. I was expecting the NN to overfit wildly, quickly getting to 100% train accuracy (after all it has 77M parameters for 2k pictures!). Instead, nothing happened, and the accuracy flattened to 50% quickly enough.
Any tip about why none of the "multi-layer" models seems able to pick any feature (be it "true" or out of overfitting) would be very much appreciated!
Note on versions etc: the notebooks were run on Python 2.7, Keras 2.0.8, Theano 0.9.0. The OS is Windows 10, and the GPU is a not-so-powerful, but that should be sufficient for basic tasks, GeForce GTX 960M.