AMD plaidml vs CPU Tensorflow - Unexpected results

AMD plaidml vs CPU Tensorflow - Unexpected results - python

I am currently running a simple script to train the mnist dataset.
Running the training through my CPU via Tensorflow is giving me 49us/sample and a 3e epoch using the following code:-
# CPU
import tensorflow as tf
mnist = tf.keras.datasets.mnist
(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train = tf.keras.utils.normalize(x_train, axis=1)
x_test = tf.keras.utils.normalize(x_test, axis=1)
model = tf.keras.models.Sequential()
model.add(tf.keras.layers.Flatten())
model.add(tf.keras.layers.Dense(128, activation=tf.nn.relu))
model.add(tf.keras.layers.Dense(128, activation=tf.nn.relu))
model.add(tf.keras.layers.Dense(10, activation=tf.nn.softmax))
model.compile(optimizer='adam', loss="sparse_categorical_crossentropy", metrics=["accuracy"])
model.fit(x_train, y_train, epochs=3)
When I run the dataset through my AMD Pro 580 using the opencl_amd_radeon_pro_580_compute_engine via plaidml setup I get the following results 249us/sample with a 15s epoch, using the following code:-
# GPU
import plaidml.keras
plaidml.keras.install_backend()
import keras
from keras.datasets import mnist
(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train = keras.utils.normalize(x_train, axis=1)
x_test = keras.utils.normalize(x_test, axis=1)
model = keras.models.Sequential()
model.add(keras.layers.Flatten())
model.add(keras.layers.Dense(128, activation='relu'))
model.add(keras.layers.Dense(128, activation='relu'))
model.add(keras.layers.Dense(10, activation='softmax'))
model.compile(optimizer='adam', loss="sparse_categorical_crossentropy", metrics=["accuracy"])
model.fit(x_train, y_train, epochs=3)
I can see my CPU firing up for the CPU test and my GPU maxing out for the GPU test, but I am very confused as to why the CPU is out performing the GPU by a factor of 5.
Should this be the expected results?
Am I doing something wrong in my code?

It seems I've found the right solution at least for macOS/Keras/AMD GPU setup.
TL;DR:
Do not use OpenCL, use *metal instead.
Do not use Tensorflow 2.0, use Keras only API
Here are the details:
Run plaidml-setup and pickup metal🤘🏻this is important!
...
Multiple devices detected (You can override by setting PLAIDML_DEVICE_IDS).
Please choose a default device:
1 : llvm_cpu.0
2 : metal_intel(r)_uhd_graphics_630.0
3 : metal_amd_radeon_pro_560x.0
Default device? (1,2,3)[1]:3
...
Make sure you saved changes:
Save settings to /Users/alexanderegorov/.plaidml? (y,n)[y]:y
Success!
Now run MNIST example, you should see something like:
INFO:plaidml:Opening device "metal_amd_radeon_pro_560x.0"
This is it. I have made a comparison using plaidbench keras mobilenet:
metal_amd_radeon_pro_560x.0 FASTEST!
Example finished, elapsed: 0.435s (compile), 8.057s (execution)
opencl_amd_amd_radeon_pro_560x_compute_engine.0
Example finished, elapsed: 3.197s (compile), 14.620s (execution)
llvm_cpu.0
Example finished, elapsed: 3.619s (compile), 47.837s (execution)

I think there are two aspects of the observed situation:
plaidml is not that great in my experience and I’ve had similar results, sadly.
Moving data to the gpu is slow. In this case, the MNIST data is really small and the time to move the data there outweighs the “benefit” of paralleling the computation. Actually the TF CPU probably does parallel matrix multiplication as well but it’s much faster as the data is smal and closer to the processing unit.

Related

I am trying to fit an CuDNNLSTM model and I am getting an error

I am learning about recurrent neural networks and I found out CuDNNLSTM layer, which is much faster than usual LSTM. So, I have tried to fit a CuDNNLSTM model, but the only thing, which program display is "Epoch 1" and then nothing is happening and my kernel is dying (I am working in jupyter-notebook). In jupyer terminal I have find this:
2022-05-25 22:22:59.693801: I
tensorflow/stream_executor/cuda/cuda_dnn.cc:384] Loaded cuDNN version 8100
2022-05-25 22:23:00.149065: E tensorflow/stream_executor/cuda/cuda_driver.cc:1018] failed to
synchronize the stop event: CUDA_ERROR_LAUNCH_FAILED: unspecified
launch failure
2022-05-25 22:23:00.149218: E
tensorflow/stream_executor/gpu/gpu_timer.cc:55] INTERNAL: Error
destroying CUDA event: CUDA_ERROR_LAUNCH_FAILED: unspecified launch
failure
2022-05-25 22:23:00.150008: E
tensorflow/stream_executor/gpu/gpu_timer.cc:60] INTERNAL: Error
destroying CUDA event: CUDA_ERROR_LAUNCH_FAILED: unspecified launch
failure
2022-05-25 22:23:00.150355: F
tensorflow/stream_executor/cuda/cuda_dnn.cc:217] Check failed:
status== CUDNN_STATUS_SUCCESS (7 vs. 0)Failed to set cuDNN stream.
I have installed tensorflow-gpu and compatible CuDNN and CUDA to my tensorflow version
tensorflow version: 2.9.0
CUDA version: 11.2
CuDNN version: 8.1
I have tried also same model, but with LSTM layers and that have worked, but still it is very slow, so I want to figure out how to use a CuDNNLSTM model.
My code:
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, LSTM
from tensorflow.compat.v1.keras.layers import CuDNNLSTM
mnist = tf.keras.datasets.mnist
(X_train, y_train), (X_test, y_test) = mnist.load_data()
X_train = X_train/255.0
X_test = X_test/255.0
model = Sequential()
model.add(CuDNNLSTM(128, input_shape=(X_train.shape[1:]), return_sequences=True))
model.add(Dropout(0.2))
model.add(CuDNNLSTM(128))
model.add(Dropout(0.2))
model.add(Dense(32, activation="relu"))
model.add(Dropout(0.2))
model.add(Dense(10, activation="softmax"))
opt = tf.keras.optimizers.Adam(learning_rate=1e-3, decay=1e-5)
model.compile(loss="sparse_categorical_crossentropy",
optimizer=opt,
metrics=["accuracy"])
model.fit(X_train, y_train, epochs=3, validation_data=(X_test, y_test))
If somebody had same problem or know how to fix that, I will be grateful for help.
Thanks in an advance.

Have you tried this with the tanh activation function? From my understanding, it has to use that. Details below:
Long Short-Term Memory layer - Hochreiter 1997.
See the Keras RNN API guide
for details about the usage of RNN API.
Based on available runtime hardware and constraints, this layer
will choose different implementations (cuDNN-based or pure-TensorFlow)
to maximize the performance. If a GPU is available and all
the arguments to the layer meet the requirement of the cuDNN kernel
(see below for details), the layer will use a fast cuDNN implementation.
The requirements to use the cuDNN implementation are:
activation == tanh
recurrent_activation == sigmoid
recurrent_dropout == 0
unroll is False
use_bias is True
Inputs, if use masking, are strictly right-padded.
Eager execution is enabled in the outermost context.

convert tf keras model to scikit MLP NN

I am experimenting with training NLTK classifier model with tensorflow and keras, would anyone know if this could be recreated with sklearn neural work MLP classifier? For what I am using ML for I don't think I need tensorflow but something simplier and easier to install/deploy.
Not a lot of wisdom on machine learning wisdom here, any tips greatly appreciated even describing this deep learning tensorflow keras model greatly appreciated.
So my tf keras model architecture looks like this:
training = []
random.shuffle(training)
training = np.array(training)
# create train and test lists. X - patterns, Y - intents
train_x = list(training[:,0])
train_y = list(training[:,1])
# Create model - 3 layers. First layer 128 neurons, second layer 64 neurons and 3rd output layer contains number of neurons
# equal to number of intents to predict output intent with softmax
model = Sequential()
model.add(Dense(128, input_shape=(len(train_x[0]),), activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(64, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(len(train_y[0]), activation='softmax'))
# Compile model. Stochastic gradient descent with Nesterov accelerated gradient gives good results for this model
sgd = SGD(lr=0.01, decay=1e-6, momentum=0.9, nesterov=True)
model.compile(loss='categorical_crossentropy', optimizer=sgd, metrics=['accuracy'])
# Fit the model
model.fit(np.array(train_x), np.array(train_y), epochs=200, batch_size=5, verbose=1)
SO the sklearn neural network, am I on track at all with this below? Can someone help me I understand what exactly the tensorflow model architecture is and what cannot be duplicated with sklearn. I sort of understand tensorflow is probabaly much more powerful that sklearn that is something more simple.
#Importing MLPClassifier
from sklearn.neural_network import MLPClassifier
model = MLPClassifier(hidden_layer_sizes=(128,64),activation ='relu',solver='sgd',random_state=1)

Just google converting a keras model to pytorch, there is quite a bit of tutorials out there for that... It doesnt look easy but maybe worth the effort for what ever you need it for...
Going down this road just using sklearn MLP neural network I can get good enough results with sklearn... without the hassle of getting tensorflow installed properly.
Also on a cloud linux instance tensorflow requires a LOT more memory and storage than a FREE account can do on pythonanywhere.com but free account seems just fine with sklearn
When experimenting with sklearn MLP NN for whatever reason better results just leaving architecture as default and playing around with the learning rate.
from sklearn.neural_network import MLPClassifier
model = MLPClassifier(learning_rate_init=0.0001,max_iter=9000,shuffle=True).fit(train_x, train_y)

Why is my CNN model being trained in less time on my local CPU than hosted options?

I'm completely new to deep-learning and I've been following a few tutorials that have been mostly using hosted Jupyter notebooks (Azure and Colaboratory). I'm at a stage where I'm looking to start experimenting on my own neural networks; however, I'm a little confused by where I should be training my keras models. To decide, I ran the following model in a few different places and in summary my i5 6500 CPU came 2nd, which I found incredibly confusing. More confusing is that running Google Cloud Compute with 8 virtual CPUs was slower than running on my CPU. I have yet to try on my GTX1060 GPU; however, it seems reasonable to assume that it will perform even better than my CPU. Why am I getting these results and where do people usually train their ML models? My results are below.
from keras.datasets import mnist
from keras.preprocessing.image import load_img, array_to_img
from keras.utils.np_utils import to_categorical
from keras.models import Sequential
from keras.layers import Dense
image_height, image_width = 28, 28
(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train = x_train.reshape(60000, image_height * image_width)
x_test = x_test.reshape(10000, image_height * image_width)
x_train = x_train.astype('float32')
x_test = x_test.astype('float32')
x_train /= 255.0
x_test /= 255.0
y_train = to_categorical(y_train, 10)
y_test = to_categorical(y_test, 10)
model = Sequential()
model.add(Dense(512, activation='relu', input_shape=(784,)))
model.add(Dense(512, activation='relu'))
model.add(Dense(10, activation='softmax'))
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
history = model.fit(x_train, y_train, epochs=2, validation_data=(x_test, y_test))
I tried the above snippet in the following locations. Below are the per epoch times.
My i5 6500 CPU: 20s
Colaboratory Notebook with CPU: 27s
Colaboratory Notebook with GPU: 8s (expected)
Colaboratory Notebook with TPU: 26s
Azure Notebook with CPU: 60s
Google Cloud Compute Jupyterlab: 4vCPUs: 36s
Google Cloud Compute Jupyterlab: 8vCPUs: 40s
Unfortunately, running Google Cloud Compute with GPU requires me to upgrade my free account so I didn't get to try that.

Different accuracies on different machines using same seeds, code and dataset

I am trying to develop a CNN for signature recognition to identify which person a given signature belongs to. There are 3 different classes(persons) and 23 signatures for each of them. Having this little amount of samples, I decided to use the Keras ImageDataGenerator to create additional images.
However, testing the CNN on different machines (Windows 10 and Mac OS) gives different accuracy scores when evaluating the model on the test data. 100% on windows and 93% on mac OS. Both machines run python 3.7.3 64-bit.
The data is split using train_test_split from sklearn.model_selection with 0.8 for training and 0.2 for testing, random_state is 1. Data and labels are properly normalised and adjusted to fit into the CNN. Various numbers for steps_per_epoch, batch_size and epoch have been tried.
I have tried using both np.random.seed and tensorflow.set_random_seed, generating 100% accuracy on the test data using seed(1) on PC, however the same seed on the other machine still yields a different accuracy score.
Here is the CNN architecture along with the method call for creating additional images. The following code yields an accuracy of 100% on one machine and 93.33% on the other.
seed(185)
set_random_seed(185)
X_train, X_test, y_train, y_test = train_test_split(data, labels, train_size=0.8, test_size=0.2, random_state=1)
datagen = ImageDataGenerator()
model = Sequential()
model.add(Conv2D(32, kernel_size=(3, 3), activation='relu', input_shape=input_shape))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.2))
model.add(Flatten())
model.add(Dense(128, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(3, activation='softmax'))
model.compile(loss='categorical_crossentropy',
optimizer='adam',
metrics=['accuracy'])
model.fit_generator(datagen.flow(X_train, y_train, batch_size=64), validation_data=(X_test,y_test),steps_per_epoch= 30, epochs=10)
model.evaluate(X_test, y_test)
EDIT
So after more research I've discovered that using different hardware, specifically different graphics cards, will result in varying accuracies.
Saving the trained model and using that to evalute data on different machines is the ideal solution.

Providing the solution here (Answer Section), even though it is present in the Question Section, for the benefit of the community.
Using different hardware, specifically different graphics cards will result in varying accuracies though we use same seeds, code and dataset. Saving the trained model and using that to evaluate data on different machines is the ideal solution.

Learning Keras model by using Distributed Tensorflow

I have two GPU installed on two different machines. I want to build a cluster that allows me to learn a Keras model by using the two GPUs together.
Keras blog shows two slices of code in Distributed training section and link official Tensorflow documentation.
My problem is that I don't know how to learn my model and put into practice what is reported in Tensorflow documentation.
For example, what should I do if I want to execute the following code on a cluster of multiple GPU?
# For a single-input model with 2 classes (binary classification):
model = Sequential()
model.add(Dense(32, activation='relu', input_dim=100))
model.add(Dense(1, activation='sigmoid'))
model.compile(optimizer='rmsprop',
loss='binary_crossentropy',
metrics=['accuracy'])
# Generate dummy data
import numpy as np
data = np.random.random((1000, 100))
labels = np.random.randint(2, size=(1000, 1))
# Train the model, iterating on the data in batches of 32 samples
model.fit(data, labels, epochs=10, batch_size=32)

In the first and second part of the blog he explains how to use keras models with tensorflow.
Also I found this example of keras with distributed training.
And here is another with horovod.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

AMD plaidml vs CPU Tensorflow - Unexpected results - python

Related

I am trying to fit an CuDNNLSTM model and I am getting an error

convert tf keras model to scikit MLP NN

Why is my CNN model being trained in less time on my local CPU than hosted options?

Different accuracies on different machines using same seeds, code and dataset

Learning Keras model by using Distributed Tensorflow

Categories

Resources