Puzzling mxnet performance under Python vs Mathematica - python

I compared mxnet performance between Mathematica and Python and observe more than an order of magnitude performance differences and would like advises on how to improve performance under Python.
My NN is an MLP for regression, with 3 float inputs, 8, 16, 24, 8 neurons fully connected layers and 2 float output, Sigmoid is used everywhere except on the input and output neurons. The optimizer used in Mathematica is Adam so I used this too in Python with the same parameters. The training dataset contains 4215 records mapping xyY colors to Munsell Hue and Chroma.
Mathematica is version 11.2 released in 2017 and Mathematica uses mxnet under the hood for deep learning tasks. On the Python side, I use the latest release with mxnet-mkl and I checked that MKLDNN is enabled.
Mathematica licence runs on a MS Surface Pro notebook with Windows 10, i7-7660U, 2.5Ghz, 2 cores, 4 hyperthreads, AVX2. I ran Python on this computer for comparison.
Here are the times for learning loops of 32768 epochs and
Batch Sizes: 128, 256, 512, 1024, 2048, 4096
Mathematica: 8m12s, 5m14s, 3m34s, 2m57s, 3m4s, 3m48s
PythonMxNet: 286m, 163m, 93m, 65m, 49m, 47m
I tried the mxnet environment variables optimization tricks suggested by Intel but only got 120% slower times.
I also switched all the arrays to float32 from float64 with the hypothesis that MKL could process 2 times as many operations in the same amount of time (excluding overhead of course) with SIMD registers but noticed not even a slight improvement.
The reason I switched my NN work from Mathematica to Python is I wanted to train the NN on different and more powerfull computers. And I also don't like having my notebook tied up on NN learning tasks.
How should I interpret those results?
What may be the cause of those performance differences?
Is there anything I can do to gain some performance under Python?
Or is this simply the unavoidable overhead imposed by the Python interpreter?
Edit:
The script for generating the NN:
def get_MunsellNet( Layers, NbInputs ):
net = nn.HybridSequential()
for l in range( len( Layers ) ):
if l == 0:
net.add( nn.Dense( Layers[ l ], activation = 'sigmoid', dtype = mu.DType, in_units = NbInputs ) )
else:
net.add( nn.Dense( Layers[ l ], activation = 'sigmoid', dtype = mu.DType ) )
net.add( nn.Dense( 2, dtype = mu.DType ) )
net.hybridize()
net.initialize( mx.init.Uniform(), ctx = ctx )
return net
The NN is created with this:
mu.DType = 'f8'
NbInputs = 3
Norm = 'None' # Possible normalizer are: 'None', 'Unit', 'RMS', 'RRMS', 'Std'
Train_Dataset = mnr.Build_HCTrainData( NbInputs, Norm, Datasets = [ 'all.dat', 'fill.dat' ] )
Test_Dataset1 = mnr.Build_HCTestData( 'real.dat', NbInputs )
Test_Dataset2 = mnr.Build_HCTestData( 'test.dat', NbInputs )
Layers = [ 8, 16, 24, 8 ]
Net = mnn.get_MunsellNet( Layers, NbInputs )
Loss_Fn = mx.gluon.loss.L2Loss()
Learning_Rate = 0.0005
Optimizer = 'Adam'
Batch_Size = 4096
Epochs = 500000
And trained with this:
if __name__ == '__main__':
global Train_Data_Loader
Train_Data_Loader = mx.gluon.data.DataLoader( Train_Dataset, batch_size = Batch_Size, shuffle = True, num_workers = mnn.NbWorkers )
Trainer = mx.gluon.Trainer( Net.collect_params(), Optimizer, {'learning_rate': Learning_Rate} )
Estimator = estimator.Estimator( net = Net,
loss = Loss_Fn,
trainer = Trainer,
context = mnn.ctx )
LossRecordHandler = mnu.ProgressRecorder( Epochs, Test_Dataset1, Test_Dataset2, NbInputs, Net_Name, Epochs / 10 * 8 )
for n in range( 10 ):
LossRecordHandler.ResetStates()
Train_Metric = Estimator.prepare_loss_and_metrics()
Net.initialize( force_reinit = True )
# ignore warnings for nightly test on CI only
with warnings.catch_warnings():
warnings.simplefilter( "ignore" )
Estimator.fit( train_data = Train_Data_Loader,
epochs = Epochs,
event_handlers = [ LossRecordHandler ] )
Let me know if you need more code clips.

In general, Python overhead can be large especially when the compute kernels are launched frequently with small inputs. From the benchmark above there is some evidence that a large chunk of it is due to the overhead. We see that while the batch size increases from 128 to 4096 (32x), the mxnet-to-mathematica time ratio decreases from ~35.75 to ~12.37, which can be seen as speed-up due to less invocation overhead. I will update the answer once there's more details to this question.

Related

Training loss not decreasing when training - tensorflow gpu

I am training a graph neural network on a node cluster with one gpu Titan RTX. I am using Tensorflow-gpu 1.15 and it can recognize the gpu successfully. The training involves some tensors operations of type float 64, where the training set is formed by 256K sparse block-circulant matrices of moderate size. I evaluate 256 samples per run and the batch size is set to 32.
When I look at the loss graph in tb, I notice that even after evaluating more than 100K samples (after 24 hours of training) my training loss is not decaying at all: it looks noisy and quite flat. This is the plot from tb:
The loss is measured as the frobenious norm of an error matrix and it is supposed to decay. I am also using the adam optimizer with learning rate of 10^-3.
Any insights on why it is behaving like this? It is basically not learning anything.
I did a quick profiling to see which operations are the slowest, but cannot find something significant.
Could it be the GPU that I am using and the loss in performance due to the heavy memory allocation of float64? When I am checking the gpu usage, I allocate 60% of the memory (and I have the option to release it after operation).
Any suggestion or tips?
I have been using:
Tensorflow-gpu 1.15,
CUDA 10.0.130,
NCCL 2.4.7-CUDA-10.0.130,
cuDNN 7.6.3-CUDA-10.0.130.
Running on a remote server with 4 gpus Titan RTX (I am using 1 of them).
Type tf.float64 is not the problem when you select the correct optimizer when I am running on version 1 compatibility mode 'tf.compat.v1.disable_eager_execution()'
Select the correct input data and target optimizer.
Select the correct tf.Variable.
Select the optimized equation or methods.
Input may require specific methods to transform to tf.float64 in compatibility mode 'tf.compat.v1.disable_eager_execution()'
Running sessions have the same input and update of variables, arrays, or feed_dict.
Purpose of the optimizer when you need to find similarities or you need to find categories of their group.
Sample: Similarity scopes re-occurrence or all pixels compare small of change see it different.
import os
from os.path import exists
import tensorflow as tf
import matplotlib.pyplot as plt
from skimage.transform import resize
import numpy as np
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""
[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
None
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""
physical_devices = tf.config.experimental.list_physical_devices('GPU')
assert len(physical_devices) > 0, "Not enough GPU hardware devices available"
config = tf.config.experimental.set_memory_growth(physical_devices[0], True)
print(physical_devices)
print(config)
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""
: Variables
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""
learning_rate = 0.1
global_step = 0
tf.compat.v1.disable_eager_execution()
BATCH_SIZE = 1
IMG_SIZE = (32, 32)
history = [ ]
history_Y = [ ]
list_file = [ ]
list_label = [ ]
for file in os.listdir("F:\\datasets\\downloads\\dark\\train") :
image = plt.imread( "F:\\datasets\\downloads\\dark\\train\\" + file )
image = resize(image, (32, 32))
image = np.reshape( image, (1, 32, 32, 3) )
list_file.append( image )
list_label.append(1)
optimizer = tf.compat.v1.train.AdamOptimizer(
learning_rate=0.1,
beta1=0.9,
beta2=0.999,
epsilon=1e-08,
use_locking=False,
name='Adam'
)
var1 = tf.Variable(255.0, dtype=tf.dtypes.float64)
var2 = tf.Variable(10.0, dtype=tf.dtypes.float64)
X_var = tf.compat.v1.get_variable('X', dtype = tf.float64, initializer = tf.random.normal((1, 32, 32, 3), dtype=tf.dtypes.float64))
y_var = tf.compat.v1.get_variable('Y', dtype = tf.float64, initializer = tf.random.normal((1, 32, 32, 3), dtype=tf.dtypes.float64))
Z = tf.nn.l2_loss((var1 - X_var) ** 2 + (var2 - y_var) ** 2, name="loss")
cosine_loss = tf.keras.losses.CosineSimilarity(axis=1)
loss = tf.reduce_mean(input_tensor=tf.square(Z))
training_op = optimizer.minimize(loss)
previous_train_loss = 0
with tf.compat.v1.Session() as sess:
sess.run(tf.compat.v1.global_variables_initializer())
image = list_file[0]
X = image
Y = image
for i in range(1000):
global_step = global_step + 1
train_loss, temp = sess.run([loss, training_op], feed_dict={X_var:X, y_var:Y})
history.append( train_loss )
if global_step % 2 == 0 :
var2 = var2 - 0.001
if global_step % 4 == 0 and train_loss <= previous_train_loss :
var1 = var1 - var2 + 0.5
print( 'steps: ' + str(i) )
print( 'train_loss: ' + str(train_loss) )
previous_train_loss = train_loss
sess.close()
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""
: Graph
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""
history = history[:-1]
plt.plot(np.asarray(history))
plt.xlabel('Epoch')
plt.ylabel('loss')
plt.legend(loc='lower right')
plt.show()
Without Cosine Similarity: All Pixels are comparing a bit of change they find the meaning.
With Cosine Similarity: Re-occurrence of series supposed to consider same threads.

Inconsistencies between tf.contrib.layer.fully_connected, tf.layers.dense, tf.contrib.slim.fully_connected, tf.keras.layers.Dense

I am trying to implement policy gradient for a contextual bandit problem (https://medium.com/emergent-future/simple-reinforcement-learning-with-tensorflow-part-1-5-contextual-bandits-bff01d1aad9c).
I am defining a model in tensorflow to solve this problem using a single fully-connected layer.
I am trying out different APIs from tensorflow, but want to avoid using the contrib package since it is not tensorflow-supported. I am interested in using the keras API since I am already familiar with the functional interface, and it is now implemented as tf.keras. However, I can only seem to get results to work when using tf.contrib.slim.fully_connected, or tf.contrib.layers.fully_connected (the former calls the latter).
The following two snippets work correctly (one_hot_encoded_state_input and num_actions both adhere to the expected tensor shapes for the layers).
import tensorflow.contrib.slim as slim
action_probability_distribution = slim.fully_connected(
one_hot_encoded_state_input, \
num_actions, \
biases_initializer=None, \
activation_fn=tf.nn.sigmoid, \
weights_initializer=tf.ones_initializer())
and
from tensorflow.contrib.layers import fully_connected
action_probability_distribution = fully_connected(
one_hot_encoded_state_input,
num_actions,\
biases_initializer=None, \
activation_fn=tf.nn.sigmoid, \
weights_initializer=tf.ones_initializer())
On the other hand, neither of the following work:
action_probability_distribution = tf.layers.dense(
one_hot_encoded_state_input, \
num_actions, \
activation=tf.nn.sigmoid, \
bias_initializer=None, \
kernel_initializer=tf.ones_initializer())
nor
action_probability_distribution = tf.keras.layers.Dense(
num_actions, \
activation='sigmoid', \
bias_initializer=None, \
kernel_initializer = 'Ones')(one_hot_encoded_state_input)
The last two cases use tensorflow's high level APIs layers and keras. Ideally, I would like to know if I am incorrectly implementing the first two cases using the last two cases, and if the only issue I am having is that the latter two are not equivalent to the former two.
For completeness, here is the entire code needed to run this (Note: python 3.5.6 and tensorflow 1.12.0 were used).
import tensorflow as tf
import numpy as np
tf.reset_default_graph()
num_states = 3
num_actions = 4
learning_rate = 1e-3
state_input = tf.placeholder(shape=(None,),dtype=tf.int32, name='state_input')
one_hot_encoded_state_input = tf.one_hot(state_input, num_states)
# DOESN'T WORK
action_probability_distribution = tf.keras.layers.Dense(num_actions, activation='sigmoid', bias_initializer=None, kernel_initializer = 'Ones')(one_hot_encoded_state_input)
# WORKS
# import tensorflow.contrib.slim as slim
# action_probability_distribution = slim.fully_connected(one_hot_encoded_state_input,num_actions,\
# biases_initializer=None,activation_fn=tf.nn.sigmoid,weights_initializer=tf.ones_initializer())
# WORKS
# from tensorflow.contrib.layers import fully_connected
# action_probability_distribution = fully_connected(one_hot_encoded_state_input,num_actions,\
# biases_initializer=None,activation_fn=tf.nn.sigmoid,weights_initializer=tf.ones_initializer())
# DOESN'T WORK
# action_probability_distribution = tf.layers.dense(one_hot_encoded_state_input,num_actions, activation=tf.nn.sigmoid, bias_initializer=None, kernel_initializer=tf.ones_initializer())
action_probability_distribution = tf.squeeze(action_probability_distribution)
action_chosen = tf.argmax(action_probability_distribution)
reward_input = tf.placeholder(shape=(None,), dtype=tf.float32, name='reward_input')
action_input = tf.placeholder(shape=(None,), dtype=tf.int32, name='action_input')
responsible_weight = tf.slice(action_probability_distribution, action_input, [1])
loss = -(tf.log(responsible_weight)*reward_input)
optimizer = tf.train.GradientDescentOptimizer(learning_rate=learning_rate)
update = optimizer.minimize(loss)
bandits = np.array([[0.2,0,-0.0,-5],
[0.1,-5,1,0.25],
[-5,5,5,5]])
assert bandits.shape == (num_states, num_actions)
def get_reward(state, action): # the lower the value of bandits[state][action], the higher the likelihood of reward
if np.random.randn() > bandits[state][action]:
return 1
return -1
max_episodes = 10000
epsilon = 0.1
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
rewards = np.zeros(num_states)
for episode in range(max_episodes):
state = np.random.randint(0,num_states)
action = sess.run(action_chosen, feed_dict={state_input:[state]})
if np.random.rand(1) < epsilon:
action = np.random.randint(0, num_actions)
reward = get_reward(state, action)
sess.run([update, action_probability_distribution, loss], feed_dict = {reward_input: [reward], action_input: [action], state_input: [state]})
rewards[state] += reward
if episode%500 == 0:
print(rewards)
When using the chunks commented # THIS WORKS, the agent learns and maximizes reward across all three states. On the other hand, those commented # THIS DOESN'T WORK# don't learn and typically converge extremely quickly to choosing one action. For example, working behaviour should print a reward list that is positive, increasing numbers (good cumulative reward for each state). non-working behaviour looks like a reward list that has only one action with increasing cumulative reward, usually sacrificing the other (negative cumulative reward).
For anyone who runs into this issue, especially since tensorflow has many APIs for implementation, the difference comes down to bias initialization and defaults. For tf.contrib and tf.slim, using biases_initializer = None means that no bias is used. Replicating this using tf.layers and tf.keras requires use_bias=False.

Fine-tuning a neural network in tensorflow

I've been working on this neural network with the intent to predict TBA (time based availability) of simulated windmill parks based on certain attributes. The neural network runs just fine, and gives me some predictions, however I'm not quite satisfied with the results. It fails to notice some very obvious correlations that I can clearly see by myself. Here is my current code:
`# Import
import tensorflow as tf
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import MinMaxScaler
maxi = 0.96
mini = 0.7
# Make data a np.array
data = pd.read_csv('datafile_ML_no_avg.csv')
data = data.values
# Shuffle the data
shuffle_indices = np.random.permutation(np.arange(len(data)))
data = data[shuffle_indices]
# Training and test data
data_train = data[0:int(len(data)*0.8),:]
data_test = data[int(len(data)*0.8):int(len(data)),:]
# Scale data
scaler = MinMaxScaler(feature_range=(mini, maxi))
scaler.fit(data_train)
data_train = scaler.transform(data_train)
data_test = scaler.transform(data_test)
# Build X and y
X_train = data_train[:, 0:5]
y_train = data_train[:, 6:7]
X_test = data_test[:, 0:5]
y_test = data_test[:, 6:7]
# Number of stocks in training data
n_args = X_train.shape[1]
multi = int(8)
# Neurons
n_neurons_1 = 8*multi
n_neurons_2 = 4*multi
n_neurons_3 = 2*multi
n_neurons_4 = 1*multi
# Session
net = tf.InteractiveSession()
# Placeholder
X = tf.placeholder(dtype=tf.float32, shape=[None, n_args])
Y = tf.placeholder(dtype=tf.float32, shape=[None,1])
# Initialize1s
sigma = 1
weight_initializer = tf.variance_scaling_initializer(mode="fan_avg",
distribution="uniform", scale=sigma)
bias_initializer = tf.zeros_initializer()
# Hidden weights
W_hidden_1 = tf.Variable(weight_initializer([n_args, n_neurons_1]))
bias_hidden_1 = tf.Variable(bias_initializer([n_neurons_1]))
W_hidden_2 = tf.Variable(weight_initializer([n_neurons_1, n_neurons_2]))
bias_hidden_2 = tf.Variable(bias_initializer([n_neurons_2]))
W_hidden_3 = tf.Variable(weight_initializer([n_neurons_2, n_neurons_3]))
bias_hidden_3 = tf.Variable(bias_initializer([n_neurons_3]))
W_hidden_4 = tf.Variable(weight_initializer([n_neurons_3, n_neurons_4]))
bias_hidden_4 = tf.Variable(bias_initializer([n_neurons_4]))
# Output weights
W_out = tf.Variable(weight_initializer([n_neurons_4, 1]))
bias_out = tf.Variable(bias_initializer([1]))
# Hidden layer
hidden_1 = tf.nn.relu(tf.add(tf.matmul(X, W_hidden_1), bias_hidden_1))
hidden_2 = tf.nn.relu(tf.add(tf.matmul(hidden_1, W_hidden_2),
bias_hidden_2))
hidden_3 = tf.nn.relu(tf.add(tf.matmul(hidden_2, W_hidden_3),
bias_hidden_3))
hidden_4 = tf.nn.relu(tf.add(tf.matmul(hidden_3, W_hidden_4),
bias_hidden_4))
# Output layer (transpose!)
out = tf.transpose(tf.add(tf.matmul(hidden_4, W_out), bias_out))
# Cost function
mse = tf.reduce_mean(tf.squared_difference(out, Y))
# Optimizer
opt = tf.train.AdamOptimizer().minimize(mse)
# Init
net.run(tf.global_variables_initializer())
# Fit neural net
batch_size = 10
mse_train = []
mse_test = []
# Run
epochs = 10
for e in range(epochs):
# Shuffle training data
shuffle_indices = np.random.permutation(np.arange(len(y_train)))
X_train = X_train[shuffle_indices]
y_train = y_train[shuffle_indices]
# Minibatch training
for i in range(0, len(y_train) // batch_size):
start = i * batch_size
batch_x = X_train[start:start + batch_size]
batch_y = y_train[start:start + batch_size]
# Run optimizer with batch
net.run(opt, feed_dict={X: batch_x, Y: batch_y})
# Show progress
if np.mod(i, 50) == 0:
mse_train.append(net.run(mse, feed_dict={X: X_train, Y: y_train}))
mse_test.append(net.run(mse, feed_dict={X: X_test, Y: y_test}))
pred = net.run(out, feed_dict={X: X_test})
print(pred)`
Have tried to tweak around with the number of hidden layers, number of nodes per layer, number of epochs to run and trying different activation functions and optimizers. However, I am quite new to neural networks, so there might be something very obvious that I'm missing.
Thanks in advance to anyone who managed to read through all of that.
It will make is much easier you you will share a small dataset that illustrate the problem. However, I will state some of the issues with non-standards datasets and how to overcome them.
Possible solutions
Regularization and validation-based optimization - are methods that are always good to try when looking for some extra-accuracy. See dropout methods here (original paper), and some overview here.
Unbalanced data - Sometimes of the time series categories/events behave like anomalies, or just in unbalanced ways. If you read a book, words like the or it will appear much more times than warehouse or such. This can become a problem if your main task is to detect the word warehouse and you train your network (even lstms) in traditional ways. A way to overcome this problem is by balancing the samples (creating balanced datasets) or to give more weight to low-frequent categories.
Model structure - sometimes fully connected layers are not enough. See computer vision problems for instance, where we train using convolution layers. The convolution and pooling layers enforce structure on the model, which is suitable for images. This is also some sort of regulation, since we have less parameters in those layers. In time-series problems, convolutions are also possible and turns out that works just fine. See example in Conditional Time Series Forecasting with Convolution Neural Networks.
The above suggestions are presented in the order I would suggest to try.
Good luck!

Tensorflow 170 times slower than Theano for RNN implementation

I am trying to implement a RNN in Tensorflow (0.11), based on this paper.
They have a Theano implementation here, that I am comparing my implementation to. When I try to run their Theano implementation, it finishes 10 epochs in about 1 hour. My Tensorflow implementation needs about 17 hours just to finish 1 epoch. I am wondering if anyone could look at my code and tell me if there are some obvious problems that are slowing it down.
The purpose of the RNN is to predict the next item a user is going to click on, given his previous clicks. The items are represented by unique IDs that are given as input to the RNN as a 1-HOT vector.
So the RNN is built like this:
[INPUT (1-HOT representation, size 37803)] -> [GRU layer (size 100)] -> [FeedForward layer]
and the ouput from the FF layer is a vector with the same size as the input vector, where high values indicate that the item corresponding to that index is very likely to be the next one clicked.
num_hidden = 100
x = tf.placeholder(tf.float32, [None, max_length, n_items], name="InputX")
y = tf.placeholder(tf.float32, [None, max_length, n_items], name="TargetY")
session_length = tf.placeholder(tf.int32, [None], name="SeqLenOfInput")
output, state = rnn.dynamic_rnn(
rnn_cell.GRUCell(num_hidden),
x,
dtype=tf.float32,
sequence_length=session_length
)
layer = {'weights':tf.Variable(tf.random_normal([num_hidden, n_items])),
'biases':tf.Variable(tf.random_normal([n_items]))}
output = tf.reshape(output, [-1, num_hidden])
prediction = tf.matmul(output, layer['weights'])
y_flat = tf.reshape(y, [-1, n_items])
final_output = tf.nn.softmax_cross_entropy_with_logits(prediction,y_flat)
cost = tf.reduce_sum(final_output)
optimizer = tf.train.AdamOptimizer().minimize(cost)
Both implementations are tested on the same hardware. Both implementations utilize the GPU.
EDIT:
The Theano model has the same structure. (1-HOT input -> GRU layer with 100 units -> FeedForward)
I tested the Theano version with the same parameters as I used in my model (using cross entropy for the loss, batch size=200, adam optimizer, with the same learning rate, no dropout in either model) but the speed difference is still the same.
EDIT (2016-12-07):
Using file queues to queue batches instead of using feed_dict helped alot.
I still need to do other optimizations to make it faster. Anyways, here is how I used file queues to make it faster.
# Create filename_queue
filename_queue = tf.train.string_input_producer(train_files, shuffle=True)
min_after_dequeue = 1024
capacity = min_after_dequeue + 3*batch_size
examples_queue = tf.RandomShuffleQueue(
capacity=capacity,
min_after_dequeue=min_after_dequeue,
dtypes=[tf.string])
# Create multiple readers to populate the queue of examples
enqueue_ops = []
for i in range(n_readers):
reader = tf.TextLineReader()
_key, value = reader.read(filename_queue)
enqueue_ops.append(examples_queue.enqueue([value]))
tf.train.queue_runner.add_queue_runner(
tf.train.queue_runner.QueueRunner(examples_queue, enqueue_ops))
example_string = examples_queue.dequeue()
# Default values, and type of the columns, first is sequence_length
# +1 since first field is sequence length
record_defaults = [[0]]*(max_sequence_length+1)
enqueue_examples = []
for thread_id in range(n_preprocess_threads):
example = tf.decode_csv(value, record_defaults=record_defaults)
# Split the row into input/target values
sequence_length = example[0]
features = example[1:-1]
targets = example[2:]
enqueue_examples.append([sequence_length, features, targets])
# Batch together examples
session_length, x_unparsed, y_unparsed = tf.train.batch_join(
enqueue_examples,
batch_size=batch_size,
capacity=2*n_preprocess_threads*batch_size)
# Parse the examples in a batch
x = tf.one_hot(x_unparsed, depth=n_classes)
y = tf.one_hot(y_unparsed, depth=n_classes)
# From here on, x, y and session_length can be used in the model

Using python multiprocessing for sklearn NN

I am using dev version of Python sklearn package with NN implementation.
My task is to train 4 NN with different input data and the average the predictions
X_median = preprocessing.scale(data_median)
X_min = preprocessing.scale(data_min)
X_max = preprocessing.scale(data_max)
X_mean = preprocessing.scale(data_mean)
I creat a Neural Networks like this
NN1 = MLPClassifier(hidden_layer_sizes = (50), activation = 'logistic', algorithm='adam', alpha= 0 , max_iter = 40, batch_size = 10, learning_rate = 'adaptive', shuffle = True, random_state=1)
NN2 = MLPClassifier(hidden_layer_sizes = (50), activation = 'logistic', algorithm='adam', alpha= 0 , max_iter = 40, batch_size = 10, learning_rate = 'adaptive', shuffle = True, random_state=1)
NN3 = MLPClassifier(hidden_layer_sizes = (50), activation = 'logistic', algorithm='adam', alpha= 0 , max_iter = 40, batch_size = 10, learning_rate = 'adaptive', shuffle = True, random_state=1)
NN4 = MLPClassifier(hidden_layer_sizes = (50), activation = 'logistic', algorithm='adam', alpha= 0 , max_iter = 40, batch_size = 10, learning_rate = 'adaptive', shuffle = True, random_state=1)
(standard sklearn function)
and I want to train them on described above datasets.
Without using pool my code will look like this:
NN1.fit(X_mean,train_y)
NN2.fit(X_median,train_y)
NN3.fit(X_min,train_y)
NN4.fit(X_max,train_y)
Of course since all 4 training are independent I want to run them in parallel, and I assume I should use pool for this. However, I do not understand completely how the computation is performed. I would assume to write something like this:
pool = Pool()
pool.apply_async(NN1.fit, args = (X_mean, train_y))
However, this does not produce any results, I can even type like this(passing only one argument) and the program will finish without any errors!
pool.apply_async(NN1.fit, args = (X_mean,)).
What will be the correct way to perform such computations?
Can someone advise good resource to understand the usage of Python multiprocessing?
Finally I made it work)
I based my solution on this answer. So, firstly create two help functions:
1)
def Myfunc(MyNN,X,train_y):
MyBrain.fit(X,train_y)
return MyNN
This one is just to make desirable function global to feed pool methods
2)
def test_star(a_b):
return Myfunc(*a_b)
This is key part of it- help function to take 1 argument and split it to desirable number of args Myfunc needed.
Then just create
mylist = [(NN_mean,X_mean, train_y), (NN_median,X_median, train_y)]
and execute
NN_mean, NN_median = pool.map(test_star, my list).
From my point of view this solution is super ugly, but it works. I hope someone can create more elegant one and post it :).

Categories