Tensorflow (GPU) vs. Numpy - python

so I've got two implementations of a linear regression using gradient descent. One in Tensorflow, one in Numpy. I'm finding the one in Numpy is about 3x faster than in Tensorflow. Here's my code -
Tensorflow:
class network_cluster(object):
def __init__(self, data_frame, feature_cols, label_cols):
self.init_data(data_frame, feature_cols, label_cols)
self.init_tensors()
def init_data(self, data_frame, feature_cols, label_cols):
self.data_frame = data_frame
self.feature_cols = feature_cols
self.label_cols = label_cols
def init_tensors(self):
self.features = tf.placeholder(tf.float32)
self.labels = tf.placeholder(tf.float32)
self.weights = tf.Variable(tf.random_normal((len(self.feature_cols), len(self.label_cols))))
self.const = tf.Variable(tf.random_normal((len(self.label_cols),)))
def linear_combiner(self):
return tf.add(tf.matmul(self.features, self.weights), self.const)
def predict(self):
return self.linear_combiner()
def error(self):
return tf.reduce_mean(tf.pow(self.labels - self.predict(), 2), axis = 0)
def learn_model(self, epocs = 100):
optimizer = tf.train.AdadeltaOptimizer(1).minimize(self.error())
error_rcd = []
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
for epoc in range(epocs):
_, error = sess.run([optimizer, self.error()], feed_dict={
self.features: self.data_frame[self.feature_cols],
self.labels: self.data_frame[self.label_cols]
})
error_rcd.append(error[0])
return error_rcd
def get_coefs(self):
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
coefs = sess.run([self.weights, self.const])
return coefs
test_cluster = network_cluster(dataset, ['ship_jumps', 'npc_kills', 'ship_kills', 'pod_kills'], ['hour_of_week'])
%timeit test_cluster.learn_model(epocs = 100)
And numpy:
def grad_descent(dataset, features, predictor, max_iters = 10000):
def initialize_model(dataset, features, predictor):
constant_array = np.ones(shape = (len(dataset), 1))
features_array = dataset.loc[:, features].values
features_array = np.append(constant_array, features_array, axis = 1)
predict_array = dataset.loc[:, predictor].values
betas = np.zeros(shape = (len(features) + 1, len(predictor)))
return (features_array, predict_array, betas)
def calc_gradient(features_array, predict_array, betas):
prediction = np.dot(features_array, betas)
predict_error = predict_array - prediction
gradient = -2 * np.dot(features_array.transpose(), predict_error)
gradient_two = 2 * np.expand_dims(np.sum(features_array ** 2, axis = 0), axis = 1)
return (gradient, gradient_two)
def update_betas(gradient, gradient_two, betas):
new_betas = betas - ((gradient / gradient_two) / len(betas))
return new_betas
def model_error(features_array, predict_array, betas):
prediction = np.dot(features_array, betas)
predict_error = predict_array - prediction
model_error = np.sqrt(np.mean(predict_error ** 2))
return model_error
features_array, predict_array, betas = initialize_model(dataset, features, predictor)
prior_error = np.inf
for iter_count in range(max_iters):
gradient, gradient_two = calc_gradient(features_array, predict_array, betas)
betas = update_betas(gradient, gradient_two, betas)
curr_error = model_error(features_array, predict_array, betas)
if curr_error == prior_error:
break
prior_error = curr_error
return (betas, iter_count, curr_error)
%timeit grad_descent(dataset, ['ship_jumps', 'npc_kills', 'ship_kills', 'pod_kills'], ['hour_of_week'], max_iters = 100)
I'm testing using the Spyder IDE, and I do have an Nvidia GPU (960). The Tensorflow code clocks in at ~20 seconds, with the Numpy code at about 7 seconds on the same dataset. The dataset is almost 1 million rows.
I would have expected Tensorflow to beat out Numpy handily here, but that's not the case. Granted I am new to using Tensorflow, and the Numpy implementation doesn't use a class, but still, 3x better with Numpy?!
Hoping for some thoughts/ideas on what I'm doing wrong here.

Without looking at your code in detail (not that much experience with TF):
This comparison is flawed!
Yaroslav's comment is of course true: GPU-computing has some overhead (at least data-preparation; not sure what kind of compiling is clocked here)
You are comparing pure GD to Adadelta in full-batch mode it seems:
Adadelta of course implicates some overhead (there are more operations than calculating the gradient and multiplying the current iterate) as it's one of the common variance-reduction methods which come with a price!
The idea is: invest some additional operations to:
remove the number of iterations needed given some learning-rate
(this is even much more complex: for most people -> achieve good convergence with using the default learning-rates)
It seems you are just running 100 epochs each and clocking this
That's not meaningful!
It's very much possible that the objective is very different:
if iteration size is not enough
or the initial learning-rate is badly chosen
or the same, but the non-existing early-stopping made sure a possible better algorithm with proven convergence (according to some criterion) wastes some additional time doing all iterations until 100 is reached!
(Adadelta was probably designed for the SGD-setting; not GD)
It's very hard to compare such different algorithms, especially when using just one task / dataset.
Even if you would introduce early-stopping, you will observe random-seed-based indeterministic performance which is hard to interpret.
You are basically measuring iteration-time, but this is not a good measure. Compare first-order methods (gradients -> SGD, GD, ...) with second-order methods (hessian -> Newton). the latter is very slow to iterate, but usually obtains quadratic convergence behaviour resulting in way less iterations needed! In NN-applications this example is more: LBFGS vs. SGD/... (although i don't know if LBFGS is available in TF; torch supports it). LBFGS is known to achieve local-quadratic convergence which is again hard to interpret in real-world tasks (especially as this limited-memory approximation of the inverse-hessian is a parameter of LBFGS). This comparison can also be done on Linear-Programming where the Simplex-method has fast-iterations while Interior-point methods (basically Newton-based; but treating constrained-optimization here there are some additional ideas needed) are much slower per iteration (despite being faster to achieve convergence in many cases).
What i ignored here: nearly all theoretical results regarding convergence and co. are limited to convex and smooth functions. NNs are typically non-convex, which means, the task of evaluating these performance-measures is even harder. But your problem here is convex of course.
I also have to admit, that my answer is only scratching the surface of this complex problem, even if unconstrained smooth convex optimization is one of the easier tasks in numerical-optimization (compared to constrained, nonsmooth nonconvex optimization).
For a general introduction to numerical-optimization, which also talks a lot about first-order vs. second-order methods (and there are many methods in-between), i recommend Numerical Optimization by Nocedal and Wright which can be found on the web.

Related

Vectorized beam search decoder is not faster on GPU - Tensorflow 2

I'm trying to run a RNN beam search on a tf.keras.Model in a vectorized way to have it work completely on GPU. However, despite having everything as tf.function, as vectorized as I can make it, it runs exactly the same speed with or without a GPU. Attached is a minimal example with a fake model. In reality, for n=32, k=32, steps=128 which is what I would want to work with, this takes 20s (per n=32 samples) to decode, both on CPU and on GPU!
I must be missing something. When I train the model, on GPU a training iteration (128 steps) with batch size 512 takes 100ms, and on CPU a training iteration with batch size 32 takes 1 sec. The GPU isn't saturated at batch size 512. I get that I have overhead from doing the steps individually and doing a blocking operation per step, but in terms of computation my overhead is negligible compared to the rest of the model.
I also get that using a tf.keras.Model in this way is probably not ideal, but is there another way to wire output tensors via a function back to the input tensors, and particularly also rewire the states?
Full working example:
https://gist.github.com/meowcat/e3eaa4b8543a7c8444f4a74a9074b9ae
#tf.function
def decode_beam(states_init, scores_init, y_init, steps, k, n):
states = states_init
scores = scores_init
xstep = embed_y_to_x(y_init)
# Keep the results in TensorArrays
y_chain = tf.TensorArray(dtype="int32", size=steps)
sequences_chain = tf.TensorArray(dtype="int32", size=steps)
scores_chain = tf.TensorArray(dtype="float32", size=steps)
for i in range(steps):
# model_decode is the trained model with 3.5 million trainable params.
# Run a single step of the RNN model.
y, states = model_decode([xstep, states])
# Add scores of step n to previous scores
# (I left out the sequence end killer for this demo)
scores_y = tf.expand_dims(tf.reshape(scores, y.shape[:-1]), 2) + tm.log(y)
# Reshape into (n,k,tokens) and find the best k sequences to continue for each of n candidates
scores_y = tf.reshape(scores_y, [n, -1])
top_k = tm.top_k(scores_y, k, sorted=False)
# Transform the indices. I was using tf.unravel_index but
# `tf.debugging.set_log_device_placement(True)` indicated that this would be placed on the CPU
# thus I rewrote it
top_k_index = tf.reshape(
top_k[1] + tf.reshape(tf.range(n), (-1, 1)) * scores_y.shape[1], [-1])
ysequence = top_k_index // y.shape[2]
ymax = top_k_index % y.shape[2]
# this gives us two (n*k,) tensors with parent sequence (ysequence)
# and chosen character (ymax) per sequence.
# For continuation, pick the states, and "return" the scores
states = tf.gather(states, ysequence)
scores = tf.reshape(top_k[0], [-1])
# Write the results into the TensorArrays,
# and embed for the next step
xstep = embed_y_to_x(ymax)
y_chain = y_chain.write(i, ymax)
sequences_chain = sequences_chain.write(i, ysequence)
scores_chain = scores_chain.write(i, scores)
# Done: Stack up the results and return them
sequences_final = sequences_chain.stack()
y_final = y_chain.stack()
scores_final = scores_chain.stack()
return sequences_final, y_final, scores_final
There was a lot going on here. I will comment on it because it might help others to resolve TensorFlow performance issues.
Profiling
The GPU profiler library (cupti) was not loading correctly on the cluster, stopping me from doing any useful profiling on the GPU. That was fixed, so I get useful profiles of the GPU now.
Note this very useful answer (the only one on the web) that shows how to profile arbitrary TensorFlow 2 code, rather than Keras training:
https://stackoverflow.com/a/56698035/1259675
logdir = "log"
writer = tf.summary.create_file_writer(logdir)
tf.summary.trace_on(graph=True, profiler=True)
# run any #tf.function decorated functions here
sequences, y, scores = decode_beam_steps(
y_init, states_init, scores_init,
steps = steps, k = k, n = n, pad_mask = pad_mask)
with writer.as_default():
tf.summary.trace_export(name="model_trace", step=0, profiler_outdir=logdir)
tf.summary.trace_off()
Note that an old Chromium version is needed to look at the profiling results, since at the time (4-17-20) this fails in current Chrome/Chromium.
Small optimizations
The graph was made a bit lighter but not significantly faster by using unroll=True in the LSTM cells used by the model (not shown here), since only one step is needed so the symbolic loop only adds clutter. This significantly slashes time for the first iteration of the function above, when AutoGraph builds the graph. Note that this time is enormous (see below).
unroll=False (the default) builds in 300 seconds, unroll=True builds in 100 seconds. Note that the performance itself stays the same (15-20 sec/iteration for n=32, k=32).
implementation=1 made it slightly slower, so I stayed with the default of implementation=2.
Using tf.while_loop instead of relying on AutoGraph
The for i in range(steps) loop. I had this both in the (above shown) inlined version, and in a modularized one:
for i in range(steps):
ystep, states = model_decode([xstep, states])
ymax, ysequence, states, scores = model_beam_step(
ystep, states, scores, k, n, pad_mask)
xstep = model_rtox(ymax)
y_chain = y_chain.write(i, ymax)
sequences_chain = sequences_chain.write(i, ysequence)
scores_chain = scores_chain.write(i, scores)
where model_beam_step does all the beam search math. Unsurprisingly, both performed exactly equally bad, and in particular, both took ~100/300 seconds on the first run when AutoGraph traced the graph. Further, tracing the graph with the profiler gives a crazy 30-50mb file that won't easily load on Tensorboard and more or less crash it. The profile had dozens of parallel GPU streams with a single operation each.
Substituting this with a tf.while_loop slashed the setup time to zero (back_prop=False makes only very little difference), and produces a nice 500kb graph that can easily be looked at in TensorBoard and profiled in an useful way with 4 GPU streams.
beam_steps_cond = lambda i, y_, seq_, sc_, xstep, states, scores: i < steps
def decode_beam_steps_body(i, y_, seq_, sc_, xstep, states, scores):
y, states = model_decode([xstep, states])
ymax, ysequence, states, scores = model_beam_step(
y, states, scores, k, n, pad_mask)
xstep = model_rtox(ymax)
y_ = y_.write(i, ymax)
seq_ = seq_.write(i, ysequence)
sc_= sc_.write(i, scores)
i = i + 1
return i, y_, seq_, sc_, xstep, states, scores
_, y_chain, sequences_chain, scores_chain, _, _, _ = \
tf.while_loop(
cond = beam_steps_cond,
body = decode_beam_steps_body,
loop_vars = [i, y_chain, sequences_chain, scores_chain,
xstep, states, scores],
back_prop = False
)
Finally, the real problem
That I was actually able to look at the profile in a meaningful way showed me that the real issue was an output postprocessing function that runs on CPU. I didn't suspect it because it was running fast earlier, but I ignored that a beam search modification I made leads to >>>k sequences per candidate, which massively slows processing down. Thus, it was slashing every benefit I could gain from being efficient on GPU with the decoding step. Without this postprocessing, GPU runs >2 iterations / sec. Refactoring the postprocessing (which is extremely fast if done right) into TensorFlow resolved the issue.

Pytorch: How to create an update rule that doesn't come from derivatives?

I want to implement the following algorithm, taken from this book, section 13.6:
I don't understand how to implement the update rule in pytorch (the rule for w is quite similar to that of theta).
As far as I know, torch requires a loss for loss.backwward().
This form does not seem to apply for the quoted algorithm.
I'm still certain there is a correct way of implementing such update rules in pytorch.
Would greatly appreciate a code snippet of how the w weights should be updated, given that V(s,w) is the output of the neural net, parameterized by w.
EDIT: Chris Holland suggested a way to implement, and I implemented it. It does not converge on Cartpole, and I wonder if I did something wrong.
The critic does converge on the solution to the function gamma*f(n)=f(n)-1 which happens to be the sum of the series gamma+gamma^2+...+gamma^inf
meaning, gamma=1 diverges. gamma=0.99 converges on 100, gamma=0.5 converges on 2 and so on. Regardless of the actor or policy.
The code:
def _update_grads_with_eligibility(self, is_critic, delta, discount, ep_t):
gamma = self.args.gamma
if is_critic:
params = list(self.critic_nn.parameters())
lamb = self.critic_lambda
eligibilities = self.critic_eligibilities
else:
params = list(self.actor_nn.parameters())
lamb = self.actor_lambda
eligibilities = self.actor_eligibilities
is_episode_just_started = (ep_t == 0)
if is_episode_just_started:
eligibilities.clear()
for i, p in enumerate(params):
if not p.requires_grad:
continue
eligibilities.append(torch.zeros_like(p.grad, requires_grad=False))
# eligibility traces
for i, p in enumerate(params):
if not p.requires_grad:
continue
eligibilities[i][:] = (gamma * lamb * eligibilities[i]) + (discount * p.grad)
p.grad[:] = delta.squeeze() * eligibilities[i]
and
expected_reward_from_t = self.critic_nn(s_t)
probs_t = self.actor_nn(s_t)
expected_reward_from_t1 = torch.tensor([[0]], dtype=torch.float)
if s_t1 is not None: # s_t is not a terminal state, s_t1 exists.
expected_reward_from_t1 = self.critic_nn(s_t1)
delta = r_t + gamma * expected_reward_from_t1.data - expected_reward_from_t.data
negative_expected_reward_from_t = -expected_reward_from_t
self.critic_optimizer.zero_grad()
negative_expected_reward_from_t.backward()
self._update_grads_with_eligibility(is_critic=True,
delta=delta,
discount=discount,
ep_t=ep_t)
self.critic_optimizer.step()
EDIT 2:
Chris Holland's solution works. The problem originated from a bug in my code that caused the line
if s_t1 is not None:
expected_reward_from_t1 = self.critic_nn(s_t1)
to always get called, thus expected_reward_from_t1 was never zero, and thus no stopping condition was specified for the bellman equation recursion.
With no reward engineering, gamma=1, lambda=0.6, and a single hidden layer of size 128 for both actor and critic, this converged on a rather stable optimal policy within 500 episodes.
Even faster with gamma=0.99, as the graph shows (best discounted episode reward is about 86.6).
BIG thank you to #Chris Holland, who "gave this a try"
I am gonna give this a try.
.backward() does not need a loss function, it just needs a differentiable scalar output. It approximates a gradient with respect to the model parameters. Let's just look at the first case the update for the value function.
We have one gradient appearing for v, we can approximate this gradient by
v = model(s)
v.backward()
This gives us a gradient of v which has the dimension of your model parameters. Assuming we already calculated the other parameter updates, we can calculate the actual optimizer update:
for i, p in enumerate(model.parameters()):
z_theta[i][:] = gamma * lamda * z_theta[i] + l * p.grad
p.grad[:] = alpha * delta * z_theta[i]
We can then use opt.step() to update the model parameters with the adjusted gradient.

Bayesian fit of cosine wave taking longer than expected

In a recent homework, I was asked to perform a Bayesian fit over a set of data a and b using a Metropolis algorithm. The relationship between a and b is given:
e(t) = e_0*cos(w*t)
w = 2 * pi
The Metropolis algorithm is (it works fine with other fit):
def metropolis(logP, args, v0, Nsteps, stepSize):
vCur = v0
logPcur = logP(vCur, *args)
v = []
Nattempts = 0
for i in range(Nsteps):
while(True):
#Propose step:
vNext = vCur + stepSize*np.random.randn(*vCur.shape)
logPnext = logP(vNext, *args)
Nattempts += 1
#Accept/reject step:
Pratio = (1. if logPnext>logPcur else np.exp(logPnext-logPcur))
if np.random.rand() < Pratio:
vCur = vNext
logPcur = logPnext
v.append(vCur)
break
acceptRatio = Nsteps*(1./Nattempts)
return np.array(v), acceptRatio
I have tried to Bayesian fit the cosine wave and used the Metropolis algorithm above:
e_0 = -0.00155
def strain_t(e_0,t):
return e_0*np.cos(2*np.pi*t)
data = pd.read_csv('stressStrain.csv')
t = np.array(data['t'])
e = strain_t(e_0,t)
def logfitstrain_t(params,t,e):
e_0 = params[0]
sigmaR = params[1]
strainModel = strain_t(e_0,t)
return np.sum(-0.5*((e-strainModel)/sigmaR)**2 - np.log(sigmaR))
params0 = np.array([-0.00155,np.std(t)])
params, accRatio = metropolis(logfitstrain_t, (t,e), params0, 1000, 0.042)
print('Acceptance ratio:', accRatio)
e0 = np.mean(params[0])
print('e0=',e0)
e_t = e0*np.cos(2*np.pi*t)
sns.jointplot(t, e_t, kind='hex',color='purple')
The data in .csv looks like
There isn't any error message showing after I hit run, but it takes forever for python to give me an output. What did I do wrong here?
Why it might "take forever"
Your algorithm is designed to run until it accepts a given number of proposals (1000 in the example). Thus, if it's running for a long time, you're likely rejecting a bunch of proposals. This can happen when the step size is too large, leading new proposals to end up in distant, low probability regions of the likelihood space. Try reducing your step size. This may require you to also increase the number of samples to ensure the posterior space becomes adequately explored.
A more serious concern
Because you only append accepted proposals to the chain v, you haven't actually implemented the Metropolis algorithm, and instead obtain a biased set of samples that will tend to overrepresent less likely regions of the posterior space. A true Metropolis implementation re-appends the previous proposal whenever the new proposal is rejected. You can still enforce a minimum number of accepted proposals, but you really must append something every time.

is k-means ++ suitable for large data?

I used this k-means++ python code for initializing k centers but it is very long for large data, for example 400000 points of 2 dimension:
class KPlusPlus(KMeans):
def _dist_from_centers(self):
cent = self.mu
X = self.X
D2 = np.array([min([np.linalg.norm(x-c)**2 for c in cent]) for x in X])
self.D2 = D2
def _choose_next_center(self):
self.probs = self.D2/self.D2.sum()
self.cumprobs = self.probs.cumsum()
r = random.random()
ind = np.where(self.cumprobs >= r)[0][0]
return(self.X[ind])
def init_centers(self):
self.mu = random.sample(self.X, 1)
while len(self.mu) < self.K:
self._dist_from_centers()
self.mu.append(self._choose_next_center())
def plot_init_centers(self):
X = self.X
fig = plt.figure(figsize=(5,5))
plt.xlim(-1,1)
plt.ylim(-1,1)
plt.plot(zip(*X)[0], zip(*X)[1], '.', alpha=0.5)
plt.plot(zip(*self.mu)[0], zip(*self.mu)[1], 'ro')
plt.savefig('kpp_init_N%s_K%s.png' % (str(self.N),str(self.K)), \
bbox_inches='tight', dpi=200)
Is there a way to speed up k-means++?
Initial seeding has a large impact on k-means execution time. In this post you can find some strategies to speed it up.
Perhaps, you could consider to use the Siddhesh Khandelwal's K-means variant, which was publised in Proceedings of European Conference on Information Retrieval (ECIR 2017).
Siddhesh provided the python implementation in GitHub, and it is accompanied by some other previous heuristic algorithms.
K-means++ initialization takes O(n*k) to run. This is reasonably fast for small k and large n, but if you choose k too large, it will take some time. It is about as expensive as one iteration of the (slow) Lloyd variant, so it will usually pay off to use kmeans++.
Your implementation is worse, at least O(n*k²) because it performs unnecessary recomputations. And it probably always chooses the same point as next center.
Note that you also only have the initialization, not the actual kmeans yet.
I haven't run any experiment yet, but Scalable K-Means++ seems rather good for very large data sets (perhaps for those even larger than what you describe).
You can find the paper here and another post explaining it here.
Unfortunately, I haven't seen any code around I'd trust...

How to use propertly theano.scan in the SGD updates of a simple Probabilistic Matrix Factorization algorithm?

I am trying to implement Probabilistic Matrix Factorization with Stochastic Gradient Descent updates, in theano, without using a for loop.
I have just started learning the basics of theano; unfortunately on my experiment I get this error:
UnusedInputError: theano.function was asked to create a function
computing outputs given certain inputs, but the provided input
variable at index 0 is not part of the computational graph needed
to compute the outputs: trainM.
The source code is the following:
def create_training_set_matrix(training_set):
return np.array([
[_i,_j,_Rij]
for (_i,_j),_Rij
in training_set
])
def main():
R = movielens.small()
U_values = np.random.random((config.K,R.shape[0]))
V_values = np.random.random((config.K,R.shape[1]))
U = theano.shared(U_values)
V = theano.shared(V_values)
lr = T.dscalar('lr')
trainM = T.dmatrix('trainM')
def step(curr):
i = T.cast(curr[0],'int32')
j = T.cast(curr[1],'int32')
Rij = curr[2]
eij = T.dot(U[:,i].T, V[:,j])
T.inc_subtensor(U[:,i], lr * eij * V[:,j])
T.inc_subtensor(V[:,j], lr * eij * U[:,i])
return {}
values, updates = theano.scan(step, sequences=[trainM])
scan_fn = function([trainM, lr],values)
print "training pmf..."
for training_set in cftools.epochsloop(R,U_values,V_values):
training_set_matrix = create_training_set_matrix(training_set)
scan_fn(training_set_matrix, config.lr)
I realize that it's a rather unconventional way to use theano.scan: do you have a suggestion on how I could implement my algorithm better?
The main difficulty lies on the updates: a single update depends on possibly all the previous updates. For this reason I defined the latent matrices U and V as shared (I hope I did that correctly).
The version of theano I am using is: 0.8.0.dev0.dev-8d6800181bedb03a4bced4f456338e5194524317
Any hint and suggestion is highly appreciated. I am available to provide further details.

Categories