I've just recently switched to PyTorch after getting frustrated in debugging tf and understand that it is equivalent to coding in numpy almost completely. My question is what are the permitted python aspects we can use in a PyTorch model (to be put completely on GPU) eg. if-else has to be implemented as follows in tensorflow
a = tf.Variable([1,2,3,4,5], dtype=tf.float32)
b = tf.Variable([6,7,8,9,10], dtype=tf.float32)
p = tf.placeholder(dtype=tf.float32)
ps = tf.placeholder(dtype=tf.bool)
li = [None]*5
li_switch = [True, False, False, True, True]
for i in range(5):
li[i] = tf.Variable(tf.random.normal([5]))
sess = tf.Session()
sess.run(tf.global_variables_initializer())
def func_0():
return tf.add(a, p)
def func_1():
return tf.subtract(b, p)
with tf.device('GPU:0'):
my_op = tf.cond(ps, func_1, func_0)
for i in range(5):
print(sess.run(my_op, feed_dict={p:li[i], ps:li_switch[i]}))
How would the structure change in pytorch for the above code? How to place the variables and ops above on GPU and parallelize the list inputs to our graph in pytorch?
In pytorch, the code can be written like the way normal python code is written.
CPU
import torch
a = torch.FloatTensor([1,2,3,4,5])
b = torch.FloatTensor([6,7,8,9,10])
cond = torch.randn(5)
for ci in cond:
if ci > 0:
print(torch.add(a, 1))
else:
print(torch.sub(b, 1))
GPU
Move the tensors to GPU like this:
a = torch.FloatTensor([1,2,3,4,5]).to('cuda')
b = torch.FloatTensor([6,7,8,9,10]).to('cuda')
cond = torch.randn(5).to('cuda')
import torch.nn as nn
class Cond(nn.Module):
def __init__(self):
super(Cond, self).__init__()
def forward(self, cond, a, b):
result = torch.empty(cond.shape[0], a.shape[0]).cuda()
for i, ci in enumerate(cond):
if ci > 0:
result[i] = torch.add(a, 1)
else:
result[i] = torch.sub(b, 1)
return result
cond_model = Cond().to('cuda')
output = cond_model(cond, a, b)
https://pytorch.org/tutorials/beginner/blitz/tensor_tutorial.html#cuda-tensors
To initialize your a and b tensors in PyTorch, you do the following:
a = torch.tensor([1,2,3,4,5], dtype=torch.float32)
b = torch.tensor([6,7,8,9,10], dtype=torch.float32)
But, since you need them to be completely on the GPU, you have to use the magic .cuda() function. So, it would be:
a = torch.tensor([1,2,3,4,5], dtype=torch.float32).cuda()
b = torch.tensor([6,7,8,9,10], dtype=torch.float32).cuda()
Which moves the tensor to the GPU
Another way of initializing is:
a = torch.FloatTensor([1,2,3,4,5]).cuda()
b = torch.FloatTensor([6,7,8,9,10]).cuda()
If we need to generate a random normal distribution we use torch.randn (there is also torch.rand which does a uniform random distribution).
li = torch.randn(5, 5)
(Catch the bug, it has to be initialized on cuda, you cannot do operations on tensors that are located on separate processing units, i.e., CPU and GPU)
li = torch.randn(5, 5).cuda()
There is no difference for the li_switch initialization.
One possible way of handling your func_0 and func_1 is to declare them as
def func_0(li_value):
return torch.add(a, li_value)
def func_1(li_value):
return torch.sub(b, li_value)
Then, for the predicate function call, it could be as simple as doing this:
for i, pred in enumerate(li_switch):
if pred:
func_0(li[i])
else:
func_1(li[i])
However, I suggest vectorizing your operations and do something like:
li_switch = torch.tensor([True, False, False, True, True])
torch.add(a, li[li_switch]).sum(dim=0)
torch.sub(b, li[~li_switch]).sum(dim=0)
This is much more optimized.
Related
I'm making an own implementation for the Graphormer architecture. Since this architecture needs to add an edge-based bias to the output for the key-query multiplication at the self-attention mechanism I am adding that bias by hand and doing the matrix multiplication for the data with the attention weights outside the attention mechanism:
import torch as th
from torch import nn
# Variable inicialization
B, T, C, H = 2, 3, 4, 2
self_attn = nn.MultiheadAttention(C, H, batch_first = True)
# Tensors
x = th.randn(B, T, C)
attn_bias = th.ones((B, T, T))
# Self-attention mechanism
_, attn_wei = self_attn(query=x, key=x, value=x)
# Adding attention bias
if attn_bias is not None:
attn_wei = attn_wei + attn_bias
x = attn_wei # x # TODO use value(x) instead of x
print(x)
This works, but for using the full potential of self-attention, the last matrix multiplication should be like x = attn_wei # value(x) but I am not able to get the value projector from the selt_attn object as it should have something like that inside of it.
How could I do this?
I'm creating some GPflow models in which I need the observations pre and post of a threshold x0 to be independent a priori. I could achieve this with just GP models, or with a ChangePoints kernel with infinite steepness, but both solutions don't work well with my future extensions in mind (MOGP in particular).
I figured I could easily construct what I want from scratch, so I made a new Combination kernel object, which uses the appropriate child kernel pre- or post x0. This works as intended when I evaluate the kernel on a set of input points; the expected correlations between points before and after threshold are zero, and the rest is determined by the children kernels:
import numpy as np
import gpflow
from gpflow.kernels import Matern32
import matplotlib.pyplot as plt
import tensorflow as tf
from gpflow.kernels import Combination
class IndependentKernel(Combination):
def __init__(self, kernels, x0, forcing_variable=0, name=None):
self.x0 = x0
self.forcing_variable = forcing_variable
super().__init__(kernels, name=name)
def K(self, X, X2=None):
# threshold X, X2 based on self.x0, and construct a joint tensor
if X2 is None:
X2 = X
fv = self.forcing_variable
mask = tf.dtypes.cast(X[:, fv] >= self.x0, tf.int32)
X_partitioned = tf.dynamic_partition(X, mask, 2)
X2_partitioned = tf.dynamic_partition(X2, mask, 2)
K_pre = self.kernels[0].K(X_partitioned[0], X2_partitioned[0])
K_post = self.kernels[1].K(X_partitioned[1], X2_partitioned[1])
zero_block_1 = tf.zeros([K_pre.shape[0], K_post.shape[1]], tf.float64)
zero_block_2 = tf.zeros([K_post.shape[0], K_pre.shape[1]], tf.float64)
upper_row = tf.concat([K_pre, zero_block_1], axis=1)
lower_row = tf.concat([zero_block_2, K_post], axis=1)
return tf.concat([upper_row, lower_row], axis=0)
#
def K_diag(self, X):
fv = self.forcing_variable
mask = tf.dtypes.cast(X[:, fv] >= self.x0, tf.int32)
X_partitioned = tf.dynamic_partition(X, mask, 2)
return tf.concat([self.kernels[0].K_diag(X_partitioned[0]),
self.kernels[1].K_diag(X_partitioned[1])],
axis=1)
#
#
def f(x):
return np.sin(6*(x-0.7))
x0 = 0.3
n = 100
x = np.linspace(0, 1, n)
sigma = 0.5
y = np.random.normal(loc=f(x), scale=sigma)
fv = 0
X = x[:, None]
kernel = IndependentKernel([Matern32(), Matern32()], x0=x0, name='indep')
x_pred = np.linspace(0, 1, 100)
K = kernel.K(x_pred[:, None]) # <- kernel is evaluated correctly here
However, when I want to train a GPflow model with this kernel, I receive the error message TypeError: Expected int32, got None of type 'NoneType' instead. This appears to result from the sub-kernel matrices K_pre and K_post to be of size (None, 1), instead of the expected squares (which they correctly are if I evaluate the kernel 'manually').
m = gpflow.models.GPR(data=(X, y[:, None]), kernel=kernel)
gpflow.optimizers.Scipy().minimize(m.training_loss,
m.trainable_variables,
options=dict(maxiter=10000),
method="L-BFGS-B") # <- K_pre & K_post are of size (None, 1) now?
What can I do to make the kernel properly trainable?
I am using GPflow 2.1.3 and TensorFlow 2.4.1.
this is not a GPflow issue but a subtlety of TensorFlow's eager vs graph mode: In eager mode (which is the default behaviour when you interact with tensors "manually" as in calling the kernel) K_pre.shape works just as expected. In graph mode (which is what happens when you wrap code in tf.function(), this generally does not always work (e.g. the shape might depend on tf.Variables with None shapes), and you have to use tf.shape(K_pre) instead to obtain the dynamic shape (that depends on the actual values inside the variables). GPflow's Scipy class by default wraps the loss&gradient computation inside tf.function() to speed up optimization. If you explicitly turn this off by passing compile=False to the minimize() call, your code example runs fine. If you replace the .shape attributes with tf.shape() calls to fix it properly, it likewise will run fine.
The included minimum working example tries to vectorize a function which adds 10.0 to the input and does so through a while loop which adds 1.0 each time (for 10 times). The function runs perfectly when using tf.map_fn and fails when using tf.vectorized_map.
The function would not run when using vectorized map and the error points towards Either add a converter or set --op_conversion_fallback_to_while_loop=True, which may run slower.
What am I doing wrong? Note: This is an duplicate of the same question I asked https://github.com/tensorflow/tensorflow/issues/46559.
if __name__ == "__main__":
import tensorflow as tf
#tf.function()
def add(a):
i = tf.constant(0, dtype = tf.int32)
c = tf.constant(1., dtype = tf.float32)
loop_index = lambda i, c, a: i < 10
def body(i, c, a):
a = c + a
i = i + 1
return i,c, a
i,c, a = tf.while_loop(loop_index, body, [i,c, a],\
shape_invariants=[tf.TensorShape(()), tf.TensorShape(()),tf.TensorShape([1])], back_prop= False, parallel_iterations=1)
return a
counter = tf.reshape(tf.range(0, 40, delta = 1, dtype = tf.float32), shape = [40,1])
all_ = tf.vectorized_map(add, counter) # does not work
# all_ = tf.map_fn(add, counter) # works as expected
print(all_, '<-- should be [40,1] float32 tensor with elements [10., 11., ...49.]')
Unfortunately this issue is only reproducible with TF2.2.0 + CUDA 10.1 on a conda environment, and there is nothing wrong with the code. The fix is to run the code with a different version of tensorflow. This is a collab notebook demonstrating the same from the github issue from the issue assignee ravikyram Link
I have a tensorflow code that runs well and accurately, but occupies a lot of memory. Specifically, in my code, I have a for-loop that looks something like this:
K = 10
myarray1 = tf.placeholder(tf.float32, shape=[None,5,5]) # shape = [None, 5, 5]
myarray2 = tf.Variable( np.zeros([K,5,5]), dtype=tf.float32 )
vals = []
for k in range(0,K):
tmp = tf.reduce_sum(myarray1*myarray2[k],axis=(1,2))
vals.append(tmp)
result = tf.min( tf.stack(vals,axis=-1), axis=-1 )
Unfortunately, that takes a lot of memory as K gets to be big in my application. So, I want to have a better way of doing it. For example, in numpy/python, you would just keep track of the minimum value as you iterate through the loops, and update it on each iteration. It seems like I could use tf.assign, as:
K = 10
myarray1 = tf.placeholder(tf.float32, shape=[None,5,5]) # shape = [None, 5, 5]
myarray2 = tf.Variable( np.zeros([K,5,5]), dtype=tf.float32 )
min_value = tf.Variable(myarray1, validate_shape=False, trainable=False)
for k in range(0,K):
tmp = myarray1*myarray2[k]
idx = tf.where(tmp<min_value)
tf.scatter_nd_assign(min_value, idx, tmp[idx], use_locking=True)
result = min_value
While this code builds the graph (when validate_shape=False), it fails to run because it complains that min_value has not been initialized. The issue is, when I run the initializer as:
sess.run(tf.global_variables_initializer())
or
sess.run(tf.variables_initializer(tf.trainable_variables()))
it complains that I am not feeding in a placeholder. This actually makes sense because the definition of min_value depends on myarray1 in the graph.
What I would actually want to do is define a dummy variable that doesn't depend on myarray1's values, but does match its shape. I would like these values to be initialized as some number (in this case something large is fine), as I will manually ensure these are overwritten in the network.
Note: as far as I know, currently you cannot define a variable with an unknown shape unless you feed in another variable of the desired shape and set validate_shape=False). Maybe there is another way?
Any help / suggestions appreciated.
Try this, if don't know how to feed placeholder, read the tutorial.
K = 10
myarray1 = tf.placeholder(tf.float32, shape=[None,5,5]) # shape = [None, 5, 5]
###################ADD THIS ####################
sess=tf.Session()
FOO = tf.run(myarray1,feed_dict={myarray1: YOURDATA}) #get myarray1 value
#replace all myarray1 below with FOO
################################################
myarray2 = tf.Variable( np.zeros([K,5,5]), dtype=tf.float32 )
min_value = tf.Variable(FOO, validate_shape=False, trainable=False)
for k in range(0,K):
tmp = FOO*myarray2[k]
idx = tf.where(tmp<min_value)
tf.scatter_nd_assign(min_value, idx, tmp[idx], use_locking=True)
result = min_value
-------above new 15.April.2018------
Since I don't know your input data, I would like to try to make some steps.
Step_1: make a place for input data
x = tf.placeholder(tf.float32, shape=[None,2])
Step_2: Get batches of data
batch_x=[[1,2],[3,4]] #example
#since x=[None,2], the batch size would be batch_x_size/x_size=2
Step_3: make a session
sess=tf.Session()
if you have variables add the following code to initialize before calculation
init=tf.gobal_variables_initializer()
sess.run(init)
Step_4:
yourplaceholderdictiornay={x: batch_x}
sess.run(x, feed_dict=yourplaceholderdictiornay)
always feed your placeholder so it gets the value to calculate.
There is a Tensorflow and Deep Learning without a PHD, very helpful PDF file, you can also find it on youtube with this title.
I am trying to use TensorFlow for calculating minimum Euclidean distance between each column in the matrix and all other columns (excluding itself):
with graph.as_default():
...
def get_diversity(matrix):
num_rows = matrix.get_shape()[0].value
num_cols = matrix.get_shape()[1].value
identity = tf.ones([1, num_cols], dtype=tf.float32)
diversity = 0
for i in range(num_cols):
col = tf.reshape(matrix[:, i], [num_rows, 1])
col_extended_to_matrix = tf.matmul(neuron_matrix, identity)
difference_matrix = (col_extended_to_matrix - matrix) ** 2
sum_vector = tf.reduce_sum(difference_matrix, 0)
mask = tf.greater(sum_vector, 0)
non_zero_vector = tf.select(mask, sum_vector, tf.ones([num_cols], dtype=tf.float32) * 9e99)
min_diversity = tf.reduce_min(non_zero_vector)
diversity += min_diversity
return diversity / num_cols
...
diversity = get_diversity(matrix1)
...
When I call get_diversity() once per 1000 iterations (on the scale of 300k) it works just fine. But when I try to call it at every iteration the interpreter returns:
W tensorflow/core/common_runtime/bfc_allocator.cc:271] Ran out of memory trying to allocate 2.99MiB. See logs for memory state.
I was thinking that was because TF creates a new set of variables each time get_diversity() is called. I tried this:
def get_diversity(matrix, scope):
scope.reuse_variables()
...
with tf.variable_scope("diversity") as scope:
diversity = get_diversity(matrix1, scope)
But it did not fix the problem.
How can I fix this allocation issue and use get_diversity() with large number of iterations?
Assuming you call get_diversity() multiple times in your training loop, Aaron's comment is a good one: instead you can do something like the following:
diversity_input = tf.placeholder(tf.float32, [None, None], name="diversity_input")
diversity = get_diversity(matrix)
# ...
with tf.Session() as sess:
for _ in range(NUM_ITERATIONS):
# ...
diversity_val = sess.run(diversity, feed_dict={diversity_input: ...})
This will avoid creating new operations each time round the loop, which should prevent the memory leak. This answer has more details.