Pytorch - Are gradients transferred on creation of new Variables?

Pytorch - Are gradients transferred on creation of new Variables? - python

I have the following code:
A = Tensor of [186,3]
If I create a new empty tensor as follows:
tempTens = torch.tensor(np.zeros((186,3)), requires_grad = True).cuda()
And I apply some operations on a block of A and output it into tempTens, which I use totally for further computation, say like this:
tempTens[20,:] = SomeMatrix * A[20,:]
Will the gradients actually be transferred correctly, lets say I am having a cost function that optimizes for the output of tempTens to some ground truth

In this case, tempTens[20,:] = SomeMatrix * A[20,:] is an in-place operation with respect to tempTens, which is generally not guaranteed to work with autograd. However, if you create a new variable by applying an operation like concatenation
output = torch.cat([SomeMatrix * A[20, :], torch.zeros(163, 3, device='cuda')], dim=0)
you will get the same result in terms of math (a matrix with first 20 rows from SomeMatrix * A[20, :] and the following 166 rows of 0s), but this will work properly with autograd. This is, generally speaking, the right way to approach this kind of problems.

Related

How to build TF tensor with ones in specified locations - batch compatible

I apologize for the poor question title but I'm not sure quite how to phrase it. Here's the problem I'm trying to solve: I have two NNs working off of the same input dataset in my code. One of them is a traditional network while the other is used to limit the acceptable range of the first. This works by using a tf.where() statement which works fine in most cases, such as this toy example:
pcts= [0.04,0.06,0.06,0.06,0.06,0.06,0.06,0.04,0.04,0.04]
legal_actions = tf.where(pcts>=0.05, tf.ones_like(pcts), tf.zeros_like(pcts))
Which gives the correct result: legal_actions = [0,1,1,1,1,1,1,0,0,0]
I can then multiply this by the output of my first network to limit its Q values to only those of the legal actions. In a case like the above this works great.
However, it is also possible that my original vector looks something like this, with low values in the middle of the high values: pcts= [0.04,0.06,0.06,0.04,0.04,0.06,0.06,0.04,0.04,0.04]
Using the same code as above my legal_actions comes out as this: legal_actions = [0,1,1,0,0,1,1,0,0,0]
Based on the code I have this is correct, however, I'd like to include any zeros in the middle as part of my legal_actions. In other words, I'd like this second example to be the same as the first. Working in basic TF this is easy to do in several different ways, such as in this reproducible example (it's also easy to do with sparse tensors):
import tensorflow as tf
pcts= tf.placeholder(tf.float32, shape=(10,))
legal_actions = tf.where(pcts>=0.05, tf.ones_like(pcts), tf.zeros_like(pcts))
mask = tf.where(tf.greater(legal_actions,0))
legals = tf.cast(tf.range(tf.reduce_min(mask),tf.reduce_max(mask)+1),tf.int64)
oh = tf.one_hot(legals,10)
oh = tf.reduce_sum(oh,0)
with tf.Session() as sess:
print(sess.run(oh,feed_dict={pcts:[0.04,0.06,0.06,0.04,0.04,0.06,0.06,0.04,0.04,0.04]}))
The problem that I'm running into is when I try to apply this to my actual code which is reading in batches from a file. I can't figure out a way to fill in the "gaps" in my tensor without the range function and/or I can't figure out how to make the range function work with batches (it will only make one range at a time, not one per batch, as near as I can tell). Any suggestions on how to either make what I'm working on work or how to solve the problem a completely different way would be appreciated.

Try this code:
import tensorflow as tf
pcts = tf.random.uniform((2,3,4))
a = pcts>=0.5
shape = tf.shape(pcts)[-1]
a = tf.reshape(a, (-1, shape))
a = tf.cast(a, dtype=tf.float32)
def rng(t):
left = tf.scan(lambda a, x: max(a, x), t)
right = tf.scan(lambda a, x: max(a, x), t, reverse=True)
return tf.minimum(left, right)
a = tf.map_fn(lambda x: rng(x), a)
a = tf.reshape(a, (tf.shape(pcts)))

Block reduce (downsample) 3D array with mode function

I would like to downsample a 3d array by taking the most frequent value (mode) of the original values. After some research, I found the block_reduce function in skimage library. For example, if I wanted like to take the average of the block, I can do it easily:
from skimage.measure import block_reduce
image = np.arange(4*4*4).reshape(4, 4, 4)
new_image = block_reduce(image, block_size=(2,2,2), func=np.mean, cval=np.mean(grades))
In my case, I want to pass the func argument a mode function. However, numpy doesn't have a mode function. According to the documentation, the passed function should accept 'axis' as an argument. I tried some workarounds such as writing my own function and combining np.unique and np.argmax, as well as passing scipy.stats.mode as the function. All of them failed.
I wrote some nested for loops to do this but it's way too slow with large arrays. Is there an easy way to do this? I don't necessarily need to use sci-kit image library.
Thank you in advance.

Let's start with the assumption that the input image shape is divisible by block_size, i.e. corresponding shape dimensions are divisible by each size parameter of block_size.
So, as pre-processing, we need to make blocks out off the input image, like so -
def blockify(image, block_size):
shp = image.shape
out_shp = [s//b for s,b in zip(shp, block_size)]
reshape_shp = np.c_[out_shp,block_size].ravel()
nC = np.prod(block_size)
return image.reshape(reshape_shp).transpose(0,2,4,1,3,5).reshape(-1,nC)
Next up, for our specific case of mode finding, we will use bincount2D_vectorized alongwith argmax -
# https://stackoverflow.com/a/46256361/ #Divakar
def bincount2D_vectorized(a):
N = a.max()+1
a_offs = a + np.arange(a.shape[0])[:,None]*N
return np.bincount(a_offs.ravel(), minlength=a.shape[0]*N).reshape(-1,N)
out = bincount2D_vectorized(blockify(image, block_size=(2,2,2))).argmax(1)
Alternatively, we can use n-dim mode -
out = mode(blockify(image, block_size=(2,2,2)), axis=1)[0]
Finally, if the initial assumption of divisibility doesn't hold true, we need to pad with the appropriate pad value. For the same, we can use np.pad, as part of the blockify method.

What is the best approach to deal with batches within a Lambda layer?

I created a neural network with Keras, and added a Lambda layer to perform some calculations, but it is showing a poor performance on inferences.
I was able to make the inferences successfully using a batch of one input and added one more loop to handle multiple inputs. Everything works fine, but the performance is somewhat poor. I figured using a larger batch would make things a lot faster. My question is whether I am handling batches correctly (is it really necessary to use another loop?) as I have not found any keras or tensorflow documentation dealing with this topic in more depth.
Below is a code with a structure similar to the one I'm using in the Lambda layer.
def GenericFunc(x, batch=10, channels=64):
y, group = [], []
for i in range(batch):
for j in range(channels):
y.append(backend.sum(x[0, :, :, j]))
group.append(tf.convert_to_tensor(y, dtype=np.float32))
y = []
yy = backend.stack(group, axis=0)
tensor_stack = backend.reshape(yy, [batch,channels])
return tensor_stack
Any suggestions will be welcome!

Never use loops. Tensors are made for tensor operations.
def GenericFunc(x):
y = backend.sum(x, axis=1)
y = backend.sum(y, axis=1)
return y
Probably also works with
def GenericFunc(x):
return backend.sum(x, axis=[1,2])

SVM with python and CPLEX, load the quadratic part of the objective function

''In general, it would get better performance creating batches of linear constraints rather than creating them one at a time. I just wondering if it states even with a huge problem.'' - The wise programmer.
To be clear, I have a (35k x 40) dataset, and I want to do SVM on it. I need to produce the Gramm matrix of this dataset, it is fine, but to pass the coefficient to CPLEX is a mess, it takes hours, here my code:
nn = 35000
XXt = np.random.rand(nn,nn) # the gramm matrix of the dataset
yy = np.random.rand(nn) # the label vector of the dataset
temp = ((yy*XXt).T)*yy
xg, yg = np.meshgrid(range(nn), range(nn))
indici = np.dstack([yg,xg])
quadraric_part = []
for ii in xrange(nn):
for indd in indici[ii][ii:]:
quadraric_part.append([indd[0],indd[1],temp[indd[0],indd[1]]])
The 'quadratic_part' is a list of the form [i,j,c_ij] where c_ij is the coefficient stored in temp. It will be passed to the function 'objective.set_quadratic_coefficients()' of the CPLEX Python API.
There is a wiser way to do that?
P.S. I have maybe a Memory problem, so It wold be better, instead store the whole list 'quadratic_part', call several times the function 'objective.set_quadratic_coefficients()'.... you know what I mean?!

Under the hood, objective.set_quadratic makes use of the CPXXcopyquad function in the C Callable Library. Whereas, objective.set_quadratic_coefficients uses CPXXcopyqpsep.
Here is an example (bear in mind that I am not a numpy expert; it's quite possible there's a better way to do that part):
import numpy as np
import cplex
nn = 5 # a small example size here
XXt = np.random.rand(nn,nn) # the gramm matrix of the dataset
yy = np.random.rand(nn) # the label vector of the dataset
temp = ((yy*XXt).T)*yy
# create symetric matrix
tempu = np.triu(temp) # upper triangle
iu1 = np.triu_indices(nn, 1)
tempu.T[iu1] = tempu[iu1] # copy upper into lower
ind = np.array([[x for x in range(nn)] for x in range(nn)])
qmat = []
for i in range(nn):
qmat.append([np.arange(nn), tempu[i]])
c = cplex.Cplex()
c.variables.add(lb=[0]*nn)
c.objective.set_quadratic(qmat)
c.write("test2.lp")
Your Q matrix is completely dense so depending on the amount of memory you have, this technique may not scale. When it's possible, though, you should get better performance initializing your Q matrix with objective.set_quadratic. Perhaps you'll need to use some hybrid technique where you use both set_quadratic and set_quadratic_coefficients.

The right way to define a function in theano?

Background:
Usually I will define a theano function with input like 'x = fmatrix()', however, during modifying keras (a deep learning library based on theano) to make it work with CTC cost, I noticed a very weird problem: if one input of the cost function is declared as
x = tensor.zeros(shape=[M,N], dtype='float32')
instead of
x = fmatrix()
the training process will converge much faster.
A simplified problem:
The whole codes above are quite big. So I try to simplify the problem like the following: say a function for computing Levenshtein edit distance as
import theano
from theano import tensor
from theano.ifelse import ifelse
def editdist(s, t):
def update(x, previous_row, target):
current_row = previous_row + 1
current_row = tensor.set_subtensor(current_row[1:], tensor.minimum(current_row[1:], tensor.add(previous_row[:-1], tensor.neq(target,x))))
current_row = tensor.set_subtensor(current_row[1:], tensor.minimum(current_row[1:], current_row[0:-1] + 1))
return current_row
source, target = ifelse(tensor.lt(s.shape[0], t.shape[0]), (t, s), (s, t))
previous_row = tensor.arange(target.size + 1, dtype=theano.config.floatX)
result, updates = theano.scan(fn = update, sequences=source, outputs_info=previous_row, non_sequences=target, name='editdist')
return result[-1,-1]
then I define two functions f1 and f2 like:
x1 = tensor.fvector()
x2 = tensor.fvector()
r1 = editdist(x1,x2)
f1 = theano.function([x1,x2], r1)
x3 = tensor.zeros(3, dtype='float32')
x4 = tensor.zeros(3, dtype='float32')
r2 = editdist(x3,x4)
f2 = theano.function([x3,x4], r2)
When computing with f1 and f2, the results are different:
>>f1([1,2,3],[1,3,3])
array(1.0)
>>f2([1,2,3],[1,3,3])
array(3.0)
f1 gives the right result, but f2 doen't.
So my problem is: what is the right way to define a theano function? And, what actually went wrong about f2?
Update:
I'm using theano of version 0.8.0.dev0. I just tried theano 0.7.0, both f1 and f2 give correct result. Maybe this is a bug of theano?
Update_1st 1-27-2016:
According to the explanation of #lamblin on this issue (https://github.com/Theano/Theano/issues/3925#issuecomment-175088918), this was actually a bug of theano, and has been fixed in the latest (1-26-2016) version. For convenience, lamblin's explanation is quoted here:
The first way is the most natural one, but in theory both should be equivalent.
x3 and x4 are created as the output of an "alloc" operation, the input of which would be the constant 3, rather than free inputs like x1 and x2, but that should not matter since you pass [x3, x4] as inputs to theano.function, which should cut the computation graph right there.
My guess is that scan is optimizing prematurely, believing that x3 or x4 is guaranteed to always be the constant 0, and does some simplifications that proved incorrect when values are provided for them. That would be an actual bug in scan."
Update_2nd 1-27-2016:
Unfortunately the bug is not totally fixed yet. In the background section I mentioned if one input of the cost function is declared as tensor.zeros() the convergence process will be much faster, I've found the reason: when input declared as tensor.zeros(), the cost function gave incorrect result, though mysteriously this helped the convergence.
I managed a simplified problem reproduction demo here (https://github.com/daweileng/TheanoDebug), run the ctc_bench.py and you can see the results.

theano.tensor.zeros(...) can't take any other value than 0.
Unless you add nodes to the graph of course and modify parts of the zeros tensor using theano.tensor.set_subtensor.
The input tensor theano.tensor.fmatrix can take any value you input.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pytorch - Are gradients transferred on creation of new Variables? - python

Related

How to build TF tensor with ones in specified locations - batch compatible

Block reduce (downsample) 3D array with mode function

What is the best approach to deal with batches within a Lambda layer?

SVM with python and CPLEX, load the quadratic part of the objective function

The right way to define a function in theano?

Categories

Resources