What does Tensorflow really do when the Gradient descent optimizer is applied to a "loss" placeholder that is not a number (a tensor of size 1) but rather a vector (a 1-dimensional tensor of size 2, 3, 4, or more)?
Is it like doing the descent on the sum of the components?
The answer to your second question is "no".
As for the second: just like in the one-dimensional case (e.g. y = f(x), x in R), where the direction the algorithm takes is defined by the derivative of the function with respect to its single variable, in the multidimensional case the 'overall' direction is defined by the derivative of the function with respect to each variable.
This means the size of the step you'll take in each direction will be determined by the value of the derivative of the variable corresponding to that direction.
Since there's no way to properly type math in StackOverflow, instead of messing around with it I'll suggest you take a look at this article.
Tensorflow first reduces your loss to a scalar and then optimizes that.
Related
I'm working on implementing a very simple auto diff library in Rust to expand my knowledge on how it is done. I have most everything working, but when implementing negative log likelihood, I realized that I have some confusion on how to handle the derivative for the following scenario (I've written it in PyTorch below).
x = torch.tensor([1, 2, 3], dtype=torch.float32, requires_grads=True)
y = x - torch.sum(x)
I've looked around, experimented, and am still a little confused on what is actually happening here. I know that the derivative with respect to x of the equation above is [-2, -2, -2], but there are a number of ways to get there, and when I expand the equation to the following:
x = torch.tensor([1, 2, 3], dtype=torch.float32, requires_grads=True)
y = torch.exp(x - torch.sum(x))
I am completely lost and have no idea how it derived the gradients for x.
I'm assuming the above equations are being rewritten to something like this:
y = (x - [torch.sum(x), torch.sum(x), torch.sum(x)])
but am not sure, and am really struggling to find info on the topic of scalars being expanded to vectors or whatever is actually happening. If someone could point me in the right direction that would be awesome!
If helpful, I can include the gradients pytorch computes of the above equations.
Your code won't work with PyTorch without any modification because it doesn't specify what the gradients w.r.t to y are. You need them to call y.backward() which computes the gradients w.r.t to x. From your all -2 result, I figured the gradients must be all ones.
The "scalar expansion" is called broadcasting. As you already know, broadcasting is performed whenever two tensor operands have mismatched shapes. My guess is that it is implemented in the same way as any other operation in PyTorch that knows how to compute the gradients w.r.t its inputs given the gradients w.r.t its outputs. A simple example is given below which (a) works with your given test case and (b) allows us to still use PyTorch's autograd to compute the gradients automatically (see also PyTorch's docs on extending autograd):
class Broadcast(torch.autograd.Function):
def forward(ctx, x: torch.Tensor, length: int) -> torch.Tensor:
assert x.ndim == 0, "input must be a scalar tensor"
assert length > 0, "length must be greater than zero"
return x.new_full((length,), x.item())
def backward(ctx, grad: torch.Tensor) -> Tuple[torch.Tensor, None]:
return grad.sum(), None
Now, by setting broadcast = Broadcast.apply we can call the broadcasting ourselves instead of letting PyTorch perform it automatically.
x = torch.tensor([1., 2., 3.], requires_grad=True)
y = x - broadcast(torch.sum(x), x.size(0))
y.backward(torch.ones_like(y))
assert torch.allclose(torch.tensor(-2.), x.grad)
Note that I don't know how PyTorch actually implements it. The implementation above is just to illustrate how the broadcasting operation might be written for automatic differentiation to work, which hopefully answers your question.
Firstly, a few things, the argument is requires_grad and not require_grads. Second, you can only require gradients for a floating point or complex dtype.
Now, a scalar addition/multiplication (note that subtraction/division which can be viewed as addition with a -ve number/multiplication with a fraction) simply adds/multiplies the scalar with all elements of the tensor. Hence,
x = torch.tensor([1., 2., 3.], requires_grad=True)
y = x - 1
evaluates to:
y = tensor([-1., 0., 1.], grad_fn=<SubBackward0>)
Thus, in your case, torch.sum(x) is basically a scalar quantity that is subtracted from all the elements of the tensor x.
If you are more interested on the gradient part, check the pytorch documentation on autograd [ref]. It states the following:
The graph is differentiated using the chain rule. If any of tensors are non-scalar (i.e. their data has more than one element) and require gradient, then the Jacobian-vector product would be computed, in this case the function additionally requires specifying grad_tensors. It should be a sequence of matching length, that contains the “vector” in the Jacobian-vector product, usually the gradient of the differentiated function w.r.t. corresponding tensors (None is an acceptable value for all tensors that don’t need gradient tensors).
It has been firmly established that my_tensor.detach().numpy() is the correct way to get a numpy array from a torch tensor.
I'm trying to get a better understanding of why.
In the accepted answer to the question just linked, Blupon states that:
You need to convert your tensor to another tensor that isn't requiring a gradient in addition to its actual value definition.
In the first discussion he links to, albanD states:
This is expected behavior because moving to numpy will break the graph and so no gradient will be computed.
If you don’t actually need gradients, then you can explicitly .detach() the Tensor that requires grad to get a tensor with the same content that does not require grad. This other Tensor can then be converted to a numpy array.
In the second discussion he links to, apaszke writes:
Variable's can’t be transformed to numpy, because they’re wrappers around tensors that save the operation history, and numpy doesn’t have such objects. You can retrieve a tensor held by the Variable, using the .data attribute. Then, this should work: var.data.numpy().
I have studied the internal workings of PyTorch's autodifferentiation library, and I'm still confused by these answers. Why does it break the graph to to move to numpy? Is it because any operations on the numpy array will not be tracked in the autodiff graph?
What is a Variable? How does it relate to a tensor?
I feel that a thorough high-quality Stack-Overflow answer that explains the reason for this to new users of PyTorch who don't yet understand autodifferentiation is called for here.
In particular, I think it would be helpful to illustrate the graph through a figure and show how the disconnection occurs in this example:
import torch
tensor1 = torch.tensor([1.0,2.0],requires_grad=True)
print(tensor1)
print(type(tensor1))
tensor1 = tensor1.numpy()
print(tensor1)
print(type(tensor1))
I think the most crucial point to understand here is the difference between a torch.tensor and np.ndarray:
While both objects are used to store n-dimensional matrices (aka "Tensors"), torch.tensors has an additional "layer" - which is storing the computational graph leading to the associated n-dimensional matrix.
So, if you are only interested in efficient and easy way to perform mathematical operations on matrices np.ndarray or torch.tensor can be used interchangeably.
However, torch.tensors are designed to be used in the context of gradient descent optimization, and therefore they hold not only a tensor with numeric values, but (and more importantly) the computational graph leading to these values. This computational graph is then used (using the chain rule of derivatives) to compute the derivative of the loss function w.r.t each of the independent variables used to compute the loss.
As mentioned before, np.ndarray object does not have this extra "computational graph" layer and therefore, when converting a torch.tensor to np.ndarray you must explicitly remove the computational graph of the tensor using the detach() command.
Computational Graph
From your comments it seems like this concept is a bit vague. I'll try and illustrate it with a simple example.
Consider a simple function of two (vector) variables, x and w:
x = torch.rand(4, requires_grad=True)
w = torch.rand(4, requires_grad=True)
y = x # w # inner-product of x and w
z = y ** 2 # square the inner product
If we are only interested in the value of z, we need not worry about any graphs, we simply moving forward from the inputs, x and w, to compute y and then z.
However, what would happen if we do not care so much about the value of z, but rather want to ask the question "what is w that minimizes z for a given x"?
To answer that question, we need to compute the derivative of z w.r.t w.
How can we do that?
Using the chain rule we know that dz/dw = dz/dy * dy/dw. That is, to compute the gradient of z w.r.t w we need to move backward from z back to w computing the gradient of the operation at each step as we trace back our steps from z to w. This "path" we trace back is the computational graph of z and it tells us how to compute the derivative of z w.r.t the inputs leading to z:
z.backward() # ask pytorch to trace back the computation of z
We can now inspect the gradient of z w.r.t w:
w.grad # the resulting gradient of z w.r.t w
tensor([0.8010, 1.9746, 1.5904, 1.0408])
Note that this is exactly equals to
2*y*x
tensor([0.8010, 1.9746, 1.5904, 1.0408], grad_fn=<MulBackward0>)
since dz/dy = 2*y and dy/dw = x.
Each tensor along the path stores its "contribution" to the computation:
z
tensor(1.4061, grad_fn=<PowBackward0>)
And
y
tensor(1.1858, grad_fn=<DotBackward>)
As you can see, y and z stores not only the "forward" value of <x, w> or y**2 but also the computational graph -- the grad_fn that is needed to compute the derivatives (using the chain rule) when tracing back the gradients from z (output) to w (inputs).
These grad_fn are essential components to torch.tensors and without them one cannot compute derivatives of complicated functions. However, np.ndarrays do not have this capability at all and they do not have this information.
please see this answer for more information on tracing back the derivative using backwrd() function.
Since both np.ndarray and torch.tensor has a common "layer" storing an n-d array of numbers, pytorch uses the same storage to save memory:
numpy() → numpy.ndarray
Returns self tensor as a NumPy ndarray. This tensor and the returned ndarray share the same underlying storage. Changes to self tensor will be reflected in the ndarray and vice versa.
The other direction works in the same way as well:
torch.from_numpy(ndarray) → Tensor
Creates a Tensor from a numpy.ndarray.
The returned tensor and ndarray share the same memory. Modifications to the tensor will be reflected in the ndarray and vice versa.
Thus, when creating an np.array from torch.tensor or vice versa, both object reference the same underlying storage in memory. Since np.ndarray does not store/represent the computational graph associated with the array, this graph should be explicitly removed using detach() when sharing both numpy and torch wish to reference the same tensor.
Note, that if you wish, for some reason, to use pytorch only for mathematical operations without back-propagation, you can use with torch.no_grad() context manager, in which case computational graphs are not created and torch.tensors and np.ndarrays can be used interchangeably.
with torch.no_grad():
x_t = torch.rand(3,4)
y_np = np.ones((4, 2), dtype=np.float32)
x_t # torch.from_numpy(y_np) # dot product in torch
np.dot(x_t.numpy(), y_np) # the same dot product in numpy
I asked, Why does it break the graph to to move to numpy? Is it because any operations on the numpy array will not be tracked in the autodiff graph?
Yes, the new tensor will not be connected to the old tensor through a grad_fn, and so any operations on the new tensor will not carry gradients back to the old tensor.
Writing my_tensor.detach().numpy() is simply saying, "I'm going to do some non-tracked computations based on the value of this tensor in a numpy array."
The Dive into Deep Learning (d2l) textbook has a nice section describing the detach() method, although it doesn't talk about why a detach makes sense before converting to a numpy array.
Thanks to jodag for helping to answer this question. As he said, Variables are obsolete, so we can ignore that comment.
I think the best answer I can find so far is in jodag's doc link:
To stop a tensor from tracking history, you can call .detach() to detach it from the computation history, and to prevent future computation from being tracked.
and in albanD's remarks that I quoted in the question:
If you don’t actually need gradients, then you can explicitly .detach() the Tensor that requires grad to get a tensor with the same content that does not require grad. This other Tensor can then be converted to a numpy array.
In other words, the detach method means "I don't want gradients," and it is impossible to track gradients through numpy operations (after all, that is what PyTorch tensors are for!)
This is a little showcase of a tensor -> numpy array connection:
import torch
tensor = torch.rand(2)
numpy_array = tensor.numpy()
print('Before edit:')
print(tensor)
print(numpy_array)
tensor[0] = 10
print()
print('After edit:')
print('Tensor:', tensor)
print('Numpy array:', numpy_array)
Output:
Before edit:
Tensor: tensor([0.1286, 0.4899])
Numpy array: [0.1285522 0.48987144]
After edit:
Tensor: tensor([10.0000, 0.4899])
Numpy array: [10. 0.48987144]
The value of the first element is shared by the tensor and the numpy array. Changing it to 10 in the tensor changed it in the numpy array as well.
I'm implementing my own neural network from scratch (educational purposes, I know there are faster and better libraries for this) and for that I'm trying to calculate the derivative of a fully connected layer. I know the following:
and assuming I have a way to calculate the derivative of f using f.derivative(<some_matrix>), how can I use numpy to efficiently calculate the derivative of f(XW) with respect to W as seen in the picture?
I want to be able to calculate the derivative for N different inputs at the same time (giving me a 4-d tensor, 1 dimension for the N samples, and 3 dimensions for the derivative in the image).
Note: f.derivative takes in a matrix of N inputs with d features each, and returns the derivative of each of the input points.
I need to solve equations in Tensorflow in the form A(y)x = b, where A is a large sparse band matrix and also a function of some other tensor say y. Naturally, the solution x will be a function of tensor y too. After solving for x, I want to take gradient of x with respect to y.
I considered two options:
1. Use a sparse external library to efficiently invert A, such as scipy.sparse. For this I need to convert the tensors to numpy array and then back to tensors. The problem with this approach is that I cannot use gradient tape with external libraries such as scipy.sparse.
2. Use Tensorflow's matrix inversion that works with gradient tape. This is extremely slow for large matrices, since it does not utilize the sparsity of the tensor. I was unable to find a sparse invert implementation in Tensorflow.
A small simplified example of what I need:
y = tf.constant(3.14)
A = my_sparse_tensor(shape=(1000, 1000)) # Arbitrary function that returns a sparse tensor
b = tf.ones(shape=(1000, 1))
with tf.GradientTape() as g:
g.watch(y)
A = A * y
x = tf.matmul(sparse_invert(A), b)
dx_dy = g.gradient(x, y)
Of course the dependence of A on y is much more complicated than in this example.
Is there any way to do this in Tensorflow, or do I have to restrict myself to tf.linalg.inv ?
I have a tensor in the shape (n_samples, n_steps, n_features). I want to decompose this into a tensor of shape (n_samples, n_components).
I need a method of decomposition that has a .fit(...) so that I can apply the same decomposition to a new batch of samples. I have been looking at Tucker Decomposition and PARAFAC Decomposition, but neither have that crucial .fit(...) and .transform(...) functionality. (Or at least I think they don't?)
I could use PCA and train it on a representative sample and then call .transform(...) on the remaining samples, but I would rather have some sort of tensor decomposition that can handle all of the samples at once, so as to get a better idea of the differences between each sample.
This is what I mean by "tensor":
In fact tensors are merely a generalisation of scalars and vectors; a scalar is a zero rank tensor, and a vector is a first rank tensor. The rank (or order) of a tensor is defined by the number of directions (and hence the dimensionality of the array) required to describe it.
If you have any questions, please ask, I'll try to clarify my problem if needed.
EDIT: The best solution would be some type of kernel but I have yet to find a kernel that can deal with n-rank Tensors and not just 2D data
You can do this using the development (master) version of TensorLy. Specifically, you can use the new partial_tucker function (it is not yet updated in the documentation...).
Note that the following solution preserves the structure of the tensor, i.e. a tensor of shape (n_samples, n_steps, n_features) is decomposed into a (smaller) tensor of shape (n_samples, n_components_1, n_components_2).
Code
Short answer: this is a very basic class that does what you want (and it would work on tensors of arbitrary order).
import tensorly as tl
from tensorly.decomposition._tucker import partial_tucker
class TensorPCA:
def __init__(self, ranks, modes):
self.ranks = ranks
self.modes = modes
def fit(self, tensor):
self.core, self.factors = partial_tucker(tensor, modes=self.modes, ranks=self.ranks)
return self
def transform(self, tensor):
return tl.tenalg.multi_mode_dot(tensor, self.factors, modes=self.modes, transpose=True)
Usage
Given an input tensor, you can use the previous class by first instantiating it with the desired ranks (size of the core tensor) and modes on which to perform the decomposition (in your 3D case, 1 and 2 since indexing starts at zero):
tpca = TensorPCA(ranks=[4, 5], modes=[1, 2])
tpca.fit(tensor)
Given a new tensor originally called new_tensor, you can project it using the transform method:
tpca.transform(new_tensor)
Explanation
Let's go through the code with an example: first let's import the necessary bits:
import numpy as np
import tensorly as tl
from tensorly.decomposition._tucker import partial_tucker
We then generate a random tensor:
tensor = np.random.random((10, 11, 12))
The next step is to decompose it along its second and third dimensions, or modes (as the first dimension corresponds to the samples):
core, factors = partial_tucker(tensor, modes=[1, 2], ranks=[4, 5])
The core corresponds to the transformed input tensor while factors is a list of two projection matrices, one for the second mode and one for the third mode. Given a new tensor, you can project it to the same subspace (the transform method) by projecting each of its last two dimensions:
tl.tenalg.multi_mode_dot(tensor, factors, modes=[1, 2], transpose=True)
The transposition here is equivalent to an inverse since the factors are orthogonal.
Finally, a note on the terminology: in general, even though it is sometimes done, it is probably best to not use interchangeably order and rank of a tensor. The order of a tensor is simply its number of dimensions while the rank of a tensor is usually a much more complicated notion which you could think of as a generalization of the notion of matrix rank.