In numpy, we use ndarray.reshape() for reshaping an array.
I noticed that in pytorch, people use torch.view(...) for the same purpose, but at the same time, there is also a torch.reshape(...) existing.
So I am wondering what the differences are between them and when I should use either of them?
torch.view has existed for a long time. It will return a tensor with the new shape. The returned tensor will share the underling data with the original tensor.
See the documentation here.
On the other hand, it seems that torch.reshape has been introduced recently in version 0.4. According to the document, this method will
Returns a tensor with the same data and number of elements as input, but with the specified shape. When possible, the returned tensor will be a view of input. Otherwise, it will be a copy. Contiguous inputs and inputs with compatible strides can be reshaped without copying, but you should not depend on the copying vs. viewing behavior.
It means that torch.reshape may return a copy or a view of the original tensor. You can not count on that to return a view or a copy. According to the developer:
if you need a copy use clone() if you need the same storage use view(). The semantics of reshape() are that it may or may not share the storage and you don't know beforehand.
Another difference is that reshape() can operate on both contiguous and non-contiguous tensor while view() can only operate on contiguous tensor. Also see here about the meaning of contiguous.
Although both torch.view and torch.reshape are used to reshape tensors, here are the differences between them.
As the name suggests, torch.view merely creates a view of the original tensor. The new tensor will always share its data with the original tensor. This means that if you change the original tensor, the reshaped tensor will change and vice versa.
>>> z = torch.zeros(3, 2)
>>> x = z.view(2, 3)
>>> z.fill_(1)
>>> x
tensor([[1., 1., 1.],
[1., 1., 1.]])
To ensure that the new tensor always shares its data with the original, torch.view imposes some contiguity constraints on the shapes of the two tensors [docs]. More often than not this is not a concern, but sometimes torch.view throws an error even if the shapes of the two tensors are compatible. Here's a famous counter-example.
>>> z = torch.zeros(3, 2)
>>> y = z.t()
>>> y.size()
torch.Size([2, 3])
>>> y.view(6)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
RuntimeError: invalid argument 2: view size is not compatible with input tensor's
size and stride (at least one dimension spans across two contiguous subspaces).
Call .contiguous() before .view().
torch.reshape doesn't impose any contiguity constraints, but also doesn't guarantee data sharing. The new tensor may be a view of the original tensor, or it may be a new tensor altogether.
>>> z = torch.zeros(3, 2)
>>> y = z.reshape(6)
>>> x = z.t().reshape(6)
>>> z.fill_(1)
tensor([[1., 1.],
[1., 1.],
[1., 1.]])
>>> y
tensor([1., 1., 1., 1., 1., 1.])
>>> x
tensor([0., 0., 0., 0., 0., 0.])
TL;DR:
If you just want to reshape tensors, use torch.reshape. If you're also concerned about memory usage and want to ensure that the two tensors share the same data, use torch.view.
view() will try to change the shape of the tensor while keeping the underlying data allocation the same, thus data will be shared between the two tensors. reshape() will create a new underlying memory allocation if necessary.
Let's create a tensor:
a = torch.arange(8).reshape(2, 4)
The memory is allocated like below (it is C contiguous i.e. the rows are stored next to each other):
stride() gives the number of bytes required to go to the next element in each dimension:
a.stride()
(4, 1)
We want its shape to become (4, 2), we can use view:
a.view(4,2)
The underlying data allocation has not changed, the tensor is still C contiguous:
a.view(4, 2).stride()
(2, 1)
Let's try with a.t(). Transpose() doesn't modify the underlying memory allocation and therefore a.t() is not contiguous.
a.t().is_contiguous()
False
Although it is not contiguous, the stride information is sufficient to iterate over the tensor
a.t().stride()
(1, 4)
view() doesn't work anymore:
a.t().view(2, 4)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
RuntimeError: view size is not compatible with input tensor's size and stride (at least one dimension spans across two contiguous subspaces). Use .reshape(...) instead.
Below is the shape we wanted to obtain by using view(2, 4):
What would the memory allocation look like?
The stride would be something like (4, 2) but we would have to go back to the begining of the tensor after we reach the end. It doesn't work.
In this case, reshape() would create a new tensor with a different memory allocation to make the transpose contiguous:
Note that we can use view to split the first dimension of the transpose.
Unlike what is said in the accepted and other answers, view() can operate on non-contiguous tensors!
a.t().view(2, 2, 2)
a.t().view(2, 2, 2).stride()
(2, 1, 4)
According to the documentation:
For a tensor to be viewed, the new view size must be compatible with
its original size and stride, i.e., each new view dimension must
either be a subspace of an original dimension, or only span across
original dimensions d, d+1, …, d+k that satisfy the following
contiguity-like condition that ∀i=d,…,d+k−1,
stride[i]=stride[i+1]×size[i+1]
Here that's because the first two dimensions after applying view(2, 2, 2) are subspaces of the transpose's first dimension.
For more information about contiguity have a look at my answer in this thread
Tensor.reshape() is more robust. It will work on any tensor, while Tensor.view() works only on tensor t where t.is_contiguous()==True.
To explain about non-contiguous and contiguous is another story, but you can always make the tensor t contiguous if you call t.contiguous() and then you can call view() without the error.
I would say the answers here are technically correct but there's another reason for existing of reshape. pytorch is usually considered more convenient than other frameworks because it closer to python and numpy. It's interesting that the question involves numpy.
Let's look into size and shape in pytorch. size is a function so you call it like x.size(). shape in pytorch is not a function. In numpy you have shape and it's not a function - you use it x.shape. So it's handy to get both of them in pytorch. If you came from numpy it would be nice to use the same functions.
Related
I'm working on implementing a very simple auto diff library in Rust to expand my knowledge on how it is done. I have most everything working, but when implementing negative log likelihood, I realized that I have some confusion on how to handle the derivative for the following scenario (I've written it in PyTorch below).
x = torch.tensor([1, 2, 3], dtype=torch.float32, requires_grads=True)
y = x - torch.sum(x)
I've looked around, experimented, and am still a little confused on what is actually happening here. I know that the derivative with respect to x of the equation above is [-2, -2, -2], but there are a number of ways to get there, and when I expand the equation to the following:
x = torch.tensor([1, 2, 3], dtype=torch.float32, requires_grads=True)
y = torch.exp(x - torch.sum(x))
I am completely lost and have no idea how it derived the gradients for x.
I'm assuming the above equations are being rewritten to something like this:
y = (x - [torch.sum(x), torch.sum(x), torch.sum(x)])
but am not sure, and am really struggling to find info on the topic of scalars being expanded to vectors or whatever is actually happening. If someone could point me in the right direction that would be awesome!
If helpful, I can include the gradients pytorch computes of the above equations.
Your code won't work with PyTorch without any modification because it doesn't specify what the gradients w.r.t to y are. You need them to call y.backward() which computes the gradients w.r.t to x. From your all -2 result, I figured the gradients must be all ones.
The "scalar expansion" is called broadcasting. As you already know, broadcasting is performed whenever two tensor operands have mismatched shapes. My guess is that it is implemented in the same way as any other operation in PyTorch that knows how to compute the gradients w.r.t its inputs given the gradients w.r.t its outputs. A simple example is given below which (a) works with your given test case and (b) allows us to still use PyTorch's autograd to compute the gradients automatically (see also PyTorch's docs on extending autograd):
class Broadcast(torch.autograd.Function):
def forward(ctx, x: torch.Tensor, length: int) -> torch.Tensor:
assert x.ndim == 0, "input must be a scalar tensor"
assert length > 0, "length must be greater than zero"
return x.new_full((length,), x.item())
def backward(ctx, grad: torch.Tensor) -> Tuple[torch.Tensor, None]:
return grad.sum(), None
Now, by setting broadcast = Broadcast.apply we can call the broadcasting ourselves instead of letting PyTorch perform it automatically.
x = torch.tensor([1., 2., 3.], requires_grad=True)
y = x - broadcast(torch.sum(x), x.size(0))
y.backward(torch.ones_like(y))
assert torch.allclose(torch.tensor(-2.), x.grad)
Note that I don't know how PyTorch actually implements it. The implementation above is just to illustrate how the broadcasting operation might be written for automatic differentiation to work, which hopefully answers your question.
Firstly, a few things, the argument is requires_grad and not require_grads. Second, you can only require gradients for a floating point or complex dtype.
Now, a scalar addition/multiplication (note that subtraction/division which can be viewed as addition with a -ve number/multiplication with a fraction) simply adds/multiplies the scalar with all elements of the tensor. Hence,
x = torch.tensor([1., 2., 3.], requires_grad=True)
y = x - 1
evaluates to:
y = tensor([-1., 0., 1.], grad_fn=<SubBackward0>)
Thus, in your case, torch.sum(x) is basically a scalar quantity that is subtracted from all the elements of the tensor x.
If you are more interested on the gradient part, check the pytorch documentation on autograd [ref]. It states the following:
The graph is differentiated using the chain rule. If any of tensors are non-scalar (i.e. their data has more than one element) and require gradient, then the Jacobian-vector product would be computed, in this case the function additionally requires specifying grad_tensors. It should be a sequence of matching length, that contains the “vector” in the Jacobian-vector product, usually the gradient of the differentiated function w.r.t. corresponding tensors (None is an acceptable value for all tensors that don’t need gradient tensors).
I have a tensor:
t1 = torch.randn(564, 400)
I want to unroll it to a 1-d tensor that's 225600 long.
How can I do this?
Note the difference between view and reshape as suggested by Kris -
From reshape's docstring:
When possible, the returned tensor will be a view
of input. Otherwise, it will be a copy. Contiguous inputs and inputs with compatible strides can be reshaped without copying...
So in case your tensor is not contiguous calling reshape should handle what one would have had to handle had one used view instead; That is, call t1.contiguous().view(...) to handle non-contiguous tensors.
Also, one could use faltten: t1 = t1.flatten() as an equivalent of view(-1), which is more readable.
Pytorch is much like numpy so you can simply do,
t1 = t1.view(-1) or t1 = t1.reshape(-1)
Both .flatten() and .view(-1) flatten a tensor in PyTorch. What's the difference?
Does .flatten() copy the data of the tensor?
Is .view(-1) faster?
Is there any situation that .flatten() doesn't work?
In addition to #adeelh's comment, there is another difference: torch.flatten() results in a .reshape(), and the differences between .reshape() and .view() are:
[...] torch.reshape may return a copy or a view of the original tensor. You can not count on that to return a view or a copy.
Another difference is that reshape() can operate on both contiguous and non-contiguous tensor while view() can only operate on contiguous tensor. Also see here about the meaning of contiguous
For context:
The community requested for a flatten function for a while, and after Issue #7743, the feature was implemented in the PR #8578.
You can see the implementation of flatten here, where a call to .reshape() can be seen in return line.
flatten is simply a convenient alias of a common use-case of view.1
There are several others:
Function
Equivalent view logic
flatten()
view(-1)
flatten(start, end)
view(*t.shape[:start], -1, *t.shape[end+1:])
squeeze()
view(*[s for s in t.shape if s != 1])
unsqueeze(i)
view(*t.shape[:i-1], 1, *t.shape[i:])
Note that flatten allows you to flatten a specific contiguous subset of dimensions, with the start_dim and end_dim arguments.
Actually the superficially equivalent reshape under the hood.
First of all, .view() works only on contiguous data, while .flatten() works on both contiguous and non contiguous data. Functions like transpose whcih generates non-contiguous data, can be acted upon by .flatten() but not .view().Coming to copying of data, both .view() and .flatten() does not copy data when it works on contiguous data. However, in case of non-contiguous data, .flatten() first copies data into contiguous memory and then change the dimensions. Any change in the new tensor would not affect th original tensor.
ten=torch.zeros(2,3)
ten_view=ten.view(-1)
ten_view[0]=123
ten
>>tensor([[123., 0., 0.],
[ 0., 0., 0.]])
ten=torch.zeros(2,3)
ten_flat=ten.flatten()
ten_flat[0]=123
ten
>>tensor([[123., 0., 0.],
[ 0., 0., 0.]])
In the above code, the tensor ten have contiguous memory allocation. Any changes to ten_view or ten_flat is reflected upon tensor ten
ten=torch.zeros(2,3).transpose(0,1)
ten_flat=ten.flatten()
ten_flat[0]=123
ten
>>tensor([[0., 0.],
[0., 0.],
[0., 0.]])
In this case non-contiguous transposed tensor ten is used for flatten(). Any changes made to ten_flat is not reflected upon ten.
I'm following an example in the book "Deep Learning with Python" by Francois Chollet.
There's an example (pg 70) where they convert an array of int's to an array of float32
The relevant lines are
from keras.datasets import imdb
(tr_data, tr_labels), (ts_data, ts_labels) = imdb.load_data(num_words=10000)
...
import numpy as np
y_train = np.asarray(tr_labels).astype('float32')
tr_labels is simply an array of ints
array([1, 0, 0, ..., 0, 1, 0])
y_train is an array of float32
array([1., 0., 0., ..., 0., 1., 0.], dtype=float32)
But why do we need to call np.asarray() when simply this seems to do the trick
y_train = tr_labels.astype('float32')
Just wondering if numpy.asarray() does some additional data processing I'm not aware of.
No, it's not necessary.
np.asarray is sometimes useful if you aren't sure what the datatype is (or if it can change), and it won't make a copy into a new array if the input is already an ndarray, so it shouldn't be a slowdown if tr_labels is already an array. Along a similar vein, if you want to allow subclasses of ndarray you can use np.asanyarray which will pass through any subclass of ndarray (such as sparse arrays, etc.) without extra copying. These are just two examples of the many array creation functions numpy provides from existing data. There are often multiple right answers, but sometimes one may be more efficient (memory wise) than another.
I generally use MATLAB and Octave, and i recently switching to python numpy.
In numpy when I define an array like this
>>> a = np.array([[2,3],[4,5]])
it works great and size of the array is
>>> a.shape
(2, 2)
which is also same as MATLAB
But when i extract the first entire column and see the size
>>> b = a[:,0]
>>> b.shape
(2,)
I get size (2,), what is this? I expect the size to be (2,1). Perhaps i misunderstood the basic concept. Can anyone make me clear about this??
A 1D numpy array* is literally 1D - it has no size in any second dimension, whereas in MATLAB, a '1D' array is actually 2D, with a size of 1 in its second dimension.
If you want your array to have size 1 in its second dimension you can use its .reshape() method:
a = np.zeros(5,)
print(a.shape)
# (5,)
# explicitly reshape to (5, 1)
print(a.reshape(5, 1).shape)
# (5, 1)
# or use -1 in the first dimension, so that its size in that dimension is
# inferred from its total length
print(a.reshape(-1, 1).shape)
# (5, 1)
Edit
As Akavall pointed out, I should also mention np.newaxis as another method for adding a new axis to an array. Although I personally find it a bit less intuitive, one advantage of np.newaxis over .reshape() is that it allows you to add multiple new axes in an arbitrary order without explicitly specifying the shape of the output array, which is not possible with the .reshape(-1, ...) trick:
a = np.zeros((3, 4, 5))
print(a[np.newaxis, :, np.newaxis, ..., np.newaxis].shape)
# (1, 3, 1, 4, 5, 1)
np.newaxis is just an alias of None, so you could do the same thing a bit more compactly using a[None, :, None, ..., None].
* An np.matrix, on the other hand, is always 2D, and will give you the indexing behavior you are familiar with from MATLAB:
a = np.matrix([[2, 3], [4, 5]])
print(a[:, 0].shape)
# (2, 1)
For more info on the differences between arrays and matrices, see here.
Typing help(np.shape) gives some insight in to what is going on here. For starters, you can get the output you expect by typing:
b = np.array([a[:,0]])
Basically numpy defines things a little differently than MATLAB. In the numpy environment, a vector only has one dimension, and an array is a vector of vectors, so it can have more. In your first example, your array is a vector of two vectors, i.e.:
a = np.array([[vec1], [vec2]])
So a has two dimensions, and in your example the number of elements in both dimensions is the same, 2. Your array is therefore 2 by 2. When you take a slice out of this, you are reducing the number of dimensions that you have by one. In other words, you are taking a vector out of your array, and that vector only has one dimension, which also has 2 elements, but that's it. Your vector is now 2 by _. There is nothing in the second spot because the vector is not defined there.
You could think of it in terms of spaces too. Your first array is in the space R^(2x2) and your second vector is in the space R^(2). This means that the array is defined on a different (and bigger) space than the vector.
That was a lot to basically say that you took a slice out of your array, and unlike MATLAB, numpy does not represent vectors (1 dimensional) in the same way as it does arrays (2 or more dimensions).