Is numpy.asarray() necessary here? - python

I'm following an example in the book "Deep Learning with Python" by Francois Chollet.
There's an example (pg 70) where they convert an array of int's to an array of float32
The relevant lines are
from keras.datasets import imdb
(tr_data, tr_labels), (ts_data, ts_labels) = imdb.load_data(num_words=10000)
...
import numpy as np
y_train = np.asarray(tr_labels).astype('float32')
tr_labels is simply an array of ints
array([1, 0, 0, ..., 0, 1, 0])
y_train is an array of float32
array([1., 0., 0., ..., 0., 1., 0.], dtype=float32)
But why do we need to call np.asarray() when simply this seems to do the trick
y_train = tr_labels.astype('float32')
Just wondering if numpy.asarray() does some additional data processing I'm not aware of.

No, it's not necessary.
np.asarray is sometimes useful if you aren't sure what the datatype is (or if it can change), and it won't make a copy into a new array if the input is already an ndarray, so it shouldn't be a slowdown if tr_labels is already an array. Along a similar vein, if you want to allow subclasses of ndarray you can use np.asanyarray which will pass through any subclass of ndarray (such as sparse arrays, etc.) without extra copying. These are just two examples of the many array creation functions numpy provides from existing data. There are often multiple right answers, but sometimes one may be more efficient (memory wise) than another.

Related

Add a level to Numpy array

I have a problem with a numpy array.
In particular, suppose to have a matrix
x = np.array([[1., 2., 3.], [4., 5., 6.]])
with shape (2,3), I want to convert the float numbers into list so to obtain the array [[[1.], [2.], [3.]], [[4.], [5.], [6.]]] with shape (2,3,1).
I tried to convert each float number to a list (i.e., x[0][0] = [x[0][0]]) but it does not work.
Can anyone help me? Thanks
What you want is adding another dimension to your numpy array. One way of doing it is using reshape:
x = x.reshape(2,3,1)
output:
[[[1.]
[2.]
[3.]]
[[4.]
[5.]
[6.]]]
There is a function in Numpy to perform exactly what #Valdi_Bo mentions. You can use np.expand_dims and add a new dimension along axis 2, as follows:
x = np.expand_dims(x, axis=2)
Refer:
np.expand_dims
Actually, you want to add a dimension (not level).
To do it, run:
result = x[...,np.newaxis]
Its shape is just (2, 3, 1).
Or save the result back under x.
You are trying to add a new dimension to the numpy array. There are multiple ways of doing this as other answers mentioned np.expand_dims, np.new_axis, np.reshape etc. But I usually use the following as I find it the most readable, especially when you are working with vectorizing multiple tensors and complex operations involving broadcasting (check this Bounty question that I solved with this method).
x[:,:,None].shape
(2,3,1)
x[None,:,None,:,None].shape
(1,2,1,3,1)
Well, maybe this is an overkill for the array you have, but definitely the most efficient solution is to use np.lib.stride_tricks.as_strided. This way no data is copied.
import numpy as np
x = np.array([[1., 2., 3.], [4., 5., 6.]])
newshape = x.shape[:-1] + (x.shape[-1], 1)
newstrides = x.strides + x.strides[-1:]
a = np.lib.stride_tricks.as_strided(x, shape=newshape, strides=newstrides)
results in:
array([[[1.],
[2.],
[3.]],
[[4.],
[5.],
[6.]]])
>>> a.shape
(2, 3, 1)

What is the difference between .flatten() and .view(-1) in PyTorch?

Both .flatten() and .view(-1) flatten a tensor in PyTorch. What's the difference?
Does .flatten() copy the data of the tensor?
Is .view(-1) faster?
Is there any situation that .flatten() doesn't work?
In addition to #adeelh's comment, there is another difference: torch.flatten() results in a .reshape(), and the differences between .reshape() and .view() are:
[...] torch.reshape may return a copy or a view of the original tensor. You can not count on that to return a view or a copy.
Another difference is that reshape() can operate on both contiguous and non-contiguous tensor while view() can only operate on contiguous tensor. Also see here about the meaning of contiguous
For context:
The community requested for a flatten function for a while, and after Issue #7743, the feature was implemented in the PR #8578.
You can see the implementation of flatten here, where a call to .reshape() can be seen in return line.
flatten is simply a convenient alias of a common use-case of view.1
There are several others:
Function
Equivalent view logic
flatten()
view(-1)
flatten(start, end)
view(*t.shape[:start], -1, *t.shape[end+1:])
squeeze()
view(*[s for s in t.shape if s != 1])
unsqueeze(i)
view(*t.shape[:i-1], 1, *t.shape[i:])
Note that flatten allows you to flatten a specific contiguous subset of dimensions, with the start_dim and end_dim arguments.
Actually the superficially equivalent reshape under the hood.
First of all, .view() works only on contiguous data, while .flatten() works on both contiguous and non contiguous data. Functions like transpose whcih generates non-contiguous data, can be acted upon by .flatten() but not .view().Coming to copying of data, both .view() and .flatten() does not copy data when it works on contiguous data. However, in case of non-contiguous data, .flatten() first copies data into contiguous memory and then change the dimensions. Any change in the new tensor would not affect th original tensor.
ten=torch.zeros(2,3)
ten_view=ten.view(-1)
ten_view[0]=123
ten
>>tensor([[123., 0., 0.],
[ 0., 0., 0.]])
ten=torch.zeros(2,3)
ten_flat=ten.flatten()
ten_flat[0]=123
ten
>>tensor([[123., 0., 0.],
[ 0., 0., 0.]])
In the above code, the tensor ten have contiguous memory allocation. Any changes to ten_view or ten_flat is reflected upon tensor ten
ten=torch.zeros(2,3).transpose(0,1)
ten_flat=ten.flatten()
ten_flat[0]=123
ten
>>tensor([[0., 0.],
[0., 0.],
[0., 0.]])
In this case non-contiguous transposed tensor ten is used for flatten(). Any changes made to ten_flat is not reflected upon ten.

What are the differences between these Numpy array creation functions? [duplicate]

What is the difference between NumPy's np.array and np.asarray? When should I use one rather than the other? They seem to generate identical output.
The definition of asarray is:
def asarray(a, dtype=None, order=None):
return array(a, dtype, copy=False, order=order)
So it is like array, except it has fewer options, and copy=False. array has copy=True by default.
The main difference is that array (by default) will make a copy of the object, while asarray will not unless necessary.
Since other questions are being redirected to this one which ask about asanyarray or other array creation routines, it's probably worth having a brief summary of what each of them does.
The differences are mainly about when to return the input unchanged, as opposed to making a new array as a copy.
array offers a wide variety of options (most of the other functions are thin wrappers around it), including flags to determine when to copy. A full explanation would take just as long as the docs (see Array Creation, but briefly, here are some examples:
Assume a is an ndarray, and m is a matrix, and they both have a dtype of float32:
np.array(a) and np.array(m) will copy both, because that's the default behavior.
np.array(a, copy=False) and np.array(m, copy=False) will copy m but not a, because m is not an ndarray.
np.array(a, copy=False, subok=True) and np.array(m, copy=False, subok=True) will copy neither, because m is a matrix, which is a subclass of ndarray.
np.array(a, dtype=int, copy=False, subok=True) will copy both, because the dtype is not compatible.
Most of the other functions are thin wrappers around array that control when copying happens:
asarray: The input will be returned uncopied iff it's a compatible ndarray (copy=False).
asanyarray: The input will be returned uncopied iff it's a compatible ndarray or subclass like matrix (copy=False, subok=True).
ascontiguousarray: The input will be returned uncopied iff it's a compatible ndarray in contiguous C order (copy=False, order='C').
asfortranarray: The input will be returned uncopied iff it's a compatible ndarray in contiguous Fortran order (copy=False, order='F').
require: The input will be returned uncopied iff it's compatible with the specified requirements string.
copy: The input is always copied.
fromiter: The input is treated as an iterable (so, e.g., you can construct an array from an iterator's elements, instead of an object array with the iterator); always copied.
There are also convenience functions, like asarray_chkfinite (same copying rules as asarray, but raises ValueError if there are any nan or inf values), and constructors for subclasses like matrix or for special cases like record arrays, and of course the actual ndarray constructor (which lets you create an array directly out of strides over a buffer).
The difference can be demonstrated by this example:
Generate a matrix.
>>> A = numpy.matrix(numpy.ones((3, 3)))
>>> A
matrix([[ 1., 1., 1.],
[ 1., 1., 1.],
[ 1., 1., 1.]])
Use numpy.array to modify A. Doesn't work because you are modifying a copy.
>>> numpy.array(A)[2] = 2
>>> A
matrix([[ 1., 1., 1.],
[ 1., 1., 1.],
[ 1., 1., 1.]])
Use numpy.asarray to modify A. It worked because you are modifying A itself.
>>> numpy.asarray(A)[2] = 2
>>> A
matrix([[ 1., 1., 1.],
[ 1., 1., 1.],
[ 2., 2., 2.]])
The differences are mentioned quite clearly in the documentation of array and asarray. The differences lie in the argument list and hence the action of the function depending on those parameters.
The function definitions are :
numpy.array(object, dtype=None, copy=True, order=None, subok=False, ndmin=0)
and
numpy.asarray(a, dtype=None, order=None)
The following arguments are those that may be passed to array and not asarray as mentioned in the documentation :
copy : bool, optional If true (default), then the object is copied.
Otherwise, a copy will only be made if __array__ returns a copy, if
obj is a nested sequence, or if a copy is needed to satisfy any of the
other requirements (dtype, order, etc.).
subok : bool, optional If True, then sub-classes will be
passed-through, otherwise the returned array will be forced to be a
base-class array (default).
ndmin : int, optional Specifies the minimum number of dimensions that
the resulting array should have. Ones will be pre-pended to the shape
as needed to meet this requirement.
asarray(x) is like array(x, copy=False)
Use asarray(x) when you want to ensure that x will be an array before any other operations are done. If x is already an array then no copy would be done. It would not cause a redundant performance hit.
Here is an example of a function that ensure x is converted into an array first.
def mysum(x):
return np.asarray(x).sum()
Here's a simple example that can demonstrate the difference.
The main difference is that array will make a copy of the original data and using different object we can modify the data in the original array.
import numpy as np
a = np.arange(0.0, 10.2, 0.12)
int_cvr = np.asarray(a, dtype = np.int64)
The contents in array (a), remain untouched, and still, we can perform any operation on the data using another object without modifying the content in original array.
Let's Understand the difference between np.array() and np.asarray() with the example:
np.array(): Convert input data (list, tuple, array, or other sequence type) to an ndarray and copies the input data by default.
np.asarray(): Convert input data to an ndarray but do not copy if the input is already an ndarray.
#Create an array...
arr = np.ones(5); # array([1., 1., 1., 1., 1.])
#Now I want to modify `arr` with `array` method. Let's see...
arr = np.array(arr)[3] = 200; # array([1., 1., 1., 1., 1.])
No change in the array because we are modify a copy of the arr.
Now, modify arr with asarray() method.
arr = np.asarray(arr)[3] = 200; # array([1., 200, 1., 1., 1.])
The change occur in this array because we are work with the original array now.

What's the difference between reshape and view in pytorch?

In numpy, we use ndarray.reshape() for reshaping an array.
I noticed that in pytorch, people use torch.view(...) for the same purpose, but at the same time, there is also a torch.reshape(...) existing.
So I am wondering what the differences are between them and when I should use either of them?
torch.view has existed for a long time. It will return a tensor with the new shape. The returned tensor will share the underling data with the original tensor.
See the documentation here.
On the other hand, it seems that torch.reshape has been introduced recently in version 0.4. According to the document, this method will
Returns a tensor with the same data and number of elements as input, but with the specified shape. When possible, the returned tensor will be a view of input. Otherwise, it will be a copy. Contiguous inputs and inputs with compatible strides can be reshaped without copying, but you should not depend on the copying vs. viewing behavior.
It means that torch.reshape may return a copy or a view of the original tensor. You can not count on that to return a view or a copy. According to the developer:
if you need a copy use clone() if you need the same storage use view(). The semantics of reshape() are that it may or may not share the storage and you don't know beforehand.
Another difference is that reshape() can operate on both contiguous and non-contiguous tensor while view() can only operate on contiguous tensor. Also see here about the meaning of contiguous.
Although both torch.view and torch.reshape are used to reshape tensors, here are the differences between them.
As the name suggests, torch.view merely creates a view of the original tensor. The new tensor will always share its data with the original tensor. This means that if you change the original tensor, the reshaped tensor will change and vice versa.
>>> z = torch.zeros(3, 2)
>>> x = z.view(2, 3)
>>> z.fill_(1)
>>> x
tensor([[1., 1., 1.],
[1., 1., 1.]])
To ensure that the new tensor always shares its data with the original, torch.view imposes some contiguity constraints on the shapes of the two tensors [docs]. More often than not this is not a concern, but sometimes torch.view throws an error even if the shapes of the two tensors are compatible. Here's a famous counter-example.
>>> z = torch.zeros(3, 2)
>>> y = z.t()
>>> y.size()
torch.Size([2, 3])
>>> y.view(6)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
RuntimeError: invalid argument 2: view size is not compatible with input tensor's
size and stride (at least one dimension spans across two contiguous subspaces).
Call .contiguous() before .view().
torch.reshape doesn't impose any contiguity constraints, but also doesn't guarantee data sharing. The new tensor may be a view of the original tensor, or it may be a new tensor altogether.
>>> z = torch.zeros(3, 2)
>>> y = z.reshape(6)
>>> x = z.t().reshape(6)
>>> z.fill_(1)
tensor([[1., 1.],
[1., 1.],
[1., 1.]])
>>> y
tensor([1., 1., 1., 1., 1., 1.])
>>> x
tensor([0., 0., 0., 0., 0., 0.])
TL;DR:
If you just want to reshape tensors, use torch.reshape. If you're also concerned about memory usage and want to ensure that the two tensors share the same data, use torch.view.
view() will try to change the shape of the tensor while keeping the underlying data allocation the same, thus data will be shared between the two tensors. reshape() will create a new underlying memory allocation if necessary.
Let's create a tensor:
a = torch.arange(8).reshape(2, 4)
The memory is allocated like below (it is C contiguous i.e. the rows are stored next to each other):
stride() gives the number of bytes required to go to the next element in each dimension:
a.stride()
(4, 1)
We want its shape to become (4, 2), we can use view:
a.view(4,2)
The underlying data allocation has not changed, the tensor is still C contiguous:
a.view(4, 2).stride()
(2, 1)
Let's try with a.t(). Transpose() doesn't modify the underlying memory allocation and therefore a.t() is not contiguous.
a.t().is_contiguous()
False
Although it is not contiguous, the stride information is sufficient to iterate over the tensor
a.t().stride()
(1, 4)
view() doesn't work anymore:
a.t().view(2, 4)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
RuntimeError: view size is not compatible with input tensor's size and stride (at least one dimension spans across two contiguous subspaces). Use .reshape(...) instead.
Below is the shape we wanted to obtain by using view(2, 4):
What would the memory allocation look like?
The stride would be something like (4, 2) but we would have to go back to the begining of the tensor after we reach the end. It doesn't work.
In this case, reshape() would create a new tensor with a different memory allocation to make the transpose contiguous:
Note that we can use view to split the first dimension of the transpose.
Unlike what is said in the accepted and other answers, view() can operate on non-contiguous tensors!
a.t().view(2, 2, 2)
a.t().view(2, 2, 2).stride()
(2, 1, 4)
According to the documentation:
For a tensor to be viewed, the new view size must be compatible with
its original size and stride, i.e., each new view dimension must
either be a subspace of an original dimension, or only span across
original dimensions d, d+1, …, d+k that satisfy the following
contiguity-like condition that ∀i=d,…,d+k−1,
stride[i]=stride[i+1]×size[i+1]
Here that's because the first two dimensions after applying view(2, 2, 2) are subspaces of the transpose's first dimension.
For more information about contiguity have a look at my answer in this thread
Tensor.reshape() is more robust. It will work on any tensor, while Tensor.view() works only on tensor t where t.is_contiguous()==True.
To explain about non-contiguous and contiguous is another story, but you can always make the tensor t contiguous if you call t.contiguous() and then you can call view() without the error.
I would say the answers here are technically correct but there's another reason for existing of reshape. pytorch is usually considered more convenient than other frameworks because it closer to python and numpy. It's interesting that the question involves numpy.
Let's look into size and shape in pytorch. size is a function so you call it like x.size(). shape in pytorch is not a function. In numpy you have shape and it's not a function - you use it x.shape. So it's handy to get both of them in pytorch. If you came from numpy it would be nice to use the same functions.

grouping data in arrays (python)

I'm trying to make a nice ordered way of grouping objects in an array. Now I've tried the following, but it gives me an error.
Any tips?
#body: mass, [x,y], [vx,vy], [ax, ay]
bodies = np.array([[1E3, [0,0], [0,0], [0.0]],\
[1, [0,200], [31.6,0], [0,0]]])
ValueError: setting an array element with a sequence.
You can use dtype=object, and then store anything you want—floats, tuples, lists, arrays. But really, that's not a good idea; you pretty much lose all the benefits of numpy.
And, because it's a bad idea, numpy doesn't make it easy for you. If you construct an array out of a list, it assumes any sub-lists are dimensions of the array, and if it can't make any sense of things that way, it gives you this error.
Why not just store the bodies as flat rows of numbers? You already have to interpret the rows as bodies at a higher level, and really, how is x, y = bodies[1][1] any better than x, y = bodies[1][1:3]?
If you really want to, you could create an array with one more dimension, but… why?
You also might want to consider using pandas instead of raw numpy, or using a database instead of using numpy in the first place, or just keeping each body as a Python object (whether sticking them in a numpy array or not), or something else entirely. Without knowing what you're trying to accomplish, it's hard to be sure what fits your needs. But it's pretty unlikely that what you're trying to do is the right thing to do.
You could use a complex dtype:
bodies = np.array([(1E3, [0,0], [0,0], [0.0]),
(1, [0,200], [31.6,0], [0,0])],
dtype=[('mass',float), ('xy','2float'),
('vxy','2float'), ('axy','2float')])
and access the "columns" with
In [63]: bodies['mass']
Out[63]: array([ 1000., 1.])
In [64]: bodies['xy']
Out[64]:
array([[ 0., 0.],
[ 0., 200.]])
etc.,
but this will not make your life any easier.
I am making an n-body simulator
Calculating distances between objects will be a common operation in an n-body simulator. You might want to use scipy.spatial.distance pdist or cdist for that. Notice that these functions expect X to be NumPy ndarrays of simple, homogenous dtype. So if you were to use an array of complex dtype, you'd always have to slice it first before you could use any of these functions.
Therefore, it probably would be simpler to just store arrays of simple, homogeneous dtype from the beginning and avoid the array of complex dtype.
I suggest making multiple 1-dimensional arrays:
mass = np.array(...)
x = np.array(...)
y = np.array(...)
or maybe use some 2D-arrays of simple, homogenous dtype:
pos = np.array([(x0, y0), (x1, y1), ...], dtype='float')
All your equations will be more readable this way too.
Instead of accessing the 2D position array with bodies['xy'] you would simply write pos. That's one less set of brackets your eyes will have to parse.

Categories