Numpy: could not broadcast input array from shape (3) into shape (1)

Numpy: could not broadcast input array from shape (3) into shape (1) - python

I want to create a numpy array b where each component is a 2D matrix, which dimensions are determined by the coordinates of vector a.
What I get doing the following satisfies me:
>>> a = [3,4,1]
>>> b = [np.zeros((a[i], a[i - 1] + 1)) for i in range(1, len(a))]
>>> np.array(b)
array([ array([[ 0., 0., 0., 0.],
[ 0., 0., 0., 0.],
[ 0., 0., 0., 0.],
[ 0., 0., 0., 0.]]),
array([[ 0., 0., 0., 0., 0.]])], dtype=object)
but if I have found this pathological case where it does not work:
>>> a = [2,1,1]
>>> b = [np.zeros((a[i], a[i - 1] + 1)) for i in range(1, len(a))]
>>> b
[array([[ 0., 0., 0.]]), array([[ 0., 0.]])]
>>> np.array(b)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: could not broadcast input array from shape (3) into shape (1)

I will present a solution to the problem, but do take into account what was said in the comments. Having Numpy arrays that are not aligned prevents most of the useful operations from working their magic. Consider using lists instead.
That being said, curious error indeed. I got the thing to work by assigning in a basic for-loop instead of using the np.array call.
a = [2,1,1]
b = np.zeros(len(a)-1, dtype=object)
for i in range(1, len(a)):
b[i-1] = np.zeros((a[i], a[i - 1] + 1))
And the result:
>>> b
array([array([[0., 0., 0.]]), array([[0., 0.]])], dtype=object)

This is a bit peculiar. Typically, numpy will try to create one array from the input of np.array with a common data type. A list of arrays would be interpreted with the list as being the new dimension. For instance, np.array([np.zeros(3, 1), np.zeros(3, 1)]) would produce a 2 x 3 x 1 array. So this can only happen if the arrays in your list match in shape. Otherwise, you end up with an array of arrays (with dtype=object), which as commented, is not really an ideal scenario.
However, your error seems to occur when the first dimension matches. Numpy for some reason tries to broadcast the arrays somehow and fails. I can reproduce your error even if the arrays are of higher dimension, as long as the first dimension between arrays matches.
I know this isn't a solution, but this wouldn't fit in a comment. As noted by #roganjosh, making this kind of array really gives you no benefit. You're better off sticking to a list of arrays for readability and to avoid the cost of creating these arrays.

Related

Why replacing values in numpy array does not always work

I am trying to replace/overwrite values in a array using the following commands:
import numpy as np
test = np.array([[4,5,0],[0,0,0],[0,0,6]])
test
Out[20]:
array([[4., 5., 0.],
[0., 0., 0.],
[0., 0., 6.]])
test[np.where(test[...,0] != 0)][...,1:3] = np.array([[10,11]])
test
Out[22]:
array([[4., 5., 0.],
[0., 0., 0.],
[0., 0., 6.]])
However, as one can see in Out22, the array test has not been modified. So I am concluding that it is not possible to simply overwrite a part of a array or just few cells.
Nevertheless, in other contexts, it is possible to overwrite few cells of a array. For example, in the below code:
test = np.array([[1,2,0],[0,0,0],[0,0,3]])
test
Out[11]:
array([[1., 2., 0.],
[0., 0., 0.],
[0., 0., 3.]])
test[test>0]
Out[12]:
array([1., 2., 3.])
test[test>0] = np.array([4,5,6])
test
Out[14]:
array([[4., 5., 0.],
[0., 0., 0.],
[0., 0., 6.]])
Therefore, my 2 questions:
1- Why does the first command
test[np.where(test[...,0] != 0)][...,1:3] = np.array([10,11])
does not allow modifying the array test ? Why does not it allow accessing the array cells and overwrite them?
2- How could I make it work considering that for my code I would need to select the cells using the command above?
Many thanks!

I'll do you one up. This does work:
test[...,1:3][np.where(test[...,0] != 0)] = np.array([[10,11]])
array([[ 4, 10, 11],
[ 0, 0, 0],
[ 0, 0, 6]])
Why? It's the combination of two factors - numpy indexing and .__setitem__ calls.
The python interpreter sort of reads lines backwards. And when it gets to =, it tries to call .__setitem__ on the furthest thing to the left. __setitem__ is (hopefully) a method of the object, and has two inputs, the target and the indices (whatever is between [...] just before it).
a[b] = c #is intepreted as
a.__setitem__(b, c)
Now, when we index in numpy we have three basic ways we can do it.
slicing (returns views)
'advanced indexing' (returns copies)
'simple indexing' (also returns copies)
One major difference between "advanced" and "simple" indexing is that a numpy array's __setitem__ function can interpret advanced indexes. And views mean the data addresses are the same, so we don't need __setitem__ to get to them.
So:
test[np.where(test[...,0] != 0)][...,1:3] = np.array([[10,11]]) #is intepreted as
(test[np.where(test[...,0] != 0)]).__setitem__( slice([...,1:3]),
np.array([[10,11]]))
But, since np.where(test[...,0] != 0) is an advanced index, (test[np.where(test[...,0] != 0)]) returns a copy, which is then lost because it is never assigned. It does take the elements we want and set them to [10,11], but the result is lost in the buffer somewhere.
If we do:
test[..., 1:3][np.where(test[..., 0] != 0)] = np.array([[10, 11]]) #is intepreted as
(test[..., 1:3]).__setitem__( np.where(test[...,0] != 0), np.array([[10,11]]) )
test[...,1:3] is a view, so it still points to the same memory. Now setitem looks for the locations in test[...,1:3] that correspond to np.where(test[...,0] != 0), and set them equal to np.array([[10,11]]). And everything works.
You can also do this:
test[np.where(test[...,0] != 0), 1:3] = np.array([10, 11])
Now, since all the indexing is in one set of brackets, it's calling test.__setitem__ on those indices, which sets the data correctly as well.
Even simpler (and most pythonic) would be:
test[test[...,0] != 0, 1:3] = np.array([10,11])

How do I convert a list of arrays to a single multidimensional numpy array?

I am trying to extract features from .wav files by using MFCC's extracted from wav files.
I'm having trouble converting my list of MFCC's to a numpy array. From my understadning, the error is due to the MFCC's within the MFCC list being the same dimensions, however I'm not sure of the best way to resolve this.
When running this code below:
X = []
y = []
_min, _max = float('inf'), -float('inf')
for _ in tqdm(range(len(new_dataset))):
rand_class = np.random.choice(class_distribution.index, p=prob_distribution)
file = np.random.choice(new_dataset[new_dataset.word == rand_class].index)
label = new_dataset.at[file, 'word']
X_sample = new_dataset.at[file,'coeff']
_min = min(np.amin(X_sample), _min)
_max = max(np.amin(X_sample), _max)
X.append(X_sample if config.mode == 'conv' else X_sample.T)
y.append(classes.index(label))
X, y = np.array(X), np.array(y) #crashes here
I get the following error message:
Traceback (most recent call last):
File "<ipython-input-150-8689abab6bcf>", line 14, in <module>
X, y = np.array(X), np.array(y)
ValueError: could not broadcast input array from shape (13,97) into shape (13)
adding print(X_sample.shape) in the loop produces:
:
(13, 74)
(13, 83)
(13, 99)
(13, 99)
(13, 99)
(13, 55)
(13, 92)
(13, 99)
(13, 99)
(13, 78)
...
From checking, it seems as MFCC's don't all have the same shape as the recordings are not all the same length.
I'd like to know if I'm correct in my assumption that this is the issue, if so how do I fix this issue?If this isn't the issue then I'd equally like to know the solution.
Thanks in advance!

This reproduces your error:
In [186]: np.array([np.zeros((4,5)),np.ones((4,6))])
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-186-e369332b8a05> in <module>
----> 1 np.array([np.zeros((4,5)),np.ones((4,6))])
ValueError: could not broadcast input array from shape (4,5) into shape (4)
If the arrays all have the same shape:
In [187]: np.array([np.zeros((4,6)),np.ones((4,6))]).shape
Out[187]: (2, 4, 6)
If one or more differs in the first dimension, we get an object dtype array, essentially an array wrapper around the list:
In [188]: np.array([np.zeros((4,6)),np.ones((3,6))]).shape
Out[188]: (2,)
Don't try to combine arrays that (may) differ in shape unless you understand what you need, and what you intend to do with the result. It is possible to make an object dtype array with the first case, but construction process is a bit roundabout. I won't go into that unless you really such an array.

You will need to truncate or pad the time dimension in order to make it into arrays of the same size. If you have very varying lengths, you can use a fixed length analysis windows (say over 1 or 10 seconds of MFCCs) and have multiple of these per input audio clip. This principle is shown here, How to use a context window to segment a whole log Mel-spectrogram (ensuring the same number of segments for all the audios)?

This reproduces your error:
In [186]: np.array([np.zeros((4,5)),np.ones((4,6))])
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-186-e369332b8a05> in <module>
----> 1 np.array([np.zeros((4,5)),np.ones((4,6))])
ValueError: could not broadcast input array from shape (4,5) into shape (4)
If the arrays all have the same shape:
In [187]: np.array([np.zeros((4,6)),np.ones((4,6))]).shape
Out[187]: (2, 4, 6)
If one or more differs in the first dimension, we get an object dtype array, essentially an array wrapper around the list:
In [188]: np.array([np.zeros((4,6)),np.ones((3,6))]).shape
Out[188]: (2,)
The first case works if we do:
In [189]: arr = np.zeros(2,object)
In [190]: arr[:] = [np.zeros((4,5)),np.ones((4,6))]
In [191]: arr
Out[191]:
array([array([[0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0.]]),
array([[1., 1., 1., 1., 1., 1.],
[1., 1., 1., 1., 1., 1.],
[1., 1., 1., 1., 1., 1.],
[1., 1., 1., 1., 1., 1.]])], dtype=object)

How to append numpy.array to other numpy.array?

I want to create 2D numpy.array knowing at the begining only its shape, i.e shape=2. Now, I want to create in for loop ith one dimensional numpy.arrays, and add them to the main matrix of shape=2, so I'll get something like this:
matrix=
[numpy.array 1]
[numpy.array 2]
...
[numpy.array n]
How can I achieve that? I try to use:
matrix = np.empty(shape=2)
for i in np.arange(100):
array = np.zeros(random_value)
matrix = np.append(matrix, array)
But as a result of print(np.shape(matrix)), after loop, I get something like:
(some_number, )
How can I append each new array in the next row of the matrix? Thank you in advance.

I would suggest working with list
matrix = []
for i in range(10):
a = np.ones(2)
matrix.append(a)
matrix = np.array(matrix)
list does not have the downside of being copied in the memory everytime you use append. so you avoid the problem described by ali_m. at the end of your operation you just convert the list object into a numpy array.

I suspect the root of your problem is the meaning of 'shape' in np.empty(shape=2)
If I run a small version of your code
matrix = np.empty(shape=2)
for i in np.arange(3):
array = np.zeros(3)
matrix = np.append(matrix, array)
I get
array([ 9.57895902e-259, 1.51798693e-314, 0.00000000e+000,
0.00000000e+000, 0.00000000e+000, 0.00000000e+000,
0.00000000e+000, 0.00000000e+000, 0.00000000e+000,
0.00000000e+000, 0.00000000e+000])
See those 2 odd numbers at the start? Those are produced by np.empty(shape=2). That matrix starts as a (2,) shaped array, not an empty 2d array. append just adds sets of 3 zeros to that, resulting in a (11,) array.
Now if you started with a 2 array with the right number of columns, and did concatenate on the 1st dimension you would get a multirow array. (rows only have meaning in 2d or larger).
mat=np.zeros((1,3))
for i in range(1,3):
mat = np.concatenate([mat, np.ones((1,3))*i],axis=0)
produces:
array([[ 0., 0., 0.],
[ 1., 1., 1.],
[ 2., 2., 2.]])
A better way of doing an iterative construction like this is with list append
alist = []
for i in range(0,3):
alist.append(np.ones((1,3))*i)
mat=np.vstack(alist)
alist is:
[array([[ 0., 0., 0.]]), array([[ 1., 1., 1.]]), array([[ 2., 2., 2.]])]
mat is
array([[ 0., 0., 0.],
[ 1., 1., 1.],
[ 2., 2., 2.]])
With vstack you can get by with np.ones((3,), since it turns all of its inputs into 2d array.
append would work, but it also requires axis=0 parameter, and 2 arrays. It gets misused, often by mistaken analogy to the list append. It is just another front end to concatenate. So I prefer not to use it.
Notice that other posters assumed your random value changed during the iteration. That would produce a arrays of differing lengths. For 1d appending that would still produce the long 1d array. But a 2d append wouldn't work, because an 2d array can't be ragged.
mat = np.zeros((2,),int)
for i in range(4):
mat=np.append(mat,np.ones((i,),int)*i)
# array([0, 0, 1, 2, 2, 3, 3, 3])

The function you are looking for is np.vstack
Here is a modified version of your example
import numpy as np
matrix = np.empty(shape=2)
for i in np.arange(3):
array = np.zeros(2)
matrix = np.vstack((matrix, array))
The result is
array([[ 0., 0.],
[ 0., 0.],
[ 0., 0.],
[ 0., 0.]])

Python time optimisation of for loop using newaxis

I need to calculate n number of points(3D) with equal spacing along a defined line(3D).
I know the starting and end point of the line. First, I used
for k in range(nbin):
step = k/float(nbin-1)
bin_point.append(beam_entry+(step*(beamlet_intersection-beam_entry)))
Then, I found that using append for large arrays takes more time, then I changed code like this:
bin_point = [start_point+((k/float(nbin-1))*(end_point-start_point)) for k in range(nbin)]
I got a suggestion that using newaxis will further improve the time.
The modified code looks like this.
step = arange(nbin) / float(nbin-1)
bin_point = start_point + ( step[:,newaxis,newaxis]*((end_pint - start_point))[newaxis,:,:] )
But, I could not understand the newaxis function, I also have a doubt that, whether the same code will work if the structure or the shape of the start_point and end_point are changed. Similarly how can I use the newaxis to mdoify the following code
for j in range(32): # for all los
line_dist[j] = sqrt([sum(l) for l in (end_point[j]-start_point[j])**2])
Sorry for being so clunky, to be more clear the structure of the start_point and end_point are
array([ [[1,1,1],[],[],[]....[]],
[[],[],[],[]....[]],
[[],[],[],[]....[]]......,
[[],[],[],[]....[]] ])

Explanation of the newaxis version in the question: these are not matrix multiplies, ndarray multiply is element-by-element multiply with broadcasting. step[:,newaxis,newaxis] is num_steps x 1 x 1 and point[newaxis,:,:] is 1 x num_points x num_dimensions. Broadcasting together ndarrays with shape (num_steps x 1 x 1) and (1 x num_points x num_dimensions) will work, because the broadcasting rules are that every dimension should be either 1 or the same; it just means "repeat the array with dimension 1 as many times as the corresponding dimension of the other array". This results in an ndarray with shape (num_steps x num_points x num_dimensions) in a very efficient way; the i, j, k subscript will be the k-th coordinate of the i-th step along the j-th line (given by the j-th pair of start and end points).
Walkthrough:
>>> start_points = numpy.array([[1, 0, 0], [0, 1, 0]])
>>> end_points = numpy.array([[10, 0, 0], [0, 10, 0]])
>>> steps = numpy.arange(10)/9.0
>>> start_points.shape
(2, 3)
>>> steps.shape
(10,)
>>> steps[:,numpy.newaxis,numpy.newaxis].shape
(10, 1, 1)
>>> (steps[:,numpy.newaxis,numpy.newaxis] * start_points).shape
(10, 2, 3)
>>> (steps[:,numpy.newaxis,numpy.newaxis] * (end_points - start_points)) + start_points
array([[[ 1., 0., 0.],
[ 0., 1., 0.]],
[[ 2., 0., 0.],
[ 0., 2., 0.]],
[[ 3., 0., 0.],
[ 0., 3., 0.]],
[[ 4., 0., 0.],
[ 0., 4., 0.]],
[[ 5., 0., 0.],
[ 0., 5., 0.]],
[[ 6., 0., 0.],
[ 0., 6., 0.]],
[[ 7., 0., 0.],
[ 0., 7., 0.]],
[[ 8., 0., 0.],
[ 0., 8., 0.]],
[[ 9., 0., 0.],
[ 0., 9., 0.]],
[[ 10., 0., 0.],
[ 0., 10., 0.]]])
As you can see, this produces the correct answer :) In this case broadcasting (10,1,1) and (2,3) results in (10,2,3). What you had is broadcasting (10,1,1) and (1,2,3) which is exactly the same and also produces (10,2,3).
The code for the distance part of the question does not need newaxis: the inputs are num_points x num_dimensions, the ouput is num_points, so one dimension has to be removed. That is actually the axis you sum along. This should work:
line_dist = numpy.sqrt( numpy.sum( (end_point - start_point) ** 2, axis=1 )
Here numpy.sum(..., axis=1) means sum along that axis only, rather than all elements: a ndarray with shape num_points x num_dimensions summed along axis=1 produces a result with num_points, which is correct.
EDIT: removed code example without broadcasting.
EDIT: fixed up order of indexes.
EDIT: added line_dist

I'm not through understanding all you wrote, but some things I already can tell you; maybe they help.
newaxis is rather a marker than a function (in fact, it is plain None). It is used to add an (unused) dimension to a multi-dimensional value. With it you can make a 3D value out of a 2D value (or even more). Each dimension already there in the input value must be represented by a colon : in the index (assuming you want to use all values, otherwise it gets complicated beyond our usecase), the dimensions to be added are denoted by newaxis.
Example:
input is a one-dimensional vector (1D): 1,2,3
output shall be a matrix (2D).
There are two ways to accomplish this; the vector could fill the lines with one value each, or the vector could fill just the first and only line of the matrix. The first is created by vector[:,newaxis], the second by vector[newaxis,:]. Results of this:
>>> array([ 7,8,9 ])[:,newaxis]
array([[7],
[8],
[9]])
>>> array([ 7,8,9 ])[newaxis,:]
array([[7, 8, 9]])
(Dimensions of multi-dimensional values are represented by nesting of arrays of course.)
If you have more dimensions in the input, use the colon more than once (otherwise the deeper nested dimensions are simply ignored, i.e. the arrays are treated as simple values). I won't paste a representation of this here as it won't clarify things due to the optical complexity when 3D and 4D values are written on a 2D display using nested brackets. I hope it gets clear anyway.

The newaxis reshapes the array in such a way so that when you multiply numpy uses broadcasting. Here is a good tutorial on broadcasting.
step[:, newaxis, newaxis] is the same as step.reshape((step.shape[0], 1, 1)) (if step is 1d). Either method for reshaping should be very fast because reshaping arrays in numpy is very cheep, it just makes a view of the array, especially because you should only be doing it once.

don't understand python's behaviour of three dimensional lists (references to sublists are the same)

I don't understand why the following code behaves the way it does.
import numpy as np
nbr_arrays = 4
nbr_fields_per_array = 3
nbr_subfields_per_field = 2
# pre-allocate zeros list
zeros = np.zeros(nbr_subfields_per_field)
data = []
for array in range(nbr_arrays):
# pre-allocate the subarray
empty_array = []
for empty_array_index in range(nbr_fields_per_array):
empty_array.append(zeros)
# append pre subarray to data
data.append(empty_array)
# fill up data
for j in range(nbr_fields_per_array):
for k in range(nbr_subfields_per_field):
data[array][j][k] = j*k*array
The generated output data reads now:
[[array([ 0., 6.]), array([ 0., 6.]), array([ 0., 6.])],
[array([ 0., 6.]), array([ 0., 6.]), array([ 0., 6.])],
[array([ 0., 6.]), array([ 0., 6.]), array([ 0., 6.])],
[array([ 0., 6.]), array([ 0., 6.]), array([ 0., 6.])]]
Even zeros reads completely differently:
array([ 0., 6.])
If I look at the identify of the different lists, this is what I get:
id(data[0][0])
Out[72]: 45790208
id(data[1][0])
Out[66]: 45790208
id(data[2][0])
Out[67]: 45790208
id(data[3][0])
Out[68]: 45790208
id(zeros)
Out[69]: 45790208
why are all the references the same? and why does zero suddenly contain non-zero values?
I'd really appreciate it if somebody could explain me what exactly is happening here, and how I have to modify my code to see the expected behaviour (output).
EDIT:
not using zeros but using [[0]*nbr_subfields_per_field for x in range(nbr_fields_per_array)] instead gives me the expected result. but why? why doesn't the original code work?
Modified code that works:
data = []
for array in range(nbr_arrays):
empty_array = [[0]*nbr_subfields_per_field for x in range(nbr_fields_per_array)]
''' this is causing the weird behaviour
empty_array = []
for empty_array_index in range(nbr_fields_per_array):
empty_array.append(zeros)
'''
data.append(empty_array)
for j in range(nbr_fields_per_array):
for k in range(nbr_subfields_per_field):
data[array][j][k] = j*k*array

# pre-allocate zeros list
zeros = np.zeros(nbr_subfields_per_field)
This creates a single object.
for empty_array_index in range(nbr_fields_per_array):
empty_array.append(zeros)
This keeps appending the same object.
Stop pre-allocating.

Numpy can set up multidimensional arrays for you, if you want. Since you're going to initialize the whole array immediately after creating it, the empty method seems like the most appropriate:
data = np.empty((nbr_arrays, nbr_fields_per_array, nbr_subfields_per_field))

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Numpy: could not broadcast input array from shape (3) into shape (1) - python

Related

Why replacing values in numpy array does not always work

How do I convert a list of arrays to a single multidimensional numpy array?

How to append numpy.array to other numpy.array?

Python time optimisation of for loop using newaxis

don't understand python's behaviour of three dimensional lists (references to sublists are the same)

Categories

Resources