I want to create a NumPy array of np.ndarray from an iterable. This is because I have a function that will return np.ndarray of some constant shape, and I need to create an array of results from this function, something like this:
OUTPUT_SHAPE = some_constant
def foo(input) -> np.ndarray:
# processing
# generated np.ndarray of shape OUTPUT_SHAPE
return output
inputs = [i for i in range(100000)]
iterable = (foo(input) for input in inputs)
arr = np.fromiter(iterable, np.ndarray)
This obviously gives an error:-
cannot create object arrays from iterator
I cannot first create a list then convert it to an array, because it will first create a copy of every output array, so for a time, there will be almost double memory occupied, and I have very limited memory.
Can anyone help me?
You probably shouldn't make an object array. You should probably make an ordinary 2D array of non-object dtype. As long as you know the number of results the iterator will give in advance, you can avoid most of the copying you're worried about by doing it like this:
arr = numpy.empty((num_iterator_outputs, OUTPUT_SHAPE), dtype=whatever_appropriate_dtype)
for i, output in enumerate(iterable):
arr[i] = output
This only needs to hold arr and a single output in memory at once, instead of arr and every output.
If you really want an object array, you can get one. The simplest way would be to go through a list, which will not perform the copying you're worried about as long as you do it right:
outputs = list(iterable)
arr = numpy.empty(len(outputs), dtype=object)
arr[:] = outputs
Note that if you just try to call numpy.array on outputs, it will try to build a 2D array, which will cause the copying you're worried about. This is true even if you specify dtype=object - it'll try to build a 2D array of object dtype, and that'll be even worse, for both usability and memory.
An object dtype array contains references, just like a list.
Define 3 arrays:
In [589]: a,b,c = np.arange(3), np.ones(3), np.zeros(3)
put them in a list:
In [590]: alist = [a,b,c]
and in an object dtype array:
In [591]: arr = np.empty(3,object)
In [592]: arr[:] = alist
In [593]: arr
Out[593]:
array([array([0, 1, 2]), array([1., 1., 1.]), array([0., 0., 0.])],
dtype=object)
In [594]: alist
Out[594]: [array([0, 1, 2]), array([1., 1., 1.]), array([0., 0., 0.])]
Modify one, and see the change in the list and array:
In [595]: b[:] = [1,2,3]
In [596]: b
Out[596]: array([1., 2., 3.])
In [597]: alist
Out[597]: [array([0, 1, 2]), array([1., 2., 3.]), array([0., 0., 0.])]
In [598]: arr
Out[598]:
array([array([0, 1, 2]), array([1., 2., 3.]), array([0., 0., 0.])],
dtype=object)
A numeric dtype array created from these copies all values:
In [599]: arr1 = np.stack(arr)
In [600]: arr1
Out[600]:
array([[0., 1., 2.],
[1., 2., 3.],
[0., 0., 0.]])
So even if your use of fromiter worked, it wouldn't be any different, memory wise from a list accumulation:
alist = []
for i in range(n):
alist.append(constant_array)
Related
For a special application dealing with numpy arrays of different lengths, I need my preferably numpy array, not just a list, to have the form np.ndarray[np.ndarray[ ], np.ndarray[ ], ..., dtype=object]. If I have given sequence, list, etc. of numpy arrays, I want them always to have this form. However, for a list of numpy arrays of the same length, e.g.,
np.array(*np.array([np.arange(4), np.arange(4)], dtype=object)
gives me np.ndarray[np.ndarray[[]], dtype=object] so I came up with the workaround below.
Is there any other magic option, which could be passed to np.array() or another method which gives the desired result more directly?
Workaround:
inp_arr_a = np.asarray([np.arange(4), np.arange(3)], dtype=object)
inp_arr_b = np.array([np.arange(4), np.arange(4)])
def split_to_obj_arr(arr):
return np.delete(np.array([*arr, 'dummy'], dtype=object), -1, 0)
gives for split_to_obj_arr(inp_arr_a)
array([array([0, 1, 2, 3]), array([0, 1, 2])], dtype=object)
and for split_to_obj_arr(inp_arr_b)
array([array([0, 1, 2, 3]), array([0, 1, 2, 3])], dtype=object)
np.array(...) by design tries to return as high a dimensional numeric array as possible. If the inputs are ragged it will raise a future-warning (unless you specify object dtype) and return the object array containing arrays. Or with some combinations of shapes it will raise an error.
Forcing an object dtype with the None element and then deleting that is one way around this. I prefer creating the None filled array first, and assigning the elements:
In [80]: def foo(alist):
...: res = np.empty(len(alist), object)
...: res[:] = alist
...: return res
...:
In [81]: foo([[],[]])
Out[81]: array([list([]), list([])], dtype=object)
In [82]: foo([np.array([]),np.array([])])
Out[82]: array([array([], dtype=float64), array([], dtype=float64)], dtype=object)
In [83]: foo([np.ones((2,3)),np.zeros((2,3))])
Out[83]:
array([array([[1., 1., 1.],
[1., 1., 1.]]),
array([[0., 0., 0.],
[0., 0., 0.]])], dtype=object)
In [84]: foo([np.array([2,3]),np.array([1,2])])
Out[84]: array([array([2, 3]), array([1, 2])], dtype=object)
Creating a 2d object array like this is also possible, but trickier. It may be simpler to reshape a 1d as needed after.
I want to create a numpy array b where each component is a 2D matrix, which dimensions are determined by the coordinates of vector a.
What I get doing the following satisfies me:
>>> a = [3,4,1]
>>> b = [np.zeros((a[i], a[i - 1] + 1)) for i in range(1, len(a))]
>>> np.array(b)
array([ array([[ 0., 0., 0., 0.],
[ 0., 0., 0., 0.],
[ 0., 0., 0., 0.],
[ 0., 0., 0., 0.]]),
array([[ 0., 0., 0., 0., 0.]])], dtype=object)
but if I have found this pathological case where it does not work:
>>> a = [2,1,1]
>>> b = [np.zeros((a[i], a[i - 1] + 1)) for i in range(1, len(a))]
>>> b
[array([[ 0., 0., 0.]]), array([[ 0., 0.]])]
>>> np.array(b)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: could not broadcast input array from shape (3) into shape (1)
I will present a solution to the problem, but do take into account what was said in the comments. Having Numpy arrays that are not aligned prevents most of the useful operations from working their magic. Consider using lists instead.
That being said, curious error indeed. I got the thing to work by assigning in a basic for-loop instead of using the np.array call.
a = [2,1,1]
b = np.zeros(len(a)-1, dtype=object)
for i in range(1, len(a)):
b[i-1] = np.zeros((a[i], a[i - 1] + 1))
And the result:
>>> b
array([array([[0., 0., 0.]]), array([[0., 0.]])], dtype=object)
This is a bit peculiar. Typically, numpy will try to create one array from the input of np.array with a common data type. A list of arrays would be interpreted with the list as being the new dimension. For instance, np.array([np.zeros(3, 1), np.zeros(3, 1)]) would produce a 2 x 3 x 1 array. So this can only happen if the arrays in your list match in shape. Otherwise, you end up with an array of arrays (with dtype=object), which as commented, is not really an ideal scenario.
However, your error seems to occur when the first dimension matches. Numpy for some reason tries to broadcast the arrays somehow and fails. I can reproduce your error even if the arrays are of higher dimension, as long as the first dimension between arrays matches.
I know this isn't a solution, but this wouldn't fit in a comment. As noted by #roganjosh, making this kind of array really gives you no benefit. You're better off sticking to a list of arrays for readability and to avoid the cost of creating these arrays.
I want to create 2D numpy.array knowing at the begining only its shape, i.e shape=2. Now, I want to create in for loop ith one dimensional numpy.arrays, and add them to the main matrix of shape=2, so I'll get something like this:
matrix=
[numpy.array 1]
[numpy.array 2]
...
[numpy.array n]
How can I achieve that? I try to use:
matrix = np.empty(shape=2)
for i in np.arange(100):
array = np.zeros(random_value)
matrix = np.append(matrix, array)
But as a result of print(np.shape(matrix)), after loop, I get something like:
(some_number, )
How can I append each new array in the next row of the matrix? Thank you in advance.
I would suggest working with list
matrix = []
for i in range(10):
a = np.ones(2)
matrix.append(a)
matrix = np.array(matrix)
list does not have the downside of being copied in the memory everytime you use append. so you avoid the problem described by ali_m. at the end of your operation you just convert the list object into a numpy array.
I suspect the root of your problem is the meaning of 'shape' in np.empty(shape=2)
If I run a small version of your code
matrix = np.empty(shape=2)
for i in np.arange(3):
array = np.zeros(3)
matrix = np.append(matrix, array)
I get
array([ 9.57895902e-259, 1.51798693e-314, 0.00000000e+000,
0.00000000e+000, 0.00000000e+000, 0.00000000e+000,
0.00000000e+000, 0.00000000e+000, 0.00000000e+000,
0.00000000e+000, 0.00000000e+000])
See those 2 odd numbers at the start? Those are produced by np.empty(shape=2). That matrix starts as a (2,) shaped array, not an empty 2d array. append just adds sets of 3 zeros to that, resulting in a (11,) array.
Now if you started with a 2 array with the right number of columns, and did concatenate on the 1st dimension you would get a multirow array. (rows only have meaning in 2d or larger).
mat=np.zeros((1,3))
for i in range(1,3):
mat = np.concatenate([mat, np.ones((1,3))*i],axis=0)
produces:
array([[ 0., 0., 0.],
[ 1., 1., 1.],
[ 2., 2., 2.]])
A better way of doing an iterative construction like this is with list append
alist = []
for i in range(0,3):
alist.append(np.ones((1,3))*i)
mat=np.vstack(alist)
alist is:
[array([[ 0., 0., 0.]]), array([[ 1., 1., 1.]]), array([[ 2., 2., 2.]])]
mat is
array([[ 0., 0., 0.],
[ 1., 1., 1.],
[ 2., 2., 2.]])
With vstack you can get by with np.ones((3,), since it turns all of its inputs into 2d array.
append would work, but it also requires axis=0 parameter, and 2 arrays. It gets misused, often by mistaken analogy to the list append. It is just another front end to concatenate. So I prefer not to use it.
Notice that other posters assumed your random value changed during the iteration. That would produce a arrays of differing lengths. For 1d appending that would still produce the long 1d array. But a 2d append wouldn't work, because an 2d array can't be ragged.
mat = np.zeros((2,),int)
for i in range(4):
mat=np.append(mat,np.ones((i,),int)*i)
# array([0, 0, 1, 2, 2, 3, 3, 3])
The function you are looking for is np.vstack
Here is a modified version of your example
import numpy as np
matrix = np.empty(shape=2)
for i in np.arange(3):
array = np.zeros(2)
matrix = np.vstack((matrix, array))
The result is
array([[ 0., 0.],
[ 0., 0.],
[ 0., 0.],
[ 0., 0.]])
I don't understand why the following code behaves the way it does.
import numpy as np
nbr_arrays = 4
nbr_fields_per_array = 3
nbr_subfields_per_field = 2
# pre-allocate zeros list
zeros = np.zeros(nbr_subfields_per_field)
data = []
for array in range(nbr_arrays):
# pre-allocate the subarray
empty_array = []
for empty_array_index in range(nbr_fields_per_array):
empty_array.append(zeros)
# append pre subarray to data
data.append(empty_array)
# fill up data
for j in range(nbr_fields_per_array):
for k in range(nbr_subfields_per_field):
data[array][j][k] = j*k*array
The generated output data reads now:
[[array([ 0., 6.]), array([ 0., 6.]), array([ 0., 6.])],
[array([ 0., 6.]), array([ 0., 6.]), array([ 0., 6.])],
[array([ 0., 6.]), array([ 0., 6.]), array([ 0., 6.])],
[array([ 0., 6.]), array([ 0., 6.]), array([ 0., 6.])]]
Even zeros reads completely differently:
array([ 0., 6.])
If I look at the identify of the different lists, this is what I get:
id(data[0][0])
Out[72]: 45790208
id(data[1][0])
Out[66]: 45790208
id(data[2][0])
Out[67]: 45790208
id(data[3][0])
Out[68]: 45790208
id(zeros)
Out[69]: 45790208
why are all the references the same? and why does zero suddenly contain non-zero values?
I'd really appreciate it if somebody could explain me what exactly is happening here, and how I have to modify my code to see the expected behaviour (output).
EDIT:
not using zeros but using [[0]*nbr_subfields_per_field for x in range(nbr_fields_per_array)] instead gives me the expected result. but why? why doesn't the original code work?
Modified code that works:
data = []
for array in range(nbr_arrays):
empty_array = [[0]*nbr_subfields_per_field for x in range(nbr_fields_per_array)]
''' this is causing the weird behaviour
empty_array = []
for empty_array_index in range(nbr_fields_per_array):
empty_array.append(zeros)
'''
data.append(empty_array)
for j in range(nbr_fields_per_array):
for k in range(nbr_subfields_per_field):
data[array][j][k] = j*k*array
# pre-allocate zeros list
zeros = np.zeros(nbr_subfields_per_field)
This creates a single object.
for empty_array_index in range(nbr_fields_per_array):
empty_array.append(zeros)
This keeps appending the same object.
Stop pre-allocating.
Numpy can set up multidimensional arrays for you, if you want. Since you're going to initialize the whole array immediately after creating it, the empty method seems like the most appropriate:
data = np.empty((nbr_arrays, nbr_fields_per_array, nbr_subfields_per_field))
In pure Python you can grow matrices column by column pretty easily:
data = []
for i in something:
newColumn = getColumnDataAsList(i)
data.append(newColumn)
NumPy's array doesn't have the append function. The hstack function doesn't work on zero sized arrays, thus the following won't work:
data = numpy.array([])
for i in something:
newColumn = getColumnDataAsNumpyArray(i)
data = numpy.hstack((data, newColumn)) # ValueError: arrays must have same number of dimensions
So, my options are either to remove the initalization iside the loop with appropriate condition:
data = None
for i in something:
newColumn = getColumnDataAsNumpyArray(i)
if data is None:
data = newColumn
else:
data = numpy.hstack((data, newColumn)) # works
... or to use a Python list and convert is later to array:
data = []
for i in something:
newColumn = getColumnDataAsNumpyArray(i)
data.append(newColumn)
data = numpy.array(data)
Both variants seem a little bit awkward to be. Are there nicer solutions?
NumPy actually does have an append function, which it seems might do what you want, e.g.,
import numpy as NP
my_data = NP.random.random_integers(0, 9, 9).reshape(3, 3)
new_col = NP.array((5, 5, 5)).reshape(3, 1)
res = NP.append(my_data, new_col, axis=1)
your second snippet (hstack) will work if you add another line, e.g.,
my_data = NP.random.random_integers(0, 9, 16).reshape(4, 4)
# the line to add--does not depend on array dimensions
new_col = NP.zeros_like(my_data[:,-1]).reshape(-1, 1)
res = NP.hstack((my_data, new_col))
hstack gives the same result as concatenate((my_data, new_col), axis=1), i'm not sure how they compare performance-wise.
While that's the most direct answer to your question, i should mention that looping through a data source to populate a target via append, while just fine in python, is not idiomatic NumPy. Here's why:
initializing a NumPy array is relatively expensive, and with this conventional python pattern, you incur that cost, more or less, at each loop iteration (i.e., each append to a NumPy array is roughly like initializing a new array with a different size).
For that reason, the common pattern in NumPy for iterative addition of columns to a 2D array is to initialize an empty target array once(or pre-allocate a single 2D NumPy array having all of the empty columns) the successively populate those empty columns by setting the desired column-wise offset (index)--much easier to show than to explain:
>>> # initialize your skeleton array using 'empty' for lowest-memory footprint
>>> M = NP.empty(shape=(10, 5), dtype=float)
>>> # create a small function to mimic step-wise populating this empty 2D array:
>>> fnx = lambda v : NP.random.randint(0, 10, v)
populate NumPy array as in the OP, except each iteration just re-sets the values of M at successive column-wise offsets
>>> for index, itm in enumerate(range(5)):
M[:,index] = fnx(10)
>>> M
array([[ 1., 7., 0., 8., 7.],
[ 9., 0., 6., 9., 4.],
[ 2., 3., 6., 3., 4.],
[ 3., 4., 1., 0., 5.],
[ 2., 3., 5., 3., 0.],
[ 4., 6., 5., 6., 2.],
[ 0., 6., 1., 6., 8.],
[ 3., 8., 0., 8., 0.],
[ 5., 2., 5., 0., 1.],
[ 0., 6., 5., 9., 1.]])
of course if you don't known in advance what size your array should be
just create one much bigger than you need and trim the 'unused' portions
when you finish populating it
>>> M[:3,:3]
array([[ 9., 3., 1.],
[ 9., 6., 8.],
[ 9., 7., 5.]])
Usually you don't keep resizing a NumPy array when you create it. What don't you like about your third solution? If it's a very large matrix/array, then it might be worth allocating the array before you start assigning its values:
x = len(something)
y = getColumnDataAsNumpyArray.someLengthProperty
data = numpy.zeros( (x,y) )
for i in something:
data[i] = getColumnDataAsNumpyArray(i)
The hstack can work on zero sized arrays:
import numpy as np
N = 5
M = 15
a = np.ndarray(shape = (N, 0))
for i in range(M):
b = np.random.rand(N, 1)
a = np.hstack((a, b))
Generally it is expensive to keep reallocating the NumPy array - so your third solution is really the best performance wise.
However I think hstack will do what you want - the cue is in the error message,
ValueError: arrays must have same number of dimensions`
I'm guessing that newColumn has two dimensions (rather than a 1D vector), so you need data to also have two dimensions..., for example, data = np.array([[]]) - or alternatively make newColumn a 1D vector (generally if things are 1D it is better to keep them 1D in NumPy, so broadcasting, etc. work better). in which case use np.squeeze(newColumn) and hstack or vstack should work with your original definition of the data.