In pure Python you can grow matrices column by column pretty easily:
data = []
for i in something:
newColumn = getColumnDataAsList(i)
data.append(newColumn)
NumPy's array doesn't have the append function. The hstack function doesn't work on zero sized arrays, thus the following won't work:
data = numpy.array([])
for i in something:
newColumn = getColumnDataAsNumpyArray(i)
data = numpy.hstack((data, newColumn)) # ValueError: arrays must have same number of dimensions
So, my options are either to remove the initalization iside the loop with appropriate condition:
data = None
for i in something:
newColumn = getColumnDataAsNumpyArray(i)
if data is None:
data = newColumn
else:
data = numpy.hstack((data, newColumn)) # works
... or to use a Python list and convert is later to array:
data = []
for i in something:
newColumn = getColumnDataAsNumpyArray(i)
data.append(newColumn)
data = numpy.array(data)
Both variants seem a little bit awkward to be. Are there nicer solutions?
NumPy actually does have an append function, which it seems might do what you want, e.g.,
import numpy as NP
my_data = NP.random.random_integers(0, 9, 9).reshape(3, 3)
new_col = NP.array((5, 5, 5)).reshape(3, 1)
res = NP.append(my_data, new_col, axis=1)
your second snippet (hstack) will work if you add another line, e.g.,
my_data = NP.random.random_integers(0, 9, 16).reshape(4, 4)
# the line to add--does not depend on array dimensions
new_col = NP.zeros_like(my_data[:,-1]).reshape(-1, 1)
res = NP.hstack((my_data, new_col))
hstack gives the same result as concatenate((my_data, new_col), axis=1), i'm not sure how they compare performance-wise.
While that's the most direct answer to your question, i should mention that looping through a data source to populate a target via append, while just fine in python, is not idiomatic NumPy. Here's why:
initializing a NumPy array is relatively expensive, and with this conventional python pattern, you incur that cost, more or less, at each loop iteration (i.e., each append to a NumPy array is roughly like initializing a new array with a different size).
For that reason, the common pattern in NumPy for iterative addition of columns to a 2D array is to initialize an empty target array once(or pre-allocate a single 2D NumPy array having all of the empty columns) the successively populate those empty columns by setting the desired column-wise offset (index)--much easier to show than to explain:
>>> # initialize your skeleton array using 'empty' for lowest-memory footprint
>>> M = NP.empty(shape=(10, 5), dtype=float)
>>> # create a small function to mimic step-wise populating this empty 2D array:
>>> fnx = lambda v : NP.random.randint(0, 10, v)
populate NumPy array as in the OP, except each iteration just re-sets the values of M at successive column-wise offsets
>>> for index, itm in enumerate(range(5)):
M[:,index] = fnx(10)
>>> M
array([[ 1., 7., 0., 8., 7.],
[ 9., 0., 6., 9., 4.],
[ 2., 3., 6., 3., 4.],
[ 3., 4., 1., 0., 5.],
[ 2., 3., 5., 3., 0.],
[ 4., 6., 5., 6., 2.],
[ 0., 6., 1., 6., 8.],
[ 3., 8., 0., 8., 0.],
[ 5., 2., 5., 0., 1.],
[ 0., 6., 5., 9., 1.]])
of course if you don't known in advance what size your array should be
just create one much bigger than you need and trim the 'unused' portions
when you finish populating it
>>> M[:3,:3]
array([[ 9., 3., 1.],
[ 9., 6., 8.],
[ 9., 7., 5.]])
Usually you don't keep resizing a NumPy array when you create it. What don't you like about your third solution? If it's a very large matrix/array, then it might be worth allocating the array before you start assigning its values:
x = len(something)
y = getColumnDataAsNumpyArray.someLengthProperty
data = numpy.zeros( (x,y) )
for i in something:
data[i] = getColumnDataAsNumpyArray(i)
The hstack can work on zero sized arrays:
import numpy as np
N = 5
M = 15
a = np.ndarray(shape = (N, 0))
for i in range(M):
b = np.random.rand(N, 1)
a = np.hstack((a, b))
Generally it is expensive to keep reallocating the NumPy array - so your third solution is really the best performance wise.
However I think hstack will do what you want - the cue is in the error message,
ValueError: arrays must have same number of dimensions`
I'm guessing that newColumn has two dimensions (rather than a 1D vector), so you need data to also have two dimensions..., for example, data = np.array([[]]) - or alternatively make newColumn a 1D vector (generally if things are 1D it is better to keep them 1D in NumPy, so broadcasting, etc. work better). in which case use np.squeeze(newColumn) and hstack or vstack should work with your original definition of the data.
Related
Data is in a structured array:
import numpy as np
dtype = [(field, float) for field in ['x', 'y', 'z', 'prop1', 'prop2']]
data = np.array([(1,2,3,4,5), (6,7,8,9,10), (11,12,13,14,15)], dtype=dtype)
For some operations, the positions are accessed as a single nx3 array, for example:
positions = data[['x', 'y', 'z']].view(dtype=float).reshape(-1, 3)
ranges = np.sqrt(np.sum(positions**2, 1))
Since numpy 1.12, the following warning is emitted:
FutureWarning: Numpy has detected that you may be viewing or writing to an array returned by selecting multiple fields in a
structured array.
This code may break in numpy 1.13 because this will return a view instead of a copy -- see release notes for details.
Here is the corresponding entry in the release notes:
Indexing a structured array with multiple fields (eg, arr[['f1', 'f3']]) will return a view into the original array in 1.13, instead of a copy. Note the returned view will have extra padding bytes corresponding to intervening fields in the original array, unlike the copy in 1.12, which will affect code such as arr[['f1', 'f3']].view(newdtype).
How to port this code to numpy >=1.13?
Checking on numpy 1.13 the announced change doesn't appear to have happened yet. So let's simulate the future:
The future behavior will presumably be not to copy the data but to create a dtype that has only the fields you want, but the itemsize of the original dtype. So there will be gaps in each element, parts of memory that are not used.
xyz_tp = xyz_tp = np.dtype({'names': list('xyz'),
'formats': tuple(data.dtype.fields[f][0] for f in 'xyz'),
'offsets': tuple(data.dtype.fields[f][1] for f in 'xyz'),
'itemsize': data.dtype.itemsize})
xyz = data.view(xyz_tp)
xyz
# array([( 1., 2., 3.), ( 6., 7., 8.), ( 11., 12., 13.)],
# dtype={'names':['x','y','z'], 'formats':['<f8','<f8','<f8'], 'offsets':[0,8,16], 'itemsize':40})
The not used memory locations and their content are ignored but still there, so if you view with a builtin dtype they'll reappear.
xyz.view(float)
# array([ 1., 2., 3., 4., 5., 6., 7., 8., 9., 10., 11.,
# 12., 13., 14., 15.])
# Ouch!
The general fix would be to cast to a contiguous (no gaps) dtype with the same fields. This will force a copy
xyz_cont_tp = np.dtype({'names': list('xyz'), 'formats': 3*('<f8',)})
xyz.astype(xyz_cont_tp).view(float).reshape(-1, 3)
# array([[ 1., 2., 3.],
# [ 6., 7., 8.],
# [ 11., 12., 13.]])
In the special case of your selected fields being contiguous and of same type you can also do:
np.lib.stride_tricks.as_strided(data.view(float), shape=(3,3), strides=data.strides + (8,))
# array([[ 1., 2., 3.],
# [ 6., 7., 8.],
# [ 11., 12., 13.]])
This method does not copy data but creates a genuine view.
Other way for several adjacent float fields. Here for 3 fields starting from 'x' we obtain same result with:
np.ndarray((len(data),3), float, data, offset= data.dtype.fields['x'][1], strides= (data.strides[0], np.dtype(float).itemsize))
I have a large 2D array that I would like to declare once, and change occasionnaly only some values depending on a parameter, without traversing the whole array.
To build this array, I have subclassed the numpy ndarray class with dtype=object and assign to the elements I want to change a function e.g. :
def f(parameter):
return parameter**2
for i in range(np.shape(A)[0]):
A[i,i]=f
for j in range(np.shape(A)[0]):
A[i,j]=1.
I have then overridden the __getitem__ method so that it returns the evaluation of the function with given parameter if it is callable, otherwise return the value itself.
def __getitem__(self, key):
value = super(numpy.ndarray, self).__getitem__(key)
if callable(value):
return value(*self.args)
else:
return value
where self.args were previously given to the instance of myclass.
However, I need to work with float arrays at the end, and I can't simply convert this array into a dtype=float array with this technique. I also tried to use numpy views, which does not work either for dtype=object.
Do you have any better alternative ? Should I override the view method rather than getitem ?
Edit I will maybe have to use Cython in the future, so if you have a solution involving e.g. C pointers, I am interested.
In this case, it does not make sens to bind a transformation function, to every index of your array.
Instead, a more efficient approach would be to define a transformation, as a function, together with a subset of the array it applies to. Here is a basic implementation,
import numpy as np
class LazyEvaluation(object):
def __init__(self):
self.transforms = []
def add_transform(self, function, selection=slice(None), args={}):
self.transforms.append( (function, selection, args))
def __call__(self, x):
y = x.copy()
for function, selection, args in self.transforms:
y[selection] = function(y[selection], **args)
return y
that can be used as follows:
x = np.ones((6, 6))*2
le = LazyEvaluation()
le.add_transform(lambda x: 0, [[3], [0]]) # equivalent to x[3,0]
le.add_transform(lambda x: x**2, (slice(4), slice(4,6))) # equivalent to x[4,4:6]
le.add_transform(lambda x: -1, np.diag_indices(x.shape[0], x.ndim), ) # setting the diagonal
result = le(x)
print(result)
which prints,
array([[-1., 2., 2., 2., 4., 4.],
[ 2., -1., 2., 2., 4., 4.],
[ 2., 2., -1., 2., 4., 4.],
[ 0., 2., 2., -1., 4., 4.],
[ 2., 2., 2., 2., -1., 2.],
[ 2., 2., 2., 2., 2., -1.]])
This way you can easily support all advanced Numpy indexing (element by element access, slicing, fancy indexing etc.), while at the same time keeping your data in an array with a native data type (float, int, etc) which is much more efficient than using dtype='object'.
I want to create 2D numpy.array knowing at the begining only its shape, i.e shape=2. Now, I want to create in for loop ith one dimensional numpy.arrays, and add them to the main matrix of shape=2, so I'll get something like this:
matrix=
[numpy.array 1]
[numpy.array 2]
...
[numpy.array n]
How can I achieve that? I try to use:
matrix = np.empty(shape=2)
for i in np.arange(100):
array = np.zeros(random_value)
matrix = np.append(matrix, array)
But as a result of print(np.shape(matrix)), after loop, I get something like:
(some_number, )
How can I append each new array in the next row of the matrix? Thank you in advance.
I would suggest working with list
matrix = []
for i in range(10):
a = np.ones(2)
matrix.append(a)
matrix = np.array(matrix)
list does not have the downside of being copied in the memory everytime you use append. so you avoid the problem described by ali_m. at the end of your operation you just convert the list object into a numpy array.
I suspect the root of your problem is the meaning of 'shape' in np.empty(shape=2)
If I run a small version of your code
matrix = np.empty(shape=2)
for i in np.arange(3):
array = np.zeros(3)
matrix = np.append(matrix, array)
I get
array([ 9.57895902e-259, 1.51798693e-314, 0.00000000e+000,
0.00000000e+000, 0.00000000e+000, 0.00000000e+000,
0.00000000e+000, 0.00000000e+000, 0.00000000e+000,
0.00000000e+000, 0.00000000e+000])
See those 2 odd numbers at the start? Those are produced by np.empty(shape=2). That matrix starts as a (2,) shaped array, not an empty 2d array. append just adds sets of 3 zeros to that, resulting in a (11,) array.
Now if you started with a 2 array with the right number of columns, and did concatenate on the 1st dimension you would get a multirow array. (rows only have meaning in 2d or larger).
mat=np.zeros((1,3))
for i in range(1,3):
mat = np.concatenate([mat, np.ones((1,3))*i],axis=0)
produces:
array([[ 0., 0., 0.],
[ 1., 1., 1.],
[ 2., 2., 2.]])
A better way of doing an iterative construction like this is with list append
alist = []
for i in range(0,3):
alist.append(np.ones((1,3))*i)
mat=np.vstack(alist)
alist is:
[array([[ 0., 0., 0.]]), array([[ 1., 1., 1.]]), array([[ 2., 2., 2.]])]
mat is
array([[ 0., 0., 0.],
[ 1., 1., 1.],
[ 2., 2., 2.]])
With vstack you can get by with np.ones((3,), since it turns all of its inputs into 2d array.
append would work, but it also requires axis=0 parameter, and 2 arrays. It gets misused, often by mistaken analogy to the list append. It is just another front end to concatenate. So I prefer not to use it.
Notice that other posters assumed your random value changed during the iteration. That would produce a arrays of differing lengths. For 1d appending that would still produce the long 1d array. But a 2d append wouldn't work, because an 2d array can't be ragged.
mat = np.zeros((2,),int)
for i in range(4):
mat=np.append(mat,np.ones((i,),int)*i)
# array([0, 0, 1, 2, 2, 3, 3, 3])
The function you are looking for is np.vstack
Here is a modified version of your example
import numpy as np
matrix = np.empty(shape=2)
for i in np.arange(3):
array = np.zeros(2)
matrix = np.vstack((matrix, array))
The result is
array([[ 0., 0.],
[ 0., 0.],
[ 0., 0.],
[ 0., 0.]])
I have a list of list with 1,200 rows and 500,000 columns. How do I convert it into a numpy array?
I've read the solutions on Bypass "Array is too big" python error but they are not helping.
I tried to put them into a numpy array:
import random
import numpy as np
lol = [[random.uniform(0,1) for j in range(500000)] for i in range(1200)]
np.array(lol)
[Error]:
ValueError: array is too big.
Then i've tried pandas:
import random
import pandas as pd
lol = [[random.uniform(0,1) for j in range(500000)] for i in range(1200)]
pd.lib.to_object_array(lol).astype(float)
[Error]:
ValueError: array is too big.
I've also tried hdf5 as #askewchan suggested:
import h5py
filearray = h5py.File('project.data','w')
data = filearray.create_dataset('tocluster',(len(data),len(data[0])),dtype='f')
data[...] = data
[Error]:
data[...] = data
File "/usr/lib/python2.7/dist-packages/h5py/_hl/dataset.py", line 367, in __setitem__
val = numpy.asarray(val, order='C')
File "/usr/local/lib/python2.7/dist-packages/numpy/core/numeric.py", line 460, in asarray
return array(a, dtype, copy=False, order=order)
File "/usr/lib/python2.7/dist-packages/h5py/_hl/dataset.py", line 455, in __array__
arr = numpy.empty(self.shape, dtype=self.dtype if dtype is None else dtype)
ValueError: array is too big.
This post shows that I can store a huge numpy array in disk Python: how to store a numpy multidimensional array in PyTables?. But i can't even get my list of list into a numpy array =(
On a system with 32GB of RAM and 64-bit Python your code:
import random
import numpy as np
lol = [[random.uniform(0,1) for j in range(500000)] for i in range(1200)]
np.array(lol)
works just fine for me but it's probably not the best route to take. This is the kind of thing PyTables was built for. Since you're dealing with homogeneous data you can use the Array class or, better yet, the CArray class (which supports compression). This can be done as follows:
import numpy as np
import tables as pt
# Create container
h5 = pt.open_file('myarray.h5', 'w')
filters = pt.Filters(complevel=6, complib='blosc')
carr = h5.create_carray('/', 'carray', atom=pt.Float32Atom(), shape=(1200, 500000), filters=filters)
# Fill the array
m, n = carr.shape
for j in xrange(m):
carr[j,:] = np.random.randn(n)
h5.close() # "myarray.h5" (~2.2 GB)
# Open file
h5 = pt.open_file('myarray.h5', 'r')
carr = h5.root.carray
# Display some numbers from array
print carr[973:975, :4]
print carr.dtype
If you print carr.flavor it will return 'numpy'. You can use this carr in the same way you can use a NumPy array. The information is stored on disk but is still quite fast.
With h5py / hdf5:
import numpy as np
import h5py
lol = np.empty((1200, 5000)).tolist()
f = h5py.File('big.hdf5', 'w')
bd = f.create_dataset('big_dataset', (len(lol), len(lol[0])), dtype='f')
bd[...] = lol
Then, I believe you can access your big dataset bd as if it were an array, but it is stored and accessed from disk, not memory:
In [14]: bd[0, 1:10]
Out[14]:
array([ 0., 0., 0., 0., 0., 0., 0., 0., 0.], dtype=float32)
And you can have several 'datasets' in the one file (multiple arrays).
abd = f.create_dataset('another_big_dataset', (len(lol), len(lol[0])), dtype='f')
abd[...] = lol
abd += 10
Then:
In [24]: abd[:3, :10]
Out[24]:
array([[ 10., 10., 10., 10., 10., 10., 10., 10., 10., 10.],
[ 10., 10., 10., 10., 10., 10., 10., 10., 10., 10.],
[ 10., 10., 10., 10., 10., 10., 10., 10., 10., 10.]], dtype=float32)
In [25]: bd[:3, :10]
Out[25]:
array([[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]], dtype=float32)
My computer can't handle your example, so I can't test this with an array your size but I hope it works!
Depending on what you want to do with your array, you might have more luck with pytables, which does a lot more than h5py.
See also:
Python Numpy Very Large Matrices
exporting from/importing to numpy, scipy in SQLite and HDF5 formats
Have you tried assigning a dtype? This works for me.
import random
import numpy as np
lol = [[random.uniform(0,1) for j in range(500000)] for i in range(1200)]
ar = np.array(lol, dtype=np.float64)
Another option is to use blaze. http://blaze.pydata.org/
import random
import blaze
lol = [[random.uniform(0,1) for j in range(500000)] for i in range(1200)]
ar = blaze.array(lol)
The problem seems to be that you are using something (either OS or python) which is only 32bit, which is the source of the size limitation. The solution is to upgrade to 64bit.
An alternative is the following:
lol = np.empty((1200,500000))
for i in range(lol.shape[0]):
lol[i] = [random.uniform(0,1) for j in range(lol.shape[1])]
This is reasonably close to your initial form, I hope it can fit into your code. I cannot test with your numbers, as I don't have enough RAM to handle the array.
I need to calculate n number of points(3D) with equal spacing along a defined line(3D).
I know the starting and end point of the line. First, I used
for k in range(nbin):
step = k/float(nbin-1)
bin_point.append(beam_entry+(step*(beamlet_intersection-beam_entry)))
Then, I found that using append for large arrays takes more time, then I changed code like this:
bin_point = [start_point+((k/float(nbin-1))*(end_point-start_point)) for k in range(nbin)]
I got a suggestion that using newaxis will further improve the time.
The modified code looks like this.
step = arange(nbin) / float(nbin-1)
bin_point = start_point + ( step[:,newaxis,newaxis]*((end_pint - start_point))[newaxis,:,:] )
But, I could not understand the newaxis function, I also have a doubt that, whether the same code will work if the structure or the shape of the start_point and end_point are changed. Similarly how can I use the newaxis to mdoify the following code
for j in range(32): # for all los
line_dist[j] = sqrt([sum(l) for l in (end_point[j]-start_point[j])**2])
Sorry for being so clunky, to be more clear the structure of the start_point and end_point are
array([ [[1,1,1],[],[],[]....[]],
[[],[],[],[]....[]],
[[],[],[],[]....[]]......,
[[],[],[],[]....[]] ])
Explanation of the newaxis version in the question: these are not matrix multiplies, ndarray multiply is element-by-element multiply with broadcasting. step[:,newaxis,newaxis] is num_steps x 1 x 1 and point[newaxis,:,:] is 1 x num_points x num_dimensions. Broadcasting together ndarrays with shape (num_steps x 1 x 1) and (1 x num_points x num_dimensions) will work, because the broadcasting rules are that every dimension should be either 1 or the same; it just means "repeat the array with dimension 1 as many times as the corresponding dimension of the other array". This results in an ndarray with shape (num_steps x num_points x num_dimensions) in a very efficient way; the i, j, k subscript will be the k-th coordinate of the i-th step along the j-th line (given by the j-th pair of start and end points).
Walkthrough:
>>> start_points = numpy.array([[1, 0, 0], [0, 1, 0]])
>>> end_points = numpy.array([[10, 0, 0], [0, 10, 0]])
>>> steps = numpy.arange(10)/9.0
>>> start_points.shape
(2, 3)
>>> steps.shape
(10,)
>>> steps[:,numpy.newaxis,numpy.newaxis].shape
(10, 1, 1)
>>> (steps[:,numpy.newaxis,numpy.newaxis] * start_points).shape
(10, 2, 3)
>>> (steps[:,numpy.newaxis,numpy.newaxis] * (end_points - start_points)) + start_points
array([[[ 1., 0., 0.],
[ 0., 1., 0.]],
[[ 2., 0., 0.],
[ 0., 2., 0.]],
[[ 3., 0., 0.],
[ 0., 3., 0.]],
[[ 4., 0., 0.],
[ 0., 4., 0.]],
[[ 5., 0., 0.],
[ 0., 5., 0.]],
[[ 6., 0., 0.],
[ 0., 6., 0.]],
[[ 7., 0., 0.],
[ 0., 7., 0.]],
[[ 8., 0., 0.],
[ 0., 8., 0.]],
[[ 9., 0., 0.],
[ 0., 9., 0.]],
[[ 10., 0., 0.],
[ 0., 10., 0.]]])
As you can see, this produces the correct answer :) In this case broadcasting (10,1,1) and (2,3) results in (10,2,3). What you had is broadcasting (10,1,1) and (1,2,3) which is exactly the same and also produces (10,2,3).
The code for the distance part of the question does not need newaxis: the inputs are num_points x num_dimensions, the ouput is num_points, so one dimension has to be removed. That is actually the axis you sum along. This should work:
line_dist = numpy.sqrt( numpy.sum( (end_point - start_point) ** 2, axis=1 )
Here numpy.sum(..., axis=1) means sum along that axis only, rather than all elements: a ndarray with shape num_points x num_dimensions summed along axis=1 produces a result with num_points, which is correct.
EDIT: removed code example without broadcasting.
EDIT: fixed up order of indexes.
EDIT: added line_dist
I'm not through understanding all you wrote, but some things I already can tell you; maybe they help.
newaxis is rather a marker than a function (in fact, it is plain None). It is used to add an (unused) dimension to a multi-dimensional value. With it you can make a 3D value out of a 2D value (or even more). Each dimension already there in the input value must be represented by a colon : in the index (assuming you want to use all values, otherwise it gets complicated beyond our usecase), the dimensions to be added are denoted by newaxis.
Example:
input is a one-dimensional vector (1D): 1,2,3
output shall be a matrix (2D).
There are two ways to accomplish this; the vector could fill the lines with one value each, or the vector could fill just the first and only line of the matrix. The first is created by vector[:,newaxis], the second by vector[newaxis,:]. Results of this:
>>> array([ 7,8,9 ])[:,newaxis]
array([[7],
[8],
[9]])
>>> array([ 7,8,9 ])[newaxis,:]
array([[7, 8, 9]])
(Dimensions of multi-dimensional values are represented by nesting of arrays of course.)
If you have more dimensions in the input, use the colon more than once (otherwise the deeper nested dimensions are simply ignored, i.e. the arrays are treated as simple values). I won't paste a representation of this here as it won't clarify things due to the optical complexity when 3D and 4D values are written on a 2D display using nested brackets. I hope it gets clear anyway.
The newaxis reshapes the array in such a way so that when you multiply numpy uses broadcasting. Here is a good tutorial on broadcasting.
step[:, newaxis, newaxis] is the same as step.reshape((step.shape[0], 1, 1)) (if step is 1d). Either method for reshaping should be very fast because reshaping arrays in numpy is very cheep, it just makes a view of the array, especially because you should only be doing it once.