I have a large 2D array that I would like to declare once, and change occasionnaly only some values depending on a parameter, without traversing the whole array.
To build this array, I have subclassed the numpy ndarray class with dtype=object and assign to the elements I want to change a function e.g. :
def f(parameter):
return parameter**2
for i in range(np.shape(A)[0]):
A[i,i]=f
for j in range(np.shape(A)[0]):
A[i,j]=1.
I have then overridden the __getitem__ method so that it returns the evaluation of the function with given parameter if it is callable, otherwise return the value itself.
def __getitem__(self, key):
value = super(numpy.ndarray, self).__getitem__(key)
if callable(value):
return value(*self.args)
else:
return value
where self.args were previously given to the instance of myclass.
However, I need to work with float arrays at the end, and I can't simply convert this array into a dtype=float array with this technique. I also tried to use numpy views, which does not work either for dtype=object.
Do you have any better alternative ? Should I override the view method rather than getitem ?
Edit I will maybe have to use Cython in the future, so if you have a solution involving e.g. C pointers, I am interested.
In this case, it does not make sens to bind a transformation function, to every index of your array.
Instead, a more efficient approach would be to define a transformation, as a function, together with a subset of the array it applies to. Here is a basic implementation,
import numpy as np
class LazyEvaluation(object):
def __init__(self):
self.transforms = []
def add_transform(self, function, selection=slice(None), args={}):
self.transforms.append( (function, selection, args))
def __call__(self, x):
y = x.copy()
for function, selection, args in self.transforms:
y[selection] = function(y[selection], **args)
return y
that can be used as follows:
x = np.ones((6, 6))*2
le = LazyEvaluation()
le.add_transform(lambda x: 0, [[3], [0]]) # equivalent to x[3,0]
le.add_transform(lambda x: x**2, (slice(4), slice(4,6))) # equivalent to x[4,4:6]
le.add_transform(lambda x: -1, np.diag_indices(x.shape[0], x.ndim), ) # setting the diagonal
result = le(x)
print(result)
which prints,
array([[-1., 2., 2., 2., 4., 4.],
[ 2., -1., 2., 2., 4., 4.],
[ 2., 2., -1., 2., 4., 4.],
[ 0., 2., 2., -1., 4., 4.],
[ 2., 2., 2., 2., -1., 2.],
[ 2., 2., 2., 2., 2., -1.]])
This way you can easily support all advanced Numpy indexing (element by element access, slicing, fancy indexing etc.), while at the same time keeping your data in an array with a native data type (float, int, etc) which is much more efficient than using dtype='object'.
Related
I wanna make a function that takes an array as its first parameter takes an arbitrary sized and shaped arr array and overwrites all its values that are in the given [a,b] interval to be equal to c. The a, b, c numbers are given to the function as parameters.like input and output below
arr = np.array([[[5., 2., -5.], [4., 3., 1.]]])
overwrite_interval(arr, -2., 2., 100.) -> ndarray([[[5., 100., -5.], [4., 3., 100.]]])
def overwrite_interval(arr , a , b , c):
for i in arr[:,:]:
arr[a,b] = c
arr = np.array([[[5., 2., -5.], [4., 3., 1.]]])
assert overwrite_interval(arr, -2., 2., 100.) #-> ndarray([[[5., 100., -5.], [4., 3., 100.]]])
I think the way you've worded your question doesn't line up with the example you've given. Firstly, the example array you've given is 3D, not 2D. You can do
>>> arr.shape
(1,2,3)
>>> arr.ndim
3
Presumably this is a mistake, and you want your array to be 2D, so you would do
arr = np.array([[5., 2., -5.], [4., 3., 1.]])
instead.
Secondly, if a and b are values that, if an element is between then to set that element to value c rather than a and b being indexes, then the np.where function is great for this.
def overwrite_interval(arr , a , b , c):
inds = np.where((arr >= a) * (arr <= b))
arr[inds] = c
return arr
np.where returns a tuple, so sometimes it can be easier to work with boolean arrays directly. In which case, the function would look like this
def overwrite_interval(arr , a , b , c):
inds = (arr >= a) * (arr <= b)
arr[inds] = c
return arr
Does this work for you, and is this your intended meaning? Note that the solution I've provided would work as is if you still meant for the initial array to be a 3D array.
I am trying to use python to compute the output of a function, say:
$f(x) = x + y$
Where x and y are the coordinates of the point in the array. So, the point 5, 5 would have the value 10. This will essentially generate an image of (x,y) and an associated pixel intensity value.
Right now I have a 100x100 dataframe in Python/Pandas, and want to know how to actually perform this calculation. My best guess is iterate over each row, and using the index of the row (y) and the index of the element (x), pass these two values into the function and set the point to that value.
This is essentially a basic multivariable graphing problem. Was hoping someone had some experience doing stuff like this. Thank you!
There are numpy functions fromfunction and indices. They'll probably do what you want.
import numpy as np
np.fromfunction( lambda r, c: r+c, shape = (5,5))
# array([[0., 1., 2., 3., 4.],
# [1., 2., 3., 4., 5.],
# [2., 3., 4., 5., 6.],
# [3., 4., 5., 6., 7.],
# [4., 5., 6., 7., 8.]])
fromfunction takes a function as the first argument then the shape. It uses the axes' indices in the function. The function requires as many arguments as there are dims in the shape.
np.indices((3,3))
# array([[[0, 0, 0], # Row coordinates
# [1, 1, 1],
# [2, 2, 2]],
#
# [[0, 1, 2], # Column coordinates
# [0, 1, 2],
# [0, 1, 2]]])
These can be used as function arguments to drive your results.
There are also np.ogrid and np.mgrid which generate np.arrays to use in any calculations. A lot depends on exactly what you want to do.
Edit: np.fromfunction with keyword arguments.
def test ( a, b, c, m0=1, m1 =1): # Specify function with kwargs.
return a * m0 + b * m1 + c
np.fromfunction(test, (4, 3, 5 ), m0=100, m1=10) # Change he kwargs at run time.
# array([[[ 0., 1., 2., 3., 4.],
# [ 10., 11., 12., 13., 14.],
# [ 20., 21., 22., 23., 24.]],
# [[100., 101., 102., 103., 104.],
# [110., 111., 112., 113., 114.],
# [120., 121., 122., 123., 124.]],
# [[200., 201., 202., 203., 204.],
# [210., 211., 212., 213., 214.],
# [220., 221., 222., 223., 224.]],
# [[300., 301., 302., 303., 304.],
# [310., 311., 312., 313., 314.],
# [320., 321., 322., 323., 324.]]])
Data is in a structured array:
import numpy as np
dtype = [(field, float) for field in ['x', 'y', 'z', 'prop1', 'prop2']]
data = np.array([(1,2,3,4,5), (6,7,8,9,10), (11,12,13,14,15)], dtype=dtype)
For some operations, the positions are accessed as a single nx3 array, for example:
positions = data[['x', 'y', 'z']].view(dtype=float).reshape(-1, 3)
ranges = np.sqrt(np.sum(positions**2, 1))
Since numpy 1.12, the following warning is emitted:
FutureWarning: Numpy has detected that you may be viewing or writing to an array returned by selecting multiple fields in a
structured array.
This code may break in numpy 1.13 because this will return a view instead of a copy -- see release notes for details.
Here is the corresponding entry in the release notes:
Indexing a structured array with multiple fields (eg, arr[['f1', 'f3']]) will return a view into the original array in 1.13, instead of a copy. Note the returned view will have extra padding bytes corresponding to intervening fields in the original array, unlike the copy in 1.12, which will affect code such as arr[['f1', 'f3']].view(newdtype).
How to port this code to numpy >=1.13?
Checking on numpy 1.13 the announced change doesn't appear to have happened yet. So let's simulate the future:
The future behavior will presumably be not to copy the data but to create a dtype that has only the fields you want, but the itemsize of the original dtype. So there will be gaps in each element, parts of memory that are not used.
xyz_tp = xyz_tp = np.dtype({'names': list('xyz'),
'formats': tuple(data.dtype.fields[f][0] for f in 'xyz'),
'offsets': tuple(data.dtype.fields[f][1] for f in 'xyz'),
'itemsize': data.dtype.itemsize})
xyz = data.view(xyz_tp)
xyz
# array([( 1., 2., 3.), ( 6., 7., 8.), ( 11., 12., 13.)],
# dtype={'names':['x','y','z'], 'formats':['<f8','<f8','<f8'], 'offsets':[0,8,16], 'itemsize':40})
The not used memory locations and their content are ignored but still there, so if you view with a builtin dtype they'll reappear.
xyz.view(float)
# array([ 1., 2., 3., 4., 5., 6., 7., 8., 9., 10., 11.,
# 12., 13., 14., 15.])
# Ouch!
The general fix would be to cast to a contiguous (no gaps) dtype with the same fields. This will force a copy
xyz_cont_tp = np.dtype({'names': list('xyz'), 'formats': 3*('<f8',)})
xyz.astype(xyz_cont_tp).view(float).reshape(-1, 3)
# array([[ 1., 2., 3.],
# [ 6., 7., 8.],
# [ 11., 12., 13.]])
In the special case of your selected fields being contiguous and of same type you can also do:
np.lib.stride_tricks.as_strided(data.view(float), shape=(3,3), strides=data.strides + (8,))
# array([[ 1., 2., 3.],
# [ 6., 7., 8.],
# [ 11., 12., 13.]])
This method does not copy data but creates a genuine view.
Other way for several adjacent float fields. Here for 3 fields starting from 'x' we obtain same result with:
np.ndarray((len(data),3), float, data, offset= data.dtype.fields['x'][1], strides= (data.strides[0], np.dtype(float).itemsize))
I have a question concerning overriding operands. I just tried to override the __add__(self, other) operator in a custom class, such that on of its elements (a numpy array) can be added to another numpy array. To make both directions of summing possible I both declared the __add__as well as the __radd__ operator. A small example:
import numpy as np
class MyClass():
def __init__(self, x):
self.x = x
self._mat = self._calc_mat()
def _calc_mat(self):
return np.eye(2)*self.x
def __add__(self, other):
return self._mat + other
def __radd__(self, other):
return self._mat + other
def some_function(x):
return x + np.ones(4).reshape((2,2))
def some_other_function(x):
return np.ones(4).reshape((2,2)) + x
inst = MyClass(3)
some_function(x=inst)
some_other_function(x=inst)
Strangely, I get two different outputs. The first ouput, from some_function is just like expected:
Out[1]
array([[ 4., 1.],
[ 1., 4.]])
The second output gives me something odd:
Out[2]:
array([[array([[ 4., 1.],
[ 1., 4.]]),
array([[ 4., 1.],
[ 1., 4.]])],
[array([[ 4., 1.],
[ 1., 4.]]),
array([[ 4., 1.],
[ 1., 4.]])]], dtype=object)
Does somebody have an idea why is that?
Thanks, Markus :-)
the issue is that numpy array is also implementing an __add__ method, and it is called before your __radd__
you can see this answer for a solution: https://stackoverflow.com/a/22633052/7033869
In pure Python you can grow matrices column by column pretty easily:
data = []
for i in something:
newColumn = getColumnDataAsList(i)
data.append(newColumn)
NumPy's array doesn't have the append function. The hstack function doesn't work on zero sized arrays, thus the following won't work:
data = numpy.array([])
for i in something:
newColumn = getColumnDataAsNumpyArray(i)
data = numpy.hstack((data, newColumn)) # ValueError: arrays must have same number of dimensions
So, my options are either to remove the initalization iside the loop with appropriate condition:
data = None
for i in something:
newColumn = getColumnDataAsNumpyArray(i)
if data is None:
data = newColumn
else:
data = numpy.hstack((data, newColumn)) # works
... or to use a Python list and convert is later to array:
data = []
for i in something:
newColumn = getColumnDataAsNumpyArray(i)
data.append(newColumn)
data = numpy.array(data)
Both variants seem a little bit awkward to be. Are there nicer solutions?
NumPy actually does have an append function, which it seems might do what you want, e.g.,
import numpy as NP
my_data = NP.random.random_integers(0, 9, 9).reshape(3, 3)
new_col = NP.array((5, 5, 5)).reshape(3, 1)
res = NP.append(my_data, new_col, axis=1)
your second snippet (hstack) will work if you add another line, e.g.,
my_data = NP.random.random_integers(0, 9, 16).reshape(4, 4)
# the line to add--does not depend on array dimensions
new_col = NP.zeros_like(my_data[:,-1]).reshape(-1, 1)
res = NP.hstack((my_data, new_col))
hstack gives the same result as concatenate((my_data, new_col), axis=1), i'm not sure how they compare performance-wise.
While that's the most direct answer to your question, i should mention that looping through a data source to populate a target via append, while just fine in python, is not idiomatic NumPy. Here's why:
initializing a NumPy array is relatively expensive, and with this conventional python pattern, you incur that cost, more or less, at each loop iteration (i.e., each append to a NumPy array is roughly like initializing a new array with a different size).
For that reason, the common pattern in NumPy for iterative addition of columns to a 2D array is to initialize an empty target array once(or pre-allocate a single 2D NumPy array having all of the empty columns) the successively populate those empty columns by setting the desired column-wise offset (index)--much easier to show than to explain:
>>> # initialize your skeleton array using 'empty' for lowest-memory footprint
>>> M = NP.empty(shape=(10, 5), dtype=float)
>>> # create a small function to mimic step-wise populating this empty 2D array:
>>> fnx = lambda v : NP.random.randint(0, 10, v)
populate NumPy array as in the OP, except each iteration just re-sets the values of M at successive column-wise offsets
>>> for index, itm in enumerate(range(5)):
M[:,index] = fnx(10)
>>> M
array([[ 1., 7., 0., 8., 7.],
[ 9., 0., 6., 9., 4.],
[ 2., 3., 6., 3., 4.],
[ 3., 4., 1., 0., 5.],
[ 2., 3., 5., 3., 0.],
[ 4., 6., 5., 6., 2.],
[ 0., 6., 1., 6., 8.],
[ 3., 8., 0., 8., 0.],
[ 5., 2., 5., 0., 1.],
[ 0., 6., 5., 9., 1.]])
of course if you don't known in advance what size your array should be
just create one much bigger than you need and trim the 'unused' portions
when you finish populating it
>>> M[:3,:3]
array([[ 9., 3., 1.],
[ 9., 6., 8.],
[ 9., 7., 5.]])
Usually you don't keep resizing a NumPy array when you create it. What don't you like about your third solution? If it's a very large matrix/array, then it might be worth allocating the array before you start assigning its values:
x = len(something)
y = getColumnDataAsNumpyArray.someLengthProperty
data = numpy.zeros( (x,y) )
for i in something:
data[i] = getColumnDataAsNumpyArray(i)
The hstack can work on zero sized arrays:
import numpy as np
N = 5
M = 15
a = np.ndarray(shape = (N, 0))
for i in range(M):
b = np.random.rand(N, 1)
a = np.hstack((a, b))
Generally it is expensive to keep reallocating the NumPy array - so your third solution is really the best performance wise.
However I think hstack will do what you want - the cue is in the error message,
ValueError: arrays must have same number of dimensions`
I'm guessing that newColumn has two dimensions (rather than a 1D vector), so you need data to also have two dimensions..., for example, data = np.array([[]]) - or alternatively make newColumn a 1D vector (generally if things are 1D it is better to keep them 1D in NumPy, so broadcasting, etc. work better). in which case use np.squeeze(newColumn) and hstack or vstack should work with your original definition of the data.