Custom data types in numpy arrays - python

I'm creating a numpy array which is to be filled with objects of a particular class I've made. I'd like to initialize the array such that it will only ever contain objects of that class. For example, here's what I'd like to do, and what happens if I do it.
class Kernel:
pass
>>> L = np.empty(4,dtype=Kernel)
TypeError: data type not understood
I can do this:
>>> L = np.empty(4,dtype=object)
and then assign each element of L as a Kernel object (or any other type of object). It would be so neat were I able to have an array of Kernels, though, from both a programming point of view (type checking) and a mathematical one (operations on sets of functions).
Is there any way for me to specify the data type of a numpy array using an arbitrary class?

If your Kernel class has a predictable amount of member data, then you could define a dtype for it instead of a class. e.g. if it's parameterized by 9 floats and an int, you could do
kerneldt = np.dtype([('myintname', np.int32), ('myfloats', np.float64, 9)])
arr = np.empty(dims, dtype=kerneldt)
You'll have to do some coercion to turn them into objects of class Kernel every time you want to manipulate methods of a single kernel but that's one way to store the actual data in a NumPy array. If you want to only store a reference, then the object dtype is the best you can do without subclassing ndarray.

It has to be a Numpy scalar type:
http://docs.scipy.org/doc/numpy/reference/arrays.scalars.html#arrays-scalars-built-in
or a subclass of ndarray:
http://docs.scipy.org/doc/numpy/reference/generated/numpy.ndarray.html#numpy.ndarray

As far as I know, enforcing a single type for elements in a numpy.ndarray has to be done manually (unless the array contains Numpy scalars): there is no built-in checking mechanism (your array has dtype=object). If you really want to enforce a single type, you have to subclass ndarray and implement the checks in the appropriate methods (__setitem__, etc.).
If you want to implement operations on a set of functions (Kernel objects), you might be able to do so by defining the proper operations directly in your Kernel class. This is what I did for my uncertainties.py module, which handles numpy.ndarrays of numbers with uncertainties.

Related

How do I define an array of objects of an extension type in Cython without using list? [duplicate]

I would like to have a cython array of a cdef class:
cdef class Child:
cdef int i
def do(self):
self.i += 1
cdef class Mother:
cdef Child[:] array_of_child
def __init__(self):
for i in range(100):
self.array_of_child[i] = Child()
The answer is no - it is not really possible in a useful way: newsgroup post of essentially the same question
It wouldn't be possible to have a direct array (allocated in a single chunk) of Childs. Partly because, if somewhere else ever gets a reference to a Child in the array, that Child has to be kept alive (but not the whole array) which wouldn't be possible to ensure if they were all allocated in the same chunk of memory. Additionally, were the array to be resized (if this is a requirement) then it would invalidate any other references to the objects within the array.
Therefore you're left with having an array of pointers to Child. Such a structure would be fine, but internally would look almost exactly like a Python list (so there's really no benefit to doing something more complicated in Cython...).
There are a few sensible workarounds:
The workaround suggested in the newsgroup post is just to use a python list. You could also use a numpy array with dtype=object. If you need to to access a cdef function in the class you can do a cast first:
cdef Child c = <Child?>a[0] # omit the ? if you don't want
# the overhead of checking the type.
c.some_cdef_function()
Internally both these options are stored as an C array of PyObject pointers to your Child objects and so are not as inefficient as you probably assume.
A further possibility might be to store your data as a C struct (cdef struct ChildStruct: ....) which can be readily stored as an array. When you need a Python interface to that struct you can either define Child so it contains a copy of ChildStruct (but modifications won't propagate back to your original array), or a pointer to ChildStruct (but you need to be careful with ensuring that the memory is not freed which the Child pointing to it is alive).
You could use a Numpy structured array - this is pretty similar to using an array of C structs except Numpy handles the memory, and provides a Python interface.
The memoryview syntax in your question is valid: cdef Child[:] array_of_child. This can be initialized from a numpy array of dtype object:
array_of_child = np.array([(Child() for i in range(100)])
In terms of data-structure, this is an array of pointers (i.e. the same as a Python list, but can be multi-dimensional). It avoids the need for <Child> casting. The important thing it doesn't do is any kind of type-checking - if you feed an object that isn't Child into the array then it won't notice (because the underlying dtype is object), but will give nonsense answers or segmentation faults.
In my view this approach gives you a false sense of security about two things: first that you have made a more efficient data structure (you haven't, it's basically the same as a list); second that you have any kind of type safety. However, it does exist. (If you want to use memoryviews, e.g. for multi-dimensional arrays, it would probably be better to use a memoryview of type object - this is honest about the underlying dtype)

Fastest way to get subset of numpy array in Cython

I have a Cython function that takes a 2d nd.array (numpy array) of integers and returns a 1d numpy array whose length is the same as the input 2d array.
import numpy as np
cimport numpy as np
np.import_array()
cimport cython
def func(np.ndarray[np.float_t, dim=2] input_arr):
cdef np.ndarray[np.float_t, ndim=1] new_arr = ...
# do stuff
return new_arr
In another loop in the program, I want to call func, but pass it a 2d array that is created dynamically from another 2d array. Right now I have:
my_2d_numpy_array = np.array([[0.5, 0.1], [0.1, 10]]) # assume this is defined
cdef int N = 10000
cdef int k
for j in xrange(N)
# find some element k of interest
# create a 2d array on fly containing just the k-th to func()
func(np.array([my_2d_numpy_array[k]], dtype=float)) # KEY LINE
This works, but I think that the call to np.array each time inside the loop creates a huge overhead, because it goes back to Python. Since func only reads the array and doesn't modify it, how can I just pass it a view of the array as a pointer, without making a new array by going back to Python? I'm only interested in pulling out the kth row of my_2d_numpy_array and passing that to func()
Update: A related question: if I am using an nd.array inside the loop but don't need the full functionality of nd.array in func, can I make func instead take something like a static C array and somehow treat the nd.array as that? Will that save costs? Presumably then you don't have to pass an object to func (nd.array is an object)
You want to use Cython memory views.
They are designed for passing array slices between functions that are a part of the same Cython module.
You may need to inline the function within your Cython module to get the full performance benefit, but that isn't always necessary.
You can take a look at the documentation.
I recently wrote a rather lengthy answer to another question that looks in to when memory views should be used.
If you want a more detailed examination of why slicing works well with memory views, have a look at this blog post.
If you don't use memory views, the slicing involving NumPy arrays still involves a Python call and is not performed in C.
For your specific case, here are a few thoughts:
If you are passing array slices between functions in your Cython module you should be able to use a memory view to pass the slices.
This approach does depend on compile-time optimizations, so if you need to pass an array between two functions that are compiled at separate times, you will have to use a pointer to pass data between functions.
This will mean doing some careful pointer arithmetic, but it should still work.
If you need to do slicing and use NumPy functions, you may just end up having to use NumPy arrays, but it could be worth trying to use NumPy arrays and memory views that view the same data.
That way you will be able to pass slices as memory views, while only having to create NumPy arrays when you really need them.
Also, I would recommend making the function func a C-function so that you don't have to go through the overhead of calling a Python function when you call it.
You can do that by using the cdef or cpdef keyword to declare it.
Use cdef if you don't need to call it from outside the module.
Use cpdef if you want a C function and a corresponding Python wrapper that is accessible to Python.
func(my_2d_numpy_array[k:k+1])
Slicing my_2d_numpy_array instead of indexing it gets you the view you wanted with the shape you wanted.

Subclassing numpy scalar types

I'm trying to subclass numpy.complex64 in order to make use of the way numpy stores the data, (contiguous, alternating real and imaginary part) but use my own __add__, __sub__, ... routines.
My problem is that when I make a numpy.ndarray, setting dtype=mysubclass, I get a numpy.ndarray with dtype='numpy.complex64' in stead, which results in numpy not using my own functions for additions, subtractions and so on.
Example:
import numpy as np
class mysubclass(np.complex64):
pass
a = mysubclass(1+1j)
A = np.empty(2, dtype=mysubclass)
print type(a)
print repr(A)
Output:
<class '__main__.mysubclass'>
array([ -2.07782988e-20 +4.58546896e-41j, -2.07782988e-20 +4.58546896e-41j], dtype=complex64)'
Does anyone know how to do this?
Thanks in advance - Soren
The NumPy type system is only designed to be extended from C, via the PyArray_RegisterDataType function. It may be possible to access this functionality from Python using ctypes but I wouldn't recommend it; better to write an extension in C or Cython, or subclass ndarray as #seberg describes.
There's a simple example dtype in the NumPy source tree: newdtype_example/floatint.c. If you're into Pyrex, reference.pyx in the pytables source may be worth a look.
Note that scalars and arrays are quite different in numpy. np.complex64 (this is 32-bit float, just to note, not double precision). You will not be able to change the array like that, you will need to subclass the array instead and then override its __add__ and __sub__.
If that is all you want to do, it should just work otherwise look at http://docs.scipy.org/doc/numpy/user/basics.subclassing.html since subclassing an array is not that simple.
However if you want to use this type also as a scalar. For example you want to index scalars out, it gets more difficult at least currently. You can get a little further by defining __array_wrap__ to convert to scalars to your own scalar type for some reduce functions, for indexing to work in all cases it appears to me that you may have define your own __getitem__ currently.
In all cases with this approach, you still use the complex datatype, and all functions that are not explicitly overridden will still behave the same. #ecatmur mentioned that you can create new datatypes from the C side, if that is really what you want.

Python-numpy test for ndarray using ndim

I'm working on a project in Python requiring a lot of numerical array calculations. Unfortunately (or fortunately, depending on your POV), I'm very new to Python, but have been doing MATLAB and Octave programming (APL before that) for years. I'm very used to having every variable automatically typed to a matrix float, and still getting used to checking input types.
In many of my functions, I require the input S to be a numpy.ndarray of size (n,p), so I have to both test that type(S) is numpy.ndarray and get the values (n,p) = numpy.shape(S). One potential problem is that the input could be a list/tuple/int/etc..., another problem is that the input could be an array of shape (): S.ndim = 0. It occurred to me that I could simultaneously test the variable type, fix the S.ndim = 0problem, then get my dimensions like this:
# first simultaneously test for ndarray and get proper dimensions
try:
if (S.ndim == 0):
S = S.copy(); S.shape = (1,1);
# define dimensions p, and p2
(p,p2) = numpy.shape(S);
except AttributeError: # got here because input is not something array-like
raise AttributeError("blah blah blah");
Though it works, I'm wondering if this is a valid thing to do? The docstring for ndim says
If it is not already an ndarray, a conversion is
attempted.
and we surely know that numpy can easily convert an int/tuple/list to an array, so I'm confused why an AttributeError is being raised for these types inputs, when numpy should be doing this
numpy.array(S).ndim;
which should work.
When doing input validation for NumPy code, I always use np.asarray:
>>> np.asarray(np.array([1,2,3]))
array([1, 2, 3])
>>> np.asarray([1,2,3])
array([1, 2, 3])
>>> np.asarray((1,2,3))
array([1, 2, 3])
>>> np.asarray(1)
array(1)
>>> np.asarray(1).shape
()
This function has the nice feature that it only copies data when necessary; if the input is already an ndarray, the data is left in-place (only the type may be changed, because it also gets rid of that pesky np.matrix).
The docstring for ndim says
That's the docstring for the function np.ndim, not the ndim attribute, which non-NumPy objects don't have. You could use that function, but the effect would be that the data might be copied twice, so instead do:
S = np.asarray(S)
(p, p2) = S.shape
This will raise a ValueError if S.ndim != 2.
[Final note: you don't need ; in Python if you just follow the indentation rules. In fact, Python programmers eschew the semicolon.]
Given the comments to #larsmans answer, you could try:
if not isinstance(S, np.ndarray):
raise TypeError("Input not a ndarray")
if S.ndim == 0:
S = np.reshape(S, (1,1))
(p, p2) = S.shape
First, you check explicitly whether S is a (subclass of) ndarray. Then, you use the np.reshape to copy your data (and reshaping it, of course) if needed. At last, you get the dimension.
Note that in most cases, the np functions will first try to access the corresponding method of a ndarray, then attempt to convert the input to a ndarray (sometimes keeping it a subclass, as in np.asanyarray, sometimes not (as in np.asarray(...)). In other terms, it's always more efficient to use the method rather than the function: that's why we're using S.shape and not np.shape(S).
Another point: the np.asarray, np.asanyarray, np.atleast_1D... are all particular cases of the more generic function np.array. For example, asarray sets the optional copy argument of array to False, asanyarray does the same and sets subok=True, atleast_1D sets ndmin=1, atleast_2d sets ndmin=2... In other terms, it's always easier to use np.array with the appropriate arguments. But as mentioned in some comments, it's a matter of style. Shortcuts can often improve readability, which is always an objective to keep in mind.
In any case, when you use np.array(..., copy=True), you're explicitly asking for a copy of your initial data, a bit like doing a list([....]). Even if nothing else changed, your data will be copied. That has the advantages of its drawbacks (as we say in French), you could for example change the order from row-first C to column-first F. But anyway, you get the copy you wanted.
With np.array(input, copy=False), a new array is always created. It will either point to the same block of memory as input if this latter was already a ndarray (that is, no waste of memory), or will create a new one "from scratch" if input wasn't. The interesting case is of course if input was a ndarray.
Using this new array in a function may or may not change the original input, depending on the function. You have to check the documentation of the function you want to use to see whether it returns a copy or not. The NumPy developers try hard to limit unnecessary copies (following the Python example), but sometimes it can't be avoided. The documentation should tell explicitly what happens, if it doesn't or it's unclear, please mention it.
np.array(...) may raise some exceptions if something goes awry. For example, trying to use a dtype=float with an input like ["STRING", 1] will raise a ValueError. However, I must admit I can't remember which exceptions in all the cases, please edit this post accordingly.
Welcome to stack-overflow. This comes down to almost a style choice, but the most common way I've seen to deal with this kind of situation is to convert the input to an array. Numpy provides some useful tools for this. numpy.asarray has already been mentioned, but here are a few more. numpy.at_least1d is similar to asarray, but reshapes () arrays to be (1,) numpy.at_least2d is the same as above but reshapes 0d and 1d arrays to be 2d, ie (3,) to (1, 3). The reason we convert "array_like" inputs to arrays is partly just because we're lazy, for example sometimes it can be easier to write foo([1, 2, 3]) than foo(numpy.array([1, 2, 3])), but this is also the design choice made within numpy itself. Notice that the following works:
>>> numpy.mean([1., 2., 3.])
>>> 2.0
In the docs for numpy.mean we can see that x should be "array_like".
Parameters
----------
a : array_like
Array containing numbers whose mean is desired. If `a` is not an
array, a conversion is attempted.
That being said, there are situations when you want to only accept arrays as arguments and not all "array_like" types.

Immutable numpy array?

Is there a simple way to create an immutable NumPy array?
If one has to derive a class from ndarray to do this, what's the minimum set of methods that one has to override to achieve immutability?
You can make a numpy array unwriteable:
a = np.arange(10)
a.flags.writeable = False
a[0] = 1
# Gives: ValueError: assignment destination is read-only
Also see the discussion in this thread:
http://mail.scipy.org/pipermail/numpy-discussion/2008-December/039274.html
and the documentation:
http://docs.scipy.org/doc/numpy/reference/generated/numpy.ndarray.flags.html
I have a subclass of Array at this gist: https://gist.github.com/sfaleron/9791418d7023a9985bb803170c5d93d8
It makes a copy of its argument and marks that as read-only, so you should only be able to shoot yourself in the foot if you are very deliberate about it. My immediate need was for it to be hashable, so I could use them in sets, so that works too. It isn't a lot of code, but about 70% of the lines are for testing, so I won't post it directly.
Note that it's not a drop-in replacement; it won't accept any keyword args like a normal Array constructor. Instances will behave like Arrays, though.
Setting the flag directly didn't work for me, but using ndarray.setflags did work:
a = np.arange(10)
a.setflags(write=False)
a[0] = 1 # ValueError

Categories