Any way to prevent modifications to content of a ndarray subclass?

Any way to prevent modifications to content of a ndarray subclass? - python

I am creating various classes for computational geometry that all subclass numpy.ndarray. The DataCloud class, which is typical of these classes, has Python properties (for example, convex_hull, delaunay_trangulation) that would be time consuming and wasteful to calculate more than once. I want to do calculations once and only once. Also, just in time, because for a given instance, I might not need a given property at all. It is easy enough to set this up by setting self.__convex_hull = None in the constructor and, if/when the convex_hull property is called, doing the required calculation, setting self.__convex_hull, and returning the calculated value.
The problem is that once any of those complicated properties is invoked, any changes to the contents made, external to my subclass, by the various numpy (as opposed to DataCloud subclass) methods will invalidate all the calculated properties, and I won't know about it. For example, suppose external code simply does this to the instance: datacloud[3,8] = 5. So is there any way to either (1) make the ndarray base class read-only once any of those properties is calculated or (2) have ndarray set some indicator that there has been a change to its contents (which for my purposes makes it dirty), so that then invoking any of the complex properties will require recalculation?

Looks like the answer is:
np.ndarray.setflags(write=False)

Related

Subclass numpy.ndarray without 'propagating' subtype

I have a Python-based system which operates on abstract simplified representations of floor plans. Currently, a floor plan is represented as a plain NumPy array, and operations on it are implemented as functions which accept that array:
floor_plan = numpy.ndarray(...)
def find_rooms(floor_plan: numpy.ndarray) ...
I'd like to refactor the system to instead use a dedicated floor plan class, with these operations defined as method on that class. This will allow me to also store supplementary information on floor plan instances, perform centralized caching on its methods, etc.
However, I'd also like to retain the ability to treat a floor plan as an array. At the least, this is useful during the refactoring process, as all existing code will continue to work even when the new class is introduced. I can accomplish this by having my new class inherit from numpy.ndarray:
class FloorPlan(numpy.ndarray):
def find_rooms(self) ...
floor_plan = FloorPlan(...)
But there's a problem: if I derive an array from an operation on this subtype, then the derived array is also of the same subtype:
type(floor_plan[:, 0]) == FloorPlan
This is intentional by NumPy's design (https://numpy.org/doc/stable/user/basics.subclassing.html). However, it's not appropriate for my purposes, as a floor plan has discrete semantics which don't necessarily apply to every derived value.
Is there any way to disable or avoid this 'propagation' of my array subtype to derived arrays?

How to use Python properties efficiently?

Let's say I have a class Foo that stores some statistical data, and I want to encapsulate the access to these data using Python properties. This is particularly useful, for example, when I only store the variance of a variable and want to have access to its standard deviation: in that case, I could define the property Foo.std and make it return the square root of the variance.
The problem with this approach is that if I need to access Foo.std multiple times, I will calculate the square root each time; furthermore, since the notation of a property is exactly like that of an attribute, the user of my class might not be aware that a computation is taking place everytime the property is accessed.
One alternative in this example would be to calculate the standard deviation every time I update my variance, and set it as an attribute. However, that would be inefficient if I don't need to access it at every update.
My question is: what is the best approach to use a Python property efficiently, when you need to perform a costly calculation? Should I cache the result after the first call and delete the cache in case of an update? Or should I rather not use a property and use a method Foo.get_std() instead?

Usually you can do this through caching. For example you can write:
class Foo:
def __int__(self, also, other, arguments):
# ...
self._std = None
#property
def std(self):
if self._std is None:
# ... calculate standard deviation
self._std = ...
return self._std
def alter_state(self, some, arguments):
# update state
self._std = None
So here we have a propert std but also an attribute _std. In case the standard deviation is not yet calculated, or you alter the object state such that the standard deviation might have changed, we set _std to None. Now if we access .std, we first check if _std is None. If that is the case we calculate the standard deviation and store it into _std and return it. Such that - if the object is not changed - we can later simply retrieve it.
If we alter the object such that the standard deviation might have changed, we set _std back to None, to force re-evaluation in case we access .std again.
If we alter the state of a Foo object twice before recalcuating the standard deviation, we will only recalculate it once. So you can frequently change the Foo object, with (close to) no extra cost involved (except setting self._std to None). So if you have a huge dataset and you update it extensively you only will put effort in calculating the standard deviation again when you actually need it.
Furthermore this can also be an oportunity to update statistical measures in case that is (very) cheap. Say for instance you have a list of objects that you frequently update in bulk. In case you increment all elements with a constant, then the mean will also increment with that constant. So functions that alter a state such that some metrics can easily be altered as well, might update the metrics, instead of making these None.
Note that whether .std is a property, or a function is irrelevant. The user does not have to know how often this has to be calculated. The std() function will guarantee that once calcuated, a second retrieval will be quite fast.

Adding to Willem's answer: starting Python 3.8, we now have functools.cached_property. The official documentation even uses std and variance as examples. I'm linking the 3.9 documentation (https://docs.python.org/3.9/library/functools.html#functools.cached_property) since it has additional explanation on how it works.

Creating class objects in a function and retaining them

I'm building a game in Python with Pygame. I have created a class that acts as a button. When clicked it cycles through various states (it'll do more later.)
What I need is a grid of them that I can adjust the size of dynamically. I have a function that iterates and assigns the class objects coordinates and makes the grid (the images spawn but end up being static), but I'm not sure it's creating the objects correctly. Even if it does, they get dropped from memory when the function ends since they're local.
I've looked at a lot of stuff and people say to use dictionaries to store dynamically created objects, but I need the class objects to still be able to use their functions. Is there a way to iterate a group of class objects with dynamic names and retain them? I can manually make a huge grid of them, but I'd really rather not have to explicitly create and assign 100 or more coordinates to objects.

The dictionary (or list, or whatever) is the answer. I don't understand why you would think that storing instances in a container would mean you can't access their methods. You can, in exactly the same way as before.
For example, if your class has a method "my_method" you can call it on an object of that class retrieved from a dictionary:
my_dict['my_instance'].my_method()

Most efficient way to pass multiple arguments to a function?

What is the most efficient (in terms of processing speed and memory utilisation) method for passing a large number of user-input variables as arguments to a function, and for returning multiple results?
A long string of arguments and return values each time I call the function - e.g. (a,b,c,d,e,f,g) = MyFunction(a,b,c,d,e,f,g) - seems inelegant, and I'm guessing is also inefficient; especially if I have to call the function repeatedly or recursively.
However defining the whole list of variables as Global outside of the function also is ugly, and carries the danger of variable names being inadvertently assigned to several different variables as my program grows.
I've tried putting all the variables into a single array or list and passed that to the function as a single argument, as this seems neater.
Am I correct in thinking that this is also more efficient, even for huge arrays, since it is only the pointer to the start of the array that is passed to the function each time, not the whole array itself?
If arrays are the best method for passing a large number of variables to/from a function, at what point does this efficiency saving kick in - e.g. is it better to pass a string of arguments if the number of arguments is less than 5, but use an array or list if 5 or more arguments are required?
A previous discussion on StackExchange:
Elegant way to pass multiple arguments to a function
has recommended using struct rather than vectors/arrays for passing multiple arguments. Why is this method preferred to using arrays, and at what point do efficiency savings justify the added complexity of using struct?
Are there any other methods that I should consider which will work in Python or C/C++?
(e.g. I'm new to object orientated programming, but wonder if this might offer a solution which is specific
to Python?)
Many thanks

All of this depends on the target system and its calling convention for functions. This answer applies to C and C++ only.
Generally, the use of file scope variables will usually be the fastest possible. In such cases, the variable should never be declared as global (accessible throughout the whole project), but as static (accessible by the local file only).
Still, such static file scope variables should be avoided for several reasons: they can make the code harder to read and maintain, indisciplined use may lead to "spaghetti code", they will create re-entrancy issues and they add some extra identifiers to the file scope namespace.
It should be noted, that in case the number of parameters are limited, that keeping them as separate parameters might increase performance, as the compiler may then store some of them in CPU registers instead of storing them on the stack. CPU registers are the fastest way of passing parameters to a function. How this works is very system-specific. However, writing your program in such a manner that you hope to get the parameters passed through CPU registers, is pre-mature optimization in most cases.
The best, de facto way of passing multiple arguments is indeed to create a custom struct (or C++ class) containing all of the arguments. This structure is then passed by reference to the function. Try to make it so that the struct contains only variables related to each other. Consider putting variables that are not related to each other, or special just for one given function, in a separate parameter. Good program design supersedes efficiency in most cases.
The reason why a struct/class is preferable instead of an array, is simply because the variables together form a unique type, but also since they will likely have different types compared to each other. Making an array of variables that all have different types doesn't make any sense.
And in C++, a class offers other advantages over an array, such as constructors and destructors, custom assignment operators etc.

It will obviously depend on what you want to do, because each of the containers has a different purpose.
For sure, in term of processing speed and memory, you should use a pointer or a reference to a container (Structure, class, array, tuple...), in order to not copy all the data but just the address of the container.
However, you must not create a structure, or put all your variables in the same container just in order to give them as a parameter of a function. All the variables that you will put on the data structure should be related.
In the example that you gave, there are multiple variable of different types. That is why a structure is preferred, because an array requires that all parameters have the same type. In python you could use named tuple in order to store different variable.

How to deal with global parameters in a scientific Python script

I am writing a piece of scientific software in Python which comprises both a Poisson equation solver (using the Newton method) on a rectangular mesh, and a particle-in-cell code. I've written the Newton Solver and the particle-in-cell code as separate functions, which are called by my main script.
I had originally written the code as one large script, but decided to break up the script so that it was more modular, and so that the individual functions could be called on their own. My problem is that I have a large number of "global" variables which I consider parameters for the problem. This includes mostly problem constants and parameters which define the problem geometry and mesh (such as dimensions, locations of certain boundary conditions, boundary conditions etc.).
These parameters are required by both the main script and the individual functions. My question is: What is the best way (and most proper) to store these variables such that they can be accessed by both the main script and the functions.
My current solution is to define a class in a separate module (parameters.py) as so:
class Parameters:
length = 0.008
width = 0.0014
nz = 160
nr = 28
dz = length/nz
dr = width/nr
...
In my main script I then have:
from parameters import Parameters
par = Parameters()
coeff_a = -2 * (1/par.dr**2 + 1/par.dz**2)
...
This method allows me to then use par as a container for my parameters which can be passed to any functions I want. It also provides an easy way to easily set up the problem space to run just one of the functions on their own. My only concern is that each function does not require everything stored in par, and hence it seems inefficient passing it forward all the time. I could probably remove many of the parameters from par, but then I would need to recalculate them every time a function is called, which seems even more inefficient.
Is there a standard solution which people use in these scenarios? I should mention that my functions are not changing the attributes of par, just reading them. I am also interested in achieving high performance, if possible.

Generally, when your program requires many parameters in different places, it makes sense to come up with a neat configuration system, usually a class that provides a certain interface to your own code.
Upon instantiation of that class, you have a configuration object at hand which you can pass around. In some places you might want to populate it, in other places you just might want to use it. In any case, this configuration object will be globally accessible. If your program is a Python package, then this configuration mechanism might be written in its own module which you can import from all other modules in your package.
The configuration class might provide useful features such as parameter registration (a certain code section says that it needs a certain parameter to be set), definition of defaults and parameter validation.
The actual population of parameters is then based on defaults, user-given commandline arguments or user-given input files.

To make Jan-Philip Gehrcke's answer more figurative, check out A global class pattern for python (btw: it's just a normal class, nothing special about "global" - but you can pass it around "globally").
Before actually implementing this in my own program, I had the same idea but wanted to find out how others would do it (like questioner nicholls). I was a bit skeptical to implement this in the first place, in particular it looked quite strange to instantiate a class in the module itself. But it works fine.
However, there are some things to keep in mind though:
It is not super clean. For instance, someone that doesn't know the function in your module wouldn't expect that a parameter in a configuration class needs to be set
If you have to reload your module/functions but want to maintain the values set in your configuration class, you should not instantiate the configuration class again: if "mem" not in locals(): mem = Mem()
It's not advised to assign a parameter from your configuration class as a default argument for a function. For example function(a, b=mem.defaultB).
You cannot change this default value later after initialization. Instead, do function(a, b=None): if b is None: b=mem.defaultB. Then you can also adjust your configuration class after you loaded your module/functions.
Certainly there are more issues...

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Any way to prevent modifications to content of a ndarray subclass? - python

Looks like the answer is: np.ndarray.setflags(write=False)

Related

Subclass numpy.ndarray without 'propagating' subtype

How to use Python properties efficiently?

Creating class objects in a function and retaining them

Most efficient way to pass multiple arguments to a function?

How to deal with global parameters in a scientific Python script

Categories

Resources