Let's say I have a class Foo that stores some statistical data, and I want to encapsulate the access to these data using Python properties. This is particularly useful, for example, when I only store the variance of a variable and want to have access to its standard deviation: in that case, I could define the property Foo.std and make it return the square root of the variance.
The problem with this approach is that if I need to access Foo.std multiple times, I will calculate the square root each time; furthermore, since the notation of a property is exactly like that of an attribute, the user of my class might not be aware that a computation is taking place everytime the property is accessed.
One alternative in this example would be to calculate the standard deviation every time I update my variance, and set it as an attribute. However, that would be inefficient if I don't need to access it at every update.
My question is: what is the best approach to use a Python property efficiently, when you need to perform a costly calculation? Should I cache the result after the first call and delete the cache in case of an update? Or should I rather not use a property and use a method Foo.get_std() instead?
Usually you can do this through caching. For example you can write:
class Foo:
def __int__(self, also, other, arguments):
# ...
self._std = None
#property
def std(self):
if self._std is None:
# ... calculate standard deviation
self._std = ...
return self._std
def alter_state(self, some, arguments):
# update state
self._std = None
So here we have a propert std but also an attribute _std. In case the standard deviation is not yet calculated, or you alter the object state such that the standard deviation might have changed, we set _std to None. Now if we access .std, we first check if _std is None. If that is the case we calculate the standard deviation and store it into _std and return it. Such that - if the object is not changed - we can later simply retrieve it.
If we alter the object such that the standard deviation might have changed, we set _std back to None, to force re-evaluation in case we access .std again.
If we alter the state of a Foo object twice before recalcuating the standard deviation, we will only recalculate it once. So you can frequently change the Foo object, with (close to) no extra cost involved (except setting self._std to None). So if you have a huge dataset and you update it extensively you only will put effort in calculating the standard deviation again when you actually need it.
Furthermore this can also be an oportunity to update statistical measures in case that is (very) cheap. Say for instance you have a list of objects that you frequently update in bulk. In case you increment all elements with a constant, then the mean will also increment with that constant. So functions that alter a state such that some metrics can easily be altered as well, might update the metrics, instead of making these None.
Note that whether .std is a property, or a function is irrelevant. The user does not have to know how often this has to be calculated. The std() function will guarantee that once calcuated, a second retrieval will be quite fast.
Adding to Willem's answer: starting Python 3.8, we now have functools.cached_property. The official documentation even uses std and variance as examples. I'm linking the 3.9 documentation (https://docs.python.org/3.9/library/functools.html#functools.cached_property) since it has additional explanation on how it works.
Related
I am creating various classes for computational geometry that all subclass numpy.ndarray. The DataCloud class, which is typical of these classes, has Python properties (for example, convex_hull, delaunay_trangulation) that would be time consuming and wasteful to calculate more than once. I want to do calculations once and only once. Also, just in time, because for a given instance, I might not need a given property at all. It is easy enough to set this up by setting self.__convex_hull = None in the constructor and, if/when the convex_hull property is called, doing the required calculation, setting self.__convex_hull, and returning the calculated value.
The problem is that once any of those complicated properties is invoked, any changes to the contents made, external to my subclass, by the various numpy (as opposed to DataCloud subclass) methods will invalidate all the calculated properties, and I won't know about it. For example, suppose external code simply does this to the instance: datacloud[3,8] = 5. So is there any way to either (1) make the ndarray base class read-only once any of those properties is calculated or (2) have ndarray set some indicator that there has been a change to its contents (which for my purposes makes it dirty), so that then invoking any of the complex properties will require recalculation?
Looks like the answer is:
np.ndarray.setflags(write=False)
I have a class in Python which trains a model for the given data:
class Model(object):
def __init__(self, data):
self.data = data
self.result = None
def train(self):
... some codes for training the model ...
self.result = ...
Once I create a Model object,
myModel = Model(myData)
the model is not trained. Then I can call the train method to initiate the training:
myModel.train()
Then myModel.result will be updated in-place.
Also, I could re-write the train method as:
def train(self):
... some code for training the model ...
result = ...
# avoid update in-place
trainedModel = copy.copy(self)
trainedModel.result = result
return trainedModel
In this way, by calling myTrainedModel = myModel.train() I have a new object and the state of the original myModel is not changed.
My question is: Which is a more common way to store the returned result from a method in a class?
My question is: Which is a more common way to store the returned
result from a method in a class?
It's really hard to say here. Your example narrows it down to a very specific use case, and even if it was broader, an answer perfectly devoid of subjectivity would probably be impossible to find.
Nevertheless, I might be able to provide some info that could help you guide your decisions.
Pure Functions
Pure functions are functions that trigger no side-effects. They don't modify any state outside of the function. They're generally known to be some of the easiest types of functions to use correctly, since side effects are a common tripping point in development ("Which part of this system caused this state to change to this?") A function that has zero side effects offers little to trip over.
Your second version is a pure function. It has no side effects: it returns a newly-trained Model. It doesn't affect anything that already exists.
Pure functions are also inherently thread-safe. Since they modify no shared state, they're extremely friendly to a concurrent paradigm.
Side Effects
Nevertheless, a function that triggers side effects is often a practical necessity in many programs. From a single-threaded efficiency perspective, any function that faces the choice between modifying a complex state or returning a whole new one can be significantly bottlenecked by doing the latter.
Imagine, as a gross case, a function that plots one pixel on an image returning a whole new image with a pixel plotted to it instead of modifying the one you pass in. That would tend to immediately become a significant bottleneck. On the other hand, if the result we're returning is not complex (ex: just an integer or very simple aggregate), often the pure function is even faster.
So functions that trigger a side effect (ideally only one logical side effect to avoid becoming a confusing source of bugs) are often a practical necessity in some cases when the results are complex and expensive to create.
Pure or "Impure"
So the choice here kind of boils down to a pure function or "impure" function that has one side effect. Since we're dealing with an object-oriented scenario, another way to look at this is mutability vs. immutability (which often has similar differences to pure and "impure" functions). We can train a Model or create and return a trained Model without touching an existing one.
The choice of which might be "better" would depend on what you are after. If safety and maintainability are your goals, the pure version might help a bit more. If the cost of creating and returning a new model is expensive and efficiency is your primary goal, training an existing model might help you avoid a bottleneck.
If in doubt, I would suggest the pure version generally. Qualities like safety and maintainability that improve productivity tend to come first before worrying about performance. Later you can grab a profiler and drill down to your hotspots, and if you find that returning whole new trained models is a bottleneck, you could, say, add a new method to train a model in-place which you use for your most critical code paths.
The classmethod() is an inbuilt function in Python, which returns a class method for a given function.
class Student:
name = "MadhavArora"
def print_name(obj):
print("The name is : ", obj.name)
Student.print_name = classmethod(Student.print_name)
Student.print_name()
I have a function f(x), which does something and return values (a tuple).
I have another function that call this function , after processing parameters (the whole function operation is irrelevant to the question); and now I would like to know if there are evil intent in returning the function itself, vs runt the function, dump the output in a variable and return the variable.
A variable has a cost, and assign a value to a variable has a cost; but beside that, is there any sorcery that would happen behind the scene, that would make one better than the other ?
def myfunction(self):
[do something]
return f(x)
is the same as
def myfunction(self):
[do something]
b = f(x)
return b
or one is to prefer to the other (and why)? I am talking purely on the OOP persepctive; without considering that create variables and assign has a cost, in terms of memory and CPU cycles.
That doesn't return the function. Returning the function would look like return f. You're returning the result of the function call. Generally speaking, the only reason to save that result before returning it is if you plan to do some other kind of processing on it before the return, in which case it's faster to just refer to a saved value rather than recomputing it. Another reason to save it would be for clarity, turning what might be a long one-liner with extensive chaining into several steps.
There's a possibility that those two functions might produce different results if you have some kind of asynchronous process that modifies your data in the background between saving the reference and returning it, but that's something you'll have to keep in mind based on your program's situation.
In a nutshell, save it if you want to refer to it, or just return it directly otherwise.
Those are practically identical; use whichever one you think is more readable. If the performance of once versus the other actually matters for you, perhaps Python is not the best choice ;).
The cost difference between these is utterly negligible: in the worst case, one extra dictionary store, one extra dictionary lookup and one extra string in memory. In practice it won't even be that bad, since cpython stores local variables in a C array, so it's more like two c level pointer indirections.
As a matter of style, I would usually avoid the unnecessary variable but its possible that it might be better in particular cases. As a guideline, think about things like whether the amalgamated version leads to an excessively long line of code, whether the extra variable has a better name than eg result, and how clear it is that that function call is the result you need (and if it isnt, whether/how much a variable helps).
I am writing a piece of scientific software in Python which comprises both a Poisson equation solver (using the Newton method) on a rectangular mesh, and a particle-in-cell code. I've written the Newton Solver and the particle-in-cell code as separate functions, which are called by my main script.
I had originally written the code as one large script, but decided to break up the script so that it was more modular, and so that the individual functions could be called on their own. My problem is that I have a large number of "global" variables which I consider parameters for the problem. This includes mostly problem constants and parameters which define the problem geometry and mesh (such as dimensions, locations of certain boundary conditions, boundary conditions etc.).
These parameters are required by both the main script and the individual functions. My question is: What is the best way (and most proper) to store these variables such that they can be accessed by both the main script and the functions.
My current solution is to define a class in a separate module (parameters.py) as so:
class Parameters:
length = 0.008
width = 0.0014
nz = 160
nr = 28
dz = length/nz
dr = width/nr
...
In my main script I then have:
from parameters import Parameters
par = Parameters()
coeff_a = -2 * (1/par.dr**2 + 1/par.dz**2)
...
This method allows me to then use par as a container for my parameters which can be passed to any functions I want. It also provides an easy way to easily set up the problem space to run just one of the functions on their own. My only concern is that each function does not require everything stored in par, and hence it seems inefficient passing it forward all the time. I could probably remove many of the parameters from par, but then I would need to recalculate them every time a function is called, which seems even more inefficient.
Is there a standard solution which people use in these scenarios? I should mention that my functions are not changing the attributes of par, just reading them. I am also interested in achieving high performance, if possible.
Generally, when your program requires many parameters in different places, it makes sense to come up with a neat configuration system, usually a class that provides a certain interface to your own code.
Upon instantiation of that class, you have a configuration object at hand which you can pass around. In some places you might want to populate it, in other places you just might want to use it. In any case, this configuration object will be globally accessible. If your program is a Python package, then this configuration mechanism might be written in its own module which you can import from all other modules in your package.
The configuration class might provide useful features such as parameter registration (a certain code section says that it needs a certain parameter to be set), definition of defaults and parameter validation.
The actual population of parameters is then based on defaults, user-given commandline arguments or user-given input files.
To make Jan-Philip Gehrcke's answer more figurative, check out A global class pattern for python (btw: it's just a normal class, nothing special about "global" - but you can pass it around "globally").
Before actually implementing this in my own program, I had the same idea but wanted to find out how others would do it (like questioner nicholls). I was a bit skeptical to implement this in the first place, in particular it looked quite strange to instantiate a class in the module itself. But it works fine.
However, there are some things to keep in mind though:
It is not super clean. For instance, someone that doesn't know the function in your module wouldn't expect that a parameter in a configuration class needs to be set
If you have to reload your module/functions but want to maintain the values set in your configuration class, you should not instantiate the configuration class again: if "mem" not in locals(): mem = Mem()
It's not advised to assign a parameter from your configuration class as a default argument for a function. For example function(a, b=mem.defaultB).
You cannot change this default value later after initialization. Instead, do function(a, b=None): if b is None: b=mem.defaultB. Then you can also adjust your configuration class after you loaded your module/functions.
Certainly there are more issues...
I have been trying to optimize this code:
def spectra(mE, a):
pdf = lambda E: np.exp(-(a+1)*E/mE)*(((a+1)*E/mE)**(a+1))/(scipy.special.gamma(a+1)*E)
u=[]
n=np.random.uniform()
E=np.arange(0.01,150, 0.1)
for i in E:
u.append(scipy.integrate.quad(pdf,0,i)[0])
f=interp1d(u, E)
return f(n)
I was trying to create a lookup table out of f, but it appears that every time I call the function it does the integration all over again. Is there a way to put in something like an if statement, which will let me create f once for values of mE and a and then just call that afterwards?
Thanks for the help.
Cheers.
It sounds like what you want to do is return a known value if the function is re-called with the same (mE, a) values, and perform the calculation if the input is new.
That is called memoization. See for example What is memoization and how can I use it in Python? . Note that in modern versions of Python, you can create a decorator to apply the memoization of the function, which lets you be a little neater.
Most probably you'll not be able to store values of spectra(x, y) and reasonably retrieve them by exact values of floating-point x and y. You rarely encounter exact same floating point values in real life.
Note that I don't think you can cache f directly, becuse it depends on a long list of floats. Possible input space of it is so large that finding a close match seems very improbable to me.
If you cache values of spectra() you could retrieve the value for a close enough pair of arguments with a reasonable probability.
The problem is searching for such close pairs. A hash table cannot work (we need imprecise matches), an ordered list and binary search cannot work either (we have 2 dimensions). I'd use a quad tree or some other form of spatial index. You can build it dynamically and efficiently search for closest known points near your given point.
If you found cached a point really close to the one you need, you can just return the cached value. If no point is close enough, you add it to the index, in hope that it will be reused in the future. Maybe you could even interpolate if your point is between two known points close by.
The big prerequisite, of course, is that sufficient number of points in the cache has a chance to be reused. To estimate this, run your some of your calculation and store (mE, a) pairs somewhere (e.g in a file), then plot them. You'll instantly see if you have groups of points close to one another. You can look for tight clusters without plotting, too, of course. If you have enough clusters that are tight (where you could reuse one point's value for another), your cache will work. If not, don't bother implementing it.