How to conventiently pass around global settings which can be modified - python

I often find myself writing code that takes a set of parameters, does some calculalation and then returns the result to another function, which also requires some of the parameters to do some other manipulation, and so on. I end up with a lot of functions where I have to pass around parameters, such as f(x, y, N, epsilon) which then calls g(y, N, epsilon) and so on. All the while I have to include the parameters N and epsilon in every function call and not lose track of them, which is quite tedious.
What I want is to prevent this endlessly passing around of parameters, while still being able to, within a single for loop, to change some of these parameters, e.g.
for epsilon in [1,2,3]:
f(..., epsilon)
I usually have around 10 parameters to keep track of (these are physics problems) and do not know beforehand which I have to vary and which I can keep to a default.
The options I thought of are
Creating a global settings = {'epsilon': 1, 'N': 100} object, which is used by every function. However, I have always been told that putting stuff in the global namespace is bad. I am also afraid that this will not play nice with modifying the settings object within the for loop.
Passing around a settings object as a parameter in every function. This means that I can keep track of the object as it passed around, and makes it play nice with the for loop. However, it is still passing around, which seems stupid to me.
Is there another, third, option that I have not considered? Most of the solution I can find are for the case where your settings are only set once, as you start up the program, and are then unchanged.

I believe this is primarily a matter of preference among coding styles. I'm going to offer my opinion on the ones you posted as well as some other alternatives.
First, creating a global settings variable is not bad by itself. Problems arise if global settings are treated as mutable state rather than being immutable. As you want to modify parameters on the fly it's a dangerous option.
Second, passing the settings around is quite common in functional languages and it's not stupid although it can look clumsy if you're not used to it. The advantage of passing state this way is that you can isolate changes in the dictionary settings you pass around without corrupting the original one, the bad thing is that python messes a bit with immutability because of shared references and you can end up making many deepcopy's to prevent data races, which is totally inefficient. Unless your dict is not nested I would not go that way.
settings = {'epsilon': 1, 'N': 100}
# Unsafe but OK for plain dict
for x in [1, 2, 3]:
f(..., dict(zip(settings, ('epsilon', x))))
# Safe way.
ephimeral = copy.deepcopy(settings)
for x in [1, 2, 3]:
ephimeral['epsilon'] = x
f(..., ephimeral)
Now, there's another option which kind of mixes the other two, probably the one I'll take. Make a global immutable settings variable and write your functions signatures to accept optional keyword arguments. This way you have the advantages of both, ability to avoid continuous variable passing and ability to modify on the fly values without data races:
def f(..., **kwargs):
epsilon = kwargs.get('epsilon', settings['epsilon'])
...
You may also create a function that encapsulates the aforementioned behavior to decouple variable extraction from function definition. There are many possibilities.
Hope this helps.

Related

Is there a way for a caller of multiple functions to forward a function ref to selected functions in a purely functional way?

Problem
I have a function make_pipeline that accepts an arbitrary number of functions, which it then calls to perform sequential data transformation. The resulting call chain performs transformations on a pandas.DataFrame. Some, but not all functions that it may call need to operate on a sub-array of the DataFrame. I have written multiple selector functions. However at present each member-function of the chain has to be explicitly be given the user-selected selector/filter function. This is VERY error-prone and accessibility is very important as the end-code is addressed to non-specialists (possibly with no Python/programming knowledge), so it must be "batteries-included". This entire project is written in a functional style (that's what's always worked for me).
Sample Code
filter_func = simple_filter()
# The API looks like this
make_pipeline(
load_data("somepath", header = [1,0]),
transform1(arg1,arg2),
transform2(arg1,arg2, data_filter = filter_func),# This function needs access to user-defined filter function
transform3(arg1,arg2,, data_filter = filter_func),# This function needs access to user-defined filter function
transform4(arg1,arg2),
)
Expected API
filter_func = simple_filter()
# The API looks like this
make_pipeline(
load_data("somepath", header = [1,0]),
transform1(arg1,arg2),
transform2(arg1,arg2),
transform3(arg1,arg2),
transform4(arg1,arg2),
)
Attempted
I thought that if the data_filter alias is available in the caller's namespace, it also becomes available (something similar to a closure) to all functions it calls. This seems to happen with some toy examples but wont work in the case (UnboundError).
What's a good way to make a function defined in one place available to certain interested functions in the call chain? I'm trying to avoid global.
Notes/Clarification
I've had problems with OOP and mutable states in the past, and functional programming has worked quite well. Hence I've set a goal for myself to NOT use classes (to the extent that Python enables me to anyways). So no classes.
I should have probably clarified this initially: In the pipeline the output of all functions is a DataFrame and the input of all functions (except load data obviously) is a DataFrame. The functions are decorated with a wrapper that calls functools.partial because we want the user to supply the args to each function but not execute it. The actual execution is done be a forloop in make_pipeline.
Each function accepts df:pandas.DataFrame plus all arguements that are specific to that function. The statement seen above transform1(arg1,arg2,...) actually calls the decorated transform1 witch returns functools.partial(transform, arg1,arg2,...) which is now has a signature like transform(df:pandas.DataFrame).
load_dataframe is just a convenience function to load the initial dataframe so that all other functions can begin operating on it. It just felt more intuitive to users to have it part of the chain rather that a separate call
The problem is this: I need a way for a filter function to be initialized (called) in only on place, such that every function in the call chain that needs access to the filter function, gets it without it being explicitly passed as argument to said function. If you're wondering why this is the case, it's because I feel that end users will find it unintuitive and arbitrary. Some functions need it, some don't. I'm also pretty certain that they will make all kinds of errors like passing different filters, forgetting it sometimes etc.
(Update) I've also tried inspect.signature() in make_pipeline to check if each function accepts a data_filter argument and pass it on. However, this raises an incorrect function signature error so some unclear reason (likely because of the decorators/partial calls). If signature could the return the non-partial function signature, this would solve the issue, but I couldn't find much info in the docs
Turns out it was pretty easy. The solution is inspect.signature.
def make_pipeline(*args, data_filter:Optional[Callable[...,Any]] = None)
d = args[0]
for arg in args[1:]:
if "data_filter" in inspect.signature(arg):
d = arg(d, data_filter = data_filter)
else:
d= arg(d)
Leaving this here mostly for reference because I think this is a mini design pattern. I've also seen an function._closure_ on unrelated subject. That may also work, but will likely be more complicated.

Avoiding global variables but also too many function arguments (Python)

Let's say I have a python module that has a lot of functions that rely on each other, processing each others results. There's lots of cohesion.
That means I'll be passing back and forth a lot of arguments. Either that, or I'd be using global variables.
What are best practices to deal with such a situation if? Things that come to mind would be replacing those parameters with dictionaries. But I don't necessarily like how that changes the function signature to something less expressive. Or I can wrap everything into a class. But that feels like I'm cheating and using "pseudo"-global variables?
I'm asking specifically for how to deal with this in Python but I understand that many of those things would apply to other languages as well.
I don't have a specific code example right, it's just something that came to mind when I was thinking about this issue.
Examples could be: You have a function that calculates something. In the process, a lot of auxiliary stuff is calculated. Your processing routines need access to this auxiliary stuff, and you don't want to just re-compute it.
This is a very generic question so it is hard to be specific. What you seem to be describing is a bunch of inter-related functions that share data. That pattern is usually implemented as an Object.
Instead of a bunch of functions, create a class with a lot of methods. For the common data, use attributes. Set the attributes, then call the methods. The methods can refer to the attributes without them being explicitly passed as parameters.
As RobertB said, an object seems the clearest way. Could be as simple as:
class myInfo:
def __init__(self, x=0.0, y=0.0):
self.x = x
self.y = y
self.dist = self.messWithDist()
def messWithDist(self):
self.dist = math.sqrt(self.x*self.x + self.y*self.y)
blob = myInfo(3,4)
blob.messWithDist()
print(blob.dist)
blob.x = 5
blob.y = 7
blob.messWithDist()
print(blob.dist)
If some of the functions shouldn't really be part of such an object, you can just define them as (non-member, independent) functions, and pass the blob as one parameter. For example, by un-indenting the def of messWithDist, then calling as messWithDist(blob) instead of blob.messWithDist().
-s

Is it better practice to pass sometimes complex dicts for parameters instead of parameter objects?

I've been programming Python for a year now, having come from a Java background, and I've noticed that, at least in my organization, the style for passing complex parameters to functions is to use dicts or tuples, rather than instances of a specialized parameter class. For example, we have a method that takes three dicts, each structured in a particular way, each of which is itself formatted as tuples. It's complicated for me to build args and to read the code. Here's an example of a passed arg:
{'[A].X': ((DiscreteMarginalDistribution, ('red', 'blue')), ()),
'[A].Y': ((DiscreteConditionalDistribution, ('yellow', 'green'), ('red', 'blue')),
(IdentityAggregator('[A].X'), ))
My questions are:
Is passing dicts/tuples like this a common Python idiom?
When, if ever, do you write Python code to use the latter (parameter instances)? E.g., when the nested structure surpasses some complexity threshold.
Thanks in advance!
Yes, it is very usual to pass a dictionary to Python functions in order to reduce the number of arguments. Dictionary-style configuration with proper key naming is much more readable than just using tuples.
I consider it rather uncommon to dynamically construct dedicated instances of a custom config class. I'd stick with dictionaries for that. In case your config dict and the consumer of it go out of sync, you get KeyErrors, which are pretty good to debug.
Some recommendations and reasoning:
If some parts of your application require really really complex configuration, I consider it a good idea to have a configuration object that properly represents the current config. However, in my projects I never ended up passing such objects as function arguments. This smells. In some applications, I have a constant global configuration object, set up during bootstrap. Such an object is globally available and treated as "immutable".
Single functions should never be so complex that they require to retrieve a tremendously complex configuration. This indicates that you should split your code into several components, each subunit having a rather simple parameterization.
If the runtime configuration of a function has a somewhat higher complexity than it is easily dealt with normal (keyword)arguments, it is absolutely common to pass a dictionary, so to say as a "leightweight" configuration object. A well thought-through selection of key names makes such an approach very well readable. Of course you can also build up a hierarchy in case one level is not enough for your use case.
Most importantly, please note that in many cases the best way is to explicitly define the parameterization of a function via its signature, using the normal argument specification:
def f(a, b, c, d, e):
...
In the calling code, you can then prepare the values for these arguments in a dictionary:
arguments = {
a = 1,
b = 2,
c = 3,
d = 4,
e = "x"
}
and then use Python's snytactic sugar for keyword expansion upon function call:
f(**arguments)

Is it bad style to reassign long variables as a local abbreviation?

I prefer to use long identifiers to keep my code semantically clear, but in the case of repeated references to the same identifier, I'd like for it to "get out of the way" in the current scope. Take this example in Python:
def define_many_mappings_1(self):
self.define_bidirectional_parameter_mapping("status", "current_status")
self.define_bidirectional_parameter_mapping("id", "unique_id")
self.define_bidirectional_parameter_mapping("location", "coordinates")
#etc...
Let's assume that I really want to stick with this long method name, and that these arguments are always going to be hard-coded.
Implementation 1 feels wrong because most of each line is taken up with a repetition of characters. The lines are also rather long in general, and will exceed 80 characters easily when nested inside of a class definition and/or a try/except block, resulting in ugly line wrapping. Let's try using a for loop:
def define_many_mappings_2(self):
mappings = [("status", "current_status"),
("id", "unique_id"),
("location", "coordinates")]
for mapping in mappings:
self.define_parameter_mapping(*mapping)
I'm going to lump together all similar iterative techniques under the umbrella of Implementation 2, which has the improvement of separating the "unique" arguments from the "repeated" method name. However, I dislike that this has the effect of placing the arguments before the method they're being passed into, which is confusing. I would prefer to retain the "verb followed by direct object" syntax.
I've found myself using the following as a compromise:
def define_many_mappings_3(self):
d = self.define_bidirectional_parameter_mapping
d("status", "current_status")
d("id", "unique_id")
d("location", "coordinates")
In Implementation 3, the long method is aliased by an extremely short "abbreviation" variable. I like this approach because it is immediately recognizable as a set of repeated method calls on first glance while having less redundant characters and much shorter lines. The drawback is the usage of an extremely short and semantically unclear identifier "d".
What is the most readable solution? Is the usage of an "abbreviation variable" acceptable if it is explicitly assigned from an unabbreviated version in the local scope?
itertools to the rescue again! Try using starmap - here's a simple demo:
list(itertools.starmap(min,[(1,2),(2,2),(3,2)]))
prints
[1,2,2]
starmap is a generator, so to actually invoke the methods, you have to consume the generator with a list.
import itertools
def define_many_mappings_4(self):
list(itertools.starmap(
self.define_parameter_mapping,
[
("status", "current_status"),
("id", "unique_id"),
("location", "coordinates"),
] ))
Normally I'm not a fan of using a dummy list construction to invoke a sequence of functions, but this arrangement seems to address most of your concerns.
If define_parameter_mapping returns None, then you can replace list with any, and then all of the function calls will get made, and you won't have to construct that dummy list.
I would go with Implementation 2, but it is a close call.
I think #2 and #3 are equally readable. Imagine if you had 100s of mappings... Either way, I cannot tell what the code at the bottom is doing without scrolling to the top. In #2 you are giving a name to the data; in #3, you are giving a name to the function. It's basically a wash.
Changing the data is also a wash, since either way you just add one line in the same pattern as what is already there.
The difference comes if you want to change what you are doing to the data. For example, say you decide to add a debug message for each mapping you define. With #2, you add a statement to the loop, and it is still easy to read. With #3, you have to create a lambda or something. Nothing wrong with lambdas -- I love Lisp as much as anybody -- but I think I would still find #2 easier to read and modify.
But it is a close call, and your taste might be different.
I think #3 is not bad although I might pick a slightly longer identifier than d, but often this type of thing becomes data driven, so then you would find yourself using a variation of #2 where you are looping over the result of a database query or something from a config file
There's no right answer, so you'll get opinions on all sides here, but I would by far prefer to see #2 in any code I was responsible for maintaining.
#1 is verbose, repetitive, and difficult to change (e.g. say you need to call two methods on each pair or add logging -- then you must change every line). But this is often how code evolves, and it is a fairly familiar and harmless pattern.
#3 suffers the same problem as #1, but is slightly more concise at the cost of requiring what is basically a macro and thus new and slightly unfamiliar terms.
#2 is simple and clear. It lays out your mappings in data form, and then iterates them using basic language constructs. To add new mappings, you only need add a line to the array. You might end up loading your mappings from an external file or URL down the line, and that would be an easy change. To change what is done with them, you only need change the body of your for loop (which itself could be made into a separate function if the need arose).
Your complaint of #2 of "object before verb" doesn't bother me at all. In scanning that function, I would basically first assume the verb does what it's supposed to do and focus on the object, which is now clear and immediately visible and maintainable. Only if there were problems would I look at the verb, and it would be immediately evident what it is doing.

When are function local python variables created?

When are function-local variables are created? For example, in the following code is dictionary d1 created each time the function f1 is called or only once when it is compiled?
def f1():
d1 = {1: 2, 3: 4}
return id(d1)
d2 = {1: 2, 3: 4}
def f2():
return id(d2)
Is it faster in general to define a dictionary within function scope or to define it globally (assuming the dictionary is used only in that function). I know it is slower to look up global symbols than local ones, but what if the dictionary is large?
Much python code I've seen seems to define these dictionaries globally, which would seem not to be optimal. But also in the case where you have a class with multiple 'encoding' methods, each with a unique (large-ish) lookup dictionary, it's awkward to have the code and data spread throughout the file.
Local variables are created when assigned to, i.e., during the execution of the function.
If every execution of the function needs (and does not modify!-) the same dict, creating it once, before the function is ever called, is faster. As an alternative to a global variable, a fake argument with a default value is even (marginally) faster, since it's accessed as fast as a local variable but also created only once (at def time):
def f(x, y, _d={1:2, 3:4}):
I'm using the name _d, with a leading underscore, to point out that it's meant as a private implementation detail of the function. Nevertheless it's a bit fragile, as a bumbling calles might accidentally and erroneously pass three arguments (the third one would be bound as _d within the function, likely causing bugs), or the function's body might mistakenly alter _d, so this is only recommended as an optimization to use when profiling reveals it's really needed. A global dict is also subject to erroneous alterations, so, even though it's faster than buiding a local dict afresh on every call, you might still pick the latter possibility to achieve higher robustness (although the global dict solution, plus good unit tests to catch any "oops"es in the function, are the recommended alternative;-).
If you look at the disassembly with the dis module you'll see that the creation and filling of d1 is done on every execution. Given that dictionaries are mutable this is unlikely to change anytime soon, at least until good escape analysis comes to Python virtual machines. On the other hand lookup of global constants will get speculatively optimized with the next generation of Python VM's such as unladen-swallow (the speculation part is that they are constant).
Speed is relative to what you're doing. If you're writing a database-intensive application, I doubt your application is going to suffer one way or another from your choice of global versus local variables. Use a profiler to be sure. ;-)
As Alex noted, the locals are initialized when the function is called. As easy way to demonstrate this for yourself:
import random
def f():
d = [random.randint(1, 100), random.randint(100, 1000)]
print(d)
f()
f()
f()

Categories