Luigi: how to pass different arguments to leaf tasks?

Luigi: how to pass different arguments to leaf tasks? - python

This is my second attempt at understanding how to pass arguments to dependencies in Luigi. The first one was here.
The idea is: I have TaskC which depends on TaskB, which depends on TaskA, which depends on Task0. I want this whole sequence to be exactly the same always, except I want to be able to control what file Task0 reads from, lets call it path. Luigi's philosophy is normally that each task should only know about the Tasks it depends on, and their parameters. The problem with this is that TaskC, TaskB, and TaskA all would have to accept variable path for the sole purpose of then passing it to Task0.
So, the solution that Luigi provides for this is called Configuration Classes
Here's some example code:
from pathlib import Path
import luigi
from luigi import Task, TaskParameter, IntParameter, LocalTarget, Parameter
class config(luigi.Config):
path = Parameter(default="defaultpath.txt")
class Task0(Task):
path = Parameter(default=config.path)
arg = IntParameter(default=0)
def run(self):
print(f"READING FROM {self.path}")
Path(self.output().path).touch()
def output(self): return LocalTarget(f"task0{self.arg}.txt")
class TaskA(Task):
arg = IntParameter(default=0)
def requires(self): return Task0(arg=self.arg)
def run(self): Path(self.output().path).touch()
def output(self): return LocalTarget(f"taskA{self.arg}.txt")
class TaskB(Task):
arg = IntParameter(default=0)
def requires(self): return TaskA(arg=self.arg)
def run(self): Path(self.output().path).touch()
def output(self): return LocalTarget(f"taskB{self.arg}.txt")
class TaskC(Task):
arg = IntParameter(default=0)
def requires(self): return TaskB(arg=self.arg)
def run(self): Path(self.output().path).touch()
def output(self): return LocalTarget(f"taskC{self.arg}.txt")
(Ignore all the output and run stuff. They're just there so the example runs successfully.)
The point of the above example is controlling the line print(f"READING FROM {self.path}") without having tasks A, B, C depend on path.
Indeed, with Configuration Classes I can control the Task0 argument. If Task0 is not passed a path parameter, it takes its default value, which is config().path.
My problem now is that this appears to me to work only at "build time", when the interpreter first loads the code, but not at run time (the details aren't clear to me).
So neither of these work:
A)
if __name__ == "__main__":
for i in range(3):
config.path = f"newpath_{i}"
luigi.build([TaskC(arg=i)], log_level="INFO")
===== Luigi Execution Summary =====
Scheduled 4 tasks of which:
* 4 ran successfully:
- 1 Task0(path=defaultpath.txt, arg=2)
- 1 TaskA(arg=2)
- 1 TaskB(arg=2)
- 1 TaskC(arg=2)
This progress looks :) because there were no failed tasks or missing dependencies
===== Luigi Execution Summary =====
I'm not sure why this doesn't work.
B)
if __name__ == "__main__":
for i in range(3):
luigi.build([TaskC(arg=i), config(path=f"newpath_{i}")], log_level="INFO")
===== Luigi Execution Summary =====
Scheduled 5 tasks of which:
* 5 ran successfully:
- 1 Task0(path=defaultpath.txt, arg=2)
- 1 TaskA(arg=2)
- 1 TaskB(arg=2)
- 1 TaskC(arg=2)
- 1 config(path=newpath_2)
This progress looks :) because there were no failed tasks or missing dependencies
===== Luigi Execution Summary =====
This actually makes sense. There's two config classes, and I only managed to change the path of one of them.
Help?
EDIT: Of course, having path reference a global variable works, but then it's not a Parameter in the usual Luigi sense.
EDIT2: I tried point 1) of the answer below:
config has the same definition
class config(luigi.Config):
path = Parameter(default="defaultpath.txt")
I fixed mistake pointed out, i.e. Task0 is now:
class Task0(Task):
path = Parameter(default=config().path)
arg = IntParameter(default=0)
def run(self):
print(f"READING FROM {self.path}")
Path(self.output().path).touch()
def output(self): return LocalTarget(f"task0{self.arg}.txt")
and finally I did:
if __name__ == "__main__":
for i in range(3):
config.path = Parameter(f"file_{i}")
luigi.build([TaskC(arg=i)], log_level="WARNING")
This doesn't work, Task0 still gets path="defaultpath.txt".

So what you're trying to do is create tasks with params without passing these params to the parent class. That is completely understandable, and I have been annoyed at times in trying to handle this.
Firstly, you are using the config class incorrectly. When using a Config class, as noted in https://luigi.readthedocs.io/en/stable/configuration.html#configuration-classes, you need to instantiate the object. So, instead of:
class Task0(Task):
path = Parameter(default=config.path)
...
you would use:
class Task0(Task):
path = Parameter(default=config().path)
...
While this now ensures you are using a value and not a Parameter object, it still does not solve your problem. When creating the class Task0, config().path would be evaluated, therefore it's not assigning the reference of config().path to path, but instead the value when called (which will always be defaultpath.txt). When using the class in the correct manner, luigi will construct a Task object with only luigi.Parameter attributes as the attribute names on the new instance as seen here: https://github.com/spotify/luigi/blob/master/luigi/task.py#L436
So, I see two possible paths forward.
1.) The first is to set the config path at runtime like you had, except set it to be a Parameter object like this:
config.path = luigi.Parameter(f"newpath_{i}")
However, this would take a lot of work to get your tasks using config.path working as now they need to take in their parameters differently (can't be evaluated for defaults when the class is created).
2.) The much easier way is to simply specify the arguments for your classes in the config file. If you look at https://github.com/spotify/luigi/blob/master/luigi/task.py#L825, you'll see that the Config class in Luigi, is actually just a Task class, so you can anything with it you could do with a class and vice-versa. Therefore, you could just have this in your config file:
[Task0]
path = newpath_1
...
3.) But, since you seem to be wanting to run multiple tasks with the different arguments for each, I would just recommend passing in args through the parents as Luigi encourages you to do. Then you could run everything with:
luigi.build([TaskC(arg=i) for i in range(3)])
4.) Finally, if you really need to get rid of passing dependencies, you can create a ParamaterizedTaskParameter that extends luigi.ObjectParameter and uses the pickle of a task instance as the object.
Of the above solutions, I highly suggest either 2 or 3. 1 would be difficult to program around, and 4 would create some very ugly parameters and is a bit more advanced.
Edit: Solutions 1 and 2 are more of hacks than anything, and it is just recommended that you bundle parameters in DictParameter.

Related

include multiple functions in Python class and object

I am very new to the concept of "class" and "objects" in Python, I succeeded in defining a single function using:
# build the object "test"
class test:
def __init__(self,raw_data):
self.method1 = raw_data*10
self.method2 = raw_data*20
self.method3 = raw_data*30
# a quick test using "raw_data = 1"
output = test(1)
# here three methods are all working
print(output.method1)
print(output.method2)
print(output.method3)
# outputs
10
20
30
But in real work, how can I include a lot of functions/processing steps under this "class" or "object" thing. So I can run all of them together. The codes below failed (only the first function was working):
# build the object "test"
class test:
def __init__(self,raw_data):
self.method1 = raw_data*10
def compute_method_2(self,raw_data):
self.method2 = raw_data*20
def compute_method_3(self,raw_data):
self.method3 = raw_data*30
# a quick test using "raw_data = 1"
output = test(1)
# now only the first calculation worked
print(output.method1)
print(output.method2)
print(output.method3)
# Error report:
AttributeError: 'test' object has no attribute 'method2'
Many thanks for your help!
To clarify why I want to split the functions: This is just a simplified example. In real work, there are multiple functions needed for different processing steps, and those functions work on different items.

Usually, Python methods only run if you choose to run them. You've gotten tripped up because your first encounter with methods is __init__(), but that is actually a weird exception to that rule: it's run immediately when you create each object of that class (that's the whole point of __init__()). So you need to run those methods manually if you want them to run:
# a quick test using "raw_data = 1"
output = test(1)
# run other computations
output.compute_method_2(1)
output.compute_method_3(1)
# now all the values are available
print(output.method1)
print(output.method2)
print(output.method3)
If you want the methods to run when you create the objects, which it looks like you do here, it's better put that code in __init__() rather than manually calling them every time you make an object - remember, that's why __init__() is there! But maybe your __init__() was getting too big and that's why you wanted to split it up. In that case, you can put your code into methods still, but call them from __init__() (and then you don't need to call them separately like the above example):
class test:
def __init__(self,raw_data):
self.method1 = raw_data*10
self.compute_method_2(raw_data)
self.compute_method_3(raw_data)
def compute_method_2(self,raw_data):
self.method2 = raw_data*20
def compute_method_3(self,raw_data):
self.method3 = raw_data*30
By the way, a "method" is a member function of a class, like your compute_method_2() function (and like __init__()!). Data members are not methods, so it is confusing that you used names like self.method2 for these.

Converting one pdb command to another, but it's not working

Need help, I am trying to add a change the functionality of c command to quit command, these changes are needed for further creation of new commands. I don't what I am doing wrong, how these two things are different first one is working fine but second one is not , I am just changing the behaviour
db = pdb.Pdb()
db.do_c = db.do_quit
no = 3
db.runcall(fun,no)
But this is not working , in this case self.do_quit is not getting even called.
class dbg(pdb.Pdb):
def custom_quit(self,arg):
self.do_quit
db = dbg()
no = 3
db.do_c = db.custom_quit
db.runcall(fun,no)
I am just running on simple function fun
def fun(no):
print("a")
print("b")
for i in range(0,no):
print(i)
return 'abc'
on command c it does nothing.

The usual way to extend a method in a class is to use the same name
for the method (that is override it) while calling super() to preserve
original method functionality:
So you can change your custom method to
class dbg(pdb.Pdb):
def do_quit(self, arg):
super().do_quit(arg)
print('do something else')
return(1)
and monkey patch it with:
db.do_c = db.do_quit # do_quit as usual
Take a look into pdb.py and search for the do_quit function and you'll
understand something is done that you have to do, or somehow preserve,
including return(1)

python multiprocessing : setting class attribute value

I have an class called Experiment and another called Case. One Experiment is made of many individual cases. See Class definitions below,
from multiprocessing import Process
class Experiment (object):
def __init__(self, name):
self.name = name
self.cases = []
self.cases.append(Case('a'))
self.cases.append(Case('b'))
self.cases.append(Case('c'))
def sr_execute(self):
for c in self.cases:
c.setVars(6)
class Case(object):
def __init__(self, name):
self.name = name
def setVars(self, var):
self.var = var
In my Experiment Class, I have a function called sr_execute. This function shows the desired behavior. I am interested in parsing thru all cases and set an attribute for each of the cases. When I run the following code,
if __name__ == '__main__':
#multiprocessing.freeze_support()
e = Experiment('exp')
e.sr_execute()
for c in e.cases: print c.name, c.var
I get,
a 6
b 6
c 6
This is the desired behavior.
However, I would like to do this in parallel using multiprocessing. To do this, I add a mp_execute() function to the Experiment Class,
def mp_execute(self):
processes = []
for c in self.cases:
processes.append(Process(target= c.setVars, args = (6,)))
[p.start() for p in processes]
[p.join() for p in processes]
However, this does not work. When I execute the following,
if __name__ == '__main__':
#multiprocessing.freeze_support()
e = Experiment('exp')
e.mp_execute()
for c in e.cases: print c.name, c.var
I get an error,
AttributeError: 'Case' object has no attribute 'var'
Apparently, I am unable to set class attribute using multiprocessing.
Any clues what is going on,

When you call:
def mp_execute(self):
processes = []
for c in self.cases:
processes.append(Process(target= c.setVars, args = (6,)))
[p.start() for p in processes]
[p.join() for p in processes]
when you create the Process it will use a copy of your object and the modifications to such object are not passed to the main program because different processes have different adress spaces. It would work if you used Threads
since in that case no copy is created.
Also note that your code will probably fail in Windows because you are passing a method as target and Windows requires the target to be picklable (and instance methods are not pickable).
The target should be a function defined at the top level of a module in order to work on all Oses.
If you want to communicate to the main process the changes you could:
Use a Queue to pass the result
Use a Manager to built a shared object
Anyway you must handle the communication "explicitly" either by setting up a "channel" (like a Queue) or setting up a shared state.
Style note: Do not use list-comprehensions in this way:
[p.join() for p in processes]
it's simply wrong. You are only wasting space creating a list of Nones. It is also probably slower compared to the right way:
for p in processes:
p.join()
Since it has to append the elements to the list.
Some say that list-comprehensions are slightly faster than for loops, however:
The difference in performance is so small that it generally doesn't matter
They are faster if and only if you consider this kind of loops:
a = []
for element in something:
a.append(element)
If the loop, like in this case, does not create a list, then the for loop will be faster.
By the way: some use map in the same way to perform side-effects. This again is wrong because you wont gain much in speed for the same reason as before and it fails completely in python3 where map returns an iterator and hence it will not execute the functions at all, thus making the code less portable.

#Bakuriu's answer offers good styling and efficiency suggestions. And true that each process gets a copy of the master process stack, hence the changes made by forked processes will not be reflected in address space of the master process unless you utilize some form of IPC (e.g. Queue, Pipe, Manager).
But the particular AttributeError: 'Case' object has no attribute 'var' error that you are getting has an additional reason, namely that your Case objects do not yet have the var attribute at the time you launch your processes. Instead, the var attribute is created in the setVars() method.
Your forked processes do indeed create the variable when they call setVars() (and actually even set it to 6), but alas, this change is only in the copies of Case objects, i.e. not reflected in the master process's memory space (where the variable still does not exist).
To see what I mean, change your Case class to this:
class Case(object):
def __init__(self, name):
self.name = name
self.var = 7 # Create var in the constructor.
def setVars(self, var):
self.var = var
By adding the var member variable in the constructor, your master process will have access to it. Of course, the changes in the forked processes will still not be reflected in the master process, but at least you don't get the error:
a 7
b 7
c 7
Hope this sheds light on what's going on. =)
SOLUTION:
The least-intrusive (to original code) thing to do is use ctypes object from shared memory:
from multiprocessing import Value
class Case(object):
def __init__(self, name):
self.name = name
self.var = Value('i', 7) # Use ctypes "int" from shared memory.
def setVars(self, var):
self.var.value = var # Set the variable's "value" attribute.
and change your main() to print c.var.value:
for c in e.cases: print c.name, c.var.value # Print the "value" attribute.
Now you have the desired output:
a 6
b 6
c 6

Pickling Self and Return to Run-state?

I found the following post extremely helpful:
How to pickle yourself?
however the limitation with this solution is that when the class is reloaded, it is not returned in its "runtime" state. i.e. it will reload all the variables etc and the general state of the class at the moment it was dumped.. but it won't continue running from that point.
Consider:
class someClass(object):
def doSomething(self):
i = 0
while i <= 20:
execute
i += 1
if i == 10:
self.dumpState()
def dumpState(self):
with open('somePickleFile','wb') as handle:
pickle.dump(self, handle)
#classmethod
def loadState(cls, file_name):
with open(file_name, 'rb') as handle:
return pickle.load(handle)
If the above is run, by creating an instance of someClass:
sC = someClass()
sC.doSomething()
sC.loadState('somePickleFile')
This does not return the class to its runtime state, it does not continue through the while loop until i == 20..
This may not be the correct approach, but I am trying to find a way to capture the runtime state of my program i.e. freeze/hibernate it, and then relaunch it after possibly moving it to another machine.. this is due to issues I have with time restrictions enforced by a queuing system on a cluster which does not support checkpointing.

That approach won't be possible with Pickle and Unpickle alone without your code being aware of it.
Pickle can save fundamental Python objects, and ordinary user classes that reference those fundamental types. But it can't freeze information of a running context as you want.
Python does allow limited (yet powerfull) ways of acessing a running code context trough its frame objects - you can get a frame object with a call to "inspect.currentframe" in the inspect module. This will allow you to see the current running line of code, local variables, content of local variables, and so on -- but there is no way inside pure-python, without resorting to raw memory manipulation of the Python interpreter's data structures to rebuild a mid-execution frame object and jump execution to there.
So - for that approach it would be better to "freeze" the entire process and it's memory data structures using an O.S. way to do that (probably there is a way to that in Linux and it should work with no file/file like resources in use by the process).
Or, from within Python, like you want, you have to keep "book check" of all your state data in a manner that Pickle would be able to "see it". In your basic example, you should refactor your code to something like:
class someClass(object):
def setup(self):
self.i = 0
def doSomething(self):
while self.i <= 20:
execute
i += 1
if i == 10:
self.dumpState()
...
#classmethod
def loadState(cls, file_name):
with open(file_name, 'rb') as handle:
self = pickle.load(handle)
if self.i <= 20: # or other check for "running context"
return self.doSomething()
The fundamental difference here is the book-keeping of the otherwise local "i" varianble as an object variable, and separate the initization code. In this way, all the state needed to continue the execution - for this small example - is recorded on the object attributes - which can be properly pickled.

loadState is a classmethod returning a new instance of someClass (or something else pickled into the file). So you should write instead:
sC = someClass()
sC.doSomething()
sC = someClass.loadState('somePickleFile')

I believe pickle only keeps the attribute values of the instance, not the internal state of any methods executing. It will not save the fact that a method was executing, and it won't save the values of the local variables, like i in your example.

How to avoid excessive parameter passing?

I am developing a medium size program in python spread across 5 modules. The program accepts command line arguments using OptionParser in the main module e.g. main.py. These options are later used to determine how methods in other modules behave (e.g. a.py, b.py). As I extend the ability for the user to customise the behaviour or the program I find that I end up requiring this user-defined parameter in a method in a.py that is not directly called by main.py, but is instead called by another method in a.py:
main.py:
import a
p = some_command_line_argument_value
a.meth1(p)
a.py:
meth1(p):
# some code
res = meth2(p)
# some more code w/ res
meth2(p):
# do something with p
This excessive parameter passing seems wasteful and wrong, but has hard as I try I cannot think of a design pattern that solves this problem. While I had some formal CS education (minor in CS during my B.Sc.), I've only really come to appreciate good coding practices since I started using python. Please help me become a better programmer!

Create objects of types relevant to your program, and store the command line options relevant to each in them. Example:
import WidgetFrobnosticator
f = WidgetFrobnosticator()
f.allow_oncave_widgets = option_allow_concave_widgets
f.respect_weasel_pins = option_respect_weasel_pins
# Now the methods of WidgetFrobnosticator have access to your command-line parameters,
# in a way that's not dependent on the input format.
import PlatypusFactory
p = PlatypusFactory()
p.allow_parthenogenesis = option_allow_parthenogenesis
p.max_population = option_max_population
# The platypus factory knows about its own options, but not those of the WidgetFrobnosticator
# or vice versa. This makes each class easier to read and implement.

Maybe you should organize your code more into classes and objects? As I was writing this, Jimmy showed a class-instance based answer, so here is a pure class-based answer. This would be most useful if you only ever wanted a single behavior; if there is any chance at all you might want different defaults some of the time, you should use ordinary object-oriented programming in Python, i.e. pass around class instances with the property p set in the instance, not the class.
class Aclass(object):
p = None
#classmethod
def init_p(cls, value):
p = value
#classmethod
def meth1(cls):
# some code
res = cls.meth2()
# some more code w/ res
#classmethod
def meth2(cls):
# do something with p
pass
from a import Aclass as ac
ac.init_p(some_command_line_argument_value)
ac.meth1()
ac.meth2()

If "a" is a real object and not just a set of independent helper methods, you can create an "p" member variable in "a" and set it when you instantiate an "a" object. Then your main class will not need to pass "p" into meth1 and meth2 once "a" has been instantiated.

[Caution: my answer isn't specific to python.]
I remember that Code Complete called this kind of parameter a "tramp parameter". Googling for "tramp parameter" doesn't return many results, however.
Some alternatives to tramp parameters might include:
Put the data in a global variable
Put the data in a static variable of a class (similar to global data)
Put the data in an instance variable of a class
Pseudo-global variable: hidden behind a singleton, or some dependency injection mechanism
Personally, I don't mind a tramp parameter as long as there's no more than one; i.e. your example is OK for me, but I wouldn't like ...
import a
p1 = some_command_line_argument_value
p2 = another_command_line_argument_value
p3 = a_further_command_line_argument_value
a.meth1(p1, p2, p3)
... instead I'd prefer ...
import a
p = several_command_line_argument_values
a.meth1(p)
... because if meth2 decides that it wants more data than before, I'd prefer if it could extract this extra data from the original parameter which it's already being passed, so that I don't need to edit meth1.

With objects, parameter lists should normally be very small, since most appropriate information is a property of the object itself. The standard way to handle this is to configure the object properties and then call the appropriate methods of that object. In this case set p as an attribute of a. Your meth2 should also complain if p is not set.

Your example is reminiscent of the code smell Message Chains. You may find the corresponding refactoring, Hide Delegate, informative.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Luigi: how to pass different arguments to leaf tasks? - python

Related

include multiple functions in Python class and object

Converting one pdb command to another, but it's not working

python multiprocessing : setting class attribute value

Pickling Self and Return to Run-state?

How to avoid excessive parameter passing?

Categories

Resources