understanding of some Luigi issues - python

Look at class ATask
class ATask(luigi.Task):
config = luigi.Parameter()
def requires(self):
# Some Tasks maybe
def output(self):
return luigi.LocalTarget("A.txt")
def run(self):
with open("A.txt", "w") as f:
f.write("Complete")
Now look at class BTask
class BTask(luigi.Task):
config = luigi.Parameter()
def requires(self):
return ATask(config = self.config)
def output(self):
return luigi.LocalTarget("B.txt")
def run(self):
with open("B.txt", "w") as f:
f.write("Complete")
Question is there is a chance that while TaskA running and start write "A.txt" taskB will start before taskA finishing writing?
The second is that if I start execution like
luigi.build([BTask(config=some_config)], local_scheduler=True )
And if this pipilene fail inside - Could I somehow to know outside about this like return value of luigi.build or smth else?

No, luigi won't start executing TaskB until TaskA has finished (ie, until it has finished writing the target file)
If you want to get a detailed response for luigi.build in case of error, you must pass an extra keyword argument: detailed_summary=True to build/run methods and then access the summary_text, this way:
luigi_run_result = luigi.build(..., detailed_summary=True)
print(luigi_run_result.summary_text)
For details on that, please read Response of luigi.build()/luigi.run() in Luigi documentation.
Also, you may be interested in this answer about how to access the error / exception: https://stackoverflow.com/a/33396642/3219121

Related

luigi - how to create a dependency not between files, but between tasks? (or how to not involve the output method)

Given two luigi tasks, how can I add one as a requirement for the other, in a way that if the required is done, the second task could start, with no output involved?
Currently I get RuntimeError: Unfulfilled dependency at run time: MyTask___home_... even though the task completed ok, because my requires / output methods are not configured right...
class ShellTask(ExternalProgramTask):
"""
ExternalProgramTask's subclass dedicated for one task with the capture output ability.
Args:
shell_cmd (str): The shell command to be run in a subprocess.
capture_output (bool, optional): If True the output is not displayed to console,
and printed after the task is done via
logger.info (both stdout + stderr).
Defaults to True.
"""
shell_cmd = luigi.Parameter()
requirement = luigi.Parameter(default='')
succeeded = False
def on_success(self):
self.succeeded = True
def requires(self):
return eval(self.requirement) if self.requirement else None
def program_args(self):
"""
Must be implemented in an ExternalProgramTask subclass.
Returns:
A script that would be run in a subprocess.Popen.
Args:
shell_cmd (luigi.Parameter (str)): the shell command to be passed as args
to the run method (run should not be overridden!).
"""
return self.shell_cmd.split()
class MyTask(ShellTask):
"""
Args: if __name__ == '__main__':
clean_output_files(['_.txt'])
task = MyTask(
shell_cmd='...',
requirement="MyTask(shell_cmd='...', output_file='_.txt')",
)
"""
pass
if __name__ == '__main__':
task_0 = MyTask(
shell_cmd='...',
requirement="MyTask(shell_cmd='...')",
)
luigi.build([task_0], workers=2, local_scheduler=False)
I hoped using the on_success could prompt something to the caller task, but I didn't figure out how to.
I'm currently overcoming this in the following way:
0) implement the output method based on the input of the task (much like the eval(requirement) I did
2) implement the run method (calling the super run and then writing "ok" to output
3) deleting the output files from main.
4) calling it somehitng like this:
if __name__ == '__main__':
clean_output_files(['_.txt'])
task = MyTask(
shell_cmd='...',
requirement="MyTask(shell_cmd='...', output_file='_.txt')",
)
So within your first luigi task, you could call your second Task within by making it a requirement.
For example:
class TaskB(luigi.Task):
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.complete_flag = False
def run(self):
self.complete_flag = True
print('do something')
def complete(self):
return self.is_complete
class TaskA(luigi.Task):
def requires(self):
return TaskB()
def run(self):
print('Carry on with other logic')

How to check output dynamically with Luigi

I realize I likely need to use dynamic requirements to accomplish the following task, however I have not been able to wrap my head around what this would look like in practice.
The goal is to use Luigi to generate data and add it to a database, without knowing ahead of time what data will be generated.
Take the following example using mongodb:
import luigi
from uuid import uuid4
from luigi.contrib import mongodb
import pymongo
# Make up IDs, though in practice the IDs may be generated from an API
class MakeID(luigi.Task):
def run(self):
with self.output().open('w') as f:
f.write(','.join([str(uuid4()) for e in range(10)]))
# Write the data to file
def output(self):
return luigi.LocalTarget('data.csv')
class ToDataBase(luigi.Task):
def requires(self):
return MakeID()
def run(self):
with self.input().open('r') as f:
ids = f.read().split(',')
# Add some fake data to simulate generating new data
count_data = {key: value for value, key in enumerate(ids)}
# Add data to the database
self.output().write(count_data)
def output(self):
# Attempt to read non-existent file to get the IDs to check if task is complete
with self.input().open('r') as f:
valid_ids = f.read().split(',')
client = pymongo.MongoClient('localhost',
27017,
ssl=False)
return mongodb.MongoRangeTarget(client,
'myDB',
'myData',
valid_ids,
'myField')
if __name__ == '__main__':
luigi.run()
The goal is to obtain data, modify it and then add it to a database.
The above code fails when run because the output method of ToDataBase runs before the require method so the while the function has access to the input, the input does not yet exist. Regardless I still need to check to be sure that the data was added to the database.
This github issue is close to what I am looking for, though as I mentioned I have not been able to figure out dynamic requirements for this use case in practice.
The solution is to create a third task (in the example Dynamic) that yields the task that is waiting on dynamic input and making the dependency a parameter rather than a requires method.
class ToDatabase(luigi.Task):
fp = luigi.Parameter()
def output(self):
with open(self.fp, 'r') as f:
valid_ids = [str(e) for e in f.read().split(',')]
client = pymongo.MongoClient('localhost', 27017, ssl=False)
return mongodb.MongoRangeTarget(client, 'myDB', 'myData',
valid_ids, 'myField')
def run(self):
with open(self.fp, 'r') as f:
valid_ids = [str(e) for e in f.read().split(',')]
self.output().write({k: 5 for k in valid_ids})
class Dynamic(luigi.Task):
def output(self):
return self.input()
def requires(self):
return MakeIDs()
def run(self):
yield(AddToDatabase(fp=self.input().path))

JSON serialization error when creating a Luigi task graph

I'm trying to batch up the processing of a few Jupyter notebooks using Luigi, and I've run into a problem.
I have two classes. The first, transform.py:
import nbformat
import nbconvert
import luigi
from nbconvert.preprocessors.execute import CellExecutionError
class Transform(luigi.Task):
"""Foo."""
notebook = luigi.Parameter()
requirements = luigi.ListParameter()
def requires(self):
return self.requirements
def run(self):
nb = nbformat.read(self.notebook, nbformat.current_nbformat)
# https://nbconvert.readthedocs.io/en/latest/execute_api.html
ep = nbconvert.preprocessors.ExecutePreprocessor(timeout=600, kernel_name='python3')
try:
ep.preprocess(nb, {'metadata': {'path': "/".join(self.notebook.split("/")[:-1])}})
with self.output().open('w') as f:
nbformat.write(nb, f)
except CellExecutionError:
pass # TODO
def output(self):
return luigi.LocalTarget(self.notebook)
This defines a Luigi task that takes a notebook as input (along with possible prior requirements to running this task) and ought to run that notebook and report a success or failure as output.
To run Transform tasks I have a tiny Runner class:
import luigi
class Runner(luigi.Task):
requirements = luigi.ListParameter()
def requires(self):
return self.requirements
To run my little job, I do:
from transform Transform
trans = Transform("../tests/fixtures/empty_valid_errorless_notebook.ipynb", [])
from runner import Runner
run_things = Runner([trans])
But this raises TypeError: Object of type 'Transform' is not JSON serializable!
Is my luigi task format correct? If so, is it obvious what component in run is making the entire class unserializable? If not, how should I go about debugging this?
requires() is supposed to return a task or tasks, not a parameter.
e.g.,
class Runner(luigi.Task):
notebooks = luigi.ListParameter()
def requires(self):
required_tasks = []
for notebook in self.notebooks:
required_tasks.append(Transform(notebook))
return required_tasks
class Transform(luigi.Task):
notebook = luigi.Parameter()
def requires(self):
return []
# then to run at cmd line
luigi --module YourModule Runner --noteboooks '["notebook1.pynb","notebook2.pynb"]'

Can luigi rerun tasks when the task dependencies become out of date?

As far as I know, a luigi.Target can either exist, or not.
Therefore, if a luigi.Target exists, it wouldn't be recomputed.
I'm looking for a way to force recomputation of the task, if one of its dependencies is modified, or if the code of one of the tasks changes.
One way you could accomplish your goal is by overriding the complete(...) method.
The documentation for complete is straightforward.
Simply implement a function that checks your constraint, and returns False if you want to recompute the task.
For example, to force recomputation when a dependency has been updated, you could do:
def complete(self):
"""Flag this task as incomplete if any requirement is incomplete or has been updated more recently than this task"""
import os
import time
def mtime(path):
return time.ctime(os.path.getmtime(path))
# assuming 1 output
if not os.path.exists(self.output().path):
return False
self_mtime = mtime(self.output().path)
# the below assumes a list of requirements, each with a list of outputs. YMMV
for el in self.requires():
if not el.complete():
return False
for output in el.output():
if mtime(output.path) > self_mtime:
return False
return True
This will return False when any requirement is incomplete or any has been modified more recently than the current task or the output of the current task does not exist.
Detecting when code has changed is harder. You could use a similar scheme (checking mtime), but it'd be hit-or-miss unless every task has its own file.
Because of the ability to override complete, any logic you want for recomputation can be implemented. If you want a particular complete method for many tasks, I'd recommend sub-classing luigi.Task, implementing your custom complete there, and then inheriting your tasks from the sub-class.
I'm late to the game, but here's a mixin that improves the accepted answer to support multiple input / output files.
class MTimeMixin:
"""
Mixin that flags a task as incomplete if any requirement
is incomplete or has been updated more recently than this task
This is based on http://stackoverflow.com/a/29304506, but extends
it to support multiple input / output dependencies.
"""
def complete(self):
def to_list(obj):
if type(obj) in (type(()), type([])):
return obj
else:
return [obj]
def mtime(path):
return time.ctime(os.path.getmtime(path))
if not all(os.path.exists(out.path) for out in to_list(self.output())):
return False
self_mtime = min(mtime(out.path) for out in to_list(self.output()))
# the below assumes a list of requirements, each with a list of outputs. YMMV
for el in to_list(self.requires()):
if not el.complete():
return False
for output in to_list(el.output()):
if mtime(output.path) > self_mtime:
return False
return True
To use it, you would just declare your class using, for example class MyTask(Mixin, luigi.Task).
The above code works well for me except that I believe for proper timestamp comparison mtime(path) must return a float instead of a string ("Sat " > "Mon "...[sic]). Thus simply,
def mtime(path):
return os.path.getmtime(path)
instead of:
def mtime(path):
return time.ctime(os.path.getmtime(path))
Regarding the Mixin suggestion from Shilad Sen posted below, consider this example:
# Filename: run_luigi.py
import luigi
from MTimeMixin import MTimeMixin
class PrintNumbers(luigi.Task):
def requires(self):
wreturn []
def output(self):
return luigi.LocalTarget("numbers_up_to_10.txt")
def run(self):
with self.output().open('w') as f:
for i in range(1, 11):
f.write("{}\n".format(i))
class SquaredNumbers(MTimeMixin, luigi.Task):
def requires(self):
return [PrintNumbers()]
def output(self):
return luigi.LocalTarget("squares.txt")
def run(self):
with self.input()[0].open() as fin, self.output().open('w') as fout:
for line in fin:
n = int(line.strip())
out = n * n
fout.write("{}:{}\n".format(n, out))
if __name__ == '__main__':
luigi.run()
where MTimeMixin is as in the post above. I run the task once using
luigi --module run_luigi SquaredNumbers
Then I touch file numbers_up_to_10.txt and run the task again. Then Luigi gives the following complaint:
File "c:\winpython-64bit-3.4.4.6qt5\python-3.4.4.amd64\lib\site-packages\luigi-2.7.1-py3.4.egg\luigi\local_target.py", line 40, in move_to_final_destination
os.rename(self.tmp_path, self.path)
FileExistsError: [WinError 183] Cannot create a file when that file already exists: 'squares.txt-luigi-tmp-5391104487' -> 'squares.txt'
This may just be a Windows problem, not an issue on Linux where "mv a b" may just delete the old b if it already exists and is not write-protected. We can fix this with the following patch to Luigi/local_target.py:
def move_to_final_destination(self):
if os.path.exists(self.path):
os.rename(self.path, self.path + time.strftime("_%Y%m%d%H%M%S.txt"))
os.rename(self.tmp_path, self.path)
Also for completeness here is the Mixin again as a separate file, from the other post:
import os
class MTimeMixin:
"""
Mixin that flags a task as incomplete if any requirement
is incomplete or has been updated more recently than this task
This is based on http://stackoverflow.com/a/29304506, but extends
it to support multiple input / output dependencies.
"""
def complete(self):
def to_list(obj):
if type(obj) in (type(()), type([])):
return obj
else:
return [obj]
def mtime(path):
return os.path.getmtime(path)
if not all(os.path.exists(out.path) for out in to_list(self.output())):
return False
self_mtime = min(mtime(out.path) for out in to_list(self.output()))
# the below assumes a list of requirements, each with a list of outputs. YMMV
for el in to_list(self.requires()):
if not el.complete():
return False
for output in to_list(el.output()):
if mtime(output.path) > self_mtime:
return False
return True

Best way to mix and match components in a python app

I have a component that uses a simple pub/sub module I wrote as a message queue. I would like to try out other implementations like RabbitMQ. However, I want to make this backend change configurable so I can switch between my implementation and 3rd party modules for cleanliness and testing.
The obvious answer seems to be to:
Read a config file
Create a modifiable settings object/dict
Modify the target component to lazily load the specified implementation.
something like :
# component.py
from test.queues import Queue
class Component:
def __init__(self, Queue=Queue):
self.queue = Queue()
def publish(self, message):
self.queue.publish(message)
# queues.py
import test.settings as settings
def Queue(*args, **kwargs):
klass = settings.get('queue')
return klass(*args, **kwargs)
Not sure if the init should take in the Queue class, I figure it would help in easily specifying the queue used while testing.
Another thought I had was something like http://www.voidspace.org.uk/python/mock/patch.html though that seems like it would get messy. Upside would be that I wouldn't have to modify the code to support swapping component.
Any other ideas or anecdotes would be appreciated.
EDIT: Fixed indent.
One thing I've done before is to create a common class that each specific implementation inherits from. Then there's a spec that can easily be followed, and each implementation can avoid repeating certain code they'll all share.
This is a bad example, but you can see how you could make the saver object use any of the classes specified and the rest of your code wouldn't care.
class SaverTemplate(object):
def __init__(self, name, obj):
self.name = name
self.obj = obj
def save(self):
raise NotImplementedError
import json
class JsonSaver(SaverTemplate):
def save(self):
file = open(self.name + '.json', 'wb')
json.dump(self.object, file)
file.close()
import cPickle
class PickleSaver(SaverTemplate):
def save(self):
file = open(self.name + '.pickle', 'wb')
cPickle.dump(self.object, file, protocol=cPickle.HIGHEST_PROTOCOL)
file.close()
import yaml
class PickleSaver(SaverTemplate):
def save(self):
file = open(self.name + '.yaml', 'wb')
yaml.dump(self.object, file)
file.close()
saver = PickleSaver('whatever', foo)
saver.save()

Categories