JSON serialization error when creating a Luigi task graph

JSON serialization error when creating a Luigi task graph - python

I'm trying to batch up the processing of a few Jupyter notebooks using Luigi, and I've run into a problem.
I have two classes. The first, transform.py:
import nbformat
import nbconvert
import luigi
from nbconvert.preprocessors.execute import CellExecutionError
class Transform(luigi.Task):
"""Foo."""
notebook = luigi.Parameter()
requirements = luigi.ListParameter()
def requires(self):
return self.requirements
def run(self):
nb = nbformat.read(self.notebook, nbformat.current_nbformat)
# https://nbconvert.readthedocs.io/en/latest/execute_api.html
ep = nbconvert.preprocessors.ExecutePreprocessor(timeout=600, kernel_name='python3')
try:
ep.preprocess(nb, {'metadata': {'path': "/".join(self.notebook.split("/")[:-1])}})
with self.output().open('w') as f:
nbformat.write(nb, f)
except CellExecutionError:
pass # TODO
def output(self):
return luigi.LocalTarget(self.notebook)
This defines a Luigi task that takes a notebook as input (along with possible prior requirements to running this task) and ought to run that notebook and report a success or failure as output.
To run Transform tasks I have a tiny Runner class:
import luigi
class Runner(luigi.Task):
requirements = luigi.ListParameter()
def requires(self):
return self.requirements
To run my little job, I do:
from transform Transform
trans = Transform("../tests/fixtures/empty_valid_errorless_notebook.ipynb", [])
from runner import Runner
run_things = Runner([trans])
But this raises TypeError: Object of type 'Transform' is not JSON serializable!
Is my luigi task format correct? If so, is it obvious what component in run is making the entire class unserializable? If not, how should I go about debugging this?

requires() is supposed to return a task or tasks, not a parameter.
e.g.,
class Runner(luigi.Task):
notebooks = luigi.ListParameter()
def requires(self):
required_tasks = []
for notebook in self.notebooks:
required_tasks.append(Transform(notebook))
return required_tasks
class Transform(luigi.Task):
notebook = luigi.Parameter()
def requires(self):
return []
# then to run at cmd line
luigi --module YourModule Runner --noteboooks '["notebook1.pynb","notebook2.pynb"]'

Related

understanding of some Luigi issues

Look at class ATask
class ATask(luigi.Task):
config = luigi.Parameter()
def requires(self):
# Some Tasks maybe
def output(self):
return luigi.LocalTarget("A.txt")
def run(self):
with open("A.txt", "w") as f:
f.write("Complete")
Now look at class BTask
class BTask(luigi.Task):
config = luigi.Parameter()
def requires(self):
return ATask(config = self.config)
def output(self):
return luigi.LocalTarget("B.txt")
def run(self):
with open("B.txt", "w") as f:
f.write("Complete")
Question is there is a chance that while TaskA running and start write "A.txt" taskB will start before taskA finishing writing?
The second is that if I start execution like
luigi.build([BTask(config=some_config)], local_scheduler=True )
And if this pipilene fail inside - Could I somehow to know outside about this like return value of luigi.build or smth else?

No, luigi won't start executing TaskB until TaskA has finished (ie, until it has finished writing the target file)
If you want to get a detailed response for luigi.build in case of error, you must pass an extra keyword argument: detailed_summary=True to build/run methods and then access the summary_text, this way:
luigi_run_result = luigi.build(..., detailed_summary=True)
print(luigi_run_result.summary_text)
For details on that, please read Response of luigi.build()/luigi.run() in Luigi documentation.
Also, you may be interested in this answer about how to access the error / exception: https://stackoverflow.com/a/33396642/3219121

luigi - how to create a dependency not between files, but between tasks? (or how to not involve the output method)

Given two luigi tasks, how can I add one as a requirement for the other, in a way that if the required is done, the second task could start, with no output involved?
Currently I get RuntimeError: Unfulfilled dependency at run time: MyTask___home_... even though the task completed ok, because my requires / output methods are not configured right...
class ShellTask(ExternalProgramTask):
"""
ExternalProgramTask's subclass dedicated for one task with the capture output ability.
Args:
shell_cmd (str): The shell command to be run in a subprocess.
capture_output (bool, optional): If True the output is not displayed to console,
and printed after the task is done via
logger.info (both stdout + stderr).
Defaults to True.
"""
shell_cmd = luigi.Parameter()
requirement = luigi.Parameter(default='')
succeeded = False
def on_success(self):
self.succeeded = True
def requires(self):
return eval(self.requirement) if self.requirement else None
def program_args(self):
"""
Must be implemented in an ExternalProgramTask subclass.
Returns:
A script that would be run in a subprocess.Popen.
Args:
shell_cmd (luigi.Parameter (str)): the shell command to be passed as args
to the run method (run should not be overridden!).
"""
return self.shell_cmd.split()
class MyTask(ShellTask):
"""
Args: if __name__ == '__main__':
clean_output_files(['_.txt'])
task = MyTask(
shell_cmd='...',
requirement="MyTask(shell_cmd='...', output_file='_.txt')",
)
"""
pass
if __name__ == '__main__':
task_0 = MyTask(
shell_cmd='...',
requirement="MyTask(shell_cmd='...')",
)
luigi.build([task_0], workers=2, local_scheduler=False)
I hoped using the on_success could prompt something to the caller task, but I didn't figure out how to.
I'm currently overcoming this in the following way:
0) implement the output method based on the input of the task (much like the eval(requirement) I did
2) implement the run method (calling the super run and then writing "ok" to output
3) deleting the output files from main.
4) calling it somehitng like this:
if __name__ == '__main__':
clean_output_files(['_.txt'])
task = MyTask(
shell_cmd='...',
requirement="MyTask(shell_cmd='...', output_file='_.txt')",
)

So within your first luigi task, you could call your second Task within by making it a requirement.
For example:
class TaskB(luigi.Task):
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.complete_flag = False
def run(self):
self.complete_flag = True
print('do something')
def complete(self):
return self.is_complete
class TaskA(luigi.Task):
def requires(self):
return TaskB()
def run(self):
print('Carry on with other logic')

How to check output dynamically with Luigi

I realize I likely need to use dynamic requirements to accomplish the following task, however I have not been able to wrap my head around what this would look like in practice.
The goal is to use Luigi to generate data and add it to a database, without knowing ahead of time what data will be generated.
Take the following example using mongodb:
import luigi
from uuid import uuid4
from luigi.contrib import mongodb
import pymongo
# Make up IDs, though in practice the IDs may be generated from an API
class MakeID(luigi.Task):
def run(self):
with self.output().open('w') as f:
f.write(','.join([str(uuid4()) for e in range(10)]))
# Write the data to file
def output(self):
return luigi.LocalTarget('data.csv')
class ToDataBase(luigi.Task):
def requires(self):
return MakeID()
def run(self):
with self.input().open('r') as f:
ids = f.read().split(',')
# Add some fake data to simulate generating new data
count_data = {key: value for value, key in enumerate(ids)}
# Add data to the database
self.output().write(count_data)
def output(self):
# Attempt to read non-existent file to get the IDs to check if task is complete
with self.input().open('r') as f:
valid_ids = f.read().split(',')
client = pymongo.MongoClient('localhost',
27017,
ssl=False)
return mongodb.MongoRangeTarget(client,
'myDB',
'myData',
valid_ids,
'myField')
if __name__ == '__main__':
luigi.run()
The goal is to obtain data, modify it and then add it to a database.
The above code fails when run because the output method of ToDataBase runs before the require method so the while the function has access to the input, the input does not yet exist. Regardless I still need to check to be sure that the data was added to the database.
This github issue is close to what I am looking for, though as I mentioned I have not been able to figure out dynamic requirements for this use case in practice.

The solution is to create a third task (in the example Dynamic) that yields the task that is waiting on dynamic input and making the dependency a parameter rather than a requires method.
class ToDatabase(luigi.Task):
fp = luigi.Parameter()
def output(self):
with open(self.fp, 'r') as f:
valid_ids = [str(e) for e in f.read().split(',')]
client = pymongo.MongoClient('localhost', 27017, ssl=False)
return mongodb.MongoRangeTarget(client, 'myDB', 'myData',
valid_ids, 'myField')
def run(self):
with open(self.fp, 'r') as f:
valid_ids = [str(e) for e in f.read().split(',')]
self.output().write({k: 5 for k in valid_ids})
class Dynamic(luigi.Task):
def output(self):
return self.input()
def requires(self):
return MakeIDs()
def run(self):
yield(AddToDatabase(fp=self.input().path))

MySQL Targets in Luigi workflow

My TaskB requires TaskA, and on completion TaskA writes to a MySQL table, and then TaskB is to take in this output to the table as its input.
I cannot seem to figure out how to do this in Luigi. Can someone point me to an example or give me a quick example here?

The existing MySqlTarget in luigi uses a separate marker table to indicate when the task is complete. Here's the rough approach I would take...but your question is very abstract, so it is likely to be more complicated in reality.
import luigi
from datetime import datetime
from luigi.contrib.mysqldb import MySqlTarget
class TaskA(luigi.Task):
rundate = luigi.DateParameter(default=datetime.now().date())
target_table = "table_to_update"
host = "localhost:3306"
db = "db_to_use"
user = "user_to_use"
pw = "pw_to_use"
def get_target(self):
return MySqlTarget(host=self.host, database=self.db, user=self.user, password=self.pw, table=self.target_table,
update_id=str(self.rundate))
def requires(self):
return []
def output(self):
return self.get_target()
def run(self):
#update table
self.get_target().touch()
class TaskB(luigi.Task):
def requires(self):
return [TaskA()]
def run(self):
# reading from target_table

Running Hadoop jar using Luigi python

I need to run a Hadoop jar job using Luigi from python. I searched and found examples of writing mapper and reducer in Luigi but nothing to directly run a Hadoop jar.
I need to run a Hadoop jar compiled directly. How can I do it?

You need to use the luigi.contrib.hadoop_jar package (code).
In particular, you need to extend HadoopJarJobTask. For example, like that:
from luigi.contrib.hadoop_jar import HadoopJarJobTask
from luigi.contrib.hdfs.target import HdfsTarget
class TextExtractorTask(HadoopJarJobTask):
def output(self):
return HdfsTarget('data/processed/')
def jar(self):
return 'jobfile.jar'
def main(self):
return 'com.ololo.HadoopJob'
def args(self):
return ['--param1', '1', '--param2', '2']
You can also include building a jar file with maven to the workflow:
import luigi
from luigi.contrib.hadoop_jar import HadoopJarJobTask
from luigi.contrib.hdfs.target import HdfsTarget
from luigi.file import LocalTarget
import subprocess
import os
class BuildJobTask(luigi.Task):
def output(self):
return LocalTarget('target/jobfile.jar')
def run(self):
subprocess.call(['mvn', 'clean', 'package', '-DskipTests'])
class YourHadoopTask(HadoopJarJobTask):
def output(self):
return HdfsTarget('data/processed/')
def jar(self):
return self.input().fn
def main(self):
return 'com.ololo.HadoopJob'
def args(self):
return ['--param1', '1', '--param2', '2']
def requires(self):
return BuildJobTask()

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

JSON serialization error when creating a Luigi task graph - python

Related

understanding of some Luigi issues

luigi - how to create a dependency not between files, but between tasks? (or how to not involve the output method)

How to check output dynamically with Luigi

MySQL Targets in Luigi workflow

Running Hadoop jar using Luigi python

Categories

Resources