Re-using generic tasks in Luigi - python

I'm having trouble understanding how to make re-usable tasks in Luigi, and then use them in a concrete situation.
For example. I have two generic tasks that do something to a file and then output the result:
class GffFilter(luigi.Task):
"Filters a GFF file to only one feature"
feature = luigi.Parameter()
out_file = luigi.Parameter()
in_file = luigi.Parameter()
...
class BgZip(luigi.Task):
"bgZips a file"
out_file = luigi.Parameter()
in_file = luigi.Parameter()
...
Now, I want a workflow that first filters, then bgzips a specific file using these tasks:
class FilterSomeFile(luigi.WrapperTask):
def requires(self):
return GffFilter(in_file='some.gff3', out_file='some.genes.gff3', filter='gene')
def output(self):
return self.inputs()
class BgZipSomeFile(luigi.Task):
def run(self):
filtered = FilterSomeFile()
BzZip(filtered)
But this is awkward. In the first task I have no run method, and I'm just using dependencies to use the generic task. Is this correct? Should I be using inheritance here instead?
Then in the second task, I can't use dependencies, because I need the output from FilterSomeFile in order to use BgZip. But using dynamic dependencies seems wrong, because luigi can't build a proper dependency graph.
How should I make a Luigi workflow out of my generic tasks?

But this is awkward. In the first task I have no run method, and I'm just using dependencies to use the generic task. Is this correct?
Yes, according to this page, the WrapperTask is a dummy task whose purpose is to define a workflow of tasks, therefore it doesn't perform any actions by itself. Instead, by defining several requirements, this task will be complete when every requirement, listed in the requires method, has been completed. The main difference of this WrapperTask to a regular Task, is that you don't need to define an output method to signal that this task suceeded, as can be seen here.
Then in the second task, I can't use dependencies, because I need the output from FilterSomeFile in order to use BgZip. But using dynamic dependencies seems wrong, because luigi can't build a proper dependency graph.
Theoretically, you could make FilterSomeFile have the same output as GffFilter, make the BgZipSomeFile require FilterSomeFile, and then use the FilterSomeFile.output() in BgZipSomeFile.run to access the the zipped file. However, this solution would be somewhat strange because:
The wrapper task only "runs" 1 other task, so the wrapped task could be used directly, without having to create a WrapperTask. A better usage of WrapperTask would involve merging BgZipSomeFile and FilterSomeFile in a single subclass of WrapperTask
A Task is being instantiated in a run method. This results in a dynamic dependency, but this is not needed in this problem.
Finally, the input of GffFilter is hardcoded in FilterSomeFile task, which makes the workflow less useful. This could be avoided by making the WrapperClass still receive parameters, and pass these parameters to its requirements.
A better solution would be:
import luigi as lg
class A(lg.Task):
inFile = lg.Parameter()
outFile = lg.Parameter()
def run(self,):
with open(self.inFile, "r") as oldFile:
text = oldFile.read()
text += "*" * 10 + "\n" + "This text was added by task A.\n" + "*" * 10 + "\n"
print(text)
with open(self.outFile, "w") as newFile:
newFile.write(text)
def output(self,):
return lg.LocalTarget(self.outFile)
class B(lg.Task):
inFile = lg.Parameter()
outFile = lg.Parameter()
def run(self,):
with open(self.inFile, "r") as oldFile:
text = oldFile.read()
text += "*" * 10 + "\n" + "This text was added by task B.\n" + "*" * 10 + "\n"
with open(self.outFile, "w") as newFile:
newFile.write(text)
def output(self,):
return lg.LocalTarget(self.outFile)
class CustomWorkflow(lg.WrapperTask):
mainOutFile = lg.Parameter()
mainInFile = lg.Parameter()
tempFile = "/tmp/myTempFile.txt"
def requires(self,):
return [ A(inFile = self.mainInFile, outFile = self.tempFile),
B(inFile = self.tempFile, outFile = self.mainOutFile)
]
This code can be run in command line with:
PYTHONPATH='.' luigi --module pythonModuleContainingTheTasks --local-scheduler CustomWorkflow --mainInFile ./text.txt --mainOutFile ./procText.txt

Related

python importlib seems to be sharing data between instances

Ok, so I am having a weird one. I am running python in a SideFX Hython (their custom build) implementation that is using PDG. The only real difference between Hython and vanilla Python is some internal functions for handling geometry data and compiled nodes, which shouldn't be an issue even though they are being used.
The way the code runs, I am generating a list of files from the disk which creates PDG work items. Those work items are then processed in parallel by PDG. Here is the code for that:
import importlib.util
import pdg
import os
from pdg.processor import PyProcessor
import json
class CustomProcessor(PyProcessor):
def __init__(self, node):
PyProcessor.__init__(self,node)
self.extractor_module = 'GeoExtractor'
def onGenerate(self, item_holder, upstream_items, generation_type):
for upstream_item in upstream_items:
new_item = item_holder.addWorkItem(parent=upstream_item, inProcess=True)
return pdg.result.Success
def onCookTask(self, work_item):
spec = importlib.util.spec_from_file_location("callback", "Geo2Custom.py")
GE = importlib.util.module_from_spec(spec)
spec.loader.exec_module(GE)
GE.convert(f"{work_item.attribValue('directory')}/{work_item.attribValue('filename')}{work_item.attribValue('extension')}", work_item.index, f'FRAME { work_item.index }', self.extractor_module)
return pdg.result.Success
def bulk_convert (path_pattern, extractor_module = 'GeoExtractor'):
type_registry = pdg.TypeRegistry.types()
try:
type_registry.registerNode(CustomProcessor, pdg.nodeType.Processor, name="customprocessor", label="Custom Processor", category="Custom")
except Exception:
pass
whereItWorks = pdg.GraphContext("testBed")
whatWorks = whereItWorks.addScheduler("localscheduler")
whatWorks.setWorkingDir(os.getcwd (), '$HIP')
whereItWorks.setValues(f'{whatWorks.name}', {'maxprocsmenu':-1, 'tempdirmenu':0, 'verbose':1})
findem = whereItWorks.addNode("filepattern")
whereItWorks.setValue(f'{findem.name}', 'pattern', path_pattern, 0)
generic = whereItWorks.addNode("genericgenerator")
whereItWorks.setValue(generic.name, 'itemcount', 4, 0)
custom = whereItWorks.addNode("customprocessor")
custom.extractor_module = extractor_module
node1 = [findem]
node2 = [custom]*len(node1)
for n1, n2 in zip(node1, node2):
whereItWorks.connect(f'{n1.name}.output', f'{n2.name}.input')
n2.cook(True)
for node in whereItWorks.graph.nodes():
node.dirty(False)
whereItWorks.disconnect(f'{n1.name}.output', f'{n2.name}.input')
print ("FULLY DONE")
import os
import hou
import traceback
import CustomWriter
import importlib
def convert (filename, frame_id, marker, extractor_module = 'GeoExtractor'):
Extractor = importlib.__import__ (extractor_module)
base, ext = os.path.splitext (filename)
if ext == '.sc':
base = os.path.splitext (base)[0]
dest_file = base + ".custom"
geo = hou.Geometry ()
geo.loadFromFile (filename)
try:
frame = Extractor.extract_geometry (geo, frame_id)
except Exception as e:
print (f'F{ frame_id } Geometry extraction failed: { traceback.format_exc () }.')
return None
print (f'F{ frame_id } Geometry extracted. Writing file { dest_file }.')
try:
CustomWriter.write_frame (frame, dest_file)
except Exception as e:
print (f'F{ frame_id } writing failed: { e }.')
print (marker + " SUCCESS")
The onCookTask code is run when the work item is processed.
Inside of the GeoExtractor.py program I am importing the geometry file defined by the work item, then converting it into a couple Pandas dataframes to collate and process the massive volumes of data quickly, which is then passed to a custom set of functions for writing binary files to disk from the Pandas data.
Everything appears to run flawlessly, until I check my output binaries and see that they escalate in file size much more than they should, indicating that either something is being shared between instances or not cleared from memory and subsequent loads of the extractor code is appending the dataframes which are named the same.
I have run the GeoExtractor code sequentially with the python instance closing between each file conversion using the exact same code and the files are fine, growing only very slowly as the geometry data volume grows, so the issue has to lie somewhere in the parallelization of it using PDG and calling the GeoExtractor.py code over and over for each work item.
I have contemplated moving the importlib stuff to the __init__() for the class leaving only the call to the member function in the onCookTask() function. Maybe even going so far as to pass a unique variable for each work item which is used inside GeoExtractor to create a closure of the internal functions so they are unique instances in memory.
I tried to do a stripped down version of GeoExtractor and since I'm not sure where the leak is, I just ended up pulling out comments with proprietary or superfluous information and changing some custom library names, but the file ended up kinda long so I am including a pastebin: https://pastebin.com/4HHS8D2W
As for CustomGeometry and CustomWriter, there is no working form of either of those libraries that will be NDA safe, so unfortunately they have to stay blackboxed. The CustomGeometry is a handful of container classes which organize all of the data coming out of the geometry, and the writer is a formatter/writer for the binary format we are utilizing. I am hoping the issue wouldn't be in either of them.
Edit 1: I fixed an issue in the example code.
Edit 2: Added larger examples.

Run pytest for each file in directory

I'm trying to build a routine that calls a Pytest class for each PDF document in current directoy... Let me explain
Lets say i have this test file
import pytest
class TestHeader:
#asserts...
class TestBody:
#asserts...
This script needs to test each pdf document in my cwd
Here is my best attempt:
import glob
import pytest
class TestHeader:
#asserts...
class TestBody:
#asserts...
filelist = glob.glob('*.pdf')
for file in filelist:
#magically call pytest for each file
How would i approach this?
EDIT: Complementing my question.
I have a huge function that extracts each document's data, lets call it extract_pdf
this function returns a tuple (header, body).
Current attempt looks like this:
import glob
import pytest
class TestHeader:
#asserts...
class TestBody:
#asserts...
filelist = glob.glob('*.pdf')
for file in filelist:
header, body = extract_pdf(file)
pytest.main(<pass header and body as args for pytest>)
I need to parse each document prior to testing. Can it be done this way?
The best way to do this through parameterization of the testcases dynamically..
This can be achieved using the pytest_generate_tests hook..
def pytest_generate_tests(metafunc):
filelist = glob.glob('*.pdf')
metafunc.parametrize("fileName", filelist )
NOTE: fileName should be one of the argument to your test function.
This will result in executing the testcase for each of the file in the directory and the testcase will be like
TestFunc[File1]
TestFunc[File2]
TestFunc[File3]
.
.
and so on..
This is expanding on the existing answer by #ArunKalirajaBaskaran.
The problem is that you have different test classes that want to use the same data, but you want to parse the data only once. If it is ok for you to read all data at once, you could read them into global variables and use these for parametrizing your tests:
def extract_data():
filenames = []
headers = []
bodies = []
for filename in glob.glob('*.pdf'):
header, body = extract_pdf(filename)
filenames.append(filename)
headers.append(header)
bodies.append(body)
return filenames, headers, bodies
filenames, headers, bodies = extract_data()
def pytest_generate_tests(metafunc):
if "header" in metafunc.fixturenames:
# use the filename as ID for better test names
metafunc.parametrize("header", headers, ids=filenames)
elif "body" in metafunc.fixturenames:
metafunc.parametrize("body", bodies, ids=filenames)
class TestHeader:
def test_1(header):
...
def test_2(header):
...
class TestBody:
def test_1(body):
...
This is the same as using
class TestHeader:
#pytest.mark.parametrize("header", headers, ids=filenames)
def test_1(header):
...
#pytest.mark.parametrize("header", headers, ids=filenames)
def test_2(header):
...
pytest_generate_tests just adds a bit of convenience so you don't have to repeat the parametrize decorator for each test.
The downside of this is of course that you will read in all of the data at once, which may cause a problem with memory usage if there is a lot of files. Your approach with pytest.main will not work, because that is the same as calling pytest on the command line with the given parameters. Parametrization can be done at the fixture level or on the test level (like here), but both need the parameters alreay evaluated at load time, so I don't see a possibility to do this lazily (apart from putting it all into one test). Maybe someone else has a better idea...

Loading and dumping multiple yaml files with ruamel.yaml (python)

Using python 2 (atm) and ruamel.yaml 0.13.14 (RedHat EPEL)
I'm currently writing some code to load yaml definitions, but they are split up in multiple files. The user-editable part contains eg.
users:
xxxx1:
timestamp: '2018-10-22 11:38:28.541810'
<< : *userdefaults
xxxx2:
<< : *userdefaults
timestamp: '2018-10-22 11:38:28.541810'
the defaults are stored in another file, which is not editable:
userdefaults: &userdefaults
# Default values for user settings
fileCountQuota: 1000
diskSizeQuota: "300g"
I can process these together by loading both and concatinating the strings, and then running them through merged_data = list(yaml.load_all("{}\n{}".format(defaults_data, user_data), Loader=yaml.RoundTripLoader)) which correctly resolves everything. (when not using RoundTripLoader I get errors that the references cannot be resolved, which is normal)
Now, I want to do some updates via python code (eg. update the timestamp), and for that I need to just write back the user part. And that's where things get hairy. I sofar haven't found a way to just write that yaml document, not both.
First of all, unless there are multiple documents in your defaults file, you
don't have to use load_all, as you don't concatenate two documents into a
multiple-document stream. If you had by using a format string with a document-end
marker ("{}\n...\n{}") or with a directives-end marker ("{}\n---\n{}")
your aliases would not carry over from one document to another, as per the
YAML specification:
It is an error for an alias node to use an anchor that does not
previously occur in the document.
The anchor has to be in the document, not just in the stream (which can consist of multiple
documents).
I tried some hocus pocus, pre-populating the already represented dictionary
of anchored nodes:
import sys
import datetime
from ruamel import yaml
def load():
with open('defaults.yaml') as fp:
defaults_data = fp.read()
with open('user.yaml') as fp:
user_data = fp.read()
merged_data = yaml.load("{}\n{}".format(defaults_data, user_data),
Loader=yaml.RoundTripLoader)
return merged_data
class MyRTDGen(object):
class MyRTD(yaml.RoundTripDumper):
def __init__(self, *args, **kw):
pps = kw.pop('pre_populate', None)
yaml.RoundTripDumper.__init__(self, *args, **kw)
if pps is not None:
for pp in pps:
try:
anchor = pp.yaml_anchor()
except AttributeError:
anchor = None
node = yaml.nodes.MappingNode(
u'tag:yaml.org,2002:map', [], flow_style=None, anchor=anchor)
self.represented_objects[id(pp)] = node
def __init__(self, pre_populate=None):
assert isinstance(pre_populate, list)
self._pre_populate = pre_populate
def __call__(self, *args, **kw):
kw1 = kw.copy()
kw1['pre_populate'] = self._pre_populate
myrtd = self.MyRTD(*args, **kw1)
return myrtd
def update(md, file_name):
ud = md.pop('userdefaults')
MyRTD = MyRTDGen([ud])
yaml.dump(md, sys.stdout, Dumper=MyRTD)
with open(file_name, 'w') as fp:
yaml.dump(md, fp, Dumper=MyRTD)
md = load()
md['users']['xxxx2']['timestamp'] = str(datetime.datetime.utcnow())
update(md, 'user.yaml')
Since the PyYAML based API requires a class instead of an object, you need to
use a class generator, that actually adds the data elements to pre-populate on
the fly from withing yaml.load().
But this doesn't work, as a node only gets written out with an anchor once it is
determined that the anchor is used (i.e. there is a second reference). So actually the
first merge key gets written out as an anchor. And although I am quite familiar
with the code base, I could not get this to work properly in a reasonable amount of time.
So instead, I would just rely on the fact that there is only one key that matches
the first key of users.yaml at the root level of the dump of the combined updated
file and strip anything before that.
import sys
import datetime
from ruamel import yaml
with open('defaults.yaml') as fp:
defaults_data = fp.read()
with open('user.yaml') as fp:
user_data = fp.read()
merged_data = yaml.load("{}\n{}".format(defaults_data, user_data),
Loader=yaml.RoundTripLoader)
# find the key
for line in user_data.splitlines():
line = line.split('# ')[0].rstrip() # end of line comment, not checking for strings
if line and line[-1] == ':' and line[0] != ' ':
split_key = line
break
merged_data['users']['xxxx2']['timestamp'] = str(datetime.datetime.utcnow())
buf = yaml.compat.StringIO()
yaml.dump(merged_data, buf, Dumper=yaml.RoundTripDumper)
document = split_key + buf.getvalue().split('\n' + split_key)[1]
sys.stdout.write(document)
which gives:
users:
xxxx1:
<<: *userdefaults
timestamp: '2018-10-22 11:38:28.541810'
xxxx2:
<<: *userdefaults
timestamp: '2018-10-23 09:59:13.829978'
I had to make a virtualenv to make sure I could run the above with ruamel.yaml==0.13.14.
That version is from the time I was still young (I won't claim to have been innocent).
There have been over 85 releases of the library since then.
I can understand that you might not be able to run anything but
Python2 at the moment and cannot compile/use a newer version. But what
you really should do is install virtualenv (can be done using EPEL, but also without
further "polluting" your system installation), make a virtualenv for the
code you are developping and install the latest version of ruamel.yaml (and
your other libraries) in there. You can also do that if you need
to distribute your software to other systems, just install virtualenv there as well.
I have all my utilties under /opt/util, and managed
virtualenvutils a
wrapper around virtualenv.
For writing the user part, you will have to manually split the output of yaml.dump() multifile output and write the appropriate part back to users yaml file.
import datetime
import StringIO
import ruamel.yaml
yaml = ruamel.yaml.YAML(typ='rt')
data = None
with open('defaults.yaml', 'r') as defaults:
with open('users.yaml', 'r') as users:
raw = "{}\n{}".format(''.join(defaults.readlines()), ''.join(users.readlines()))
data = list(yaml.load_all(raw))
data[0]['users']['xxxx1']['timestamp'] = datetime.datetime.now().isoformat()
with open('users.yaml', 'w') as outfile:
sio = StringIO.StringIO()
yaml.dump(data[0], sio)
out = sio.getvalue()
outfile.write(out.split('\n\n')[1]) # write the second part here as this is the contents of users.yaml

luigi dependencies change at runtime

I have a luigi preprocessing task that splits my raw data into smaller files. These Files will then be processed by the actual pipeline.
So regarding the parameters, I would like to require each pipeline with one preprocessed file id as parameter. However, this file id is only generated in the preprocessing step and is thus only known at runtime. To illustrate my idea I provide this not-working code:
import luigi
import subprocess
import random
class GenPipelineFiles(luigi.Task):
input_file = luigi.Parameter()
def requires(self):
pass
def output(self):
for i in range(random.randint(0,10)):
yield luigi.LocalTarget("output/{}_{}.txt".format(self.input_file, i))
def run(self):
for iout in self.output:
command = "touch {}".format(iout.fname)
subprocess.call(command, shell=True)
class RunPipelineOnSmallChunk(luigi.Task):
pass
class Experiment(luigi.WrapperTask):
input_file = luigi.Parameter(default="ex")
def requires(self):
file_ids = GenPipelineFiles(input_file=self.input_file)
for file_id in file_ids:
yield RunPipelineOnSmallChunk(directory=self.input_file, file_id=file_id)
luigi.run()
The wrapper task Experiment should
first, somehow require the splitting of the raw data into documents
secondly, require the actual pipeline with the obtained file id of the preprocessing.
The random number of output files in the GenPipelineFiles indicates that this cannot be hard-coded into the Experiment's requires.
A question that is probably related to this one is the fact, that a luigi task properly only has one input target and one output target. Probably a note on how to model multiple outputs in GenPipelineFiles could also solve the problem.
One simple approach to dealing with multiple outputs is to create a directory named after the input file, and put the output files from the split into that a directory named after the input file. That way the dependent task can just check for the existence of the directory. Let's say I have an input file 123.txt, I then make a directory 123_split with files 1.txt, 2.txt, 3.txt as the output of GenPipelineFiles, and then a directory 123_processed with 1.txt, 2.txt, 3.txt as the output of RunPipelineOnSmallChunk.
For your requires method in Experiment, you have to return the tasks you want to run, in a list for example. The way you have written file_ids = GenPipelineFiles(input_file=self.input_file) makes me think the run method of that object is not being called, because it is not being returned by the method.
here's some sample code that works with targets on a per file basis (but not a task per file basis). I still think it is safer to have a single output target of a directory or a sentinel file out of some kind to indicate you are done. Atomicity is lost unless the tasks ensures each target is created.
PYTHONPATH=. luigi --module sampletask RunPipelineOnSmallChunk --local-scheduler
sampletask.py
import luigi
import os
import subprocess
import random
class GenPipelineFiles(luigi.Task):
inputfile = luigi.Parameter()
num_targets = random.randint(0,10)
def requires(self):
pass
def get_prefix(self):
return self.inputfile.split(".")[0]
def get_dir(self):
return "split_{}".format(self.get_prefix())
def output(self):
targets = []
for i in range(self.num_targets):
targets.append(luigi.LocalTarget(" {}/{}_{}.txt".format(self.get_dir(), self.get_prefix(), i)))
return targets
def run(self):
if not os.path.exists(self.get_dir()):
os.makedirs(self.get_dir())
for iout in self.output():
command = "touch {}".format(iout.path)
subprocess.call(command, shell=True)
class RunPipelineOnSmallChunk(luigi.Task):
inputfile = luigi.Parameter(default="test")
def get_prefix(self):
return self.inputfile.split(".")[0]
def get_dir(self):
return "processed_{}".format(self.get_prefix())
#staticmethod
def clean_input_path(path):
return path.replace("split", "processed")
def requires(self):
return GenPipelineFiles(self.inputfile)
def output(self):
targets = []
for target in self.input():
targets.append(luigi.LocalTarget(RunPipelineOnSmallChunk.clean_input_path(target.path)))
return targets
def run(self):
if not os.path.exists(self.get_dir()):
os.makedirs(self.get_dir())
for iout in self.output():
command = "touch {}".format(iout.path)
subprocess.call(command, shell=True)

python unittest howto

I`d like to know how I could unit-test the following module.
def download_distribution(url, tempdir):
""" Method which downloads the distribution from PyPI """
print "Attempting to download from %s" % (url,)
try:
url_handler = urllib2.urlopen(url)
distribution_contents = url_handler.read()
url_handler.close()
filename = get_file_name(url)
file_handler = open(os.path.join(tempdir, filename), "w")
file_handler.write(distribution_contents)
file_handler.close()
return True
except ValueError, IOError:
return False
Unit test propositioners will tell you that unit tests should be self contained, that is, they should not access the network or the filesystem (especially not in writing mode). Network and filesystem tests are beyond the scope of unit tests (though you might subject them to integration tests).
Speaking generally, for such a case, I'd extract the urllib and file-writing codes to separate functions (which would not be unit-tested), and inject mock-functions during unit testing.
I.e. (slightly abbreviated for better reading):
def get_web_content(url):
# Extracted code
url_handler = urllib2.urlopen(url)
content = url_handler.read()
url_handler.close()
return content
def write_to_file(content, filename, tmpdir):
# Extracted code
file_handler = open(os.path.join(tempdir, filename), "w")
file_handler.write(content)
file_handler.close()
def download_distribution(url, tempdir):
# Original code, after extractions
distribution_contents = get_web_content(url)
filename = get_file_name(url)
write_to_file(distribution_contents, filename, tmpdir)
return True
And, on the test file:
import module_I_want_to_test
def mock_web_content(url):
return """Some fake content, useful for testing"""
def mock_write_to_file(content, filename, tmpdir):
# In this case, do nothing, as we don't do filesystem meddling while unit testing
pass
module_I_want_to_test.get_web_content = mock_web_content
module_I_want_to_test.write_to_file = mock_write_to_file
class SomeTests(unittest.Testcase):
# And so on...
And then I second Daniel's suggestion, you should read some more in-depth material on unit testing.
Vague question. If you're just looking for a primer for unit testing in general with a Python slant, I recommend Mark Pilgrim's "Dive Into Python" which has a chapter on unit testing with Python. Otherwise you need to clear up what specific issues you are having testing that code.
To mock urllopen you can pre fetch some examples that you can then use in your unittests. Here's an example to get you started:
def urlopen(url):
urlclean = url[:url.find('?')] # ignore GET parameters
files = {
'http://example.com/foo.xml': 'foo.xml',
'http://example.com/bar.xml': 'bar.xml',
}
return file(files[urlclean])
yourmodule.urllib.urlopen = urlopen

Categories