I have a luigi preprocessing task that splits my raw data into smaller files. These Files will then be processed by the actual pipeline.
So regarding the parameters, I would like to require each pipeline with one preprocessed file id as parameter. However, this file id is only generated in the preprocessing step and is thus only known at runtime. To illustrate my idea I provide this not-working code:
import luigi
import subprocess
import random
class GenPipelineFiles(luigi.Task):
input_file = luigi.Parameter()
def requires(self):
pass
def output(self):
for i in range(random.randint(0,10)):
yield luigi.LocalTarget("output/{}_{}.txt".format(self.input_file, i))
def run(self):
for iout in self.output:
command = "touch {}".format(iout.fname)
subprocess.call(command, shell=True)
class RunPipelineOnSmallChunk(luigi.Task):
pass
class Experiment(luigi.WrapperTask):
input_file = luigi.Parameter(default="ex")
def requires(self):
file_ids = GenPipelineFiles(input_file=self.input_file)
for file_id in file_ids:
yield RunPipelineOnSmallChunk(directory=self.input_file, file_id=file_id)
luigi.run()
The wrapper task Experiment should
first, somehow require the splitting of the raw data into documents
secondly, require the actual pipeline with the obtained file id of the preprocessing.
The random number of output files in the GenPipelineFiles indicates that this cannot be hard-coded into the Experiment's requires.
A question that is probably related to this one is the fact, that a luigi task properly only has one input target and one output target. Probably a note on how to model multiple outputs in GenPipelineFiles could also solve the problem.
One simple approach to dealing with multiple outputs is to create a directory named after the input file, and put the output files from the split into that a directory named after the input file. That way the dependent task can just check for the existence of the directory. Let's say I have an input file 123.txt, I then make a directory 123_split with files 1.txt, 2.txt, 3.txt as the output of GenPipelineFiles, and then a directory 123_processed with 1.txt, 2.txt, 3.txt as the output of RunPipelineOnSmallChunk.
For your requires method in Experiment, you have to return the tasks you want to run, in a list for example. The way you have written file_ids = GenPipelineFiles(input_file=self.input_file) makes me think the run method of that object is not being called, because it is not being returned by the method.
here's some sample code that works with targets on a per file basis (but not a task per file basis). I still think it is safer to have a single output target of a directory or a sentinel file out of some kind to indicate you are done. Atomicity is lost unless the tasks ensures each target is created.
PYTHONPATH=. luigi --module sampletask RunPipelineOnSmallChunk --local-scheduler
sampletask.py
import luigi
import os
import subprocess
import random
class GenPipelineFiles(luigi.Task):
inputfile = luigi.Parameter()
num_targets = random.randint(0,10)
def requires(self):
pass
def get_prefix(self):
return self.inputfile.split(".")[0]
def get_dir(self):
return "split_{}".format(self.get_prefix())
def output(self):
targets = []
for i in range(self.num_targets):
targets.append(luigi.LocalTarget(" {}/{}_{}.txt".format(self.get_dir(), self.get_prefix(), i)))
return targets
def run(self):
if not os.path.exists(self.get_dir()):
os.makedirs(self.get_dir())
for iout in self.output():
command = "touch {}".format(iout.path)
subprocess.call(command, shell=True)
class RunPipelineOnSmallChunk(luigi.Task):
inputfile = luigi.Parameter(default="test")
def get_prefix(self):
return self.inputfile.split(".")[0]
def get_dir(self):
return "processed_{}".format(self.get_prefix())
#staticmethod
def clean_input_path(path):
return path.replace("split", "processed")
def requires(self):
return GenPipelineFiles(self.inputfile)
def output(self):
targets = []
for target in self.input():
targets.append(luigi.LocalTarget(RunPipelineOnSmallChunk.clean_input_path(target.path)))
return targets
def run(self):
if not os.path.exists(self.get_dir()):
os.makedirs(self.get_dir())
for iout in self.output():
command = "touch {}".format(iout.path)
subprocess.call(command, shell=True)
Related
I'm trying to build a routine that calls a Pytest class for each PDF document in current directoy... Let me explain
Lets say i have this test file
import pytest
class TestHeader:
#asserts...
class TestBody:
#asserts...
This script needs to test each pdf document in my cwd
Here is my best attempt:
import glob
import pytest
class TestHeader:
#asserts...
class TestBody:
#asserts...
filelist = glob.glob('*.pdf')
for file in filelist:
#magically call pytest for each file
How would i approach this?
EDIT: Complementing my question.
I have a huge function that extracts each document's data, lets call it extract_pdf
this function returns a tuple (header, body).
Current attempt looks like this:
import glob
import pytest
class TestHeader:
#asserts...
class TestBody:
#asserts...
filelist = glob.glob('*.pdf')
for file in filelist:
header, body = extract_pdf(file)
pytest.main(<pass header and body as args for pytest>)
I need to parse each document prior to testing. Can it be done this way?
The best way to do this through parameterization of the testcases dynamically..
This can be achieved using the pytest_generate_tests hook..
def pytest_generate_tests(metafunc):
filelist = glob.glob('*.pdf')
metafunc.parametrize("fileName", filelist )
NOTE: fileName should be one of the argument to your test function.
This will result in executing the testcase for each of the file in the directory and the testcase will be like
TestFunc[File1]
TestFunc[File2]
TestFunc[File3]
.
.
and so on..
This is expanding on the existing answer by #ArunKalirajaBaskaran.
The problem is that you have different test classes that want to use the same data, but you want to parse the data only once. If it is ok for you to read all data at once, you could read them into global variables and use these for parametrizing your tests:
def extract_data():
filenames = []
headers = []
bodies = []
for filename in glob.glob('*.pdf'):
header, body = extract_pdf(filename)
filenames.append(filename)
headers.append(header)
bodies.append(body)
return filenames, headers, bodies
filenames, headers, bodies = extract_data()
def pytest_generate_tests(metafunc):
if "header" in metafunc.fixturenames:
# use the filename as ID for better test names
metafunc.parametrize("header", headers, ids=filenames)
elif "body" in metafunc.fixturenames:
metafunc.parametrize("body", bodies, ids=filenames)
class TestHeader:
def test_1(header):
...
def test_2(header):
...
class TestBody:
def test_1(body):
...
This is the same as using
class TestHeader:
#pytest.mark.parametrize("header", headers, ids=filenames)
def test_1(header):
...
#pytest.mark.parametrize("header", headers, ids=filenames)
def test_2(header):
...
pytest_generate_tests just adds a bit of convenience so you don't have to repeat the parametrize decorator for each test.
The downside of this is of course that you will read in all of the data at once, which may cause a problem with memory usage if there is a lot of files. Your approach with pytest.main will not work, because that is the same as calling pytest on the command line with the given parameters. Parametrization can be done at the fixture level or on the test level (like here), but both need the parameters alreay evaluated at load time, so I don't see a possibility to do this lazily (apart from putting it all into one test). Maybe someone else has a better idea...
Suppose in "./data_writers/excel_data_writer.py", I have:
from generic_data_writer import GenericDataWriter
class ExcelDataWriter(GenericDataWriter):
def __init__(self, config):
super().__init__(config)
self.sheet_name = config.get('sheetname')
def write_data(self, pandas_dataframe):
pandas_dataframe.to_excel(
self.get_output_file_path_and_name(), # implemented in GenericDataWriter
sheet_name=self.sheet_name,
index=self.index)
In "./data_writers/csv_data_writer.py", I have:
from generic_data_writer import GenericDataWriter
class CSVDataWriter(GenericDataWriter):
def __init__(self, config):
super().__init__(config)
self.delimiter = config.get('delimiter')
self.encoding = config.get('encoding')
def write_data(self, pandas_dataframe):
pandas_dataframe.to_csv(
self.get_output_file_path_and_name(), # implemented in GenericDataWriter
sep=self.delimiter,
encoding=self.encoding,
index=self.index)
In "./datawriters/generic_data_writer.py", I have:
import os
class GenericDataWriter:
def __init__(self, config):
self.output_folder = config.get('output_folder')
self.output_file_name = config.get('output_file')
self.output_file_path_and_name = os.path.join(self.output_folder, self.output_file_name)
self.index = config.get('include_index') # whether to include index column from Pandas' dataframe in the output file
Suppose I have a JSON config file that has a key-value pair like this:
{
"__comment__": "Here, user can provide the path and python file name of the custom data writer module she wants to use."
"custom_data_writer_module": "./data_writers/excel_data_writer.py"
"there_are_more_key_value_pairs_in_this_JSON_config_file": "for other input parameters"
}
In "main.py", I want to import the data writer module based on the custom_data_writer_module provided in the JSON config file above. So I wrote this:
import os
import importlib
def main():
# Do other things to read and process data
data_writer_class_file = config.get('custom_data_writer_module')
data_writer_module = importlib.import_module\
(os.path.splitext(os.path.split(data_writer_class_file)[1])[0])
dw = data_writer_module.what_should_this_be? # <=== Here, what should I do to instantiate the right specific data writer (Excel or CSV) class instance?
for df in dataframes_to_write_to_output_file:
dw.write_data(df)
if __name__ == "__main__":
main()
As I asked in the code above, I want to know if there's a way to retrieve and instantiate the class defined in a Python module assuming that there is ONLY ONE class defined in the module. Or if there is a better way to refactor my code (using some sort of pattern) without changing the structure of JSON config file described above, I'd like to learn from Python experts on StackOverflow. Thank you in advance for your suggestions!
You can do this easily with vars:
cls1,=[v for k,v in vars(data_writer_module).items()
if isinstance(v,type)]
dw=cls1(config)
The comma enforces that exactly one class is found. If the module is allowed to do anything like from collections import deque (or even foo=str), you might need to filter based on v.__module__.
I'm having trouble understanding how to make re-usable tasks in Luigi, and then use them in a concrete situation.
For example. I have two generic tasks that do something to a file and then output the result:
class GffFilter(luigi.Task):
"Filters a GFF file to only one feature"
feature = luigi.Parameter()
out_file = luigi.Parameter()
in_file = luigi.Parameter()
...
class BgZip(luigi.Task):
"bgZips a file"
out_file = luigi.Parameter()
in_file = luigi.Parameter()
...
Now, I want a workflow that first filters, then bgzips a specific file using these tasks:
class FilterSomeFile(luigi.WrapperTask):
def requires(self):
return GffFilter(in_file='some.gff3', out_file='some.genes.gff3', filter='gene')
def output(self):
return self.inputs()
class BgZipSomeFile(luigi.Task):
def run(self):
filtered = FilterSomeFile()
BzZip(filtered)
But this is awkward. In the first task I have no run method, and I'm just using dependencies to use the generic task. Is this correct? Should I be using inheritance here instead?
Then in the second task, I can't use dependencies, because I need the output from FilterSomeFile in order to use BgZip. But using dynamic dependencies seems wrong, because luigi can't build a proper dependency graph.
How should I make a Luigi workflow out of my generic tasks?
But this is awkward. In the first task I have no run method, and I'm just using dependencies to use the generic task. Is this correct?
Yes, according to this page, the WrapperTask is a dummy task whose purpose is to define a workflow of tasks, therefore it doesn't perform any actions by itself. Instead, by defining several requirements, this task will be complete when every requirement, listed in the requires method, has been completed. The main difference of this WrapperTask to a regular Task, is that you don't need to define an output method to signal that this task suceeded, as can be seen here.
Then in the second task, I can't use dependencies, because I need the output from FilterSomeFile in order to use BgZip. But using dynamic dependencies seems wrong, because luigi can't build a proper dependency graph.
Theoretically, you could make FilterSomeFile have the same output as GffFilter, make the BgZipSomeFile require FilterSomeFile, and then use the FilterSomeFile.output() in BgZipSomeFile.run to access the the zipped file. However, this solution would be somewhat strange because:
The wrapper task only "runs" 1 other task, so the wrapped task could be used directly, without having to create a WrapperTask. A better usage of WrapperTask would involve merging BgZipSomeFile and FilterSomeFile in a single subclass of WrapperTask
A Task is being instantiated in a run method. This results in a dynamic dependency, but this is not needed in this problem.
Finally, the input of GffFilter is hardcoded in FilterSomeFile task, which makes the workflow less useful. This could be avoided by making the WrapperClass still receive parameters, and pass these parameters to its requirements.
A better solution would be:
import luigi as lg
class A(lg.Task):
inFile = lg.Parameter()
outFile = lg.Parameter()
def run(self,):
with open(self.inFile, "r") as oldFile:
text = oldFile.read()
text += "*" * 10 + "\n" + "This text was added by task A.\n" + "*" * 10 + "\n"
print(text)
with open(self.outFile, "w") as newFile:
newFile.write(text)
def output(self,):
return lg.LocalTarget(self.outFile)
class B(lg.Task):
inFile = lg.Parameter()
outFile = lg.Parameter()
def run(self,):
with open(self.inFile, "r") as oldFile:
text = oldFile.read()
text += "*" * 10 + "\n" + "This text was added by task B.\n" + "*" * 10 + "\n"
with open(self.outFile, "w") as newFile:
newFile.write(text)
def output(self,):
return lg.LocalTarget(self.outFile)
class CustomWorkflow(lg.WrapperTask):
mainOutFile = lg.Parameter()
mainInFile = lg.Parameter()
tempFile = "/tmp/myTempFile.txt"
def requires(self,):
return [ A(inFile = self.mainInFile, outFile = self.tempFile),
B(inFile = self.tempFile, outFile = self.mainOutFile)
]
This code can be run in command line with:
PYTHONPATH='.' luigi --module pythonModuleContainingTheTasks --local-scheduler CustomWorkflow --mainInFile ./text.txt --mainOutFile ./procText.txt
I am using Pytest to test an executable. This .exe file reads a configuration file on startup.
I have written a fixture to spawn this .exe file at the start of each test and closes it down at the end of the test. However, I cannot work out how to tell the fixture which configuration file to use. I want the fixture to copy a specified config file to a directory before spawning the .exe file.
#pytest.fixture
def session(request):
copy_config_file(specific_file) # how do I specify the file to use?
link = spawn_exe()
def fin():
close_down_exe()
return link
# needs to use config file foo.xml
def test_1(session):
session.talk_to_exe()
# needs to use config file bar.xml
def test_2(session):
session.talk_to_exe()
How do I tell the fixture to use foo.xml for test_1 function and bar.xml for test_2 function?
Thanks
John
One solution is to use pytest.mark for that:
import pytest
#pytest.fixture
def session(request):
m = request.node.get_closest_marker('session_config')
if m is None:
pytest.fail('please use "session_config" marker')
specific_file = m.args[0]
copy_config_file(specific_file)
link = spawn_exe()
yield link
close_down_exe(link)
#pytest.mark.session_config("foo.xml")
def test_1(session):
session.talk_to_exe()
#pytest.mark.session_config("bar.xml")
def test_2(session):
session.talk_to_exe()
Another approach would be to just change your session fixture slightly to delegate the creation of the link to the test function:
import pytest
#pytest.fixture
def session_factory(request):
links = []
def make_link(specific_file):
copy_config_file(specific_file)
link = spawn_exe()
links.append(link)
return link
yield make_link
for link in links:
close_down_exe(link)
def test_1(session_factory):
session = session_factory('foo.xml')
session.talk_to_exe()
def test_2(session):
session = session_factory('bar.xml')
session.talk_to_exe()
I prefer the latter as its simpler to understand and allows for more improvements later, for example, if you need to use #parametrize in a test based on the config value. Also notice the latter allows to spawn more than one executable in the same test.
I often find my self writing a python script which takes parameters:
python my_script.py input_file output_file other_parameter_a other_parameter_b optional_parameter_c
Now, I want the option to either run the script on a single file like what the above would do, or run it on every single file in a directory. I find myself writing a new script my_script_run_on_directory.py that looks up every file in a directory and then calls my_script.py So, I would have:
python my_script_run_on_directory.py directory_input directory_output other_parameter_a other_parameter_b optional_parameter_c
I need to do this often and I keep writing a new directory script for each my_script. Is there a better way to do this? I thought of using decorators but not sure what the best way to do this is.
I suppose what I want is something like
python general_run_on_directory_script.py my_script directory_input directory_output <and all other paremeters needed for my_script>
As for your question on what to use. In general, I'd say abstract the generic code away in a function that takes a specific function as an argument. Using a decorator is a rather clean way to do this. So in my opinion, yes it is a good solution.
Simple case (always expecting the same argument for your function):
import os
#Define decorator, takes the function to execute as an argument
def dir_or_file_decorator(func):
def newFunc(path):
if os.path.isdir(path):
filenames = os.listdir(path)
for filename in filenames:
filepath = os.path.join(path,filename)
func(filepath)
else:
func(path)
return newFunc
#Define the function we want to decorate
#dir_or_file_decorator
def print_file_name(filepath):
print filepath
#Run some tests
print 'Testing file'
print_file_name(r'c:\testdir\testfile1.txt')
print 'Testing dir'
print_file_name(r'c:\testdir')
#The #decorator is just syntactic sugar. The code below shows what actually happens
def print_file_name2(filepath):
print filepath
decorated_func = dir_or_file_decorator(print_file_name2)
print 'Testing file'
decorated_func(r'c:\testdir\testfile1.txt')
print 'Testing dir'
decorated_func(r'c:\testdir')
#Output:
# Testing file
# c:\testdir\testfile1.txt
# Testing dir
# c:\testdir\testfile1.txt
# c:\testdir\testfile2.txt
More complicated cases:
Extra arguments in your functions:
import os
def dir_or_file_decorator(func):
def newFunc(path, *args, **kwargs):
if os.path.isdir(path):
filenames = os.listdir(path)
for filename in filenames:
filepath = os.path.join(path,filename)
func(filepath, *args, **kwargs)
else:
func(path, *args, **kwargs)
return newFunc
#dir_or_file_decorator
def print_file_name_and_args(path, extra):
print extra, path
#We can use the parameter order in the function (our decorator assumes path is the first one)
print_file_name_and_args(r'c:\testdir', 'extra for test 1')
#Or we can just be safe and use named arguments (our decorator assumes the argument is named path)
print_file_name_and_args(extra='extra for test 1', path=r'c:\testdir')
#A combination of both is possible too (but I feel it's more complicated and hence more prone to error)
print_file_name_and_args(r'c:\testdir', extra='extra for test 1')
#Output (in all 3 cases):
# extra for test 1 c:\testdir\testfile1.txt
# extra for test 1 c:\testdir\testfile2.txt
Having to return values as well:
import os
def dir_or_file_decorator_with_results(concatenateResultFunc):
def dir_or_file_decorator(func):
def newFunc(path, *args, **kwargs):
if os.path.isdir(path):
results = []
filenames = os.listdir(path)
for filename in filenames:
filepath = os.path.join(path,filename)
results.append(func(filepath, *args, **kwargs))
return concatenateResultFunc(results)
else:
return func(path, *args, **kwargs)
return newFunc
return dir_or_file_decorator
#Our function to concatenate the results in case of a directory
def concatenate_results(results):
return ','.join(results)
#We pass the function used to concatenate the results in case of a directory when we apply to decorator
#What happens is that we create a new dir_or_file_decorator that uses the specified concatenateResultFunc
#That newly created decorator is then applied to our function
#dir_or_file_decorator_with_results(concatenate_results)
def get_file_name_and_args(extra, path):
return extra + ' -> ' + path
#Test again
print get_file_name_and_args(r'c:\testdir', 'extra for test 1')
#Output:
# c:\testdir\testfile1.txt -> extra for test 1,c:\testdir\testfile2.txt -> extra for test 1