First time into the realm of Luigi (and Python!) and have some questions. Relevant code is:
from Database import Database
import luigi
class bbSanityCheck(luigi.Task):
conn = luigi.Parameter()
date = luigi.Parameter()
def __init__(self, *args, **kwargs):
super(bbSanityCheck, self).__init__(*args, **kwargs)
self.has_run = False
def run(self):
print "Entering run of bb sanity check"
# DB STUFF HERE THAT DOESN"T MATTER
print "Are we in la-la land?"
def complete(self):
print "BB Sanity check being asked for completeness: " , self.has_run
return self.has_run
class Pipeline(luigi.Task):
date = luigi.DateParameter()
def requires(self):
db = Database('cbs')
self.conn = db.connect()
print "I'm about to yield!"
return bbSanityCheck(conn = self.conn, date = self.date)
def run(self):
print "Hello World"
self.conn.query("""SELECT *
FROM log_blackbook""")
result = conn.store_result()
print result.fetch_row()
def complete(self):
return False
if __name__=='__main__':
luigi.run()
Output is here (with relevant DB returns removed 'cause):
DEBUG: Checking if Pipeline(date=2013-03-03) is complete
I'm about to yield!
INFO: Scheduled Pipeline(date=2013-03-03)
I'm about to yield!
DEBUG: Checking if bbSanityCheck(conn=<_mysql.connection open to 'sas1.rad.wc.truecarcorp.com' at 223f050>, date=2013-03-03) is complete
BB Sanity check being asked for completeness: False
INFO: Scheduled bbSanityCheck(conn=<_mysql.connection open to 'sas1.rad.wc.truecarcorp.com' at 223f050>, date=2013-03-03)
INFO: Done scheduling tasks
DEBUG: Asking scheduler for work...
DEBUG: Pending tasks: 2
INFO: [pid 5150] Running bbSanityCheck(conn=<_mysql.connection open to 'sas1.rad.wc.truecarcorp.com' at 223f050>, date=2013-03-03)
Entering run of bb sanity check
Are we in la-la land?
INFO: [pid 5150] Done bbSanityCheck(conn=<_mysql.connection open to 'sas1.rad.wc.truecarcorp.com' at 223f050>, date=2013-03-03)
DEBUG: Asking scheduler for work...
INFO: Done
INFO: There are no more tasks to run at this time
INFO: There are 1 pending tasks possibly being run by other workers
INFO: Worker was stopped. Shutting down Keep-Alive thread
So the questions:
1.) Why does "I'm about to yield" get printed twice?
2.) Why is "hello world" never printed?
3.) What is the "1 pending tasks possibly run by other workers"?
I prefer super-ultra clean output because it is way easier to maintain. I'm hoping I can get these warning equivalents ironed out.
I've also noted that requires either "yield" or "return item, item2, item3". I've read about yield and understand it. What I don't get is which convention is considered superior here or if their are subtle differences that I being new to the language am not getting.
I think you're misunderstanding how luigi works in general.
(1) Hmm.. not sure about that. It looks more like an issue with printing the same thing in both INFO and DEBUG to me
(2)
So, you're trying to run Pipeline which depends on bbSanityCheck to run. bbSanityCheck.complete() never returns True because you never set has_run to True in bbSanityCheck. So the Pipeline task can never run and output hello world, because its dependencies are never complete.
(3) That's probably because you have this pending task(it's actually Pipeline). But Luigi understands it is impossible for it to run and shuts down.
I would personally not use has_run as a way to check if a task has run, but instead check for the existence of the result of this job. Ie, if this job does sth to the database then, complete() should check that the expected contents are there.
Related
I have the following setup
class TaskA(luigi.Task):
def requires(self):
yield TaskB()
if not get_results_from_task_B_written_on_S3():
print('Did not find any results and will exit')
return
else:
print('Found results and will proceed')
yield TaskC()
results = get_results_from_task_C_written_on_S3():
# do other stuff
class TaskB(luigi.Task):
def run(self):
// process and write results to s3
def output(self):
return URITarget('b_path')
class TaskC(luigi.Task):
def run(self):
// process and write results to s3
def output(self):
return URITarget('c_path')
The Luigi logs show the following:
Did not find any results and will exit
Found results and will proceed
To me seems like the control flow enters both if and else. Since this is in principle impossible I suspect that Luigi attempts to run the pipeline twice. Once it produces this
Did not find any results and will exit
Since it cannot find any results written on s3 from TaskB.
Then TaskB actually finishes its execution. Writes its results on s3. TaskA reruns. Finds the results from TaskB on s3 and produces
Found results and will proceed
But then it seems like the yield of TaskC is not working. It's just stuck there indifinitely.
This is just my assumption of Luigi's behavior. Please let me know if I'm wrong about this.
I need this modularisation of tasks B and C into separate tasks since it makes testing much easier. TaskC is a fairly complex tasks whose test setup would be much more involved than testing its constituents separately.
Part of the problem is that requires() can get called multiple times during scheduling. Therefore, the first time your TaskA.requires() gets called, it yields TaskB. But the next time TaskA.requires() is called, you are yielding TaskB again and you hit the else block. That first call to TaskA.requires() is the only one that gets used for the actual scheduling dependencies.
I wrote a test program just to test this out and you can see in my output how many times TaskB.output() is called.
import luigi
taskC_complete = False
taskB_complete = False
def get_results_from_task_C_written_on_S3():
return taskC_complete
def get_results_from_task_B_written_on_S3():
return taskB_complete
def set_taskB_complete():
taskB_complete = True
def set_taskC_complete():
taskC_complete = True
class TaskA(luigi.Task):
def requires(self):
yield TaskB()
if not get_results_from_task_B_written_on_S3():
print('Did not find any results and will exit')
return
else:
print('Found results and will proceed')
yield TaskC()
results = get_results_from_task_C_written_on_S3()
class TaskB(luigi.Task):
def run(self):
print("Task B")
def output(self):
return print('b_path')
class TaskC(luigi.Task):
def run(self):
print("Task C")
def output(self):
return print('c_path')
if __name__ == '__main__':
luigi_run_results = luigi.build([TaskA()], workers=1,
local_scheduler=True, detailed_summary=True, log_level='INFO')
This code outputs
Did not find any results and will exit
b_path
Task B
Did not find any results and will exit
b_path
Did not find any results and will exit
b_path
Did not find any results and will exit
b_path
Although the code is not a perfect replica of what you are attempting, here's the output from the scheduler which shows what will actually run:
INFO: Informed scheduler that task TaskA__99914b932b has status PENDING
INFO: Informed scheduler that task TaskB__99914b932b has status PENDING
I'm not sure what exactly you're trying to achieve, but read up on their documentation on task dependencies. You're better off trying to yield other tasks in your run() function for TaskA.
Iam new to luigi and exploring its possibilities. I encountered a problem wherein I defined the task with (requires ,run and output method). In run(), I'm executing the contents of a file.
However , if the file do not exist , the task does not fail . Is there something I'm missing ?
import luigi
import logging
import time
import sys, os
logging.basicConfig(filename='Execution.log',level=logging.DEBUG)
date = time.strftime("%Y%m%d")
class CreateTable(luigi.Task):
def run(self):
os.system('./createsample.hql')
# with self.output().open('w') as f:
# f.write('Completed')
def output(self):
return luigi.LocalTarget('/tmp/CreateTable_Success_%s.csv' % date)
Output :
INFO: [pid 15553] Worker Worker(salt=747259359, workers=1, host=host-.com, username=root, pid=15553) running CreateTable()
sh: ./createsample.hql: No such file or directory
INFO: [pid 15553] Worker Worker(salt=747259359, workers=1, host=host-.com, username=root, pid=15553) done CreateTable()
DEBUG: 1 running tasks, waiting for next task to finish
INFO: Informed scheduler that task CreateTable__99914b932b has status DONE
Technically your code works and the Python part of your job ran successfully. The problem is that you are doing a system call that fails because the file does not exist.
What you need to do here is to check the return code of the system call. Return code 0 means it ran successfully. Any other outcome will yield a non-zero return code:
rc = os.system('./createsample.hql')
if rc:
raise Exception("something went wrong")
You might want to use the subprocess module for system calls to have more flexibility (and complexity): https://docs.python.org/2/library/subprocess.html
My initial files are in AWS S3. Could someone point me how I need to setup this in a Luigi Task?
I reviewed the documentation and found luigi.S3 but is not clear for me what to do with that, then I searched in the web and only get links from mortar-luigi and implementation in top of luigi.
UPDATE
After following the example provided for #matagus (I created the ~/.boto file as suggested too):
# coding: utf-8
import luigi
from luigi.s3 import S3Target, S3Client
class MyS3File(luigi.ExternalTask):
def output(self):
return S3Target('s3://my-bucket/19170205.txt')
class ProcessS3File(luigi.Task):
def requieres(self):
return MyS3File()
def output(self):
return luigi.LocalTarget('/tmp/resultado.txt')
def run(self):
result = None
for input in self.input():
print("Doing something ...")
with input.open('r') as f:
for line in f:
result = 'This is a line'
if result:
out_file = self.output().open('w')
out_file.write(result)
When I execute it nothing happens
DEBUG: Checking if ProcessS3File() is complete
INFO: Informed scheduler that task ProcessS3File() has status PENDING
INFO: Done scheduling tasks
INFO: Running Worker with 1 processes
DEBUG: Asking scheduler for work...
DEBUG: Pending tasks: 1
INFO: [pid 21171] Worker Worker(salt=226574718, workers=1, host=heliodromus, username=nanounanue, pid=21171) running ProcessS3File()
INFO: [pid 21171] Worker Worker(salt=226574718, workers=1, host=heliodromus, username=nanounanue, pid=21171) done ProcessS3File()
DEBUG: 1 running tasks, waiting for next task to finish
INFO: Informed scheduler that task ProcessS3File() has status DONE
DEBUG: Asking scheduler for work...
INFO: Done
INFO: There are no more tasks to run at this time
INFO: Worker Worker(salt=226574718, workers=1, host=heliodromus, username=nanounanue, pid=21171) was stopped. Shutting down Keep-Alive thread
As you can see, the message Doing something... never prints. What is wrong?
The key here is to define an External Task that has no inputs and which outputs are those files you already have in living in S3. Luigi docs mention this in Requiring another Task:
Note that requires() can not return a Target object. If you have a simple Target object that is created externally you can wrap it in a Task class
So, basically you end up with something like this:
import luigi
from luigi.s3 import S3Target
from somewhere import do_something_with
class MyS3File(luigi.ExternalTask):
def output(self):
return luigi.S3Target('s3://my-bucket/path/to/file')
class ProcessS3File(luigi.Task):
def requires(self):
return MyS3File()
def output(self):
return luigi.S3Target('s3://my-bucket/path/to/output-file')
def run(self):
result = None
# this will return a file stream that reads the file from your aws s3 bucket
with self.input().open('r') as f:
result = do_something_with(f)
# and the you
out_file = self.output().open('w')
# it'd better to serialize this result before writing it to a file, but this is a pretty simple example
out_file.write(result)
UPDATE:
Luigi uses boto to read files from and/or write them to AWS S3, so in order to make this code work, you'll need to provide your credentials in your boto config file ~/boto (look for other possible config file locations here):
[Credentials]
aws_access_key_id = <your_access_key_here>
aws_secret_access_key = <your_secret_key_here>
Problem
I have a chain of pipelines:
class PipelineA(base_handler.PipelineBase):
def run(self, *args):
# do something
class PipelineB(base_handler.PipelineBase):
def run(self, *args):
# do something
class EntryPipeline(base_handler.PipelineBase):
def run(self):
if some_condition():
self.abort("Condition failed. Pipeline aborted!")
yield PipelineA()
mr_output = yield mapreduce_pipeline.MapreducePipeline(
# mapreduce configs here
# ...
)
yield PipelineB(mr_output)
p = EntryPipeline()
p.start()
In EntryPipeline, I am testing some conditions before starting PipelineA, MapreducePipeline and PipelineB. If the condition fail, I want to abort EntryPipeline and all subsequent pipelines.
Questions
What is a graceful pipeline abortion? Is self.abort() the correct way to do it or do I need sys.exit()?
What if I want to do the abortion inside PipelineA? e.g. PipelineA kicks off successfully, but prevent subsequent pipelines(MapreducePipeline and PipelineB) from starting.
Edit:
I ended up moving the condition statement outside of EntryPipeline, so start the whole thing only if the condition is true. Otherwise I think Nick's answer is correct.
Since the docs currently say "TODO: Talk about explicit abort and retry"
we'll have to read the source:
https://github.com/GoogleCloudPlatform/appengine-pipelines/blob/master/python/src/pipeline/pipeline.py#L703
def abort(self, abort_message=''):
"""Mark the entire pipeline up to the root as aborted.
Note this should only be called from *outside* the context of a running
pipeline. Synchronous and generator pipelines should raise the 'Abort'
exception to cause this behavior during execution.
Args:
abort_message: Optional message explaining why the abort happened.
Returns:
True if the abort signal was sent successfully; False if the pipeline
could not be aborted for any reason.
"""
So if you have a handle to some_pipeline that isn't self, you can call some_pipeline.abort()... but if you want to abort yourself you need to raise Abort() ... and that will bubble up to the top and kill the whole tree
I'm trying to test a queued redis job but the meta data doesn't seem to be passing between the task and the originator. The job_id's appear to match so I'm a perplexed. Maybe some fresh eyes can help me work out the problem:
The task is as per the documentation:
from rq import get_current_job
def do_test(word):
job = get_current_job()
print job.get_id()
job.meta['word'] = word
job.save()
print "saved: ", job.meta['word']
return True
The rqworker log prints the job_id and word after it is saved
14:32:32 *** Listening on default...
14:33:07 default: labeller.do_test('supercalafragelistic') (a6e2e579-df26-411a-b017-8788d621149f)
a6e2e579-df26-411a-b017-8788d621149f
saved: supercalafragelistic
14:33:07 Job OK, result = True
14:33:07 Result is kept for 500 seconds.
The task is invoked from a unittest:
class RedisQueueTestCase(unittest.TestCase):
"""
Requires running "rqworker" on the localhost cmdline
"""
def setUp(self):
use_connection()
self.q = Queue()
def test_enqueue(self):
job = self.q.enqueue(do_test, "supercalafragelistic")
while True:
print job.get_id(), job.get_status(), job.meta.get('word')
if job.is_finished:
print "Result: ", job.result, job.meta.get('word')
break
time.sleep(0.25)
And generates this log showing the same job_id and correct result, but the meta variable word is never populated.
Testing started at 2:33 PM ...
a6e2e579-df26-411a-b017-8788d621149f queued None
a6e2e579-df26-411a-b017-8788d621149f finished None
Result: True None
Process finished with exit code 0
I tried adding a long delay so the log has a chance to see the task in started, but not finished state (in case meta is cleared when it finishes), but it didn't make any difference.
Any idea what I've missed?
The local job doesn't automatically update itself after a save occurs at the remote end. One must do a refresh to update it. Before the refactoring this was not necessary as I was doing a fetch_job with the job_id on every request.
So the test routine needs to include a refresh() (or fetch_job) to reflect any changes:
def test_enqueue(self):
job = self.q.enqueue(do_test, "supercalafragelistic")
while True:
job.refresh() #<--- well, duh, freddy
print job.get_id(), job.get_status(), job.meta.get('word')
if job.is_finished:
print "Result: ", job.result, job.meta.get('word')
break
time.sleep(0.25)
Which works a bit better:
Testing started at 5:14 PM ...
6ea0163f-b5d5-411a-906a-f765aa0b3cc6 queued None 0 []
6ea0163f-b5d5-411a-906a-f765aa0b3cc6 started supercalafragelistic
6ea0163f-b5d5-411a-906a-f765aa0b3cc6 finished supercalafragelistic
Result: True supercalafragelistic
The fact that the get_status was updating fooled me into overlooking this: get_status() is a method that goes as looks for the current status, whereas meta is just a pointer to some possibly stale data somewhere.