Luigi: Task is never invoked - python

I have the following setup
class TaskA(luigi.Task):
def requires(self):
yield TaskB()
if not get_results_from_task_B_written_on_S3():
print('Did not find any results and will exit')
return
else:
print('Found results and will proceed')
yield TaskC()
results = get_results_from_task_C_written_on_S3():
# do other stuff
class TaskB(luigi.Task):
def run(self):
// process and write results to s3
def output(self):
return URITarget('b_path')
class TaskC(luigi.Task):
def run(self):
// process and write results to s3
def output(self):
return URITarget('c_path')
The Luigi logs show the following:
Did not find any results and will exit
Found results and will proceed
To me seems like the control flow enters both if and else. Since this is in principle impossible I suspect that Luigi attempts to run the pipeline twice. Once it produces this
Did not find any results and will exit
Since it cannot find any results written on s3 from TaskB.
Then TaskB actually finishes its execution. Writes its results on s3. TaskA reruns. Finds the results from TaskB on s3 and produces
Found results and will proceed
But then it seems like the yield of TaskC is not working. It's just stuck there indifinitely.
This is just my assumption of Luigi's behavior. Please let me know if I'm wrong about this.
I need this modularisation of tasks B and C into separate tasks since it makes testing much easier. TaskC is a fairly complex tasks whose test setup would be much more involved than testing its constituents separately.

Part of the problem is that requires() can get called multiple times during scheduling. Therefore, the first time your TaskA.requires() gets called, it yields TaskB. But the next time TaskA.requires() is called, you are yielding TaskB again and you hit the else block. That first call to TaskA.requires() is the only one that gets used for the actual scheduling dependencies.
I wrote a test program just to test this out and you can see in my output how many times TaskB.output() is called.
import luigi
taskC_complete = False
taskB_complete = False
def get_results_from_task_C_written_on_S3():
return taskC_complete
def get_results_from_task_B_written_on_S3():
return taskB_complete
def set_taskB_complete():
taskB_complete = True
def set_taskC_complete():
taskC_complete = True
class TaskA(luigi.Task):
def requires(self):
yield TaskB()
if not get_results_from_task_B_written_on_S3():
print('Did not find any results and will exit')
return
else:
print('Found results and will proceed')
yield TaskC()
results = get_results_from_task_C_written_on_S3()
class TaskB(luigi.Task):
def run(self):
print("Task B")
def output(self):
return print('b_path')
class TaskC(luigi.Task):
def run(self):
print("Task C")
def output(self):
return print('c_path')
if __name__ == '__main__':
luigi_run_results = luigi.build([TaskA()], workers=1,
local_scheduler=True, detailed_summary=True, log_level='INFO')
This code outputs
Did not find any results and will exit
b_path
Task B
Did not find any results and will exit
b_path
Did not find any results and will exit
b_path
Did not find any results and will exit
b_path
Although the code is not a perfect replica of what you are attempting, here's the output from the scheduler which shows what will actually run:
INFO: Informed scheduler that task TaskA__99914b932b has status PENDING
INFO: Informed scheduler that task TaskB__99914b932b has status PENDING
I'm not sure what exactly you're trying to achieve, but read up on their documentation on task dependencies. You're better off trying to yield other tasks in your run() function for TaskA.

Related

Python multiprocessing.Process calls join by itself

I have this code:
class ExtendedProcess(multiprocessing.Process):
def __init__(self):
super(ExtendedProcess, self).__init__()
self.stop_request = multiprocessing.Event()
def join(self, timeout=None):
logging.debug("stop request received")
self.stop_request.set()
super(ExtendedProcess, self).join(timeout)
def run(self):
logging.debug("process has started")
while not self.stop_request.is_set():
print "doing something"
logging.debug("proc is stopping")
When I call start() on the process it should be running forever, since self.stop_request() is not set. After some miliseconds join() is being called by itself and breaking run. What is going on!? why is join being called by itself?
Moreover, when I start a debugger and go line by line it's suddenly working fine.... What am I missing?
OK, thanks to ely's answer the reason hit me:
There is a race condition -
new process created...
as it's starting itself and about to run logging.debug("process has started") the main function hits end.
main function calls sys exit and on sys exit python calls for all finished processes to close with join().
since the process didn't actually hit "while not self.stop_request.is_set()" join is called and "self.stop_request.set()". Now stop_request.is_set and the code closes.
As mentioned in the updated question, this is because of a race condition. Below I put an initial example highlighting a simplistic race condition where the race is against the overall program exit, but this could also be caused by other types of scope exits or other general race conditions involving your process.
I copied your class definition and added some "main" code to run it, here's my full listing:
import logging
import multiprocessing
import time
class ExtendedProcess(multiprocessing.Process):
def __init__(self):
super(ExtendedProcess, self).__init__()
self.stop_request = multiprocessing.Event()
def join(self, timeout=None):
logging.debug("stop request received")
self.stop_request.set()
super(ExtendedProcess, self).join(timeout)
def run(self):
logging.debug("process has started")
while not self.stop_request.is_set():
print("doing something")
time.sleep(1)
logging.debug("proc is stopping")
if __name__ == "__main__":
p = ExtendedProcess()
p.start()
while True:
pass
The above code listing runs as expected for me using both Python 2.7.11 and 3.6.4. It loops infinitely and the process never terminates:
ely#eschaton:~/programming$ python extended_process.py
doing something
doing something
doing something
doing something
doing something
... and so on
However, if I instead use this code in my main section, it exits right away (as expected):
if __name__ == "__main__":
p = ExtendedProcess()
p.start()
This exits because the interpreter reaches the end of the program, which in turn triggers automatically destroying the p object as it goes out of scope of the whole program.
Note this could also explain why it works for you in the debugger. That is an interactive programming session, so after you start p, the debugger environment allows you to wait around and inspect it ... it would not be automatically destroyed unless you somehow invoked it within some scope that is exited while stepping through the debugger.
Just to verify the join behavior too, I also tried with this main block:
if __name__ == "__main__":
log = logging.getLogger()
log.setLevel(logging.DEBUG)
p = ExtendedProcess()
p.start()
st_time = time.time()
while time.time() - st_time < 5:
pass
p.join()
print("Finished!")
and it works as expected:
ely#eschaton:~/programming$ python extended_process.py
DEBUG:root:process has started
doing something
doing something
doing something
doing something
doing something
DEBUG:root:stop request received
DEBUG:root:proc is stopping
Finished!

How to use "with" with a list of objects

Suppose I have a class that will spawn a thread and implements .__enter__ and .__exit__ so I can use it as such:
with SomeAsyncTask(params) as t:
# do stuff with `t`
t.thread.start()
t.thread.join()
.__exit__ might perform certain actions for clean-up purposes (ie. removing temp files, etc.)
That works all fine until I have a list of SomeAsyncTasks that I would like to start all at once.
list_of_async_task_params = [params1, params2, ...]
How should I use with on the list? I'm hoping for something like this:
with [SomeAsyncTask(params) for params in list_of_async_task_params] as tasks:
# do stuff with `tasks`
for task in tasks:
task.thread.start()
for task in tasks:
task.thread.join()
I think contextlib.ExitStack is exactly what you're looking for. It's a way of combining an indeterminate number of context managers into a single one safely (so that an exception while entering one context manager won't cause it to skip exiting the ones it's already entered successfully).
The example from the docs is pretty instructive:
with ExitStack() as stack:
files = [stack.enter_context(open(fname)) for fname in filenames]
# All opened files will automatically be closed at the end of
# the with statement, even if attempts to open files later
# in the list raise an exception
This can adapted to your "hoped for" code pretty easily:
import contextlib
with contextlib.ExitStack() as stack:
tasks = [stack.enter_context(SomeAsyncTask(params))
for params in list_of_async_task_params]
for task in tasks:
task.thread.start()
for task in tasks:
task.thread.join()
Note: Somehow I missed the fact that your Thread subclass was also a context manager itself—so the code below doesn't make that assumption. Nevertheless, it might be helpful when using more "generic" kinds of threads (where using something like contextlib.ExitStack wouldn't be an option).
Your question is a little light on details—so I made some up—however this might be close to what you want. It defines a AsyncTaskListContextManager class that has the necessary __enter__() and __exit__() methods required to support the context manager protocol (and associated with statements).
import threading
from time import sleep
class SomeAsyncTask(threading.Thread):
def __init__(self, name, *args, **kwargs):
super().__init__(*args, **kwargs)
self.name = name
self.status_lock = threading.Lock()
self.running = False
def run(self):
with self.status_lock:
self.running = True
while True:
with self.status_lock:
if not self.running:
break
print('task {!r} running'.format(self.name))
sleep(.1)
print('task {!r} stopped'.format(self.name))
def stop(self):
with self.status_lock:
self.running = False
class AsyncTaskListContextManager:
def __init__(self, params_list):
self.threads = [SomeAsyncTask(params) for params in params_list]
def __enter__(self):
for thread in self.threads:
thread.start()
return self
def __exit__(self, *args):
for thread in self.threads:
if thread.is_alive():
thread.stop()
thread.join() # wait for it to terminate
return None # allows exceptions to be processed normally
params = ['Fee', 'Fie', 'Foe']
with AsyncTaskListContextManager(params) as task_list:
for _ in range(5):
sleep(1)
print('leaving task list context')
print('end-of-script')
Output:
task 'Fee' running
task 'Fie' running
task 'Foe' running
task 'Foe' running
task 'Fee' running
task 'Fie' running
... etc
task 'Fie' running
task 'Fee' running
task 'Foe' running
leaving task list context
task 'Foe' stopped
task 'Fie' stopped
task 'Fee' stopped
end-of-script
#martineau answer should work. Here's a more generic method that should work for other cases. Note that exceptions are not handled in __exit__(). If one __exit__() function fails, the rest won't be called. A generic solution would probably throw an aggregate exception and allow you to handle it. Another corner case is when you your second manager's __enter__() method throws an exception. The first manager's __exit__() will not be called in that case.
class list_context_manager:
def __init__(self, managers):
this.managers = managers
def __enter__(self):
for m in self.managers:
m.__enter__()
return self.managers
def __exit__(self):
for m in self.managers:
m.__exit__()
It can then be used like in your question:
with list_context_manager([SomeAsyncTask(params) for params in list_of_async_task_params]) as tasks:
# do stuff with `tasks`
for task in tasks:
task.thread.start()
for task in tasks:
task.thread.join()

Run functions parallel in python 2.7 to use output of an function at the end of other functions

I am a newbie to python and never used it's parallel processing modules like threading or multiprocess. I am working on a real time code problem where the output of one function is used as the input of one function. There is one big function which takes almost 3 seconds to complete. It is like a program where a person submit some documents to reception, and while his documents are being verified he is directed to somewhere other for different checks. If at the end of these checks the result of documents verification is available then the program will be failed.
def parallel_running_function(*args):
"""It is the function which will take 3 seconds to complete"""
output = "various documents matching and verification"
return output
def check_1(*args):
""" check one for the task"""
def check_2(*args):
""" check two for the task"""
def check_3(*args):
""" check three for the task"""
def check_4(*args):
""" check 4 for the task"""
def main_function():
output = parallel_running_function() # need to run this function
#parallel with other functions
output_1 = check_1()
output_2 = check_2()
output_3 = check_3()
output_4 = check_4()
if output:
"program is successful"
else:
"program is failed"
I need the output of parallel running function is here along with the other executed functions. If I don't get the output of that function here then program will be failed or ll give some wrong result.
I am using python 2.7. I have read multiple posts about this problem using threading, subprocess and multiprocessing module of python but i couldn't get a concrete solution of this problem. What I got from other posts is seems i need to use multiprocessing module. Can someone please give me an idea about how should i overcome of this problem.
You can do something like this:
import multiprocessing
pool = None
def parallel_running_function(*args):
"""It is the function which will take 3 seconds to complete"""
output = "various documents matching and verification"
return output
def check_1(*args):
""" check one for the task"""
def check_2(*args):
""" check two for the task"""
def check_3(*args):
""" check three for the task"""
def check_4(*args):
""" check 4 for the task"""
def main_function():
res = pool.apply_async(parallel_running_function)
res_1 = pool.apply_async(check_1)
res_2 = pool.apply_async(check_2)
res_3 = pool.apply_async(check_3)
res_4 = pool.apply_async(check_4)
output = res.get()
output_1 = res_1.get()
output_2 = res_2.get()
output_3 = res_3.get()
output_4 = res_4.get()
if output:
print "program is successful"
else:
print "program is failed"
if __name__ == '__main__':
pool = multiprocessing.Pool(processes=4)
main_function()
The main process will block when get is called, but the other processes will still be running.

How to abort App Engine pipelines gracefully?

Problem
I have a chain of pipelines:
class PipelineA(base_handler.PipelineBase):
def run(self, *args):
# do something
class PipelineB(base_handler.PipelineBase):
def run(self, *args):
# do something
class EntryPipeline(base_handler.PipelineBase):
def run(self):
if some_condition():
self.abort("Condition failed. Pipeline aborted!")
yield PipelineA()
mr_output = yield mapreduce_pipeline.MapreducePipeline(
# mapreduce configs here
# ...
)
yield PipelineB(mr_output)
p = EntryPipeline()
p.start()
In EntryPipeline, I am testing some conditions before starting PipelineA, MapreducePipeline and PipelineB. If the condition fail, I want to abort EntryPipeline and all subsequent pipelines.
Questions
What is a graceful pipeline abortion? Is self.abort() the correct way to do it or do I need sys.exit()?
What if I want to do the abortion inside PipelineA? e.g. PipelineA kicks off successfully, but prevent subsequent pipelines(MapreducePipeline and PipelineB) from starting.
Edit:
I ended up moving the condition statement outside of EntryPipeline, so start the whole thing only if the condition is true. Otherwise I think Nick's answer is correct.
Since the docs currently say "TODO: Talk about explicit abort and retry"
we'll have to read the source:
https://github.com/GoogleCloudPlatform/appengine-pipelines/blob/master/python/src/pipeline/pipeline.py#L703
def abort(self, abort_message=''):
"""Mark the entire pipeline up to the root as aborted.
Note this should only be called from *outside* the context of a running
pipeline. Synchronous and generator pipelines should raise the 'Abort'
exception to cause this behavior during execution.
Args:
abort_message: Optional message explaining why the abort happened.
Returns:
True if the abort signal was sent successfully; False if the pipeline
could not be aborted for any reason.
"""
So if you have a handle to some_pipeline that isn't self, you can call some_pipeline.abort()... but if you want to abort yourself you need to raise Abort() ... and that will bubble up to the top and kill the whole tree

Storing "meta" data on redis job is not working?

I'm trying to test a queued redis job but the meta data doesn't seem to be passing between the task and the originator. The job_id's appear to match so I'm a perplexed. Maybe some fresh eyes can help me work out the problem:
The task is as per the documentation:
from rq import get_current_job
def do_test(word):
job = get_current_job()
print job.get_id()
job.meta['word'] = word
job.save()
print "saved: ", job.meta['word']
return True
The rqworker log prints the job_id and word after it is saved
14:32:32 *** Listening on default...
14:33:07 default: labeller.do_test('supercalafragelistic') (a6e2e579-df26-411a-b017-8788d621149f)
a6e2e579-df26-411a-b017-8788d621149f
saved: supercalafragelistic
14:33:07 Job OK, result = True
14:33:07 Result is kept for 500 seconds.
The task is invoked from a unittest:
class RedisQueueTestCase(unittest.TestCase):
"""
Requires running "rqworker" on the localhost cmdline
"""
def setUp(self):
use_connection()
self.q = Queue()
def test_enqueue(self):
job = self.q.enqueue(do_test, "supercalafragelistic")
while True:
print job.get_id(), job.get_status(), job.meta.get('word')
if job.is_finished:
print "Result: ", job.result, job.meta.get('word')
break
time.sleep(0.25)
And generates this log showing the same job_id and correct result, but the meta variable word is never populated.
Testing started at 2:33 PM ...
a6e2e579-df26-411a-b017-8788d621149f queued None
a6e2e579-df26-411a-b017-8788d621149f finished None
Result: True None
Process finished with exit code 0
I tried adding a long delay so the log has a chance to see the task in started, but not finished state (in case meta is cleared when it finishes), but it didn't make any difference.
Any idea what I've missed?
The local job doesn't automatically update itself after a save occurs at the remote end. One must do a refresh to update it. Before the refactoring this was not necessary as I was doing a fetch_job with the job_id on every request.
So the test routine needs to include a refresh() (or fetch_job) to reflect any changes:
def test_enqueue(self):
job = self.q.enqueue(do_test, "supercalafragelistic")
while True:
job.refresh() #<--- well, duh, freddy
print job.get_id(), job.get_status(), job.meta.get('word')
if job.is_finished:
print "Result: ", job.result, job.meta.get('word')
break
time.sleep(0.25)
Which works a bit better:
Testing started at 5:14 PM ...
6ea0163f-b5d5-411a-906a-f765aa0b3cc6 queued None 0 []
6ea0163f-b5d5-411a-906a-f765aa0b3cc6 started supercalafragelistic
6ea0163f-b5d5-411a-906a-f765aa0b3cc6 finished supercalafragelistic
Result: True supercalafragelistic
The fact that the get_status was updating fooled me into overlooking this: get_status() is a method that goes as looks for the current status, whereas meta is just a pointer to some possibly stale data somewhere.

Categories