How to abort App Engine pipelines gracefully? - python

Problem
I have a chain of pipelines:
class PipelineA(base_handler.PipelineBase):
def run(self, *args):
# do something
class PipelineB(base_handler.PipelineBase):
def run(self, *args):
# do something
class EntryPipeline(base_handler.PipelineBase):
def run(self):
if some_condition():
self.abort("Condition failed. Pipeline aborted!")
yield PipelineA()
mr_output = yield mapreduce_pipeline.MapreducePipeline(
# mapreduce configs here
# ...
)
yield PipelineB(mr_output)
p = EntryPipeline()
p.start()
In EntryPipeline, I am testing some conditions before starting PipelineA, MapreducePipeline and PipelineB. If the condition fail, I want to abort EntryPipeline and all subsequent pipelines.
Questions
What is a graceful pipeline abortion? Is self.abort() the correct way to do it or do I need sys.exit()?
What if I want to do the abortion inside PipelineA? e.g. PipelineA kicks off successfully, but prevent subsequent pipelines(MapreducePipeline and PipelineB) from starting.
Edit:
I ended up moving the condition statement outside of EntryPipeline, so start the whole thing only if the condition is true. Otherwise I think Nick's answer is correct.

Since the docs currently say "TODO: Talk about explicit abort and retry"
we'll have to read the source:
https://github.com/GoogleCloudPlatform/appengine-pipelines/blob/master/python/src/pipeline/pipeline.py#L703
def abort(self, abort_message=''):
"""Mark the entire pipeline up to the root as aborted.
Note this should only be called from *outside* the context of a running
pipeline. Synchronous and generator pipelines should raise the 'Abort'
exception to cause this behavior during execution.
Args:
abort_message: Optional message explaining why the abort happened.
Returns:
True if the abort signal was sent successfully; False if the pipeline
could not be aborted for any reason.
"""
So if you have a handle to some_pipeline that isn't self, you can call some_pipeline.abort()... but if you want to abort yourself you need to raise Abort() ... and that will bubble up to the top and kill the whole tree

Related

Luigi: Task is never invoked

I have the following setup
class TaskA(luigi.Task):
def requires(self):
yield TaskB()
if not get_results_from_task_B_written_on_S3():
print('Did not find any results and will exit')
return
else:
print('Found results and will proceed')
yield TaskC()
results = get_results_from_task_C_written_on_S3():
# do other stuff
class TaskB(luigi.Task):
def run(self):
// process and write results to s3
def output(self):
return URITarget('b_path')
class TaskC(luigi.Task):
def run(self):
// process and write results to s3
def output(self):
return URITarget('c_path')
The Luigi logs show the following:
Did not find any results and will exit
Found results and will proceed
To me seems like the control flow enters both if and else. Since this is in principle impossible I suspect that Luigi attempts to run the pipeline twice. Once it produces this
Did not find any results and will exit
Since it cannot find any results written on s3 from TaskB.
Then TaskB actually finishes its execution. Writes its results on s3. TaskA reruns. Finds the results from TaskB on s3 and produces
Found results and will proceed
But then it seems like the yield of TaskC is not working. It's just stuck there indifinitely.
This is just my assumption of Luigi's behavior. Please let me know if I'm wrong about this.
I need this modularisation of tasks B and C into separate tasks since it makes testing much easier. TaskC is a fairly complex tasks whose test setup would be much more involved than testing its constituents separately.
Part of the problem is that requires() can get called multiple times during scheduling. Therefore, the first time your TaskA.requires() gets called, it yields TaskB. But the next time TaskA.requires() is called, you are yielding TaskB again and you hit the else block. That first call to TaskA.requires() is the only one that gets used for the actual scheduling dependencies.
I wrote a test program just to test this out and you can see in my output how many times TaskB.output() is called.
import luigi
taskC_complete = False
taskB_complete = False
def get_results_from_task_C_written_on_S3():
return taskC_complete
def get_results_from_task_B_written_on_S3():
return taskB_complete
def set_taskB_complete():
taskB_complete = True
def set_taskC_complete():
taskC_complete = True
class TaskA(luigi.Task):
def requires(self):
yield TaskB()
if not get_results_from_task_B_written_on_S3():
print('Did not find any results and will exit')
return
else:
print('Found results and will proceed')
yield TaskC()
results = get_results_from_task_C_written_on_S3()
class TaskB(luigi.Task):
def run(self):
print("Task B")
def output(self):
return print('b_path')
class TaskC(luigi.Task):
def run(self):
print("Task C")
def output(self):
return print('c_path')
if __name__ == '__main__':
luigi_run_results = luigi.build([TaskA()], workers=1,
local_scheduler=True, detailed_summary=True, log_level='INFO')
This code outputs
Did not find any results and will exit
b_path
Task B
Did not find any results and will exit
b_path
Did not find any results and will exit
b_path
Did not find any results and will exit
b_path
Although the code is not a perfect replica of what you are attempting, here's the output from the scheduler which shows what will actually run:
INFO: Informed scheduler that task TaskA__99914b932b has status PENDING
INFO: Informed scheduler that task TaskB__99914b932b has status PENDING
I'm not sure what exactly you're trying to achieve, but read up on their documentation on task dependencies. You're better off trying to yield other tasks in your run() function for TaskA.

Celery ignores retry_backoff, instead retries 180 seconds repeatedly

Celery version number: 4.4.5
I have a function decorated like so:
#app.task(bind=True, retry_backoff=5, retry_jitter=False, retry_kwargs={"max_retries": 5})
def foo(self):
try:
#work
except Exception:
try:
_log.info(retrying task)
self.retry()
except MaxRetriesExceeded:
_log.error(Permanent failure)
I would expect this to retry after 5 seconds, then again after 10, then again after 20, then 40, then 80.
Instead, celery logs 'retrying task after 180 seconds', which it does. It then repeats the same procedure twice to make three retries in total, before giving up.
From what I've read on the docs, this seems to be the correct way to do it. Am I doing something wrong?
retry_backoff option relates only to autoretries that you specify using autoretry_for task decorator parameter:
A boolean, or a number. If this option is set to True, autoretries will be delayed following the rules of exponential backoff.
In your case, you are calling self.retry() yourself so the retry backoff doesn't apply.
EDIT:
To handle the cleanup actions after failure, consider this example:
from celery import Celery
from celery.utils.log import get_task_logger
app = Celery(broker='pyamqp://')
logger = get_task_logger(__name__)
def cleanup(self, exc, task_id, args, kwargs, einfo):
logger.error('An error has occured, cleaning up...')
#app.task(autoretry_for=(ZeroDivisionError,), retry_kwargs={'max_retries': 3},
retry_backoff=True, on_failure=cleanup)
def fail():
return 1/0
When you call the fail task, it will fail 3 times and then raise ZeroDivisionError exception. Also, it will call the cleanup function to do the cleanup. So you don't care if the task gets retried, you react to the fact the task failed and handle the fact accordingly in the on_failure callback. If your actions should depend on what exception occured, you can use the arguments the cleanup gets called with.

How to use "with" with a list of objects

Suppose I have a class that will spawn a thread and implements .__enter__ and .__exit__ so I can use it as such:
with SomeAsyncTask(params) as t:
# do stuff with `t`
t.thread.start()
t.thread.join()
.__exit__ might perform certain actions for clean-up purposes (ie. removing temp files, etc.)
That works all fine until I have a list of SomeAsyncTasks that I would like to start all at once.
list_of_async_task_params = [params1, params2, ...]
How should I use with on the list? I'm hoping for something like this:
with [SomeAsyncTask(params) for params in list_of_async_task_params] as tasks:
# do stuff with `tasks`
for task in tasks:
task.thread.start()
for task in tasks:
task.thread.join()
I think contextlib.ExitStack is exactly what you're looking for. It's a way of combining an indeterminate number of context managers into a single one safely (so that an exception while entering one context manager won't cause it to skip exiting the ones it's already entered successfully).
The example from the docs is pretty instructive:
with ExitStack() as stack:
files = [stack.enter_context(open(fname)) for fname in filenames]
# All opened files will automatically be closed at the end of
# the with statement, even if attempts to open files later
# in the list raise an exception
This can adapted to your "hoped for" code pretty easily:
import contextlib
with contextlib.ExitStack() as stack:
tasks = [stack.enter_context(SomeAsyncTask(params))
for params in list_of_async_task_params]
for task in tasks:
task.thread.start()
for task in tasks:
task.thread.join()
Note: Somehow I missed the fact that your Thread subclass was also a context manager itself—so the code below doesn't make that assumption. Nevertheless, it might be helpful when using more "generic" kinds of threads (where using something like contextlib.ExitStack wouldn't be an option).
Your question is a little light on details—so I made some up—however this might be close to what you want. It defines a AsyncTaskListContextManager class that has the necessary __enter__() and __exit__() methods required to support the context manager protocol (and associated with statements).
import threading
from time import sleep
class SomeAsyncTask(threading.Thread):
def __init__(self, name, *args, **kwargs):
super().__init__(*args, **kwargs)
self.name = name
self.status_lock = threading.Lock()
self.running = False
def run(self):
with self.status_lock:
self.running = True
while True:
with self.status_lock:
if not self.running:
break
print('task {!r} running'.format(self.name))
sleep(.1)
print('task {!r} stopped'.format(self.name))
def stop(self):
with self.status_lock:
self.running = False
class AsyncTaskListContextManager:
def __init__(self, params_list):
self.threads = [SomeAsyncTask(params) for params in params_list]
def __enter__(self):
for thread in self.threads:
thread.start()
return self
def __exit__(self, *args):
for thread in self.threads:
if thread.is_alive():
thread.stop()
thread.join() # wait for it to terminate
return None # allows exceptions to be processed normally
params = ['Fee', 'Fie', 'Foe']
with AsyncTaskListContextManager(params) as task_list:
for _ in range(5):
sleep(1)
print('leaving task list context')
print('end-of-script')
Output:
task 'Fee' running
task 'Fie' running
task 'Foe' running
task 'Foe' running
task 'Fee' running
task 'Fie' running
... etc
task 'Fie' running
task 'Fee' running
task 'Foe' running
leaving task list context
task 'Foe' stopped
task 'Fie' stopped
task 'Fee' stopped
end-of-script
#martineau answer should work. Here's a more generic method that should work for other cases. Note that exceptions are not handled in __exit__(). If one __exit__() function fails, the rest won't be called. A generic solution would probably throw an aggregate exception and allow you to handle it. Another corner case is when you your second manager's __enter__() method throws an exception. The first manager's __exit__() will not be called in that case.
class list_context_manager:
def __init__(self, managers):
this.managers = managers
def __enter__(self):
for m in self.managers:
m.__enter__()
return self.managers
def __exit__(self):
for m in self.managers:
m.__exit__()
It can then be used like in your question:
with list_context_manager([SomeAsyncTask(params) for params in list_of_async_task_params]) as tasks:
# do stuff with `tasks`
for task in tasks:
task.thread.start()
for task in tasks:
task.thread.join()

Finding the cause of a BrokenProcessPool in python's concurrent.futures

In a nutshell
I get a BrokenProcessPool exception when parallelizing my code with concurrent.futures. No further error is displayed. I want to find the cause of the error and ask for ideas of how to do that.
Full problem
I am using concurrent.futures to parallelize some code.
with ProcessPoolExecutor() as pool:
mapObj = pool.map(myMethod, args)
I end up with (and only with) the following exception:
concurrent.futures.process.BrokenProcessPool: A child process terminated abruptly, the process pool is not usable anymore
Unfortunately, the program is complex and the error appears only after the program has run for 30 minutes. Therefore, I cannot provide a nice minimal example.
In order to find the cause of the issue, I wrapped the method that I run in parallel with a try-except-block:
def myMethod(*args):
try:
...
except Exception as e:
print(e)
The problem remained the same and the except block was never entered. I conclude that the exception does not come from my code.
My next step was to write a custom ProcessPoolExecutor class that is a child of the original ProcessPoolExecutor and allows me to replace some methods with cusomized ones. I copied and pasted the original code of the method _process_worker and added some print statements.
def _process_worker(call_queue, result_queue):
"""Evaluates calls from call_queue and places the results in result_queue.
...
"""
while True:
call_item = call_queue.get(block=True)
if call_item is None:
# Wake up queue management thread
result_queue.put(os.getpid())
return
try:
r = call_item.fn(*call_item.args, **call_item.kwargs)
except BaseException as e:
print("??? Exception ???") # newly added
print(e) # newly added
exc = _ExceptionWithTraceback(e, e.__traceback__)
result_queue.put(_ResultItem(call_item.work_id, exception=exc))
else:
result_queue.put(_ResultItem(call_item.work_id,
result=r))
Again, the except block is never entered. This was to be expected, because I already ensured that my code does not raise an exception (and if everything worked well, the exception should be passed to the main process).
Now I am lacking ideas how I could find the error. The exception is raised here:
def submit(self, fn, *args, **kwargs):
with self._shutdown_lock:
if self._broken:
raise BrokenProcessPool('A child process terminated '
'abruptly, the process pool is not usable anymore')
if self._shutdown_thread:
raise RuntimeError('cannot schedule new futures after shutdown')
f = _base.Future()
w = _WorkItem(f, fn, args, kwargs)
self._pending_work_items[self._queue_count] = w
self._work_ids.put(self._queue_count)
self._queue_count += 1
# Wake up queue management thread
self._result_queue.put(None)
self._start_queue_management_thread()
return f
The process pool is set to be broken here:
def _queue_management_worker(executor_reference,
processes,
pending_work_items,
work_ids_queue,
call_queue,
result_queue):
"""Manages the communication between this process and the worker processes.
...
"""
executor = None
def shutting_down():
return _shutdown or executor is None or executor._shutdown_thread
def shutdown_worker():
...
reader = result_queue._reader
while True:
_add_call_item_to_queue(pending_work_items,
work_ids_queue,
call_queue)
sentinels = [p.sentinel for p in processes.values()]
assert sentinels
ready = wait([reader] + sentinels)
if reader in ready:
result_item = reader.recv()
else: #THIS BLOCK IS ENTERED WHEN THE ERROR OCCURS
# Mark the process pool broken so that submits fail right now.
executor = executor_reference()
if executor is not None:
executor._broken = True
executor._shutdown_thread = True
executor = None
# All futures in flight must be marked failed
for work_id, work_item in pending_work_items.items():
work_item.future.set_exception(
BrokenProcessPool(
"A process in the process pool was "
"terminated abruptly while the future was "
"running or pending."
))
# Delete references to object. See issue16284
del work_item
pending_work_items.clear()
# Terminate remaining workers forcibly: the queues or their
# locks may be in a dirty state and block forever.
for p in processes.values():
p.terminate()
shutdown_worker()
return
...
It is (or seems to be) a fact that a process terminates, but I have no clue why. Are my thoughts correct so far? What are possible causes that make a process terminate without a message? (Is this even possible?) Where could I apply further diagnostics? Which questions should I ask myself in order to come closer to a solution?
I am using python 3.5 on 64bit Linux.
I think I was able to get as far as possible:
I changed the _queue_management_worker method in my changed ProcessPoolExecutor module such that the exit code of the failed process is printed:
def _queue_management_worker(executor_reference,
processes,
pending_work_items,
work_ids_queue,
call_queue,
result_queue):
"""Manages the communication between this process and the worker processes.
...
"""
executor = None
def shutting_down():
return _shutdown or executor is None or executor._shutdown_thread
def shutdown_worker():
...
reader = result_queue._reader
while True:
_add_call_item_to_queue(pending_work_items,
work_ids_queue,
call_queue)
sentinels = [p.sentinel for p in processes.values()]
assert sentinels
ready = wait([reader] + sentinels)
if reader in ready:
result_item = reader.recv()
else:
# BLOCK INSERTED FOR DIAGNOSIS ONLY ---------
vals = list(processes.values())
for s in ready:
j = sentinels.index(s)
print("is_alive()", vals[j].is_alive())
print("exitcode", vals[j].exitcode)
# -------------------------------------------
# Mark the process pool broken so that submits fail right now.
executor = executor_reference()
if executor is not None:
executor._broken = True
executor._shutdown_thread = True
executor = None
# All futures in flight must be marked failed
for work_id, work_item in pending_work_items.items():
work_item.future.set_exception(
BrokenProcessPool(
"A process in the process pool was "
"terminated abruptly while the future was "
"running or pending."
))
# Delete references to object. See issue16284
del work_item
pending_work_items.clear()
# Terminate remaining workers forcibly: the queues or their
# locks may be in a dirty state and block forever.
for p in processes.values():
p.terminate()
shutdown_worker()
return
...
Afterwards I looked up the meaning of the exit code:
from multiprocessing.process import _exitcode_to_name
print(_exitcode_to_name[my_exit_code])
whereby my_exit_code is the exit code that was printed in the block I inserted to the _queue_management_worker. In my case the code was -11, which means that I ran into a segmentation fault. Finding the reason for this issue will be a huge task but goes beyond the scope of this question.
If you are using macOS, there is a known issue with how some versions of macOS uses forking that's not considered fork-safe by Python in some scenarios. The workaround that worked for me is to use no_proxy environment variable.
Edit ~/.bash_profile and include the following (it might be better to specify list of domains or subnets here, instead of *)
no_proxy='*'
Refresh the current context
source ~/.bash_profile
My local versions the issue was seen and worked around are: Python 3.6.0 on
macOS 10.14.1 and 10.13.x
Sources:
Issue 30388
Issue 27126

Where did the Luigi task go?

First time into the realm of Luigi (and Python!) and have some questions. Relevant code is:
from Database import Database
import luigi
class bbSanityCheck(luigi.Task):
conn = luigi.Parameter()
date = luigi.Parameter()
def __init__(self, *args, **kwargs):
super(bbSanityCheck, self).__init__(*args, **kwargs)
self.has_run = False
def run(self):
print "Entering run of bb sanity check"
# DB STUFF HERE THAT DOESN"T MATTER
print "Are we in la-la land?"
def complete(self):
print "BB Sanity check being asked for completeness: " , self.has_run
return self.has_run
class Pipeline(luigi.Task):
date = luigi.DateParameter()
def requires(self):
db = Database('cbs')
self.conn = db.connect()
print "I'm about to yield!"
return bbSanityCheck(conn = self.conn, date = self.date)
def run(self):
print "Hello World"
self.conn.query("""SELECT *
FROM log_blackbook""")
result = conn.store_result()
print result.fetch_row()
def complete(self):
return False
if __name__=='__main__':
luigi.run()
Output is here (with relevant DB returns removed 'cause):
DEBUG: Checking if Pipeline(date=2013-03-03) is complete
I'm about to yield!
INFO: Scheduled Pipeline(date=2013-03-03)
I'm about to yield!
DEBUG: Checking if bbSanityCheck(conn=<_mysql.connection open to 'sas1.rad.wc.truecarcorp.com' at 223f050>, date=2013-03-03) is complete
BB Sanity check being asked for completeness: False
INFO: Scheduled bbSanityCheck(conn=<_mysql.connection open to 'sas1.rad.wc.truecarcorp.com' at 223f050>, date=2013-03-03)
INFO: Done scheduling tasks
DEBUG: Asking scheduler for work...
DEBUG: Pending tasks: 2
INFO: [pid 5150] Running bbSanityCheck(conn=<_mysql.connection open to 'sas1.rad.wc.truecarcorp.com' at 223f050>, date=2013-03-03)
Entering run of bb sanity check
Are we in la-la land?
INFO: [pid 5150] Done bbSanityCheck(conn=<_mysql.connection open to 'sas1.rad.wc.truecarcorp.com' at 223f050>, date=2013-03-03)
DEBUG: Asking scheduler for work...
INFO: Done
INFO: There are no more tasks to run at this time
INFO: There are 1 pending tasks possibly being run by other workers
INFO: Worker was stopped. Shutting down Keep-Alive thread
So the questions:
1.) Why does "I'm about to yield" get printed twice?
2.) Why is "hello world" never printed?
3.) What is the "1 pending tasks possibly run by other workers"?
I prefer super-ultra clean output because it is way easier to maintain. I'm hoping I can get these warning equivalents ironed out.
I've also noted that requires either "yield" or "return item, item2, item3". I've read about yield and understand it. What I don't get is which convention is considered superior here or if their are subtle differences that I being new to the language am not getting.
I think you're misunderstanding how luigi works in general.
(1) Hmm.. not sure about that. It looks more like an issue with printing the same thing in both INFO and DEBUG to me
(2)
So, you're trying to run Pipeline which depends on bbSanityCheck to run. bbSanityCheck.complete() never returns True because you never set has_run to True in bbSanityCheck. So the Pipeline task can never run and output hello world, because its dependencies are never complete.
(3) That's probably because you have this pending task(it's actually Pipeline). But Luigi understands it is impossible for it to run and shuts down.
I would personally not use has_run as a way to check if a task has run, but instead check for the existence of the result of this job. Ie, if this job does sth to the database then, complete() should check that the expected contents are there.

Categories