GAE: Task on backend instance killed without warning

GAE: Task on backend instance killed without warning - python

TL;DR:
How can I work around this bug in Appengine: sometimes is_shutting_down returns False, and in a second or two, the instance is shut down?
Details
I have a backend instance on a Google Appengine application (Python). The backend instance is used to generate reports, which sometimes takes minutes or even hours to finish.
To deal with unexpected shutdowns, I am watching for runtime.is_shutting_down() and store the report's intermediate state into DB when is_shutting_down returns True.
Here's the portion of code where I check it:
from google.appengine.api import runtime
#...
def my_report_function():
#...
# Check if we should interrupt and reschedule to avoid timeout error.
duration_sec = time.time() - start
too_long = MAX_SEC < duration_sec
is_shutting_down = runtime.is_shutting_down()
log.debug('Does this report iteration need to wrap it up soon? '
'Too long? %s (%s sec). Shutting down? %s'
% (too_long, duration_sec, is_shutting_down))
if too_long or is_shutting_down:
# save the state of report, reschedule next iteration, and return
Sometimes it works, but sometimes I see the following in the Appengine log:
D 2013-06-20 18:41:56.893 Does this report iteration need to wrap it up soon? Too long? False (348.865950108 sec). Shutting down? False
E 2013-06-20 18:42:00.248 Process terminated because the backend took too long to shutdown.
Clearly, the 30-second timeout has not passed between the time when I checked the value returned by runtime.is_shutting_down(), and when Appengine killed the backend.
Does anybody know why this is happening, and whether there is a workaround for this?
Thank you in advance!

There is demo code from Google IO here http://backends-io.appspot.com/
The included counter_v3_with_write_behind.py demonstrates a pattern:
On '/_ah/start' set a shutdown hook via
runtime.set_shutdown_hook(something_to_save_progress_and_requeue_task)
It looks like your code is 'are you shutting down right now, if not, go do something that may take a while'. This pattern should listen for 'shut down ASAP or you lose everything'.

Related

"The object invoked has disconnected from its clients" win32com.client.Dispatch('CANalyzer.Application')

I have a python code that starts CANalyzer and stops it after n seconds (defined by the user with a tkinter GUI) inside a timer thread. Here is the code:
pythoncom.CoInitialize()
CANalyzer = win32com.client.Dispatch('CANalyzer.Application')
self.CAN_id = pythoncom.CoMarshalInterThreadInterfaceInStream(pythoncom.IID_IDispatch, CANalyzer)
then in the thread:
def timer_Stop_reply(CAN_id):
pythoncom.CoInitialize()
CAN = win32com.client.Dispatch(
pythoncom.CoGetInterfaceAndReleaseStream(CAN_id, pythoncom.IID_IDispatch)
)
CAN.Measurement.Stop()
self.stopped_DSE=1
pythoncom.CoUninitialize()
print('\n=== Stopping Trace ===')
Unfortunately, there is a sort of timeout after 400 s, in fact, I get this error (-2147417848, 'The object invoked has disconnected from its clients.', None, None). How can I avoid this problem? Is there something like a keep alive? I need to run CANalyzer for more than 10 min in my test so this error is really annoying.
Thanks

Ok at the moment the workaroud is to re-dispatch CANalyzer in the thread when the test time is > 400 s. Anyway if someone finds a better solution I'd be glad to read it !

Python: PySerial disconnecting from device randomly

I have a process that runs data acquisition using PySerial. It's working fine now, but there's a weird thing I had to do to make it work continuously, and I'm not sure this is normal, so I'm asking this question.
What happens: It looks like that the connection drops now and then! Around once every 30-60 minutes, with big error bars (could go for hours and be OK, but sometimes happens often).
My question: Is this standard?
My temporary solution: I wrote a simple "reopen" function that looks like this:
def ReopenDevice(devObject):
try:
devObject.close()
devObject.open()
except Exception as e:
print("Error while trying to connect to device " + devObject.port + ". The error says: " + str(e))
time.sleep(2)
And what I do is that if data pulling fails for 2 minutes, I reopen the device with this function, and it continues working well with no problems.
My program model: It's a GUI program, where the user clicks something like "Start", and that button does some preparations and runs a function through multiprocessing.Process() that starts with:
devObj = serial.Serial()
#... other params
devObj.open()
and that function then runs a while loop that keeps polling data with something like:
bytesToRead = devObj.inWaiting()
if bytesToRead != 0:
buffer = decodeString(devObj.read(bytesToRead))
#process buffer and push it to a list...
The way I know that the problem happened, is that devObj.inWaiting() Keeps returning zero... no matter how much data there's on the device!
Is this behavior expected and should always be considered whether it happens or doesn't happen?

The problem reduced a lot after not calling inWaiting() very frequently. Anyway, I kept the reconnect part to ensure that my program never fails. Thanks for "Kobi K" for suggesting the possible cause of the problem.

Python Memory leak using Yocto

I'm running a python script on a raspberry pi that constantly checks on a Yocto button and when it gets pressed it puts data from a different sensor in a database.
a code snippet of what constantly runs is:
#when all set and done run the program
Active = True
while Active:
if ResponseType == "b":
while Active:
try:
if GetButtonPressed(ResponseValue):
DoAllSensors()
time.sleep(5)
else:
time.sleep(0.5)
except KeyboardInterrupt:
Active = False
except Exception, e:
print str(e)
print "exeption raised continueing after 10seconds"
time.sleep(10)
the GetButtonPressed(ResponseValue) looks like the following:
def GetButtonPressed(number):
global buttons
if ModuleCheck():
if buttons[number - 1].get_calibratedValue() < 300:
return True
else:
print "module not online"
return False
def ModuleCheck():
global moduleb
return moduleb.isOnline()
I'm not quite sure about what might be going wrong. But it takes about an hour before the RPI runs out of memory.
The memory increases in size constantly and the button is only pressed once every 15 minutes or so.
That already tells me that the problem must be in the code displayed above.

The problem is that the yocto_api.YAPI object will continue to accumulate _Event objects in its _DataEvents dict (a class-wide attribute) until you call YAPI.YHandleEvents. If you're not using the API's callbacks, it's easy to think (I did, for hours) that you don't need to ever call this. The API docs aren't at all clear on the point:
If your program includes significant loops, you may want to include a call to this function to make sure that the library takes care of the information pushed by the modules on the communication channels. This is not strictly necessary, but it may improve the reactivity of the library for the following commands.
I did some playing around with API-level callbacks before I decided to periodically poll the sensors in my own code, and it's possible that some setting got left enabled in them that is causing these events to accumulate. If that's not the case, I can't imagine why they would say calling YHandleEvents is "not strictly necessary," unless they make ARM devices with unlimited RAM in Switzerland.
Here's the magic static method that thou shalt call periodically, no matter what. I'm doing so once every five seconds and that is taking care of the problem without loading down the system at all. API code that would accumulate unwanted events still smells to me, but it's time to move on.
#noinspection PyUnresolvedReferences
#staticmethod
def HandleEvents(errmsgRef=None):
"""
Maintains the device-to-library communication channel.
If your program includes significant loops, you may want to include
a call to this function to make sure that the library takes care of
the information pushed by the modules on the communication channels.
This is not strictly necessary, but it may improve the reactivity
of the library for the following commands.
This function may signal an error in case there is a communication problem
while contacting a module.
#param errmsg : a string passed by reference to receive any error message.
#return YAPI.SUCCESS when the call succeeds.
On failure, throws an exception or returns a negative error code.
"""
errBuffer = ctypes.create_string_buffer(YAPI.YOCTO_ERRMSG_LEN)
#noinspection PyUnresolvedReferences
res = YAPI._yapiHandleEvents(errBuffer)
if YAPI.YISERR(res):
if errmsgRef is not None:
#noinspection PyAttributeOutsideInit
errmsgRef.value = YByte2String(errBuffer.value)
return res
while len(YAPI._DataEvents) > 0:
YAPI.yapiLockFunctionCallBack(errmsgRef)
if not (len(YAPI._DataEvents)):
YAPI.yapiUnlockFunctionCallBack(errmsgRef)
break
ev = YAPI._DataEvents.pop(0)
YAPI.yapiUnlockFunctionCallBack(errmsgRef)
ev.invokeData()
return YAPI.SUCCESS

Twisted Python pause/postpone reactor

I'm pretty new to twisted, I have an HTTP client that queries a server that has rate limit, when I hit this limit the server responds with HTTP 204, so when I'm handling the response I'm doing probably something nasty, like this:
def handleResponse(r, ip):
if r.code == 204:
print 'Got 204, sleeping'
time.sleep(120)
return None
else:
jsonmap[ip] = ''
whenFinished = twisted.internet.defer.Deferred()
r.deliverBody(PrinterClient(whenFinished, ip))
return whenFinished
I'm doing this because I want to pause all the tasks.
Following there are 2 behaviours that I've in my mind, either re-run the tasks that hit 204 afterwards in the same execution (don't know if it's possible) or just log the errors and re-run them afterwards in another execution of the program. Another problem that may raise is that I've set a timeout on the connection in order to cancel the deferred after a pre-defined amount of time (see the code below) if there's no response from the server
timeoutCall = reactor.callLater(60, d.cancel)
def completed(passthrough):
if timeoutCall.active():
timeoutCall.cancel()
return passthrough
d.addCallback(handleResponse, ip)
d.addErrback(handleError, ip)
d.addBoth(completed)
Another problem that I may encounter is that if I'm sleeping I may hit this timeout and all my requests will be cancelled.
I hope that I've been enough precise.
Thank you in advance.
Jeppo

Don't use time.sleep(20) in any Twisted-based code. This violates basic assumptions that any other Twisted-based code that you might be using makes.
Instead, if want to delay something by N seconds, use reactor.callLater(N, someFunction).
Once you remove the sleep calls from your program, the problem of unrelated timeouts being hit just because you've stopped the reactor from processing events will go away.

For anyone stumbling across this thread, it's imperative that you never call time.sleep(...); however, it is possible to create a Deferred that does nothing but sleep... which you can use to compose delays into a deferred chain:
def make_delay_deferred(seconds, result=None):
d = Deferred()
reactor.callLater(seconds, d.callback, result)
return d

Find out whether celery task exists

Is it possible to find out whether a task with a certain task id exists? When I try to get the status, I will always get pending.
>>> AsyncResult('...').status
'PENDING'
I want to know whether a given task id is a real celery task id and not a random string. I want different results depending on whether there is a valid task for a certain id.
There may have been a valid task in the past with the same id but the results may have been deleted from the backend.

Celery does not write a state when the task is sent, this is partly an optimization (see the documentation).
If you really need it, it's simple to add:
from celery import current_app
# `after_task_publish` is available in celery 3.1+
# for older versions use the deprecated `task_sent` signal
from celery.signals import after_task_publish
# when using celery versions older than 4.0, use body instead of headers
#after_task_publish.connect
def update_sent_state(sender=None, headers=None, **kwargs):
# the task may not exist if sent using `send_task` which
# sends tasks by name, so fall back to the default result backend
# if that is the case.
task = current_app.tasks.get(sender)
backend = task.backend if task else current_app.backend
backend.store_result(headers['id'], None, "SENT")
Then you can test for the PENDING state to detect that a task has not (seemingly)
been sent:
>>> result.state != "PENDING"

AsyncResult.state returns PENDING in case of unknown task ids.
PENDING
Task is waiting for execution or unknown. Any task id that is not
known is implied to be in the pending state.
http://docs.celeryproject.org/en/latest/userguide/tasks.html#pending
You can provide custom task ids if you need to distinguish unknown ids from existing ones:
>>> from tasks import add
>>> from celery.utils import uuid
>>> r = add.apply_async(args=[1, 2], task_id="celery-task-id-"+uuid())
>>> id = r.task_id
>>> id
'celery-task-id-b774c3f9-5280-4ebe-a770-14a6977090cd'
>>> if not "blubb".startswith("celery-task-id-"): print "Unknown task id"
...
Unknown task id
>>> if not id.startswith("celery-task-id-"): print "Unknown task id"
...

Right now I'm using following scheme:
Get task id.
Set to memcache key like 'task_%s' % task.id message 'Started'.
Pass task id to client.
Now from client I can monitor task status(set from task messages to memcache).
From task on ready - set to memcache key message 'Ready'.
From client on task ready - start special task that will delete key from memcache and do necessary cleaning actions.

You need to call .get() on the AsyncTask object you create to actually fetch the result from the backend.
See the Celery FAQ.
To further clarify on my answer.
Any string is technically a valid ID, there is no way to validate the task ID. The only way to find out if a task exists is to ask the backend if it knows about it and to do that you must use .get().
This introduces the problem that .get() blocks when the backend doesn't have any information about the task ID you supplied, this is by design to allow you to start a task and then wait for its completion.
In the case of the original question I'm going to assume that the OP wants to get the state of a previously completed task. To do that you can pass a very small timeout and catch timeout errors:
from celery.exceptions import TimeoutError
try:
# fetch the result from the backend
# your backend must be fast enough to return
# results within 100ms (0.1 seconds)
result = AsyncResult('blubb').get(timeout=0.1)
except TimeoutError:
result = None
if result:
print "Result exists; state=%s" % (result.state,)
else:
print "Result does not exist"
It should go without saying that this only work if your backend is storing results, if it's not there's no way to know if a task ID is valid or not because nothing is keeping a record of them.
Even more clarification.
What you want to do cannot be accomplished using the AMQP backend because it does not store results, it forwards them.
My suggestion would be to switch to a database backend so that the results are in a database that you can query outside of the existing celery modules. If no tasks exist in the result database you can assume the ID is invalid.

So I have this idea:
import project.celery_tasks as tasks
def task_exist(task_id):
found = False
# tasks is my imported task module from celery
# it is located under /project/project, where the settings.py file is located
i = tasks.app.control.inspect()
s = i.scheduled()
for e in s:
if task_id in s[e]:
found = True
break
a = i.active()
if not found:
for e in a:
if task_id in a[e]:
found = True
break
r = i.reserved()
if not found:
for e in r:
if task_id in r[e]:
found = True
break
# if checking the status returns pending, yet we found it in any queues... it means it exists...
# if it returns pending, yet we didn't find it on any of the queues... it doesn't exist
return found
According to https://docs.celeryproject.org/en/stable/userguide/monitoring.html the different types of queue inspections are:
active,
scheduled,
reserved,
revoked,
registered,
stats,
query_task,
so pick and choose as you please.
And there might be a better way to go about checking the queues for their tasks, but this should work for me, for now.

Try
AsyncResult('blubb').state
that may work.
It should return something different.

Please correct me if i'm wrong.
if built_in_status_check(task_id) == 'pending'
if registry_exists(task_id) == true
print 'Pending'
else
print 'Task does not exist'

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.