I'm building a flask app which processes some big files from users. Since this process can take a few seconds, I want the task to be done by an executor. I'm using Flask-executor and it works very well on my Windows developing machine.
But on the Ubuntu host, it simply doesn't: After submitting a task the job method gets called, logs one line to the log file and then stops. No return statement, no exception thrown or something else. How is this even possible?
Here I create the executor:
from flask_executor import Executor
executor = Executor()
executor.init_app(app)
After the file upload, I submit the job:
task_list.append((process_upload, full_file_path, current_user, readme_path))
Nothing special, so here's my method:
def process_upload(file_path: str, user: User, readme_path: str):
try:
_process_upload(file_path, user, readme_path)
except BaseException:
logger.error(traceback.format_exc())
def _process_upload(file_path: str, user: User, readme_path: str):
logger.info(f" --------- processing file {file_path} in background from user {user.username}")
logger.info("1")
path_only, file_name_with_ext = os.path.split(file_path)
logger.info("2")
The process method is actually bigger and does some fancy things, but I'm not even getting that far. Here's the log file:
2022-02-22 10:56:02,055 - ThreadPoolExecutor-0_0 - INFO - --------- processing file /var/www/myproject/files/test/test.dat in background from user test
I know that it's not just log files because the task isn't done after all.
So how the hell is it possible that the submitted task just stops? And why does it work on Windows but not on a 08/15 Ubuntu? How can i find and fix the problem?
Related
For learning purpose I want to implement the next thing:
I have a script that runs selenium for example in the background and I have some log messages that help me to see what is going on in the terminal.
But I want to get the same messages in my REST request to the Angular app.
print('Started')
print('Logged in')
...
print('Processing')
...
print('Success')
In my view.py file
class RunTask(viewsets.ViewSet):
queryset = Task.objects.all()
#action(detail=False, methods=['GET'], name='Run Test Script')
def run(self, request, *args, **kwargs):
task = task()
if valid['success']:
return Response(data=task)
else:
return Response(data=task['message'])
def task()
print('Staring')
print('Logged in')
...
print('Processing')
...
print('Success')
return {
'success': True/False,
'message': 'my status message'
}
Now it shows me only the result of the task. But I want to get the same messages to indicate process status in frontend.
And I can't understand how to organize it.
Or how I can tell angular about my process status?
Unfortunately, it's not that simple. Indeed, the REST API lets you start the task, but since it runs in the same thread, the HTTP request will block until the task is finished before sending the response. Your print statements won't appear in the HTTP response but on your server output (if you look at the shell where you ran python manage.py runserver, you'll see those print statements).
Now, if you wish to have those output in real-time, you'll have to look for WebSockets. They allow you to open a "tunnel" between the browser and the server, and send/receive messages in real-time. The django-channels library allow you to implement them.
However, for long-running background tasks (like a Selenium scraper), I would advise to look into the Celery task queue. Basically, your Django process will schedule task into the queue. The tasks into the queue will then be executed by one (or more !) "worker" processes. The advantage of this is that your Django process won't be blocked by the long task: it justs add some work into the queue and then respond.
When you add tasks in the queue, Celery will give you a unique identifier for this task, that you can return in the HTTP response. You can then very well implement another endpoint which takes a task id in parameter and return the state of the task (is it pending ? done ? failed ?).
For this to work, you'll have to setup a "broker", a kind of database that will store the tasks to do and their results (typically RabbitMQ or Redis). Celery documentation explains this well: https://docs.celeryproject.org/en/latest/getting-started/brokers/index.html
Either way you choose, it's not a trivial thing and will need quite some work before having some results ; but it's interesting to see how it expands the possibilities of a classical HTTP server.
I integrated my project with celery in this way, inside views.py after receving request from the user
def upload(request):
if "POST" == request.method:
# save the file
task_parse.delay()
# continue
and in tasks.py
from __future__ import absolute_import
from celery import shared_task
from uploadapp.main import aunit
#shared_task
def task_parse():
aunit()
return True
In short, the shared task will run a function aunit() from a third python file located in uploadapp/ directory named main.py.
Let's assume that aunit() is a resource heavy process which takes time (like file parsing). As I integrated that with celery, It works totally asynchronously now which is good to me. So, the task start -> Celery process -> It finishes then celery set status to Finish. I can view that using flower .
But what I want to do is that I want to notify the user who is using my app also through django UI that Your Task is done processing as soon as Celery has finished processing at back-side and set status to SUCCESS.
Now, I know this is possible if :
1.) I constantly request the STATUS and see wheather it returns SUCCESS or not.
How do I do that via Celery. How can you query Celery Task status from your views.py and notify user asynchronously with just celery's python module ?
You need a real time mechanism. I would suggest Firebase. Update the Firebase real time DB field of user id with a boolean=True at the end of the celery task. Implement a javascript function to listen to Firebase database user_id object changes -> update the UI
I'm using Python 2.7 (sigh), celery==3.1.19, librabbitmq==1.6.1, rabbitmq-server-3.5.6-1.noarch, and redis 2.8.24 (from redis-cli info).
I'm attempting to send a message from a celery producer to a celery consumer, and obtain the result back in the producer. There is 1 producer and 1 consumer, but 2 rabbitmq's (as brokers) and 1 redis (for results) in between.
The problem I'm facing is:
In the consumer, I get back get an AsyncResult via async_result =
ZipUp.delay(unique_directory), but async_result.ready() never
returns True (at least for 9 seconds it doesn't) - even for a
consumer task that does essentially nothing but return a string.
I can see, in the rabbitmq management web interface, my message
being received by the rabbitmq exchange, but it doesn't show up in
the corresponding rabbitmq queue. Also, a log message sent by the
very beginning of the ZipUp task doesn't appear to be getting
logged.
Things work if I don't try to get a result back from the AsyncResult! But I'm kinda hoping to get the result of the call - it's useful :).
Below are configuration specifics.
We're setting up Celery as follows for returns:
CELERY_RESULT_BACKEND = 'redis://%s' % _SHARED_WRITE_CACHE_HOST_INTERNAL
CELERY_RESULT = Celery('TEST', broker=CELERY_BROKER)
CELERY_RESULT.conf.update(
BROKER_HEARTBEAT=60,
CELERY_RESULT_BACKEND=CELERY_RESULT_BACKEND,
CELERY_TASK_RESULT_EXPIRES=100,
CELERY_IGNORE_RESULT=False,
CELERY_RESULT_PERSISTENT=False,
CELERY_ACCEPT_CONTENT=['json'],
CELERY_TASK_SERIALIZER='json',
CELERY_RESULT_SERIALIZER='json',
)
We have another Celery configuration that doesn't expect a return value, and that works - in the same program. It looks like:
CELERY = Celery('TEST', broker=CELERY_BROKER)
CELERY.conf.update(
BROKER_HEARTBEAT=60,
CELERY_RESULT_BACKEND=CELERY_BROKER,
CELERY_TASK_RESULT_EXPIRES=100,
CELERY_STORE_ERRORS_EVEN_IF_IGNORED=False,
CELERY_IGNORE_RESULT=True,
CELERY_ACCEPT_CONTENT=['json'],
CELERY_TASK_SERIALIZER='json',
CELERY_RESULT_SERIALIZER='json',
)
The celery producer's stub looks like:
#CELERY_RESULT.task(name='ZipUp', exchange='cognition.workflow.ZipUp_%s' % INTERNAL_VERSION)
def ZipUp(directory): # pylint: disable=invalid-name
""" Task stub """
_unused_directory = directory
raise NotImplementedError
It's been mentioned that using queue= instead of exchange= in this stub would be simpler. Can anyone confirm that (I googled but found exactly nothing on the topic)? Apparently you can just use queue= unless you want to use fanout or something fancy like that, since not all celery backends have the concept of an exchange.
Anyway, the celery consumer starts out with:
#task(queue='cognition.workflow.ZipUp_%s' % INTERNAL_VERSION, name='ZipUp')
#StatsInstrument('workflow.ZipUp')
def ZipUp(directory): # pylint: disable=invalid-name
'''
Zip all files in directory, password protected, and return the pathname of the new zip archive.
:param directory Directory to zip
'''
try:
LOGGER.info('zipping up {}'.format(directory))
But "zipping up" doesn't get logged anywhere. I searched every (disk-backed) file on the celery server for that string, and got two hits: /usr/bin/zip, and my celery task's code - and no log messages.
Any suggestions?
Thanks for reading!
It appears that using the following task stub in the producer solved the problem:
#CELERY_RESULT.task(name='ZipUp', queue='cognition.workflow.ZipUp_%s' % INTERNAL_VERSION)
def ZipUp(directory): # pylint: disable=invalid-name
""" Task stub """
_unused_directory = directory
raise NotImplementedError
In short, it's using queue= instead of exchange= .
How to check that a function in executed by celery?
def notification():
# in_celery() returns True if called from celery_test(),
# False if called from not_celery_test()
if in_celery():
# Send mail directly without creation of additional celery subtask
...
else:
# Send mail with creation of celery task
...
#celery.task()
def celery_test():
notification()
def not_celery_test():
notification()
Here is one way to do it by using celery.current_task. Here is the code to be used by the task:
def notification():
from celery import current_task
if not current_task:
print "directly called"
elif current_task.request.id is None:
print "called synchronously"
else:
print "dispatched"
#app.task
def notify():
notification()
This is code you can run to exercise the above:
from core.tasks import notify, notification
print "DIRECT"
notification()
print "NOT DISPATCHED"
notify()
print "DISPATCHED"
notify.delay().get()
My task code in the first snippet was in a module named core.tasks. And I shoved the code in the last snippet in a custom Django management command. This tests 3 cases:
Calling notification directly.
Calling notification through a task executed synchronously. That is, this task is not dispatched through Celery to a worker. The code of the task executes in the same process that calls notify.
Calling notification through a task run by a worker. The code of the task executes in a different process from the process that started it.
The output was:
NOT DISPATCHED
called synchronously
DISPATCHED
DIRECT
directly called
There is no line from the print in the task on the output after DISPATCHED because that line ends up in the worker log:
[2015-12-17 07:23:57,527: WARNING/Worker-4] dispatched
Important note: I initially was using if current_task is None in the first test but it did not work. I checked and rechecked. Somehow Celery sets current_task to an object which looks like None (if you use repr on it, you get None) but is not None. Unsure what is going on there. Using if not current_task works.
Also, I've tested the code above in a Django application but I've not used it in production. There may be gotchas I don't know.
I'm using Celery with Django for an online game.
I've written middleware to check whether Celery is available and running, based on this answer: Detect whether Celery is Available/Running
My code actually looks like this:
from celery.task.control import inspect
class CeleryCheckMiddleware(object):
def process_request(self, request):
insp = inspect().stats()
if not insp:
return render(...)
else:
return None
But I forgot the caveat in the comment at the bottom of that answer, 'I've discovered that the above adds two reply.celery.pidbox queues to rabbitmq every time it's run. This leads to an incremental increase in rabbitmq's memory usage.'
I'm now (only one day later!) noticing occasional 500 errors starting from the line insp = inspect().stats() and terminating with OSError: [Errno 4] Interrupted system call.
Is there a memory-safe way to check whether Celery is available and running?
This feels very heavy. You might be better off running an async task and collecting the result with an acceptable timeout. It's naive but it shouldn't impact resources a lot, depending on how often you have to call it thou .....
#app.job
def celery_alive():
return "OK"
def process_request(self, request):
res = celery_alive.apply_async()
try:
return "OK" == res.get(timeout=settings.ACCEPTABLE_TRANSACTION_TIME)
except TimeoutError as e:
return False
The below script is worked for me.
#Import the celery app from project
from application_package import app as celery_app
def get_celery_worker_status():
insp = celery_app.control.inspect()
nodes = insp.stats()
if not nodes:
raise Exception("celery is not running.")
logger.error("celery workers are: {}".format(nodes))
return nodes