I have a flask server running within a gunicorn.
In my flask application I want to handle large upload files (>20GB), so I plan on letting a celery task do the handling of the large file.
The problem is that retrieving the file from request.files already takes quite long, in the meantime gunicorn terminates the worker handling that request. I could increase the timeout time, but the maximum file size is currently unknown, so I don't know how much time I would need.
My plan was to make the request context available to the celery task, as it was described here: http://xion.io/post/code/celery-include-flask-request-context.html, but I cannot make it work
Q1 Is the signature right?
I set the signature with
celery.signature(handle_large_file, args={}, kwargs={})
and nothing is complaining. I get the arguments I pass from the flask request handler to the celery task, but that's it. Should I somehow get a handle to the context here?
Q2 how to use the context?
I would have thought if the flask request context was available I could just use request.files in my code, but then I get the warning that I am out of context.
Using celery 4.4.0
Code:
# in celery.py:
from flask import request
from celery import Celery
celery = Celery('celery_worker',
backend=Config.CELERY_RESULT_BACKEND,
broker=Config.CELERY_BROKER_URL)
#celery.task(bind=True)
def handle_large_file(task_object, data):
# do something with the large file...
# what I'd like to do:
files = request.files['upfile']
...
celery.signature(handle_large_file, args={}, kwargs={})
# in main.py
def create_app():
app = Flask(__name__.split('.')[0])
...
celery_worker.conf.update(app.config)
# copy from the blog
class RequestContextTask(Task):...
celery_worker.Task = RequestContextTask
# in Controller.py
#FILE.route("", methods=['POST'])
def upload():
data = dict()
...
handle_large_file.delay(data)
What am I missing?
Related
Context
I developed a Flask API that sends tasks to my computing environment.
To use this, you should make a post request to the API.
Then, the API received your request, process it and send necessary data, through the RABBITMQ broker, a message to be held by the computing environment.
At the end, it should send the result back to the API
Some code
Here is an example of my API and my Celery application:
#main.py
# Package
import time
from flask import Flask
from flask import request, jsonify, make_response
# Own module
from celery_app import celery_app
# Environment
app = Flask()
# Endpoint
#app.route("/test", methods=["POST"])
def test():
"""
Test route
Returns
-------
Json formatted output
"""
# Do some preprocessing in here
result = celery_app.send_task(f"tasks.Client", args=[1, 2])
while result.state == "PENDING":
time.sleep(0.01)
result = result.get()
if result["sucess"]:
result_code = 200
else:
result_code = 500
output = str(result)
return make_response(
jsonify(
text=output,
code_status=result_code, ),
result_code,
)
# Main thread
if __name__ == "__main__":
app.run()
In a different file, I have setup my celery application connected to RABBITMQ Queue
#celery_app.py
from celery import Celery, Task
celery_app = Celery("my_celery",
broker=f"amqp://{USER}:{PASSWORD}#{HOSTNAME}:{PORT}/{COLLECTION}",
backend="rpc://"
)
celery_app.conf.task_serializer = "pickle"
celery_app.conf.result_serializer = "pickle"
celery_app.conf.accept_content = ["pickle"]
celery_app.conf.broker_connection_max_retries = 5
celery_app.conf.broker_pool_limit = 1
class MyTask(Task):
def run(self, a, b):
return a + b
celery_app.register_task(MyTask())
To run it, you should launch:
python3 main.py
Do not forget to run the celery worker (after registering tasks in it)
Then you can make a post request on it:
curl -X POST http://localhost:8000/test
The problem to resolve
When this simple API is running, I am sending request on my endpoint.
Unfortunatly, it fails 1 time on 4.
I have 2 messages:
The first message is:
amqp.exceptions.PreconditionFailed: (0, 0): (406) PRECONDITION_FAILED - delivery acknowledgement on channel 1 timed out. Timeout value used: 1800000 ms. This timeout value can be configured, see consumers doc guide to learn more
Then, because of the time out, my server has lost the message so:
File "main.py", line x, in test
result = celery_app.send_task("tasks.Client", args=[1, 2])
amqp.exceptions.InvalidCommand: Channel.close_ok: (503) COMMAND_INVALID - unimplemented method
Resolve this error
There are 2 solutions to get around this problem
retry to send a tasks until it fails 5 times in a row (try / except amqp.exceptions.InvalidCommand)
change the timeout value.
Unfortunatly, it doesn't seems to be the best ways to solve it.
Can you help me ?
Regards
PS:
my_packages:
Flask==2.0.2
python==3.6
celery==4.4.5
rabbitmq==latest
1. PreconditionFailed
I change my RabbitMQ version from latest to 3.8.14.
Then, I set up a celery task timeout using time_limit and soft_time_limit.
And it works :)
2. InvalidCommand
To resolve this problem, I use this retryfunctionnaluity.
I setup:
# max_retries=3
# autoretry_for=(InvalidCommand,)
I am getting an error
redis.exceptions.ConnectionError: Error 24 connecting to redis-service:6379. Too many open files.
...
OSError: [Errno 24] Too many open files
I know this can be fixed by increasing the ulimit but I don't think that's the issue here and also this is a service running on a container.
The application starts up correctly works for 48 hours correctly and then I get the above error.
Which implies that the connections are growing over time exponentially.
What my application is basically doing
background_task (ran using celery) -> collects data from postgres and sets it on redis
prometheus reaches the app at '/metrics' which is a django view -> collects data from redis and serves the data using django prometheus exporter
The code looks something like this
views.py
from prometheus_client.core import GaugeMetricFamily, REGISTRY
from my_awesome_app.taskbroker.celery import app
class SomeMetricCollector:
def get_sample_metrics(self):
with app.connection_or_acquire() as conn:
client = conn.channel().client
result = client.get('some_metric_key')
return {'some_metric_key': result}
def collect(self):
sample_metrics = self.get_sample_metrics()
for key, value in sample_metrics.items():
yield GaugeMetricFamily(key, 'This is a custom metric', value=value)
REGISTRY.register(SomeMetricCollector())
tasks.py
# This is my boilerplate taskbroker app
from my_awesome_app.taskbroker.celery import app
# How it's collecting data from postgres is trivial to this issue.
from my_awesome_app.utility_app.utility import some_value_calculated_from_query
#app.task()
def app_metrics_sync_periodic():
with app.connection_or_acquire() as conn:
client = conn.channel().client
client.set('some_metric_key', some_value_calculated_from_query(), ex=21600)
return True
I don't think the background data collection in tasks.py is causing the Redis connections to grow exponentially but it's the Django view '/metrics' in views.py which is causing.
Can you please tell me what I am doing wrong here?
If there is a better way to read from Redis from a Django view. The Prometheus instance scrapes the Django application every 5s.
This answer is according to my use case and research.
The issue here, according to me, is the fact that each request to /metrics initiates a new thread where the views.py creates new connections in the Celery broker's connection pool.
This can be easily handled by letting Django manage its own Redis connection pool through cache backend and Celery manage its own Redis connection pool and not use each other's connection pools from their respective threads.
Django Side
config.py
# CACHES
# ------------------------------------------------------------------------------
# For more details on options for your cache backend please refer
# https://docs.djangoproject.com/en/3.1/ref/settings/#backend
CACHES = {
"default": {
"BACKEND": "django_redis.cache.RedisCache",
"LOCATION": "redis://localhost:6379/0",
"OPTIONS": {
"CLIENT_CLASS": "django_redis.client.DefaultClient",
},
}
}
views.py
from prometheus_client.core import GaugeMetricFamily, REGISTRY
# *: Replacing celery app with Django cache backend
from django.core.cache import cache
class SomeMetricCollector:
def get_sample_metrics(self):
# *: This is how you will get the new client, which is still context managed.
with cache.client.get_client() as client:
result = client.get('some_metric_key')
return {'some_metric_key': result}
def collect(self):
sample_metrics = self.get_sample_metrics()
for key, value in sample_metrics.items():
yield GaugeMetricFamily(key, 'This is a custom metric', value=value)
REGISTRY.register(SomeMetricCollector())
This will ensure that Django will maintain it's Redis connection pool and not cause new connections to be spun up unnecessarily.
Celery Side
tasks.py
# This is my boilerplate taskbroker app
from my_awesome_app.taskbroker.celery import app
# How it's collecting data from postgres is trivial to this issue.
from my_awesome_app.utility_app.utility import some_value_calculated_from_query
#app.task()
def app_metrics_sync_periodic():
with app.connection_or_acquire() as conn:
# *: This will force celery to always look into the existing connection pool for connection.
client = conn.default_channel.client
client.set('some_metric_key', some_value_calculated_from_query(), ex=21600)
return True
How do I monitor connections?
There is a nice prometheus celery exporter which will help you monitor your celery task activity not sure how you can add connection pool and connection monitoring to it.
The easiest way to manually verify if the connections are growing every time /metrics is hit on the web app, is by:
$ redis-cli
127.0.0.1:6379> CLIENT LIST
...
The client list command will help you see if the number of connections are growing or not.
I don't use queues sadly but I would recommend using queues. This is how my worker runs:
$ celery -A my_awesome_app.taskbroker worker --concurrency=20 -l ERROR -E
I have created a flask application using Blueprints.
This application receives data via paho.mqtt.client.
This is also the trigger to processes the data and run processes afterwards.
'system' is a blueprint containing mqtt.py and functions.py
functions.py contains the function to process the data once received
mqtt.py contains the definition of the mqtt client
mqtt.py
from app.system import functions
import paho.mqtt.client as mqtt
#....
def on_message(mqttc,obj,msg):
try:
data = json.loads(msg.payload.decode('utf-8'))
# start main process
functions.process(data)
except Exception as e:
print("error: ", e)
pass
Once I receive data and the on_message callback is triggered I get an out of application context error:
error: Working outside of application context.
This typically means that you attempted to use functionality that needed
to interface with the current application object in some way. To solve
this, set up an application context with app.app_context(). See the
documentation for more information.
How can i get the application context within the on_message callback?
I tried importing current_app and using something like this
from flask import current_app
#...
def on_message(mqttc,obj,msg):
try:
data = json.loads(msg.payload.decode('utf-8'))
app = current_app._get_current_object()
with app.app_context():
# start main process
functions.process(data)
I still get the same error
There is this package - https://flask-mqtt.readthedocs.io/en/latest/ - that might help, but it only works with one worker instance.
Most of the time you set the application context when you create the app object.
So wherever you create your app is where you should initialize the extension. In your case it sounds like functions.py needs mqtt.py to carry out its logic, so you should initialize your mqtt client in your application creation.
From the flask docs - http://flask.pocoo.org/docs/1.0/appcontext/
If you see that error while configuring your application, such as when
initializing an extension, you can push a context manually since you
have direct access to the app. Use app_context() in a with block, and
everything that runs in the block will have access to current_app.
def create_app():
app = Flask(__name__)
with app.app_context():
#init_db()
initialize mqtt client
return app
I'd like to call generate_async_audio_service from a view and have it asynchronously generate audio files for the list of words using a threading pool and then commit them to a database.
I keep running into an error that I'm working out of the application context even though I'm creating a new polly and s3 instance each time.
How can I generate/upload multiple audio files at once?
from flask import current_app,
from multiprocessing.pool import ThreadPool
from Server.database import db
import boto3
import io
import uuid
def upload_audio_file_to_s3(file):
app = current_app._get_current_object()
with app.app_context():
s3 = boto3.client(service_name='s3',
aws_access_key_id=app.config.get('BOTO3_ACCESS_KEY'),
aws_secret_access_key=app.config.get('BOTO3_SECRET_KEY'))
extension = file.filename.rsplit('.', 1)[1].lower()
file.filename = f"{uuid.uuid4().hex}.{extension}"
s3.upload_fileobj(file,
app.config.get('S3_BUCKET'),
f"{app.config.get('UPLOADED_AUDIO_FOLDER')}/{file.filename}",
ExtraArgs={"ACL": 'public-read', "ContentType": file.content_type})
return file.filename
def generate_polly(voice_id, text):
app = current_app._get_current_object()
with app.app_context():
polly_client = boto3.Session(
aws_access_key_id=app.config.get('BOTO3_ACCESS_KEY'),
aws_secret_access_key=app.config.get('BOTO3_SECRET_KEY'),
region_name=app.config.get('AWS_REGION')).client('polly')
response = polly_client.synthesize_speech(VoiceId=voice_id,
OutputFormat='mp3', Text=text)
return response['AudioStream'].read()
def generate_polly_from_term(vocab_term, gender='m'):
app = current_app._get_current_object()
with app.app_context():
audio = generate_polly('Celine', vocab_term.term)
file = io.BytesIO(audio)
file.filename = 'temp.mp3'
file.content_type = 'mp3'
return vocab_term.id, upload_audio_file_to_s3(file)
def generate_async_audio_service(terms):
pool = ThreadPool(processes=12)
results = pool.map(generate_polly_from_term, terms)
# do something w/ results
This is not necessarily a fleshed-out answer, but rather than putting things into comments I'll put it here.
Celery is a task manager for python. The reason you would want to use this is if you have tasks pinging Flask, but they take longer to finish than the interval of tasks coming in, then certain tasks will be blocked and you won't get all of your results. To fix this, you hand it to another process. This goes like so:
1) Client sends a request to Flask to process audio files
2) The files land in Flask to be processed, Flask will send an asyncronous task to Celery.
3) Celery is notified of the task and stores its state in some sort of messaging system (RabbitMQ and Redis are the canonical examples)
4) Flask is now unburdened from that task and can receive more
5) Celery finishes the task, including the upload to your database
Celery and Flask are then two separate python processes communicating with one another. That should satisfy your multithreaded approach. You can also retrieve the state from a task through Flask if you want the client to verify that the task was/was not completed. The route in your Flask app.py would look like:
#app.route('/my-route', methods=['POST'])
def process_audio():
# Get your files and save to common temp storage
save_my_files(target_dir, files)
response = celery_app.send_tast('celery_worker.files', args=[target_dir])
return jsonify({'task_id': response.task_id})
Where celery_app comes from another module worker.py:
import os
from celery import Celery
env = os.environ
# This is for a rabbitMQ backend
CELERY_BROKER_URL = env.get('CELERY_BROKER_URL', 'amqp://0.0.0.0:5672/0')
CELERY_RESULT_BACKEND = env.get('CELERY_RESULT_BACKEND', 'rpc://')
celery_app = Celery('tasks', broker=CELERY_BROKER_URL, backend=CELERY_RESULT_BACKEND)
Then, your celery process would have a worker configured something like:
from celery import Celery
from celery.signals import after_task_publish
env = os.environ
CELERY_BROKER_URL = env.get('CELERY_BROKER_URL')
CELERY_RESULT_BACKEND = env.get('CELERY_RESULT_BACKEND', 'rpc://')
# Set celery_app with name 'tasks' using the above broker and backend
celery_app = Celery('tasks', broker=CELERY_BROKER_URL, backend=CELERY_RESULT_BACKEND)
#celery_app.task(name='celery_worker.files')
def async_files(path):
# Get file from path
# Process
# Upload to database
# This is just if you want to return an actual result, you can fill this in with whatever
return {'task_state': "FINISHED"}
This is relatively basic, but could serve as a starting point. I will say that some of Celery's behavior and setup is not always the most intuitive, but this will leave your flask app available to whoever wants to send files to it without blocking anything else.
Hopefully that's somewhat helpful
I'm trying to deploy a flask app on heroku that uses background tasks in Celery. I've implemented the application factory pattern so that the celery processes are not bound to any one instance of the flask app.
This works locally, and I have yet to see an error. But when deployed to heroku, the same results always occur: the celery task (I'm only using one) succeeds the first time it is run, but any subsequent celery calls to that task fail with sqlalchemy.exc.DatabaseError: (psycopg2.DatabaseError) SSL error: decryption failed or bad record mac. If I restart the celery worker, the cycle continues.
There are multiple issues that show this same error, but none specify a proper solution. I initially believed implementing the application factory pattern would have prevented this error from manifesting, but it's not quite there.
In app/__init__.py I create the celery and db objects:
celery = Celery(__name__, broker=Config.CELERY_BROKER_URL)
db = SQLAlchemy()
def create_app(config_name):
app = Flask(__name__)
app.config.from_object(config[config_name])
db.init_app(app)
return app
My flask_celery.py file creates the actual Flask app object:
import os
from app import celery, create_app
app = create_app(os.getenv('FLASK_CONFIG', 'default'))
app.app_context().push()
And I start celery with this command:
celery worker -A app.flask_celery.celery --loglevel=info
This is what the actual celery task looks like:
#celery.task()
def task_process_stuff(stuff_id):
stuff = Stuff.query.get(stuff_id)
stuff.processed = True
db.session.add(stuff)
db.session.commit()
return stuff
Which is invoked by:
task_process_stuff.apply_async(args=[stuff.id], countdown=10)
Library Versions
Flask 0.12.2
SQLAlchemy 1.1.11
Flask-SQLAlchemy 2.2
Celery 4.0.2
The solution was to add db.engine.dispose() at the beginning of the task, disposing of all db connections before any work begins:
#celery.task()
def task_process_stuff(stuff_id):
db.engine.dispose()
stuff = Stuff.query.get(stuff_id)
stuff.processed = True
db.session.commit()
return stuff
As I need this functionality across all of my tasks, I added it to task_prerun:
#task_prerun.connect
def on_task_init(*args, **kwargs):
db.engine.dispose()