I'd like to call generate_async_audio_service from a view and have it asynchronously generate audio files for the list of words using a threading pool and then commit them to a database.
I keep running into an error that I'm working out of the application context even though I'm creating a new polly and s3 instance each time.
How can I generate/upload multiple audio files at once?
from flask import current_app,
from multiprocessing.pool import ThreadPool
from Server.database import db
import boto3
import io
import uuid
def upload_audio_file_to_s3(file):
app = current_app._get_current_object()
with app.app_context():
s3 = boto3.client(service_name='s3',
aws_access_key_id=app.config.get('BOTO3_ACCESS_KEY'),
aws_secret_access_key=app.config.get('BOTO3_SECRET_KEY'))
extension = file.filename.rsplit('.', 1)[1].lower()
file.filename = f"{uuid.uuid4().hex}.{extension}"
s3.upload_fileobj(file,
app.config.get('S3_BUCKET'),
f"{app.config.get('UPLOADED_AUDIO_FOLDER')}/{file.filename}",
ExtraArgs={"ACL": 'public-read', "ContentType": file.content_type})
return file.filename
def generate_polly(voice_id, text):
app = current_app._get_current_object()
with app.app_context():
polly_client = boto3.Session(
aws_access_key_id=app.config.get('BOTO3_ACCESS_KEY'),
aws_secret_access_key=app.config.get('BOTO3_SECRET_KEY'),
region_name=app.config.get('AWS_REGION')).client('polly')
response = polly_client.synthesize_speech(VoiceId=voice_id,
OutputFormat='mp3', Text=text)
return response['AudioStream'].read()
def generate_polly_from_term(vocab_term, gender='m'):
app = current_app._get_current_object()
with app.app_context():
audio = generate_polly('Celine', vocab_term.term)
file = io.BytesIO(audio)
file.filename = 'temp.mp3'
file.content_type = 'mp3'
return vocab_term.id, upload_audio_file_to_s3(file)
def generate_async_audio_service(terms):
pool = ThreadPool(processes=12)
results = pool.map(generate_polly_from_term, terms)
# do something w/ results
This is not necessarily a fleshed-out answer, but rather than putting things into comments I'll put it here.
Celery is a task manager for python. The reason you would want to use this is if you have tasks pinging Flask, but they take longer to finish than the interval of tasks coming in, then certain tasks will be blocked and you won't get all of your results. To fix this, you hand it to another process. This goes like so:
1) Client sends a request to Flask to process audio files
2) The files land in Flask to be processed, Flask will send an asyncronous task to Celery.
3) Celery is notified of the task and stores its state in some sort of messaging system (RabbitMQ and Redis are the canonical examples)
4) Flask is now unburdened from that task and can receive more
5) Celery finishes the task, including the upload to your database
Celery and Flask are then two separate python processes communicating with one another. That should satisfy your multithreaded approach. You can also retrieve the state from a task through Flask if you want the client to verify that the task was/was not completed. The route in your Flask app.py would look like:
#app.route('/my-route', methods=['POST'])
def process_audio():
# Get your files and save to common temp storage
save_my_files(target_dir, files)
response = celery_app.send_tast('celery_worker.files', args=[target_dir])
return jsonify({'task_id': response.task_id})
Where celery_app comes from another module worker.py:
import os
from celery import Celery
env = os.environ
# This is for a rabbitMQ backend
CELERY_BROKER_URL = env.get('CELERY_BROKER_URL', 'amqp://0.0.0.0:5672/0')
CELERY_RESULT_BACKEND = env.get('CELERY_RESULT_BACKEND', 'rpc://')
celery_app = Celery('tasks', broker=CELERY_BROKER_URL, backend=CELERY_RESULT_BACKEND)
Then, your celery process would have a worker configured something like:
from celery import Celery
from celery.signals import after_task_publish
env = os.environ
CELERY_BROKER_URL = env.get('CELERY_BROKER_URL')
CELERY_RESULT_BACKEND = env.get('CELERY_RESULT_BACKEND', 'rpc://')
# Set celery_app with name 'tasks' using the above broker and backend
celery_app = Celery('tasks', broker=CELERY_BROKER_URL, backend=CELERY_RESULT_BACKEND)
#celery_app.task(name='celery_worker.files')
def async_files(path):
# Get file from path
# Process
# Upload to database
# This is just if you want to return an actual result, you can fill this in with whatever
return {'task_state': "FINISHED"}
This is relatively basic, but could serve as a starting point. I will say that some of Celery's behavior and setup is not always the most intuitive, but this will leave your flask app available to whoever wants to send files to it without blocking anything else.
Hopefully that's somewhat helpful
Related
I have a fastAPI app where I want to call a celery task
I can not import the task as they are in two different code base. So I have to call it using its name.
in tasks.py
imagery = Celery(
"imagery", broker=os.getenv("BROKER_URL"), backend=os.getenv("REDIS_URL")
)
...
#imagery.task(bind=True, name="filter")
def filter_task(self, **kwargs) -> Dict[str, Any]:
print('running task')
The celery worker is running with this command:
celery worker -A worker.imagery -P threads --loglevel=INFO --queues=imagery
Now in my FastAPI code base I want to run the filter task.
So my understanding is I have to use the celery.send_task() function
In app.py I have
from celery import Celery, states
from celery.execute import send_task
from fastapi import FastAPI
from starlette.responses import JSONResponse, PlainTextResponse
from app import models
app = FastAPI()
tasks = Celery(broker=os.getenv("BROKER_URL"), backend=os.getenv("REDIS_URL"))
#app.post("/filter", status_code=201)
async def upload_images(data: models.FilterProductsModel):
"""
TODO: use a celery task(s) to query the database and upload the results to S3
"""
data = ['ok', 'un test']
data = ['ok', 'un test']
result = tasks.send_task('workers.imagery.filter', args=list(data))
return PlainTextResponse(f"here is the id: {str(result.ready())}")
After calling the /filter endpoint, I don't see any task being picked up by the worker.
So I tried different name in send_task()
filter
imagery.filter
worker.imagery.filter
How come my task never get picked up by the worker and nothing shows in the log?
Is my task name wrong?
Edit:
The worker process run in docker. Here is the fullpath of the file on its disk.
tasks.py : /workers/worker.py
So if I follow the import schema. the name of the task would be workers.worker.filter but this does not work, nothing get printed in the logs of docker. Is a print supposed to appear in the STDOUT of the celery cli?
Your Celery worker is subscribed to the imagery queue only . On the other hand, you try to send the task to the default queue (if you did not change configuration, the name of that queue is celery) with result = tasks.send_task('workers.imagery.filter', args=list(data)). It is not surprising you do not see task being executed by your worker as you have been sending tasks to the default queue whole time.
To fix this, try the following:
result = tasks.send_task('workers.imagery.filter', args=list(data), queue='imagery')
OP Here.
This is the solution I used.
task = signature("filter", kwargs=data.dict() ,queue="imagery")
res = task.delay()
As mentioned by #DejanLekic I had to specify the queue.
TLDR:
I need to setup a flask app for multiprocessing such that the API and stomp queue listener are running in separate processes and therefore not interfering with each other's operations.
Details:
I am building a python flask app that has API endpoints and also creates a message queue listener to connect to an activemq queue with the stomp package.
I need to implement multiprocessing such that the API and listener do not block each other's operation. That way the API will accept new requests and the listener will continue to listen for new messages and carry out tasks accordingly.
A simplified version of the code is shown below (some details are omitted for brevity).
Problem: The multiprocessing is causing the application to get stuck. The worker's run method is not called consistently, and therefore the listener never gets created.
# Start the worker as a subprocess -- this is not working -- app gets stuck before the worker's run method is called
m = Manager()
shared_state = m.dict()
worker = MyWorker(shared_state=shared_state)
worker.start()
After several days of troubleshooting I suspect the problem is due to the multiprocessing not being setup correctly. I was able to prove that this is the case because when I stripped out all of the multiprocessing code and called the worker's run method directly, the all of the queue management code is working correctly, the CustomWorker module creates the listener, creates the message, and picks up the message. I think this indicates that the queue management code is working correctly and the source of the problem is most likely due to the multiprocessing.
# Removing the multiprocessing and calling the worker's run method directly works without getting stuck so the issue is likely due to multiprocessing not being setup correctly
worker = MyWorker()
worker.run()
Here is the code I have so far:
App
This part of the code creates the API and attempts to create a new process to create the queue listener. The 'custom_worker_utils' module is a custom module that creates the stomp listener in the CustomWorker() class run method.
from flask import Flask, request, make_response, jsonify
from flask_restx import Resource, Api
import sys, os, logging, time
basedir = os.path.dirname(os.getcwd())
sys.path.append('..')
from custom_worker_utils.custom_worker_utils import *
from multiprocessing import Manager
# app.py
def create_app():
app = Flask(__name__)
app.config['BASE_DIR'] = basedir
api = Api(app, version='1.0', title='MPS Worker', description='MPS Common Worker')
logger = get_logger()
'''
This is a placeholder to trigger the sending of a message to the first queue
'''
#api.route('/initialapicall', endpoint="initialapicall", methods=['GET', 'POST', 'PUT', 'DELETE'])
class InitialApiCall(Resource):
#Sends a message to the queue
def get(self, *args, **kwargs):
mqconn = get_mq_connection()
message = create_queue_message(initial_tracker_file)
mqconn.send('/queue/test1', message, headers = {"persistent":"true"})
return make_response(jsonify({'message': 'Initial Test Call Worked!'}), 200)
# Start the worker as a subprocess -- this is not working -- app gets stuck before the worker's run method is called
m = Manager()
shared_state = m.dict()
worker = MyWorker(shared_state=shared_state)
worker.start()
# Removing the multiprocessing and calling the worker's run method directly works without getting stuck so the issue is likely due to multiprocessing not being setup correctly
#worker = MyWorker()
#worker.run()
return app
Custom worker utils
The run() method is called, connects to the queue and creates the listener with the stomp package
# custom_worker_utils.py
from multiprocessing import Manager, Process
from _datetime import datetime
import os, time, json, stomp, requests, logging, random
'''
The listener
'''
class MyListener(stomp.ConnectionListener):
def __init__(self, p):
self.process = p
self.logger = p.logger
self.conn = p.mqconn
self.conn.connect(_user, _password, wait=True)
self.subscribe_to_queue()
def on_message(self, headers, message):
message_data = json.loads(message)
ticket_id = message_data[constants.TICKET_ID]
prev_status = message_data[constants.PREVIOUS_STEP_STATUS]
task_name = message_data[constants.TASK_NAME]
#Run the service
if prev_status == "success":
resp = self.process.do_task(ticket_id, task_name)
elif hasattr(self, 'revert_task'):
resp = self.process.revert_task(ticket_id, task_name)
else:
resp = True
if (resp):
self.logger.debug('Acknowledging')
self.logger.debug(resp)
self.conn.ack(headers['message-id'], self.process.conn_id)
else:
self.conn.nack(headers['message-id'], self.process.conn_id)
def on_disconnected(self):
self.conn.connect('admin', 'admin', wait=True)
self.subscribe_to_queue()
def subscribe_to_queue(self):
queue = os.getenv('QUEUE_NAME')
self.conn.subscribe(destination=queue, id=self.process.conn_id, ack='client-individual')
def get_mq_connection():
conn = stomp.Connection([(_host, _port)], heartbeats=(4000, 4000))
conn.connect(_user, _password, wait=True)
return conn
class CustomWorker(Process):
def __init__(self, **kwargs):
super(CustomWorker, self).__init__()
self.logger = logging.getLogger("Worker Log")
log_level = os.getenv('LOG_LEVEL', 'WARN')
self.logger.setLevel(log_level)
self.mqconn = get_mq_connection()
self.conn_id = random.randrange(1,100)
for k, v in kwargs.items():
setattr(self, k, v)
def revert_task(self, ticket_id, task_name):
# If the subclass does not implement this,
# then there is nothing to undo so just return True
return True
def run(self):
lst = MyListener(self)
self.mqconn.set_listener('queue_listener', lst)
while True:
pass
Seems like Celery is excatly what you need.
Celery is a task queue that can distribute work across worker-processes and even across machines.
Miguel Grinberg created a great post about that, Showing how to accept tasks via flask and spawn them using Celery as tasks.
Good Luck!
To resolve this issue I have decided to run the flask API and the message queue listener as two entirely separate applications in the same docker container. I have installed and configured supervisord to start and the processes individually.
[supervisord]
nodaemon=true
logfile=/home/appuser/logs/supervisord.log
[program:gunicorn]
command=gunicorn -w 1 -c gunicorn.conf.py "app:create_app()" -b 0.0.0.0:8081 --timeout 10000
directory=/home/appuser/app
user=appuser
autostart=true
autorestart=true
stdout_logfile=/home/appuser/logs/supervisord_worker_stdout.log
stderr_logfile=/home/appuser/logs/supervisord_worker_stderr.log
[program:mqlistener]
command=python3 start_listener.py
directory=/home/appuser/mqlistener
user=appuser
autostart=true
autorestart=true
stdout_logfile=/home/appuser/logs/supervisord_mqlistener_stdout.log
stderr_logfile=/home/appuser/logs/supervisord_mqlistener_stderr.log
I have a very simple Flask app example which uses a celery worker to process a task asynchronously:
app.py
app.config['CELERY_BROKER_URL'] = os.environ.get('REDISCLOUD_URL', 'redis://localhost:6379')
app.config['CELERY_RESULT_BACKEND']= os.environ.get('REDISCLOUD_URL', 'redis://localhost:6379')
app.config['SQLALCHEMY_DATABASE_URI'] = conn_str
celery = make_celery(app)
db.init_app(app)
#app.route('/')
def index():
return "Working"
#app.route('/test')
def test():
task = reverse.delay("hello")
return task.id
#celery.task(name='app.reverse')
def reverse(string):
return string[::-1]
if __name__ == "__main__":
app.run()
To run it locally, I run celery -A app.celery worker --loglevel=INFO
in one terminal, and python app.py in another terminal.
I'm wondering how can I deploy this application on Google Cloud? I don't want to use Task Queues since it is only compatible with Python 2. Is there a good piece of documentation available for doing something like this? Thanks
App engine task queues is the previous version of Google Cloud Tasks, this has full support for App Engine Flex/STD and Python 3.x runtimes.
You need to create a Cloud Task Queue and an App engine service to handle the tasks
Gcloud command to create a queue
gcloud tasks queues create [QUEUE_ID]
Task handler code
from flask import Flask, request
app = Flask(__name__)
#app.route('/example_task_handler', methods=['POST'])
def example_task_handler():
"""Log the request payload."""
payload = request.get_data(as_text=True) or '(empty payload)'
print('Received task with payload: {}'.format(payload))
return 'Printed task payload: {}'.format(payload)
Code to push a task
"""Create a task for a given queue with an arbitrary payload."""
from google.cloud import tasks_v2
client = tasks_v2.CloudTasksClient()
# replace with your values.
# project = 'my-project-id'
# queue = 'my-appengine-queue'
# location = 'us-central1'
# payload = 'hello'
parent = client.queue_path(project, location, queue)
# Construct the request body.
task = {
'app_engine_http_request': { # Specify the type of request.
'http_method': tasks_v2.HttpMethod.POST,
'relative_uri': '/example_task_handler'
}
}
if payload is not None:
# The API expects a payload of type bytes.
converted_payload = payload.encode()
# Add the payload to the request.
task['app_engine_http_request']['body'] = converted_payload
if in_seconds is not None:
timestamp = datetime.datetime.utcnow() + datetime.timedelta(seconds=in_seconds)
# Add the timestamp to the tasks.
task['schedule_time'] = timestamp
# Use the client to build and send the task.
response = client.create_task(parent=parent, task=task)
print('Created task {}'.format(response.name))
return response
requirements.txt
Flask==1.1.2
gunicorn==20.0.4
google-cloud-tasks==2.0.0
You can check this full example in GCP Python examples Github page
I have a flask server running within a gunicorn.
In my flask application I want to handle large upload files (>20GB), so I plan on letting a celery task do the handling of the large file.
The problem is that retrieving the file from request.files already takes quite long, in the meantime gunicorn terminates the worker handling that request. I could increase the timeout time, but the maximum file size is currently unknown, so I don't know how much time I would need.
My plan was to make the request context available to the celery task, as it was described here: http://xion.io/post/code/celery-include-flask-request-context.html, but I cannot make it work
Q1 Is the signature right?
I set the signature with
celery.signature(handle_large_file, args={}, kwargs={})
and nothing is complaining. I get the arguments I pass from the flask request handler to the celery task, but that's it. Should I somehow get a handle to the context here?
Q2 how to use the context?
I would have thought if the flask request context was available I could just use request.files in my code, but then I get the warning that I am out of context.
Using celery 4.4.0
Code:
# in celery.py:
from flask import request
from celery import Celery
celery = Celery('celery_worker',
backend=Config.CELERY_RESULT_BACKEND,
broker=Config.CELERY_BROKER_URL)
#celery.task(bind=True)
def handle_large_file(task_object, data):
# do something with the large file...
# what I'd like to do:
files = request.files['upfile']
...
celery.signature(handle_large_file, args={}, kwargs={})
# in main.py
def create_app():
app = Flask(__name__.split('.')[0])
...
celery_worker.conf.update(app.config)
# copy from the blog
class RequestContextTask(Task):...
celery_worker.Task = RequestContextTask
# in Controller.py
#FILE.route("", methods=['POST'])
def upload():
data = dict()
...
handle_large_file.delay(data)
What am I missing?
I'm building a Flask application which relies on Celery to process some long running tasks. Each task will essentially append a dictionary to a shared list once it has finished processing - this list is shared by the celery workers and the routes of the Flask application. The Flask component essentially consists of a set of routes to retrieve the contents of the shared list and modify the order of the elements.
I thin I have successfully shared the list between the Celery workers using a Manager from the Python's multiprocessing module. However, the changes made to this list are not seen by the Flask application. Here is a minimal application which illustrates the issue:
import os
import json
from flask import Flask
from multiprocessing import Manager
from celery import Celery
application = Flask(__name__)
redis_url = os.environ.get('REDIS_URL')
if redis_url is None:
redis_url = 'redis://localhost:6379/0'
# Set the secret key to enable cookies
application.secret_key = 'some secret key'
application.config['SESSION_TYPE'] = 'filesystem'
# Redis and Celery configuration
application.config['BROKER_URL'] = redis_url
application.config['CELERY_RESULT_BACKEND'] = redis_url
celery = Celery(application.name, broker=redis_url)
celery.conf.update(BROKER_URL=redis_url,
CELERY_RESULT_BACKEND=redis_url)
manager = Manager()
shared_queue = manager.list() # THIS IS THE SHARED LIST
#application.route("/submit", methods=['GET'])
def submit_song():
add_song_to_queue.delay()
return 'Added a song to the queue'
#application.route("/playlist", methods=['GET', 'POST'])
def get_playlist():
playlist = []
i = 0
queue_size = len(shared_queue)
while i < queue_size:
print(shared_queue[i])
playlist.append(shared_queue[i])
return json.dumps(playlist)
#celery.task
def add_song_to_queue():
shared_queue.append({'some':'data!'})
print(len(shared_queue))
if __name__ == "__main__":
application.run(host='0.0.0.0', debug=True)
In the celery logs I can clearly see that the dictionaries are being appended to the list, and that the size of the list increases. However, when I access the /playlist route on my browser I always get an empty list.
Does anyone know how I can get the list to be shared among all the workers and the Flask application?
I found a solution by moving away from Celery and instead using multiprocessing.Pool as a task queue and shared memory through Manager as shown in sample code in the question. This link has an excellent example of how this solution can be integrated with Flask: http://gouthamanbalaraman.com/blog/python-multiprocessing-as-a-task-queue.html
from multiprocessing import Pool
from flask import Flask
app = Flask(__name__)
_pool = None
def expensive_function(x):
# import packages that is used in this function
# do your expensive time consuming process
return x*x
#app.route('/expensive_calc/<int:x>')
def route_expcalc(x):
f = _pool.apply_async(expensive_function,[x])
r = f.get(timeout=2)
return 'Result is %d'%r
if __name__=='__main__':
_pool = Pool(processes=4)
try:
# insert production server deployment code
app.run()
except KeyboardInterrupt:
_pool.close()
_pool.join()