Cannot access user_data when using job queue - python

I have a function that I need to be able to both execute by sending a command and execute automatically using a job queue. And in that function I need to have access to user_data to store data that is specific to each user. Now the problem is that when I try to execute it via job queue, user_data is None and thus unavailable. How can I fix that?
Here's how I'm currently doing this, simplified:
import datetime
from settings import TEST_TOKEN, BACKUP_USER
from telegram.ext import Updater, CommandHandler, CallbackContext
from pytz import timezone
def job_daily(context: CallbackContext):
job(BACKUP_USER, context)
def job_command(update, context):
job(update.message.chat_id, context)
def job(chat_id, context):
print(context.user_data)
def main():
updater = Updater(TEST_TOKEN, use_context=True)
dispatcher = updater.dispatcher
job_queue = updater.job_queue
# To run it automatically
tehran = timezone("Asia/Tehran")
due = datetime.time(15, 3, tzinfo=tehran)
job_queue.run_daily(job_daily, due)
# To run it via command
dispatcher.add_handler(CommandHandler("job", job_command))
updater.start_polling()
updater.idle()
if __name__ == "__main__":
main()
Now when I send the command /job and thus executing job_command, the job function prints {} which means that I can access user_data. But when the job_daily function is executed, the job function prints None meaning that I don't have access to user_data. Same goes for chat_data.

In a callback function of python-telegram-bot, context.user_data and context.chat_data depend on the update. More precisely, PTB takes update.effective_user/chat.id and provides the corresponding user/chat_data. Jobs callbacks are not triggered by an update (but by a time based trigger), so there's no reasonable way to provide context.user_data.
What you can do, when scheduling the job from within a handler callback, where user_data is available, is to pass it as context argument to the job:
context.job_queue.run_*(..., context=context.user_data)
Then within the job callback, you can retrieve it as user_data = context.job.context
In your case, you schedule the job in main, which is not a handler callback and hence you don't have context.user_data (not even context). If you have a specific user id for which you'd like to pass the user_data, you can get that user_data as user_data = updater.dispatcher.user_data[user_id], which is the same object as context.user_data (for updates from this particular user).
Disclaimer: I'm currently the maintainer of python-telegram-bot.

Related

How to stop execution of FastAPI endpoint after a specified time to reduce CPU resource usage/cost?

Use case
The client micro service, which calls /do_something, has a timeout of 60 seconds in the request/post() call. This timeout is fixed and can't be changed. So if /do_something takes 10 mins, /do_something is wasting CPU resources since the client micro service is NOT waiting after 60 seconds for the response from /do_something, which wastes CPU for 10 mins and this increases the cost. We have limited budget.
The current code looks like this:
import time
from uvicorn import Server, Config
from random import randrange
from fastapi import FastAPI
app = FastAPI()
def some_func(text):
"""
Some computationally heavy function
whose execution time depends on input text size
"""
randinteger = randrange(1,120)
time.sleep(randinteger)# simulate processing of text
return text
#app.get("/do_something")
async def do_something():
response = some_func(text="hello world")
return {"response": response}
# Running
if __name__ == '__main__':
server = Server(Config(app=app, host='0.0.0.0', port=3001))
server.run()
Desired Solution
Here /do_something should stop the processing of the current request to endpoint after 60 seconds and wait for next request to process.
If execution of the end point is force stopped after 60 seconds we should be able to log it with custom message.
This should not kill the service and work with multithreading/multiprocessing.
I tried this. But when timeout happends the server is getting killed.
Any solution to fix this?
import logging
import time
import timeout_decorator
from uvicorn import Server, Config
from random import randrange
from fastapi import FastAPI
app = FastAPI()
#timeout_decorator.timeout(seconds=2, timeout_exception=StopIteration, use_signals=False)
def some_func(text):
"""
Some computationally heavy function
whose execution time depends on input text size
"""
randinteger = randrange(1,30)
time.sleep(randinteger)# simulate processing of text
return text
#app.get("/do_something")
async def do_something():
try:
response = some_func(text="hello world")
except StopIteration:
logging.warning(f'Stopped /do_something > endpoint due to timeout!')
else:
logging.info(f'( Completed < /do_something > endpoint')
return {"response": response}
# Running
if __name__ == '__main__':
server = Server(Config(app=app, host='0.0.0.0', port=3001))
server.run()
This answer is not about improving CPU time—as you mentioned in the comments section—but rather explains what would happen, if you defined an endpoint with normal def or async def, as well as provides solutions when you run blocking operations inside an endpoint.
You are asking how to stop the processing of a request after a while, in order to process further requests. It does not really make that sense to start processing a request, and then (60 seconds later) stop it as if it never happened (wasting server resources all that time and having other requests waiting). You should instead let the handling of requests to FastAPI framework itself. When you define an endpoint with async def, it is run on the main thread (in the event loop), i.e., the server processes the requests sequentially, as long as there is no await call inside the endpoint (just like in your case). The keyword await passes function control back to the event loop. In other words, it suspends the execution of the surrounding coroutine, and tells the event loop to let something else run, until the awaited task completes (and has returned the result data). The await keyword only works within an async function.
Since you perform a heavy CPU-bound operation inside your async def endpoint (by calling your some_func() function), and you never give up control for other requests to run in the event loop (e.g., by awaiting for some coroutine), the server will be blocked and wait for that request to be fully processed and complete, before moving on to the next one(s)—have a look at this answer for more details.
Solutions
One solution would be to define your endpoint with normal def instead of async def. In brief, when you declare an endpoint with normal def instead of async def in FastAPI, it is run in an external threadpool that is then awaited, instead of being called directly (as it would block the server); hence, FastAPI would still work asynchronously.
Another solution, as described in this answer, is to keep the async def definition and run the CPU-bound operation in a separate thread and await it, using Starlette's run_in_threadpool(), thus ensuring that the main thread (event loop), where coroutines are run, does not get blocked. As described by #tiangolo here, "run_in_threadpool is an awaitable function, the first parameter is a normal function, the next parameters are passed to that function directly. It supports sequence arguments and keyword arguments". Example:
from fastapi.concurrency import run_in_threadpool
res = await run_in_threadpool(cpu_bound_task, text='Hello world')
Since this is about a CPU-bound operation, it would be preferable to run it in a separate process, using ProcessPoolExecutor, as described in the link provided above. In this case, this could be integrated with asyncio, in order to await the process to finish its work and return the result(s). Note that, as described in the link above, it is important to protect the main loop of code to avoid recursive spawning of subprocesses, etc—essentially, your code must be under if __name__ == '__main__'. Example:
import concurrent.futures
from functools import partial
import asyncio
loop = asyncio.get_running_loop()
with concurrent.futures.ProcessPoolExecutor() as pool:
res = await loop.run_in_executor(pool, partial(cpu_bound_task, text='Hello world'))
About Request Timeout
With regards to the recent update on your question about the client having a fixed 60s request timeout; if you are not behind a proxy such as Nginx that would allow you to set the request timeout, and/or you are not using gunicorn, which would also allow you to adjust the request timeout, you could use a middleware, as suggested here, to set a timeout for all incoming requests. The suggested middleware (example is given below) uses asyncio's .wait_for() function, which waits for an awaitable function/coroutine to complete with a timeout. If a timeout occurs, it cancels the task and raises asyncio.TimeoutError.
Regarding your comment below:
My requirement is not unblocking next request...
Again, please read carefully the first part of this answer to understand that if you define your endpoint with async def and not await for some coroutine inside, but instead perform some CPU-bound task (as you already do), it will block the server until is completed (and even the approach below wont' work as expected). That's like saying that you would like FastAPI to process one request at a time; in that case, there is no reason to use an ASGI framework such as FastAPI, which takes advantage of the async/await syntax (i.e., processing requests asynchronously), in order to provide fast performance. Hence, you either need to drop the async definition from your endpoint (as mentioned earlier above), or, preferably, run your synchronous CPU-bound task using ProcessPoolExecutor, as described earlier.
Also, your comment in some_func():
Some computationally heavy function whose execution time depends on
input text size
indicates that instead of (or along with) setting a request timeout, you could check the length of input text (using a dependency fucntion, for instance) and raise an HTTPException in case the text's length exceeds some pre-defined value, which is known beforehand to require more than 60s to complete the processing. In that way, your system won't waste resources trying to perform a task, which you already know will not be completed.
Working Example
import time
import uvicorn
import asyncio
import concurrent.futures
from functools import partial
from fastapi import FastAPI, Request
from fastapi.responses import JSONResponse
from starlette.status import HTTP_504_GATEWAY_TIMEOUT
from fastapi.concurrency import run_in_threadpool
REQUEST_TIMEOUT = 2 # adjust timeout as desired
app = FastAPI()
#app.middleware('http')
async def timeout_middleware(request: Request, call_next):
try:
return await asyncio.wait_for(call_next(request), timeout=REQUEST_TIMEOUT)
except asyncio.TimeoutError:
return JSONResponse({'detail': f'Request exceeded the time limit for processing'},
status_code=HTTP_504_GATEWAY_TIMEOUT)
def cpu_bound_task(text):
time.sleep(5)
return text
#app.get('/')
async def main():
loop = asyncio.get_running_loop()
with concurrent.futures.ProcessPoolExecutor() as pool:
res = await loop.run_in_executor(pool, partial(cpu_bound_task, text='Hello world'))
return {'response': res}
if __name__ == '__main__':
uvicorn.run(app)

How to stop the execution of a long process if something changes in the db?

I have a view that sends a message to a RabbitMQ queue.
message = {'origin': 'Bytes CSV',
'data': {'csv_key': str(csv_entry.key),
'csv_fields': csv_fields
'order_by': order_by,
'filters': filters}}
...
queue_service.send(message=message, headers={}, exchange_name=EXCHANGE_IN_NAME,
routing_key=MESSAGES_ROUTING_KEY.replace('#', 'bytes_counting.create'))
On my consumer, I have a long process to generate a CSV.
def create(self, data):
csv_obj = self._get_object(key=data['csv_key'])
if csv_obj.status == CSVRequestStatus.CANCELED:
self.logger.info(f'CSV {csv_obj.key} was canceled by the user')
return
result = self.generate_result_data(filters=data['filters'], order_by=data['order_by'], csv_obj=csv_obj)
csv_data = self._generate_csv(result=result, csv_fields=data['csv_fields'], csv_obj=csv_obj)
file_key = self._post_csv(csv_data=csv_data, csv_obj=csv_obj)
csv_obj.status = CSVRequestStatus.READY
csv_obj.status_additional = CSVRequestStatusAdditional.SUCCESS
csv_obj.file_key = file_key
csv_obj.ready_at = timezone.now()
csv_obj.save(update_fields=['status', 'status_additional', 'ready_at', 'file_key'])
self.logger.info(f'CSV {csv_obj.name} created')
The long proccess happens inside self._generate_csv, because self.generate_result_data returns a queryset, which is lazy.
As you can see, if a user changes the status of the csv_request through an endpoint BEFORE the message starts to be consumed the proccess will not be evaluated. My goal is to let this happen during the execution of self._generate_csv.
So far I tried to use Threading, but unsuccessfully.
How can I achive my goal?
Thanks a lot!
Why don't you checkout Celery library ? Using celery with django with RabbitMQ backend is much easier than directly leveraging rabbitmq queues.
Celery has an inbuilt function revoke to terminate an ongoing task:
>>> from celery.task.control import revoke
>>> revoke(task_id, terminate=True)
related SO answer
celery docs
For your use case, you probably want something like (code snippets):
## celery/tasks.py
from celery import app
#app.task(queue="my_queue")
def create_csv(message):
# ...snip...
pass
## main.py
from celery import uuid, current_app
def start_task(task_id, message):
current_app.send_task(
"create_csv",
args=[message],
task_id=task_id,
)
def kill_task(task_id):
current_app.control.revoke(task_id, terminate=True)
## signals.py
from django.dispatch import receiver
from .models import MyModel
from .main import kill_task
# choose appropriate signal to listen for DB change
#receiver(models.signals.post_save, sender=MyModel)
def handler(sender, instance, **kwargs):
kill_task(instance.task_id)
Use celery.uuid to generate task IDs which can be stored in DB or cache and use the same task ID to control the task i.e. request termination.
Since self._generate_csv is the slowest, the obvious solution would be to work with this function.
To do this, you can divide the creation of the csv file into several pieces. After creating each piece, check the status and see if you can continue to create the file. At the very end, glue all the pieces into a finished file.
Here is a method for combining multiple files into one.

Load new model in background

I made an API for my AI model but I would like to not have any down time when I update the model. I search a way to load in background and once it's loaded I switch the old model with the new. I tried passing values between sub process but doesn't work well. Do you have any idea how can I do that ?
You can place the serialized model in a raw storage, like an S3 bucket if you're on AWS. In S3's case, you can use bucket versioning which might prove helpful. Then setup some sort of trigger. You can definitely get creative here, and I've thought about this a lot. In practice, the best options I've tried are:
Set up an endpoint that when called will go open the new model at whatever location you store it at. Set up a webhook on the storage/S3 bucket that will send a quick automated call to the given endpoint and auto-load that new item
Same thing as #1, but instead you just manually load it. In both cases you'll really want some security on that endpoint or anyone that finds your site can just absolutely abuse your stack.
Set a timer at startup that calls a given function nightly, internally running within the application itself. The function is invoked and then goes and reloads.
Could be other ideas I'm not smart enough (yet!) to use, just trying to start some dialogue.
Found a way to do it with async and multiprocessing
import asyncio
import random
from uvicorn import Server, Config
from fastapi import FastAPI
import time
from multiprocessing import Process, Manager
app = FastAPI()
value = {"latest": 1, "b": 2}
#app.get("/")
async def root():
global value
return {"message": value}
def background_loading(d):
time.sleep(2)
d["test"] = 3
async def update():
while True:
global value
manager = Manager()
d = manager.dict()
p1 = Process(target=background_loading, args=(d,))
p1.daemon = True
p1.start()
while p1.is_alive():
await asyncio.sleep(5)
print(f'Update to value to {d}')
value = d
if __name__ == "__main__":
loop = asyncio.new_event_loop()
config = Config(app=app, loop=loop)
server = Server(config)
loop.create_task(update())
loop.run_until_complete(server.serve())

python django race condition with celery

Working on a python django project, here is what I want:
User access Page1 with object argument, function longFunction() of the object is triggered and passed to celery so the page can be returned immediately
If user tries to access Page2 with same object argument, I want the page to hang until object function longFunction() triggered by Page1 is terminated.
So I tried by locking mysql db row with objects.select_for_update() but it doesn't work.
Here is a simplified version of my code:
def Page1(request, arg_id):
obj = Vm.objects.select_for_update().get(id=arg_id)
obj.longFunction.delay()
return render_to_response(...)
def Page2(request, arg_id):
vm = Vm.objects.select_for_update().get(id=arg_id)
return render_to_response(...)
I want that Page2 hangs at the line vm = Vm.objects.select_for_update().get(id=arg_id) until longFunction() is completed. I'm new to celery and it looks like the mysql connection initiated on Page1 is lost when the Page1 returns, even if longFunction() is not finished.
Is there another way I can achieve that?
Thanks
Maybe this can be helpul for you:
from celery.result import AsyncResult
from yourapp.celery import app
def Page1(request, arg_id):
obj = Vm.objects.select_for_update().get(id=arg_id)
celery_task_id = obj.longFunction.delay()
return render_to_response(...)
def Page2(request, arg_id, celery_task_id):
task = AsyncResult(app=app, id=celery_task_id)
state = task.state
while state != "SUCCESFUL":
# wait or do whatever you want
vm = Vm.objects.select_for_update().get(id=arg_id)
return render_to_response(...)
More info at http://docs.celeryproject.org/en/latest/reference/celery.states.html
The database lock from select_for_update is released when the transaction closes (in page 1). This lock doesn't get carried to the celery task. You can lock in the celery task but that won't solve your problem because page 2 might get loaded before the celery task obtains the lock.
Mikel's answer will work. You could also put a lock in the cache as described in the celery cookbook.

Celery dynamic queue creation and routing

I'm trying to call a task and create a queue for that task if it doesn't exist then immediately insert to that queue the called task. I have the following code:
#task
def greet(name):
return "Hello %s!" % name
def run():
result = greet.delay(args=['marc'], queue='greet.1',
routing_key='greet.1')
print result.ready()
then I have a custom router:
class MyRouter(object):
def route_for_task(self, task, args=None, kwargs=None):
if task == 'tasks.greet':
return {'queue': kwargs['queue'],
'exchange': 'greet',
'exchange_type': 'direct',
'routing_key': kwargs['routing_key']}
return None
this creates an exchange called greet.1 and a queue called greet.1 but the queue is empty. The exchange should be just called greet which knows how to route a routing key like greet.1 to the queue called greet.1.
Any ideas?
When you do the following:
task.apply_async(queue='foo', routing_key='foobar')
Then Celery will take default values from the 'foo' queue in CELERY_QUEUES,
or if it does not exist then automatically create it using (queue=foo, exchange=foo, routing_key=foo)
So if 'foo' does not exist in CELERY_QUEUES you will end up with:
queues['foo'] = Queue('foo', exchange=Exchange('foo'), routing_key='foo')
The producer will then declare that queue, but since you override the routing_key,
actually send the message using routing_key = 'foobar'
This may seem strange but the behavior is actually useful for topic exchanges,
where you publish to different topics.
It's harder to do what you want though, you can create the queue yourself
and declare it, but that won't work well with automatic message publish retries.
It would be better if the queue argument to apply_async could support
a custom kombu.Queue instead that will be both declared and used as the destination.
Maybe you could open an issue for that at http://github.com/celery/celery/issues

Categories