Flask and long running tasks - python

I'm writing a small web server using Flask that needs to do the following things:
On the first request, serve the basic page and kick off a long (15-60 second) data processing task. The data processing task queries a second server which I do not control, updates a local database, and then performs some calculations on the results to show in the web page.
The page issues several AJAX requests that all depend on parts of the result from the long task, so I need to wait until the processing is done.
Subsequent requests for the first page would ideally re-use the previous request's result if they come in while the processing task is ongoing (or even shortly thereafter)
I tried using flask-cache (specifically SimpleCache), but ran into an issue as it seems the cache pickles the result, when I'd really rather keep the exact object.
I suppose I could re-write what I'm caching to be pickle-able, and then implement a single worker thread to do the processing.
Is there some more better way of handling this kind of workflow?

I think best way for long data processing is something like Celery.
Send request to run task and receive task ID.
Periodically send ajax requests to check task progress and receive result of task execution.

Related

Which requests should be handled by the webserver and which by a task queue worker?

I am working on a Python web app that uses Celery to schedule and execute user job requests.
Most of the time the requests submitted by a user can't be resolved immediately and thus it makes sense to me to schedule them in a queue.
However, now that I have the whole queuing architecture in place, I'm confused about whether I should delegate all the request processing logic to the queue/workers or if I should leave some of the work to the webserver itself.
For example, apart from the job scheduling, there are times where a user only needs to perform a simple database query, or retrieve a static JSON file. Should I also delegate these "synchronous" requests to the queue/workers?
Right now, my webserver controllers don't do anything except validating incoming JSON request schemas and forwarding them to the queue. What are the pros and cons of having a dumb webserver like this?
I believe the way you have it right now plus giving the workers the small jobs now is good. That way the workers would be overloaded first in the event of an attack or huge request influx. :)

Is it a bad practice to use sleep() in a web server in production?

I'm working with Django1.8 and Python2.7.
In a certain part of the project, I open a socket and send some data through it. Due to the way the other end works, I need to leave some time (let's say 10 miliseconds) between each data that I send:
while True:
send(data)
sleep(0.01)
So my question is: is it considered a bad practive to simply use sleep() to create that pause? Is there maybe any other more efficient approach?
UPDATED:
The reason why I need to create that pause is because the other end of the socket is an external service that takes some time to process the chunks of data I send. I should also point out that it doesnt return anything after having received or let alone processed the data. Leaving that brief pause ensures that each chunk of data that I send gets properly processed by the receiver.
EDIT: changed the sleep to 0.01.
Yes, this is bad practice and an anti-pattern. You will tie up the "worker" which is processing this request for an unknown period of time, which will make it unavailable to serve other requests. The classic pattern for web applications is to service a request as-fast-as-possible, as there is generally a fixed or max number of concurrent workers. While this worker is continually sleeping, it's effectively out of the pool. If multiple requests hit this endpoint, multiple workers are tied up, so the rest of your application will experience a bottleneck. Beyond that, you also have potential issues with database locks or race conditions.
The standard approach to handling your situation is to use a task queue like Celery. Your web-application would tell Celery to initiate the task and then quickly finish with the request logic. Celery would then handle communicating with the 3rd party server. Django works with Celery exceptionally well, and there are many tutorials to help you with this.
If you need to provide information to the end-user, then you can generate a unique ID for the task and poll the result backend for an update by having the client refresh the URL every so often. (I think Celery will automatically generate a guid, but I usually specify one.)
Like most things, short answer: it depends.
Slightly longer answer:
If you're running it in an environment where you have many (50+ for example) connections to the webserver, all of which are triggering the sleep code, you're really not going to like the behavior. I would strongly recommend looking at using something like celery/rabbitmq so Django can dump the time delayed part onto something else and then quickly respond with a "task started" message.
If this is production, but you're the only person hitting the webserver, it still isn't great design, but if it works, it's going to be hard to justify the extra complexity of the task queue approach mentioned above.

Scraping website using Celery

Currently, my structure is Flask, Redis, RabbitMQ and Celery. In my scraping, I am using requests and BeautifulSoup.
My flask is running on apache and wsgi. This is on prod. With app.run(threaded=True)
I have 25 APIs. 10 are to scrape the URL like headers, etc. , and the rest is to use a 3rd party API for that URL.
I am using chord for processing my APIs and getting data from the APIs using requests.
For my chord header I have 3 workers, while on my callback I only have 1.
I am having a bottleneck issue of having ConnectTimeoutError and MaxRetryError. As I read some thread it said to do a timeout for every process, because having this error means you are overloading the remote server.
The problem is since I am using a chord there is no sense to use a time sleep since the 25 API call will be run at the same time. Have anyone encountered this? Or am I doing this wrong?
The thread I read seem to be saying to change the requests to pycurl or use Scrapy. But I dont think that's the case since ConnectTimeoutError is about my host overloading a specific URLs server.
My chord process:
callback = create_document.s(url, company_logo, api_list)
header = [api_request.s(key) for key in api_list.keys()]
result = chord(header)(callback)
In api_request task requests is used.
If you're wanting to limit the number of scrapes running at the same time you can create an enqueue task that checks to see if another task is running that shares the same properties as the task you are wanting to run. If the task is running you tell it to sleep for a few seconds and check again. When it sees that one is not running you can then queue the task you want to run. This will allow you to have sleeps with asynchronous tasks. You can even count the tasks and run more if only a certain number are running. With this you can run 5 at a time and see if it is throttled enough then queue another when you see one has finished etc.
::EDIT::
Documentation for Celery Inspect

Running several processes in parallel while reading from Kafka and sending http requests in Python

I have been searching how to open parallel processes in Python and have stumbled upon concurrent futures and multiprocessing as the most interesting options. Sadly I haven't been able to implement them correctly since they seem to be working one worker after another instead of at the same time. Right now my process is taking a little too long and I think I can make it faster.
Say I coded a function in Python that is connecting to a Kafka queue and reading json messages, each message will then be sent to a rest service using requests and getting some information in order to complete the data and be posted afterwards; then I'm just updating my database with the response.
I need to be able to have this function run in several processes and read from the queue at the same time while doing the requests until I'm out messages.
What would be the best approach to do so?

Better ways to handle AppEngine requests that time out?

Sometimes, with requests that do a lot, Google AppEngine returns an error. I have been handling this by some trickery: memcaching intermediate processed data and just requesting the page again. This often works because the memcached data does not have to be recalculated and the request finishes in time.
However... this hack requires seeing an error, going back, and clicking again. Obviously less than ideal.
Any suggestions?
inb4: "optimize your process better", "split your page into sub-processes", and "use taskqueue".
Thanks for any thoughts.
Edit - To clarify:
Long wait for requests is ok because the function is administrative. I'm basically looking to run a data-mining function. I'm searching over my datastore and modifying a bunch of objects. I think the correct answer is that AppEngine may not be the right tool for this. I should be exporting the data to a computer where I can run functions like this on my own. It seems AppEngine is really intended for serving with lighter processing demands. Maybe the quota/pricing model should offer the option to increase processing timeouts and charge extra.
If interactive user requests are hitting the 30 second deadline, you have bigger problems: your user has almost certainly given up and left anyway.
What you can do depends on what your code is doing. There's a lot to be optimized by batching datastore operations, or reducing them by changing how you model your data; you can offload work to the Task Queue; for URLFetches, you can execute them in parallel. Tell us more about what you're doing and we may be able to provide more concrete suggestions.
I have been handling something similar by building a custom automatic retry dispatcher on the client. Whenever an ajax call to the server fails, the client will retry it.
This works very well if your page is ajaxy. If your app spits entire HTML pages then you can use a two pass process: first send an empty page containing only an ajax request. Then, when AppEngine receives that ajax request, it outputs the same HTML you had before. If the ajax call succeeds it fills the DOM with the result. If it fails, it retries once.

Categories