How to implement an auto-scaling solution requiring synchronous subscription?

How to implement an auto-scaling solution requiring synchronous subscription? - python

I have to process requests published in a Pub/ Sub Topic by the users of my "service" in a Python application, having a main loop.
Each instance of the service will only be able to process one request a time due to the limitation of the application. Burst of requests (~10 at the beginning, growing to ~ 10^6/7) will mix into mostly total idle time. The cold start time of the application is very high compared to the processing time for a single request.
My code will be a plug-in, which polls the subscription, calls methods in the application based on it and then saves data in Cloud Storage and Big Query.
I have read through the cloud documentation and it seems that Cloud Run is the right solution, together with the synchronous subscription API in Python. Cloud Functions I excluded, because it seems more be suited for asynchronous stuff.
However, I did not fully understand how the auto-scaling works. Basically it would have to do this based on the average processing time for each request and the length of the queue, considering the average start-up time of a container.
Unfortunately I did not find a tutorial or example for such a use-case, especially not really explaining how the auto-scaling happens in detail.
Does anyone have something like that or could one explain me here?

Related

How to design multiclient preprocess software pipeline using aws?

My software goal is to automate the preprocessing pipeline, the pipeline has three code blocks:
Fetching the data - either by api or by client uploading csv to s3 bucket.
Processing the data - my goal is to unified the data from the different clients to a unified end scheme.
Store scheme is database.
I know it is a very common system but I failed to find what is the best design for it.
The requirements are:
The system is not real time, for each client I plan each X days to fetch the new data and it is dose not matter if only even a day later it will finish
The processing partis unique per client data, of course there are some common features, but also a lot of different features and muniplation.
I wish the system to be automated.
I thought of the following :
The lambda solution:
schedule a lambda for each client which will fetch the data every X days, the lambda will trigger another lambda which will do processing. But if I have 100 clients that will be awful to handle 200 lambdas.
2.1 making a project call Api and have different script for each client, my a schudle for each script on a ec2 or ecs.
2.2 Have another project call processing where the father class has the common code and all the subclass client code inherite from it, the API script will activate the relevant processing script.
In the end I am very confused what is the best practice, I only found example which handle one client, or a general scheme approch/ diagram block which is to broad.
Because I know it such a common system, I would appreciate learning from others experience.
Would appreciate any reference links or wisdom

Take a look at Step Functions, it will allow you to decouple the execution of each stage and allow you to reuse your Lambdas.
By passing in input into the step function the top Lambda might be able to make decisions which feed to the others.
To schedule this use a scheduled CloudWatch event

Architecturing an ML API in Django

I work on a project for our clients which is heavily ML based and is computationally intensive (as in complex and multi-level similarity scores, NLP, etc.) For the prototype, we delivered a Django RF where the API would have access to a database from the client and with every request at specific end-points it would literally do all the ML applications on the fly (in the backend).
Now that we are scaling and more user activity is taking place in the production, the app seems to lag a lot. A simple profiling shows that a single POST request could take upto 20 secs to respond. So no matter how much I optimize in terms of horizontal scaling, I can't get rid of the bottleneck of all the calculations happening with the API calls. I have a hunch that caching could be a kind of solution. But I am not sure. I can imagine a lot of 'theoretical' solutions but I don't want to reinvent the wheel (or shall I say, re-discover the wheel).
Are there specific design architectures for ML or computationally intensive REST API calls that I can refer to in redesigning my project?

Machine learning & natural language processing systems are often resource-hungry and in many cases there is not much one can do about it directly. Some operations simply take longer than others but this is actually not the main problem in your case. The main problem is that the user doesn't get any feedback while the backend does its job which is not a good user experience.
Therefore, it is not recommended to perform resource-heavy computation within the traditional HTTP request-response cycle. Instead of calling the ML logic within the API view and waiting for it to finish, consider setting up an asychronous task queue to perform the heavy lifting independently of the synchronous request-response cycle.
In the context of Django, the standard task queue implementation would be Celery. Setting it put will require some learning and additional infrastructure (e.g. a Redis instance and worker servers), but there is really no other way to not to break the user experience.
Once you have set up everything, you can then start an asynchronous task whenever your API endpoint receives a request and immediately inform the user that their request is being carried out via a normal view response. Once the ML task has finished and its results have been written to the database (using a Django model, of course), you can then notify the user (e.g. via mail or directly in the browser via WebSockets) to view the analysis results in a dedicated results view.

What is `/_ah/background` with Google Flex VM

Started using Google Flex Vms recently and in the logs there are multiple requests to /_ah/background that last ~1 hour each time. The only reference to these I could find is this question which mentions they have to do with background threads but I don't believe that's the case here as:
nowhere do we use background threads
that API is deprecated and I'm not even sure we can use it
we do use processes but they're short-lived (nowhere near an hour) and don't print any log messages
Any ideas?

/_ah/background is used in flex for AE API calls that are called outside of an incoming request processing context (like threads, async-io, ...).
Even if you don't do that directly log flushing is still done asynchronously (not part of an income request processing).
This is an implementation details and there is a plan to hide it but still
find a way to show (in log, trace,...) information about these API calls.

FWIW, I thought sharing this picture will provide little insight on the frequency, (possible performance) impact this /_ah/background creates..
is there any workaround to clear this ?

How to measure memory usage of a web request when using Werkzeug/Flask?

Is there a way to measure the amount of memory allocated by an arbitrary web request in a Flask/Werkzeug app? By arbitrary, I mean I'd prefer a technique that lets me instrument code at a high enough level that I don't have to change it to test memory usage of different routes. If that's not possible but it's still possible to do this by wrapping individual requests with a little code, so be it.
In a PHP app I wrote a while ago, I accomplished this by calling the memory_get_peak_usage() function both at the start and the end of the request and taking the difference.
Is there an analog in Python/Flask/Werkzeug? Using Python 2.7.9 if it matters.

First of all, one should understand the main difference between PHP and Python requests processing. Roughly speaking, each PHP worker accepts only one request, handle it and then die (or reinit interpreter). PHP was designed directly for it, it's request processing language by its nature. So, it's pretty simple to measure per request memory usage. Request's peak memory usage is equal to the worker peak memory usage. It's a language feature.
At the same time, Python usually uses another approach to handle requests. There are two main models - synchronous and asynchronous request processing. However, both of them have the same difficulty when it comes to measure per request memory usage. The reason is that one Python worker handles plenty of requests (concurrently or sequentially) during his life. So, it's hard to get memory usage exactly for a request.
However, one can adapt an underlying framework and application code to accomplish collecting memory usage task. One possible solution is to use some kind of events. For example, one can raise an abstract mem_usage event on: before request, at the beginning of a view function, at the end of a view function, in some important places within the business logic and so on. Then it should exists a subscriber for such events, doing the next thing:
import resource
mem_usage = resource.getrusage(resource.RUSAGE_SELF).ru_maxrss
This subscriber have to accumulate such usage data and on the app_request_teardown/after_request send it to the metrics collection system with information about current request.endpoint or route or whatever.
Also, using a memory profiler is a good idea, but usually not for a production usage.
Further reading about request processing models:
CGI
FastCGI
PHP specific

Another possible solution is to use sys.setrace. Using this tool one can measure memory usage even per each line of code. Usage examples can be found in the memory_profiler project. Of course, it will slowdown the code significantly.

Python "Task Server"

My question is: which python framework should I use to build my server?
Notes:
This server talks HTTP with it's clients: GET and POST (via pyAMF)
Clients "submit" "tasks" for processing and, then, sometime later, retrieve the associated "task_result"
submit and retrieve might be separated by days - different HTTP connections
The "task" is a lump of XML describing a problem to be solved, and a "task_result" is a lump of XML describing an answer.
When a server gets a "task", it queues it for processing
The server manages this queue and, when tasks get to the top, organises that they are processed.
the processing is performed by a long running (15 mins?) external program (via subprocess) which is feed the task XML and which produces a "task_result" lump of XML which the server picks up and stores (for later Client retrieval).
it serves a couple of basic HTML pages showing the Queue and processing status (admin purposes only)
I've experimented with twisted.web, using SQLite as the database and threads to handle the long running processes.
But I can't help feeling that I'm missing a simpler solution. Am I? If you were faced with this, what technology mix would you use?

I'd recommend using an existing message queue. There are many to choose from (see below), and they vary in complexity and robustness.
Also, avoid threads: let your processing tasks run in a different process (why do they have to run in the webserver?)
By using an existing message queue, you only need to worry about producing messages (in your webserver) and consuming them (in your long running tasks). As your system grows you'll be able to scale up by just adding webservers and consumers, and worry less about your queuing infrastructure.
Some popular python implementations of message queues:
http://code.google.com/p/stomper/
http://code.google.com/p/pyactivemq/
http://xph.us/software/beanstalkd/

I'd suggest the following. (Since it's what we're doing.)
A simple WSGI server (wsgiref or werkzeug). The HTTP requests coming in will naturally form a queue. No further queueing needed. You get a request, you spawn the subprocess as a child and wait for it to finish. A simple list of children is about all you need.
I used a modification of the main "serve forever" loop in wsgiref to periodically poll all of the children to see how they're doing.
A simple SQLite database can track request status. Even this may be overkill because your XML inputs and results can just lay around in the file system.
That's it. Queueing and threads don't really enter into it. A single long-running external process is too complex to coordinate. It's simplest if each request is a separate, stand-alone, child process.
If you get immense bursts of requests, you might want a simple governor to prevent creating thousands of children. The governor could be a simple queue, built using a list with append() and pop(). Every request goes in, but only requests that fit will in some "max number of children" limit are taken out.

My reaction is to suggest Twisted, but you've already looked at this. Still, I stick by my answer. Without knowing you personal pain-points, I can at least share some things that helped me reduce almost all of the deferred-madness that arises when you have several dependent, blocking actions you need to perform for a client.
Inline callbacks (lightly documented here: http://twistedmatrix.com/documents/8.2.0/api/twisted.internet.defer.html) provide a means to make long chains of deferreds much more readable (to the point of looking like straight-line code). There is an excellent example of the complexity reduction this affords here: http://blog.mekk.waw.pl/archives/14-Twisted-inlineCallbacks-and-deferredGenerator.html
You don't always have to get your bulk processing to integrate nicely with Twisted. Sometimes it is easier to break a large piece of your program off into a stand-alone, easily testable/tweakable/implementable command line tool and have Twisted invoke this tool in another process. Twisted's ProcessProtocol provides a fairly flexible way of launching and interacting with external helper programs. Furthermore, if you suddenly decide you want to cloudify your application, it is not all that big of a deal to use a ProcessProtocol to simply run your bulk processing on a remote server (random EC2 instances perhaps) via ssh, assuming you have the keys setup already.

You can have a look at celery

It seems any python web framework will suit your needs. I work with a similar system on a daily basis and I can tell you, your solution with threads and SQLite for queue storage is about as simple as you're going to get.
Assuming order doesn't matter in your queue, then threads should be acceptable. It's important to make sure you don't create race conditions with your queues or, for example, have two of the same job type running simultaneously. If this is the case, I'd suggest a single threaded application to do the items in the queue one by one.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to implement an auto-scaling solution requiring synchronous subscription? - python

Related

How to design multiclient preprocess software pipeline using aws?

Architecturing an ML API in Django

What is `/_ah/background` with Google Flex VM

How to measure memory usage of a web request when using Werkzeug/Flask?

Python "Task Server"

Categories

Resources