I think I don't completely understand the deployment process. Here is what I know:
when we need to do hot deployment -- meaning that we need to change the code that is live -- we can do it by reloading the modules, but
imp.reload is a bad idea, and we should restart the application instead of reloading the changed modules
ideally the running code should be a clone of your code repository, and any time you need to deploy, you just pull the changes
Now, let's say I have multiple instances of wsgi app running behind a reverse proxy like nginx (on ports like 8011, 8012, ...). And, let's also assume that I get 5 requests per second.
Now in this case, how should I update my code in all the running instances of the application.
If I stop all the instances, then update all of them, then restart them all -- I will certainly lose some requests
If I update each instance one by one -- then the instances will be in inconsistent states (some will be running old code, and some new) until all of them are updated. Now if a request hits an updated instance, and then a subsequent (and related) request hits an older instance (yet to be updated) -- then I will get wrong results.
Can somebody explain thoroughly how busy applications like this are hot-deployed?
For deployment across several hot instances that are behind a load balancer like nginx I like to do rolling deployments with a tool like Fabric.
Fabric connects you to Server 1
Shut down the web-server
Deploy changes, either by using your VCS or transferring tarball with the new application
Start up the web-server
GOTO1 and connect to the next server.
That way you're never offline, and it's seamless as nginx knows when a webserver is taken down when it tries to round-robin to it and will move onto the next one instead, and as soon as the node/instance is back up it will be back into production usage.
EDIT:
You can use the ip_hash module in nginx to ensure all requests from one IP Address goes to the same server for the length of the session
This directive causes requests to be distributed between upstreams based on the IP-address of the client.
The key for the hash is the class-C network address of the client. This method guarantees that the client request will always be transferred to the same server. But if this server is considered inoperative, then the request of this client will be transferred to another server. This gives a high probability clients will always connect to the same server.
What this means to you, is that once your web-server is updated and a client has connected to the new instance, all connections for that session will continue to be forwarded to the same server.
This does leave you in the situation of
Client connects to site, gets served from Server 1
Server 1 is updated before client finishes whatever they're doing
Client potentially left in a state of limbo?
This scenario begs the question, are you removing things from your API/Site which could potentially leave the client in a state of limbo ? If all you're doing is for example updating UI elements or adding pages etc but not changing any back-end APIs you should not have any problems. If you are removing API functions, you might end up with issues.
Couldn't you take half your servers offline (say by pulling them out of the load balancing pool) and then update those. Then bring them back online while simultaneously pulling down the other half. Then update those and bring them back online.
This will ensure that you stay online while also ensuring that you never have the old and new versions of your application online at the same time. Yes, this will mean that your site would run at half its capacity during the time. But that might be ok?
Related
Basically I'm running a Flask web server that crunches a bunch of data and sends it back to the user. We aren't expecting many users ~60, but I've noticed what could be an issue with concurrency. Right now, if I open a tab and send a request to have some data crunched, it takes about 30s, for our application that's ok.
If I open another tab and send the same request at the same time, unicorn will do it concurrently, this is great if we have two seperate users making two seperate requests. But what happens if I have one user open 4 or 8 tabs and send the same request? It backs up the server for everyone else, is there a way I can tell Gunicorn to only accept 1 request at a time from the same IP?
A better solution to the answer by #jon would be limiting the access by your web server instead of the application server. A good way would always be to have separation between the responsibilities to be carried out by the different layers of your application. Ideally, the application server, flask should not have any configuration for the limiting or anything to do with from where the requests are coming. The responsibility of the web server, in this case nginx is to route the request based on certain parameters to the right client. The limiting should be done at this layer.
Now, coming to the limiting, you could do it by using the limit_req_zone directive in the http block config of nginx
http {
limit_req_zone $binary_remote_addr zone=one:10m rate=1r/s;
...
server {
...
location / {
limit_req zone=one burst=5;
proxy_pass ...
}
where, binary_remote_addris the IP of the client and not more than 1 request per second at an average is allowed, with bursts not exceeding 5 requests.
Pro-tip: Since the subsequent requests from the same IP would be held in a queue, there is a good chance of nginx timing out. Hence, it would be advisable to have a better proxy_read_timeout and if the reports take longer then also adjusting the timeout of gunicorn
Documentation of limit_req_zone
A blog post by nginx on rate limiting can be found here
This is probably NOT best handled at the flask level. But if you had to do it there, then it turns out someone else already designed a flask plugin to do just this:
https://flask-limiter.readthedocs.io/en/stable/
If a request takes at least 30s then make your limit by address for one request every 30s. This will solve the issue of impatient users obsessively clicking instead of waiting for a very long process to finish.
This isn't exactly what you requested, since it means that longer/shorter requests may overlap and allow multiple requests at the same time, which doesn't fully exclude the behavior you describe of multiple tabs, etc. That said, if you are able to tell your users to wait 30 seconds for anything, it sounds like you are in the drivers seat for setting UX expectations. Probably a good wait/progress message will help too if you can build an asynchronous server interaction.
I know GIL blocks python from running its threads across cores. If it does so, why python is being used in webservers, how are the companies like youtube, instagram handling it.
PS: I know alternatives like multiprocessing can solve it. But it would be great if anyone can post it with a scenario that was handled by them.
Python is used for server-side handling in webservers, but not (usually) as webserver.
On normal setup: we have have Apache or other webserver to handles a lot of processes (server-side) (python uses usually wsgi). Note usually apache handles directly "static" files. So we have one apache server, many parallel apache processes (to handle connection and basic http) and many python processes which handles one connection per time.
Each of such process are independent each others (they just use the same resources), so you can program your server side part easily, without worrying about deadlocks. It is mostly a trade-off: performance of code, and easy and quickly to produce code without huge problems. But usually webserver with python scale very well (also on large sites), and servers are cheaper then programmers.
Note: security is also increased by having just one request in a process.
GIL exists in CPython, (Python interpreter made in C and most used), other interpreter versions such as Jython or IronPython don't have such problem, because they don't have GIL.
Even though, using CPython you can still have concurrency, just do your thing in C and then "link it" in your Python code, just like Numpy or similar do.
Other thing is, even though you have your page using Flask or Django, when you set up it in a production server, you have an Apache or Nginx, etc which has a real charge balancer (or load balancer, I can't remember the name in english now) that can serve the page to many people at the same time.
Take it from the Flask docs (link):
Flask’s built-in server is not suitable for production as it doesn’t scale well and by default serves only one request at a time.
[...]
If you want to deploy your Flask application to a WSGI server not listed here, look up the server documentation about how to use a WSGI app with it. Just remember that your Flask application object is the actual WSGI application.
Although a bit late, but I will try to give a generic and useful answer.
#Giacomo Catenazzi's answer is a good one but some part of it is factually incorrect.
API requests (or other form of web requests) are served from an already running process. The creation of this 'already running' process is handled by some webserver like gunicorn which on startup creates specified number of processes that are running the code in your web application continuously waiting to serve any incoming request.
Needless to say, each of these processes are limited by the GIL to only run one thread at a time. But one process in its lifetime handles more than one (normally many) request. Here it would be better if we could understand the flow of a request.
We will take an example of flask but this is applicable to most web frameworks. When a request comes from Nginx, it is handed over to gunicorn which interacts with your web application via wsgi. When the request reaches to the framework, an app context is created and some variables are pushed into the app-context. Then it follows the normal route that mostly people are familiar with: routing, db calls, response creation and so on. The response is then handed back to the gunicorn via wsgi again. At the time of handing over the response, the app context is teared down. So it's the app context, not the process that is created on every new request.
Also, I have talked only about the sync worker in gunicorn but it also has an option of async worker which can handle multiple requests in parallel through coroutines. But thats a separate topic.
So answering your question:
Nginx (Capable of handling multiple requests at a time)
Gunicorn creates a pool of n number of processes at the start and also manages the pool in the sense that if a process exits or gets stuck, it kills/recreates ans adds that to the pool.
Each process handling 1 request at a time.
Read more about gunicorn's design and how it can be used to help you achieve your requirements. This is a good thread about gunicorn with flask understanding. And this is a great resource to understand flask app context
I have an AWS server that handles end-user registration/ It runs an EC2 linux instance that serves our API via Apache & Python, and which is connected to its data on a separate Amazon RDS instance running mysql.
To remotely admin the system, I set states in a mysql table to control the availability of the registration API to the public user, and also the level of logging for our Python API, which may reference up to 5 concurrent admin preferences (i.e. not a single "log level")
Because our API provides almost two dozen different functions, we need to check the state of the system's availability before any individual function is accessed. That means there's an SQL Select statement from that table (which only has one record), but for every session of user transaction,s which might involve a half-dozen API calls. We need to check to see if the availability status has changed, so the user doesn't start an API call and have the database become unavailable in the middle of the process. Same for the logging preferences.
The API calls return the server's availability, and estimated downtime, back to the calling program (NOT a web browser interface) which handles that situation gracefully.
Is this a commonly accepted approch for handling this? Should I care if I'm over-polling the status table? And should I set up mysql with my status table in such a way to make my constant checking more efficient (e.g. cached?) when Python obtains its data?
I should note that we might have thousands of simultaneous users making API requests, not tens of thousands, or millions.
Your strategy seems off-track, here.
Polling a status table should not be a major hot spot. A small table, with proper indexes, queried outside a transaction, is a lightweight operation. With an appropriately-provisioned server, such a query should be done entirely in memory, requiring no disk access.
But that doesn't mean it's a fully viable strategy.
We need to check to see if the availability status has changed, so the user doesn't start an API call and have the database become unavailable in the middle of the process.
This will prove impossible. You need time travel capability for this strategy to succeed.
Consider this: the database becoming unavailable in the middle of a process wouldn't be detected by your approach. Only the lack of availability at the beginning would be detected. And that's easy enough to detect, anyway -- you will realize that as soon as you try to do something.
Set appropriate timeouts. The MySQL client library should have support for a connect timeout, as well as a timeout which will cause your application to see an error if a query runs longer than is acceptable or a network disruption causes the connection to be lost mid-query. I don't know whether this exists or what it's called in Python but in the C client library, this is MYSQL_OPT_READ_TIMEOUT and is very handy for preventing a hang when for whatever reason you get no response from the database within an acceptable period of time.
Use database transactions, so that a failure to process a request results in no net change to the database. A MySQL transaction is implicitly rolled back if the connection between the application and the database is lost.
Implementing error handling and recovery -- written into your code -- is likely the more viable approach than trying to prevent your code from running when the service is unavailable is more likely to be a good design, because there is no check interval small enough to fully avoid a database becoming unavailable "in the middle" of a request.
In any event, polling a database table with each request seems like the wrong approach, not to mention the fact that an outage on the health status table's server makes your service fail unnecessarily when the service itself might have been healthy but failed to prove that.
On the other hand, I don't know your architecture, but assuming your front-end involves something like Amazon Application Load Balancer or HAProxy, the health checks against the API service endpoint can actually perform the test. If you configure your check interval for, say, 10 seconds, and making a request to the check endpoint (say GET /health-check) actually verifies end-to-end availability of the necessary components (e.g. database access) then the API service can effectively take itself offline when a problem occurs. It remains offline until it starts returning success again.
The advantage here is that your workload involved in healthy checking is consistent -- it happens every 10 seconds, increasing with the number of nodes providing the service, but not increasing with actual request traffic, because you don't have to perform a check for each request. This means you have a window of a few seconds between the actual loss of availability and the detection of the loss of availability, but the requests that get through in the mean time will fail, anyway.
HAProxy -- and presumably other tools like Varnish or Nginx -- can help you handle graceful failures in other ways as well, by timing out failed requests at a layer before the API endpoint so that the caller gets a response even though the service itself didn't respond. An example from one of my environments is a shopping page where an external API call is made by the application when a site visitor is browsing items by category. If this request runs longer than it should, the proxy can interrupt the request and return a preconfigured static error page to the system making the request with an error -- say, in JSON or XML, that the requesting application will understand -- so that the hard failure becomes a softer one. This fake response can, for example in this case, return an empty JSON array of "items found."
It isn't entirely clear to me, now, whether these APIs are yours, or are external APIs that you are aggregating. If the latter, then HAProxy is a good solution here, too, but facing the other direction -- the back-end faces outward and your service contacts its front-end. You access the external service through the proxy and the proxy checks the remote service and will immediately return an error back to your application if the target API is unhealthy. I use this solution to access an external trouble ticketing system from one of my apps. An additional advantage, here, is that the proxy logs allow me to collect usage, performance, and reliability data about all of the many requests passed to that external service regardless of which of dozens of internal systems may access it, with far better visibility than I could accomplish than if I tried to collect it from all of the internal application servers that access that external service.
This is really troublesome for me. I have a telegram bot that runs in django and python 2.7. During development I used django sslserver and everything worked fine. Today I deployed it using gunicorn in nginx and the code works very different than it did on my localhost. I tried everything I could since I already started getting users, but all to no avail. It seems to me that most python objects lose their state after each request and this is what might be causing the problems. The library I use has a class that handles conversation with a telegram user and the state of the conversation is stored in a class instance. Sometimes when new requests come, those values would already be lost. Please has anyone faced this? and is there a way to solve the problem quick? I am in a critical situation and need a quick solution
Gunicorn has a preforking worker model -- meaning that it launches several independent subprocesses, each of which is responsible for handling a subset of the load.
If you're relying on internal application state being consistent across all threads involved in offering your service, you'll want to turn the number of workers down to 1, to ensure that all those threads are within the same process.
Of course, this is a stopgap -- if you want to be able to scale your solution to run on production loads, or have multiple servers backing your application, then you'll want to be modify your system to persist the relevant state to a shared store, rather than relying on content being available in-process.
There are a bunch of techniques I can think of for doing this:
Setting up a replica web-server on a different port and/or IP, then using DNS as load-balancer; restarting one server at a time
Utilising more explicit load-balancing (which PaaS such as Heroku and OpenShift use) with implicit replicas
Using some in-built mechanism (e.g.: in nginx)
I am working within an IaaS solution, and will be setting up git and some listeners to handle this whole setup.
What's the best method of restarting the web-server—so my latest revision of my Python web-app can go live—without noticeably affecting site visitors/users/clients?
The simpler the better, no silver bullet.
For single server, gracefully restart mechanism can be helpful. It will start new processes to accept new requests, and maintain the old processes till the old requests finished. Nginx already using this, see http://wiki.nginx.org/CommandLine#Stopping_or_Restarting_Nginx
For multiple servers, using reverse proxy is a good practice. An example structure looks like this, and it can be easily build using Nginx:
If some of backend servers broken down, the reverse proxy can dispatch requests to other healthy servers and will not affect users. You can customize the load balancing strategy to do fine-grained control. And you can also flexible add server for scaling up, or pick off server for trouble shooting or code updating.