We have a flask app running behind uwsgi with 4 processes. Its an API which serves data from one of our two ElasticSearch clusters.
On app bootstrap each process pulls config from external DB to check which ES cluster is active and connects to it.
Evey now and then POST request comes (from aws SNS service) which informs all the clients to switch ES cluster. That triggers the same function as on bootstrap - pull config from DB reconnect to active ES cluster.
It works well running as a single process, but when we have more then one process running only one of them will get updated (the one which picks up POST request)... where other processes are still connected to inactive cluster.
Pulling config on each request to make sure that ES cluster we use is active would be to slow. Im thinking to install redis locally and store the active_es_cluster there... any other ideas?
I think there are two routes you could go down.
Have an endpoint "/set_es_cluster" that gets hit by your SNS POST request. This endpoint then sets the key "active_es_cluster", which is read on every ES request by your other processes. The downside of this is that on each ES request you need to do a redis lookup first.
Have a seperate process that gets the POST request specifically (I assume the clusters are not changing often). The purpose of this process is to receive the post request and just have uWSGI gracefully restart your other flask processes.
The advantages of the second option:
Don't have to hit redis on every request
Let uWSGI handle the restarts for you (which it does well)
You already setup the config pulling at runtime anyway so it should "just work" with your existing application
Related
In short:
I have a Django application being served up by Apache on a Google Compute Engine VM.
I want to access a secret from Google Secret Manager in my Python code (when the Django app is initialising).
When I do 'python manage.py runserver', the secret is successfully retrieved. However, when I get Apache to run my application, it hangs when it sends a request to the secret manager.
Too much detail:
I followed the answer to this question GCP VM Instance is not able to access secrets from Secret Manager despite of appropriate Roles. I have created a service account (not the default), and have given it the 'cloud-platform' scope. I also gave it the 'Secret Manager Admin' role in the web console.
After initially running into trouble, I downloaded the a json key for the service account from the web console, and set the GOOGLE_APPLICATION_CREDENTIALS env-var to point to it.
When I run the django server directly on the VM, everything works fine. When I let Apache run the application, I can see from the logs that the service account credential json is loaded successfully.
However, when I make my first API call, via google.cloud.secretmanager.SecretManagerServiceClient.list_secret_versions , the application hangs. I don't even get a 500 error in my browser, just an eternal loading icon. I traced the execution as far as:
grpc._channel._UnaryUnaryMultiCallable._blocking, line 926 : 'call = self._channel.segregated_call(...'
It never gets past that line. I couldn't figure out where that call goes so I couldnt inspect it any further than that.
Thoughts
I don't understand GCP service accounts / API access very well. I can't understand why this difference is occurring between the django dev server and apache, given that they're both using the same service account credentials from json. I'm also surprised that the application just hangs in the google library rather than throwing an exception. There's even a timeout option when sending a request, but changing this doesn't make any difference.
I wonder if it's somehow related to the fact that I'm running the django server under my own account, but apache is using whatever user account it uses?
Update
I tried changing the user/group that apache runs as to match my own. No change.
I enabled logging for gRPC itself. There is a clear difference between when I run with apache vs the django dev server.
On Django:
secure_channel_create.cc:178] grpc_secure_channel_create(creds=0x17cfda0, target=secretmanager.googleapis.com:443, args=0x7fe254620f20, reserved=(nil))
init.cc:167] grpc_init(void)
client_channel.cc:1099] chand=0x2299b88: creating client_channel for channel stack 0x2299b18
...
timer_manager.cc:188] sleep for a 1001 milliseconds
...
client_channel.cc:1879] chand=0x2299b88 calld=0x229e440: created call
...
call.cc:1980] grpc_call_start_batch(call=0x229daa0, ops=0x20cfe70, nops=6, tag=0x7fe25463c680, reserved=(nil))
call.cc:1573] ops[0]: SEND_INITIAL_METADATA...
call.cc:1573] ops[1]: SEND_MESSAGE ptr=0x21f7a20
...
So, a channel is created, then a call is created, and then we see gRPC start to execute the operations for that call (as far as I read it).
On Apache:
secure_channel_create.cc:178] grpc_secure_channel_create(creds=0x7fd5bc850f70, target=secretmanager.googleapis.com:443, args=0x7fd583065c50, reserved=(nil))
init.cc:167] grpc_init(void)
client_channel.cc:1099] chand=0x7fd5bca91bb8: creating client_channel for channel stack 0x7fd5bca91b48
...
timer_manager.cc:188] sleep for a 1001 milliseconds
...
timer_manager.cc:188] sleep for a 1001 milliseconds
...
So, we a channel is created... and then nothing. No call, no operations. So the python code is sitting there waiting for gRPC to make this call, which it never does.
The problem appears to be that the forking behaviour of Apache breaks gRPC somehow. I couldn't nail down the precise cause, but after I began to suspect that forking was the issue, I found this old gRPC issue that indicates that forking is a bit of a tricky area.
I tried to reconfigure Apache to use a different 'Multi-processing Module', but as my experience in this is limited, I couldn't get gRPC to work under any of them.
In the end, I switched to using nginx/uwsgi instead of Apache/mod_wsgi, and I did not have the same issue. If you're trying to solve a problem like this and you have to use Apache, I'd advice further investigating Apache forking, how gRPC handles forking, and the different MPMs available for Apache.
I'm facing a similar issue. When running my Flask Application with eventlet==0.33.0 and gunicorn https://github.com/benoitc/gunicorn/archive/ff58e0c6da83d5520916bc4cc109a529258d76e1.zip#egg=gunicorn==20.1.0. When calling secret_client.access_secret_version it hangs forever.
It used to work fine with an older eventlet version, but we needed to upgrade to the latest version of eventlet due to security reasons.
I experienced a similar issue and I was able to solve with the following:
import grpc.experimental.gevent as grpc_gevent
from gevent import monkey
from google.cloud import secretmanager
monkey.patch_all()
grpc_gevent.init_gevent()
client = secretmanager.SecretManagerServiceClient()
I am running a flask application which has an option in the UI where the user can click a button and it calls an endpoint to perform some analysis.
The application is served as follows:
from waitress import serve
serve(app, host="0.0.0.0", port=5000)
After around ~1 minute, I am receiving a gateway timeout in the UI:
504 Gatway Time-out
However, the flask application keeps doing the work behind and after 2 minutes it completes the processing and I can see that it submits the data on the db side. So the process is not timing out itself.
I tried already passing channel_timeout argument to a much higher value (default seems 120 seconds) but with no luck. I know that this makes sense to implement in a different way where the user doesn't have to wait for these two minutes, but I am looking if there is such a timeout set by default and whether it can be increased.
The application is deployed in K8s and the UI is exposed via ingress. Could the timeout come from ingress instead?
The problem was with the ingress controller default timeout.
I managed to get around this issue by changing the implementation and having this job run as background task instead.
I have an API to do the below operations. I am using python, Django framework and gunicorn/nginx for deployment.
API has deployed in AWS lightsail. Request will come in for every 2secs.
receives data from the client.
creates record in local SQLite database and sends response.
runs task asynchronously in thread. (running in thread). Entire task takes about 1 sec on average.
a. gets the updated record from step 2 with ID. (0 sec)
b. posts data to another API using requests. (.5 secs)
c. updates to database (AWS RDS) (.5 secs)
Setup:
I have ThreadPoolExecutor max_workers=12.
gunicorn has one worker as the instance has 1vCPU. I don't use gunicorn workers with threads, since I have to perform some other task within the api.
The reason asyncio is not used was that base update to db in Django is not supported with this. So I kept the post API in the threading itself.
Each request will be unique. No same request.
Even If I keep max_worker to 1 in threadpool, it is bursting 10% mark in AWS 5$ instance. API only receive request for every 2secs.
I am not able to profile this situation where it is causing the CPU usage.
There are a couple of reason, I can think of.
gunicorn master is constantly checking on the worker.
OS is managing threads for context switching.
Any pointer will be helpful for profiling.
I have some Flask application. It works with some database, I'm using SQLAlchemy for this. So I have one question:
Flask handle requests one-by-one. So, for example, I have two users, which are modifying the same record in the table of database, for example A and B (they are concurrent).
How can I say to user B that user A has changed this record? It must be some message to user B.
In the development server version, when you do app.run(), you get a single synchronous process, which means at most 1 requests being processed at a time. So you cannot accept multiple users at the same time.
However, gunicorn is a solid, easy-to-use WSGI server that will let you spawn multiple workers (separate processes), and even comes with asynchronous workers when you need to deploy your application.
However, to answer your question, since, they run on separate threads, the data that exists in the database at the specific time when the query is run in that thread will be used/returned.
I hope this answers your query.
I am trying to serve bokeh documents via Django using the bokeh-server executable, which creates a Tornado instance. The bokeh documents can be accessed via URL provided by the Session.object_link method. When navigated to, the bokeh-server executable writes this to the stdout (IP addresses have been replaced with ellipses):
INFO:tornado.access:200 POST /bokeh/bb/71cee48b-5122-4275-bd4f-d137ea1374e5/gc (...) 222.55ms
INFO:tornado.access:200 GET /bokeh/bb/71cee48b-5122-4275-bd4f-d137ea1374e5/ (...) 110.15ms
INFO:tornado.access:200 POST /bokeh/bb/71cee48b-5122-4275-bd4f-d137ea1374e5/gc (...) 232.66ms
INFO:tornado.access:200 GET /bokeh/bb/71cee48b-5122-4275-bd4f-d137ea1374e5/ (...) 114.16ms
This appears to be communication between the python instance running the Django WSGI app (initialized by Apache running mod_wsgi) and the bokeh-server executable.
When the browser is sent the response, including the graphs and data etc. required for the bokeh interface, there is some initial networking to the browser, followed by networking if there is any interaction with the graphs which have python callbacks. When the user closes the window or browser, the same networking above continues. Moreover, the networking only stops when the Django or bokeh-server processes are killed.
In order to start a bokeh session and pass a URL back to the Django template, it is necessary to start the bokeh session in a new thread:
def get_bokeh_url(self, context):
t = Thread(target=self.run)
t.start()
return self.session.object_link(self.document.context)
def run(self):
return self.session.poll_document(self.document)
self.session and self.document were both initialized before the thread was started. So at the point where get_bokeh_url is called, there are some graphs on the document, some of which have interaction callbacks and session has been created but not polled via poll_document (which appears necessary for interaction).
The thread keeps running forever unless you kill either Django or bokeh-server. This means that when more requests come through, more threads build up and the amount of networking increases.
My question is, is there a way to kill the thread once the document is no longer being viewed in a browser?
One answer that I have been pondering would be to send a quick request to the server when the browser closes and somehow kill the thread for that document. I've tried deleting the documents from the bokeh interface, but this has no effect.
The bokeh server periodically checks whether there are connections to a session. If there have been no connections for some time, the session is expired and destroyed.
As of version 0.12.1, the check interval and maximum connectionless time default to 17 and 60 seconds, respectively. You can override them by running the server like this
bokeh serve --check-unused-sessions 1000 --unused-session-lifetime 1000 app.py
This is rather hard to find in the docs, it's described in the CLI documentation and in the developer guide, in a section on Applications, Sessions and Connections in the Server Architecture chapter. There's also a closed Github issue on this topic: Periodic callbacks continue after tabs are closed #3770
If you need custom logic whenever a session is destroyed, use the directory deploy format for your app and add a server_lifecycle.py file containing your Lifecycle Hooks, specifically this one:
def on_session_destroyed(session_context):
''' If present, this function is called when a session is closed. '''
pass