uwsgi.is_connected() delay with nginx - python

I have a setup with nginx, uwsgi, and gevent. When testing the setup's ability to handle premature client disconnects, I found that uwsgi isn't exactly responding in a timely manner.
This is how I detect that a disconnect has occurred inside of my python code:
While True:
if 'uwsgi' in sys.modules:
import uwsgi ##UnresolvedImport
fileDescriptor = uwsgi.connection_fd()
if not uwsgi.is_connected(fileDescriptor):
logger.debug("Connection was lost (client disconnect)")
break
So when uwsgi signals a lost of connection, I break out of this loop. There's also a call to gevent.sleep(2) at the bottom of the loop to prevent hammering the CPU.
With that in place I have nginx logging the close connection like this:
2016/08/16 19:23:23 [info] 32452#0: *1 epoll_wait() reported that client prematurely closed connection, so upstream connection is closed too while sending to client, client: 192.168.56.1, server: <removed>, request: "GET /myurl HTTP/1.1", upstream: "uwsgi://127.0.0.1:8070", host: "<removed>:8443"
nginx is immediately aware of the disconnect when it produces this log entry, it's within milliseconds of the client disconnecting. Yet uwsgi doesn't seem to be aware of the disconnect until seconds, sometimes almost a minute later at least in terms of notifying my code:
DEBUG - Connection was lost (client disconnect) - 391 ms[08/16/16 19:24:04 UTC])
The uwsgi.log file created via daemonize suggests it somehow saw it a second before nginx but somehow waited half a minute to actually tell my code:
[pid: 32208|app: 0|req: 2/2] 192.168.56.1 () {32 vars in 382 bytes} [Tue Aug 16 19:23:22 2016] GET /myurl => generated 141 bytes in 42030 msecs (HTTP/1.1 200) 2 headers in 115 bytes (4 switches on core 999
This is my setup in nginx:
upstream bottle {
server 127.0.0.1:8070;
}
server {
listen 8443;
ssl on;
ssl_certificate /etc/pki/tls/certs/server.crt;
ssl_certificate_key /etc/pki/tls/private/server.key;
server_name <removed>;
# Load configuration files for the default server block.
include /etc/nginx/default.d/*.conf;
location / {
include uwsgi_params;
#proxy_read_timeout 5m;
uwsgi_buffering off;
uwsgi_ignore_client_abort off;
proxy_ignore_client_abort off;
proxy_cache off;
chunked_transfer_encoding off;
#uwsgi_read_timeout 5m;
#uwsgi_send_timeout 5m;
uwsgi_pass bottle;
}
}
The odd part to me is how the timestamp from uwsgi is saying it saw it right when nginx did, however it doesn't write that entry until my code sees it ~30 seconds later. It appears from my perspective, that uwsgi is essentially lying or locking it up, yet I can't find any errors from it.
Any help is appreciated. I've attempted to remove any buffering and delays from nginx without any success.

Related

Sagemaker Batch Transform Job "upstream prematurely closed connection" when surpassing 30 minutes

I am serving a sagemaker model through a custom docker container using the guide that AWS provides. This is a docker container that runs a simple nginx->gunicorn/wsgi->flask server
I am facing an issue where my transform requests time out around 30 minutes in all instances, despite should being able to continue to 60 minutes. I need requests to be able to go to sagemaker maximum of 60 minutes due to data intense nature of request.
Through experience working with this setup for some months, I know that there are 3 factors that should affect the time my server has to respond to requests:
Sagemaker itself will cap invocations requests according to the
InvocationsTimeoutInSeconds paremeter set when creating the batch
transform
job.
The nginx.conf file must be configured such that keepalive_timeout, proxy_read_timeout, proxy_send_timeout, and proxy_connect_timeout are all equal or greater than maximum timeout
gunicorn server must its timeout configured to be equal or greater than maximum timeout
I have verified that when I create my batch transform job InvocationsTimeoutInSeconds is set to 3600 (1 hour)
My nginx.conf looks like this:
worker_processes 1;
daemon off; # Prevent forking
pid /tmp/nginx.pid;
error_log /var/log/nginx/error.log;
events {
# defaults
}
http {
include /etc/nginx/mime.types;
default_type application/octet-stream;
access_log /var/log/nginx/access.log combined;
sendfile on;
client_max_body_size 30M;
keepalive_timeout 3920s;
upstream gunicorn {
server unix:/tmp/gunicorn.sock;
}
server {
listen 8080 deferred;
client_max_body_size 80m;
keepalive_timeout 3920s;
proxy_read_timeout 3920s;
proxy_send_timeout 3920s;
proxy_connect_timeout 3920s;
send_timeout 3920s;
location ~ ^/(ping|invocations) {
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header Host $http_host;
proxy_redirect off;
proxy_pass http://gunicorn;
}
location / {
return 404 "{}";
}
}
}
I start the gunicorn server like this:
def start_server():
print('Starting the inference server with {} workers.'.format(model_server_workers))
print('Model server timeout {}.'.format(model_server_timeout))
# link the log streams to stdout/err so they will be logged to the container logs
subprocess.check_call(['ln', '-sf', '/dev/stdout', '/var/log/nginx/access.log'])
subprocess.check_call(['ln', '-sf', '/dev/stderr', '/var/log/nginx/error.log'])
nginx = subprocess.Popen(['nginx', '-c', '/opt/program/nginx.conf'])
gunicorn = subprocess.Popen(['gunicorn',
'--timeout', str(3600),
'-k', 'sync',
'-b', 'unix:/tmp/gunicorn.sock',
'--log-level', 'debug',
'-w', str(1),
'wsgi:app'])
signal.signal(signal.SIGTERM, lambda a, b: sigterm_handler(nginx.pid, gunicorn.pid))
# If either subprocess exits, so do we.
pids = set([nginx.pid, gunicorn.pid])
while True:
pid, _ = os.wait()
if pid in pids:
break
sigterm_handler(nginx.pid, gunicorn.pid)
print('Inference server exiting')
Despite all this, whenever a transform job takes longer than approx 30 minutes I will see this message in my logs and the transform job status becomes failed:
2023/01/07 08:23:14 [error] 11#11: *4 upstream prematurely closed connection while reading response header from upstream, client: 169.254.255.130, server: , request: "POST /invocations HTTP/1.1", upstream: "http://unix:/tmp/gunicorn.sock:/invocations", host: "169.254.255.131:8080"
I am close to thinking there is a bug in AWS batch transform, but perhaps I am missing some other variable (perhaps in the nginx.conf) that could lead to premature upstream termination of my request.
By looking at hardware metrics was able to determine that the upstream termination only happens when the server was near its memory limit. So my guess is that the OS was killing the gunicorn worker and the 30 minute mark was just a coincidence that happened on my long running test cases.
My solution was to increase the memory available on the server

First request doesnt terminate (no FIN) uWSGI + nginx

I am using nginx as a reverse proxy in front of a uWSGI server (flask apps).
Due to a memory leak, use --max-requests to reload workers after so many calls.
The issue is the following : When a worker just restarted/started, the first request it receives stays hanging between uWSGI and NGINX, the process time inside of the flask app is usual and very quick but the client waits until uwsgi_send_timeout is triggered.
Using tcpdump to see the request (nginx is XXX.14 and uWSGI is XXX.11) :
You can see in the time column that it hangs for 300 seconds (uwsgi_send_timeout) eventhough the HTTP request has been received by NGINX... uWSGI just doesn't send a [FIN] packet to signal that the connexion is closed. Then NGINX triggers the timeout and closes the session.
The end client receives a truncated response.. With a 200 status code. which is very frustrating.
This happens at every worker reload, only once, the first request no matter how big the request.
Does anyone have a workaround this issue? have I misconfigured something?
uwsgi.ini
[uwsgi]
# Get the location of the app
module = api:app
plugin = python3
socket = :8000
manage-script-name = true
mount = /=api:app
cache2 = name=xxx,items=1024
# Had to increase buffer-size because of big authentication requests.
buffer-size = 8192
## Workers management
# Number of workers
processes = $(UWSGI_PROCESSES)
master = true
# Number of requests managed by 1 worker before reloading (reload is time expensive)
max-requests = $(UWSGI_MAX_REQUESTS)
lazy-apps = true
single-interpreter = true
nginx-server.conf
server {
listen 443 ssl http2;
client_max_body_size 50M;
location #api {
include uwsgi_params;
uwsgi_pass api:8000;
uwsgi_read_timeout 300;
uwsgi_send_timeout 300;
}
For some weird reason, adding the parameter uwsgi_buffering off; in the nginx config fixed the issue.
I still don't understand why but for now this fixes my issue. If anyone has a valid explanation, don't hesitate.
server {
listen 443 ssl http2;
client_max_body_size 50M;
location #api {
include uwsgi_params;
uwsgi_pass api:8000;
uwsgi_buffering off;
uwsgi_read_timeout 300;
uwsgi_send_timeout 300;
}

nginx bad gateway 502 in django

I have configured a django developed site using Nginx, gunicorn and supervisor on ubuntu 14.04 and it working perfectly for more than 2 years without any lag in response and request.
In my site i have a script/management command that takes the database dump and pushes to s3 through cron job, few days before the code stopped working as started throwing me an error socket.error: [Errno 104] Connection reset by peer and i posted the complete traceback here but could n;t get any response and so started googling around, and i got to see this post and so made changes as describe there like in order to get rid of socket.error: [Errno 104] Connection reset by peer error it was mentioned to add the following lines to /etc/sysctl.conf
Workaround for TCP Window Scaling bugs in other ppl's equipment
net.ipv4.tcp_wmem = 4096 16384 512000
net.ipv4.tcp_rmem = 4096 87380 512000
I have added them and tried running $ sudo sysctl -p, after this executed the dbbackup/s3 uploading command python manage.py db_backup but still facing the same socket.error: [Errno 104] Connection reset by peer error, and so reverted/removed the changes(removed the above added lines in /etc/sysctl.conf) and re run the command $ sudo sysctl -p and so i have my previous changes back.
Also on my nginx configuration
ssl_protocols TLSv1 TLSv1.1 TLSv1.2; # SSLv2
I read some where that removing TLSv1 from above setting ssl_protocols will solve socket.error: [Errno 104] Connection reset by peer problem, and so i removed it and restarted nginx server, now tried to run db_backup management command it seems to be working, but i added back TLSv1 to setting ssl_protocols just to make sure and confirm with some one else
Now the actual problem is after doing the above changes and reverting back and restarting supervisor, nginx my site has became damn damn slow
I have different sections on site like
Home page
Contact us
Many pages that shows/list the list of records fetching from postgres database
The home page and contact us page are working as usual but the page that related to database fetching was not able to load even after 3 minutes and displaying 502 Bad Gateway nginx/1.4.6 (Ubuntu)
I tried everything like restarting postgres, nginx, supervisor and double checked the file /etc/sysctl.conf to make sure it does n't any new changes. Everything seems to be perfect but could n;t able to understand why the site has gone slow
Nginx and gunicorn files
server {
listen 80;
server_name example.com www.example.com m.example.com;
location / {
return 301 https://www.example.com$request_uri;
# proxy_pass http://127.0.0.1:8001;
}
location /static/ {
alias /user/apps/example_webapp/project/new_media/;
}
}
server {
listen 443 ssl;
server_name example.com www.example.com m.example.com;
ssl_certificate /etc/ssl/example/example.com.chained.crt;
ssl_certificate_key /etc/ssl/example/www.example.com.key;
ssl_session_timeout 20m;
ssl_session_cache shared:SSL:10m; # ~ 40,000 sessions
ssl_protocols TLSv1 TLSv1.1 TLSv1.2; # SSLv2
# ssl_ciphers ALL:!aNull:!eNull:!SSLv2:!kEDH:!EXPORT56:RC4+RSA:+HIGH:+MEDIUM:+EXP:#STRENGTH;
ssl_ciphers HIGH:!aNULL:!MD5;
ssl_prefer_server_ciphers on;
client_max_body_size 20M;
location / {
proxy_pass http://127.0.0.1:8001;
proxy_connect_timeout 300s;
proxy_read_timeout 300s;
}
location /static/ {
alias /user/apps/example_webapp/project/new_media/;
}
}
Gunicorn
bind = "127.0.0.1:8001"
workers = 3
loglevel = "debug"
proc_name = "project"
daemon = False
pythonpath = "/user/apps/project_name/"
errorlog = "/user/apps/gunicorn_configurations/gunicorn_logfiles/gunicorn_errors.log"
timeout = 90
So can anyone please let me know how to bring my site back to original state ? also what might be the causes for slowed up suddenly without any reason ? where should i check for any help or errors ?

How do I cleanup in uwsgi after nginx so_keepalive timeout?

My nginx configuration is like:
server {
listen 80 so_keepalive=30m::;
location /wsgi {
uwsgi_pass uwsgicluster;
include uwsgi_params;
uwsgi_read_timeout 30000;
uwsgi_buffering off;
}
...
}
In my python:
def application_(environ, start_response):
body = queue.Queue()
...
gevent.spawn(redis_wait, environ, body, channels)
return body
def redis_wait(environ, body, channels):
server = redis.Redis(connection_pool=REDIS_CONNECTION_POOL)
client = server.pubsub()
try:
for channel in channels:
client.subscribe(channel)
messages = client.listen()
for message in messages:
if message['type'] != 'message' and message['type'] != 'pmessage':
continue
body.put(message['data'])
finally:
client.unsubscribe()
client.close()
The problem occurs when the client connection is interrupted (either network connection abruptly lost, application terminates, etc.) redis shows that the connection on the server is still open. How do I fix this? Even with the so_keepalive, the connection isnt being cleaned up. How do I fix this?
EDIT: I've noticed through the nginx_status page that the active connection count does go down after the disconnect. The problem is that uwsgi isnt getting notified of this.
You have to wait on the uwsgi socket as well as the redis socket so that you can be notified in case the uwsgi socket closes. Example here: http://nullege.com/codes/show/src%40k%40o%40kozmic-ci-HEAD%40tailer%40__init__.py/72/uwsgi.connection_fd/python

How do I debug a HTTP 502 error?

I have a Python Tornado server sitting behind a nginx frontend. Every now and then, but not every time, I get a 502 error. I look in the nginx access log and I see this:
127.0.0.1 - - [02/Jun/2010:18:04:02 -0400] "POST /a/question/updates HTTP/1.1" 502 173 "http://localhost/tagged/python" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.2.3) Gecko/20100401 Firefox/3.6.3"
and in the error log:
2010/06/02 18:04:02 [error] 14033#0: *1700 connect() failed (111: Connection refused)
while connecting to upstream, client: 127.0.0.1, server: _,
request: "POST /a/question/updates HTTP/1.1",
upstream: "http://127.0.0.1:8888/a/question/updates", host: "localhost", referrer: "http://localhost/tagged/python"
I don't think any errors show up in the Tornado log. How would you go about debugging this? Is there something I can put in the Tornado or nginx configuration to help debug this?
The line from the error log is very informative in my opinion. It says the connection was refused by the upstream, it contains client IP, Nginx server config, request line, hostname, upstream URL and referrer.
It is pretty clear you must look at the upstream (or firewall) to find out the reason.
In case you'd like to look at how Nginx processes the request, why it chooses specific server and location sections -- there is a beautiful "debug" mode. (Note, your Nginx binary must be built with debugging symbols included). Then:
error_log /path/to/your/error.log debug;
will turn on debugging for all the requests. Debugging information in the error log requires some time to get used to interpret it, but it's worth the efforts.
Do not use this "as is" for high traffic sites! It generates a lot of information and your error log will grow very fast. If you need to debug requests in the production, use debug_connection directive:
events {
debug_connection 1.2.3.4;
}
It turns debugging on for the specific client IP address only.

Categories