Django times out instead of returning exception - python

I've got a remote server running Nginx -> gunicorn -> django. When I hit a view that causes an exception, I would expect a 500 server error page to be returned. Instead, it hangs for ~10 seconds and I get a 502 bad gateway.
When I look in the gunicorn logs, they indicate a worker timed out and was killed. No exceptions are logged, and no admin emails are sent. The gunicorn logs:
[2016-02-16 16:47:30 -0600] [5809] [CRITICAL] WORKER TIMEOUT (pid:5817)
[2016-02-16 22:47:30 +0000] [5817] [INFO] Worker exiting (pid: 5817)
[2016-02-16 16:47:30 -0600] [5833] [INFO] Booting worker with pid: 5833
On my local machine, everything works as expected. They are both running identical settings.py (DEBUG is False). I reduced it to a test case of
def foo(request):
raise Exception('bar')
Browsing to it locally, it immediately returns the 500 server error page, as well as firing off admin emails. On the remote server, the browser spins for a while then nginx returns the bad gateway response. No emails are sent, no exceptions are logged.
Regular pages return immediately with the responses I expect. It appears to exhibit the bad behavior only if an exception is thrown.
What might cause such behavior?

I figured it out. The firewall wasn't allowing outbound SMTP connections. Django hung trying to send the email.

At first I would have increased nginx:
proxy_connect_timeout 300s;
proxy_read_timeout 300s;
and gunicorn settings:
--timeout 180
Maybe it's would help to handle some exceptions in log files;

Related

nginx gunicorn 502 bad gateway: upstream prematurely closed connection while reading response header from upstream

I have a Django app with which users can create video collages using multiple videos. Problem is, on production, when uploading videos to amazon s3, I get a 502 bad gateway (works fine locally). Does anyone know what could be wrong? I already set
client_max_body_size 100M
and
fastcgi_buffers 8 16k;
fastcgi_buffer_size 32k;
fastcgi_connect_timeout 3000;
fastcgi_send_timeout 3000;
fastcgi_read_timeout 3000;
Does anyone know what could be wrong? Thanks in advance
Full error:
2017/12/31 23:50:51 [error] 1279#1279: *1 upstream prematurely closed connection while reading response header from upstream,
client: 107.205.110.154,
server: movingcollage.com,
request: "POST /create-collage/ HTTP/1.1",
upstream: "http://unix:/home/mike/movingcollage/movingcollage.sock:/create-collage/",
host: "movingcollage.com", referrer: "http://movingcollage.com/create-collage/"
If the problem were in nginx timeout it would give you 504 error. 502 error means that this error could happen due to timeout in process behind nginx, gunicorn in your case I guess. Try to launch it with -t 3000 param (to match your nginx conf).

upstream timed out error in Python Flask web app using Fabric on nginx

I have my Python Flask web app hosted on nginx. While trying to execute a request it shows a timeout error in the nginx error log as shown below :
[error] 2084#0: *1 upstream timed out (110: Connection timed out)
while reading response header from upstream, client:
192.168.2.224, server: 192.168.2.131, request: "POST /execute HTTP/1.1", upstream: "uwsgi://unix:/hom
e/jay/PythonFlaskApp/app.sock", host: "192.168.2.131:9000", referrer:
"http://192.168.2.131:9000/"
If I try to run the app locally it works fine and responds fine.
Any one have any idea what might be wrong ?
the error found in browser console is :
Gateway Time-out
Here is the nginx config file:
server {
listen 9000;
server_name 192.168.2.131;
location / {
include uwsgi_params;
proxy_read_timeout 300;
uwsgi_pass unix:/home/jay/PythonFlaskApp/app.sock;
}
}
And here is the Python Fabric code that i trying to execute. i'm not sure if this is causing the issue, but any waz here is the code :
from fabric.api import *
#application.route("/execute",methods=['POST'])
def execute():
try:
machineInfo = request.json['info']
ip = machineInfo['ip']
username = machineInfo['username']
password = machineInfo['password']
command = machineInfo['command']
isRoot = machineInfo['isRoot']
env.host_string = username + '#' + ip
env.password = password
resp = ''
with settings(warn_only=True):
if isRoot:
resp = sudo(command)
else:
resp = run(command)
return jsonify(status='OK',message=resp)
except Exception, e:
print 'Error is ' + str(e)
return jsonify(status='ERROR',message=str(e))
I have a uWSGi config file for the web app and started it using an upstart script. Here is uwSGi conf file :
[uwsgi]
module = wsgi
master = true
processes = 5
socket = app.sock
chmod-socket = 660
vacuum = true
die-on-term = true
and here is upstart script
description "uWSGI server instance configured to serve Python Flask App"
start on runlevel [2345]
stop on runlevel [!2345]
setuid jay
setgid www-data
chdir /home/jay/PythonFlaskApp
exec uwsgi --ini app.ini
I have followed the below tutorial on running flask app on nginx
This is likely a problem with the Fabric task, not with Flask. Have you tried isolating / removing Fabric from the application, just for troubleshooting purposes? You could try stubbing out a value for resp, rather than actually executing the run/sudo commands in your function. I would bet that the app works just fine if you do that.
And so that would mean that you've got a problem with Fabric executing the command in question. First thing you should do is verify this by mocking up an example Fabfile on the production server using the info you're expecting in one of your requests, and then running it with fab -f <mock_fabfile.py>.
It's also worth noting that using with settings(warn_only=True): can result in suppression of error messages. I think that you should remove this, since you are in a troubleshooting scenario. From the docs on Managing Output:
warnings: Warning messages. These are often turned off when one expects a given operation to fail, such as when using grep to test existence of text in a file. If paired with setting env.warn_only to True, this can result in fully silent warnings when remote programs fail. As with aborts, this setting does not control actual warning behavior, only whether warning messages are printed or hidden.
As a third suggestion, you can get more info out of Fabric by using the show('debug') context manager, as well as enabling Paramiko's logging:
from fabric.api import env
from fabric.context_managers import show
# You can also enable Paramiko's logging like so:
import logging
logging.basicConfig(level=logging.DEBUG)
def my_task():
with show('debug'):
run('my command...')
The Fabric docs have some additional suggestions for troubleshooting: http://docs.fabfile.org/en/1.6/troubleshooting.html. (1.6 is an older/outdated version, but the concepts still apply.)

django errno 104 Connection reset by peer

I am trying to run my django server on an Ubuntu instance on AWS EC2. I am using gunicorn to run the server like this :
gunicorn --workers 4 --bind 127.0.0.1:8000 woc.wsgi:application --name woc-server --log-level=info --worker-class=tornado --timeout=90 --graceful-timeout=10
When I make a request I am getting 502, Bad Gateway on the browser. Here is the server log http://pastebin.com/Ej5KWrWs
Some sections of the settings.py file where behaviour is changed based on hostname are
iUbuntu is the hostname of my laptop
if socket.gethostname() == 'iUbuntu':
'''
Development mode
"iUbuntu" is the hostname of Ishan's PC
'''
DEBUG = TEMPLATE_DEBUG = True
else:
'''
Production mode
Anywhere else than Ishan's PC is considered as production
'''
DEBUG = TEMPLATE_DEBUG = False
if socket.gethostname() == 'iUbuntu':
'''Development'''
ALLOWED_HOSTS = ['*', ]
else:
'''Production Won't let anyone pretend as us'''
ALLOWED_HOSTS = ['domain.com', 'www.domain.com',
'api.domain.com', 'analytics.domain.com',
'ops.domain.com', 'localhost', '127.0.0.1']
(I don't get what's the purpose of this section of the code. Since I inherited the code from someone and the server was working I didn't bothered removing it without understanding what it does)
if socket.gethostname() == 'iUbuntu':
MAIN_SERVER = 'http://localhost'
else:
MAIN_SERVER = 'http://domain.com'
I can't figure out what's the problem here. The same code runs fine with gunicorn on my laptop.
I have also made a small hello world node.js to serve on the port 8000 to test nginx configuration and it is running fine. So no nginx errors.
UPDATE:
I set DEBUG to True and copied the Traceback http://pastebin.com/ggFuCmYW
UPDATE:
Thanks to the reply by #ARJMP. This indeed is the problem with celery consumer not getting connected to broker.
I am configuring celery like this : app.config_from_object('woc.celeryconfig') and the contents of celeryconfig.py are:
BROKER_URL = 'amqp://celeryuser:celerypassword#localhost:5672/MyVHost'
CELERY_RESULT_BACKEND = 'rpc://'
I am running the worker like this :celery worker -A woc.async -l info --autoreload --include=woc.async -n woc_celery.%h
And the error that I am getting is:
consumer: Cannot connect to amqp://celeryuser:**#127.0.0.1:5672/MyVHost: [Errno 104] Connection reset by peer.
Ok so your problem as far as I can tell is that your celery worker can't connect to the broker. You have some middleware trying to call a celery task, so it will fail on every request (unless that analyse_urls.delay(**kw) is conditional)
I found a similar issue that was solved by upgrading their version of celery.
Another cause could be that the EC2 instance can't connect to the message queue server because the EC2 security group won't allow it. If the message queue is running on a separate server, you may have to make sure you've allowed the connection between the EC2 instance and the message queue through AWS EC2 Security Groups
try setting the rabbitmq connection timeout to 30 seconds. this usually clears up the problem of being unable to connect to a server.
you can add connection_timeout to your connection string:
BROKER_URL = 'amqp://celeryuser:celerypassword#server.example.com:5672/MyVHost?connection_timeout=30'
note the format with the question mark: ?connection_timeout=30
this is a query string parameter for the RMQ connection string.
also - make sure the url is pointing to your production server name / url, and not localhost, in your production environment

Celery Closes Unexpectedly After Longer Inactivity

So I am using a RabbitMQ + Celery to create a simple RPC architecture. I have one RabbitMQ message broker and one remote worker which runs Celery deamon.
There is a third server which exposes a thin RESTful API. When it receives HTTP request, it sends a task to the remote worker, waits for response and returns a response.
This works great most of the time. However I have notices that after a longer inactivity (say 5 minutes of no incoming requests), the Celery worker behaves strangely. First 3 tasks received after a longer inactivity return this error:
exchange.declare: connection closed unexpectedly
After three erroneous tasks it works again. If there are not tasks for longer period of time, the same thing happens. Any idea?
My init script for the Celery worker:
# description "Celery worker using sync broker"
console log
start on runlevel [2345]
stop on runlevel [!2345]
setuid richard
setgid richard
script
chdir /usr/local/myproject/myproject
exec /usr/local/myproject/venv/bin/celery worker -n celery_worker_deamon.%h -A proj.sync_celery -Q sync_queue -l info --autoscale=10,3 --autoreload --purge
end script
respawn
My celery config:
# Synchronous blocking tasks
BROKER_URL_SYNC = 'amqp://guest:guest#localhost:5672//'
# Asynchronous non blocking tasks
BROKER_URL_ASYNC = 'amqp://guest:guest#localhost:5672//'
#: Only add pickle to this list if your broker is secured
#: from unwanted access (see userguide/security.html)
CELERY_ACCEPT_CONTENT = ['json']
CELERY_TASK_SERIALIZER = 'json'
CELERY_RESULT_SERIALIZER = 'json'
CELERY_TIMEZONE = 'UTC'
CELERY_ENABLE_UTC = True
CELERY_BACKEND = 'amqp'
# http://docs.celeryproject.org/en/latest/userguide/tasks.html#disable-rate-limits-if-they-re-not-used
CELERY_DISABLE_RATE_LIMITS = True
# http://docs.celeryproject.org/en/latest/userguide/routing.html
CELERY_DEFAULT_QUEUE = 'sync_queue'
CELERY_DEFAULT_EXCHANGE = "tasks"
CELERY_DEFAULT_EXCHANGE_TYPE = "topic"
CELERY_DEFAULT_ROUTING_KEY = "sync_task.default"
CELERY_QUEUES = {
'sync_queue': {
'binding_key':'sync_task.#',
},
'async_queue': {
'binding_key':'async_task.#',
},
}
Any ideas?
EDIT:
Ok, now it appears to happen randomly. I noticed this in RabbitMQ logs:
=WARNING REPORT==== 6-Jan-2014::17:31:54 ===
closing AMQP connection <0.295.0> (some_ip_address:36842 -> some_ip_address:5672):
connection_closed_abruptly
Is your RabbitMQ server or your Celery worker behind a load balancer by any chance? If yes, then the load balancer is closing the TCP connection after some period of inactivity. In which case, you will have to enable heartbeat from the client (worker) side. If you do, I would not recommend using the pure Python amqp lib for this. Instead, replace it with librabbitmq.
The connection_closed_abruptly is caused when clients disconnecting without the proper AMQP shutdown protocol:
channel.close(...)
Request a channel close.
This method indicates that the sender wants to close the channel.
This may be due to internal conditions (e.g. a forced shut-down) or due to
an error handling a specific method, i.e. an exception.
When a close is due to an exception, the sender provides the class and method id of
the method which caused the exception.
After sending this method, any received methods except Close and Close-OK MUST be discarded. The response to receiving a Close after sending Close must be to send Close-Ok.
channel.close-ok():
Confirm a channel close.
This method confirms a Channel.Close method and tells the recipient
that it is safe to release resources for the channel.
A peer that detects a socket closure without having received a
Channel.Close-Ok handshake method SHOULD log the error.
Here is an issue about that.
Can you set your custom configuration for BROKER_HEARTBEAT and BROKER_HEARTBEAT_CHECKRATE and check again, for example:
BROKER_HEARTBEAT = 10
BROKER_HEARTBEAT_CHECKRATE = 2.0

How do I debug a HTTP 502 error?

I have a Python Tornado server sitting behind a nginx frontend. Every now and then, but not every time, I get a 502 error. I look in the nginx access log and I see this:
127.0.0.1 - - [02/Jun/2010:18:04:02 -0400] "POST /a/question/updates HTTP/1.1" 502 173 "http://localhost/tagged/python" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.2.3) Gecko/20100401 Firefox/3.6.3"
and in the error log:
2010/06/02 18:04:02 [error] 14033#0: *1700 connect() failed (111: Connection refused)
while connecting to upstream, client: 127.0.0.1, server: _,
request: "POST /a/question/updates HTTP/1.1",
upstream: "http://127.0.0.1:8888/a/question/updates", host: "localhost", referrer: "http://localhost/tagged/python"
I don't think any errors show up in the Tornado log. How would you go about debugging this? Is there something I can put in the Tornado or nginx configuration to help debug this?
The line from the error log is very informative in my opinion. It says the connection was refused by the upstream, it contains client IP, Nginx server config, request line, hostname, upstream URL and referrer.
It is pretty clear you must look at the upstream (or firewall) to find out the reason.
In case you'd like to look at how Nginx processes the request, why it chooses specific server and location sections -- there is a beautiful "debug" mode. (Note, your Nginx binary must be built with debugging symbols included). Then:
error_log /path/to/your/error.log debug;
will turn on debugging for all the requests. Debugging information in the error log requires some time to get used to interpret it, but it's worth the efforts.
Do not use this "as is" for high traffic sites! It generates a lot of information and your error log will grow very fast. If you need to debug requests in the production, use debug_connection directive:
events {
debug_connection 1.2.3.4;
}
It turns debugging on for the specific client IP address only.

Categories