AWS Elastic Beanstalk health check issue

AWS Elastic Beanstalk health check issue - python

My web application is Django and web server use Nginx, use Docker image and Elastic Beanstalk for deployment.
Normally there was no problem, but as the load balancer expands EC2, my web server becomes 502 Bad Gateway.
I checked Elastic Beanstalk application logs, about 16% of the requests returned 5xx errors, at which time the load balancer expands EC2, causing the web server to transition to the 502 Bad Gateway state and the Elastic Beanstalk application to the Degraded state.
Is this a common problem when the load balancer performs a health check? If not, how to turn off the Health Check?
I am attaching a captured image for reference.

As far as I know, 502 Bad Gateway error can be mitigated only by manually checking the major links you have on your websites and if they are accessible through a simple GET request.
In case of my website, I had some issue with the login page and an about page, (and it was about 33% of my website sadly) which is why after uploading to EC2 i got a 5xx error on health check. I solved the problem by simply making the links work on the server (there were some functionalities which were only running on localhost and not on AWS so I fixed that and got OK status in Health Check)
I don't think there is a point in removing health check as it gives vital information about your website and you probably don't want your website to have inaccessible pages.
Keep track of logs to narrow down to the problem.
I hope you find the solution.

While your code is being deployed, you will get 502 because the EC2 instance fails the health check call. You need to adjust the load balance health check default settings to allow enough time for your deployment to complete. Allow more time for a deployment if you also restart the server after each deployment.
The AWS load balancer sends a health check request to each registered instance every N seconds using the path you specify. The default interval seconds is 30 seconds. If the health check fails N number of times (default is 2) for any of the instances you have running, health changes to Degraded or Severe depending on the percentage of your instances that are not responding.
Send a request that should return a 200 response code. Default is '/index.html'
Wait for N seconds before time out (default 5 seconds)
Try again after N interval seconds (default 30 seconds)
If N consecutive calls fail, change the health state to warning or severe (default unhealthy threshold is 2)
After N consecutive successful calls, return the health state to OK (default is 10).
With the default settings, if any web server instance is down for more than a minute (2 tries of 30 seconds each), it is considered an outage. It will take 5 minutes (10 tries every 30 seconds) to get back to Ok status.
For a detailed explanation and configuration options please check AWS documentation: Configure Health Checks for Elastic Load Balancing

Related

How to adjust timeout for python app served with waitress

I am running a flask application which has an option in the UI where the user can click a button and it calls an endpoint to perform some analysis.
The application is served as follows:
from waitress import serve
serve(app, host="0.0.0.0", port=5000)
After around ~1 minute, I am receiving a gateway timeout in the UI:
504 Gatway Time-out
However, the flask application keeps doing the work behind and after 2 minutes it completes the processing and I can see that it submits the data on the db side. So the process is not timing out itself.
I tried already passing channel_timeout argument to a much higher value (default seems 120 seconds) but with no luck. I know that this makes sense to implement in a different way where the user doesn't have to wait for these two minutes, but I am looking if there is such a timeout set by default and whether it can be increased.
The application is deployed in K8s and the UI is exposed via ingress. Could the timeout come from ingress instead?

The problem was with the ingress controller default timeout.
I managed to get around this issue by changing the implementation and having this job run as background task instead.

FastAPI-Docker Image on Azure App Service throws [CRITICAL] WORKER TIMEOUT

I'm running a custom FastAPI-Docker Image (more specifically the python3.7 one) on an Azure App Service (2 Cores, 3.5 GB Memory, OS: Linux). The image is hosted using an Azure Container Registry.
The container serves requests which have a PDF-File attached, processes the PDF-file and returns the result to a website. A user can then verify that the result is either correct or not correct. In case that the result is correct, the website then sends a second POST-request to the FastAPI-Service to store the results in an AtlasDB.
Initially, when the Azure App Service has just been started, everything works fine. However, after a while the documents could not be uploaded using the website anymore. The service crashes with the following error:
2020-11-02T14:21:40.225130726Z [2020-11-02 14:21:40 +0000] 1 [CRITICAL] WORKER TIMEOUT (pid:9)
This, however, only happens, after some documents have been uploaded. The amount of time it takes until the error is raised is not deterministic. Sometimes it is about 90 seconds and sometimes it takes up to 3 minutes until the timeout is reached. Other documents that can be processed in under 90 seconds are still processed just fine.
At first, I thought that it has to do with the gunicorn workers. I created a custom gunicorn_conf.py (using the standard config) and increased the WORKERS_PER_CORE to 3 as well as the GRACEFUL_TIMEOUT and TIMEOUT to 180 seconds.
But the error still remains. Additionally, I upscaled the App-Service plan to see, whether more computing power would solve the problem. Unfortunately, it does not.
Whenever I start the docker-image locally everything works flawlessly.
My guess is, that the error still is gunicorn-worker related, but I couldn't find any solution to the problem. I also downloaded the App Service Logs and Container Logs but there isn't any further information or hint on why the worker crashes.
Does anyone have some advice on how to find the issue?
Thank you in advance and have a nice day!

Will taking a snapshot of my RabbitMq instance on AWS harm my app prod?

I have a django webapp using Celery, Supervisord and connected to a t2.micro rabbitmq instance. I wanted to upgrade to a t2.large but was wondering if taking a snapshot will affect anything. Orginally I had not built this set up and so I am trying to learn. Will proceeding with the upgrade only require me to switch the RabbitMQ ip address? What precautions should I take?

Taking a snapshot of any form of datastore usually has a certain tax on the underlying hardware in terms of CPU and IOPS. Given you are currently running on a t2.instance, assuming you have burst credits remaining, taking a snapshot is probably acceptable, as the instance size suggest your traffic is low. Once you provisioned the new instance, setting it's connection string (IP address or DNS name of you set one through a proxy) in your Django settings should be sufficient to start routing traffic to the new instance.
Just FYI, AWS has a hosted RabbitMQ option available which takes care of much of the heavy lifting for you :)

Google AppEngine - Using queues with max_concurrent_requests set to 1: Process terminated because the request deadline was exceeded

I've set up a TaskQueue for my AppEngine API. The API processes large requests, and may take up to three hours to conclude computing, working on one task at a time.
I've set 'max_concurrent_requests' to 1 in the queue.yaml file, so that only one task will be active at a time, and increased the gunicorn timeout as well.
My problem comes because the CloudTask requests seem to have a timeout of 10min, after which they throw up an error:
Process terminated because the request deadline was exceeded. Please ensure that your HTTP server is listening for requests on 0.0.0.0 and on the port defined by the PORT environment variable. (Error code 123)
How can I configure my queue to wait idle for a previous task to finish, instead of simply timing out?

You've most likely set your App Engine service scaling element to "autoscaling" (or didn't define it, autoscaling being the default value) in the app.yaml file.
Instances in autoscaling have a deadline of 10min, as documented here. You'll need to re-deploy your service with an updated app.yaml file setting the scaling element to "manual scaling" or "basic scaling" to allow your tasks to run to up to 24h.

How can I troubleshoot why flask file system session data is lost

I have an application (running in Docker and managed by Marathon) where I use server side flask sessions - FileSystemSessionInterface (permanent sessions).
My problem is that if the user waits too long to go to the next step, the session data is lost.
One of my assumptions was that this is because of Marathon, which performs health checks of the application by making an http get request every 2 seconds. This results in a new session file open on every request. And my assumption was that the maximum number of open files is reached. However, when I check in the docker container how many session files are open, the number is not that big, around 350 files.
Did anyone have this problem, any ideas on why my session data disappears?

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.