I'm running a custom FastAPI-Docker Image (more specifically the python3.7 one) on an Azure App Service (2 Cores, 3.5 GB Memory, OS: Linux). The image is hosted using an Azure Container Registry.
The container serves requests which have a PDF-File attached, processes the PDF-file and returns the result to a website. A user can then verify that the result is either correct or not correct. In case that the result is correct, the website then sends a second POST-request to the FastAPI-Service to store the results in an AtlasDB.
Initially, when the Azure App Service has just been started, everything works fine. However, after a while the documents could not be uploaded using the website anymore. The service crashes with the following error:
2020-11-02T14:21:40.225130726Z [2020-11-02 14:21:40 +0000] 1 [CRITICAL] WORKER TIMEOUT (pid:9)
This, however, only happens, after some documents have been uploaded. The amount of time it takes until the error is raised is not deterministic. Sometimes it is about 90 seconds and sometimes it takes up to 3 minutes until the timeout is reached. Other documents that can be processed in under 90 seconds are still processed just fine.
At first, I thought that it has to do with the gunicorn workers. I created a custom gunicorn_conf.py (using the standard config) and increased the WORKERS_PER_CORE to 3 as well as the GRACEFUL_TIMEOUT and TIMEOUT to 180 seconds.
But the error still remains. Additionally, I upscaled the App-Service plan to see, whether more computing power would solve the problem. Unfortunately, it does not.
Whenever I start the docker-image locally everything works flawlessly.
My guess is, that the error still is gunicorn-worker related, but I couldn't find any solution to the problem. I also downloaded the App Service Logs and Container Logs but there isn't any further information or hint on why the worker crashes.
Does anyone have some advice on how to find the issue?
Thank you in advance and have a nice day!
Related
In short:
I have a Django application being served up by Apache on a Google Compute Engine VM.
I want to access a secret from Google Secret Manager in my Python code (when the Django app is initialising).
When I do 'python manage.py runserver', the secret is successfully retrieved. However, when I get Apache to run my application, it hangs when it sends a request to the secret manager.
Too much detail:
I followed the answer to this question GCP VM Instance is not able to access secrets from Secret Manager despite of appropriate Roles. I have created a service account (not the default), and have given it the 'cloud-platform' scope. I also gave it the 'Secret Manager Admin' role in the web console.
After initially running into trouble, I downloaded the a json key for the service account from the web console, and set the GOOGLE_APPLICATION_CREDENTIALS env-var to point to it.
When I run the django server directly on the VM, everything works fine. When I let Apache run the application, I can see from the logs that the service account credential json is loaded successfully.
However, when I make my first API call, via google.cloud.secretmanager.SecretManagerServiceClient.list_secret_versions , the application hangs. I don't even get a 500 error in my browser, just an eternal loading icon. I traced the execution as far as:
grpc._channel._UnaryUnaryMultiCallable._blocking, line 926 : 'call = self._channel.segregated_call(...'
It never gets past that line. I couldn't figure out where that call goes so I couldnt inspect it any further than that.
Thoughts
I don't understand GCP service accounts / API access very well. I can't understand why this difference is occurring between the django dev server and apache, given that they're both using the same service account credentials from json. I'm also surprised that the application just hangs in the google library rather than throwing an exception. There's even a timeout option when sending a request, but changing this doesn't make any difference.
I wonder if it's somehow related to the fact that I'm running the django server under my own account, but apache is using whatever user account it uses?
Update
I tried changing the user/group that apache runs as to match my own. No change.
I enabled logging for gRPC itself. There is a clear difference between when I run with apache vs the django dev server.
On Django:
secure_channel_create.cc:178] grpc_secure_channel_create(creds=0x17cfda0, target=secretmanager.googleapis.com:443, args=0x7fe254620f20, reserved=(nil))
init.cc:167] grpc_init(void)
client_channel.cc:1099] chand=0x2299b88: creating client_channel for channel stack 0x2299b18
...
timer_manager.cc:188] sleep for a 1001 milliseconds
...
client_channel.cc:1879] chand=0x2299b88 calld=0x229e440: created call
...
call.cc:1980] grpc_call_start_batch(call=0x229daa0, ops=0x20cfe70, nops=6, tag=0x7fe25463c680, reserved=(nil))
call.cc:1573] ops[0]: SEND_INITIAL_METADATA...
call.cc:1573] ops[1]: SEND_MESSAGE ptr=0x21f7a20
...
So, a channel is created, then a call is created, and then we see gRPC start to execute the operations for that call (as far as I read it).
On Apache:
secure_channel_create.cc:178] grpc_secure_channel_create(creds=0x7fd5bc850f70, target=secretmanager.googleapis.com:443, args=0x7fd583065c50, reserved=(nil))
init.cc:167] grpc_init(void)
client_channel.cc:1099] chand=0x7fd5bca91bb8: creating client_channel for channel stack 0x7fd5bca91b48
...
timer_manager.cc:188] sleep for a 1001 milliseconds
...
timer_manager.cc:188] sleep for a 1001 milliseconds
...
So, we a channel is created... and then nothing. No call, no operations. So the python code is sitting there waiting for gRPC to make this call, which it never does.
The problem appears to be that the forking behaviour of Apache breaks gRPC somehow. I couldn't nail down the precise cause, but after I began to suspect that forking was the issue, I found this old gRPC issue that indicates that forking is a bit of a tricky area.
I tried to reconfigure Apache to use a different 'Multi-processing Module', but as my experience in this is limited, I couldn't get gRPC to work under any of them.
In the end, I switched to using nginx/uwsgi instead of Apache/mod_wsgi, and I did not have the same issue. If you're trying to solve a problem like this and you have to use Apache, I'd advice further investigating Apache forking, how gRPC handles forking, and the different MPMs available for Apache.
I'm facing a similar issue. When running my Flask Application with eventlet==0.33.0 and gunicorn https://github.com/benoitc/gunicorn/archive/ff58e0c6da83d5520916bc4cc109a529258d76e1.zip#egg=gunicorn==20.1.0. When calling secret_client.access_secret_version it hangs forever.
It used to work fine with an older eventlet version, but we needed to upgrade to the latest version of eventlet due to security reasons.
I experienced a similar issue and I was able to solve with the following:
import grpc.experimental.gevent as grpc_gevent
from gevent import monkey
from google.cloud import secretmanager
monkey.patch_all()
grpc_gevent.init_gevent()
client = secretmanager.SecretManagerServiceClient()
I built and deployed an app on GAE. Yesterday all seemed to be working fine, sending requests every few seconds to the app would be successful with a response time of about 2.5 seconds. Today GAE keeps deploying a new instance for every request, or fails to create even one, resulting in unacceptably high response times (and much higher charges) or even 500 server errors.
I tried to suspend and restart the app a few times, works again for a couple of requests, then reverts to the same behavior. On the console I can see that a new instance is immediately shut down after serving a request, or in case of server error, that GAE was unable to deploy a new instance.
I checked the quotas on the console, nothing seems to hint that I cannot send multiple requests from the same IP.
Has anyone experienced such issues, and if yes, what could be the cause(s) and remedies? Please note, I am very new to GAE so have no further clue right now on where to start.
EDIT: Just realized the average memory used by an instance (F2 in my case, which gives you 256MB) is very close to the max (250MB). Could it be the issue? I will upgrade to F4 (512MB) and see what happens.
As per the documentation - a new instance may be created based on request rate, response latencies, and other application metrics.
Therefore, it’s expected behaviour for the GAE Standard instances to scale up and down depending on the traffic they receive.
Also, if the maximum memory usage for the instance class is reached, a shutdown process will be triggered as explained here.
As for the failures to create a new instance, it’s hard to tell what may be causing it without the Stackdriver Logging information. At the top of my head, you may receive HTTP 500 errors due to having reached the response limit, but it could indeed happen for any other reason as well.
Finally, taking into account the nature of the issues, I think it’s a good idea testing the GAE app’s behaviour using a better instance class and comparing the results. If you no longer experience this using an F4 instance class, it’s safe to assume that the previous instance class was simply not enough to satisfy the app’s requirements.
I have been working on a website for over a year now, using Django and Python3 primarily. A few of my buddies and I built a front end where a user enters some parameters and submits, this goes to the GAE to run the job and return the results.
In my local dev environment, everything works well. I have two separate dev environments. One builds the entire service up in a docker container. This produces the desired results in roughly 11 seconds. The other environment runs the source files locally on my computer and connects to the Postgres database hosted in Google Cloud. The Python application runs locally. It takes roughly 2 minutes for it to run locally, a lot of latency between the cloud and the post/gets from my local machine.
Once I perform the Gcloud app deploy and attempt to run in production, it never finishes. I have some print statements built into the code, I know it gets to the part where the submitted parameters go to the Python code. I monitor via this command on my local computer: gcloud app logs read.
I suspect that since my local computer is a beast (i7-7770 processor with 64 GB of RAM), it runs the whole thing no problem. But in the GAE, I don't think it's providing the proper machines to do the job efficiently (not enough compute, not enough RAM). That's my guess.
So, I need help in how to troubleshoot this. I tried changing my app.yaml file so that resources would scale to 16 GB of memory, but it would never deploy. I received an error 13.
One other note, after it spins around trying to run the job for 60 minutes, the website crashes and displays this message:
502 Server Error
Error: Server Error
The server encountered a temporary error and could not complete your request.
Please try again in 30 seconds.
OK, so just in case anybody in the future is having a similar problem...the constant crashing of my Google App Engine workers was because of using Pandas dataframes in the production environment. I don't know exactly what Pandas was doing, but I kept getting Memory Errors, it would crash the site...and it didn't appear to be occurring in a single line of code. That is, it randomly happened somewhere in a Pandas Dataframe operation.
I am still using a Pandas Dataframe simply to read in a csv file. I then use
data_I_care_about = dict(zip(df.col1, df.col2))
#or
other_data = df.col3.values.tolist()
and then go to town with processing. As a note, on my local machine (my development environment basically) - it took 6 seconds to run from start to finish . That's a long time for a web request but I was in a hurry, thus why I used Pandas to begin with.
After refactoring, the same job completed in roughly 200ms using python lists and dicts (again, in my dev environment). The website is up and running very smoothly now. It takes a maximum of 7 seconds after pressing "Submit" for the back-end to return the data sets and render on the web page. Thanks for the help peeps!
My web application is Django and web server use Nginx, use Docker image and Elastic Beanstalk for deployment.
Normally there was no problem, but as the load balancer expands EC2, my web server becomes 502 Bad Gateway.
I checked Elastic Beanstalk application logs, about 16% of the requests returned 5xx errors, at which time the load balancer expands EC2, causing the web server to transition to the 502 Bad Gateway state and the Elastic Beanstalk application to the Degraded state.
Is this a common problem when the load balancer performs a health check? If not, how to turn off the Health Check?
I am attaching a captured image for reference.
As far as I know, 502 Bad Gateway error can be mitigated only by manually checking the major links you have on your websites and if they are accessible through a simple GET request.
In case of my website, I had some issue with the login page and an about page, (and it was about 33% of my website sadly) which is why after uploading to EC2 i got a 5xx error on health check. I solved the problem by simply making the links work on the server (there were some functionalities which were only running on localhost and not on AWS so I fixed that and got OK status in Health Check)
I don't think there is a point in removing health check as it gives vital information about your website and you probably don't want your website to have inaccessible pages.
Keep track of logs to narrow down to the problem.
I hope you find the solution.
While your code is being deployed, you will get 502 because the EC2 instance fails the health check call. You need to adjust the load balance health check default settings to allow enough time for your deployment to complete. Allow more time for a deployment if you also restart the server after each deployment.
The AWS load balancer sends a health check request to each registered instance every N seconds using the path you specify. The default interval seconds is 30 seconds. If the health check fails N number of times (default is 2) for any of the instances you have running, health changes to Degraded or Severe depending on the percentage of your instances that are not responding.
Send a request that should return a 200 response code. Default is '/index.html'
Wait for N seconds before time out (default 5 seconds)
Try again after N interval seconds (default 30 seconds)
If N consecutive calls fail, change the health state to warning or severe (default unhealthy threshold is 2)
After N consecutive successful calls, return the health state to OK (default is 10).
With the default settings, if any web server instance is down for more than a minute (2 tries of 30 seconds each), it is considered an outage. It will take 5 minutes (10 tries every 30 seconds) to get back to Ok status.
For a detailed explanation and configuration options please check AWS documentation: Configure Health Checks for Elastic Load Balancing
On all versions of the app I started getting error "uncaught application failure". I'm on python27. Errors began to appear suddenly, last app deployment was ~5 hours ago. From time to time I get the expected result but mostly I see the error. Nothing useful in the logs. Any suggestions?
Response body:
<html><head><title>s~lawinsidercontracts : uncaught application failure</title><body><pre>
<br></pre></body></html>
In logs this is looks like:
2013-12-20 22:22:54.987
This request caused a new process to be started for your application, and thus caused your application code to be loaded for the first time. This request may thus take longer and use more CPU than a typical request for your application.
W 2013-12-20 22:22:54.987
A problem was encountered with the process that handled this request, causing it to exit. This is likely to cause a new process to be used for the next request to your application. (Error code 121)
UPDATE: Now I see this issue only with default version. Any of the "default versions" gets this error. Different source code does not make any difference. Temporary solved by splitting 99% of traffic to non-default version. (it's works!)
Please, note: it does not seem to be a bug in my code, I tried completely different sources, it's seems more like internal error in the instance(s) and I would like to get feedback from GAE team.
3 days without errors. Apparently the problem resolved itself.