Django Channels "took too long to shut down and was killed" - python

I am using Django Channels for a real-time progress bar. With the help of this progressbar the client gets actual feedback of a simulation. This simulation can take longer than 5 minutes depending on the data size. Now to the problem. The client can start the simulation successfully, but during this time no other page can be loaded. Furthermore I get the following error message:
Application instance <Task pending coro=<StaticFilesWrapper.__call__() running at /.../python3.7/site-packages/channels/staticfiles.py:44> wait_for=<Future pending cb=[_chain_future.<locals>._call_check_cancel() at /Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/asyncio/futures.py:348, <TaskWakeupMethWrapper object at 0x123aa88e8>()]>> for connection <WebRequest at 0x123ab7e10 method=GET uri=/2/ clientproto=HTTP/1.1> took too long to shut down and was killed.
Only after the simulation is finished, further pages can be loaded.
There are already articles on this topic, but they do not contain any working solutions for me. Like for example in the following link:
https://github.com/django/channels/issues/1119
A suggestion here is to downgrade the Channels version. However, I encounter other errors and it must also work with the newer versions.
The code of the consumer, routing and asgi are implemented exactly as required in the instructions: https://channels.readthedocs.io/en/stable/tutorial/index.html
requirments.txt:
Django==3.1.2
channels==3.0.3
channels-redis==3.3.1
asgiref==3.2.10
daphne==3.0.2
I am grateful for any hint or tip.
Best regards,
Dennis

Related

Heroku with Django, Celery and CloudAMPQ - timeout error

I am building an online shop, following Chapter 7 in the book "Django 3 by Example." The book was written by Antonio Melé.
Everything works fine in my local machine. It also works well when I deploy it to Heroku.
However, when I try to use Celery and RabbitMQ (CloudAMQP, Little Lemur, on Heroku), the email message the worker was supposed to send to the customer is not sent. The task takes more then 30 seconds and then it crashes:
heroku[router]: at=error code=H12 desc="Request timeout" method=POST
I have created a tasks.py file with the task of sending emails. My settings.py file include the following lines for Celery:
broker_url = os.environ.get('CLOUDAMQP_URL')
broker_pool_limit = 1
broker_heartbeat = None
broker_connection_timeout = 30
result_backend = None
event_queue_expires = 60
worker_prefetch_multiplier = 1
worker_concurrency = 50
This was taken from https://www.cloudamqp.com/docs/celery.html
And my Procfile is as follows,
web: gunicorn shop.wsgi --log-file -
worker: celery worker --app=tasks.app
Am I missing something?
Thanks!
So fairly familiar with heroku, though not your tech stack. So the general approach to deal with heroku timeout is this:
First, determine exactly what is causing the timeout. One or more things are taking a lot of time.
Now you have 3 main options.
Heroku Scheduler (or one of several other similar addons). Very useful if you can run a script of some sort via a terminal command, and 10 minutes/1 hour/24 hour checks to see if the script needs to be run is good enough for you. I typically find this to be the most straightforward solution, but it's not always an acceptable one. Depending on what you are emailing, an email being delayed 5-15 minutes might be acceptable.
Background process worker. Looks like this is what you are trying to do with Celery, but it's not configured right, probably can't help much on that.
Optimize. The reason heroku sets a 30 second timeout is because generally speaking there really isn't a good reason for a user to wait 30 seconds for a response. I'm scratching my head as to why sending an email would take more than 30 seconds, unless you need to send a few hundred of them or the email is very, very large. Alternatively, you might be doing a lot of work before you send the email, though that raises the question fo why not do that work seperately from the send email command. I suspect you should probably look into the why of this before you try to get a background process worker setup.
After several days trying to solve this issue, I contacted the support department at CLOUDAMQP
They helped me figure out that the problem was related to Celery not identifying my BROKER_URL properly.
Then I came across this nice comment by #jainal09 here. There was an extra variable that should be set in settings.py:
CELERY_BROKER_URL = '<broker address given by Heroku config>'
Adding that extra line solved the problem. Now Heroku is able to send the email correctly.

Dask worker seem die but cannot find the worker log to figure out why

I have a piece of DASK code run on local machine which work 90% of time but will stuck sometimes. Stuck mean. No crash, no error print out not cpu usage. never end.
I google and think it maybe due to some worker dead. I will be very useful if I can see the worker log and figure out why.
But I cannot find my worker log. I go to edit config.yaml to add loging but still see nothing from stderr.
Then I go to dashboard --> info --> logs and see blank page.
The code it stuck is
X_test = df_test.to_dask_array(lengths=True)
or
proba = y_pred_proba_train[:, 1].compute()
and my ~/.config/dask/config.yaml or ~.dask/config.yaml look like
logging:
distributed: info
distributed.client: warning
distributed.worker: debug
bokeh: error
I am using
python 3.6
dask 1.1.4
All I need is a way to see the log so that I can try to figure out what goes wrong.
Thanks
Joseph
Worker logs are usually managed by whatever system you use to set up Dask.
Perhaps you used something like Kubernetes or Yarn or SLURM?
These systems all have ways to get logs back.
Unfortunately, once a Dask worker is no longer running, Dask itself has no ability to collect logs for you. You need to use the system that you use to launch Dask.

Persist Completed Pipeline in Luigi Visualiser

I'm starting to port a nightly data pipeline from a visual ETL tool to Luigi, and I really enjoy that there is a visualiser to see the status of jobs. However, I've noticed that a few minutes after the last job (named MasterEnd) completes, all of the nodes disappear from the graph except for MasterEnd. This is a little inconvenient, as I'd like to see that everything is complete for the day/past days.
Further, if in the visualiser I go directly to the last job's URL, it can't find any history that it ran: Couldn't find task MasterEnd(date=2015-09-17, base_url=http://aws.east.com/, log_dir=/home/ubuntu/logs/). I have verified that it ran successfully this morning.
One thing to note is that I have a cron that runs this pipeline every 15 minutes to check for a file on S3. If it exists, it runs, otherwise it stops. I'm not sure if that is causing the removal of tasks from the visualiser or not. I've noticed it generates a new PID every run, but I couldn't find a way to persist one PID/day in the docs.
So, my questions: Is it possible to persist the completed graph for the current day in the visualiser? And is there a way to see what has happened in the past?
Appreciate all the help
I'm not 100% positive if this is correct, but this is what I would try first. When you call luigi.run, pass it --scheduler-remove-delay. I'm guessing this is how long the scheduler waits before forgetting a task after all of its dependents have completed. If you look through luigi's source, the default is 600 seconds. For example:
luigi.run(["--workers", "8", "--scheduler-remove-delay","86400")], main_task_cls=task_name)
If you configure the remove_delay setting in your luigi.cfg then it will keep the tasks around for longer.
[scheduler]
record_task_history = True
state_path = /x/s/hadoop/luigi/var/luigi-state.pickle
remove_delay = 86400
Note, there is a typo in the documentation ("remove-delay" instead of remove_delay") which is being fixed under https://github.com/spotify/luigi/issues/2133

How to get rid of unkillable Freeswitch channels

After upgrading from Freeswitch 1.2.9 (1.2.9+git~20130506T233047Z~7c88f35451) to Freeswitch 1.4.21 (1.4.21-35~64bit), freeswitch stopped dropping channels after they were hung up, and when we tried to do a manual uuid_kill, it gives us this lovely error:
-ERR No such channel!
Even though show channels shows that channel clearly there. From the bugs on jira.freeswitch.com that I've seen, it looks like it may be a code problem. A little more info on our environment/code:
We have a python twisted loop that connects to the client so the client can run commands on the server and vice versa. As soon as that twisted connection dies (the client is closed/disconnected) the channels are killed as well, but we need the channel to die before then as we're taking a lot of calls per second and need them to die when the other end is disconnected. We can't close and reopen the client every time a call is done, or reconnect as that would take way too much time and defeats the purpose of our use of the software.
Once again, this error only started happening when we changed to installing the freeswitch server using apt-get instead of directly from source. This lets us get a new server up and running extremely faster, and we would rather not take the extra time to use our previous method. Please tell me if there's any code you would like to look at, and ask for any clarification you need, but we would really like this to be fixed soon. Thanks in advance!
Edit: For more clarification, we're mainly using mod_callcenter, mod_conference, and mod_sofia with our software.
Edit 2: For a little more clarification, we're running this on Ubuntu 14.04 Server
We are using an ESL connection to connect and run commands in freeswitch from python, and we think that's the root of the problem. We tried exiting the connection, but that destroys both channels.
Also, all of the bugs filed already for this problem on Jira are closed for not being bugs. I thought I may have a bit more success here, as it is a programming type question.
You need to reproduce the issue in a test environment and file the bug report to Jira. At best you should also try reproducing it with the latest master branch (only Debian 8 is supported):
https://freeswitch.org/confluence/display/FREESWITCH/Debian+8+Jessie
I had a similar problem when I used mod_perl, and a Perl object was referring to a session, and it was not properly destructed (if I remember it right, I had two Perl objects attached to the same session). That resulted in channels which were impossible to kill.
I suppose you are using a ESL connection between your application and FreeSWITCH, right?

Amazon Elastic MapReduce - SIGTERM

I have an EMR streaming job (Python) which normally works fine (e.g. 10 machines processing 200 inputs). However, when I run it against large data sets (12 machines processing a total of 6000 inputs, at about 20 seconds per input), after 2.5 hours of crunching I get the following error:
java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 143
at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:372)
at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:586)
at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:135)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57)
at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:36)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:441)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:377)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1132)
at org.apache.hadoop.mapred.Child.main(Child.java:249)
If I am reading this correctly, the subprocess failed with code 143 because someone sent a SIGTERM signal to the streaming job.
Is my understanding correct? If so: When would the EMR infrastructure send a SIGTERM?
I figured out what was happening, so here's some information if anyone else experiences similar problems.
The key to me was to look at the "jobtracker" logs. These live in your task's logs/ folder on S3, under:
<logs folder>/daemons/<id of node running jobtracker>/hadoop-hadoop-jobtracker-XXX.log.
There were multiple lines of the following kind:
2012-08-21 08:07:13,830 INFO org.apache.hadoop.mapred.TaskInProgress
(IPC Server handler 29 on 9001): Error from attempt_201208210612_0001_m_000015_0:
Task attempt_201208210612_0001_m_000015_0 failed to report status
for 601 seconds. Killing!
So my code was timing out, and it was being killed (it was going beyond the 10 minute task timeout). 10 minutes I wasn't doing any I/Os, which was certainly not expected (I would typically do an I/O every 20 seconds).
I then discovered this article:
http://devblog.factual.com/practical-hadoop-streaming-dealing-with-brittle-code
"In one of our science projects, we have a few Hadoop Streaming jobs that run over ruby and rely on libxml to parse documents. This creates a perfect storm of badness – the web is full of really bad html and libxml occasionally goes into infinite loops or outright segfaults. On some documents, it always segfaults."
It nailed it. I must be experiencing one of these "libxml going into infinite loop" situations (I am using libxml heavily -- only with Python, not Ruby).
The final step for me was to trigger skip mode (instructions here: Setting hadoop parameters with boto?).
I ran into this output from Amazon EMR ("subprocess failed with code 143"). My streaming job was using PHP curl to send data to a server that didn't have the MapReduce job servers part of its security group. Therefore the reducer was timing out and being killed. Ideally I'd like to add my jobs to the same security group but I opted to simply add a URL security token param infront of my API.

Categories