I have a web project,use nginx ,uwsgi ,web.py ,nginx is used for load balancing,uwsgi is used as a web server,web.py is a web frameworks
I start it with this commond "/usr/local/bin/uwsgi -d /home/sheng/www/lr-server/../log/lr-server/uwsgi.log -s 127.0.0.1:8666 -w rc_main -t 20 -M -p 20 --pidfile /home/sheng/www/lr-server/master.pid --enable-threads -R 800"
this commond means it will produce twenty process to receive request, and each process receive 800 requests as most
as show below is normal process
sheng 12414 15051 21 10:04 ? 00:01:45 /usr/local/bin/uwsgi -d /home/sheng/www/lr-server/../log/lr-server/uwsgi.log -s 127.0.0.1:8666 -w rc_main -t 20 -M -p 20 --pidfile /home/sheng/www/lr-server/master.pid --enable-threads -R 800
15051 is parent pid
usually it works normal ,but it will produce strange process when server is very busy and many requests take a long time, as shown below is strange process:
sheng 23370 1 0 09:08 ? 00:00:00 /usr/local/bin/uwsgi -d /home/sheng/www/lr-server/../log/lr-server/uwsgi.log -s 127.0.0.1:8666 -w rc_main -t 20 -M -p 20 --pidfile /home/sheng/www/lr-server/master.pid --enable-threads -R 800
you will notice,this process's pid is 23370,but it's parent pid is 1,it like a defunct process.However, in fact,this process take up memory and will not receive any request
I had hoped to produce 20 normal process to receive request,but now ,it process more then 80 starnge process,who can tell me why and what can I do to solve this problem
I have found out the reason
In my python project , sometimes python will use R to make a scientific calculation
R is slow ,so a colleague of mine use parallel technology,R will fork some child process to calculate
Unfortunately,if a request take more than 20 seconds ,uwsgi will kill the python process,but due to unknown reasons, those child process which R forked will not be killed ,they are just strange process what I see
Related
I would like some help on how to set up properly a complicated job on a HPC. So, at some point in my python code I want to submit a job by using os.system("bsub -K < mama.sh") , I fould that the -K arg would actually wait for the job to end before continuing. So now I want from this mama.sh script to call 5 other jobs (kid1.sh, kid2.sh ... kid5.sh) that would run in parallel (to reduce computational time). Each one of these 5 children scripts will run a python piece of code. mama.sh should wait until all 5 other jobs have finished before continuing.
I thought of something like that:
#!/bin/sh
#BSUB -q hpc
#BSUB -J kids[1-5]
#BSUB -n 5
#BSUB -W 10:00
#BSUB -R "rusage[mem=6GB]"
#BSUB -R "span[hosts=1]"
# -- end of LSF options --
module load python3/3.8
python3 script%Ι.py
ORRR
python3 script1.py
python3 script2.py
python3 script3.py
python3 script4.py
python3 script5.py
Maybe the above doesn't make sense at all though. Is there any way to actually do that?
Thanks in advance
As is know to me, you can accomplish the goal in different level.
By two easy ways:
parallel your python code by import multiprocessing
parallel your shell script by &, command can be executed in the background.
python3 script1.py &
python3 script2.py
Is there a way to kill uvicorn cleanly?
I.e., I can type ^C at it, if it is running in the foreground on a terminal. This causes the uvivorn process to die and all of the worker processes to be cleaned up. (I.e., they go away.)
On the other hand, if uvicorn is running in the background without a terminal, then I can't figure out a way to kill it cleanly. It seems to ignore SIGTERM, SIGINT, and SIGHUP. I can kill it with SIGKILL (i.e. -9), but then the worker processes remain alive, and I have to track all the worker processes down and kill them too. This is not ideal.
I am using uvicorn with CPython 3.7.4, uvivorn version 0.11.2, and FastAPI 0.46.0 on Red Hat Enterprise Linux Server 7.3 (Maipo).
That's because you're running uvicorn as your only server. uvicorn is not a process manager and, as so, it does not manage its workers life cycle. That's why they recommend running uvicorn using gunicorn+UvicornWorker for production.
That said, you can kill the spawned workers and trigger it's shutdown using the script below:
$ kill $(pgrep -P $uvicorn_pid)
The reason why this works but not the kill on the parent pid is because when you ^C something, the signal is transmitted throughout all of its spawned processes that are attached to the stdin.
lsof -i :8000
This will check processes using port :8000. If you are using different port for fastAPI then change the port number. I was using postman and python for fastAPI. So check process with python, then copy the PID usually 4-5 numbers.
Then run
kill -9 PID
Where PID is the PID number you copied
Your can try running below command
kill -9 $(ps -ef | grep uvicorn | awk '{print $2}')
or
Create and Alias with the command and keep using that command.
For example:
alias uvicornprocess="kill -9 $(ps -ef | grep uvicorn | awk '{print $2}')"
In my case uvicorn managed to spawn new processes while pgrep -P was killing old ones,
so I decided to kill the whole process group at once, just like ^C does:
PID="$(pgrep -f example:app)"
if [[ -n "$PID" ]]
then
PGID="$(ps --no-headers -p $PID -o pgid)"
kill -SIGINT -- -${PGID// /}
fi
Each line explained:
pgrep -f example:app gets the PID of the parent uvicorn ... example:app
[[ -n "$PID" ]] checks this PID is not empty, to avoid further steps when uvicorn is not running
ps --no-headers -p $PID -o pgid gets PGID (Process Group ID) this PID is part of
kill -SIGINT is similar to polite ^C (you may use kill -9 for non-polite instant kill)
-- means the next token is a positional argument, not a named option, even if it starts with -
-${PGID - negative value lets kill know it is PGID, not PID
${PGID// /} removes all spaces ps added to PGID to align a column
For my dissertation at University, I'm working on a coding leaderboard system where users can compile / run untrusted code through temporary docker containers. The system seems to be working well so far, but one problem I'm facing is that when code for an infinite loop is submitted, E.g:
while True:
print "infinite loop"
the system goes haywire. The problem is that when I'm creating a new docker container, the Python interpreter prevents docker from killing the child container as data is still being printed to STDOUT (forever). This leads to the huge vulnerability of docker eating up all available system resources until the machine using the system completely freezes (shown below):
So my question is, is there a better way of setting a timeout on a docker container than my current method that will actually kill the docker container and make my system secure (code originally taken from here)?
#!/bin/bash
set -e
to=$1
shift
cont=$(docker run --rm "$#")
code=$(timeout "$to" docker wait "$cont" || true)
docker kill $cont &> /dev/null
echo -n 'status: '
if [ -z "$code" ]; then
echo timeout
else
echo exited: $code
fi
echo output:
# pipe to sed simply for pretty nice indentation
docker logs $cont | sed 's/^/\t/'
docker rm $cont &> /dev/null
Edit: The default timeout in my application (passed to the $to variable) is "10s" / 10 seconds.
I've tried looking into adding a timer and sys.exit() to the python source directly, but this isn't really a viable option as it seems rather insecure because the user could submit code to prevent it from executing, meaning the problem would still persist. Oh the joys of being stuck on a dissertation... :(
You could set up your container with a ulimit on the max CPU time, which will kill the looping process. A malicious user can get around this, though, if they're root inside the container.
There's another S.O. question, "Setting absolute limits on CPU for Docker containers" that describes how to limit the CPU consumption of containers. This would allow you to reduce the effect of malicious users.
I agree with Abdullah, though, that you ought to be able to docker kill the runaway from your supervisor.
If you want to run the containers without providing any protection inside them, you can use runtime constraints on resources.
In your case, -m 100M --cpu-quota 50000 might be reasonable.
That way it won't eat up the parent's system resources until you get around to killing it.
I have achieved a solution for this problem.
First you must kill docker container when time limit is achieved:
#!/bin/bash
set -e
did=$(docker run -it -d -v "/my_real_path/$1":/usercode virtual_machine ./usercode/compilerun.sh 2>> $1/error.txt)
sleep 10 && docker kill $did &> /dev/null && echo -n "timeout" >> $1/error.txt &
docker wait "$did" &> /dev/null
docker rm -f $ &> /dev/null
The container runs in detached mode (-d option), so it runs in the background.
Then you run sleep also in the background.
Then wait for the container to stop. If it doesnt stop in 10 seconds (sleep timer), the container will be killed.
As you can see, the docker run process calls a script named compilerun.sh:
#!/bin/bash
gcc -o /usercode/file /usercode/file.c 2> /usercode/error.txt && ./usercode/file < /usercode/input.txt | head -c 1M > /usercode/output.txt
maxsize=1048576
actualsize=$(wc -c <"/usercode/output.txt")
if [ $actualsize -ge $maxsize ]; then
echo -e "1MB file size limit exceeded\n\n$(cat /usercode/output.txt)" > /usercode/output.txt
fi
It starts by compiling and running a C program (its my use case, I am sure the same can be done for python compiller).
This part:
command | head -c 1M > /usercode/output.txt
Is responsible for the max output size thing. It allows output to be 1MB maximum.
After that, I just check if file is 1MB. If true, write a message inside (at the beginning of) the output file.
The --stop-timeout option is not killing the container if the timeout is exceeded.
Instead, use --ulimit --cpu=timeout to kill the container if the timeout is exceeded.
This is based on the CPU time for the process inside the container.
I guess, you can use signals in python like unix to set timeout. you can use alarm of specific time say 50 seconds and catch it. Following link might help you.
signals in python
Use --stop-timeout option while running your docker container. this will execute SIGKILL once the timeout occured
I'm trying to use a python script to run a series of oommf simulations on a unix cluster but I'm getting stuck at the point where I send a command from python to bash. I'm using the line:-
subprocess.check_call('qsub shellfile.sh')
Which returns exit code 191. What is exit code 191, I can't seem to be able to find it online. It may be a PBS error rather than a unix error but I'm not sure. The error doesn't seem to be in the shell file itself since the only commands in there:-
#!/bin/bash
# This is an example submit script for the hello world program.
# OPTIONS FOR PBS PRO ==============================================================
#PBS -l walltime=1:00:00
# This specifies the job should run for no longer than 24 hours
#PBS -l select=1:ncpus=8:mem=2048mb
# This specifies the job needs 1 'chunk', with 1 CPU core, and 2048 MB of RAM (memory).
#PBS -j oe
# This joins up the error and output into one file rather that making two files
##PBS -o $working_folder/$PBS_JOBID-oommf_log
# This send your output to the file "hello_output" rather than the standard filename
# OPTIONS FOR PBS PRO ==============================================================
#PBS -P HPCA-000987-EFR
#PBS -M ppxsb3#nottingham.ac.uk
#PBS -m abe
# Here we just use Unix command to run our program
echo "Running on hostname"
sleep 20
echo "Finished job now""
Which should just print the hostname and 'Finished job now'
Thanks
Exit code 191 indicates that the project code associated with the job is invalid. This is the code in line 13:-
#PBS -P HPCA-000974-EFG
Which tells the cluster which project the code is associated with.
Context
I'm adding a few pieces to an existing, working system.
There is a control machine (a local Linux PC) running some test scripts which involve sending lots of commands to several different machines remotely via SSH. The test framework is written in Python and uses Fabric to access the different machines.
All commands are handled with a generic calling function, simplified below:
def cmd(host, cmd, args):
...
with fabric.api.settings(host_string=..., user='root', use_ssh_config=True, disable_known_hosts=True):
return fabric.api.run('%s %s' % (cmd, args))
The actual commands sent to each machine usually involve running an existing python script on the remote side. Those python scripts, do some jobs which include invoking external commands (using system and subprocess). The run() command called on the test PC will return when the remote python script is done.
At one point I needed one of those remote python scripts to launch a background task: starting an openvon server and client using openvpn --config /path/to/config.openvpn. In a normal python script I would just use &:
system('openvpn --config /path/to/config.openvpn > /var/log/openvpn.log 2>&1 &')
When this script is called remotely via Fabric, one must explicitly use nohup, dtach, screen and the likes to run the job in background. I got it working with:
system("nohup openvpn --config /path/to/config.openvpn > /var/log/openvpn.log 2>&1 < /dev/null &"
The Fabric FAQ goes into some details about this.
It works fine for certain background commands.
Problem: doesn't work for all types of background commands
This technique doesn't work for all the commands I need. In some scripts, I need to launch a background atop command (it's a top on steroids) and redirect its stdout to a file.
My code (note: using atop -P for parseable output):
system('nohup atop -P%s 1 < /dev/null | grep %s > %s 2>&1 &' % (dataset, grep_options, filename))
When the script containing that command is called remotely via Fabric, the atop process is immediately killed. The output file is generated but it's empty. Calling the same script while logged in the remote machine by SSH works fine, the atop command dumps data periodically in my output file.
Some googling and digging around brought me to interesting information about background jobs using Fabric, but my problem seems to be only specific to certains types of background jobs. I've tried:
appending sleep
running with pty=False
replacing nohup with dtach -n: same symptoms
I read about commands like top failing in Fabric with stdin redirected to /dev/null, not quite sure what to make of it. I played around with different combinations or (non-) redirects of STDIN, STDOUT and STDERR
Looks like I'm running out of ideas.
Fabric seems overkill for what we are doing. We don't even use the "fabfile" method because it's integrated in a nose framework and I run them invoking nosetests. Maybe I should resort to dropping Fabric in favor of manual SSH commands, although I don't like the idea of changing a working system because of it not supporting one of my newer modules.
In my environment, looks like it is working
from fabric.api import sudo
def atop():
sudo('nohup atop -Pcpu 1 </dev/null '
'| grep cpu > /tmp/log --line-buffered 2>&1 &',
pty=False)
result:
fabric:~$ fab atop -H web01
>>>[web01] Executing task 'atop'
>>>[web01] sudo: nohup atop -Pcpu 1 </dev/null | grep cpu > /tmp/log --line-buffered 2>&1 &
>>>
>>>Done.
web01:~$ cat /tmp/log
>>>cpu web01 1374246222 2013/07/20 00:03:42 361905 100 0 5486 6968 0 9344927 3146 0 302 555 0 2494 100
>>>cpu web01 1374246223 2013/07/20 00:03:43 1 100 0 1 0 0 99 0 0 0 0 0 2494 100
>>>cpu web01 1374246224 2013/07/20 00:03:44 1 100 0 1 0 0 99 0 0 0 0 0 2494 100
...
The atop command may need the super user. This doesn't work
from fabric.api import run
def atop():
run('nohup atop -Pcpu 1 </dev/null '
'| grep cpu > /tmp/log --line-buffered 2>&1 &',
pty=False)
On the other hand this work.
from fabric.api import run
def atop():
run('sudo nohup atop -Pcpu 1 </dev/null '
'| grep cpu > /tmp/log --line-buffered 2>&1 &',
pty=False)