I have a python script that uses multiprocessing's pool.map( ... ) to run a large number of calculations in parallel. Each of these calculations consists of the python script setting up input for a fortran program, using subprocess.popen( ... , stdin=PIPE, stdout=PIPE, stderr=PIPE ) to run the program, dump the input to it and read the output. Then the script parses the output, gets the needed numbers, then does it again for the next run.
def main():
#Read a configuration file
#do initial setup
pool = multiprocessing.Pool(processes=maxProc)
runner = CalcRunner( things that are the same for each run )
runNumsAndChis = pool.map( runner, xrange(startRunNum, endRunNum))
#dump the data that makes it past a cut to disk
class CalcRunner(object):
def __init__(self, stuff):
#setup member variables
def __call__(self, runNumber):
#get parameters for this run
params = self.getParams(runNum)
inFileLines = []
#write the lines of the new input file to a list
makeInputFile(inFileLines, ... )
process = subprocess.Popen(cmdString, bufsize=0, stdin=subprocess.PIPE, ... )
output = proc.communicate( "".join(inFileLines) )
#get the needed numbers from stdout
chi2 = getChiSq(output[0])
return [runNumber, chi2]
...
Anyways, on to the reason for the question. I submit this script to a grid engine system to break this huge parameter space sweep into 1000, 12 core (I choose 12 since most of the grid is 12 cores), tasks. When a single task runs on a single 12 core machine about 1/3 of the machine's time is spent doing system stuff, and the other 2/3 of the time is doing the user calculations, presumably setting up inputs to ECIS (the aforementioned FORTRAN code), running ECIS, and parsing the output of ECIS. However, sometimes 5 tasks get sent to a 64 core machine to utilize 60 of its cores. On that machine 40% of the time is spent doing system stuff and 1-2% doing user stuff.
First of all, where are all the system calls coming from? I tried writing a version of the program that runs ECIS once per separate thread and keeps piping new input to it and it spends FAR more time in system (and is slower overall), so it doesn't seem like it is due to all the process creation and deletion.
Second of all, how do I go about decreasing the amount of time spent on system calls?
At a guess, the open a process once and keep sending input to it was slower because I had to turn gfortran's output buffering off to get anything from the process, nothing else worked (short of modifying the fortran code... which isn't happening).
The OS on my home test machines where I developed this is Fedora 14. The OS on the grid machines is a recent version of Red Hat.
I have tried playing around with bufsize, setting it to -1 (system defaults), 0 (unbuffered), 1 (line by line), and 64Kb that does not seem to change things.
Related
I am conducting an experiment to load RAM at 100% on Mac OS. Stumbled upon the method described here: https://unix.stackexchange.com/a/99365
I decide to do the same. I wrote two programs which are presented below. While executing the first program, the system writes that the process takes 120 GB, but the memory usage graph is still stable. When executing the second program, almost immediately a warning pops up that the system does not have enough resources. The second program creates ten parallel processes that increase memory consumption in approximately the same way.
First program:
def load_ram(vm, timer):
x = (vm * 1024 * 1024 * 1024 * 8) * (0,)
begin_time = time.time()
while time.time() - begin_time < timer:
pass
print("end")
Memory occupied by the first program
Second program:
def load_ram(vm, timer):
file_sh = open("bash_file.sh", "w")
str_to_bash = """
VM=%d;
for i in {1..10};
do
python -c "x=($VM*1024*1024*1024*8)*(0,); import time; time.sleep(%d)" & echo "started" $i ;
done""" % (int(vm), int(timer))
file_sh.write(str_to_bash)
file_sh.close()
os.system("bash bash_file.sh")
Memory occupied by the second program
Memory occupied by the second program + system message
Parameters: vm = 16, timer = 30.
In the first program, memory takes up are equal to about 128 gigabytes (after that, a kill pops up in the terminal and the process stops). The second takes up more than 160 gigabytes, as shown in the picture. And all these ten processes are not completed. The warning that the system is low on resources is displayed even if the memory takes up are 10 gigabytes per process (that is, 100 gigabytes in total).
According to the situation described, two questions arise:
Why, with the same memory consumption (120 gigabytes), in the first case, the system pretends that this process does not exist, and in the second case it immediately falls under the same load?
Where does the number of 120 gigabytes come from if my computer's operating system contains only 16 gigabytes?
Thank you for the attention!
I have a python script (grouper.py) that accepts 1 argument as input. Currently, due to size of the input file, I must break the input argument up into 20 chunks and open 20 terminals and run all 20 at once.
Is there a way to loop through all 10 input arguments that kicks off a python process?
def fun(i):
j = pd.read(i)
# do some heavy processing
return j
for i in inputfiles:
print(i)
outputfile=fun(i)
outputfile.to_csv('outputfile.csv', index=False)
The above code of mine does each inputfile 1 at a time... IS there a way to run all 20 input files at once??
Thanks!!
Q : Script to kick-off multiple instances of another script that takes an input parameter?
GNU parallel solves this straight from the CLI :
parallel python {} ::: file1 file2 file3 ...... file20
Given about a 20+ CPU-core machine goes on this, # do some heavy processing may remain unconstrained to a just-[CONCURRENT] CPU-scheduling and may indeed perform the work in an almost [PARALLEL] fashion ( without race conditions on shared resources )
for i in inputfiles:
print(i)
outputfile=fun(i)
...
is a pure-[SERIAL] iterator, producing just a sequence of passes, so launching the process straight from CLI may be the cheapest ever solution. Python joblib and other multi-processing tools can spawn copies of the running python-interpreter, yet that will be at rather remarkable add-on cost, if just a batch-operated processing from a single CLI command may suffice the target - to process a known list of files into another set of output files.
I have 2 separate scripts working with the same variables.
To be more precise, one code edits the variables and the other one uses them (It would be nice if it could edit them too but not absolutely necessary.)
This is what i am currently doing:
When code 1 edits a variable it dumps it into a json file.
Code 2 repeatedly opens the json file to get the variables.
This method is really not elegant and the while loop is really slow.
How can i share variables across scripts?
My first scripts gets data from a midi controller and sends web-requests.
My second script is for LED strips (those run thanks to the same midi controller). Both script run in a "while true" loop.
I can't simply put them in the same script since every webrequest would slow the LEDs down. I am currently just sharing the variables via a json file.
If enough people ask for it i will post the whole code but i have been told not to do this
Considering the information you provided, meaning...
Both script run in a "while true" loop.
I can't simply put them in the same script since every webrequest would slow the LEDs down.
To me, you have 2 choices :
Use a client/server model. You have 2 machines. One acts as the server, and the second as the client. The server has a script with an infinite loop that consistently updates the data, and you would have an API that would just read and expose the current state of your file/database to the client. The client would be on another machine, and as I understand it, it would simply request the current data, and process it.
Make a single multiprocessing script. Each script would run on a separate 'thread' and would manage its own memory. As you also want to share variables between your two programs, you could pass as argument an object that would be shared between both your programs. See this resource to help you.
Note that there are more solutions to this. For instance, you're using a JSON file that you are consistently opening and closing (that is probably what takes the most time in your program). You could use a real Database that could handle being opened only once, and processed many times, while still being updated.
a Manager from multiprocessing lets you do this sort thing pretty easily
first I simplify your "midi controller and sends web-request" code down to something that just sleeps for random amounts of time and updates a variable in a managed dictionary:
from time import sleep
from random import random
def slow_fn(d):
i = 0
while True:
sleep(random() ** 2)
i += 1
d['value'] = i
next we simplify the "LED strip" control down to something that just prints to the screen:
from time import perf_counter
def fast_fn(d):
last = perf_counter()
while True:
sleep(0.05)
value = d.get('value')
now = perf_counter()
print(f'fast {value} {(now - last) * 1000:.2f}ms')
last = now
you can then run these functions in separate processes:
import multiprocessing as mp
with mp.Manager() as manager:
d = manager.dict()
procs = []
for fn in [slow_fn, fast_fn]:
p = mp.Process(target=fn, args=[d])
procs.append(p)
p.start()
for p in procs:
p.join()
the "fast" output happens regularly with no obvious visual pauses
I wrote a function using multiprocessing packages from python and tried to boost the speed of my code.
from arch.univariate import ARX, GARCH
from multiprocessing import Process
import multiprocessing
import time
def batch_learning(X, lag_array=None):
"""
X is a time series array
lag_array contains all possible lag numbers
"""
# init a queue used for triggering different processes
queue = multiprocessing.JoinableQueue()
data = multiprocessing.Queue()
# a worker called ARX_fit triggered by queue.get()
def ARX_fit(queue):
while True:
q = queue.get()
q.volatility = GARCH()
print "Starting to fit lags %s" %str(q.lags.size/2)
try:
q_res=q.fit(update_freq=500)
except:
print "Error:...."
print "finished lags %s" %str(q.lags.size/2)
queue.task_done()
# init four processes
for i in range(4):
process_i = Process(target=ARX_fit, name="Process_%s"%str(i), args=(queue,))
process_i.start()
# put ARX model objects into queue continuously
for num in lag_array:
queue.put(ARX(X, lags=num))
# sync processes here
queue.join()
return
After calling function:
batch_learning(a, lag_array=range(1,10))
However it got stuck in the middle and I got the print out messages as below:
Starting to fit lags 1
Starting to fit lags 3
Starting to fit lags 2
Starting to fit lags 4
finished lags 1
finished lags 2
Starting to fit lags 5
finished lags 3
Starting to fit lags 6
Starting to fit lags 7
finished lags 4
Starting to fit lags 8
finished lags 6
finished lags 5
Starting to fit lags 9
It runs forever but without any printouts on my Mac OS El Captain. Then using PyCharm debug mode and thanks for Tim Peters suggestions, I successfully find out that the processes actually quitted unexpectedly. Under debug mode, I can pinpoint it is actually svd function inside numpy.linalg.pinv() used by arch library causing this problem. Then my question is: Why? It works with single process for-loop but it cannot work with 2 processes or above. I don't know how to fix this problem. Is it a numpy bug? Can anyone help me a bit here?
I have to answer this question by myself and providing my solutions. I have already solved this issue, thanks to the help from #Tim Peters and #aganders.
The multiprocessing usually hangs when you use numpy/scipy libraries on Mac OS because of the Accelerate Framework used in Apple OS which is a replacement for OpenBlas numpy is built on. Simply, in order to solve the similar problem, you have to do as follows:
uninstall numpy and scipy (scipy needs to be matched with proper version of numpy)
follow the procedure on this link to rebuild numpy with Openblas.
reinstall scipy and test your code to see if it works.
Some heads up for testing your multiprocessing codes on Mac OS, when you run your code, it is better to set up a env variable to run your code:
OPENBLAS_NUM_THREADS=1 python import_test.py
The reason for doing this is that OpenBlas by default create 2 threads for each core to run, in which case there are 8 threads running (2 for each core) even though you set up 4 processes. This creates a bit overhead for the thread switching. I tested OPENBLAS_NUM_THREADS=1 config to limit 1 thread each process on each core, it is indeed faster than default settings.
There's not much to go on here, and the code indentation is wrong so it's hard to guess what you're really doing. To the extent I can guess, what you're seeing could happen if the OS killed a process in a way that didn't raise a Python exception.
One thing to try: first make a list, ps, of your four process_i objects. Then before queue.join() add:
while ps:
new_ps = []
for p in ps:
if p.is_alive():
new_ps.append(p)
else:
print("*********", p.name, "exited with", p.exitcode)
ps = new_ps
time.sleep(1)
So about once per second, this just runs through the list of worker processes to see whether any have (unexpectedly!) died. If one (or more) has, it displays the process name (which you supplied already) and the process exit code (as given by your OS). If that triggers, it would be a big clue.
If none die, then we have to wonder whether
q_res=q.fit(update_freq=500)
"simply" takes a very long time for some q states.
I have a command line program I'm running and I pipe in text as arguments:
somecommand.exe < someparameters_tin.txt
It runs for a while (typically a good fraction of an hour to several hours) and then writes results in a number of text files. I'm trying to write a script to launch several of these simultaneously, using all the cores on a many core machine. On other OSs I'd fork, but that's not implemented in many scripting languages for Windows. Python's multiprocessing looks like it might do the trick so I thought I'd give it a try, although I don't know python at all. I'm hoping someone can tell me what I'm doing wrong.
I wrote a script (below) which I point to a directory, if finds the executable and input files, and launches them using pool.map and a pool of n, and a function using call. What I see is that initially (with the first set of n processes launched) it seems fine, using n cores 100%. But then I see the processes go idle, using no or only a few percent of their CPUs. There are always n processes there, but they aren't doing much. It appears to happen when they go to write the output data files, and once it starts everything bogs down, and overall core utilization ranges from a few percent to occasional peaks of 50-60%, but never gets near 100%.
If I can attach it (edit: I can't, at least for now) here's a plot of run times for the processes. The lower curve was when I opened n command prompts and manually kept n processes going at a time, easily keeping the computer near 100%. (The line is regular, slowly increasing from near 0 to 0.7 hours across 32 different processes varying a parameter.) The upper line is the result of some version of this script -- the runs times are inflated by about 0.2 hours on average and are much less predictable, like I'd taken the bottom line and added 0.2 + a random number.
Here's a link to the plot:
Run time plot
Edit: and now I think I can add the plot.
What am I doing wrong?
from multiprocessing import Pool, cpu_count, Lock
from subprocess import call
import glob, time, os, shlex, sys
import random
def launchCmd(s):
mypid = os.getpid()
try:
retcode = call(s, shell=True)
if retcode < 0:
print >>sys.stderr, "Child was terminated by signal", -retcode
else:
print >>sys.stderr, "Child returned", retcode
except OSError, e:
print >>sys.stderr, "Execution failed:", e
if __name__ == '__main__':
# ******************************************************************
# change this to the path you have the executable and input files in
mypath = 'E:\\foo\\test\\'
# ******************************************************************
startpath = os.getcwd()
os.chdir(mypath)
# find list of input files
flist = glob.glob('*_tin.txt')
elist = glob.glob('*.exe')
# this will not act as expected if there's more than one .exe file in that directory!
ex = elist[0] + ' < '
print
print 'START'
print 'Path: ', mypath
print 'Using the executable: ', ex
nin = len(flist)
print 'Found ',nin,' input files.'
print '-----'
clist = [ex + s for s in flist]
cores = cpu_count()
print 'CPU count ', cores
print '-----'
# ******************************************************
# change this to the number of processes you want to run
nproc = cores -1
# ******************************************************
pool = Pool(processes=nproc, maxtasksperchild=1) # start nproc worker processes
# mychunk = int(nin/nproc) # this didn't help
# list.reverse(clist) # neither did this, or randomizing the list
pool.map(launchCmd, clist) # launch processes
os.chdir(startpath) # return to original working directory
print 'Done'
Is there any chance that the processes are trying to write to a common file? Under Linux it would probably just work, clobbering data but not slowing down; but under Windows one process might get the file and all the other processes might hang waiting for the file to become available.
If you replace your actual task list with some silly tasks that use CPU but don't write to disk, does the problem reproduce? For example, you could have tasks that compute the md5sum of some large file; once the file was cached the other tasks would be pure CPU and then a single line output to stdout. Or compute some expensive function or something.
I think I know this. When you call map, it breaks the list of tasks into 'chunks' for each process. By default, it uses chunks large enough that it can send one to each process. This works on the assumption that all the tasks take about the same length of time to complete.
In your situation, presumably the tasks can take very different amounts of time to complete. So some workers finish before others, and those CPUs sit idle. If that's the case, then this should work as expected:
pool.map(launchCmd, clist, chunksize=1)
Less efficient, but it should mean that each worker gets more tasks as it finishes until they're all complete.