Related
I have a very big array that I need to create with over 10^7 columns that needs to get filtered/modified depending on some criteria. There is a set of 24 different criterias (2x4x3 due to combinations) which means the filtering/modification needs to be done 24 times and each result is saved in a different specified directory.
Since this takes a very long time, I am looking into using multiprocessing to speed up the process. Can anyone help me out? Here is an exemplary code:
import itertools
import numpy as np
sample_size = 1000000
variables = 25
x_array = np.random.rand(variables, sample_size)
x_dir = ['x1', 'x2']
y_dir = ['y1', 'y2', 'y3', 'y4']
z_dir = ['z1', 'z2', 'z3']
x_directories = [0, 1]
y_directories = [0, 1, 2, 3]
z_directories = [0, 1, 2]
directory_combinations = itertools.product(x_directories, y_directories, z_directories)
for k, t, h in directory_combinations:
target_dir=main_dir+'/'+x_dir[k]+'/'+y_dir[t]+'/'+z_dir[h]
for i in range(sample_size):
#x_array gets filtered/modified
#x_array gets saved in target_dir directory as a dataframe after modification'''
Basically with multiprocessing I am hoping for either each loop handled by a single core out of 16 I have available or for each loop iteration to be sped up by using all 16 cores.
Many thanks in advance!
Take one of loop and rewrite it to function
for k, t, h in directory_combinations:
Becomes for example
def func(k,t,h):
....
pool = multiprocessing.Pool(12)
pool.starmap_async(func, directory_combinations, 32)
It spawns 12 processes, that apply func on each iteration of 2 argument. Data tranfered to processes by 32-length chunks.
The following code first creates the x_array in shared memory and initialized each process in the pool with global variable x_array, which is this shared array.
I would move the code that creates a the copy of this global x_array, processes it and then writes out the dataframe to a function, worker, which is passed the target directory as an argument.
import itertools
import numpy as np
import ctypes
import multiprocessing as mp
SAMPLE_SIZE = 1000000
VARIABLES = 25
def to_numpy_array(shared_array, shape):
'''Create a numpy array backed by a shared memory Array.'''
arr = np.ctypeslib.as_array(shared_array)
return arr.reshape(shape)
def to_shared_array(arr, ctype):
shared_array = mp.Array(ctype, arr.size, lock=False)
temp = np.frombuffer(shared_array, dtype=arr.dtype)
temp[:] = arr.flatten(order='C')
return shared_array
def init_pool(shared_array, shape):
global x_array
# Recreate x_array using shared memory array:
x_array = to_numpy_array(shared_array, shape)
def worker(target_dir):
# make copy of x_array with np.copy
x_array_copy = np.copy(x_array)
for i in range(sample_size):
#x_array_copy gets filtered/modified
...
#x_array_copy gets saved in target_dir directory as a dataframe after modification
def main():
main_dir = '.' # for example
x_dir = ['x1', 'x2']
y_dir = ['y1', 'y2', 'y3', 'y4']
z_dir = ['z1', 'z2', 'z3']
x_directories = [0, 1]
y_directories = [0, 1, 2, 3]
z_directories = [0, 1, 2]
directory_combinations = itertools.product(x_directories, y_directories, z_directories)
target_dirs = [main_dir+'/'+x_dir[k]+'/'+y_dir[t]+'/'+z_dir[h] for k, t, h in directory_combinations]
x_array = np.random.rand(VARIABLES, SAMPLE_SIZE)
shape = x_array.shape
# Create array in shared memory
shared_array = to_shared_array(x_array, ctypes.c_int64)
# Recreate x_array using the shared memory array as the base:
x_array = to_numpy_array(shared_array, shape)
# Create pool of 12 processes copying the shared array to each process:
pool = mp.Pool(12, initializer=init_pool, initargs=(shared_array, shape))
pool.map(worker, target_dirs)
# This is required for Windows:
if __name__ == '__main__':
main()
It is a task in school (parallel normalization of each column of a matrix) and besides other problems you may see, I found it particularly difficult to find something easy as the list = [] that you can list.append() entire lists in a loop to, without predefining dimensions.
Here is what I have so far with the line in question at the end. Thank you in advance for any help!
from multiprocessing import Pool
import numpy as np
def fct_norm(col):
mn = col.min()
mx = col.max()
col_norm = np.zeros((6, 1))
for i in range(6):
col_norm[i, 0] = (col[i] - mn) / (mx - mn)
return col_norm
if __name__ == "__main__":
pool = Pool()
arr = np.random.uniform(0, 100, size=(6, 3))
maybe predefine arr_norm here?
for i in range(2):
print("i = ", i)
col = arr[:, i]
result = pool.map(fct_norm, [col])
norm_arr = HOW_TO_ADD_EACH_RESULT_COLUMN_TO_A_NEW_ARRAY?
The function you need to concatenate a number of columns is np.hstack. However, a big problem is pool.mapis not used in the correct way in the original code.
As written, there is no parallel execution of the columns, since each call to pool.map gets only a single column. The idea is to pass an iterator with several values at the same time - in this case, multiple columns to pool.map.
Since numpy loops over rows, rather than columns, the matrix must be transposed (using the (...).T operator. Also, after the pool is finished, it is good measure to close it. One way to handle this automatically, is to use a context (i.e., the with Pool() as pool: construct, as then it will close automatically.
This all taken together gives the following solution:
from multiprocessing import Pool
import numpy as np
def fct_norm(col):
mn = col.min()
mx = col.max()
col_norm = np.zeros((6, 1))
for i in range(6):
col_norm[i, 0] = (col[i] - mn) / (mx - mn)
return col_norm
if __name__ == "__main__":
arr = np.random.uniform(0, 100, size=(6, 3))
with Pool() as pool:
norm_arr = np.hstack(pool.map(fct_norm, arr.T))
# Here norm_arr is available for further operations.
Thus, the whole operation can be performed in two lines.
I am using a permutation test (pulling random sub-samples) to test the difference between 2 experiments. Each experiment was carried out 100 times (=100 replicas of each). Each replica consists of 801 measurement points over time. Now I would like to perform a kind of permutation (or boot strapping) in order to test how many replicas per experiment (and how many (time) measurement points) I need to obtain a certain reliability level.
For this purpose I have written a code from which I have extracted the minimal working example (with lots of things hard-coded) (please see below). The input data is generated as random numbers. Here np.random.rand(100, 801) for 100 replicas and 801 time points.
This code works in principle however the produced curves are sometimes not smoothly falling as one would expect if choosing random sub-samples for 5000 times. Here is the output of the code below:
It can be seen that at 2 of the x-axis there is a peak up which should not be there. If I change the random seed from 52389 to 324235 it is gone and the curve is smooth. It seems there is something wrong with the way the random numbers are chosen?
Why is this the case? I have the semantically similar code in Matlab and there the curves are completely smooth at already 1000 permutations (here 5000).
Do I have a coding mistake or is the numpy random number generator not good?
Does anyone see the problem here?
import matplotlib.pyplot as plt
import numpy as np
from multiprocessing import current_process, cpu_count, Process, Queue
import matplotlib.pylab as pl
def groupDiffsInParallel (queue, d1, d2, nrOfReplicas, nrOfPermuts, timesOfInterestFramesIter):
allResults = np.zeros([nrOfReplicas, nrOfPermuts]) # e.g. 100 x 3000
for repsPerGroupIdx in range(1, nrOfReplicas + 1):
for permutIdx in range(nrOfPermuts):
d1TimeCut = d1[:, 0:int(timesOfInterestFramesIter)]
d1Idxs = np.random.randint(0, nrOfReplicas, size=repsPerGroupIdx)
d1Sel = d1TimeCut[d1Idxs, :]
d1Mean = np.mean(d1Sel.flatten())
d2TimeCut = d2[:, 0:int(timesOfInterestFramesIter)]
d2Idxs = np.random.randint(0, nrOfReplicas, size=repsPerGroupIdx)
d2Sel = d2TimeCut[d2Idxs, :]
d2Mean = np.mean(d2Sel.flatten())
diff = d1Mean - d2Mean
allResults[repsPerGroupIdx - 1, permutIdx] = np.abs(diff)
queue.put(allResults)
def evalDifferences_parallel (d1, d2):
# d1 and d2 are of size reps x time (e.g. 100x801)
nrOfReplicas = d1.shape[0]
nrOfFrames = d1.shape[1]
timesOfInterestNs = [0.25, 0.5, 1, 2, 3, 4, 5, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100] # 17
nrOfTimesOfInterest = len(timesOfInterestNs)
framesPerNs = (nrOfFrames-1)/100 # sim time == 100 ns
timesOfInterestFrames = [x*framesPerNs for x in timesOfInterestNs]
nrOfPermuts = 5000
allResults = np.zeros([nrOfTimesOfInterest, nrOfReplicas, nrOfPermuts]) # e.g. 17 x 100 x 3000
nrOfProcesses = cpu_count()
print('{} cores available'.format(nrOfProcesses))
queue = Queue()
jobs = []
print('Starting ...')
# use one process for each time cut
for timesOfInterestFramesIterIdx, timesOfInterestFramesIter in enumerate(timesOfInterestFrames):
p = Process(target=groupDiffsInParallel, args=(queue, d1, d2, nrOfReplicas, nrOfPermuts, timesOfInterestFramesIter))
p.start()
jobs.append(p)
print('Process {} started work on time \"{} ns\"'.format(timesOfInterestFramesIterIdx, timesOfInterestNs[timesOfInterestFramesIterIdx]), end='\n', flush=True)
# collect the results
for timesOfInterestFramesIterIdx, timesOfInterestFramesIter in enumerate(timesOfInterestFrames):
oneResult = queue.get()
allResults[timesOfInterestFramesIterIdx, :, :] = oneResult
print('Process number {} returned the results.'.format(timesOfInterestFramesIterIdx), end='\n', flush=True)
# hold main thread and wait for the child process to complete. then join back the resources in the main thread
for proc in jobs:
proc.join()
print("All parallel done.")
allResultsMeanOverPermuts = allResults.mean(axis=2) # size: 17 x 100
replicaNumbersToPlot = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100])
replicaNumbersToPlot -= 1 # zero index!
colors = pl.cm.jet(np.linspace(0, 1, len(replicaNumbersToPlot)))
ctr = 0
f, ax = plt.subplots(nrows=2, ncols=2, figsize=(12, 12))
axId = (1, 0)
for lineIdx in replicaNumbersToPlot:
lineData = allResultsMeanOverPermuts[:, lineIdx]
ax[axId].plot(lineData, ".-", color=colors[ctr], linewidth=0.5, label="nReps="+str(lineIdx+1))
ctr+=1
ax[axId].set_xticks(range(nrOfTimesOfInterest)) # careful: this is not the same as plt.xticks!!
ax[axId].set_xticklabels(timesOfInterestNs)
ax[axId].set_xlabel("simulation length taken into account")
ax[axId].set_ylabel("average difference between mean values boot strapping samples")
ax[axId].set_xlim([ax[axId].get_xlim()[0], ax[axId].get_xlim()[1]+1]) # increase x max by 2
plt.show()
##### MAIN ####
np.random.seed(83737) # some number for reproducibility
d1 = np.random.rand(100, 801)
d2 = np.random.rand(100, 801)
np.random.seed(52389) # if changed to 324235 the peak is gone
evalDifferences_parallel(d1, d2)
------------- UPDATE ---------------
Changing the random number generator from numpy to "from random import randint" does not fix the problem:
from:
d1Idxs = np.random.randint(0, nrOfReplicas, size=repsPerGroupIdx)
d2Idxs = np.random.randint(0, nrOfReplicas, size=repsPerGroupIdx)
to:
d1Idxs = [randint(0, nrOfReplicas-1) for p in range(repsPerGroupIdx)]
d2Idxs = [randint(0, nrOfReplicas-1) for p in range(repsPerGroupIdx)]
--- UPDATE 2 ---
timesOfInterestNs can just be set to:
timesOfInterestNs = [0.25, 0.5, 1, 2, 3, 4, 5, 10, 20, 30, 40, 50]
to speed it up on machines with fewer cores.
--- UPDATE 3 ---
Re-initialising the random seed generator in each child process (Random seed is replication across child processes) does also not fix the problem:
pid = str(current_process())
pid = int(re.split("(\W)", pid)[6])
ms = int(round(time.time() * 1000))
mySeed = np.mod(ms, 4294967295)
mySeed = mySeed + 25000 * pid + 100 * pid + pid
mySeed = np.mod(mySeed, 4294967295)
np.random.seed(seed=mySeed)
--- UPDATE 4 ---
On a windows machine you will need a:
if __name__ == '__main__':
to avoid creating subprocesses recursively (and a crash).
I guess this is the classical multiprocessing mistake. Nothing guarantees that the processes will finish in the same order as the one they started. This means that you cannot be sure that the instruction allResults[timesOfInterestFramesIterIdx, :, :] = oneResult will store the result of process timesOfInterestFramesIterIdx at the location timesOfInterestFramesIterIdx in allResults. To make it clearer, let's say timesOfInterestFramesIterIdx is 2, then you have absolutely no guarantee that oneResult is the output of process 2.
I have implemented a very quick fix below. The idea is to track the order in which the processes have been launched by adding an extra argument to groupDiffsInParallel which is then stored in the queue and thereby serves as a process identifier when the results are gathered.
import matplotlib.pyplot as plt
import numpy as np
from multiprocessing import cpu_count, Process, Queue
import matplotlib.pylab as pl
def groupDiffsInParallel(queue, d1, d2, nrOfReplicas, nrOfPermuts,
timesOfInterestFramesIter,
timesOfInterestFramesIterIdx):
allResults = np.zeros([nrOfReplicas, nrOfPermuts]) # e.g. 100 x 3000
for repsPerGroupIdx in range(1, nrOfReplicas + 1):
for permutIdx in range(nrOfPermuts):
d1TimeCut = d1[:, 0:int(timesOfInterestFramesIter)]
d1Idxs = np.random.randint(0, nrOfReplicas, size=repsPerGroupIdx)
d1Sel = d1TimeCut[d1Idxs, :]
d1Mean = np.mean(d1Sel.flatten())
d2TimeCut = d2[:, 0:int(timesOfInterestFramesIter)]
d2Idxs = np.random.randint(0, nrOfReplicas, size=repsPerGroupIdx)
d2Sel = d2TimeCut[d2Idxs, :]
d2Mean = np.mean(d2Sel.flatten())
diff = d1Mean - d2Mean
allResults[repsPerGroupIdx - 1, permutIdx] = np.abs(diff)
queue.put({'allResults': allResults,
'number': timesOfInterestFramesIterIdx})
def evalDifferences_parallel (d1, d2):
# d1 and d2 are of size reps x time (e.g. 100x801)
nrOfReplicas = d1.shape[0]
nrOfFrames = d1.shape[1]
timesOfInterestNs = [0.25, 0.5, 1, 2, 3, 4, 5, 10, 20, 30, 40, 50, 60, 70,
80, 90, 100] # 17
nrOfTimesOfInterest = len(timesOfInterestNs)
framesPerNs = (nrOfFrames-1)/100 # sim time == 100 ns
timesOfInterestFrames = [x*framesPerNs for x in timesOfInterestNs]
nrOfPermuts = 5000
allResults = np.zeros([nrOfTimesOfInterest, nrOfReplicas,
nrOfPermuts]) # e.g. 17 x 100 x 3000
nrOfProcesses = cpu_count()
print('{} cores available'.format(nrOfProcesses))
queue = Queue()
jobs = []
print('Starting ...')
# use one process for each time cut
for timesOfInterestFramesIterIdx, timesOfInterestFramesIter \
in enumerate(timesOfInterestFrames):
p = Process(target=groupDiffsInParallel,
args=(queue, d1, d2, nrOfReplicas, nrOfPermuts,
timesOfInterestFramesIter,
timesOfInterestFramesIterIdx))
p.start()
jobs.append(p)
print('Process {} started work on time \"{} ns\"'.format(
timesOfInterestFramesIterIdx,
timesOfInterestNs[timesOfInterestFramesIterIdx]),
end='\n', flush=True)
# collect the results
resultdict = {}
for timesOfInterestFramesIterIdx, timesOfInterestFramesIter \
in enumerate(timesOfInterestFrames):
resultdict.update(queue.get())
allResults[resultdict['number'], :, :] = resultdict['allResults']
print('Process number {} returned the results.'.format(
resultdict['number']), end='\n', flush=True)
# hold main thread and wait for the child process to complete. then join
# back the resources in the main thread
for proc in jobs:
proc.join()
print("All parallel done.")
allResultsMeanOverPermuts = allResults.mean(axis=2) # size: 17 x 100
replicaNumbersToPlot = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 20, 30, 40,
50, 60, 70, 80, 90, 100])
replicaNumbersToPlot -= 1 # zero index!
colors = pl.cm.jet(np.linspace(0, 1, len(replicaNumbersToPlot)))
ctr = 0
f, ax = plt.subplots(nrows=2, ncols=2, figsize=(12, 12))
axId = (1, 0)
for lineIdx in replicaNumbersToPlot:
lineData = allResultsMeanOverPermuts[:, lineIdx]
ax[axId].plot(lineData, ".-", color=colors[ctr], linewidth=0.5,
label="nReps="+str(lineIdx+1))
ctr += 1
ax[axId].set_xticks(range(nrOfTimesOfInterest))
# careful: this is not the same as plt.xticks!!
ax[axId].set_xticklabels(timesOfInterestNs)
ax[axId].set_xlabel("simulation length taken into account")
ax[axId].set_ylabel("average difference between mean values boot "
+ "strapping samples")
ax[axId].set_xlim([ax[axId].get_xlim()[0], ax[axId].get_xlim()[1]+1])
# increase x max by 2
plt.show()
# #### MAIN ####
np.random.seed(83737) # some number for reproducibility
d1 = np.random.rand(100, 801)
d2 = np.random.rand(100, 801)
np.random.seed(52389) # if changed to 324235 the peak is gone
evalDifferences_parallel(d1, d2)
This is the output I get, which obviously shows that the order in which the processes return is shuffled compared to the starting order.
20 cores available
Starting ...
Process 0 started work on time "0.25 ns"
Process 1 started work on time "0.5 ns"
Process 2 started work on time "1 ns"
Process 3 started work on time "2 ns"
Process 4 started work on time "3 ns"
Process 5 started work on time "4 ns"
Process 6 started work on time "5 ns"
Process 7 started work on time "10 ns"
Process 8 started work on time "20 ns"
Process 9 started work on time "30 ns"
Process 10 started work on time "40 ns"
Process 11 started work on time "50 ns"
Process 12 started work on time "60 ns"
Process 13 started work on time "70 ns"
Process 14 started work on time "80 ns"
Process 15 started work on time "90 ns"
Process 16 started work on time "100 ns"
Process number 3 returned the results.
Process number 0 returned the results.
Process number 4 returned the results.
Process number 7 returned the results.
Process number 1 returned the results.
Process number 2 returned the results.
Process number 5 returned the results.
Process number 8 returned the results.
Process number 6 returned the results.
Process number 9 returned the results.
Process number 10 returned the results.
Process number 11 returned the results.
Process number 12 returned the results.
Process number 13 returned the results.
Process number 14 returned the results.
Process number 15 returned the results.
Process number 16 returned the results.
All parallel done.
And the figure which is produced.
not sure if you're still hung up on this issue, but I just ran your code on my machine (MacBook Pro (15-inch, 2018)) in Jupyter 4.4.0 and my graphs are smooth with the exact same seed values you originally posted:
##### MAIN ####
np.random.seed(83737) # some number for reproducibility
d1 = np.random.rand(100, 801)
d2 = np.random.rand(100, 801)
np.random.seed(52389) # if changed to 324235 the peak is gone
evalDifferences_parallel(d1, d2)
Perhaps there's nothing wrong with your code and nothing special about the 324235 seed and you just need to double check your module versions since any changes to the source code that have been made in more recent versions could affect your results. For reference I'm using numpy 1.15.4, matplotlib 3.0.2 and multiprocessing 2.6.2.1.
Suppose I have a want to plot the density on the x-y plane, the density is defined as:
def density(x,y):
return x**2 +y**2
If I have many points (x1,y1), (x2,y2)... to calculate, therefore I want to do it parallel. I found the doc multiprocessing and try to do the following:
pointsList = [(1,1), (2,2), (3,3)]
from multiprocessing import Pool
if __name__ == '__main__':
with Pool() as p:
print(p.map(density,pointsList ))
the error occurs and it seems that I failed to pass the args to the function, how to do this?
Edit:
the error is:
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-647-1e2a1f0007fb> in <module>()
5 from multiprocessing import Pool
6 if __name__ == '__main__':
----> 7 with Pool() as p:
8 print(p.map(density,pointsList ))
AttributeError: __exit__
Edit2:
If I can't do this simple parallel in python2.7, how can I do it in python3.5 for instance?
The use of Pool in a context manager was added in Python 3.3. Since you tagged Python 2.7, you can't use the with syntax.
Documentation:
New in version 3.3: Pool objects now support the context management
protocol – see Context Manager Types. __enter__() returns the pool
object, and __exit__() calls terminate().
Here's the working example you wanted, for python 3.3+ :
def density(args):
x, y = args
return x**2 +y**2
pointsList = [(1,1), (2,2), (3,3)]
from multiprocessing import Pool
if __name__ == '__main__':
with Pool() as p:
print(p.map(density,pointsList ))
And since you're also using Python 2.7, you just need to not use the context manager and call p.terminate() instead:
def density(args):
x, y = args
return x**2 +y**2
pointsList = [(1,1), (2,2), (3,3)]
from multiprocessing import Pool
if __name__ == '__main__':
p = Pool()
print(p.map(density,pointsList ))
p.terminate()
Need to change the density function to unpack the tuple argument
def density(z):
(x,y) = z
return x**2 +y**2
try not use the with and close the pool yourself after you are done with it.
This way should be compatible for both python 2 and 3
from multiprocessing import Pool
pointsList = [(1,1), (2,2), (3,3)]
p = Pool()
print(p.map( density,pointsList ))
p.close()
or using contextlib module
from multiprocessing import Pool
import contextlib
pointsList = [(1,1), (2,2), (3,3)]
with contextlib.closing(Pool()) as p:
print(p.map( density,pointsList ))
I'm using Python's multiprocessing lib to speed up some code (least squares fitting with scipy).
It works fine on 3 different machines, but it shows a strange behaviour on a 4th machine.
The code:
import numpy as np
from scipy.optimize import least_squares
import time
import parmap
from multiprocessing import Pool
p0 = [1., 1., 0.5]
def f(p, xx):
return p[0]*np.exp(-xx ** 2 / p[1] ** 2) + p[2]
def errorfunc(p, xx, yy):
return f(p, xx) - yy
def do_fit(yy, xx):
return least_squares(errorfunc, p0[:], args=(xx, yy))
if __name__ == '__main__':
# create data
x = np.linspace(-10, 10, 1000)
y = []
np.random.seed(42)
for i in range(1000):
y.append(f([np.random.rand(1) * 10, np.random.rand(1), 0.], x) + np.random.rand(len(x)))
# fit without multiprocessing
t1 = time.time()
for y_data in y:
p1 = least_squares(errorfunc, p0[:], args=(x, y_data))
t2 = time.time()
print t2 - t1
# fit with multiprocessing lib
times = []
for p in range(1,13):
my_pool = Pool(p)
t3 = time.time()
results = parmap.map(do_fit, y, x, pool=my_pool)
t4 = time.time()
times.append(t4-t3)
my_pool.close()
print times
For the 3 machines where it works, it speeds up roughly in the expected way. E.g. on my i7 laptop it gives:
[4.92650294303894, 2.5883090496063232, 1.7945551872253418, 1.629533052444458,
1.4896039962768555, 1.3550388813018799, 1.1796400547027588, 1.1852478981018066,
1.1404039859771729, 1.2239141464233398, 1.1676840782165527, 1.1416618824005127]
I'm running Ubuntu 14.10, Python 2.7.6, numpy 1.11.0 and scipy 0.17.0.
I tested it on another Ubuntu machine, a Dell PowerEdge R210 with similar results and on a MacBook Pro Retina (here with Python 2.7.11, and same numpy and scipy versions).
The computer that causes issues is a PowerEdge R710 (two hexcores) running Ubuntu 15.10, Python 2.7.11 and same numpy and scipy version as above.
However, I don't observe any speedup. Times are around 6 seconds, no matter what poolsize I use. In fact, it is slightly better for a poolsize of 2 and gets worse for more processes.
htop shows that somehow more processes get spawned than I would expect.
E.g. on my laptop htop shows one entry per process (which matches the poolsize) and eventually each process shows 100% CPU load.
On the PowerEdge R710 I see about 8 python processes for a poolsize of 1 and about 20 processes for a poolsize of 2 etc. each of which shows 100% CPU load.
I checked BIOS settings of the R710 and I couldn't find anything unusual.
What should I look for?
EDIT:
Answering to the comment, I used another simple script. Surprisingly this one seems to 'work' for all machines:
from multiprocessing import Pool
import time
import math
import numpy as np
def f_np(x):
return x**np.sin(x)+np.fabs(np.cos(x))**np.arctan(x)
def f(x):
return x**math.sin(x)+math.fabs(math.cos(x))**math.atan(x)
if __name__ == '__main__':
print "#pool", ", numpy", ", pure python"
for p in range(1,9):
pool = Pool(processes=p)
np.random.seed(42)
a = np.random.rand(1000,1000)
t1 = time.time()
for i in range(5):
pool.map(f_np, a)
t2 = time.time()
for i in range(5):
pool.map(f, range(1000000))
print p, t2-t1, time.time()-t2
pool.close()
gives:
#pool , numpy , pure python
1 1.34186911583 5.87641906738
2 0.697530984879 3.16030216217
3 0.470160961151 2.20742988586
4 0.35701417923 1.73128080368
5 0.308979988098 1.47339701653
6 0.286448001862 1.37223601341
7 0.274246931076 1.27663207054
8 0.245123147964 1.24748778343
on the machine that caused the trouble. There are no more threads (or processes?) spawned than I would expect.
It looks like numpy is not the problem, but as soon as I use scipy.optimize.least_squares the issue arises.
Using on htop on the processes shows a lot of sched_yield() calls which I don't see if I don't use scipy.optimize.least_squares and which I also don't see on my laptop even when using least_squares.
According to here, there is an issue when OpenBLAS is used together with joblib.
Similar issues occur when MKL is used (see here).
The solution given here, also worked for me:
Adding
import os
os.environ['MKL_NUM_THREADS'] = '1'
at the beginning of my python script solves the issue.