Why is the curve of my permutation test analysis not smooth? - python

I am using a permutation test (pulling random sub-samples) to test the difference between 2 experiments. Each experiment was carried out 100 times (=100 replicas of each). Each replica consists of 801 measurement points over time. Now I would like to perform a kind of permutation (or boot strapping) in order to test how many replicas per experiment (and how many (time) measurement points) I need to obtain a certain reliability level.
For this purpose I have written a code from which I have extracted the minimal working example (with lots of things hard-coded) (please see below). The input data is generated as random numbers. Here np.random.rand(100, 801) for 100 replicas and 801 time points.
This code works in principle however the produced curves are sometimes not smoothly falling as one would expect if choosing random sub-samples for 5000 times. Here is the output of the code below:
It can be seen that at 2 of the x-axis there is a peak up which should not be there. If I change the random seed from 52389 to 324235 it is gone and the curve is smooth. It seems there is something wrong with the way the random numbers are chosen?
Why is this the case? I have the semantically similar code in Matlab and there the curves are completely smooth at already 1000 permutations (here 5000).
Do I have a coding mistake or is the numpy random number generator not good?
Does anyone see the problem here?
import matplotlib.pyplot as plt
import numpy as np
from multiprocessing import current_process, cpu_count, Process, Queue
import matplotlib.pylab as pl
def groupDiffsInParallel (queue, d1, d2, nrOfReplicas, nrOfPermuts, timesOfInterestFramesIter):
allResults = np.zeros([nrOfReplicas, nrOfPermuts]) # e.g. 100 x 3000
for repsPerGroupIdx in range(1, nrOfReplicas + 1):
for permutIdx in range(nrOfPermuts):
d1TimeCut = d1[:, 0:int(timesOfInterestFramesIter)]
d1Idxs = np.random.randint(0, nrOfReplicas, size=repsPerGroupIdx)
d1Sel = d1TimeCut[d1Idxs, :]
d1Mean = np.mean(d1Sel.flatten())
d2TimeCut = d2[:, 0:int(timesOfInterestFramesIter)]
d2Idxs = np.random.randint(0, nrOfReplicas, size=repsPerGroupIdx)
d2Sel = d2TimeCut[d2Idxs, :]
d2Mean = np.mean(d2Sel.flatten())
diff = d1Mean - d2Mean
allResults[repsPerGroupIdx - 1, permutIdx] = np.abs(diff)
queue.put(allResults)
def evalDifferences_parallel (d1, d2):
# d1 and d2 are of size reps x time (e.g. 100x801)
nrOfReplicas = d1.shape[0]
nrOfFrames = d1.shape[1]
timesOfInterestNs = [0.25, 0.5, 1, 2, 3, 4, 5, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100] # 17
nrOfTimesOfInterest = len(timesOfInterestNs)
framesPerNs = (nrOfFrames-1)/100 # sim time == 100 ns
timesOfInterestFrames = [x*framesPerNs for x in timesOfInterestNs]
nrOfPermuts = 5000
allResults = np.zeros([nrOfTimesOfInterest, nrOfReplicas, nrOfPermuts]) # e.g. 17 x 100 x 3000
nrOfProcesses = cpu_count()
print('{} cores available'.format(nrOfProcesses))
queue = Queue()
jobs = []
print('Starting ...')
# use one process for each time cut
for timesOfInterestFramesIterIdx, timesOfInterestFramesIter in enumerate(timesOfInterestFrames):
p = Process(target=groupDiffsInParallel, args=(queue, d1, d2, nrOfReplicas, nrOfPermuts, timesOfInterestFramesIter))
p.start()
jobs.append(p)
print('Process {} started work on time \"{} ns\"'.format(timesOfInterestFramesIterIdx, timesOfInterestNs[timesOfInterestFramesIterIdx]), end='\n', flush=True)
# collect the results
for timesOfInterestFramesIterIdx, timesOfInterestFramesIter in enumerate(timesOfInterestFrames):
oneResult = queue.get()
allResults[timesOfInterestFramesIterIdx, :, :] = oneResult
print('Process number {} returned the results.'.format(timesOfInterestFramesIterIdx), end='\n', flush=True)
# hold main thread and wait for the child process to complete. then join back the resources in the main thread
for proc in jobs:
proc.join()
print("All parallel done.")
allResultsMeanOverPermuts = allResults.mean(axis=2) # size: 17 x 100
replicaNumbersToPlot = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100])
replicaNumbersToPlot -= 1 # zero index!
colors = pl.cm.jet(np.linspace(0, 1, len(replicaNumbersToPlot)))
ctr = 0
f, ax = plt.subplots(nrows=2, ncols=2, figsize=(12, 12))
axId = (1, 0)
for lineIdx in replicaNumbersToPlot:
lineData = allResultsMeanOverPermuts[:, lineIdx]
ax[axId].plot(lineData, ".-", color=colors[ctr], linewidth=0.5, label="nReps="+str(lineIdx+1))
ctr+=1
ax[axId].set_xticks(range(nrOfTimesOfInterest)) # careful: this is not the same as plt.xticks!!
ax[axId].set_xticklabels(timesOfInterestNs)
ax[axId].set_xlabel("simulation length taken into account")
ax[axId].set_ylabel("average difference between mean values boot strapping samples")
ax[axId].set_xlim([ax[axId].get_xlim()[0], ax[axId].get_xlim()[1]+1]) # increase x max by 2
plt.show()
##### MAIN ####
np.random.seed(83737) # some number for reproducibility
d1 = np.random.rand(100, 801)
d2 = np.random.rand(100, 801)
np.random.seed(52389) # if changed to 324235 the peak is gone
evalDifferences_parallel(d1, d2)
------------- UPDATE ---------------
Changing the random number generator from numpy to "from random import randint" does not fix the problem:
from:
d1Idxs = np.random.randint(0, nrOfReplicas, size=repsPerGroupIdx)
d2Idxs = np.random.randint(0, nrOfReplicas, size=repsPerGroupIdx)
to:
d1Idxs = [randint(0, nrOfReplicas-1) for p in range(repsPerGroupIdx)]
d2Idxs = [randint(0, nrOfReplicas-1) for p in range(repsPerGroupIdx)]
--- UPDATE 2 ---
timesOfInterestNs can just be set to:
timesOfInterestNs = [0.25, 0.5, 1, 2, 3, 4, 5, 10, 20, 30, 40, 50]
to speed it up on machines with fewer cores.
--- UPDATE 3 ---
Re-initialising the random seed generator in each child process (Random seed is replication across child processes) does also not fix the problem:
pid = str(current_process())
pid = int(re.split("(\W)", pid)[6])
ms = int(round(time.time() * 1000))
mySeed = np.mod(ms, 4294967295)
mySeed = mySeed + 25000 * pid + 100 * pid + pid
mySeed = np.mod(mySeed, 4294967295)
np.random.seed(seed=mySeed)
--- UPDATE 4 ---
On a windows machine you will need a:
if __name__ == '__main__':
to avoid creating subprocesses recursively (and a crash).

I guess this is the classical multiprocessing mistake. Nothing guarantees that the processes will finish in the same order as the one they started. This means that you cannot be sure that the instruction allResults[timesOfInterestFramesIterIdx, :, :] = oneResult will store the result of process timesOfInterestFramesIterIdx at the location timesOfInterestFramesIterIdx in allResults. To make it clearer, let's say timesOfInterestFramesIterIdx is 2, then you have absolutely no guarantee that oneResult is the output of process 2.
I have implemented a very quick fix below. The idea is to track the order in which the processes have been launched by adding an extra argument to groupDiffsInParallel which is then stored in the queue and thereby serves as a process identifier when the results are gathered.
import matplotlib.pyplot as plt
import numpy as np
from multiprocessing import cpu_count, Process, Queue
import matplotlib.pylab as pl
def groupDiffsInParallel(queue, d1, d2, nrOfReplicas, nrOfPermuts,
timesOfInterestFramesIter,
timesOfInterestFramesIterIdx):
allResults = np.zeros([nrOfReplicas, nrOfPermuts]) # e.g. 100 x 3000
for repsPerGroupIdx in range(1, nrOfReplicas + 1):
for permutIdx in range(nrOfPermuts):
d1TimeCut = d1[:, 0:int(timesOfInterestFramesIter)]
d1Idxs = np.random.randint(0, nrOfReplicas, size=repsPerGroupIdx)
d1Sel = d1TimeCut[d1Idxs, :]
d1Mean = np.mean(d1Sel.flatten())
d2TimeCut = d2[:, 0:int(timesOfInterestFramesIter)]
d2Idxs = np.random.randint(0, nrOfReplicas, size=repsPerGroupIdx)
d2Sel = d2TimeCut[d2Idxs, :]
d2Mean = np.mean(d2Sel.flatten())
diff = d1Mean - d2Mean
allResults[repsPerGroupIdx - 1, permutIdx] = np.abs(diff)
queue.put({'allResults': allResults,
'number': timesOfInterestFramesIterIdx})
def evalDifferences_parallel (d1, d2):
# d1 and d2 are of size reps x time (e.g. 100x801)
nrOfReplicas = d1.shape[0]
nrOfFrames = d1.shape[1]
timesOfInterestNs = [0.25, 0.5, 1, 2, 3, 4, 5, 10, 20, 30, 40, 50, 60, 70,
80, 90, 100] # 17
nrOfTimesOfInterest = len(timesOfInterestNs)
framesPerNs = (nrOfFrames-1)/100 # sim time == 100 ns
timesOfInterestFrames = [x*framesPerNs for x in timesOfInterestNs]
nrOfPermuts = 5000
allResults = np.zeros([nrOfTimesOfInterest, nrOfReplicas,
nrOfPermuts]) # e.g. 17 x 100 x 3000
nrOfProcesses = cpu_count()
print('{} cores available'.format(nrOfProcesses))
queue = Queue()
jobs = []
print('Starting ...')
# use one process for each time cut
for timesOfInterestFramesIterIdx, timesOfInterestFramesIter \
in enumerate(timesOfInterestFrames):
p = Process(target=groupDiffsInParallel,
args=(queue, d1, d2, nrOfReplicas, nrOfPermuts,
timesOfInterestFramesIter,
timesOfInterestFramesIterIdx))
p.start()
jobs.append(p)
print('Process {} started work on time \"{} ns\"'.format(
timesOfInterestFramesIterIdx,
timesOfInterestNs[timesOfInterestFramesIterIdx]),
end='\n', flush=True)
# collect the results
resultdict = {}
for timesOfInterestFramesIterIdx, timesOfInterestFramesIter \
in enumerate(timesOfInterestFrames):
resultdict.update(queue.get())
allResults[resultdict['number'], :, :] = resultdict['allResults']
print('Process number {} returned the results.'.format(
resultdict['number']), end='\n', flush=True)
# hold main thread and wait for the child process to complete. then join
# back the resources in the main thread
for proc in jobs:
proc.join()
print("All parallel done.")
allResultsMeanOverPermuts = allResults.mean(axis=2) # size: 17 x 100
replicaNumbersToPlot = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 20, 30, 40,
50, 60, 70, 80, 90, 100])
replicaNumbersToPlot -= 1 # zero index!
colors = pl.cm.jet(np.linspace(0, 1, len(replicaNumbersToPlot)))
ctr = 0
f, ax = plt.subplots(nrows=2, ncols=2, figsize=(12, 12))
axId = (1, 0)
for lineIdx in replicaNumbersToPlot:
lineData = allResultsMeanOverPermuts[:, lineIdx]
ax[axId].plot(lineData, ".-", color=colors[ctr], linewidth=0.5,
label="nReps="+str(lineIdx+1))
ctr += 1
ax[axId].set_xticks(range(nrOfTimesOfInterest))
# careful: this is not the same as plt.xticks!!
ax[axId].set_xticklabels(timesOfInterestNs)
ax[axId].set_xlabel("simulation length taken into account")
ax[axId].set_ylabel("average difference between mean values boot "
+ "strapping samples")
ax[axId].set_xlim([ax[axId].get_xlim()[0], ax[axId].get_xlim()[1]+1])
# increase x max by 2
plt.show()
# #### MAIN ####
np.random.seed(83737) # some number for reproducibility
d1 = np.random.rand(100, 801)
d2 = np.random.rand(100, 801)
np.random.seed(52389) # if changed to 324235 the peak is gone
evalDifferences_parallel(d1, d2)
This is the output I get, which obviously shows that the order in which the processes return is shuffled compared to the starting order.
20 cores available
Starting ...
Process 0 started work on time "0.25 ns"
Process 1 started work on time "0.5 ns"
Process 2 started work on time "1 ns"
Process 3 started work on time "2 ns"
Process 4 started work on time "3 ns"
Process 5 started work on time "4 ns"
Process 6 started work on time "5 ns"
Process 7 started work on time "10 ns"
Process 8 started work on time "20 ns"
Process 9 started work on time "30 ns"
Process 10 started work on time "40 ns"
Process 11 started work on time "50 ns"
Process 12 started work on time "60 ns"
Process 13 started work on time "70 ns"
Process 14 started work on time "80 ns"
Process 15 started work on time "90 ns"
Process 16 started work on time "100 ns"
Process number 3 returned the results.
Process number 0 returned the results.
Process number 4 returned the results.
Process number 7 returned the results.
Process number 1 returned the results.
Process number 2 returned the results.
Process number 5 returned the results.
Process number 8 returned the results.
Process number 6 returned the results.
Process number 9 returned the results.
Process number 10 returned the results.
Process number 11 returned the results.
Process number 12 returned the results.
Process number 13 returned the results.
Process number 14 returned the results.
Process number 15 returned the results.
Process number 16 returned the results.
All parallel done.
And the figure which is produced.

not sure if you're still hung up on this issue, but I just ran your code on my machine (MacBook Pro (15-inch, 2018)) in Jupyter 4.4.0 and my graphs are smooth with the exact same seed values you originally posted:
##### MAIN ####
np.random.seed(83737) # some number for reproducibility
d1 = np.random.rand(100, 801)
d2 = np.random.rand(100, 801)
np.random.seed(52389) # if changed to 324235 the peak is gone
evalDifferences_parallel(d1, d2)
Perhaps there's nothing wrong with your code and nothing special about the 324235 seed and you just need to double check your module versions since any changes to the source code that have been made in more recent versions could affect your results. For reference I'm using numpy 1.15.4, matplotlib 3.0.2 and multiprocessing 2.6.2.1.

Related

Use multiprocessings 'Pool' together with 'RawArray'

Whenever I try to use shared memory with pythons 'multiprocessing' module to fill a huge array in parallel I use something like:
import numpy as np
from multiprocessing import Process, RawArray
def tf(x, arr):
arr = np.reshape( np.frombuffer( arr, dtype=np.float32 ), -1 ).reshape((10, 10, 10))
arr[x] = np.random.random((10, 10))
mpa = RawArray('f', 1000)
ncpu = 4
procs = []
for i in range(10):
procs.append(Process(target=tf, args=(i, mpa)))
procs[-1].start()
if len(procs) == ncpu:
procs[0].join()
procs.pop(0)
for p in procs:
p.join()
arr = np.reshape( np.frombuffer( mpa, dtype=np.uint32 ), -1).reshape((10, 10, 10))
to ensure that only as many processes are active as I have cpus. If I try to use 'Pool' and 'apply_async' the array is not altered for some reason. So I wonder if it is possible to either use 'Pool' or any other intended way to manage the amount of active processes.
The above code is working but is not the most efficient since I only if the process I added first is finished to decide if I should add another process.

Saving images in a loop faster than multithreading / multiprocessing

Here's a timed example of multiple image arrays of different sizes being saved in a loop as well as concurrently using threads / processes:
import tempfile
from concurrent.futures import ProcessPoolExecutor, ThreadPoolExecutor, as_completed
from pathlib import Path
from time import perf_counter
import numpy as np
from cv2 import cv2
def save_img(idx, image, dst):
cv2.imwrite((Path(dst) / f'{idx}.jpg').as_posix(), image)
if __name__ == '__main__':
l1 = np.random.randint(0, 255, (100, 50, 50, 1))
l2 = np.random.randint(0, 255, (1000, 50, 50, 1))
l3 = np.random.randint(0, 255, (10000, 50, 50, 1))
temp_dir = tempfile.mkdtemp()
workers = 4
t1 = perf_counter()
for ll in l1, l2, l3:
t = perf_counter()
for i, img in enumerate(ll):
save_img(i, img, temp_dir)
print(f'Time for {len(ll)}: {perf_counter() - t} seconds')
for executor in ThreadPoolExecutor, ProcessPoolExecutor:
with executor(workers) as ex:
futures = [
ex.submit(save_img, i, img, temp_dir) for (i, img) in enumerate(ll)
]
for f in as_completed(futures):
f.result()
print(
f'Time for {len(ll)} ({executor.__name__}): {perf_counter() - t} seconds'
)
And I get these durations on my i5 mbp:
Time for 100: 0.09495482999999982 seconds
Time for 100 (ThreadPoolExecutor): 0.14151873999999998 seconds
Time for 100 (ProcessPoolExecutor): 1.5136184309999998 seconds
Time for 1000: 0.36972280300000016 seconds
Time for 1000 (ThreadPoolExecutor): 0.619205703 seconds
Time for 1000 (ProcessPoolExecutor): 2.016624468 seconds
Time for 10000: 4.232915643999999 seconds
Time for 10000 (ThreadPoolExecutor): 7.251599262 seconds
Time for 10000 (ProcessPoolExecutor): 13.963426469999998 seconds
Aren't threads / processes expected to require less time to achieve the same thing? and why not in this case?
The timings in the code are wrong because the timer t is not reset before testing the Pools. Nevertheless, the relative order of the timings are correct. A possible code with a timer reset is:
import tempfile
from concurrent.futures import ProcessPoolExecutor, ThreadPoolExecutor, as_completed
from pathlib import Path
from time import perf_counter
import numpy as np
from cv2 import cv2
def save_img(idx, image, dst):
cv2.imwrite((Path(dst) / f'{idx}.jpg').as_posix(), image)
if __name__ == '__main__':
l1 = np.random.randint(0, 255, (100, 50, 50, 1))
l2 = np.random.randint(0, 255, (1000, 50, 50, 1))
l3 = np.random.randint(0, 255, (10000, 50, 50, 1))
temp_dir = tempfile.mkdtemp()
workers = 4
for ll in l1, l2, l3:
t = perf_counter()
for i, img in enumerate(ll):
save_img(i, img, temp_dir)
print(f'Time for {len(ll)}: {perf_counter() - t} seconds')
for executor in ThreadPoolExecutor, ProcessPoolExecutor:
t = perf_counter()
with executor(workers) as ex:
futures = [
ex.submit(save_img, i, img, temp_dir) for (i, img) in enumerate(ll)
]
for f in as_completed(futures):
f.result()
print(
f'Time for {len(ll)} ({executor.__name__}): {perf_counter() - t} seconds'
)
Multithreading is faster specially for I/O bound processes. In this case, compressing the images is cpu-intensive, so depending on the implementation of OpenCV and of the python wrapper, multithreading can be much slower. In many cases the culprit is CPython's GIL, but I am not sure if this is the case (I do not know if the GIL is released during the imwrite call). In my setup (i7 8th gen), Threading is as fast as the loop for 100 images and barely faster for 1000 and 10000 images. If ThreadPoolExecutor reuses threads, there is an overhead involved in assigning a new task to an existing thread. If it does not reuses threads, there is an overhead involved in launching a new thread.
Multiprocessing circumvents the GIL issue, but has some other problems. First, pickling the data to pass between processes takes some time, and in the case of images it can be very expensive. Second, in the case of windows, spawning a new process takes a lot of time. A simple test to see the overhead (both for processes and threads) is to change the save_image function by one that does nothing, but still need pickling, etc:
def save_img(idx, image, dst):
if idx != idx:
print("impossible!")
and by a similar one without parameters to see the overhead of spawning the processes, etc.
The timings in my setup show that 2.3 seconds are needed just to spawn the 10000 processes and 0.6 extra seconds for pickling, which is much more than the time needed for processing.
A way to improve the throughput and keep the overhead to a minimum is to break the work on chunks, and submit each chunk to the worker:
import tempfile
from concurrent.futures import ProcessPoolExecutor, ThreadPoolExecutor, as_completed
from pathlib import Path
from time import perf_counter
import numpy as np
from cv2 import cv2
def save_img(idx, image, dst):
cv2.imwrite((Path(dst) / f'{idx}.jpg').as_posix(), image)
def multi_save_img(idx_start, images, dst):
for idx, image in zip(range(idx_start, idx_start + len(images)), images):
cv2.imwrite((Path(dst) / f'{idx}.jpg').as_posix(), image)
if __name__ == '__main__':
l1 = np.random.randint(0, 255, (100, 50, 50, 1))
l2 = np.random.randint(0, 255, (1000, 50, 50, 1))
l3 = np.random.randint(0, 255, (10000, 50, 50, 1))
temp_dir = tempfile.mkdtemp()
workers = 4
for ll in l1, l2, l3:
t = perf_counter()
for i, img in enumerate(ll):
save_img(i, img, temp_dir)
print(f'Time for {len(ll)}: {perf_counter() - t} seconds')
chunk_size = len(ll)//workers
ends = [chunk_size * (_+1) for _ in range(workers)]
ends[-1] += len(ll) % workers
starts = [chunk_size * _ for _ in range(workers)]
for executor in ThreadPoolExecutor, ProcessPoolExecutor:
t = perf_counter()
with executor(workers) as ex:
futures = [
ex.submit(multi_save_img, start, ll[start:end], temp_dir) for (start, end) in zip(starts, ends)
]
for f in as_completed(futures):
f.result()
print(
f'Time for {len(ll)} ({executor.__name__}): {perf_counter() - t} seconds'
)
This should give you a significant boost over a simple for, both for a multiprocessing and multithreading approach.

Multiprocessing for a range of loops in Python?

I have a very big array that I need to create with over 10^7 columns that needs to get filtered/modified depending on some criteria. There is a set of 24 different criterias (2x4x3 due to combinations) which means the filtering/modification needs to be done 24 times and each result is saved in a different specified directory.
Since this takes a very long time, I am looking into using multiprocessing to speed up the process. Can anyone help me out? Here is an exemplary code:
import itertools
import numpy as np
sample_size = 1000000
variables = 25
x_array = np.random.rand(variables, sample_size)
x_dir = ['x1', 'x2']
y_dir = ['y1', 'y2', 'y3', 'y4']
z_dir = ['z1', 'z2', 'z3']
x_directories = [0, 1]
y_directories = [0, 1, 2, 3]
z_directories = [0, 1, 2]
directory_combinations = itertools.product(x_directories, y_directories, z_directories)
for k, t, h in directory_combinations:
target_dir=main_dir+'/'+x_dir[k]+'/'+y_dir[t]+'/'+z_dir[h]
for i in range(sample_size):
#x_array gets filtered/modified
#x_array gets saved in target_dir directory as a dataframe after modification'''
Basically with multiprocessing I am hoping for either each loop handled by a single core out of 16 I have available or for each loop iteration to be sped up by using all 16 cores.
Many thanks in advance!
Take one of loop and rewrite it to function
for k, t, h in directory_combinations:
Becomes for example
def func(k,t,h):
....
pool = multiprocessing.Pool(12)
pool.starmap_async(func, directory_combinations, 32)
It spawns 12 processes, that apply func on each iteration of 2 argument. Data tranfered to processes by 32-length chunks.
The following code first creates the x_array in shared memory and initialized each process in the pool with global variable x_array, which is this shared array.
I would move the code that creates a the copy of this global x_array, processes it and then writes out the dataframe to a function, worker, which is passed the target directory as an argument.
import itertools
import numpy as np
import ctypes
import multiprocessing as mp
SAMPLE_SIZE = 1000000
VARIABLES = 25
def to_numpy_array(shared_array, shape):
'''Create a numpy array backed by a shared memory Array.'''
arr = np.ctypeslib.as_array(shared_array)
return arr.reshape(shape)
def to_shared_array(arr, ctype):
shared_array = mp.Array(ctype, arr.size, lock=False)
temp = np.frombuffer(shared_array, dtype=arr.dtype)
temp[:] = arr.flatten(order='C')
return shared_array
def init_pool(shared_array, shape):
global x_array
# Recreate x_array using shared memory array:
x_array = to_numpy_array(shared_array, shape)
def worker(target_dir):
# make copy of x_array with np.copy
x_array_copy = np.copy(x_array)
for i in range(sample_size):
#x_array_copy gets filtered/modified
...
#x_array_copy gets saved in target_dir directory as a dataframe after modification
def main():
main_dir = '.' # for example
x_dir = ['x1', 'x2']
y_dir = ['y1', 'y2', 'y3', 'y4']
z_dir = ['z1', 'z2', 'z3']
x_directories = [0, 1]
y_directories = [0, 1, 2, 3]
z_directories = [0, 1, 2]
directory_combinations = itertools.product(x_directories, y_directories, z_directories)
target_dirs = [main_dir+'/'+x_dir[k]+'/'+y_dir[t]+'/'+z_dir[h] for k, t, h in directory_combinations]
x_array = np.random.rand(VARIABLES, SAMPLE_SIZE)
shape = x_array.shape
# Create array in shared memory
shared_array = to_shared_array(x_array, ctypes.c_int64)
# Recreate x_array using the shared memory array as the base:
x_array = to_numpy_array(shared_array, shape)
# Create pool of 12 processes copying the shared array to each process:
pool = mp.Pool(12, initializer=init_pool, initargs=(shared_array, shape))
pool.map(worker, target_dirs)
# This is required for Windows:
if __name__ == '__main__':
main()

How to Parallelize a GridSearch Scan with Talos

While talos supports GPU parallelization, how do you extend the Scan object to support CPU + GPU parallelization?
Following the approach of breaking up the scan experiments into processes:
import multiprocessing as mp
from itertools import product
import talos
import os
# Helper function to create configuration chunks
def chunkify(lst, n):
return [lst[i::n] for i in range(n)]
# a Talos Scan Configuration superset
playbook_configurations = {
"input_lstm_dim": [5, 15, 30, 50],
"dense_a_dim": [None, 5],
"dense_b_dim": [None, 5],
"dense_c_dim": [None, 5],
"dropout_a_rate": [None, 0.7, 0.5, 0.3],
"epochs": [100],
"verbose": [verbose_flag],
"batch_normalization": [None, 1]
}
# Threadsafe Queue for scan results
output = mp.Queue()
# Actual scan to run within each process
def process_scan(playbook_scan_settings, output):
scan = talos.Scan(
...
params=playbook_scan_settings,
)
...
output.put(results) # pump results onto queue
# Sample Process count based on core affinity
cpu_count = len(os.sched_getaffinity(0))
# Cartesian product of Talos Configuration
playbook_configurations_cartesian_product = [dict(zip(playbook_configurations, v)) for v in product(
*playbook_configurations.values())]
# Configuration chunks to assign to each process
playbook_configuration_groups = chunkify(
playbook_configurations_cartesian_product, cpu_count)
processes = []
for playbook_configuration_group in playbook_configuration_groups:
# merged (array) configuration for process group
playbook_scan_settings = {}
for g in playbook_configuration_group:
for k, v in g.items():
if not k in playbook_scan_settings:
playbook_scan_settings[k] = []
if not v in playbook_scan_settings[k]:
playbook_scan_settings[k].append(v)
if bool(playbook_scan_settings):
# process to scan on merged configuration for process group
processes.append(mp.Process(
target=process_scan, args=(playbook_scan_settings, output)))
for p in processes:
p.start()
for p in processes:
p.join()
# Will be the result from the message queue
results = [output.get() for p in processes]
You can easily pump the Report object, winning models, and metrics per scan tier into the message queue for final selection.

If statement over dask array

Hi everybody can you tell me why a If statement over dask array is so slow and how to solve it ?
import dask.array as da
import time
x = da.random.binomial(1, 0.5, 200, 200)
s = time.time()
if da.any(x):
e = time.time()
print('duration = ', e-s)
output: duration = 0.368
Dask array is lazy by default, so no work happens until you call .compute() on your array.
In your case you are implicitly calling .compute() when you place your dask array into an if statement, which converts things into booleans.
x = da.random.random(...) # this is free
y = x + x.T # this is free
z = y.any() # this is free
if z: # everything above happens now,
...
I took at look at the dask source code. Essentially, when you call functions on dask arrays it performs a "reduction" of the array. Intuitively this is necessary because, behind the scenes, dask arrays are stored as separate "blocks" that can live individually in memory, on disk, etc. but you need to somehow pull pieces of them together for function calls.
So the time you are noticing is in the initial overhead of performing the reduction. Note that if you increase the size of the array to 2M, it takes about the same time as for 200. At 20M it only takes about 1s.
import dask.array as da
import time
# 200 case
x = da.random.binomial(1, 0.5, 200, 200)
print x.shape
s = time.time()
print "start"
if da.any(x):
e = time.time()
print 'duration = ', e-s
# duration = 0.362557172775
# 2M case
x = da.random.binomial(1, 0.5, 2000000, 2000000)
print x.shape
s = time.time()
print "start"
if da.any(x):
e = time.time()
print 'duration = ', e-s
# duration = 0.132781982422
# 20M case
x = da.random.binomial(1, 0.5, 20000000, 20000000)
print x.shape
s = time.time()
print "start"
if da.any(x):
e = time.time()
print 'duration = ', e-s
# duration = 1.08430886269
# 200M case
x = da.random.binomial(1, 0.5, 200000000, 200000000)
print x.shape
s = time.time()
print "start"
if da.any(x):
e = time.time()
print 'duration = ', e-s
# duration = 8.83682179451

Categories