Python solve LPs in parallel - python

I have a list of LPs which I want to solve in parallel.
So far I have tried both multiprocessing and joblib. But both use only 1 CPU (out of 8).
My code
import subprocess
from multiprocessing import Pool, cpu_count
from scipy.optimize import linprog
import numpy as np
from joblib import Parallel, delayed
def is_in_convex_hull(arg):
A,v = arg
res = linprog(np.zeros(A.shape[1]),A_eq = A,b_eq = v)
return res['success']
def convex_hull_LP(A):
pool = Pool(processes = cpu_count())
res = pool.map(is_in_convex_hull,[(np.delete(A,i,axis=1),A[:,i]) for i in range(A.shape[1])])
pool.close()
pool.join()
return [i for i in range(A.shape[1]) if not res[i]]
Now in IPyton I run
A = np.random.randint(0,60,size = (40,300))
%time l1 = convex_hull_LP(A)
%time l2 = Parallel(n_jobs=8)(delayed(is_in_convex_hull)((np.delete(A,i,axis=1),A[:,i])) for i in range(A.shape[1]))
which both result in about 7 seconds, but using only a single CPU, although 8 different process-IDs are shown.
Other Threads
With the answer from Python multiprocessing.Pool() doesn't use 100% of each CPU I got 100% on all, but I think an LP is complicated enough, to be the bottleneck.
I couldn't make sense of Multiprocess in python uses only one process
My Questions
How can I split the jobs over all available CPUs?
Or is it even possible to run this on the GPU?

Related

How to wait for the worker processes in Python multiprocessing.pool.Pool without closing it?

I'm benchmarking this script on a 6-core CPU with Ubuntu 22.04.1 and Python 3.10.6. It is supposed to show usage of all available CPU cores with par function vs. a single core with ser function.
import numpy as np
from multiprocessing import Pool
import timeit as ti
def foo(n):
return -np.sort(-np.arange(n))[-1]
def par(reps, bigNum, pool):
for i in range(bigNum, bigNum+reps):
pool.apply_async(foo, args=(i,))
def ser(reps, bigNum):
for i in range(bigNum, bigNum+reps):
foo(i)
if __name__ == '__main__':
bigNum = 9_000_000
reps = 6
fun = f'par(reps, bigNum, pool)'
t = 1000 * np.array(ti.repeat(stmt=fun, setup='pool=Pool(reps);'+fun, globals=globals(), number=1, repeat=10))
print(f'{fun}: {np.amin(t):6.3f}ms {np.median(t):6.3f}ms')
fun = f'ser(reps, bigNum)'
t = 1000 * np.array(ti.repeat(stmt=fun, setup=fun, globals=globals(), number=1, repeat=10))
print(f'{fun}: {np.amin(t):6.3f}ms {np.median(t):6.3f}ms')
Right now, par function only shows the time to spin the worker processes. What do I need to change in function par, in order to make it wait for all worker processes to complete before returning? Note that I would like to reuse the process pool between calls.
you need to get the result from apply_async to wait for it.
def par(reps, bigNum, pool):
jobs = []
for i in range(bigNum, bigNum+reps):
jobs.append(pool.apply_async(foo, args=(i,)))
for job in jobs:
job.get()
for long loops you should be using map or imap or imap_unordered instead of apply_async as it has less overhead and you get to control the chunksize for faster serialization of small objects, and you can pass generators to them to save memory or allow infinite generators (with imap).
def par(reps, bigNum, pool):
pool.map(foo, range(bigNum,bigNum+reps), chunksize=1)
note: python PEP8 indentation is 4 spaces, not 2.

Nested dask delayed or futures

Looking for best practice for nested parallel jobs. I couldn't nest dask delayed or futures so I mixed both to get it to work. Is this not recommended? Is there better way to do this? Example:
import dask
from dask.distributed import Client
import random
import time
client = Client()
def rndSeries(x):
time.sleep(1)
return random.sample(range(1, 50), x)
def sqNum(x):
time.sleep(1)
return x**2
def subProcess(li):
results=[]
for i in li:
r = dask.delayed(sqNum)(i)
results.append(r)
return dask.compute(sum(results))[0]
futures=[]
for i in range(10):
x = client.submit(rndSeries,random.randrange(5,10,1))
y = client.submit(subProcess, x)
futures.append(y)
client.gather(futures)
Consider modification of your script to have a deterministic workflow. If you start with 1 worker, you will see that the process completes in 20 seconds (as expected, 2 processes of 1 second + 6 processes of 3 seconds). If you have 2 workers, the execution time will drop to 10 seconds.
import dask
from dask.distributed import Client, LocalCluster
import time
import numpy as np
cluster = LocalCluster(n_workers=1, threads_per_worker=1)
client = Client(cluster)
# if inside jupyter split the code below into a new cell
# to see accurate timing
%%time
def rndSeries(x):
time.sleep(1)
return np.random.rand()
def sqNum(x):
time.sleep(3)
return 1
def subProcess(li):
results=[]
li = [1,2,3]
for i in li:
r = dask.delayed(sqNum)(i)
results.append(r)
return dask.compute(sum(results))[0]
futures=[]
for i in range(2):
x = client.submit(rndSeries, np.random.rand())
y = client.submit(subProcess, x)
futures.append(y)
client.gather(futures)
What happens if you have 6 workers? Execution time is now 4 seconds (the lowest possible for this task), so it seems that the only drawback of dask.compute() inside a future version is that it forces the results of delayeds to be on a single worker. This is probably OK in many cases, however, if the combined resource requirements of all delayed tasks exceed resources of a single worker, then the best way to proceed is to submit tasks from tasks: https://distributed.dask.org/en/latest/task-launch.html

More parallel processes than available processors in pathos

I used to be able to run 100 parallel process this way:
from multiprocessing import Process
def run_in_parallel(some_list):
proc = []
for list_element in some_list:
time.sleep(20)
p = Process(target=main, args=(list_element,))
p.start()
proc.append(p)
for p in proc:
p.join()
run_in_parallel(some_list)
but now my inputs are a bit more complicated and I'm getting "that" pickle error. I had to switch to pathos.
The following minimal example of my code works well but it seems to be limited by the number of threads. How can I get pathos to scale up to 100 parallel process? My cpu only has 4 cores. My processes are idling most of the time but they have to run for days. I don't mind having that "time.sleep(20)" in there for the initialization.
from pathos.multiprocessing import ProcessingPool as Pool
input = zip(itertools.repeat((variable1, variable2, class1), len(some_list)), some_list)
p = Pool()
p.map(main, input)
edit:
Ideally I would like to do p = Pool(nodes=len(some_list)), which does not work of course.
I'm the pathos author. I'm not sure I'm interpreting your question correctly -- it's a bit easier to interpret the question when you have provided a minimal working code sample. However...
Is this what you mean?
>>> def name(x):
... import multiprocess as mp
... return mp.process.current_process().name
...
>>> from pathos.multiprocessing import ProcessingPool as Pool
>>> p = Pool(ncpus=10)
>>> p.map(name, range(10))
['PoolWorker-1', 'PoolWorker-2', 'PoolWorker-3', 'PoolWorker-4', 'PoolWorker-6', 'PoolWorker-5', 'PoolWorker-7', 'PoolWorker-8', 'PoolWorker-9', 'PoolWorker-10']
>>> p.map(name, range(20))
['PoolWorker-1', 'PoolWorker-2', 'PoolWorker-3', 'PoolWorker-4', 'PoolWorker-6', 'PoolWorker-5', 'PoolWorker-7', 'PoolWorker-8', 'PoolWorker-9', 'PoolWorker-10', 'PoolWorker-1', 'PoolWorker-2', 'PoolWorker-3', 'PoolWorker-4', 'PoolWorker-6', 'PoolWorker-5', 'PoolWorker-7', 'PoolWorker-8', 'PoolWorker-9', 'PoolWorker-10']
>>>
Then, for example, if you wanted to reconfigure to only use 4 cpus, you can do this:
>>> p.ncpus = 4
>>> p.map(name, range(20))
['PoolWorker-11', 'PoolWorker-11', 'PoolWorker-12', 'PoolWorker-12', 'PoolWorker-13', 'PoolWorker-13', 'PoolWorker-14', 'PoolWorker-14', 'PoolWorker-11', 'PoolWorker-11', 'PoolWorker-12', 'PoolWorker-12', 'PoolWorker-13', 'PoolWorker-13', 'PoolWorker-14', 'PoolWorker-14', 'PoolWorker-11', 'PoolWorker-11', 'PoolWorker-12', 'PoolWorker-12']
I'd worry that if you have only 4 cores, but want 100-way parallel, that you may not get the scaling that you think. Depending on how long the function you want to parallelize takes, you might want to use one of the other pools, like: pathos.threading.ThreadPool or a MPI-centric pool from pyina.
What happens with only 4 cores and 100 processes is that the 4 cores will have 100 instances of python spawned at once... so that may be a serious memory hit, and the multiple instances of python on a single core will compete for cpu time... so it might be best to play with the configuration a bit to find the right mix of resource oversubscribing and any resource idling.

Python multiprocessing: no performance gain with multiple processes [duplicate]

This question already has an answer here:
How can I improve CPU utilization when using the multiprocessing module?
(1 answer)
Closed 8 years ago.
Using multiprocessing, I tried to parallelize a function but I have no performance improvement:
from MMTK import *
from MMTK.Trajectory import Trajectory, TrajectoryOutput, SnapshotGenerator
from MMTK.Proteins import Protein, PeptideChain
import numpy as np
filename = 'traj_prot_nojump.nc'
trajectory = Trajectory(None, filename)
def calpha_2dmap_mult(trajectory = trajectory, t = range(0,len(trajectory))):
dist = []
universe = trajectory.universe
proteins = universe.objectList(Protein)
chain = proteins[0][0]
traj = trajectory[t]
dt = 1000 # calculate distance every 1000 steps
for n, step in enumerate(traj):
if n % dt == 0:
universe.setConfiguration(step['configuration'])
for i in np.arange(len(chain)-1):
for j in np.arange(len(chain)-1):
dist.append(universe.distance(chain[i].peptide.C_alpha,
chain[j].peptide.C_alpha))
return(dist)
c0 = time.time()
dist1 = calpha_2dmap_mult(trajectory, range(0,11001))
c1 = time.time() - c0
print(c1)
# Multiprocessing
from multiprocessing import Pool, cpu_count
pool = Pool(processes=4)
c0 = time.time()
dist_pool = [pool.apply(calpha_2dmap_mult, args=(trajectory, t,)) for t in
[range(0,2001), range(3000,5001), range(6000,8001),
range(9000,11001)]]
c1 = time.time() - c0
print(c1)
The time spent to calculate the distances is the 'same' without (70.1s) or with multiprocessing (70.2s)! I was maybe not expecting an improvement of a factor 4 but I was at least expecting some improvements!
Is someone knows what I did wrong?
Pool.apply is a blocking operation:
[Pool.apply is the] equivalent of the apply() built-in function. It blocks until the result is ready, so apply_async() is better suited for performing work in parallel ..
In this case Pool.map is likely more appropriate for collecting the results; the map itself blocks but the sequence elements / transformations are processed in parallel.
It addition to using partial application (or manual realization of such), also consider expanding the data itself. It's the same cat in a different skin.
data = ((trajectory, r) for r in [range(0,2001), ..])
result = pool.map(.., data)
This can in turn be expanded:
def apply_data(d):
return calpha_2dmap_mult(*d)
result = pool.map(apply_data, data)
The function (or simple argument-expanded proxy of such of such) will need to be written to accept a single argument but all the data is now mapped as a single unit.

Parallelizing a Numpy vector operation

Let's use, for example, numpy.sin()
The following code will return the value of the sine for each value of the array a:
import numpy
a = numpy.arange( 1000000 )
result = numpy.sin( a )
But my machine has 32 cores, so I'd like to make use of them. (The overhead might not be worthwhile for something like numpy.sin() but the function I actually want to use is quite a bit more complicated, and I will be working with a huge amount of data.)
Is this the best (read: smartest or fastest) method:
from multiprocessing import Pool
if __name__ == '__main__':
pool = Pool()
result = pool.map( numpy.sin, a )
or is there a better way to do this?
There is a better way: numexpr
Slightly reworded from their main page:
It's a multi-threaded VM written in C that analyzes expressions, rewrites them more efficiently, and compiles them on the fly into code that gets near optimal parallel performance for both memory and cpu bounded operations.
For example, in my 4 core machine, evaluating a sine is just slightly less than 4 times faster than numpy.
In [1]: import numpy as np
In [2]: import numexpr as ne
In [3]: a = np.arange(1000000)
In [4]: timeit ne.evaluate('sin(a)')
100 loops, best of 3: 15.6 ms per loop
In [5]: timeit np.sin(a)
10 loops, best of 3: 54 ms per loop
Documentation, including supported functions here. You'll have to check or give us more information to see if your more complicated function can be evaluated by numexpr.
Well this is kind of interesting note if you run the following commands:
import numpy
from multiprocessing import Pool
a = numpy.arange(1000000)
pool = Pool(processes = 5)
result = pool.map(numpy.sin, a)
UnpicklingError: NEWOBJ class argument has NULL tp_new
wasn't expecting that, so whats going on, well:
>>> help(numpy.sin)
Help on ufunc object:
sin = class ufunc(__builtin__.object)
| Functions that operate element by element on whole arrays.
|
| To see the documentation for a specific ufunc, use np.info(). For
| example, np.info(np.sin). Because ufuncs are written in C
| (for speed) and linked into Python with NumPy's ufunc facility,
| Python's help() function finds this page whenever help() is called
| on a ufunc.
yep numpy.sin is implemented in c as such you can't really use it directly with multiprocessing.
so we have to wrap it with another function
perf:
import time
import numpy
from multiprocessing import Pool
def numpy_sin(value):
return numpy.sin(value)
a = numpy.arange(1000000)
pool = Pool(processes = 5)
start = time.time()
result = numpy.sin(a)
end = time.time()
print 'Singled threaded %f' % (end - start)
start = time.time()
result = pool.map(numpy_sin, a)
pool.close()
pool.join()
end = time.time()
print 'Multithreaded %f' % (end - start)
$ python perf.py
Singled threaded 0.032201
Multithreaded 10.550432
wow, wasn't expecting that either, well theres a couple of issues for starters we are using a python function even if its just a wrapper vs a pure c function, and theres also the overhead of copying the values, multiprocessing by default doesn't share data, as such each value needs to be copy back/forth.
do note that if properly segment our data:
import time
import numpy
from multiprocessing import Pool
def numpy_sin(value):
return numpy.sin(value)
a = [numpy.arange(100000) for _ in xrange(10)]
pool = Pool(processes = 5)
start = time.time()
result = numpy.sin(a)
end = time.time()
print 'Singled threaded %f' % (end - start)
start = time.time()
result = pool.map(numpy_sin, a)
pool.close()
pool.join()
end = time.time()
print 'Multithreaded %f' % (end - start)
$ python perf.py
Singled threaded 0.150192
Multithreaded 0.055083
So what can we take from this, multiprocessing is great but we should always test and compare it sometimes its faster and sometimes its slower, depending how its used ...
Granted you are not using numpy.sin but another function I would recommend you first verify that indeed multiprocessing will speed up the computation, maybe the overhead of copying values back/forth may affect you.
Either way I also do believe that using pool.map is the best, safest method of multithreading code ...
I hope this helps.
SciPy actually has a pretty good writeup on this subject here.

Categories