How to parallel sum a loop using multiprocessing in Python - python

I am having difficulty understanding how to use Python's multiprocessing module.
I have a sum from 1 to n where n=10^10, which is too large to fit into a list, which seems to be the thrust of many examples online using multiprocessing.
Is there a way to "split up" the range into segments of a certain size and then perform the sum for each segment?
For instance
def sum_nums(low,high):
result = 0
for i in range(low,high+1):
result += i
return result
And I want to compute sum_nums(1,10**10) by breaking it up into many sum_nums(1,1000) + sum_nums(1001,2000) + sum_nums(2001,3000)... and so on. I know there is a close-form n(n+1)/2 but pretend we don't know that.
Here is what I've tried
import multiprocessing
def sum_nums(low,high):
result = 0
for i in range(low,high+1):
result += i
return result
if __name__ == "__main__":
n = 1000
procs = 2
sizeSegment = n/procs
jobs = []
for i in range(0, procs):
process = multiprocessing.Process(target=sum_nums, args=(i*sizeSegment+1, (i+1)*sizeSegment))
jobs.append(process)
for j in jobs:
j.start()
for j in jobs:
j.join()
#where is the result?

I find the usage of multiprocess.Pool and map() much more simple
Using your code:
from multiprocessing import Pool
def sum_nums(args):
low = int(args[0])
high = int(args[1])
return sum(range(low,high+1))
if __name__ == "__main__":
n = 1000
procs = 2
sizeSegment = n/procs
# Create size segments list
jobs = []
for i in range(0, procs):
jobs.append((i*sizeSegment+1, (i+1)*sizeSegment))
pool = Pool(procs).map(sum_nums, jobs)
result = sum(pool)
>>> print result
>>> 500500

You can do this sum without multiprocessing at all, and it's probably simpler, if not faster, to just use generators.
# prepare a generator of generators each at 1000 point intervals
>>> xr = (xrange(1000*i+1,i*1000+1001) for i in xrange(10000000))
>>> list(xr)[:3]
[xrange(1, 1001), xrange(1001, 2001), xrange(2001, 3001)]
# sum, using two map functions
>>> xr = (xrange(1000*i+1,i*1000+1001) for i in xrange(10000000))
>>> sum(map(sum, map(lambda x:x, xr)))
50000000005000000000L
However, if you want to use multiprocessing, you can also do this too. I'm using a fork of multiprocessing that is better at serialization (but otherwise, not really different).
>>> xr = (xrange(1000*i+1,i*1000+1001) for i in xrange(10000000))
>>> import pathos
>>> mmap = pathos.multiprocessing.ProcessingPool().map
>>> tmap = pathos.multiprocessing.ThreadingPool().map
>>> sum(tmap(sum, mmap(lambda x:x, xr)))
50000000005000000000L
The version w/o multiprocessing is faster and takes about a minute on my laptop. The multiprocessing version takes a few minutes due to the overhead of spawning multiple python processes.
If you are interested, get pathos here: https://github.com/uqfoundation

First, the best way to get around the memory issue is to use an iterator/generator instead of a list:
def sum_nums(low, high):
result = 0
for i in xrange(low, high+1):
result += 1
return result
in python3, range() produces an iterator, so this is only needed in python2
Now, where multiprocessing comes in is when you want to split up the processing to different processes or CPU cores. If you don't need to control the individual workers than the easiest method is to use a process pool. This will let you map a function to the pool and get the output. You can alternatively use apply_async to apply jobs to the pool one at a time and get a delayed result which you can get with .get():
import multiprocessing
from multiprocessing import Pool
from time import time
def sum_nums(low, high):
result = 0
for i in xrange(low, high+1):
result += i
return result
# map requires a function to handle a single argument
def sn((low,high)):
return sum_nums(low, high)
if __name__ == '__main__':
#t = time()
# takes forever
#print sum_nums(1,10**10)
#print '{} s'.format(time() -t)
p = Pool(4)
n = int(1e8)
r = range(0,10**10+1,n)
results = []
# using apply_async
t = time()
for arg in zip([x+1 for x in r],r[1:]):
results.append(p.apply_async(sum_nums, arg))
# wait for results
print sum(res.get() for res in results)
print '{} s'.format(time() -t)
# using process pool
t = time()
print sum(p.map(sn, zip([x+1 for x in r], r[1:])))
print '{} s'.format(time() -t)
On my machine, just calling sum_nums with 10**10 takes almost 9 minutes, but using a Pool(8) and n=int(1e8) reduces this to just over a minute.

Related

Python-MultiThreading: Can MultiThreading improve "for loop" performance?

As far as my understanding:
MultiThread is an ideal option for I/O applications.
Therefore, I test a "for loop" code without any I/O.
(As following code)
Howerver, it can reduce the execution time from 6.3s to 3.7s.
Is the result correct ?
or any mistake in my suppose ?
from multiprocessing.dummy import Pool as ThreadPool
import time
# normal
l = []
s = time.time()
for i in range(0, 10000):
for j in range(i):
l.append(j * 10)
e = time.time()
print(f"case1: {e-s}") # 6.3 sec
# multiThread
def func(x):
for i in range(x):
l_.append(i * 10)
with ThreadPool(50) as pool:
l_ = []
s = time.time()
pool.map(func, range(0, 10000))
e = time.time()
print(f"case2: {e-s}") # 3.7 sec
what you are seeing is just python specializing the function by using faster op-codes for the multithreaded version as it is a function that is called multiple times, See PEP 659 Specializing Adaptive Interpreter, this only true for python 3.11 and later.
changing the non-multithreaded version to also be a function that is called multiple times give almost the same performance (python 3.11)
from multiprocessing.dummy import Pool as ThreadPool
import time
l = []
def f2(i):
for j in range(i):
l.append(j * 10)
def f1():
# normal
for i in range(0, 10_000):
f2(i)
s = time.time()
f1()
e = time.time()
print(f"case1: {e-s}")
# multiThread
def func(x):
global l_
for i in range(x):
l_.append(i * 10)
with ThreadPool(50) as pool:
l_ = []
s = time.time()
pool.map(func, range(0, 10_000))
e = time.time()
print(f"case2: {e-s}")
case1: 3.9744303226470947
case2: 4.036579370498657
threading in python IS going to be slower for functions that need the GIL, and manipulating lists requires the GIL so using threading will be slower, python threads only improve performance if the GIL is dropped (which happens on IO and C external libraries calls), if this is ever not the case then either your code drops the GIL or your benchmark is flawed.
In general it is true that multithreading is better suited for I/O bound operations. However, in this trivial case it is clearly not so.
It's worth pointing out multiprocessing will outperform either of the strategies implemented in OP's code.
Here's a rewrite that demonstrates 3 techniques:
from functools import wraps
from time import perf_counter
def timeit(func):
#wraps(func)
def _wrapper(*args, **kwargs):
start = perf_counter()
result = func(*args, **kwargs)
duration = perf_counter() - start
print(f'Function {func.__name__}{args} {kwargs} Took {duration:.4f} seconds')
return result
return _wrapper
#timeit
def case1():
l = []
for i in range(0, 10000):
for j in range(i):
l.append(j * 10)
def process(n):
l = []
for j in range(n):
l.append(j * 10)
#timeit
def case2():
with TPE() as tpe:
tpe.map(process, range(0, 10_000))
#timeit
def case3():
with PPE() as ppe:
ppe.map(process, range(0, 10_000))
if __name__ == '__main__':
for func in case1, case2, case3:
func()
Output:
Function case1() {} Took 3.3104 seconds
Function case2() {} Took 2.6354 seconds
Function case3() {} Took 1.7245 seconds
In this case the trivial processing is probably outweighed by the overheads in thread management. If case1 was even more CPU intensive you'd probably begin to see rather different results
Multi threading is ideal for I/O applications because it allows a server/host to accept multiple connections, and if a single request is slow or hangs, it can continue serving the other connections without blocking them.
That isn't mutually exclusive from speeding up a simple for loop execution, if there's no coordination between threads required like in your trivial example above. If each execution of loop is completely independent of any other executions, then it's also very well suited to multi threading, and that's why you're seeing a speed up.

Multiprocessing: Shared memory slower than pickling?

I am trying to get myself acquainted to multiprocessing in Python. Performance does not work out as I expected; therefore, I am seeking advice how to make things work more efficiently.
Let my first state my objective: I basically have a bunch data of lists. Each of these lists can be processed independently, say by some dummy routine do_work. My implementation in my actual program is slow (slower than doing the same in a single process serially). I was wondering if this is due to the pickling/unpickling overhead involved into multiprocess programming.
Therefore, I tried to implement a version using shared memory. Since the way how I distribute the work makes sure that no two processes try to write to the same piece of memory at the same time, I use multiprocessing.RawArray and RawValue. As it turns out, the version with shared memory is even slower.
My code is as follows: main_pass and worker_pass implement the parallelisation using return-statements, while main_shared and worker_shared use shared memory.
import multiprocessing, time, timeit, numpy as np
data = None
def setup():
return np.random.randint(0,100, (1000,100000)).tolist(), list(range(1000))
def do_work(input):
output = []
for j in input:
if j % 3 == 0:
output.append(j)
return output
def main_pass():
global data
data, instances = setup()
with multiprocessing.Pool(4) as pool:
start = time.time()
new_blocks = pool.map(worker_pass, instances)
print("done", time.time() - start)
def worker_pass(i):
global data
return do_work(data[i])
def main_shared():
global data
data, instances = setup()
data = [(a := multiprocessing.RawArray('i', block), multiprocessing.RawValue('i', len(a))) for block in data]
with multiprocessing.Pool(4) as pool:
start = time.time()
pool.map(worker_shared, instances)
print("done", time.time() - start)
new_blocks = [list(a[:l.value]) for a, l in data]
print(new_blocks)
def worker_shared(i):
global data
array, length = data[i]
new_block = do_work(array[:length.value])
array[:len(new_block)] = new_block
length.value = len(new_block)
import timeit
if __name__ == '__main__':
multiprocessing.set_start_method('fork')
print(timeit.timeit(lambda: main_pass(), number=1))
print(timeit.timeit(lambda: main_shared(), number=1))
the timing I get:
done 7.257717132568359
10.633161254
done 7.889772891998291
38.037218965
So the version run first (using return) is way faster than the one writing the result to shared memory.
Why is this?
Btw., is it possible to measure the time spent on pickling/unpickling conveniently?
Info: I am using python 3.9 on MacOS 10.15.
What you say about the returned output from worker_pass being done by pickling is true but that additional overhead is clearly does not seem to compensate for the additional work being done by worker_shared to "repack" the RawArray instances. Where a performance improvement is achieved is when you are forced to use pickling for the worker_pass case as when you are on platforms that use spawn to create new processes.
In the following spawn demo I seed the random number generator with a specific value so I get the same generated values for both runs and I print out the sum of all the returned random numbers just to ensure that both runs are doing equivalent processing. It is clear that using shared memory arrays performs better now if you are only timing the pool-creation (where the overhead is for the non-shared memory case) and map times. But when you include the additional setup time and post-processing time required for the use of the shared memory arrays, the difference in times is not that significant:
import multiprocessing, time, timeit, numpy as np
def setup():
np.random.seed(seed=1)
return np.random.randint(0,100, (1000,100000)).tolist(), list(range(1000))
def init_process_pool(the_data):
global data
data = the_data
def do_work(input):
output = []
for j in input:
if j % 3 == 0:
output.append(j)
return output
def main_pass():
data, instances = setup()
start = time.time()
with multiprocessing.Pool(4, initializer=init_process_pool, initargs=(data,)) as pool:
new_blocks = pool.map(worker_pass, instances)
print("done", time.time() - start)
print(sum(sum(new_block) for new_block in new_blocks))
def worker_pass(i):
global data
return do_work(data[i])
def main_shared():
data, instances = setup()
data = [(a := multiprocessing.RawArray('i', block), multiprocessing.RawValue('i', len(a))) for block in data]
start = time.time()
with multiprocessing.Pool(4, initializer=init_process_pool, initargs=(data,)) as pool:
pool.map(worker_shared, instances)
print("done", time.time() - start)
new_blocks = [list(a[:l.value]) for a, l in data]
#print(new_blocks)
print(sum(sum(new_block) for new_block in new_blocks))
def worker_shared(i):
global data
array, length = data[i]
new_block = do_work(array[:length.value])
array[:len(new_block)] = new_block
length.value = len(new_block)
import timeit
if __name__ == '__main__':
multiprocessing.set_start_method('spawn')
print(timeit.timeit(lambda: main_pass(), number=1))
print(timeit.timeit(lambda: main_shared(), number=1))
Prints:
done 17.68915629386902
1682969169
20.2827687
done 3.9250364303588867
1682969169
23.2993996

Why is Python multithreading in this example so slow?

From what I understand, multithreading is supposed to speed up when the program is IO bound. Why is the example below so much slower, and how can I make it produce the exact same result as the single threaded version?
Single-threaded:
Time: 1.8115384578704834
import time
start_time = time.time()
def testThread(num):
num = ""
for i in range(500):
num += str(i % 10)
a.write(num)
def main():
test_list = [x for x in range(3000)]
for i in test_list:
testThread(i)
if __name__ == '__main__':
a = open('single.txt', 'w')
main()
print(time.time() - start_time)
Multi-threaded:
Time: 22.509746551513672
import threading
from concurrent.futures import ThreadPoolExecutor
from multiprocessing.pool import ThreadPool
import time
start_time = time.time()
def testThread(num):
num = ""
for i in range(500):
num += str(i % 10)
with global_lock:
a.write(num)
def main():
test_list = [x for x in range(3000)]
with ThreadPool(4) as executor:
results = executor.map(testThread, test_list)
# with ThreadPoolExecutor() as executor:
# results = executor.map(testThread, test_list)
if __name__ == '__main__':
a = open('multi.txt', 'w')
global_lock = threading.Lock()
main()
print(time.time() - start_time)
Also, how is ThreadPoolExecutor different from ThreadPool
Edit. Forgot about GIL, silly me.
The below code shows a general approach for other multithreaded languages, as they will have the same problem.
However, in this case, the language is Python. Due to the GIL, there will only ever be one active thread at a time in this case, as none of the threads ever actually yield. (The OS may forcefully swap threads mid-way, but that doesn't change the fact that only one thread would ever run at a time.
For computation and File I/O, Python won't see any gains by multithreading.
In order to increase computation, you can multiprocess. However, you still can't speed up the file I/O
for i in range(500):
num += str(i % 10)
with global_lock:
a.write(num)
You now have 4 threads locking and writing, which is the slower part of the function.
Essentially, You have 4 threads doing 1 task, and you added a ton of overhead and wait time on top if it.
From the comments, here is something that may help:
unordered_output_list = []
def testThread(num):
num = ""
for i in range(500):
num += str(i % 10)
unordered_output_list.append(num)
def main():
test_list = [x for x in range(3000)]
with ThreadPool(4) as executor:
results = executor.map(testThread, test_list)
# with ThreadPoolExecutor() as executor:
# results = executor.map(testThread, test_list)
for num in unordered_output_list:
a.write(num)
if __name__ == '__main__':
a = open('multi.txt', 'w')
main()
print(time.time() - start_time)
Here we do the processing in parallel (a little speed up), then we write in sequence.
From this answer
Short answer: physically writing to the same disk from multiple
threads at the same time, will never be faster than writing to that
disk from one thread (talking about normal hard disks here). In some
cases it can even be a lot slower.

Getting a pickle error when trying to run processes

What I'm trying to do is running a list of prime number decomposition in different processes at once. I have a threaded version that's working, but can't seem to get it working with processes.
import math
from Queue import Queue
import multiprocessing
def primes2(n):
primfac = []
num = n
d = 2
while d * d <= n:
while (n % d) == 0:
primfac.append(d) # supposing you want multiple factors repeated
n //= d
d += 1
if n > 1:
primfac.append(n)
myfile = open('processresults.txt', 'a')
myfile.write(str(num) + ":" + str(primfac) + "\n")
return primfac
def mp_factorizer(nums, nprocs):
def worker(nums, out_q):
""" The worker function, invoked in a process. 'nums' is a
list of numbers to factor. The results are placed in
a dictionary that's pushed to a queue.
"""
outdict = {}
for n in nums:
outdict[n] = primes2(n)
out_q.put(outdict)
# Each process will get 'chunksize' nums and a queue to put his out
# dict into
out_q = Queue()
chunksize = int(math.ceil(len(nums) / float(nprocs)))
procs = []
for i in range(nprocs):
p = multiprocessing.Process(
target=worker,
args=(nums[chunksize * i:chunksize * (i + 1)],
out_q))
procs.append(p)
p.start()
# Collect all results into a single result dict. We know how many dicts
# with results to expect.
resultdict = {}
for i in range(nprocs):
resultdict.update(out_q.get())
# Wait for all worker processes to finish
for p in procs:
p.join()
print resultdict
if __name__ == '__main__':
mp_factorizer((400243534500, 100345345000, 600034522000, 9000045346435345000), 4)
I'm getting a pickle error shown below:
Any help would be greatly appreciated :)
You need to use multiprocessing.Queue instead of regular Queue. +more
This is due the Process doesn't run using the same memory space and there are some objects that aren't pickable, like the a regular queue (Queue.Queue). To overcome this, the multiprocessing library provide a Queue class that is actually a Proxy to a Queue.
And also, you could extract the def worker(.. out as any other method. This could be your main problem because on "how" a process is forked on a OS level.
You can also use a multiprocessing.Manager +more.
dynamically created functions cannot be pickled and therefore cannot be used as the target of a Process, the function worker needs to be defined in the global scope instead of inside the definition of mp_factorizer.

Python Multiprocessing with a single function

I have a simulation that is currently running, but the ETA is about 40 hours -- I'm trying to speed it up with multi-processing.
It essentially iterates over 3 values of one variable (L), and over 99 values of of a second variable (a). Using these values, it essentially runs a complex simulation and returns 9 different standard deviations. Thus (even though I haven't coded it that way yet) it is essentially a function that takes two values as inputs (L,a) and returns 9 values.
Here is the essence of the code I have:
STD_1 = []
STD_2 = []
# etc.
for L in range(0,6,2):
for a in range(1,100):
### simulation code ###
STD_1.append(value_1)
STD_2.append(value_2)
# etc.
Here is what I can modify it to:
master_list = []
def simulate(a,L):
### simulation code ###
return (a,L,STD_1, STD_2 etc.)
for L in range(0,6,2):
for a in range(1,100):
master_list.append(simulate(a,L))
Since each of the simulations are independent, it seems like an ideal place to implement some sort of multi-threading/processing.
How exactly would I go about coding this?
EDIT: Also, will everything be returned to the master list in order, or could it possibly be out of order if multiple processes are working?
EDIT 2: This is my code -- but it doesn't run correctly. It asks if I want to kill the program right after I run it.
import multiprocessing
data = []
for L in range(0,6,2):
for a in range(1,100):
data.append((L,a))
print (data)
def simulation(arg):
# unpack the tuple
a = arg[1]
L = arg[0]
STD_1 = a**2
STD_2 = a**3
STD_3 = a**4
# simulation code #
return((STD_1,STD_2,STD_3))
print("1")
p = multiprocessing.Pool()
print ("2")
results = p.map(simulation, data)
EDIT 3: Also what are the limitations of multiprocessing. I've heard that it doesn't work on OS X. Is this correct?
Wrap the data for each iteration up into a tuple.
Make a list data of those tuples
Write a function f to process one tuple and return one result
Create p = multiprocessing.Pool() object.
Call results = p.map(f, data)
This will run as many instances of f as your machine has cores in separate processes.
Edit1: Example:
from multiprocessing import Pool
data = [('bla', 1, 3, 7), ('spam', 12, 4, 8), ('eggs', 17, 1, 3)]
def f(t):
name, a, b, c = t
return (name, a + b + c)
p = Pool()
results = p.map(f, data)
print results
Edit2:
Multiprocessing should work fine on UNIX-like platforms such as OSX. Only platforms that lack os.fork (mainly MS Windows) need special attention. But even there it still works. See the multiprocessing documentation.
Here is one way to run it in parallel threads:
import threading
L_a = []
for L in range(0,6,2):
for a in range(1,100):
L_a.append((L,a))
# Add the rest of your objects here
def RunParallelThreads():
# Create an index list
indexes = range(0,len(L_a))
# Create the output list
output = [None for i in indexes]
# Create all the parallel threads
threads = [threading.Thread(target=simulate,args=(output,i)) for i in indexes]
# Start all the parallel threads
for thread in threads: thread.start()
# Wait for all the parallel threads to complete
for thread in threads: thread.join()
# Return the output list
return output
def simulate(list,index):
(L,a) = L_a[index]
list[index] = (a,L) # Add the rest of your objects here
master_list = RunParallelThreads()
Use Pool().imap_unordered if ordering is not important. It will return results in a non-blocking fashion.

Categories