Multiprocessing and lists

Multiprocessing and lists - python

I have been trying to optimise my code using the multiprocessing module, but I think I have fallen for the trap of premature optimization.
For example, when running this code:
num = 1000000
l = mp.Manager().list()
for i in range(num):
l.append(i)
l_ = Counter(l)
It takes several times longer than this:
num = 1000000
l = []
for i in range(num):
l.append(i)
l_ = Counter(l)
What is the reason the multiprocessing list is slower than regular lists? And are there ways to make them as efficient?

Shared memroy data structures are meant to be shared between processes. To synchronize accesses, they need to be locked. On the other hand, a list ([]) does not require a lock.
With / without locking makes a difference.

Related

Python nested for loop faster than single for loop

Why is the nested for loop faster than the single for loop?
start = time()
k = 0
m = 0
for i in range(1000):
for j in range(1000):
for l in range(100):
m+=1
#for i in range(100000000):
# k +=1
print int(time() - start)
For the single for loop I get a time of 14 seconds and for the nested for loop of 10 seconds

The relevant context is explained in this topic.
In short, range(100000000) builds a huge list in Python 2, whereas with the nested loops you only build lists with a total of 1000 + 1000 + 100 = 2100 elements. In Python 3, range is smarter and lazy like xrange in Python 2.
Here are some timings for the following code. Absolute runtime depends on the system, but comparing the values with each other is valuable.
import timeit
runs = 100
code = '''k = 0
for i in range(1000):
for j in range(1000):
for l in range(100):
k += 1'''
print(timeit.timeit(stmt=code, number=runs))
code = '''k = 0
for i in range(100000000):
k += 1'''
print(timeit.timeit(stmt=code, number=runs))
Outputs:
CPython 2.7 - range
264.650791883
372.886064053
Interpretation: building huge lists takes time.
CPython 2.7 - range exchanged with xrange
231.975350142
221.832423925
Interpretation: almost equal, as expected. (Nested for loops should have slightly
larger overhead than a single for loop.)
CPython 3.6 - range
365.20924194483086
437.26447860104963
Interpretation: Interesting! I did not expect this. Anyone?

It is because you are using Python2. Range generates a list of numbers, and has to allocate that list. In the first nested loop you are allocating 1000 + 1000 + 100, so the list size is 2100, while in the other one the list has a size of 100000000, which is much bigger.
In python2 is better to use a generator, xrange(), a generator yields the numbers instead of building and allocating a list with them.
Aditionally and for further information you can read this question that it is related to this but in python3

In Python 2, range creates a list with all of the numbers within the list. Try swapping range with xrange and you should see them take comparable time or the single loop approach may work a bit faster.

during the nested loops python has to allocate 1000+1000+100=2100 values for the counters whereas in the single loop it has to allocate 10M. This is what's taking the extra time
i have tested this in python 3.6 and the behaviour is similar, i would say it's very likely this is a memory allocation issue.

Multiprocessing in Python not faster than doing it sequentially

I want to do something parallelly but it always goes slower. I put an example of two code snippets which can be compared. The multiprocessing way needs 12 seconds on my laptop. The sequential way only 3 seconds. I thought multiprocessing is faster.
I know that the task in this way does not make any sense but it is just made to compare the two ways. I know bubble sort can be replaced by faster ways.
Thanks.
Multiprocessing way:
from multiprocessing import Process, Manager
import os
import random
myArray = []
for i in range(1000):
myArray.append(random.randint(1,1000))
def getRandomSample(myset, sample_size):
sorted_list = sorted(random.sample(xrange(len(myset)), sample_size))
return([myset[i] for i in sorted_list])
def bubbleSort(iterator,alist, return_dictionary):
sample_list = (getRandomSample(alist, 100))
for passnum in range(len(sample_list)-1,0,-1):
for i in range(passnum):
if sample_list[i]>alist[i+1]:
temp = alist[i]
sample_list[i] = alist[i+1]
sample_list[i+1] = temp
return_dictionary[iterator] = sample_list
if __name__ == '__main__':
manager = Manager()
return_dictionary = manager.dict()
jobs = []
for i in range(3000):
p = Process(target=bubbleSort, args=(i,myArray,return_dictionary))
jobs.append(p)
p.start()
for proc in jobs:
proc.join()
print return_dictionary.values()
The other way:
import os
import random
myArray = []
for i in range(1000):
myArray.append(random.randint(1,1000))
def getRandomSample(myset, sample_size):
sorted_list = sorted(random.sample(xrange(len(myset)), sample_size))
return([myset[i] for i in sorted_list])
def bubbleSort(alist):
sample_list = (getRandomSample(alist, 100))
for passnum in range(len(sample_list)-1,0,-1):
for i in range(passnum):
if sample_list[i]>alist[i+1]:
temp = alist[i]
sample_list[i] = alist[i+1]
sample_list[i+1] = temp
return(sample_list)
if __name__ == '__main__':
results = []
for i in range(3000):
results.append(bubbleSort(myArray))
print results

Multiprocessing is faster if you have multiple cores and do the parallelization properly. In your example you create 3000 processes which causes enormous amount on context switching between them. Instead of that use Pool to schedule the jobs for processes:
def bubbleSort(alist):
sample_list = (getRandomSample(alist, 100))
for passnum in range(len(sample_list)-1,0,-1):
for i in range(passnum):
if sample_list[i]>alist[i+1]:
temp = alist[i]
sample_list[i] = alist[i+1]
sample_list[i+1] = temp
return(sample_list)
if __name__ == '__main__':
pool = Pool(processes=4)
for x in pool.imap_unordered(bubbleSort, (myArray for x in range(3000))):
pass
I removed all the output and did some tests on my 4 core machine. As expected the code above was about 4 times faster than your sequential example.

Multiprocessing is not just magically faster. The thing is that your computer still has to do the same amount of work. It's like if you try to do multiple tasks at once, it's not going to be faster.
In a "normal" program, doing it sequential is easier to read and write (that it is that much faster too surprises me a little). Multiprocessing is especially useful if you have to wait for another process like a web request (you can send multiple at once and don't have to wait for each) or having some sort of event loop.
My guess as to why it is faster is that Python already uses multiprocessing internally wherever it makes sense (don't quote me on that). Also with threading it has to keep track of what is where, which means more overhead.
So, if we go back to the example in the real world, if you give a task to somebody else and instead of waiting for it, you do other things at the same time as them, then you are faster.

Why is direct indexing of an array significantly faster than iteration?

Just some Python code for an example:
nums = [1,2,3]
start = timer()
for i in range(len(nums)):
print(nums[i])
end = timer()
print((end-start)) #computed to 0.0697546862831
start = timer()
print(nums[0])
print(nums[1])
print(nums[2])
end = timer()
print((end-start)) #computed to 0.0167170338524
I can grasp that some extra time will be taken in the loop because the value of i must be incremented a few times, but the difference between the running times of these two different methods seems a lot bigger than I expected. Is there something else happening underneath the hood that I'm not considering?

Short answer: it isn't, unless the loop is very small. The for loop has a small overhead, but the way you're doing it is inefficient. By using range(len(nums)) you're effectively creating another list and iterating through that, then doing the same index lookups anyway. Try this:
for i in nums:
print(i)
Results for me were as expected:
>>> import timeit
>>> timeit.timeit('nums[0];nums[1];nums[2]', setup='nums = [1,2,3]')
0.10711812973022461
>>> timeit.timeit('for i in nums:pass', setup='nums = [1,2,3]')
0.13474011421203613
>>> timeit.timeit('for i in range(len(nums)):pass', setup='nums = [1,2,3]')
0.42371487617492676
With a bigger list the advantage of the loop becomes apparent, because the incremental cost of accessing an element by index outweighs the one-off cost of the loop:
>>> timeit.timeit('for i in nums:pass', setup='nums = range(0,100)')
1.541944980621338
timeit.timeit(';'.join('nums[%s]' % i for i in range(0,100)), setup='nums = range(0,100)')
2.5244338512420654
In python 3, which puts a greater emphasis on iterators over indexable lists, the difference is even greater:
>>> timeit.timeit('for i in nums:pass', setup='nums = range(0,100)')
1.6542046590038808
>>> timeit.timeit(';'.join('nums[%s]' % i for i in range(0,100)), setup='nums = range(0,100)')
10.331634456000756

With such a small array you're probably measuring noise first, and then the overhead of calling range(). Note that range not only has to increment a variable a few times, it also creates an object that holds its state (the current value) because it's a generator. The function call and object creation are two things you don't pay for in the second example and for very short iterations they will probably dwarf three array accesses.
Essentially your second snippet does loop unrolling, which is a viable and frequent technique of speeding up performance-critical code.

The for loop have a cost in any case, and the one you write is especially costly. Here is four versions, using timeit for measure time:
from timeit import timeit
NUMS = [1, 2, 3]
def one():
for i in range(len(NUMS)):
NUMS[i]
def one_no_access():
for i in range(len(NUMS)):
i
def two():
NUMS[0]
NUMS[1]
NUMS[2]
def three():
for i in NUMS:
i
for func in (one, one_no_access, two, three):
print(func.__name__ + ':', timeit(func))
Here is the found times:
one: 1.0467438200000743
one_no_access: 0.8853238560000136
two: 0.3143197629999577
three: 0.3478466749998006
The one_no_access show the cost of the expression range(len(NUMS)).
While lists in python are stocked contiguously in memory, the random access of elements is in O(1), explaining two as the quicker.

multiprocessing full capacity in Python

I wrote the following code which call function (compute_cluster) 6 times in parallel (each run of this function is independent of the other run and each run write the results in a separate file), the following is my code:
global L
for L in range(6,24):
pool = Pool(6)
pool.map(compute_cluster,range(1,3))
pool.close()
if __name__ == "__main__":
main(sys.argv)
despite the fact that I'm running this code on a I7 processor machine, and no matter how much I set the Pool to it's always running only two processes in parallel so is there any suggestion on how can I run 6 processes in parallel? such that the first three processes use L=6 and call compute_cluster with parameter values from 1:3 in parallel and at the same time the other three processes run the same function with the same parameter values but this time the Global L value is 7 ?
any suggestions is highly appreciated

There are a few things wrong here. First, as to why you always only have 2 processes going at a time -- The reason is because range(1, 3) only returns 2 values. So you're only giving the pool 2 tasks to do before you close it.
The second issue is that you're relying on global state. In this case, the code probably works, but it's limiting your performance since it is the factor which is preventing you from using all your cores. I would parallelize the L loop rather than the "inner" range loop. Something like1:
def wrapper(tup):
l, r = tup
# Even better would be to get rid of `L` and pass it to compute_cluster
global L
L = l
compute_cluster(r)
for r in range(1, 3):
p = Pool(6)
p.map(wrapper, [(l, r) for l in range(6, 24)])
p.close()
This works with the global L because each spawned process picks up its own copy of L -- It doesn't get shared between processes.
1Untested code
As pointed out in the comments, we can even pull the Pool out of the loop:
p = Pool(6)
p.map(wrapper, [(l, r) for l in range(6, 24) for r in range(1, 3)])
p.close()

Long argument lists and performance

This is surely no python-specific question, but I am looking for a python-specific answer - if any. It is about putting code blocks with a large number of variables into functions (or alike?). Let me assume this code
##!/usr/bin/env python
# many variables: built in types, custom made objects, you name it.
# Let n be a 'substantial' number, say 47.
x1 = v1
x2 = v2
...
xn = vn
# several layers of flow control, for brevity only 2 loops
for i1 in range(ri1):
for i2 in range(ri2):
y1 = f1(i1,i2)
y2 = f2(i1,i2)
# Now, several lines of work
do_some_work
# involving HEAVY usage and FREQUENT (say several 10**3 times)
# access to all of x1,...xn, (and maybe y1,y2)
# One of the main points is that slowing down access to x1,...,xn
# will turn into a severe bottleneck for the performance of the code.
# now other things happen. These may or may not involve modification
# of x1,...xn
# some place later in the code, again, several layers of flow control,
# not necessarily identical to the first occur
for j1 in range(rj1):
y1 = g1(j1)
y2 = g2(j1)
# Now, again
do_some_work # <---- this is EXACTLY THE SAME code block as above
# a.s.o.
Obviously I would like to put 'do_some_work' into something like a function (or maybe something better?).
What would be the most performant way to do this in python
without function calls with a confusingly large numbers of arguments
without performance lossy indirection to access x1,...,xn (Say, by wrapping them into another list, class, or alike)
without using x1,...,xn as globals in a function do_some_work(...)
I have to admit, that I always find myself returning to globals.

A simple and dirty(probably not optimal) banchmark:
import timeit
def test_no_func():
(x0,x1,x2,x3,x4,x5,x6,x7,x8,x9,x10,x11,x12,x13,x14,x15,x16,x17,x18,x19) = range(20)
for i1 in xrange(100):
for i2 in xrange(100):
for i3 in xrange(100):
results = [x0+x1+x2+x3+x4+x5+x6 for _ in xrange(100)]
results.extend(x7+x8+x9+x10+x11+x12+x13+x14+x15 for _ in xrange(100))
results.extend(x16+x17+x18+x19+x0 for _ in xrange(500))
for j1 in xrange(100):
for j2 in xrange(100):
for i3 in xrange(100):
results = [x0+x1+x2+x3+x4+x5+x6 for _ in xrange(100)]
results.extend(x7+x8+x9+x10+x11+x12+x13+x14+x15 for _ in xrange(100))
results.extend(x16+x17+x18+x19+x0 for _ in xrange(500))
def your_func(x_vars):
# of the number is not too big you can simply unpack.
# 150 is a bit too much for unpacking...
(x0,x1,x2,x3,x4,x5,x6,x7,x8,x9,x10,x11,x12,x13,x14,x15,x16,x17,x18,x19) = x_vars
results = [x0+x1+x2+x3+x4+x5+x6 for _ in xrange(100)]
results.extend(x7+x8+x9+x10+x11+x12+x13+x14+x15 for _ in xrange(100))
results.extend(x16+x17+x18+x19+x0 for _ in xrange(500))
return results
def test_func():
(x0,x1,x2,x3,x4,x5,x6,x7,x8,x9,x10,x11,x12,x13,x14,x15,x16,x17,x18,x19) = range(20)
for i1 in xrange(100):
for i2 in xrange(100):
for i3 in xrange(100):
results = your_func(val for key,val in locals().copy().iteritems() if key.startswith('x'))
for j1 in xrange(100):
for j2 in xrange(100):
for i3 in xrange(100):
results = your_func(val for key,val in locals().copy().iteritems() if key.startswith('x'))
print timeit.timeit('test_no_func()', 'from __main__ import test_no_func', number=1)
print timeit.timeit('test_func()', 'from __main__ import test_func, your_func', number=1)
Result:
214.810357094
227.490054131
which is about 5% slower passing the arguments. But probably you can't do much better than this introducing 1 million function calls...

Global variables are significantly slower than local variables.
Also, it's almost always a bad idea to use lots of different variable names. Better use a single data structure, for example a dictionary:
d = {"x1": "foo", "x2": "bar", "y1": "baz"}
etc.
Then you can pass d to your functions (which is very fast since just the address of the dict will be passed, not the entire dictionary), and access its contents from there.
if d["x2"] = "eggs":
d["x1"] = "spam"

I recommend using python cProfile module. Just run your script this way:
python -m cProfile your_script.py
in different modes (with and without function wrapper) and see how fast it works. I don't think accessing the variables is a bottleneck. Usually, loops and repeated operations are.
Secondly, I suggest thinking of abstracting the function, since you use i1, i2, etc.
those many variables might need to be in a list or a dictionary, and
cycles can be abstracted with itertools:
from itertools import product
equal_sums = 0
for l in product(range(10), repeat=6): # instead of 6 nested loops over range(10)
if sum(l[:3]) == sum(l[3:]):
equal_sums += 1

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Multiprocessing and lists - python

Shared memroy data structures are meant to be shared between processes. To synchronize accesses, they need to be locked. On the other hand, a list ([]) does not require a lock. With / without locking makes a difference.

Related

Python nested for loop faster than single for loop

Multiprocessing in Python not faster than doing it sequentially

Why is direct indexing of an array significantly faster than iteration?

multiprocessing full capacity in Python

Long argument lists and performance

Categories

Resources