Long argument lists and performance - python

This is surely no python-specific question, but I am looking for a python-specific answer - if any. It is about putting code blocks with a large number of variables into functions (or alike?). Let me assume this code
##!/usr/bin/env python
# many variables: built in types, custom made objects, you name it.
# Let n be a 'substantial' number, say 47.
x1 = v1
x2 = v2
...
xn = vn
# several layers of flow control, for brevity only 2 loops
for i1 in range(ri1):
for i2 in range(ri2):
y1 = f1(i1,i2)
y2 = f2(i1,i2)
# Now, several lines of work
do_some_work
# involving HEAVY usage and FREQUENT (say several 10**3 times)
# access to all of x1,...xn, (and maybe y1,y2)
# One of the main points is that slowing down access to x1,...,xn
# will turn into a severe bottleneck for the performance of the code.
# now other things happen. These may or may not involve modification
# of x1,...xn
# some place later in the code, again, several layers of flow control,
# not necessarily identical to the first occur
for j1 in range(rj1):
y1 = g1(j1)
y2 = g2(j1)
# Now, again
do_some_work # <---- this is EXACTLY THE SAME code block as above
# a.s.o.
Obviously I would like to put 'do_some_work' into something like a function (or maybe something better?).
What would be the most performant way to do this in python
without function calls with a confusingly large numbers of arguments
without performance lossy indirection to access x1,...,xn (Say, by wrapping them into another list, class, or alike)
without using x1,...,xn as globals in a function do_some_work(...)
I have to admit, that I always find myself returning to globals.

A simple and dirty(probably not optimal) banchmark:
import timeit
def test_no_func():
(x0,x1,x2,x3,x4,x5,x6,x7,x8,x9,x10,x11,x12,x13,x14,x15,x16,x17,x18,x19) = range(20)
for i1 in xrange(100):
for i2 in xrange(100):
for i3 in xrange(100):
results = [x0+x1+x2+x3+x4+x5+x6 for _ in xrange(100)]
results.extend(x7+x8+x9+x10+x11+x12+x13+x14+x15 for _ in xrange(100))
results.extend(x16+x17+x18+x19+x0 for _ in xrange(500))
for j1 in xrange(100):
for j2 in xrange(100):
for i3 in xrange(100):
results = [x0+x1+x2+x3+x4+x5+x6 for _ in xrange(100)]
results.extend(x7+x8+x9+x10+x11+x12+x13+x14+x15 for _ in xrange(100))
results.extend(x16+x17+x18+x19+x0 for _ in xrange(500))
def your_func(x_vars):
# of the number is not too big you can simply unpack.
# 150 is a bit too much for unpacking...
(x0,x1,x2,x3,x4,x5,x6,x7,x8,x9,x10,x11,x12,x13,x14,x15,x16,x17,x18,x19) = x_vars
results = [x0+x1+x2+x3+x4+x5+x6 for _ in xrange(100)]
results.extend(x7+x8+x9+x10+x11+x12+x13+x14+x15 for _ in xrange(100))
results.extend(x16+x17+x18+x19+x0 for _ in xrange(500))
return results
def test_func():
(x0,x1,x2,x3,x4,x5,x6,x7,x8,x9,x10,x11,x12,x13,x14,x15,x16,x17,x18,x19) = range(20)
for i1 in xrange(100):
for i2 in xrange(100):
for i3 in xrange(100):
results = your_func(val for key,val in locals().copy().iteritems() if key.startswith('x'))
for j1 in xrange(100):
for j2 in xrange(100):
for i3 in xrange(100):
results = your_func(val for key,val in locals().copy().iteritems() if key.startswith('x'))
print timeit.timeit('test_no_func()', 'from __main__ import test_no_func', number=1)
print timeit.timeit('test_func()', 'from __main__ import test_func, your_func', number=1)
Result:
214.810357094
227.490054131
which is about 5% slower passing the arguments. But probably you can't do much better than this introducing 1 million function calls...

Global variables are significantly slower than local variables.
Also, it's almost always a bad idea to use lots of different variable names. Better use a single data structure, for example a dictionary:
d = {"x1": "foo", "x2": "bar", "y1": "baz"}
etc.
Then you can pass d to your functions (which is very fast since just the address of the dict will be passed, not the entire dictionary), and access its contents from there.
if d["x2"] = "eggs":
d["x1"] = "spam"

I recommend using python cProfile module. Just run your script this way:
python -m cProfile your_script.py
in different modes (with and without function wrapper) and see how fast it works. I don't think accessing the variables is a bottleneck. Usually, loops and repeated operations are.
Secondly, I suggest thinking of abstracting the function, since you use i1, i2, etc.
those many variables might need to be in a list or a dictionary, and
cycles can be abstracted with itertools:
from itertools import product
equal_sums = 0
for l in product(range(10), repeat=6): # instead of 6 nested loops over range(10)
if sum(l[:3]) == sum(l[3:]):
equal_sums += 1

Related

Python : Cost of forming list with comprehension vs standalone function

What costs more in python, making a list with comprehension or with a standalone function?
It appears I failed to find previous posts asking the same question. While other answers go into detail about bytes and internal workings of python, and that Is indeed helpful, I felt like the visual graphs help to show that there is a continuous trend.
I don't yet have a good enough understanding of the low level workings of python, so those answers are a bit foreign for me to try and comprehend.
I am currently an undergrad in CS, and I am continually amazed with how powerful python is. I recently did a small experiment to test the cost of forming lists with comprehension versus a standalone function. For example:
def make_list_of_size(n):
retList = []
for i in range(n):
retList.append(0)
return retList
creates a list of size n containing zeros.
It is well known that this function is O(n). I wanted to explore the growth of the following:
def comprehension(n):
return [0 for i in range(n)]
Which makes the same list.
let us explore!
This is the code I used for timing, and note the order of function calls (which way did I make the list first). I made the list with a standalone function first, and then with comprehension. I have yet to learn how to turn off garbage collection for this experiment, so, there is some inherent measurement error, brought about when garbage collection kicks in.
'''
file: listComp.py
purpose: to test the cost of making a list with comprehension
versus a standalone function
'''
import time as T
def get_overhead(n):
tic = T.time()
for i in range(n):
pass
toc = T.time()
return toc - tic
def make_list_of_size(n):
aList = [] #<-- O(1)
for i in range(n): #<-- O(n)
aList.append(n) #<-- O(1)
return aList #<-- O(1)
def comprehension(n):
return [n for i in range(n)] #<-- O(?)
def do_test(size_i,size_f,niter,file):
delta = 100
size = size_i
while size <= size_f:
overhead = get_overhead(niter)
reg_tic = T.time()
for i in range(niter):
reg_list = make_list_of_size(size)
reg_toc = T.time()
comp_tic = T.time()
for i in range(niter):
comp_list = comprehension(size)
comp_toc = T.time()
#--------------------
reg_cost_per_iter = (reg_toc - reg_tic - overhead)/niter
comp_cost_pet_iter = (comp_toc - comp_tic - overhead)/niter
file.write(str(size)+","+str(reg_cost_per_iter)+
","+str(comp_cost_pet_iter)+"\n")
print("SIZE: "+str(size)+ " REG_COST = "+str(reg_cost_per_iter)+
" COMP_COST = "+str(comp_cost_pet_iter))
if size == 10*delta:
delta *= 10
size += delta
def main():
fname = input()
file = open(fname,'w')
do_test(100,1000000,2500,file)
file.close()
main()
I did three tests. Two of them were up to list size 100000, the third was up to 1*10^6
See Plots:
Overlay with NO ZOOM
I found these results to be intriguing. Although both methods have a big-O notation of O(n), the cost, with respect to time, is less for comprehension for making the same list.
I have more information to share, including the same test done with the list made with comprehension first, and then with the standalone function.
I have yet to run a test without garbage collection.

Accelerating for loops with list methods in Python

I have for loops in python that iterates nearly 2.5 million times and it's taking so much time to get a result. In JS I can make this happen in nearly 1 second but Python does it in 6 seconds on my computer. I must use Python in this case. Here is the code:
for i in xrange(k,p,2):
arx = process[i]
ary = process[i+1]
for j in xrange(7,-1,-1):
nx = arx + dirf[j]
ny = ary + dirg[j]
ind = ny*w+nx
if data[ind] == e[j]:
process[c]=nx
c=c+1
process[c]=ny
c=c+1
matrix[ind]=1
Here is some lists from code:
process = [None] * (11000*4) it's items will be replaced with integers after it's assignment.
dirf = [-1,0,1,-1,1,-1,0,1]
dirg = [1,1,1,0,0,-1,-1,-1]
e = [9, 8, 7, 6, 4, 3, 2, 1]
the data list is consists of 'r' informations from pixels of an rgba image.
data = imgobj.getdata(0)
How can I boost this. What am I doing wrong? Is there any other approaches about for loops? Thanks.
Here are a few suggestions for improving your code:
That inner xrange is being used a lot: what if you made that a regular list and just did something like this:
inner = range(7,-1,-1) # make the actual list
for(a,b,c): #first for
#stuff
for i in inner # reference an actual list and not the generator
Evidence :
n = range(7,-1,-1)
def one():
z = 0
for k in xrange(100):
for i in n:
z+=1
def two():
z = 0
for k in xrange(100):
for i in xrange(7,-1,-1):
z+=1
if __name__ == '__main__':
import timeit
print("one:")
print(timeit.timeit("one()",number=1000000 ,setup="from __main__ import one"))
print("two:")
print(timeit.timeit("two()",number=1000000 ,setup="from __main__ import two"))
"result"
one:
37.798637867
two:
63.5098838806
If the code I wrote is comparable, it would appear to indicate that referencing the inner list and not generating really speeds it up.
[edit] referencing local variable is faster than accessing global.
so if this is correct place the list definition as close to the loop as possible without having it generate every time.
You are also changing process twice. If it's not needed, just choose one.
As you mentioned in the comments, you say you're working with images. I am not sure if the following is relevant, but perhaps you could use openCV, which has a Python API to C code. That might speed it up. As others have mentioned: numpy and your own cython extensions will speed this up considerably.

Multiprocessing and lists

I have been trying to optimise my code using the multiprocessing module, but I think I have fallen for the trap of premature optimization.
For example, when running this code:
num = 1000000
l = mp.Manager().list()
for i in range(num):
l.append(i)
l_ = Counter(l)
It takes several times longer than this:
num = 1000000
l = []
for i in range(num):
l.append(i)
l_ = Counter(l)
What is the reason the multiprocessing list is slower than regular lists? And are there ways to make them as efficient?
Shared memroy data structures are meant to be shared between processes. To synchronize accesses, they need to be locked. On the other hand, a list ([]) does not require a lock.
With / without locking makes a difference.

Python: Fast mapping and lookup between two lists

I'm currently working on a high-performance python 2.7 project utilizing lists ten thousands elements in size. Obviously, every operation must be performed as fast as possible.
So, I have two lists: One of them is a list of unique arbitrary numbers, let's call it A, and the other one is a linear list starting with 1 and with the same length as the first list, named B, which represents indices in A (starting with 1)
Something like enumerate, starting with 1.
For example:
A = [500, 300, 400, 200, 100] # The order here is arbitrary, they can be any integers, but every integer can only exist once
B = [ 1, 2, 3, 4, 5] # This is fixed, starting from 1, with exactly as many elements as A
If I have an element of B (called e_B) and want the corresponding element in A, I can simply do correspond_e_A = A[e_B - 1]. No problem.
But now I have a huge list of random, non-unique integers, and I want to know the indices of the integers that are in A, and what the corresponding elements in B are.
I think I have a reasonable solution for the first question:
indices_of_existing = numpy.nonzero(numpy.in1d(random_list, A))[0]
What is great about this approach is that there is no need to map() single operations, numpy's in1d just returns a list like [True, True, False, True, ...]. Using nonzero() I can get the indices of the elements in random_list that exist in A. Perfect, I think.
But for the second question, I'm stumped.
I tried something like:
corresponding_e_B = map(lambda x: numpy.where(A==x)[0][0] + 1, random_list))
This correctly gives me the indices, but it's not optimal, because firstly I need a map(), secondly I need a lambda, and finally numpy.where() does not stop after the item was found once (remember, A has only unique elements), meaning that it scales horribly with huge datasets like mine.
I took a look at bisect, but it seems bisect only works with single requests, not with lists, meaning that I'd still have to use map() and build my list elementwise (that's slow, isn't it?)
Since I'm quite new to Python, I was hoping anyone here might have an idea? Maybe a library I don't know yet?
I think you would be well advised to use a hashtable for your lookups instead of numpy.in1d, which uses a O(n log n) merge sort as a preprocessing step.
>>> A = [500, 300, 400, 200, 100]
>>> index = { k:i for i,k in enumerate(A, 1) }
>>> random_list = [200, 100, 50]
>>> [i for i,x in enumerate(random_list) if x in index]
[0, 1]
>>> map(index.get, random_list)
[4, 5, None]
>>> filter(None, map(index.get, random_list))
[4, 5]
This is Python 2, in Python 3 map and filter return generator-like objects, so you would need a list around filter if you want to get the result as a list.
I have tried to use builtin functions as much as possible to push the computational burden to the C side (assuming you use CPython). All the names are resolved upfront, so it should be pretty fast.
In general, for maximum performance, you might want to consider using PyPy, a great alternative Python implementation with JIT compilation.
A benchmark to compare multiple approaches is never a bad idea:
import sys
is_pypy = '__pypy__' in sys.builtin_module_names
import timeit
import random
if not is_pypy:
import numpy
import bisect
n = 10000
m = 10000
q = 100
A = set()
while len(A) < n: A.add(random.randint(0,2*n))
A = list(A)
queries = set()
while len(queries) < m: queries.add(random.randint(0,2*n))
queries = list(queries)
# these two solve question one (find indices of queries that exist in A)
if not is_pypy:
def fun11():
for _ in range(q):
numpy.nonzero(numpy.in1d(queries, A))[0]
def fun12():
index = set(A)
for _ in range(q):
[i for i,x in enumerate(queries) if x in index]
# these three solve question two (find according entries of B)
def fun21():
index = { k:i for i,k in enumerate(A, 1) }
for _ in range(q):
[index[i] for i in queries if i in index]
def fun22():
index = { k:i for i,k in enumerate(A, 1) }
for _ in range(q):
list(filter(None, map(index.get, queries)))
def findit(keys, values, key):
i = bisect.bisect(keys, key)
if i == len(keys) or keys[i] != key:
return None
return values[i]
def fun23():
keys, values = zip(*sorted((k,i) for i,k in enumerate(A,1)))
for _ in range(q):
list(filter(None, [findit(keys, values, x) for x in queries]))
if not is_pypy:
# note this does not filter out nonexisting elements
def fun24():
I = numpy.argsort(A)
ss = numpy.searchsorted
maxi = len(I)
for _ in range(q):
a = ss(A, queries, sorter=I)
I[a[a<maxi]]
tests = ("fun12", "fun21", "fun22", "fun23")
if not is_pypy: tests = ("fun11",) + tests + ("fun24",)
if is_pypy:
# warmup
for f in tests:
timeit.timeit("%s()" % f, setup = "from __main__ import %s" % f, number=20)
# actual timing
for f in tests:
print("%s: %.3f" % (f, timeit.timeit("%s()" % f, setup = "from __main__ import %s" % f, number=3)))
Results:
$ python2 -V
Python 2.7.6
$ python3 -V
Python 3.3.3
$ pypy -V
Python 2.7.3 (87aa9de10f9ca71da9ab4a3d53e0ba176b67d086, Dec 04 2013, 12:50:47)
[PyPy 2.2.1 with GCC 4.8.2]
$ python2 test.py
fun11: 1.016
fun12: 0.349
fun21: 0.302
fun22: 0.276
fun23: 2.432
fun24: 0.897
$ python3 test.py
fun11: 0.973
fun12: 0.382
fun21: 0.423
fun22: 0.341
fun23: 3.650
fun24: 0.894
$ pypy ~/tmp/test.py
fun12: 0.087
fun21: 0.073
fun22: 0.128
fun23: 1.131
You can tweak n (size of A), m (size of random_list) and q (number of queries) to your scenario. To my surprise, my attempt to be clever and use builtin functions instead of list comps has not paid off, since fun22 is not a lot faster than fun21 (only ~10% In Python 2 and ~25% in Python 3, but almost 75% slower in PyPy). A case of premature optimization here. I guess the difference is due to the fact that fun22 builds up an unnecessary temporary list per query in Python 2. We also see that binary search is pretty bad (look at fun23).
def numpy_optimized(index, values):
I = np.argsort(values)
Q = np.searchsorted(values, index, sorter=I)
return I[Q]
This calculates the same thing as OP, but with the indices in matching order to the values queried, which I imagine is an improvement in functionality. It is up to twice as fast as OP's solution on my machine; which puts it slightly ahead of the non-pypy solutions, if I interpret your benchmarks correctly.
Or in case we cannot assume all index are present in values, and would like missing queries to be silently dropped:
def numpy_optimized_filtered(index, values):
I = np.argsort(values)
Q = np.searchsorted(values, index, sorter=I)
Z = I[Q]
return Z[values[Z]==index]

Python append performance

I'm having some performance problems with 'append' in Python.
I'm writing an algorithm that checks if there are two overlapping circles in a (large) set of circles.
I start by putting the extreme points of the circles (x_i-R_i & x_i+R_i) in a list and then sorting the list.
class Circle:
def __init__(self, middle, radius):
self.m = middle
self.r = radius
In between I generate N random circles and put them in the 'circles' list.
"""
Makes a list with all the extreme points of the circles.
Format = [Extreme, left/right ~ 0/1 extreme, index]
Seperate function for performance reason, python handles local variables faster.
Garbage collect is temporarily disabled since a bug in Python makes list.append run in O(n) time instead of O(1)
"""
def makeList():
"""gc.disable()"""
list = []
append = list.append
for circle in circles:
append([circle.m[0]-circle.r, 0, circles.index(circle)])
append([circle.m[0] + circle.r, 1, circles.index(circle)])
"""gc.enable()"""
return list
When running this with 50k circles it takes over 75 seconds to generate the list. As you might see in the comments I wrote I disabled garbage collect, put it in a separate function, used
append = list.append
append(foo)
instead of just
list.append(foo)
I disabled gc since after some searching it seems that there's a bug with python causing append to run in O(n) instead of O(c) time.
So is this way the fastest way or is there a way to make this run faster?
Any help is greatly appreciated.
Instead of
for circle in circles:
... circles.index(circle) ...
use
for i, circle in enumerate(circles):
... i ...
This could decrease your O(n^2) to O(n).
Your whole makeList could be written as:
sum([[[circle.m[0]-circle.r, 0, i], [circle.m[0]+circle.r, 1, i]] for i, circle in enumerate(circles)], [])
Your performance problem is not in the append() method, but in your use of circles.index(), which makes the whole thing O(n^2).
A further (comparitively minor) improvement is to use a list comprehension instead of list.append():
mylist = [[circle.m[0] - circle.r, 0, i]
for i, circle in enumerate(circles)]
mylist += [[circle.m[0] + circle.r, 1, i]
for i, circle in enumerate(circles)]
Note that this will give the data in a different order (which should not matter as you are planning to sort it anyway).
I've just tried several tests to improve "append" function's speed. It will definitely helpful for you.
Using Python
Using list(map(lambda - known as a bit faster means than for+append
Using Cython
Using Numba - jit
CODE CONTENT : getting numbers from 0 ~ 9999999, square them, and put them into a new list using append.
Using Python
import timeit
st1 = timeit.default_timer()
def f1():
a = range(0, 10000000)
result = []
append = result.append
for i in a:
append( i**2 )
return result
f1()
st2 = timeit.default_timer()
print("RUN TIME : {0}".format(st2-st1))
RUN TIME : 3.7 s
Using list(map(lambda
import timeit
st1 = timeit.default_timer()
result = list(map(lambda x : x**2 , range(0,10000000) ))
st2 = timeit.default_timer()
print("RUN TIME : {0}".format(st2-st1))
RUN TIME : 3.6 s
Using Cython
the coding in a .pyx file
def f1():
cpdef double i
a = range(0, 10000000)
result = []
append = result.append
for i in a:
append( i**2 )
return result
and I compiled it and ran it in .py file.
in .py file
import timeit
from c1 import *
st1 = timeit.default_timer()
f1()
st2 = timeit.default_timer()
print("RUN TIME : {0}".format(st2-st1))
RUN TIME : 1.6 s
Using Numba - jit
import timeit
from numba import jit
st1 = timeit.default_timer()
#jit(nopython=True, cache=True)
def f1():
a = range(0, 10000000)
result = []
append = result.append
for i in a:
append( i**2 )
return result
f1()
st2 = timeit.default_timer()
print("RUN TIME : {0}".format(st2-st1))
RUN TIME : 0.57 s
CONCLUSION :
As you mentioned above, changing the simple append form boosted up the speed a bit. And using Cython is much faster than in Python. However, turned out using Numba is the best choice in terms of speed improvement for 'for+append' !
Try using deque in the collections package to append large rows of data, without performance diminishing. And then convert a deque back to a DataFrame using List Comprehension.
Sample Case:
from collections import deque
d = deque()
for row in rows:
d.append([value_x, value_y])
df = pd.DataFrame({'column_x':[item[0] for item in d],'column_y':[item[1] for item in d]})
This is a real time-saver.
If performance were an issue, I would avoid using append. Instead, preallocate an array and then fill it up. I would also avoid using index to find position within the list "circles". Here's a rewrite. It's not compact, but I'll bet it's fast because of the unrolled loop.
def makeList():
"""gc.disable()"""
mylist = 6*len(circles)*[None]
for i in range(len(circles)):
j = 6*i
mylist[j] = circles[i].m[0]-circles[i].r
mylist[j+1] = 0
mylist[j+2] = i
mylist[j+3] = circles[i].m[0] + circles[i].r
mylist[j+4] = 1
mylist[j+5] = i
return mylist

Categories