I'm having some performance problems with 'append' in Python.
I'm writing an algorithm that checks if there are two overlapping circles in a (large) set of circles.
I start by putting the extreme points of the circles (x_i-R_i & x_i+R_i) in a list and then sorting the list.
class Circle:
def __init__(self, middle, radius):
self.m = middle
self.r = radius
In between I generate N random circles and put them in the 'circles' list.
"""
Makes a list with all the extreme points of the circles.
Format = [Extreme, left/right ~ 0/1 extreme, index]
Seperate function for performance reason, python handles local variables faster.
Garbage collect is temporarily disabled since a bug in Python makes list.append run in O(n) time instead of O(1)
"""
def makeList():
"""gc.disable()"""
list = []
append = list.append
for circle in circles:
append([circle.m[0]-circle.r, 0, circles.index(circle)])
append([circle.m[0] + circle.r, 1, circles.index(circle)])
"""gc.enable()"""
return list
When running this with 50k circles it takes over 75 seconds to generate the list. As you might see in the comments I wrote I disabled garbage collect, put it in a separate function, used
append = list.append
append(foo)
instead of just
list.append(foo)
I disabled gc since after some searching it seems that there's a bug with python causing append to run in O(n) instead of O(c) time.
So is this way the fastest way or is there a way to make this run faster?
Any help is greatly appreciated.
Instead of
for circle in circles:
... circles.index(circle) ...
use
for i, circle in enumerate(circles):
... i ...
This could decrease your O(n^2) to O(n).
Your whole makeList could be written as:
sum([[[circle.m[0]-circle.r, 0, i], [circle.m[0]+circle.r, 1, i]] for i, circle in enumerate(circles)], [])
Your performance problem is not in the append() method, but in your use of circles.index(), which makes the whole thing O(n^2).
A further (comparitively minor) improvement is to use a list comprehension instead of list.append():
mylist = [[circle.m[0] - circle.r, 0, i]
for i, circle in enumerate(circles)]
mylist += [[circle.m[0] + circle.r, 1, i]
for i, circle in enumerate(circles)]
Note that this will give the data in a different order (which should not matter as you are planning to sort it anyway).
I've just tried several tests to improve "append" function's speed. It will definitely helpful for you.
Using Python
Using list(map(lambda - known as a bit faster means than for+append
Using Cython
Using Numba - jit
CODE CONTENT : getting numbers from 0 ~ 9999999, square them, and put them into a new list using append.
Using Python
import timeit
st1 = timeit.default_timer()
def f1():
a = range(0, 10000000)
result = []
append = result.append
for i in a:
append( i**2 )
return result
f1()
st2 = timeit.default_timer()
print("RUN TIME : {0}".format(st2-st1))
RUN TIME : 3.7 s
Using list(map(lambda
import timeit
st1 = timeit.default_timer()
result = list(map(lambda x : x**2 , range(0,10000000) ))
st2 = timeit.default_timer()
print("RUN TIME : {0}".format(st2-st1))
RUN TIME : 3.6 s
Using Cython
the coding in a .pyx file
def f1():
cpdef double i
a = range(0, 10000000)
result = []
append = result.append
for i in a:
append( i**2 )
return result
and I compiled it and ran it in .py file.
in .py file
import timeit
from c1 import *
st1 = timeit.default_timer()
f1()
st2 = timeit.default_timer()
print("RUN TIME : {0}".format(st2-st1))
RUN TIME : 1.6 s
Using Numba - jit
import timeit
from numba import jit
st1 = timeit.default_timer()
#jit(nopython=True, cache=True)
def f1():
a = range(0, 10000000)
result = []
append = result.append
for i in a:
append( i**2 )
return result
f1()
st2 = timeit.default_timer()
print("RUN TIME : {0}".format(st2-st1))
RUN TIME : 0.57 s
CONCLUSION :
As you mentioned above, changing the simple append form boosted up the speed a bit. And using Cython is much faster than in Python. However, turned out using Numba is the best choice in terms of speed improvement for 'for+append' !
Try using deque in the collections package to append large rows of data, without performance diminishing. And then convert a deque back to a DataFrame using List Comprehension.
Sample Case:
from collections import deque
d = deque()
for row in rows:
d.append([value_x, value_y])
df = pd.DataFrame({'column_x':[item[0] for item in d],'column_y':[item[1] for item in d]})
This is a real time-saver.
If performance were an issue, I would avoid using append. Instead, preallocate an array and then fill it up. I would also avoid using index to find position within the list "circles". Here's a rewrite. It's not compact, but I'll bet it's fast because of the unrolled loop.
def makeList():
"""gc.disable()"""
mylist = 6*len(circles)*[None]
for i in range(len(circles)):
j = 6*i
mylist[j] = circles[i].m[0]-circles[i].r
mylist[j+1] = 0
mylist[j+2] = i
mylist[j+3] = circles[i].m[0] + circles[i].r
mylist[j+4] = 1
mylist[j+5] = i
return mylist
Related
What costs more in python, making a list with comprehension or with a standalone function?
It appears I failed to find previous posts asking the same question. While other answers go into detail about bytes and internal workings of python, and that Is indeed helpful, I felt like the visual graphs help to show that there is a continuous trend.
I don't yet have a good enough understanding of the low level workings of python, so those answers are a bit foreign for me to try and comprehend.
I am currently an undergrad in CS, and I am continually amazed with how powerful python is. I recently did a small experiment to test the cost of forming lists with comprehension versus a standalone function. For example:
def make_list_of_size(n):
retList = []
for i in range(n):
retList.append(0)
return retList
creates a list of size n containing zeros.
It is well known that this function is O(n). I wanted to explore the growth of the following:
def comprehension(n):
return [0 for i in range(n)]
Which makes the same list.
let us explore!
This is the code I used for timing, and note the order of function calls (which way did I make the list first). I made the list with a standalone function first, and then with comprehension. I have yet to learn how to turn off garbage collection for this experiment, so, there is some inherent measurement error, brought about when garbage collection kicks in.
'''
file: listComp.py
purpose: to test the cost of making a list with comprehension
versus a standalone function
'''
import time as T
def get_overhead(n):
tic = T.time()
for i in range(n):
pass
toc = T.time()
return toc - tic
def make_list_of_size(n):
aList = [] #<-- O(1)
for i in range(n): #<-- O(n)
aList.append(n) #<-- O(1)
return aList #<-- O(1)
def comprehension(n):
return [n for i in range(n)] #<-- O(?)
def do_test(size_i,size_f,niter,file):
delta = 100
size = size_i
while size <= size_f:
overhead = get_overhead(niter)
reg_tic = T.time()
for i in range(niter):
reg_list = make_list_of_size(size)
reg_toc = T.time()
comp_tic = T.time()
for i in range(niter):
comp_list = comprehension(size)
comp_toc = T.time()
#--------------------
reg_cost_per_iter = (reg_toc - reg_tic - overhead)/niter
comp_cost_pet_iter = (comp_toc - comp_tic - overhead)/niter
file.write(str(size)+","+str(reg_cost_per_iter)+
","+str(comp_cost_pet_iter)+"\n")
print("SIZE: "+str(size)+ " REG_COST = "+str(reg_cost_per_iter)+
" COMP_COST = "+str(comp_cost_pet_iter))
if size == 10*delta:
delta *= 10
size += delta
def main():
fname = input()
file = open(fname,'w')
do_test(100,1000000,2500,file)
file.close()
main()
I did three tests. Two of them were up to list size 100000, the third was up to 1*10^6
See Plots:
Overlay with NO ZOOM
I found these results to be intriguing. Although both methods have a big-O notation of O(n), the cost, with respect to time, is less for comprehension for making the same list.
I have more information to share, including the same test done with the list made with comprehension first, and then with the standalone function.
I have yet to run a test without garbage collection.
Just some Python code for an example:
nums = [1,2,3]
start = timer()
for i in range(len(nums)):
print(nums[i])
end = timer()
print((end-start)) #computed to 0.0697546862831
start = timer()
print(nums[0])
print(nums[1])
print(nums[2])
end = timer()
print((end-start)) #computed to 0.0167170338524
I can grasp that some extra time will be taken in the loop because the value of i must be incremented a few times, but the difference between the running times of these two different methods seems a lot bigger than I expected. Is there something else happening underneath the hood that I'm not considering?
Short answer: it isn't, unless the loop is very small. The for loop has a small overhead, but the way you're doing it is inefficient. By using range(len(nums)) you're effectively creating another list and iterating through that, then doing the same index lookups anyway. Try this:
for i in nums:
print(i)
Results for me were as expected:
>>> import timeit
>>> timeit.timeit('nums[0];nums[1];nums[2]', setup='nums = [1,2,3]')
0.10711812973022461
>>> timeit.timeit('for i in nums:pass', setup='nums = [1,2,3]')
0.13474011421203613
>>> timeit.timeit('for i in range(len(nums)):pass', setup='nums = [1,2,3]')
0.42371487617492676
With a bigger list the advantage of the loop becomes apparent, because the incremental cost of accessing an element by index outweighs the one-off cost of the loop:
>>> timeit.timeit('for i in nums:pass', setup='nums = range(0,100)')
1.541944980621338
timeit.timeit(';'.join('nums[%s]' % i for i in range(0,100)), setup='nums = range(0,100)')
2.5244338512420654
In python 3, which puts a greater emphasis on iterators over indexable lists, the difference is even greater:
>>> timeit.timeit('for i in nums:pass', setup='nums = range(0,100)')
1.6542046590038808
>>> timeit.timeit(';'.join('nums[%s]' % i for i in range(0,100)), setup='nums = range(0,100)')
10.331634456000756
With such a small array you're probably measuring noise first, and then the overhead of calling range(). Note that range not only has to increment a variable a few times, it also creates an object that holds its state (the current value) because it's a generator. The function call and object creation are two things you don't pay for in the second example and for very short iterations they will probably dwarf three array accesses.
Essentially your second snippet does loop unrolling, which is a viable and frequent technique of speeding up performance-critical code.
The for loop have a cost in any case, and the one you write is especially costly. Here is four versions, using timeit for measure time:
from timeit import timeit
NUMS = [1, 2, 3]
def one():
for i in range(len(NUMS)):
NUMS[i]
def one_no_access():
for i in range(len(NUMS)):
i
def two():
NUMS[0]
NUMS[1]
NUMS[2]
def three():
for i in NUMS:
i
for func in (one, one_no_access, two, three):
print(func.__name__ + ':', timeit(func))
Here is the found times:
one: 1.0467438200000743
one_no_access: 0.8853238560000136
two: 0.3143197629999577
three: 0.3478466749998006
The one_no_access show the cost of the expression range(len(NUMS)).
While lists in python are stocked contiguously in memory, the random access of elements is in O(1), explaining two as the quicker.
I have for loops in python that iterates nearly 2.5 million times and it's taking so much time to get a result. In JS I can make this happen in nearly 1 second but Python does it in 6 seconds on my computer. I must use Python in this case. Here is the code:
for i in xrange(k,p,2):
arx = process[i]
ary = process[i+1]
for j in xrange(7,-1,-1):
nx = arx + dirf[j]
ny = ary + dirg[j]
ind = ny*w+nx
if data[ind] == e[j]:
process[c]=nx
c=c+1
process[c]=ny
c=c+1
matrix[ind]=1
Here is some lists from code:
process = [None] * (11000*4) it's items will be replaced with integers after it's assignment.
dirf = [-1,0,1,-1,1,-1,0,1]
dirg = [1,1,1,0,0,-1,-1,-1]
e = [9, 8, 7, 6, 4, 3, 2, 1]
the data list is consists of 'r' informations from pixels of an rgba image.
data = imgobj.getdata(0)
How can I boost this. What am I doing wrong? Is there any other approaches about for loops? Thanks.
Here are a few suggestions for improving your code:
That inner xrange is being used a lot: what if you made that a regular list and just did something like this:
inner = range(7,-1,-1) # make the actual list
for(a,b,c): #first for
#stuff
for i in inner # reference an actual list and not the generator
Evidence :
n = range(7,-1,-1)
def one():
z = 0
for k in xrange(100):
for i in n:
z+=1
def two():
z = 0
for k in xrange(100):
for i in xrange(7,-1,-1):
z+=1
if __name__ == '__main__':
import timeit
print("one:")
print(timeit.timeit("one()",number=1000000 ,setup="from __main__ import one"))
print("two:")
print(timeit.timeit("two()",number=1000000 ,setup="from __main__ import two"))
"result"
one:
37.798637867
two:
63.5098838806
If the code I wrote is comparable, it would appear to indicate that referencing the inner list and not generating really speeds it up.
[edit] referencing local variable is faster than accessing global.
so if this is correct place the list definition as close to the loop as possible without having it generate every time.
You are also changing process twice. If it's not needed, just choose one.
As you mentioned in the comments, you say you're working with images. I am not sure if the following is relevant, but perhaps you could use openCV, which has a Python API to C code. That might speed it up. As others have mentioned: numpy and your own cython extensions will speed this up considerably.
I've profiled my application, and it spends 90% of its time in plus_minus_variations.
The function finds ways to make various numbers given a list of numbers using addition and subtraction.
For example:
Input
1, 2
Output
1+2=3
1-2=-1
-1+2=1
-1-2=-3
This is my current code. I think it could be improved a lot in terms of speed.
def plus_minus_variations(nums):
result = dict()
for i, ops in zip(xrange(2 ** len(nums)), \
itertools.product([-1, 1], repeat=len(nums))):
total = sum(map(operator.mul, ops, nums))
result[total] = ops
return result
I'm mainly looking for a different algorithm to approach this with. My current one seems pretty inefficient. However, if you have optimization suggestions about the code itself, I'd be happy to hear those too.
Additional:
It's okay if the result is missing some of the answers (or has some extraneous answers) if it finishes a lot faster.
If there are multiple ways to get a number, any of them are fine.
For the list sizes I'm using, 99.9% of the ways produce duplicate numbers.
It's okay if the result doesn't have the way that the numbers were produced, if, again, it finishes a lot faster.
If it is ok not to get trace of number producing there is no reasons to recalculate sum of number combination every time. You can store intermediate results:
def combine(l,r):
res = set()
for x in l:
for y in r:
res.add( x+y )
res.add( x-y )
res.add( -x+y )
res.add( -x-y )
return list(res)
def pmv(nums):
if len(nums) > 1:
l = pmv( nums[:len(nums)/2] )
r = pmv( nums[len(nums)/2:] )
return combine( l, r )
return nums
EDIT: if the way of number generation is important you can use this variant:
def combine(l,r):
res = dict()
for x,q in l.iteritems():
for y,w in r.iteritems():
if not res.has_key(x+y):
res[x+y] = w+q
res[-x-y] = [-i for i in res[x+y]]
if not res.has_key(x-y):
res[x-y] = w+[-i for i in q]
res[-x+y] = [-i for i in res[x-y]]
return res
def pmv(nums):
if len(nums) > 1:
l = pmv( nums[:len(nums)/2] )
r = pmv( nums[len(nums)/2:] )
return combine( l, r )
return {nums[0]:[1]}
My tests shows that it is still faster than the other solutions.
EDITED:
Aha!
Code is in Python 3,
inspired by tyz:
from functools import reduce # only in Python 3
def process(old, num):
new = set(map(num.__add__, old)) # use itertools.imap for Python 2
new.update(map((-num).__add__, old))
return new
def pmv(nums):
n = iter(nums)
x = next(n)
result = {x, -x} # set([x, -x]) for Python 2
return reduce(process, n, result)
Instead of split half and recursive, I use reduce to compute it one by one. that extremely reduced the times of function calls.
Take less than 1 sec to compute 256 numbers.
Why product then mult?
def pmv(nums):
return {sum(i):i for i in itertools.product(*((num, -num) for num in nums))}
Can be faster without how the numbers were produced:
def pmv(nums):
return set(map(sum, itertools.product(*((num, -num) for num in nums))))
This seems to be significantly faster for large random lists, I guess you could further micro-optimize it, but I prefer readability.
I chunk the list into smaller pieces and create variations for it. Since you get a lot less than 2 ** len(chunk) variatons it's going to be faster. Chunk length is 6, you can play with it to see what's the optimal chunk length.
def pmv(nums):
chunklen=6
res = dict()
res[0] = ()
for i in xrange(0, len(nums), chunklen):
part = plus_minus_variations(nums[i:i+chunklen])
resnew = dict()
for (i,j) in itertools.product(res, part):
resnew[i + j] = tuple(list(res[i]) + list(part[j]))
res = resnew
return res
You can get something like a 50% speedup (at least for short lists) just by doing:
from itertools import product, imap
from operator import mul
def plus_minus_variations(nums):
result = {}
for ops in product((-1, 1), repeat=len(nums)):
result[sum(imap(mul, ops, nums))] = ops
return result
imap won't create intermediate lists you don't need. Importing into the local namespace saves the time attribute lookup takes. Tuples are faster than lists. Don't store unneeded intermediate items.
I tried this with a dict comprehension but it was a bit slower. I tried it with a set comprehension (not saving the ops) and it was the same speed.
I don't know why you were using zip and xrange at all... you weren't using the result in your calculation.
Edit: I get significant speedups with this version all the way up to the point where your version gives a memory error, not just for short lists.
From a mathematical point of view you finally arive at all multiples of the greatest common divisor of your startvalues.
For example:
startvalues 2,4. then the gcd(2,4) is 2, so the generated numbers are .. -4, -2, 0, 2, 4, ...
startvalues 3,5. then the gcd(3,5) is 1, you get all integers.
startvalues 12, 18, 15. the gcd(12,15,18) is 3, you get .. -6, -3, 0, 3, 6, ....
This simple iterative method computes all possible sums. It could be about 5 times faster than the recursive method by #tyz.
def pmv(nums):
sums = set()
sums.add(0)
for i in range(0, len(nums)):
partialsums = set()
for s in sums:
partialsums.add(s + nums[i])
partialsums.add(s - nums[i])
sums = partialsums
return sums
Based on this older thread, it looks like the cost of list functions in Python is:
Random access: O(1)
Insertion/deletion to front: O(n)
Insertion/deletion to back: O(1)
Can anyone confirm whether this is still true in Python 2.6/3.x?
Take a look here. It's a PEP for a different kind of list. The version specified is 2.6/3.0.
Append (insertion to back) is O(1), while insertion (everywhere else) is O(n). So yes, it looks like this is still true.
Operation...Complexity
Copy........O(n)
Append......O(1)
Insert......O(n)
Get Item....O(1)
Set Item....O(1)
Del Item....O(n)
Iteration...O(n)
Get Slice...O(k)
Del Slice...O(n)
Set Slice...O(n+k)
Extend......O(k)
Sort........O(n log n)
Multiply....O(nk)
Python 3 is mostly an evolutionary change, no big changes in the datastructures and their complexities.
The canonical source for Python collections is TimeComplexity on the Wiki.
That's correct, inserting in front forces a move of all the elements to make place.
collections.deque offers similar functionality, but is optimized for insertion on both sides.
I know this post is old, but I recently did a little test myself. The complexity of list.insert() appears to be O(n)
Code:
'''
Independent Study, Timing insertion list method in python
'''
import time
def make_list_of_size(n):
ret_list = []
for i in range(n):
ret_list.append(n)
return ret_list
#Estimate overhead of timing loop
def get_overhead(niters):
'''
Returns the time it takes to iterate a for loop niter times
'''
tic = time.time()
for i in range(niters):
pass #Just blindly iterate, niter times
toc = time.time()
overhead = toc-tic
return overhead
def tictoc_midpoint_insertion(list_size_initial, list_size_final, niters,file):
overhead = get_overhead(niters)
list_size = list_size_initial
#insertion_pt = list_size//2 #<------- INSERTION POINT ASSIGMNET LOCATION 1
#insertion_pt = 0 #<--------------- INSERTION POINT ASSIGNMENT LOCATION 4 (insert at beginning)
delta = 100
while list_size <= list_size_final:
#insertion_pt = list_size//2 #<----------- INSERTION POINT ASSIGNMENT LOCATION 2
x = make_list_of_size(list_size)
tic = time.time()
for i in range(niters):
insertion_pt = len(x)//2 #<------------- INSERTION POINT ASSIGNMENT LOCATION 3
#insertion_pt = len(x) #<------------- INSERTION POINT ASSIGN LOC 5 insert at true end
x.insert(insertion_pt,0)
toc = time.time()
cost_per_iter = (toc-tic)/niters #overall time cost per iteration
cost_per_iter_no_overhead = (toc - tic - overhead)/niters #the fraction of time/iteration, #without overhead cost of pure iteration
print("list size = {:d}, cost without overhead = {:f} sec/iter".format(list_size,cost_per_iter_no_overhead))
file.write(str(list_size)+','+str(cost_per_iter_no_overhead)+'\n')
if list_size >= 10*delta:
delta *= 10
list_size += delta
def main():
fname = input()
file = open(fname,'w')
niters = 10000
tictoc_midpoint_insertion(100,10000000,niters,file)
file.close()
main()
See 5 positions where insertion can be done. Cost is of course a function of how large the list is, and how close you are to the beginning of the list (i.e. how many memory locations have to be re-organized)
Ignore left image of plot
Fwiw, there is a faster (for some ops... insert is O(log n)) list implementation called BList if you need it. BList