Unexpected performance curve from CPython merge sort

Unexpected performance curve from CPython merge sort - python

I have implemented a naive merge sorting algorithm in Python. Algorithm and test code is below:
import time
import random
import matplotlib.pyplot as plt
import math
from collections import deque
def sort(unsorted):
if len(unsorted) <= 1:
return unsorted
to_merge = deque(deque([elem]) for elem in unsorted)
while len(to_merge) > 1:
left = to_merge.popleft()
right = to_merge.popleft()
to_merge.append(merge(left, right))
return to_merge.pop()
def merge(left, right):
result = deque()
while left or right:
if left and right:
elem = left.popleft() if left[0] > right[0] else right.popleft()
elif not left and right:
elem = right.popleft()
elif not right and left:
elem = left.popleft()
result.append(elem)
return result
LOOP_COUNT = 100
START_N = 1
END_N = 1000
def test(fun, test_data):
start = time.clock()
for _ in xrange(LOOP_COUNT):
fun(test_data)
return time.clock() - start
def run_test():
timings, elem_nums = [], []
test_data = random.sample(xrange(100000), END_N)
for i in xrange(START_N, END_N):
loop_test_data = test_data[:i]
elapsed = test(sort, loop_test_data)
timings.append(elapsed)
elem_nums.append(len(loop_test_data))
print "%f s --- %d elems" % (elapsed, len(loop_test_data))
plt.plot(elem_nums, timings)
plt.show()
run_test()
As much as I can see everything is OK and I should get a nice N*logN curve as a result. But the picture differs a bit:
Things I've tried to investigate the issue:
PyPy. The curve is ok.
Disabled the GC using the gc module. Wrong guess. Debug output showed that it doesn't even run until the end of the test.
Memory profiling using meliae - nothing special or suspicious.
`
I had another implementation (a recursive one using the same merge function), it acts the similar way. The more full test cycles I create - the more "jumps" there are in the curve.
So how can this behaviour be explained and - hopefully - fixed?
UPD: changed lists to collections.deque
UPD2: added the full test code
UPD3: I use Python 2.7.1 on a Ubuntu 11.04 OS, using a quad-core 2Hz notebook. I tried to turn of most of all other processes: the number of spikes went down but at least one of them was still there.

You are simply picking up the impact of other processes on your machine.
You run your sort function 100 times for input size 1 and record the total time spent on this. Then you run it 100 times for input size 2, and record the total time spent. You continue doing so until you reach input size 1000.
Let's say once in a while your OS (or you yourself) start doing something CPU-intensive. Let's say this "spike" lasts as long as it takes you to run your sort function 5000 times. This means that the execution times would look slow for 5000 / 100 = 50 consecutive input sizes. A while later, another spike happens, and another range of input sizes look slow. This is precisely what you see in your chart.
I can think of one way to avoid this problem. Run your sort function just once for each input size: 1, 2, 3, ..., 1000. Repeat this process 100 times, using the same 1000 inputs (it's important, see explanation at the end). Now take the minimum time spent for each input size as your final data point for the chart.
That way, your spikes should only affect each input size only a few times out of 100 runs; and since you're taking the minimum, they will likely have no impact on the final chart at all.
If your spikes are really really long and frequent, you of course might want to increase the number of repetitions beyond the current 100 per input size.
Looking at your spikes, I notice the execution slows down exactly 3 times during a spike. I'm guessing the OS gives your python process one slot out of three during high load. Whether my guess is correct or not, the approach I recommend should resolve the issue.
EDIT:
I realized that I didn't clarify one point in my proposed solution to your problem.
Should you use the same input in each of your 100 runs for the given input size? Or should use 100 different (random) inputs?
Since I recommended to take the minimum of the execution times, the inputs should be the same (otherwise you'll be getting incorrect output, as you'll measuring the best-case algorithm complexity instead of the average complexity!).
But when you take the same inputs, you create some noise in your chart since some inputs are simply faster than others.
So a better solution is to resolve the system load problem, without creating the problem of only one input per input size (this is obviously pseudocode):
seed = 'choose whatever you like'
repeats = 4
inputs_per_size = 25
runtimes = defaultdict(lambda : float('inf'))
for r in range(repeats):
random.seed(seed)
for i in range(inputs_per_size):
for n in range(1000):
input = generate_random_input(size = n)
execution_time = get_execution_time(input)
if runtimes[(n, i)] > execution_time:
runtimes[(n,i)] = execution_time
for n in range(1000):
runtimes[n] = sum(runtimes[(n,i)] for i in range(inputs_per_size))/inputs_per_size
Now you can use runtimes[n] to build your plot.
Of course, depending if your system is super-noisy, you might change (repeats, inputs_per_size) from (4,25) to say, (10,10), or even (25,4).

I can reproduce the spikes using your code:
You should choose an appropriate timing function (time.time() vs. time.clock() -- from timeit import default_timer), number of repetitions in a test (how long each test takes), and number of tests to choose the minimal time from. It gives you a better precision and less external influence on the results. Read the note from timeit.Timer.repeat() docs:
It’s tempting to calculate mean and standard deviation from the result
vector and report these. However, this is not very useful. In a
typical case, the lowest value gives a lower bound for how fast your
machine can run the given code snippet; higher values in the result
vector are typically not caused by variability in Python’s speed, but
by other processes interfering with your timing accuracy. So the min()
of the result is probably the only number you should be interested in.
After that, you should look at the entire vector and apply common
sense rather than statistics.
timeit module can choose appropriate parameters for you:
$ python -mtimeit -s 'from m import testdata, sort; a = testdata[:500]' 'sort(a)'
Here's timeit-based performance curve:
The figure shows that sort() behavior is consistent with O(n*log(n)):
|------------------------------+-------------------|
| Fitting polynom | Function |
|------------------------------+-------------------|
| 1.00 log2(N) + 1.25e-015 | N |
| 2.00 log2(N) + 5.31e-018 | N*N |
| 1.19 log2(N) + 1.116 | N*log2(N) |
| 1.37 log2(N) + 2.232 | N*log2(N)*log2(N) |
To generate the figure I've used make-figures.py:
$ python make-figures.py --nsublists 1 --maxn=0x100000 -s vkazanov.msort -s vkazanov.msort_builtin
where:
# adapt sorting functions for make-figures.py
def msort(lists):
assert len(lists) == 1
return sort(lists[0]) # `sort()` from the question
def msort_builtin(lists):
assert len(lists) == 1
return sorted(lists[0]) # builtin
Input lists are described here (note: the input is sorted so builtin sorted() function shows expected O(N) performance).

Related

Does this manipulation in python save more time for my highly inefficient program?

I am using the following code unchanged in form but changed in content:
import numpy as np
import matplotlib.pyplot as plt
import random
from random import seed
from random import randint
import math
from math import *
from random import *
import statistics
from statistics import *
n=1000
T_plot=[0];
X_relm=[0];
class Objs:
def __init__(self, xIn, yIn, color):
self.xIn= xIn
self.yIn = yIn
self.color = color
def yfT(self, t):
return self.yIn*t+self.yIn*t
def xfT(self, t):
return self.xIn*t-self.yIn*t
xi=np.random.uniform(0,1,n);
yi=np.random.uniform(0,1,n);
O1 = [Objs(xIn = i, yIn = j, color = choice(["Black", "White"])) for i,j
in zip(xi,yi)]
X=sorted(O1,key=lambda x:x.xIn)
dt=1/(2*n)
T=20
iter=40000
Black=[]
White=[]
Xrelm=[]
for i in range(1,iter+1):
t=i*dt
for j in range(n-1):
check=X[j].xfT(t)-X[j+1].xfT(t);
if check<0:
X[j],X[j+1]=X[j+1],X[j]
if check<-10:
X[j].color,X[j+1].color=X[j+1].color,X[j].color
if X[j].color=="Black":
Black.append(X[j].xfT(t))
else:
White.append(X[j].xfT(t))
Xrel=mean(Black)-mean(White)
Xrelm.append(Xrel)
plot1=plt.figure(1);
plt.plot(T_plot,Xrelm);
plt.xlabel("time")
plt.ylabel("Relative ")
and it keeps running (I left it for 10 hours) without giving output for some parameters simply because it's too big I guess. I know that my code is not faulty totally (in the sense that it should give something even if wrong) because it does give outputs for fewer time steps and other parameters.
So, I am focusing on trying to optimize my code so that it takes lesser time to run. Now, this is a routine task for coders but I am a newbie and I am coding simply because the simulation will help in my field. So, in general, any inputs of a general nature that give insights on how to make one's code faster are appreciated.
Besides that, I want to ask whether defining a function a priori for the inner loop will save any time.
I do not think it should save any time since I am doing the same thing but I am not sure maybe it does. If it doesn't, any insights on how to deal with nested loops in a more efficient way along with those of general nature are appreciated.
(I have tried to shorten the code as far as I could and still not miss relevant information)

There are several issues in your code:
the mean is recomputed from scratch based on the growing array. Thus, the complexity of mean(Black)-mean(White) is quadratic to the number of elements.
The mean function is not efficient. Using a basic sum and division is much faster. In fact, a manual mean is about 25~30 times faster on my machine.
The CPython interpreter is very slow so you should avoid using loops as much as possible (OOP code does not help either). If this is not possible and your computation is expensive, then consider using a natively compiled code. You can use tools like PyPy, Numba or Cython or possibly rewrite a part in C.
Note that strings are generally quite slow and there is no reason to use them here. Consider using enumerations instead (ie. integers).
Here is a code fixing the first two points:
dt = 1/(2*n)
T = 20
iter = 40000
Black = []
White = []
Xrelm = []
cur1, cur2 = 0, 0
sum1, sum2 = 0.0, 0.0
for i in range(1,iter+1):
t = i*dt
for j in range(n-1):
check = X[j].xfT(t) - X[j+1].xfT(t)
if check < 0:
X[j],X[j+1] = X[j+1],X[j]
if check < -10:
X[j].color, X[j+1].color = X[j+1].color, X[j].color
if X[j].color == "Black":
Black.append(X[j].xfT(t))
else:
White.append(X[j].xfT(t))
delta1, delta2 = sum(Black[cur1:]), sum(White[cur2:])
sum1, sum2 = sum1+delta1, sum2+delta2
cur1, cur2 = len(Black), len(White)
Xrel = sum1/cur1 - sum2/cur2
Xrelm.append(Xrel)
Consider resetting Black and White to an empty list if you do not use them later.
This is several hundreds of time faster. It now takes 2 minutes as opposed to >20h (estimation) for the initial code.
Note that using a compiled code should be at least 10 times faster here so the execution time should be no more than dozens of seconds.

As mentioned in earlier comments, this one is a bit too broad to answer.
To illustrate; your iteration itself doesn't take very long:
import time
start = time.time()
for i in range(10000):
for j in range(10000):
pass
end = time.time()
print (end-start)
On my not-so-great machine that takes ~2s to complete.
So the looping portion is only a tiny fraction of your 10h+ run time.
The detail of what you're doing in the loop is the key.
Whilst very basic, the approach I've shown in the code above could be applied to your existing code to work out which bit(s) are the least performant and then raise a new question with some more specific, actionable detail.

Is this just very unlikely? Or is it impossible

So, I'm a beginner in python (coding in general, really), and I've tried to make this little program which generates a random number of rods in 305 attempts
import random
rods = 0
def blazerods():
global rods
seed = random.randint(0, 100000000000)
random.seed(seed)
i = 0
rods = 0
for i in range(0, 305):
rnd = random.random()
if rnd < 0.50:
rods += 1
print(rods)
return rods
while 1==1:
blazerods()
if rods >= 211:
break
The goal is to get 211 or more rods. However, I ran the program for 30 minutes without results.
My questions are: Is it even possible to get 211 or higher with just this code I included?
Can I make it more likely that rods can be more than 211 (still being a very unlikely result, ofc) without changing the chance(50%)?
Is random.seed(seed) even useful?

The probability distribution of rods is Binomial(305,0.5), that is the probability of getting exactly n rods is (305 choose n) * 0.5^305.
To get the probability to get at least 211, you need to sum these terms from 211 to 305. Wolfram alpha gives that as 8.8e-12.
So... it is really, really unlikely and you will have to wait a long time.
If your loop runs 1000 times a second, you will expect to have enough rods about once every 4 years.
If I remember correctly, Matt Parker from the Youtube channel Stand-up Maths has something to say about this particular case in his video "How lucky is too lucky".

As pointed out by Jens, this is easy to calculate via the Binomial distribution. The SciPy stats module allows you to calculate this by doing:
from scipy import stats
# i.e. 305 draws with equal probability
d = stats.binom(305, 0.5)
# the probability of seeing something greater than this value
p = d.sf(210)
which should give you the same value as Jens got: ~8.8e-12.
Next we can use the datetime module to convert this number into the expected time you have to wait:
from datetime import timedelta
time_per_try = timedelta(seconds=1/1000)
print(time_per_try / p)
which should give you ~1300 days, or 3.6 years. Technically, this is the time you'll have to wait to have a 50% chance of seeing it, and it could appear much sooner or later.
You can calculate reasonable values of when this would happen, using the negative binomial distribution. In Python, this looks like:
for q in stats.nbinom(1, p).ppf([0.025, 0.975]):
print(time_per_try * q)
where the 0.025 and 0.975 values give you the 95% confidence interval you hear scientists talking about.
It tells you that if you had 20 computers running your algorithm in parallel, each doing 1000 tests per second, you could expect the first one to finish in around a month while the slowest one would likely be going on for more than 10 years.

With Python: My recursive function is way to slow

I want to program the following (I've just start to learn python):
f[i]:=f[i-1]-(1/n)*(1-(1-f[i-1])^n)-(1/n)*(f[i-1])^n+(2*f[0]/n);
with F[0]=x, x belongs to [0,1] and n a constant integer.
My try:
import pylab as pl
import numpy as np
N=20
n=100
h=0.01
T=np.arange(0, 1+h, h)
def f(i):
if i == 0:
return T
else:
return f(i-1)-(1./n)*(1-(1-f(i-1))**n)-(1./n)*(f(i-1))**n+2.*T/n
pl.figure(figsize=(10, 6), dpi=80)
pl.plot(T,f(N), color="red",linestyle='--', linewidth=2.5)
pl.show()
For N=10 (number of iterations) it returns the correct plot fast enough, but for N=20 it keeps running and running (more than 30 minutes already).

The reason why your run time is so slow is the fact that, like the simplistic calculation of the nth fibonacci number it runs in exponential time (in this case 3^n). To see this, before F[i] can return it's value, it has to call f[i-1] 3 times, but then each of those has to call F[i-2] 3 times (3*3 calls), and then each of those has to call F[i-3] 3 times (3*3*3 calls), and so on. In this example, as others have shown, this can be calculated simply in linear time. That you see it slow for N = 20 is because your function has to be called 3^20 = 3486784401 times before you get the answer!

You calculate f(i-1) three times in a single recursion layer - so after the first run you "know" the answer but still calculate it two more times. A naive approach:
fi_1 = f(i-1)
return fi_1-(1./n)*(1-(1-fi_1)**n)-(1./n)*(fi_1)**n+2.*T/n
But of course we can still do better and cache every evaluation of f:
cache = {}
def f_cached(i):
if not i in cache:
cache[i] = f(i)
return(cache[i])
Then replace every every occurence of f with f_cached.
There are also libraries out there that can do that for you automatically (with a decorator).
While recursion often yields nice and easy formulas, python is not that good at evaluating them (see tail recursion). You are probably better off with rewriting it in a iterativ way and calculate that.

First of all you are calculating f[i-1] three times when you can save it's result in some variable and calculate it only once :
t = f(i-1)
return t-(1./n)*(1-(1-t)**n)-(1./n)*(t)**n+2.*T/n
It will increase the speed of the program, but I would also like to recommend to calculate f without using recursion.
fs = T
for i in range(1,N+1):
tmp = fs
fs = (tmp-(1./n)*(1-(1-tmp)**n)-(1./n)*(tmp)**n+2.*T/n)

Perceptron Learning Algorithm taking a lot of iterations to converge?

I am solving the homework-1 of Caltech Machine Learning Course (http://work.caltech.edu/homework/hw1.pdf) . To solve ques 7-10 we need to implement a PLA. This is my implementation in python:
import sys,math,random
w=[] # stores the weights
data=[] # stores the vector X(x1,x2,...)
output=[] # stores the output(y)
# returns 1 if dot product is more than 0
def sign_dot_product(x):
global w
dot=sum([w[i]*x[i] for i in xrange(len(w))])
if(dot>0):
return 1
else :
return -1
# checks if a point is misclassified
def is_misclassified(rand_p):
return (True if sign_dot_product(data[rand_p])!=output[rand_p] else False)
# loads data in the following format:
# x1 x2 ... y
# In the present case for d=2
# x1 x2 y
def load_data():
f=open("data.dat","r")
global w
for line in f:
data_tmp=([1]+[float(x) for x in line.split(" ")])
data.append(data_tmp[0:-1])
output.append(data_tmp[-1])
def train():
global w
w=[ random.uniform(-1,1) for i in xrange(len(data[0]))] # initializes w with random weights
iter=1
while True:
rand_p=random.randint(0,len(output)-1) # randomly picks a point
check=[0]*len(output) # check is a list. The ith location is 1 if the ith point is correctly classified
while not is_misclassified(rand_p):
check[rand_p]=1
rand_p=random.randint(0,len(output)-1)
if sum(check)==len(output):
print "All points successfully satisfied in ",iter-1," iterations"
print iter-1,w,data[rand_p]
return iter-1
sign=output[rand_p]
w=[w[i]+sign*data[rand_p][i] for i in xrange(len(w))] # changing weights
if iter>1000000:
print "greater than 1000"
print w
return 10000000
iter+=1
load_data()
def simulate():
#tot_iter=train()
tot_iter=sum([train() for x in xrange(100)])
print float(tot_iter)/100
simulate()
The problem according to the answer of question 7 it should take around 15 iterations for perceptron to converge when size of training set but the my implementation takes a average of 50000 iteration . The training data is to be randomly generated but I am generating data for simple lines such as x=4,y=2,..etc. Is this the reason why I am getting wrong answer or there is something else wrong. Sample of my training data(separable using y=2):
1 2.1 1
231 100 1
-232 1.9 -1
23 232 1
12 -23 -1
10000 1.9 -1
-1000 2.4 1
100 -100 -1
45 73 1
-34 1.5 -1
It is in the format x1 x2 output(y)

It is clear that you are doing a great job learning both Python and classification algorithms with your effort.
However, because of some of the stylistic inefficiencies with your code, it makes it difficult to help you and it creates a chance that part of the problem could be a miscommunication between you and the professor.
For example, does the professor wish for you to use the Perceptron in "online mode" or "offline mode"? In "online mode" you should move sequentially through the data point and you should not revisit any points. From the assignment's conjecture that it should require only 15 iterations to converge, I am curious if this implies the first 15 data points, in sequential order, would result in a classifier that linearly separates your data set.
By instead sampling randomly with replacement, you might be causing yourself to take much longer (although, depending on the distribution and size of the data sample, this is admittedly unlikely since you'd expect roughly that any 15 points would do about as well as the first 15).
The other issue is that after you detect a correctly classified point (cases when not is_misclassified evaluates to True) if you then witness a new random point that is misclassified, then your code will kick down into the larger section of the outer while loop, and then go back to the top where it will overwrite the check vector with all 0s.
This means that the only way your code will detect that it has correctly classified all the points is if the particular random sequence that it evaluates them (in the inner while loop) happens to be a string of all 1's except for the miraculous ability that on any particular 0, on that pass through the array, it classifies correctly.
I can't quite formalize why I think that will make the program take much longer, but it seems like your code is requiring a much stricter form of convergence, where it sort of has to learn everything all at once on one monolithic pass way late in the training stage after having been updated a bunch already.
One easy way to check if my intuition about this is crappy would be to move the line check=[0]*len(output) outside of the while loop all together and only initialize it one time.
Some general advice to make the code easier to manage:
Don't use global variables. Instead, let your function to load and prep the data return things.
There are a few places where you say, for example,
return (True if sign_dot_product(data[rand_p])!=output[rand_p] else False)
This kind of thing can be simplified to
return sign_dot_product(data[rand_p]) != output[rand_p]
which is easier to read and conveys what criteria you're trying to check for in a more direct manner.
I doubt efficiency plays an important role since this seems to be a pedagogical exercise, but there are a number of ways to refactor your use of list comprehensions that might be beneficial. And if possible, just use NumPy which has native array types. Witnessing how some of these operations have to be expressed with list operations is lamentable. Even if your professor doesn't want you to implement with NumPy because she or he is trying to teach you pure fundamentals, I say just ignore them and go learn NumPy. It will help you with jobs, internships, and practical skill with these kinds of manipulations in Python vastly more than fighting with the native data types to do something they were not designed for (array computing).

Finding a Prime Sieve Inconsistency in Python

I'm attempting to learn python and I thought trying to develop my own prime sieve would be an interesting problem for the afternoon. When required thus far, I would just import a version of the Sieve of Eratosthenes that I found online -- it's this that I used as my benchmark.
After trying several different optimizations, I thought I had written a pretty decent sieve:
def sieve3(n):
top = n+1
sieved = dict.fromkeys(xrange(3,top,2), True)
for si in sieved:
if si * si > top:
break
if sieved[si]:
for j in xrange((si*2) + si, top, si*2): [****]
sieved[j] = False
return [2] + [pr for pr in sieved if sieved[pr]]
Using the first 1,000,000 integers as my range, this code would generate the correct number of primes and was only about 3-5x slower than my benchmark. I was about to give up and pat myself on the back when I tried it on a larger range, but it no longer worked!
n = 1,000 -- Benchmark = 168 in 0.00010 seconds
n = 1,000 -- Sieve3 = 168 in 0.00022 seconds
n = 4,194,304 -- Benchmark = 295,947 in 0.288 seconds
n = 4,194,304 -- Sieve3 = 295,947 in 1.443 seconds
n = 4,194,305 -- Benchmark = 295,947 in 3.154 seconds
n = 4,194,305 -- Sieve3 = 2,097,153 in 0.8465 seconds
I think the problem comes from the line with [****], but I can't figure out why it's so broken. It's supposed to mark each odd multiple of 'j' as False and it works most of the time, but for anything above 4,194,304 the sieve is broken. (To be fair, it breaks on random other numbers too, like 10,000 for instance).
I made a change and it significantly slowed my code down, but it would actually work for all values. This version includes all numbers (not just odds) but is otherwise identical.
def sieve2(n):
top = n+1
sieved = dict.fromkeys(xrange(2,top), True)
for si in sieved:
if si * si > top:
break
if sieved[si]:
for j in xrange((si*2), top, si):
sieved[j] = False
return [pr for pr in sieved if sieved[pr]]
Can anyone help me figure out why my original function (sieve3) doesn't work consistently?
Edit: I forgot to mention, that when Sieve3 'breaks', sieve3(n) returns n/2.

The sieve requires the loop over candidate primes to be ordered. The code in question is enumerating the keys of a dictionary, which are not guaranteed to be ordered. Instead, go ahead and use the xrange you used to initialize the dictionary for your main sieve loop as well as the return result loop as follows:
def sieve3(n):
top = n+1
sieved = dict.fromkeys(xrange(3,top,2), True)
for si in xrange(3,top,2):
if si * si > top:
break
if sieved[si]:
for j in xrange(3*si, top, si*2):
sieved[j] = False
return [2] + [pr for pr in xrange(3,top,2) if sieved[pr]]

It's because dictionary keys are not ordered. Some of the time, by chance, for si in sieved: will loop through your keys in increasing order.
With your last example, the first value si gets is big enough to break the loop immediately.
You can simply use:
for si in sorted(sieved):

Well, look at the runtime -- you see that the runtime on the last case you showed was almost 5 times faster than the benchmark, while it had usually been 5 times slower. So that is a red flag, maybe you aren't performing all of the iterations? (And it is 5 times faster while having almost 10 times as many primes...)
I don't have time to look into the code more right now, but I hope this helps.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Unexpected performance curve from CPython merge sort - python

Related

Does this manipulation in python save more time for my highly inefficient program?

Is this just very unlikely? Or is it impossible

With Python: My recursive function is way to slow

Perceptron Learning Algorithm taking a lot of iterations to converge?

Finding a Prime Sieve Inconsistency in Python

Categories

Resources