What is the logic behind the <statement, setup> design for timeit?

What is the logic behind the <statement, setup> design for timeit? - python

I am new to Python and figured I'd play around with problems on Project Euler to have something concrete to do meanwhile.
I came across the idea of timing different solutions to see how they rate against each other. That simple task turned out to be too complicated for my taste however. I read that the time.clock() calls are not accurate enough on unix systems (seconds resolution is simply pathetic with modern processors). Thus I stumbled upon the timeit module which seems to be the first choice for profiling tasks.
I have to say I really don't understand why they went with such a counter-intuitive way to go about it. I can't seem to get it to work, without needing to rewrite/restructure my code, which I find very frustrating.
Take the code below and nevermind for a second that it's neither pretty nor particularly efficient:
import math
import sys
from timeit import Timer
def digitsum(number):
rem = 0
while number > 0:
rem += number % 10
number //= 10
return rem
def prime_form(p):
if p == 2 or p == 3 or p == 5:
return True
elif (p-1) % 6 != 0 and (p+1) % 6 != 0:
return False
elif digitsum(p) % 3 == 0:
return False
elif p % 10 == 0 or p % 10 == 5:
return False
else:
return True
def lfactor(n):
if n <= 3:
return 1
limit = int(math.sqrt(n))
if limit % 2 == 0:
limit -= 1
lfac = 1
for i in range(3,limit+1,2):
if prime_form(i):
(div,rem) = divmod(n,i)
if rem == 0:
lfac = max(lfac, max(lfactor(div) ,lfactor(i)))
return lfac if lfac != 1 else n
number = int(sys.argv[1])
t = Timer("""print lfactor(number)""", """import primefacs""")
t.timeit(100)
#print lfactor(number)
If i would like to time the line print lfactor(number) why should I go through a bunch of loops, trying to define a setup statement etc.. I understand why one would want to have debug tool that are detached from the code being tested (a la unit testing) but shouldn't there be a simple and straightforward way to get the process time of a chunk of code without much hassle (importing/defining a setup etc)? What I am thinking here is something like the way one would do that:
long t0 = System.currentTimeInMillis();
// do something
long t = System.currentTimeInMillis() - t0;
.. or even better with MATLAB, using the tic/toc commands:
tic
x = A\b;
t(n) = toc;
Hope this doesn't come across as a rant, I am really trying understand "the pythonian way of thinking" but honestly it doesn't come naturally here, not at all...

Simple, the logic behind the statement and setup is that the setup is not part of the code you want to benchmark. Normally a python module is loaded once while the functions inside it are run more than one, much more.
A pythonic way of use timeit?
$ python -m timeit -h
Tool for measuring execution time of small code snippets.
This module avoids a number of common traps for measuring execution
times. See also Tim Peters' introduction to the Algorithms chapter in
the Python Cookbook, published by O'Reilly.
Library usage: see the Timer class.
Command line usage:
python timeit.py [-n N] [-r N] [-s S] [-t] [-c] [-h] [--] [statement]
Options:
-n/--number N: how many times to execute 'statement' (default: see below)
-r/--repeat N: how many times to repeat the timer (default 3)
-s/--setup S: statement to be executed once initially (default 'pass')
-t/--time: use time.time() (default on Unix)
-c/--clock: use time.clock() (default on Windows)
-v/--verbose: print raw timing results; repeat for more digits precision
-h/--help: print this usage message and exit
--: separate options from statement, use when statement starts with -
statement: statement to be timed (default 'pass')
[cut]
$ python -m timeit -s 'from primefacs import lfactor' 'lfactor(42)'
$ # this does not work, primefacs is not binded, ie. not loaded
$ python -m timeit 'primefacts.lfactor(42)'
$ # this does not work too, lfactor is not defined
$ python -m timeit 'lfactor(42)'
$ # this works but the time to import primefacs is benchmarked too
$ # but only the first time is loaded, the successive ones the cache is used.
$ python -m timeit 'import primefacts; primefacts.lfactor(42)'
As you can see, the way timeit works is much more intuitive than you think.
Edit to add:
I read that the time.clock() calls are not accurate enough on unix
systems (seconds resolution is simply pathetic with modern
processors).
quoting the documentation:
On Unix, return the current processor time as a floating point number
expressed in seconds. The precision, and in fact the very definition
of the meaning of “processor time”, depends on that of the C function
of the same name, but in any case, this is the function to use for
benchmarking Python or timing algorithms... The resolution is
typically better than one microsecond.
going on..
I have to say I really don't understand why they went with such a
counter-intuitive way to go about it. I can't seem to get it to work,
without needing to rewrite/restructure my code, which I find very
frustrating.
Yes, it could be but then this is one of those cases where documentation can help you, here a link to the examples for the impatiens. Here a more gentle introduction to timeit.

When timing a statement, you want to time just that statement, not the setup. The setup could be considerably slower than the statement-under-test.
Note that timeit runs your statement thousands of times to get a reasonable average. It does this to eliminate the effects of OS scheduling and other processes (including but not limited to disk buffer flushing, cronjob execution, memory swapping, etc); only an average time would have any meaning when comparing different code alternatives.
For your case, just test lfactor(number) directly, and just use the timeit() function:
timeit.timeit('lfactor(number)', 'from __main__ import lfactor, number')
The setup code retrieves the lfactor() function, as well as number taken from sys.argv from the main script; the function and number won't otherwise be seen.
There is absolutely no point in performance testing the print statement, that's not what you are trying to time. Using timeit is not about seeing the result of the call, just the time it takes to run it. Since the code-under-test is run thousands of times, all you'd get is thousands of prints of (presumably) the same result.
Note that usually timeit is used to compare performance characteristics of short python snippets; to find performance bottlenecks in more complex code, use profiling instead.
If you want to time just one run, use the timeit.default_timer() function to get the most accurate timer for your platform:
timer = timeit.default_timer
start = timer()
print lfactor(number)
time_taken = timer() - start

Related

How to make PyCharm profiler show only timings of my source code, not any libraries?

When I run PyCharm profiler (a quick intro video is here -- https://www.youtube.com/watch?v=QSueV8MYtlw ) I get thousands of lines like hasattr or npyio.py (where did it come from, I do not even use numpy) which do not help me to understand what's going on at all.
How can I make make PyCharm profiler to show only timings of my source code, not any libraries or system calls?
In other words, can the time spent in system calls and libraries be assigned to my functions which call them?
In other words (version two), all I want is number of milliseconds next to each line of my python code, not anything else.

I created a code to provide an example and hopefully provide an acceptable answer:
import datetime as dt
class something:
def something_else(self):
other_list = range(100000)
for num in other_list:
datetimeobj = dt.datetime.fromtimestamp(num)
print(num)
print(datetimeobj)
def something_different(self):
other_list = range(100000)
for num in other_list:
datetimeobj = dt.datetime.fromtimestamp(num)
print(num)
print(datetimeobj)
st = something()
st.something_else()
st.something_different()
The code resulted in the below picture, which I have sorted based on name. (This is in my case possible because all of the builtin methods are prefixed by "<". After doing this I can now see that main took 100 % of the total time (Column: Time (ms)). something_else took 50.8 % of the time and something_different took 49.2 % of the time (Totaling to 100 % as well) (Column: Time (ms)) The time spent inside each of the two home-grown methods was 2.0 % for each (Column: Own Time (ms)) -> This means that underlying calls from something_else accounted for 48.8 % and for something_different 47.2 % and the parts that I wrote accounted for 4.0 % of the total time. The remaining 96.0 % of the code happens from the built-in methods, that I call.
Your questions were:
How can I make make PyCharm profiler to show only timings of my source code, not any libraries or system calls? -> That's what you see in the column: "Own Time (ms)" -> 2.0 % (Time spent inside the specific method.)
In other words, can the time spent in system calls and libraries be assigned to my functions which call them? -> That's what you see in the column: "Time (ms)" (Time spent including underlying methods.)
Subtract the two columns and you get time spent only in underlying methods.
I have unfortunately been unable to find a method for filtering in the profiler, but it is possible to export the list by copying it and this way you could create something else to do the filtering on e.g. "<built_in" to clean up the data.

Python loop slower than Excel VBA?

I ran a little test between excel (VBA) and python performing a simple loop. Code listed below. To my surprise vba was significantly faster than python. Almost 6 times faster. I though that due to the fact that python runs through the command line the performance would be better. Do you guys have any comments on this?
Python
import time
import ctypes # An included library with Python install.
start_time = time.time()
for x in range(0, 1000000):
print x
x = ("--- %s seconds ---" % (time.time() - start_time))
ctypes.windll.user32.MessageBoxA(0, x, "Your title", 1)
Excel (VBA)
Sub looptest()
Dim MyTimer As Double
MyTimer = Timer
Dim rng As Range, cell As Range
Set rng = Range("A1:A1000000")
x = 1
For Each cell In rng
cell.Value = x
x = x + 1
Next cell
MsgBox Timer - MyTimer
End Sub

Your two code samples are not doing the same thing. In the Python code, the inner loop has to:
Ask for the next number in range(0, 1000000).
Display it.
In the VBA code, Excel has to:
Ask for the next cell in Range("A1:A1000000") (which has nothing to do with Python ranges).
Set the cell.Value property.
Run through various code Excel executes whenever it changes a cell.
Check to see if any formulas need to be recalculated.
Display it.
Increment x.
Let's rewrite this so the Python and VBA loops do the same thing, as near as we can:
Python
import time
import ctypes
start_time = time.time()
x = 0
while x <= 1000000:
x = x + 1
x = ("--- %s seconds ---" % (time.time() - start_time))
ctypes.windll.user32.MessageBoxA(0, x, "Your title", 1)
VBA
Declare Function QueryPerformanceCounter Lib "kernel32" (t As Currency) As Boolean
Declare Function QueryPerformanceFrequency Lib "kernel32" (t As Currency) As Boolean
Sub looptest()
Dim StartTime As Currency
QueryPerformanceCounter StartTime
x = 0
Do While x <= 1000000
x = x + 1
Loop
Dim EndTime As Currency
QueryPerformanceCounter EndTime
Dim Frequency As Currency
QueryPerformanceFrequency Frequency
MsgBox Format$((EndTime - StartTime) / Frequency, "0.000")
End Sub
On my computer, Python takes about 96 ms, and VBA 33 ms – VBA performs three times faster. If you throw in a Dim x As Long, it performs six times faster.
Why? Well, let's look at how each gets run. Python internally compiles your .py file into a .pyc, and runs it under the Python VM. Another answer describes the Python case in detail. Excel compiles VBA into MS P-Code, and runs it under the Visual Basic VM.
At this point, it doesn't matter that python.exe is command-line and Excel is GUI. The VM runs your code, and it lives a little deeper in the bowels of your computer. Performance depends on what specific instructions are in the compiled code, and how efficiently the VM runs these instructions. In this case, the VB VM ran its P-Code faster than the Python VM ran its .pyc.

The slow part about this is the print. Printing to the console is incredibly slow, so you should totally avoid it. I assume that setting cell values in Excel is just way faster.
If you want to compare computation speed you should not have any I/O within the loop. Instead, only calculate the time it took to process the whole loop without doing anything inside (or doing something simple like adding a number or something). If you do that, you will see that Python is very fast.

It varies with the computation you need to perform and it is difficult to give a proof which is faster in a simply way, with a simple comparison. So I will share my experience in a generic way, without proofs, but with some comments on things that may generate a big difference.
In my experience, if you compare a simple for loop with exactly the same code between pure python and pure VBA, VBA is around 3 times faster. But nobody do the same in different languages. You have to apply each language best practices.
If you apply VBA best practices you can make it even faster through declaring the variables and other similar optimizations not available in python. In my experience this can make code around 2-3 times faster. So you can make VBA code around 6-9 times faster than a simple for loop in python.
On the other hand, if you apply python best practices, you won't generally write a regular for loop. You will use list comprehension or you will use numpy and scipy, which run in compiled C libraries. Those solutions are much faster than VBA code.
In general, if you perform complex matrix calculations you can do with numpy and scipy, python will be faster than VBA. In other cases, VBA is faster.
In python, you can also use Numba, which adds a bit of complexity to the code, but it helps generating compiled code and handles running it in the GPU. It will make your code even faster.
This is my experience with pure computation mainly involving internal arrays. I haven't compared performance for I/O with GUI, external files and databases, network communication or API.

How to make a recursive program run for a long time without getting RunTimeError in Python

This code is the recursive factorial function.
The problem is that if I want to calculate a very large number, it generates this error:
RuntimeError : maximum recursion depth exceeded
import time
def factorial (n) :
if n == 0:
return 1
else:
return n * (factorial (n -1 ) )
print " The factorial of the number is: " , factorial (1500)
time.sleep (3600)
The goal is to do with the recursive function is a factor which can calculate maximum one hour.

This is a really bad idea. Python is not at all well-suited for recursing that many times. I'd strongly recommend you switch this to a loop which checks a timer and stops when it reaches the limit.
But, if you're seriously interested in increasing the recursion limit in cython (the default depth is 1000), there's a sys setting for that, sys.setrecursionlimit. Note as it says in the documentation that "the highest possible limit is platform-dependent" - meaning there's no way to know when your program will fail. Nor is there any way you, I or cython could ever tell whether your program will recurse for something as irrelevant to the actual execution of your code as "an hour." (Just for fun, I tried this with a method that passes an int counting how many times its already recursed, and I got to 9755 before IDLE totally restarted itself.)
Here's an example of a way I think you should do this:
# be sure to import time
start_time = time.time()
counter = 1
# will execute for an hour
while time.time() < start_time + 3600:
factorial(counter) # presumably you'd want to do something with the return value here
counter += 1
You should also keep in mind that regardless of whether you use iteration or recursion, (unless you're using a separate thread) you're still going to be blocking the entire program for the entirety of the hour.

Don't do that. There is an upper limit on how deep your recursion can get. Instead, do something like this:
def factorial(n):
result = 1
for i in range(1, n+1):
result *= i
return result
Any recursive function can be rewritten to an iterative function. If your code is fancier than this, show us the actual code and we'll help you rewrite it.

Few things to note here:
You can increase recursion stack with:
import sys
sys.setrecursionlimit(someNumber) # may be 20000 or bigger.
Which will basically just increase your limit for recursion. Note that in order for it to run one hour, this number should be so unreasonably big, that it is mostly impossible. This is one of the problems with recursion and this is why people think about iterative programs.
So basically what you want is practically impossible and you would rather make it with a loop/while approach.
Moreover your sleep function does not do what you want. Sleep just forces you to wait additional time (frozing your program)

It is a guard against a stack overflow. You can change the recursion limit with sys.setrecursionlimit(newLimit)where newLimit is an integer.
Python isn't a functional language. Rewriting the algorithm iteratively, if possible, is generally a better idea.

How to get REALLY fast Python over a simple loop

I'm working on a SPOJ problem, INTEST. The goal is to specify the number of test cases (n) and a divisor (k), then feed your program n numbers. The program will accept each number on a newline of stdin and after receiving the nth number, will tell you how many were divisible by k.
The only challenge in this problem is getting your code to be FAST because k can be anything up to 10^7 and n can be as high as 10^9.
I'm trying to write it in Python and have trouble speeding it up. Any ideas?
Edit 2: I finally got it to pass at 10.54 seconds. I used nearly all of your answers to get there, and thus it was hard to choose one as 'correct', but I believe the one I chose sums it up the best. Thanks to you all. Final passing code is below.
Edit: I included some of the suggested updates in the included code.
Extensions and third-party modules are not allowed. The code is also run by the SPOJ judge machine, so I do not have the option of changing interpreters.
import sys
import psyco
psyco.full()
def main():
from sys import stdin, stdout
first_in = stdin.readline()
thing = first_in.split()
n = int(thing[0])
k = int(thing[1])
total = 0
list = stdin.readlines()
for item in list:
if int(item) % k == 0:
total += 1
stdout.write(str(total) + "\n")
if __name__ == "__main__":
main()

[Edited to reflect new findings and passing code on spoj]
Generally, when using Python for spoj:
Don't use "raw_input", use sys.stdin.readlines(). That can make a difference for large input. Also, if possible (and it is, for this problem), read everything at once (sys.stdin. readlines()), instead of reading line by line ("for line in sys.stdin...").
Similarly, don't use "print", use sys.stdout.write() - and don't forget "\n". Of course, this is only relevant when printing multiple times.
As S.Mark suggested, use psyco. It's available for both python2.5 and python2.6, at spoj (test it, it's there, and easy to spot: solutions using psyco usually have a ~35Mb memory usage offset). It's really simple: just add, after "import sys": import psyco; psyco.full()
As Justin suggested, put your code (except psyco incantation) inside a function, and simply call it at the end of your code
Sometimes creating a list and checking its length can be faster than creating a list and adding its components.
Favour list comprehensions (and generator expressions, when possible) over "for" and "while" as well. For some constructs, map/reduce/filter may also speed up your code.
Using (some of) these guidelines, I've managed to pass INTEST. Still testing alternatives, though.

Hey, I got it to be within the time limit. I used the following:
Psyco with Python 2.5.
a simple loop with a variable to keep count in
my code was all in a main() function (except the psyco import) which I called.
The last one is what made the difference. I believe that it has to do with variable visibility, but I'm not completely sure. My time was 10.81 seconds. You might get it to be faster with a list comprehension.
Edit:
Using a list comprehension brought my time down to 8.23 seconds. Bringing the line from sys import stdin, stdout inside of the function shaved off a little too to bring my time down to 8.12 seconds.

Use psyco, it will JIT your code, very effective when there is big loop and calculations.
Edit: Looks like third party modules are not allowed,
So, you may try converting your loop to list comprehensions, it supposed to be run at C level, so it should be faster a little bit.
sum(1 if int(line) % k == 0 else 0 for line in sys.stdin)

Just recently Alex Martinelli said that invoking code inside a function, outperforms code run in the module ( I can't find the post though )
So, why don't you try:
import sys
import psyco
psyco.full1()
def main():
first_in = raw_input()
thing = first_in.split()
n = int(thing[0])
k = int(thing[1])
total = 0
i = 0
total = sum(1 if int(line) % k == 0 else 0 for line in sys.stdin)
print total
if __name__ == "__main__":
main()
IIRC the reason was code inside a function can be optimized.

Using list comprehensions with psyco is counter productive.
This code:
count = 0
for l in sys.stdin:
count += not int(l)%k
runs twice as fast as
count = sum(not int(l)%k for l in sys.stdin)
when using psyco.

For other readers, here is the INTEST problem statement. It's intended to be an I/O throughput test.
On my system, I was able to shave 15% off the execution time by replacing the loop with the following:
print sum(1 for line in sys.stdin if int(line) % k == 0)

Why is subtraction faster than addition in Python?

I was optimising some Python code, and tried the following experiment:
import time
start = time.clock()
x = 0
for i in range(10000000):
x += 1
end = time.clock()
print '+=',end-start
start = time.clock()
x = 0
for i in range(10000000):
x -= -1
end = time.clock()
print '-=',end-start
The second loop is reliably faster, anywhere from a whisker to 10%, depending on the system I run it on. I've tried varying the order of the loops, number of executions etc, and it still seems to work.
Stranger,
for i in range(10000000, 0, -1):
(ie running the loop backwards) is faster than
for i in range(10000000):
even when loop contents are identical.
What gives, and is there a more general programming lesson here?

I can reproduce this on my Q6600 (Python 2.6.2); increasing the range to 100000000:
('+=', 11.370000000000001)
('-=', 10.769999999999998)
First, some observations:
This is 5% for a trivial operation. That's significant.
The speed of the native addition and subtraction opcodes is irrelevant. It's in the noise floor, completely dwarfed by the bytecode evaluation. That's talking about one or two native instructions around thousands.
The bytecode generates exactly the same number of instructions; the only difference is INPLACE_ADD vs. INPLACE_SUBTRACT and +1 vs -1.
Looking at the Python source, I can make a guess. This is handled in ceval.c, in PyEval_EvalFrameEx. INPLACE_ADD has a significant extra block of code, to handle string concatenation. That block doesn't exist in INPLACE_SUBTRACT, since you can't subtract strings. That means INPLACE_ADD contains more native code. Depending (heavily!) on how the code is being generated by the compiler, this extra code may be inline with the rest of the INPLACE_ADD code, which means additions can hit the instruction cache harder than subtraction. This could be causing extra L2 cache hits, which could cause a significant performance difference.
This is heavily dependent on the system you're on (different processors have different amounts of cache and cache architectures), the compiler in use, including the particular version and compilation options (different compilers will decide differently which bits of code are on the critical path, which determines how assembly code is lumped together), and so on.
Also, the difference is reversed in Python 3.0.1 (+: 15.66, -: 16.71); no doubt this critical function has changed a lot.

$ python -m timeit -s "x=0" "x+=1"
10000000 loops, best of 3: 0.151 usec per loop
$ python -m timeit -s "x=0" "x-=-1"
10000000 loops, best of 3: 0.154 usec per loop
Looks like you've some measurement bias

I think the "general programming lesson" is that it is really hard to predict, solely by looking at the source code, which sequence of statements will be the fastest. Programmers at all levels frequently get caught up by this sort of "intuitive" optimisation. What you think you know may not necessarily be true.
There is simply no substitute for actually measuring your program performance. Kudos for doing so; answering why undoubtedly requires delving deep into the implementation of Python, in this case.
With byte-compiled languages such as Java, Python, and .NET, it is not even sufficient to measure performance on just one machine. Differences between VM versions, native code translation implementations, CPU-specific optimisations, and so on will make this sort of question ever more tricky to answer.

"The second loop is reliably faster ..."
That's your explanation right there. Re-order your script so the subtraction test is timed first, then the addition, and suddenly addition becomes the faster operation again:
-= 3.05
+= 2.84
Obviously something happens to the second half of the script that makes it faster. My guess is that the first call to range() is slower because python needs to allocate enough memory for such a long list, but it is able to re-use that memory for the second call to range():
import time
start = time.clock()
x = range(10000000)
end = time.clock()
del x
print 'first range()',end-start
start = time.clock()
x = range(10000000)
end = time.clock()
print 'second range()',end-start
A few runs of this script show that the extra time needed for the first range() accounts for nearly all of the time difference between '+=' and '-=' seen above:
first range() 0.4
second range() 0.23

It's always a good idea when asking a question to say what platform and what version of Python you are using. Sometimes it does't matter. This is NOT one of those times:
time.clock() is appropriate only on Windows. Throw away your own measuring code and use -m timeit as demonstrated in pixelbeat's answer.
Python 2.X's range() builds a list. If you are using Python 2.x, replace range with xrange and see what happens.
Python 3.X's int is Python2.X's long.

Is there a more general programming lesson here?
The more general programming lesson here is that intuition is a poor guide when predicting run-time performance of computer code.
One can reason about algorithmic complexity, hypothesise about compiler optimisations, estimate cache performance and so on. However, since these things can interact in non-trivial ways, the only way to be sure about how fast a particular piece of code is going to be is to benchmark it in the target environment (as you have rightfully done.)

With Python 2.5 the biggest problem here is using range, which will allocate a list that big to iterate over it. When using xrange, whichever is done second is a tiny bit faster for me. (Not sure if range has become a generator in Python 3.)

Your experiment is faulty. The way this experiment should be designed is to write 2 different programs - 1 for addition, 1 for subtraction. They should be exactly the same and run under the same conditions with the data being put to file. Then you need to average the runs (at least several thousand), but you'd need a statistician to tell you an appropriate number.
If you wanted to analyze different methods of addition, subtraction, and looping, again each of those should be a separate program.
Experimental error might arise from heat of processor and other activity going on the cpu, so i'd execute the runs in a variety of patterns...

That would be remarkable, so I have thoroughly evaluated your code and also setup the expiriment as I would find it more correct (all declarations and function calls outside the loop). Both versions I have run five times.
Running your code validated your claims:
-= takes constantly less time; 3.6% on average
Running my code, though, contradicts the outcome of your experiment:
+= takes on average (not always) 0.5% less time.
To show all results I have put plots online:
Your evaluation: http://bayimg.com/kadAeaAcN
My evaluation: http://bayimg.com/KadaAaAcN
So, I conclude that your experiment has a bias, and it is significant.
Finally here is my code:
import time
addtimes = [0.] * 100
subtracttimes = [0.] * 100
range100 = range(100)
range10000000 = range(10000000)
j = 0
i = 0
x = 0
start = 0.
for j in range100:
start = time.clock()
x = 0
for i in range10000000:
x += 1
addtimes[j] = time.clock() - start
for j in range100:
start = time.clock()
x = 0
for i in range10000000:
x -= -1
subtracttimes[j] = time.clock() - start
print '+=', sum(addtimes)
print '-=', sum(subtracttimes)

The running loop backwards is faster because the computer has an easier time comparing if a number is equal to 0.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.