How to get REALLY fast Python over a simple loop

How to get REALLY fast Python over a simple loop - python

I'm working on a SPOJ problem, INTEST. The goal is to specify the number of test cases (n) and a divisor (k), then feed your program n numbers. The program will accept each number on a newline of stdin and after receiving the nth number, will tell you how many were divisible by k.
The only challenge in this problem is getting your code to be FAST because k can be anything up to 10^7 and n can be as high as 10^9.
I'm trying to write it in Python and have trouble speeding it up. Any ideas?
Edit 2: I finally got it to pass at 10.54 seconds. I used nearly all of your answers to get there, and thus it was hard to choose one as 'correct', but I believe the one I chose sums it up the best. Thanks to you all. Final passing code is below.
Edit: I included some of the suggested updates in the included code.
Extensions and third-party modules are not allowed. The code is also run by the SPOJ judge machine, so I do not have the option of changing interpreters.
import sys
import psyco
psyco.full()
def main():
from sys import stdin, stdout
first_in = stdin.readline()
thing = first_in.split()
n = int(thing[0])
k = int(thing[1])
total = 0
list = stdin.readlines()
for item in list:
if int(item) % k == 0:
total += 1
stdout.write(str(total) + "\n")
if __name__ == "__main__":
main()

[Edited to reflect new findings and passing code on spoj]
Generally, when using Python for spoj:
Don't use "raw_input", use sys.stdin.readlines(). That can make a difference for large input. Also, if possible (and it is, for this problem), read everything at once (sys.stdin. readlines()), instead of reading line by line ("for line in sys.stdin...").
Similarly, don't use "print", use sys.stdout.write() - and don't forget "\n". Of course, this is only relevant when printing multiple times.
As S.Mark suggested, use psyco. It's available for both python2.5 and python2.6, at spoj (test it, it's there, and easy to spot: solutions using psyco usually have a ~35Mb memory usage offset). It's really simple: just add, after "import sys": import psyco; psyco.full()
As Justin suggested, put your code (except psyco incantation) inside a function, and simply call it at the end of your code
Sometimes creating a list and checking its length can be faster than creating a list and adding its components.
Favour list comprehensions (and generator expressions, when possible) over "for" and "while" as well. For some constructs, map/reduce/filter may also speed up your code.
Using (some of) these guidelines, I've managed to pass INTEST. Still testing alternatives, though.

Hey, I got it to be within the time limit. I used the following:
Psyco with Python 2.5.
a simple loop with a variable to keep count in
my code was all in a main() function (except the psyco import) which I called.
The last one is what made the difference. I believe that it has to do with variable visibility, but I'm not completely sure. My time was 10.81 seconds. You might get it to be faster with a list comprehension.
Edit:
Using a list comprehension brought my time down to 8.23 seconds. Bringing the line from sys import stdin, stdout inside of the function shaved off a little too to bring my time down to 8.12 seconds.

Use psyco, it will JIT your code, very effective when there is big loop and calculations.
Edit: Looks like third party modules are not allowed,
So, you may try converting your loop to list comprehensions, it supposed to be run at C level, so it should be faster a little bit.
sum(1 if int(line) % k == 0 else 0 for line in sys.stdin)

Just recently Alex Martinelli said that invoking code inside a function, outperforms code run in the module ( I can't find the post though )
So, why don't you try:
import sys
import psyco
psyco.full1()
def main():
first_in = raw_input()
thing = first_in.split()
n = int(thing[0])
k = int(thing[1])
total = 0
i = 0
total = sum(1 if int(line) % k == 0 else 0 for line in sys.stdin)
print total
if __name__ == "__main__":
main()
IIRC the reason was code inside a function can be optimized.

Using list comprehensions with psyco is counter productive.
This code:
count = 0
for l in sys.stdin:
count += not int(l)%k
runs twice as fast as
count = sum(not int(l)%k for l in sys.stdin)
when using psyco.

For other readers, here is the INTEST problem statement. It's intended to be an I/O throughput test.
On my system, I was able to shave 15% off the execution time by replacing the loop with the following:
print sum(1 for line in sys.stdin if int(line) % k == 0)

Related

sys.std.readline() Vs. input() [duplicate]

I'm trying to decide which one to use when I need to acquire lines of input from STDIN, so I wonder how I need to choose them in different situations.
I found a previous post (https://codereview.stackexchange.com/questions/23981/how-to-optimize-this-simple-python-program) saying that:
How can I optimize this code in terms of time and memory used? Note that I'm using different function to read the input, as sys.stdin.readline() is the fastest one when reading strings and input() when reading integers.
Is that statement true ?

The builtin input and sys.stdin.readline functions don't do exactly the same thing, and which one is faster may depend on the details of exactly what you're doing. As aruisdante commented, the difference is less in Python 3 than it was in Python 2, when the quote you provide was from, but there are still some differences.
The first difference is that input has an optional prompt parameter that will be displayed if the interpreter is running interactively. This leads to some overhead, even if the prompt is empty (the default). On the other hand, it may be faster than doing a print before each readline call, if you do want a prompt.
The next difference is that input strips off any newline from the end of the input. If you're going to strip that anyway, it may be faster to let input do it for you, rather than doing sys.stdin.readline().strip().
A final difference is how the end of the input is indicated. input will raise an EOFError when you call it if there is no more input (stdin has been closed on the other end). sys.stdin.readline on the other hand will return an empty string at EOF, which you need to know to check for.
There's also a third option, using the file iteration protocol on sys.stdin. This is likely to be much like calling readline, but perhaps nicer logic to it.
I suspect that while differences in performance between your various options may exist, they're liky to be smaller than the time cost of simply reading the file from the disk (if it is large) and doing whatever you are doing with it. I suggest that you avoid the trap of premature optimization and just do what is most natural for your problem, and if the program is too slow (where "too slow" is very subjective), you do some profiling to see what is taking the most time. Don't put a whole lot of effort into deciding between the different ways of taking input unless it actually matters.

As Linn1024 says, for reading large amounts of data input() is much slower.
A simple example is this:
import sys
for i in range(int(sys.argv[1])):
sys.stdin.readline()
This takes about 0.25μs per iteration:
$ time yes | py readline.py 1000000
yes 0.05s user 0.00s system 22% cpu 0.252 total
Changing that to sys.stdin.readline().strip() takes that to about 0.31μs.
Changing readline() to input() is about 10 times slower:
$ time yes | py input.py 1000000
yes 0.05s user 0.00s system 1% cpu 2.855 total
Notice that it's still pretty fast though, so you only really need to worry when you are reading thousands of entries like above.

It checks if it is TTY every time as input() runs by syscall and it works much more slow than sys.stdin.readline()
https://github.com/python/cpython/blob/af2f5b1723b95e45e1f15b5bd52102b7de560f7c/Python/bltinmodule.c#L1981

import sys
def solve(N, A):
for in range (N):
A.append(int(sys.stdin.readline().strip()))
return A
def main():
N = int(sys.stdin.readline().strip())
A = []
result = solve(N, A):
print(result)
main()

How to make a recursive program run for a long time without getting RunTimeError in Python

This code is the recursive factorial function.
The problem is that if I want to calculate a very large number, it generates this error:
RuntimeError : maximum recursion depth exceeded
import time
def factorial (n) :
if n == 0:
return 1
else:
return n * (factorial (n -1 ) )
print " The factorial of the number is: " , factorial (1500)
time.sleep (3600)
The goal is to do with the recursive function is a factor which can calculate maximum one hour.

This is a really bad idea. Python is not at all well-suited for recursing that many times. I'd strongly recommend you switch this to a loop which checks a timer and stops when it reaches the limit.
But, if you're seriously interested in increasing the recursion limit in cython (the default depth is 1000), there's a sys setting for that, sys.setrecursionlimit. Note as it says in the documentation that "the highest possible limit is platform-dependent" - meaning there's no way to know when your program will fail. Nor is there any way you, I or cython could ever tell whether your program will recurse for something as irrelevant to the actual execution of your code as "an hour." (Just for fun, I tried this with a method that passes an int counting how many times its already recursed, and I got to 9755 before IDLE totally restarted itself.)
Here's an example of a way I think you should do this:
# be sure to import time
start_time = time.time()
counter = 1
# will execute for an hour
while time.time() < start_time + 3600:
factorial(counter) # presumably you'd want to do something with the return value here
counter += 1
You should also keep in mind that regardless of whether you use iteration or recursion, (unless you're using a separate thread) you're still going to be blocking the entire program for the entirety of the hour.

Don't do that. There is an upper limit on how deep your recursion can get. Instead, do something like this:
def factorial(n):
result = 1
for i in range(1, n+1):
result *= i
return result
Any recursive function can be rewritten to an iterative function. If your code is fancier than this, show us the actual code and we'll help you rewrite it.

Few things to note here:
You can increase recursion stack with:
import sys
sys.setrecursionlimit(someNumber) # may be 20000 or bigger.
Which will basically just increase your limit for recursion. Note that in order for it to run one hour, this number should be so unreasonably big, that it is mostly impossible. This is one of the problems with recursion and this is why people think about iterative programs.
So basically what you want is practically impossible and you would rather make it with a loop/while approach.
Moreover your sleep function does not do what you want. Sleep just forces you to wait additional time (frozing your program)

It is a guard against a stack overflow. You can change the recursion limit with sys.setrecursionlimit(newLimit)where newLimit is an integer.
Python isn't a functional language. Rewriting the algorithm iteratively, if possible, is generally a better idea.

sys.stdin.readline() and input(): which one is faster when reading lines of input, and why?

I'm trying to decide which one to use when I need to acquire lines of input from STDIN, so I wonder how I need to choose them in different situations.
I found a previous post (https://codereview.stackexchange.com/questions/23981/how-to-optimize-this-simple-python-program) saying that:
How can I optimize this code in terms of time and memory used? Note that I'm using different function to read the input, as sys.stdin.readline() is the fastest one when reading strings and input() when reading integers.
Is that statement true ?

The builtin input and sys.stdin.readline functions don't do exactly the same thing, and which one is faster may depend on the details of exactly what you're doing. As aruisdante commented, the difference is less in Python 3 than it was in Python 2, when the quote you provide was from, but there are still some differences.
The first difference is that input has an optional prompt parameter that will be displayed if the interpreter is running interactively. This leads to some overhead, even if the prompt is empty (the default). On the other hand, it may be faster than doing a print before each readline call, if you do want a prompt.
The next difference is that input strips off any newline from the end of the input. If you're going to strip that anyway, it may be faster to let input do it for you, rather than doing sys.stdin.readline().strip().
A final difference is how the end of the input is indicated. input will raise an EOFError when you call it if there is no more input (stdin has been closed on the other end). sys.stdin.readline on the other hand will return an empty string at EOF, which you need to know to check for.
There's also a third option, using the file iteration protocol on sys.stdin. This is likely to be much like calling readline, but perhaps nicer logic to it.
I suspect that while differences in performance between your various options may exist, they're liky to be smaller than the time cost of simply reading the file from the disk (if it is large) and doing whatever you are doing with it. I suggest that you avoid the trap of premature optimization and just do what is most natural for your problem, and if the program is too slow (where "too slow" is very subjective), you do some profiling to see what is taking the most time. Don't put a whole lot of effort into deciding between the different ways of taking input unless it actually matters.

As Linn1024 says, for reading large amounts of data input() is much slower.
A simple example is this:
import sys
for i in range(int(sys.argv[1])):
sys.stdin.readline()
This takes about 0.25μs per iteration:
$ time yes | py readline.py 1000000
yes 0.05s user 0.00s system 22% cpu 0.252 total
Changing that to sys.stdin.readline().strip() takes that to about 0.31μs.
Changing readline() to input() is about 10 times slower:
$ time yes | py input.py 1000000
yes 0.05s user 0.00s system 1% cpu 2.855 total
Notice that it's still pretty fast though, so you only really need to worry when you are reading thousands of entries like above.

It checks if it is TTY every time as input() runs by syscall and it works much more slow than sys.stdin.readline()
https://github.com/python/cpython/blob/af2f5b1723b95e45e1f15b5bd52102b7de560f7c/Python/bltinmodule.c#L1981

import sys
def solve(N, A):
for in range (N):
A.append(int(sys.stdin.readline().strip()))
return A
def main():
N = int(sys.stdin.readline().strip())
A = []
result = solve(N, A):
print(result)
main()

What is the logic behind the <statement, setup> design for timeit?

I am new to Python and figured I'd play around with problems on Project Euler to have something concrete to do meanwhile.
I came across the idea of timing different solutions to see how they rate against each other. That simple task turned out to be too complicated for my taste however. I read that the time.clock() calls are not accurate enough on unix systems (seconds resolution is simply pathetic with modern processors). Thus I stumbled upon the timeit module which seems to be the first choice for profiling tasks.
I have to say I really don't understand why they went with such a counter-intuitive way to go about it. I can't seem to get it to work, without needing to rewrite/restructure my code, which I find very frustrating.
Take the code below and nevermind for a second that it's neither pretty nor particularly efficient:
import math
import sys
from timeit import Timer
def digitsum(number):
rem = 0
while number > 0:
rem += number % 10
number //= 10
return rem
def prime_form(p):
if p == 2 or p == 3 or p == 5:
return True
elif (p-1) % 6 != 0 and (p+1) % 6 != 0:
return False
elif digitsum(p) % 3 == 0:
return False
elif p % 10 == 0 or p % 10 == 5:
return False
else:
return True
def lfactor(n):
if n <= 3:
return 1
limit = int(math.sqrt(n))
if limit % 2 == 0:
limit -= 1
lfac = 1
for i in range(3,limit+1,2):
if prime_form(i):
(div,rem) = divmod(n,i)
if rem == 0:
lfac = max(lfac, max(lfactor(div) ,lfactor(i)))
return lfac if lfac != 1 else n
number = int(sys.argv[1])
t = Timer("""print lfactor(number)""", """import primefacs""")
t.timeit(100)
#print lfactor(number)
If i would like to time the line print lfactor(number) why should I go through a bunch of loops, trying to define a setup statement etc.. I understand why one would want to have debug tool that are detached from the code being tested (a la unit testing) but shouldn't there be a simple and straightforward way to get the process time of a chunk of code without much hassle (importing/defining a setup etc)? What I am thinking here is something like the way one would do that:
long t0 = System.currentTimeInMillis();
// do something
long t = System.currentTimeInMillis() - t0;
.. or even better with MATLAB, using the tic/toc commands:
tic
x = A\b;
t(n) = toc;
Hope this doesn't come across as a rant, I am really trying understand "the pythonian way of thinking" but honestly it doesn't come naturally here, not at all...

Simple, the logic behind the statement and setup is that the setup is not part of the code you want to benchmark. Normally a python module is loaded once while the functions inside it are run more than one, much more.
A pythonic way of use timeit?
$ python -m timeit -h
Tool for measuring execution time of small code snippets.
This module avoids a number of common traps for measuring execution
times. See also Tim Peters' introduction to the Algorithms chapter in
the Python Cookbook, published by O'Reilly.
Library usage: see the Timer class.
Command line usage:
python timeit.py [-n N] [-r N] [-s S] [-t] [-c] [-h] [--] [statement]
Options:
-n/--number N: how many times to execute 'statement' (default: see below)
-r/--repeat N: how many times to repeat the timer (default 3)
-s/--setup S: statement to be executed once initially (default 'pass')
-t/--time: use time.time() (default on Unix)
-c/--clock: use time.clock() (default on Windows)
-v/--verbose: print raw timing results; repeat for more digits precision
-h/--help: print this usage message and exit
--: separate options from statement, use when statement starts with -
statement: statement to be timed (default 'pass')
[cut]
$ python -m timeit -s 'from primefacs import lfactor' 'lfactor(42)'
$ # this does not work, primefacs is not binded, ie. not loaded
$ python -m timeit 'primefacts.lfactor(42)'
$ # this does not work too, lfactor is not defined
$ python -m timeit 'lfactor(42)'
$ # this works but the time to import primefacs is benchmarked too
$ # but only the first time is loaded, the successive ones the cache is used.
$ python -m timeit 'import primefacts; primefacts.lfactor(42)'
As you can see, the way timeit works is much more intuitive than you think.
Edit to add:
I read that the time.clock() calls are not accurate enough on unix
systems (seconds resolution is simply pathetic with modern
processors).
quoting the documentation:
On Unix, return the current processor time as a floating point number
expressed in seconds. The precision, and in fact the very definition
of the meaning of “processor time”, depends on that of the C function
of the same name, but in any case, this is the function to use for
benchmarking Python or timing algorithms... The resolution is
typically better than one microsecond.
going on..
I have to say I really don't understand why they went with such a
counter-intuitive way to go about it. I can't seem to get it to work,
without needing to rewrite/restructure my code, which I find very
frustrating.
Yes, it could be but then this is one of those cases where documentation can help you, here a link to the examples for the impatiens. Here a more gentle introduction to timeit.

When timing a statement, you want to time just that statement, not the setup. The setup could be considerably slower than the statement-under-test.
Note that timeit runs your statement thousands of times to get a reasonable average. It does this to eliminate the effects of OS scheduling and other processes (including but not limited to disk buffer flushing, cronjob execution, memory swapping, etc); only an average time would have any meaning when comparing different code alternatives.
For your case, just test lfactor(number) directly, and just use the timeit() function:
timeit.timeit('lfactor(number)', 'from __main__ import lfactor, number')
The setup code retrieves the lfactor() function, as well as number taken from sys.argv from the main script; the function and number won't otherwise be seen.
There is absolutely no point in performance testing the print statement, that's not what you are trying to time. Using timeit is not about seeing the result of the call, just the time it takes to run it. Since the code-under-test is run thousands of times, all you'd get is thousands of prints of (presumably) the same result.
Note that usually timeit is used to compare performance characteristics of short python snippets; to find performance bottlenecks in more complex code, use profiling instead.
If you want to time just one run, use the timeit.default_timer() function to get the most accurate timer for your platform:
timer = timeit.default_timer
start = timer()
print lfactor(number)
time_taken = timer() - start

Why is subtraction faster than addition in Python?

I was optimising some Python code, and tried the following experiment:
import time
start = time.clock()
x = 0
for i in range(10000000):
x += 1
end = time.clock()
print '+=',end-start
start = time.clock()
x = 0
for i in range(10000000):
x -= -1
end = time.clock()
print '-=',end-start
The second loop is reliably faster, anywhere from a whisker to 10%, depending on the system I run it on. I've tried varying the order of the loops, number of executions etc, and it still seems to work.
Stranger,
for i in range(10000000, 0, -1):
(ie running the loop backwards) is faster than
for i in range(10000000):
even when loop contents are identical.
What gives, and is there a more general programming lesson here?

I can reproduce this on my Q6600 (Python 2.6.2); increasing the range to 100000000:
('+=', 11.370000000000001)
('-=', 10.769999999999998)
First, some observations:
This is 5% for a trivial operation. That's significant.
The speed of the native addition and subtraction opcodes is irrelevant. It's in the noise floor, completely dwarfed by the bytecode evaluation. That's talking about one or two native instructions around thousands.
The bytecode generates exactly the same number of instructions; the only difference is INPLACE_ADD vs. INPLACE_SUBTRACT and +1 vs -1.
Looking at the Python source, I can make a guess. This is handled in ceval.c, in PyEval_EvalFrameEx. INPLACE_ADD has a significant extra block of code, to handle string concatenation. That block doesn't exist in INPLACE_SUBTRACT, since you can't subtract strings. That means INPLACE_ADD contains more native code. Depending (heavily!) on how the code is being generated by the compiler, this extra code may be inline with the rest of the INPLACE_ADD code, which means additions can hit the instruction cache harder than subtraction. This could be causing extra L2 cache hits, which could cause a significant performance difference.
This is heavily dependent on the system you're on (different processors have different amounts of cache and cache architectures), the compiler in use, including the particular version and compilation options (different compilers will decide differently which bits of code are on the critical path, which determines how assembly code is lumped together), and so on.
Also, the difference is reversed in Python 3.0.1 (+: 15.66, -: 16.71); no doubt this critical function has changed a lot.

$ python -m timeit -s "x=0" "x+=1"
10000000 loops, best of 3: 0.151 usec per loop
$ python -m timeit -s "x=0" "x-=-1"
10000000 loops, best of 3: 0.154 usec per loop
Looks like you've some measurement bias

I think the "general programming lesson" is that it is really hard to predict, solely by looking at the source code, which sequence of statements will be the fastest. Programmers at all levels frequently get caught up by this sort of "intuitive" optimisation. What you think you know may not necessarily be true.
There is simply no substitute for actually measuring your program performance. Kudos for doing so; answering why undoubtedly requires delving deep into the implementation of Python, in this case.
With byte-compiled languages such as Java, Python, and .NET, it is not even sufficient to measure performance on just one machine. Differences between VM versions, native code translation implementations, CPU-specific optimisations, and so on will make this sort of question ever more tricky to answer.

"The second loop is reliably faster ..."
That's your explanation right there. Re-order your script so the subtraction test is timed first, then the addition, and suddenly addition becomes the faster operation again:
-= 3.05
+= 2.84
Obviously something happens to the second half of the script that makes it faster. My guess is that the first call to range() is slower because python needs to allocate enough memory for such a long list, but it is able to re-use that memory for the second call to range():
import time
start = time.clock()
x = range(10000000)
end = time.clock()
del x
print 'first range()',end-start
start = time.clock()
x = range(10000000)
end = time.clock()
print 'second range()',end-start
A few runs of this script show that the extra time needed for the first range() accounts for nearly all of the time difference between '+=' and '-=' seen above:
first range() 0.4
second range() 0.23

It's always a good idea when asking a question to say what platform and what version of Python you are using. Sometimes it does't matter. This is NOT one of those times:
time.clock() is appropriate only on Windows. Throw away your own measuring code and use -m timeit as demonstrated in pixelbeat's answer.
Python 2.X's range() builds a list. If you are using Python 2.x, replace range with xrange and see what happens.
Python 3.X's int is Python2.X's long.

Is there a more general programming lesson here?
The more general programming lesson here is that intuition is a poor guide when predicting run-time performance of computer code.
One can reason about algorithmic complexity, hypothesise about compiler optimisations, estimate cache performance and so on. However, since these things can interact in non-trivial ways, the only way to be sure about how fast a particular piece of code is going to be is to benchmark it in the target environment (as you have rightfully done.)

With Python 2.5 the biggest problem here is using range, which will allocate a list that big to iterate over it. When using xrange, whichever is done second is a tiny bit faster for me. (Not sure if range has become a generator in Python 3.)

Your experiment is faulty. The way this experiment should be designed is to write 2 different programs - 1 for addition, 1 for subtraction. They should be exactly the same and run under the same conditions with the data being put to file. Then you need to average the runs (at least several thousand), but you'd need a statistician to tell you an appropriate number.
If you wanted to analyze different methods of addition, subtraction, and looping, again each of those should be a separate program.
Experimental error might arise from heat of processor and other activity going on the cpu, so i'd execute the runs in a variety of patterns...

That would be remarkable, so I have thoroughly evaluated your code and also setup the expiriment as I would find it more correct (all declarations and function calls outside the loop). Both versions I have run five times.
Running your code validated your claims:
-= takes constantly less time; 3.6% on average
Running my code, though, contradicts the outcome of your experiment:
+= takes on average (not always) 0.5% less time.
To show all results I have put plots online:
Your evaluation: http://bayimg.com/kadAeaAcN
My evaluation: http://bayimg.com/KadaAaAcN
So, I conclude that your experiment has a bias, and it is significant.
Finally here is my code:
import time
addtimes = [0.] * 100
subtracttimes = [0.] * 100
range100 = range(100)
range10000000 = range(10000000)
j = 0
i = 0
x = 0
start = 0.
for j in range100:
start = time.clock()
x = 0
for i in range10000000:
x += 1
addtimes[j] = time.clock() - start
for j in range100:
start = time.clock()
x = 0
for i in range10000000:
x -= -1
subtracttimes[j] = time.clock() - start
print '+=', sum(addtimes)
print '-=', sum(subtracttimes)

The running loop backwards is faster because the computer has an easier time comparing if a number is equal to 0.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.