Oddity calculating runtime with timeit in Python?

Oddity calculating runtime with timeit in Python? - python

I want to calculate the runtime of two different algorithms in the same program. When I wrote a program calculating the runtime of each individually, I obtained very different results, so to test this new program, I had python calculate the runtime for the same algorithm twice. When I did this (in the program found below), I found that the runtimes of the same algorithm were in fact different! What am I missing and how do I fix this so I can compare algorithms?
import timeit
def calc1(x):
return x*x+x+1
def calc2(x):
return x*x+x+1
def main():
x = int(input("Input a number to be tested: "))
start1 = timeit.default_timer()
result1 = calc1(x)
end1 = timeit.default_timer()
start2 = timeit.default_timer()
result2 = calc2(x)
end2 = timeit.default_timer()
print("Result of calculation 1 was {0}; time to compute was {1} seconds.".format(result1,end1-start1))
print("Result of calculation 2 was {0}; time to compute was {1} seconds.".format(result2,end2-start2))
main()

I think you're being bitten by Windows power management in one part, and the second part, your testing method is flawed.
In the first case, I also got bitten by the fact that Windows, by default, will throttle CPU throughput in order to save power. Currently, there is a dramatic difference in your calculated runtimes; this can be dramatically reduced just by having a nonsense calculation like something = 5**1000000 immediately after x = int(input("Input a number to be tested: ")) to ramp up the CPU resources. I hate Windows 10 so I don't know how to change this off the top of my head; I need to check how you shift Power Options" to "High Performance" and remove CPU throttling (Edit soon), which will reduce this gap considerably.
The second issue is that you only run one test cycle. You cannot get stable results from this. Instead, you need multiple iterations. For example, with a million iterations you would see some similarity between the numbers:
import timeit
exec1 = timeit.timeit(stmt='def calc1():\n return x*x+x+1', number=1000000)
exec2 = timeit.timeit(stmt='def calc2():\n return x*x+x+1', number=1000000)
print("Execution of first loop: {}".format(exec1))
print("Execution of second loop: {}".format(exec2))
Depending on your IDE (if it's Canopy/Spyder) then there could be cleaner ways of running timeit such as using your existing definitions of calc1 and calc2.

Related

Why Multiprocessing is taking more time than sequential processing?

The below code is taking around 15 seconds to get the result. But when I run a it sequentially it
only takes around 11 seconds. What can be the reason for this ?
import multiprocessing
import os
import time
def square(x):
# print(os.getpid())
return x*x
if __name__=='__main__':
start_time = time.time()
p = multiprocessing.Pool()
r = range(100000000)
p1 = p.map(square,r)
end_time = time.time()
print('time_taken::',end_time-start_time)
Sequential code
start_time = time.time()
d = list(map(square,range(100000000)))
end_time = time.time()

Regarding your code example, there are two important factors which influence runtime performance gains achievable by parallelization:
First, you have to take the administrative overhead into account. This means, that spawning new processes is rather expensive in comparison to simple arithmetic operations. Therefore, you gain performance, when the computation's complexity exceeds a certain threshold. Which was not the case in your example above.
Secondly, you have to think of a "clever way" of splitting your computation into parts which could be independently executed. In the given code example, you can optimize the chunks you pass to the worker processes created by multiprocessing.Pool, so that each process has a self contained package of computations to perform.
E.g., this could be accomplished with the following modifications of your code:
def square(x):
return x ** 2
def square_chunk(i, j):
return list(map(square, range(i, j)))
def calculate_in_parallel(n, c=4):
"""Calculates a list of squares in a parallelized manner"""
result = []
step = math.ceil(n / c)
with Pool(c) as p:
partial_results = p.starmap(
square_chunk, [(i, min(i + step, n)) for i in range(0, n, step)]
)
for res in partial_results:
result += res
return result
Please note, that I used the operation x**2 (instead of the heavily optimized x*x) to increase the load and underline resulting runtime differences.
Here, the Pool's starmap()-function is used which unpacks arguments of the passed tuples. Using it, we can effectively pass more than one argument to the mapped function. Furthermore, we distribute the workload evenly to the amount of available cores. On each core the range of numbers between (i, min(i + step, n)) is calculated, whereas the step denotes the chunksize, calculated as the maximum_number divided by the count of CPU.
By running the code with different parametrizations, one can clearly see, that the performance gain increases when the maximum number (denoted n) increases. As expected, when more cores are used in parallel the runtime is reduced as well.
Edit:
As #KellyBundy pointed out, parallelism (especially) shines, when you minimize not only the input to the worker processes but the output as well. Performing several measurements calculating the sum of the squared numbers (sum(map(square, range(i, j)))) instead of returning (and concatenating) lists, showed an even larger increase in runtime performance as the following figure illustrates.

Why string comparison is NOT faster then integer comparison in Python?

The difference in C++ is huge, but not in Python. I used similar code on C++, and the result is so different -- integer comparison is 20-30 times faster than string comparison.
Here is my example code:
import random, time
rand_nums = []
rand_strs = []
total_num = 1000000
for i in range(total_num):
randint = random.randint(0,total_num*10)
randstr = str(randint)
rand_nums.append(randint)
rand_strs.append(randstr)
start = time.time()
for i in range(total_num-1):
b = rand_nums[i+1]>rand_nums[i]
end = time.time()
print("integer compare:",end-start) # 0.14269232749938965 seconds
start = time.time()
for i in range(total_num-1):
b = rand_strs[i+1]>rand_strs[i]
end = time.time() # 0.15730643272399902 seconds
print("string compare:",end-start)

I can't explain why it's so slow in C++, but in Python, the reason is simple from your test code: random strings usually differ in the first byte, so the comparison time for those cases should be pretty much the same.
Also, not that much of your overhead will be in the loop control and list accesses. You'd get a much more accurate measure if you remove those factors by zipping the lists:
for s1, s2 in zip(rand_strs, rand_strs[1:]):
b = s1 > s2

The difference in C++ is huge, but not in Python.
The time spent in the comparison is minimal compared to the rest of the loop in Python. The actual comparison operation is implemented in Python's standard library C code, while the loop will execute through the interpreter.
As a test, you can run this code that performs all the same operations as the string comparison loop, except without the comparison:
start = time.time()
for i in range(total_num-1):
b = rand_strs[i+1], rand_strs[i]
end = time.time()
print("no compare:",end-start)
The times are pretty close to each other, though for me string comparison is always the slowest of the three loops:
integer compare: 1.2947499752044678
string compare: 1.3821675777435303
no compare: 1.3093421459197998

why are // calculations faster the / calculations(or not)

I was just experimenting with some code and I found something out what makes no sence to me
>>> import timeit
>>> timeit.timeit("524288000/1024/1024")
0.05489620000000173
>>> timeit.timeit("524288000//1024//1024")
0.030612500000017917
>>>
using // in calculations is faster then / calculations
but when i repeated it this where the results:
>>> timeit.timeit("524288000//1024//1024")
0.02494899999999234
>>> timeit.timeit("524288000/1024/1024")
0.02480830000001788
and now is / faster then // what makes no sense to me
why is this?
edit:
the results of the experiment with the the amount of times repeated on 10000 this are the results:
avg for /: 0.0261193088
avg for //: 0.025788395899999896

When you time a function the CPU calculates the difference between the time when the instruction finished and the time when the instruction started, but a lot happens under the hood and not just the algorithm that you're timing.
Try to read some books about Operating Systems and you'll understand better.
In order to do these kind of experiments you should repeat this algorithm thousands of times to escape from variations.
Try the code below, but if you want to do real experiments change the loop value to something greater
import timeit
loops = 100
oneSlashAvg = 0
for i in range(loops):
oneSlashAvg += timeit.timeit("524288000/1024/1024")
print(oneSlashAvg/loops)
doubleSlashAvg = 0
for i in range(loops):
doubleSlashAvg += timeit.timeit("524288000//1024//1024")
print(doubleSlashAvg/loops)

How to speed up for loop in python using Cython

I am trying to make a sensor using Beaglebone Black(BBB) and Python. I need to get as much data as possible per second from a sensor. The code bellow allows me to collect about 100,000 data points per second.
import Adafruit_BBIO_GPIO as GPIO
import time
GPIO.setup("P8_13", GPIO.IN)
def get_data(n):
my_list = []
start_time = time.time()
for i in range(n):
my_list.append(GPIO.input("P8_13"))
end_time = time.time() - start_time
print "Time: {}".format(end-time)
return my_list
n = 100000
get_data(n)
If n = 1,000,000, it takes around 10 seconds to populate my_list which is the same rate when n = 100,000 and time = 1s.
I decided to try Cython to get better results. I've heard it can significantly speed up python code. I followed the basic Cython tutorial: created data.pyx file with the python code above, then created a setup.py and, finally, built the Cython file.
Unfortunately, that did not help me at all. So, I am wondering if I am using Cython inappropriately or in this case, when there are no "heavy math computations", Cython cannot help too much. Any suggestions on how to speed up my code are highly appreciated!

You can start by adding a static type declaration:
import Adafruit_BBIO_GPIO as GPIO
import time
GPIO.setup("P8_13", GPIO.IN)
def get_data(int n): # declared as an int
my_list = []
start_time = time.time()
for i in range(n):
my_list.append(GPIO.input("P8_13"))
end_time = time.time() - start_time
print "Time: {}".format(end-time)
return my_list
n = 100000
get_data(n)
This allows the loop itself to be converted into a pure C loop, with the disadvantage that n is no longer arbitrary precision (so if you try to pass a value larger than ~2 billion, you'll get undefined behavior). This issue can be mitigated by changing int to unsigned long long, which allows values up to 2**64 - 1, or around 18 quintillion. The unsigned quantifier means you won't be able to pass a negative value.
You'll get a much more substantial speed boost if you can eliminate the list. Try replacing it with an array. Cython can work more efficiently with arrays than with lists.

I tried your same code, but with a different build of Adafruit_BBIO,
the one million count take only about 3 seconds to run on my rev C board.
I thought that the main change in the board from Rev B to Rev C was the fact that the eMMC was increased from 2GB to 4GB.
If you go and get the current Adafruit_BBIO, all you have to change in your above code is the first import statement, it should be Adafruit_BBIO.GPIO as GPIO
What have you tried out next?
Ron

Unexpected performance curve from CPython merge sort

I have implemented a naive merge sorting algorithm in Python. Algorithm and test code is below:
import time
import random
import matplotlib.pyplot as plt
import math
from collections import deque
def sort(unsorted):
if len(unsorted) <= 1:
return unsorted
to_merge = deque(deque([elem]) for elem in unsorted)
while len(to_merge) > 1:
left = to_merge.popleft()
right = to_merge.popleft()
to_merge.append(merge(left, right))
return to_merge.pop()
def merge(left, right):
result = deque()
while left or right:
if left and right:
elem = left.popleft() if left[0] > right[0] else right.popleft()
elif not left and right:
elem = right.popleft()
elif not right and left:
elem = left.popleft()
result.append(elem)
return result
LOOP_COUNT = 100
START_N = 1
END_N = 1000
def test(fun, test_data):
start = time.clock()
for _ in xrange(LOOP_COUNT):
fun(test_data)
return time.clock() - start
def run_test():
timings, elem_nums = [], []
test_data = random.sample(xrange(100000), END_N)
for i in xrange(START_N, END_N):
loop_test_data = test_data[:i]
elapsed = test(sort, loop_test_data)
timings.append(elapsed)
elem_nums.append(len(loop_test_data))
print "%f s --- %d elems" % (elapsed, len(loop_test_data))
plt.plot(elem_nums, timings)
plt.show()
run_test()
As much as I can see everything is OK and I should get a nice N*logN curve as a result. But the picture differs a bit:
Things I've tried to investigate the issue:
PyPy. The curve is ok.
Disabled the GC using the gc module. Wrong guess. Debug output showed that it doesn't even run until the end of the test.
Memory profiling using meliae - nothing special or suspicious.
`
I had another implementation (a recursive one using the same merge function), it acts the similar way. The more full test cycles I create - the more "jumps" there are in the curve.
So how can this behaviour be explained and - hopefully - fixed?
UPD: changed lists to collections.deque
UPD2: added the full test code
UPD3: I use Python 2.7.1 on a Ubuntu 11.04 OS, using a quad-core 2Hz notebook. I tried to turn of most of all other processes: the number of spikes went down but at least one of them was still there.

You are simply picking up the impact of other processes on your machine.
You run your sort function 100 times for input size 1 and record the total time spent on this. Then you run it 100 times for input size 2, and record the total time spent. You continue doing so until you reach input size 1000.
Let's say once in a while your OS (or you yourself) start doing something CPU-intensive. Let's say this "spike" lasts as long as it takes you to run your sort function 5000 times. This means that the execution times would look slow for 5000 / 100 = 50 consecutive input sizes. A while later, another spike happens, and another range of input sizes look slow. This is precisely what you see in your chart.
I can think of one way to avoid this problem. Run your sort function just once for each input size: 1, 2, 3, ..., 1000. Repeat this process 100 times, using the same 1000 inputs (it's important, see explanation at the end). Now take the minimum time spent for each input size as your final data point for the chart.
That way, your spikes should only affect each input size only a few times out of 100 runs; and since you're taking the minimum, they will likely have no impact on the final chart at all.
If your spikes are really really long and frequent, you of course might want to increase the number of repetitions beyond the current 100 per input size.
Looking at your spikes, I notice the execution slows down exactly 3 times during a spike. I'm guessing the OS gives your python process one slot out of three during high load. Whether my guess is correct or not, the approach I recommend should resolve the issue.
EDIT:
I realized that I didn't clarify one point in my proposed solution to your problem.
Should you use the same input in each of your 100 runs for the given input size? Or should use 100 different (random) inputs?
Since I recommended to take the minimum of the execution times, the inputs should be the same (otherwise you'll be getting incorrect output, as you'll measuring the best-case algorithm complexity instead of the average complexity!).
But when you take the same inputs, you create some noise in your chart since some inputs are simply faster than others.
So a better solution is to resolve the system load problem, without creating the problem of only one input per input size (this is obviously pseudocode):
seed = 'choose whatever you like'
repeats = 4
inputs_per_size = 25
runtimes = defaultdict(lambda : float('inf'))
for r in range(repeats):
random.seed(seed)
for i in range(inputs_per_size):
for n in range(1000):
input = generate_random_input(size = n)
execution_time = get_execution_time(input)
if runtimes[(n, i)] > execution_time:
runtimes[(n,i)] = execution_time
for n in range(1000):
runtimes[n] = sum(runtimes[(n,i)] for i in range(inputs_per_size))/inputs_per_size
Now you can use runtimes[n] to build your plot.
Of course, depending if your system is super-noisy, you might change (repeats, inputs_per_size) from (4,25) to say, (10,10), or even (25,4).

I can reproduce the spikes using your code:
You should choose an appropriate timing function (time.time() vs. time.clock() -- from timeit import default_timer), number of repetitions in a test (how long each test takes), and number of tests to choose the minimal time from. It gives you a better precision and less external influence on the results. Read the note from timeit.Timer.repeat() docs:
It’s tempting to calculate mean and standard deviation from the result
vector and report these. However, this is not very useful. In a
typical case, the lowest value gives a lower bound for how fast your
machine can run the given code snippet; higher values in the result
vector are typically not caused by variability in Python’s speed, but
by other processes interfering with your timing accuracy. So the min()
of the result is probably the only number you should be interested in.
After that, you should look at the entire vector and apply common
sense rather than statistics.
timeit module can choose appropriate parameters for you:
$ python -mtimeit -s 'from m import testdata, sort; a = testdata[:500]' 'sort(a)'
Here's timeit-based performance curve:
The figure shows that sort() behavior is consistent with O(n*log(n)):
|------------------------------+-------------------|
| Fitting polynom | Function |
|------------------------------+-------------------|
| 1.00 log2(N) + 1.25e-015 | N |
| 2.00 log2(N) + 5.31e-018 | N*N |
| 1.19 log2(N) + 1.116 | N*log2(N) |
| 1.37 log2(N) + 2.232 | N*log2(N)*log2(N) |
To generate the figure I've used make-figures.py:
$ python make-figures.py --nsublists 1 --maxn=0x100000 -s vkazanov.msort -s vkazanov.msort_builtin
where:
# adapt sorting functions for make-figures.py
def msort(lists):
assert len(lists) == 1
return sort(lists[0]) # `sort()` from the question
def msort_builtin(lists):
assert len(lists) == 1
return sorted(lists[0]) # builtin
Input lists are described here (note: the input is sorted so builtin sorted() function shows expected O(N) performance).

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Oddity calculating runtime with timeit in Python? - python

Related

Why Multiprocessing is taking more time than sequential processing?

Why string comparison is NOT faster then integer comparison in Python?

why are // calculations faster the / calculations(or not)

How to speed up for loop in python using Cython

Unexpected performance curve from CPython merge sort

Categories

Resources