I'm calculating the euclidean distance between two vectors represented by tuples.
(u[0]-v[0])**2 + (u[1]-v[1])**2 + (u[3]-v[3])**2 ...
The hard-coded way of doing this is pretty fast. However, I would like to make no assumptions about the length of these vectors. That results in solutions like:
sum([(a-b)**2 for a, b in izip(u, v)]) # Faster without generator
or
sum = 0
for i in xrange(len(u)):
sum += (u[i]-v[i])**2
which turn out to be much (at least twice) slower than the first version. Is there some smart way of optimizing this, without resorting to NumPy/SciPy? I'm aware that those packages offer the fastest way of doing such things, but at the moment, I'm more trying to get experience with optimizing "bare Python". What I found works fast is to dynamically build a string that defines the function and exec() it, but that's really a last resort, I would say...
The requirements:
CPython 2.7
Standard library only
"Real" (e.g. no exec()), pure Python
Even though my question is about the matter of small operations in general, you may assume in your solution that one of the vectors remains the same over several function calls.
mysum = 0
for a, b in izip(u, v) :
mysum += (a-b)**2
About 35% faster than #3
PS: have you tried Cython (not CPython) or Shedskin?
What I'm understanding is that you don't really need to make the code faster, you just want to know why it's slower. To answer that, let's look at the disassembly. For the purposes of this discussion, I'm going to wrap each method in a function call, the loading of u and v and the return command can be ignored in each disassembly.
def test1(u, v):
return (u[0]-v[0])**2 + (u[1]-v[1])**2 + (u[3]-v[3])**2
dis.dis(test1)
0 LOAD_FAST 0 (u)
3 LOAD_CONST 1 (0)
6 BINARY_SUBSCR
7 LOAD_FAST 1 (v)
10 LOAD_CONST 1 (0)
13 BINARY_SUBSCR
14 BINARY_SUBTRACT
15 LOAD_CONST 2 (2)
18 BINARY_POWER
19 LOAD_FAST 0 (u)
22 LOAD_CONST 3 (1)
25 BINARY_SUBSCR
26 LOAD_FAST 1 (v)
29 LOAD_CONST 3 (1)
32 BINARY_SUBSCR
33 BINARY_SUBTRACT
34 LOAD_CONST 2 (2)
37 BINARY_POWER
38 BINARY_ADD
39 LOAD_FAST 0 (u)
42 LOAD_CONST 4 (3)
45 BINARY_SUBSCR
46 LOAD_FAST 1 (v)
49 LOAD_CONST 4 (3)
52 BINARY_SUBSCR
53 BINARY_SUBTRACT
54 LOAD_CONST 2 (2)
57 BINARY_POWER
58 BINARY_ADD
59 RETURN_VALUE
I cut the first example off at a length of 3 because it would just repeat the same pattern over and over. You can quickly see that there is no function call overhead and pretty much the interpreter is doing the minimum possible work on these operands to achieve your result.
def test2(u, v):
sum((a-b)**2 for a, b in izip(u, v))
dis.dis(test2)
0 LOAD_GLOBAL 0 (sum)
3 LOAD_CONST 1 (<code object <genexpr> at 02C6F458, file "<pyshell#10>", line 2>)
6 MAKE_FUNCTION 0
9 LOAD_GLOBAL 1 (izip)
12 LOAD_FAST 0 (u)
15 LOAD_FAST 1 (v)
18 CALL_FUNCTION 2
21 GET_ITER
22 CALL_FUNCTION 1
25 CALL_FUNCTION 1
28 RETURN_VALUE
What we see here is that we create a function out of the generator expression, load 2 globals (sum and izip, global lookups are slower than local lookups, we can't avoid making them once but if they're going to be called in a tight loop, many people assign them to a local, such as _izip or _sum), and then suffer 4 expensive bytecode operations in a row, calling izip, getting the iterator from the generator, calling the function created by the generator, and then calling the sum function (which will consume the iterator and add each item before returning).
def test3(u, v):
sum = 0
for i in xrange(len(u)):
sum += (u[i]-v[i])**2
dis.dis(test3)
0 LOAD_CONST 1 (0)
3 STORE_FAST 2 (sum)
6 SETUP_LOOP 52 (to 61)
9 LOAD_GLOBAL 0 (xrange)
12 LOAD_GLOBAL 1 (len)
15 LOAD_FAST 0 (u)
18 CALL_FUNCTION 1
21 CALL_FUNCTION 1
24 GET_ITER
25 FOR_ITER 32 (to 60)
28 STORE_FAST 3 (i)
31 LOAD_FAST 2 (sum)
34 LOAD_FAST 0 (u)
37 LOAD_FAST 3 (i)
40 BINARY_SUBSCR
41 LOAD_FAST 1 (v)
44 LOAD_FAST 3 (i)
47 BINARY_SUBSCR
48 BINARY_SUBTRACT
49 LOAD_CONST 2 (2)
52 BINARY_POWER
53 INPLACE_ADD
54 STORE_FAST 2 (sum)
57 JUMP_ABSOLUTE 25
60 POP_BLOCK
61 LOAD_CONST 0 (None)
64 RETURN_VALUE
What we see here is a more straightforward version of what is happening in test2. No generator expression or call to sum, but we've replaced that function call overhead with an unnecessary function call by doing xrange(len(u)) instead of the faster solution suggested by #Lucas Malor.
def test4(u, v):
mysum = 0
for a, b in izip(u, v) :
mysum += (a-b)**2
return mysum
dis.dis(test4)
0 LOAD_CONST 1 (0)
3 STORE_FAST 2 (mysum)
6 SETUP_LOOP 47 (to 56)
9 LOAD_GLOBAL 0 (izip)
12 LOAD_FAST 0 (u)
15 LOAD_FAST 1 (v)
18 CALL_FUNCTION 2
21 GET_ITER
22 FOR_ITER 30 (to 55)
25 UNPACK_SEQUENCE 2
28 STORE_FAST 3 (a)
31 STORE_FAST 4 (b)
34 LOAD_FAST 2 (mysum)
37 LOAD_FAST 3 (a)
40 LOAD_FAST 4 (b)
43 BINARY_SUBTRACT
44 LOAD_CONST 2 (2)
47 BINARY_POWER
48 INPLACE_ADD
49 STORE_FAST 2 (mysum)
52 JUMP_ABSOLUTE 22
55 POP_BLOCK
56 LOAD_FAST 2 (mysum)
59 RETURN_VALUE
The above represents #Lucas Malor's contribution and it's faster in a few ways. It replaces subscript operations with unpacking while reducing the number of calls to 1. This is, in many cases, as fast you're going to achieve with the constraints you've given us.
Note that it would only be worth evaluating a run-time generated string similar to the function in test1 if you were going to call the function enough times to merit the overhead. Note also that as the length of u and v becomes increasingly large (which is typically how algorithms of this type are evaluated) the function call overhead of the other solutions becomes increasingly small and therefore, in most cases, the most readable solution is vastly superior. At the same time, even though it's slower in small cases, if the length of your sequences, u and v, may be very long, I recommend a generator expression as opposed to a list comprehension. The memory savings will cause much faster execution in most cases (and faster gc).
Overall, my recommendation is that the tiny speedup in cases of short sequences is just not worth the increase in code size and inconsistent behavior with other implementations of python you're looking at by performing micro-optimizations. The "best" solution is almost certainly test2.
Related
I found out something weird.
I defined two test functions as such:
def with_brackets(n=10000):
d = dict()
for i in range(n):
d["hello"] = i
def with_setitem(n=10000):
d = dict()
st = d.__setitem__
for i in range(n):
st("hello", i)
One would expect the two functions to be roughly the same execution speed. However:
>>> timeit(with_brackets, number=1000)
0.6558860000222921
>>> timeit(with_setitem, number=1000)
0.9857697170227766
There is possibly something I missed, but it does seem like setitem is almost twice as long, and I don't really understand why. Isn't dict[key] = x supposed to call __setitem__?
(Using CPython 3.9)
Edit: Using timeit instead of time
Isn't dict[key] = x supposed to call __setitem__?
Strictly speaking, no. Running both your functions through dis.dis, we get (I am only including the for loop):
>>> dis.dis(with_brackets)
...
>> 22 FOR_ITER 12 (to 36)
24 STORE_FAST 3 (i)
5 26 LOAD_FAST 0 (n)
28 LOAD_FAST 1 (d)
30 LOAD_CONST 1 ('hello')
32 STORE_SUBSCR
34 JUMP_ABSOLUTE 22
...
Vs
>>> dis.dis(with_setitem)
...
>> 28 FOR_ITER 14 (to 44)
30 STORE_FAST 4 (i)
6 32 LOAD_FAST 2 (setitem)
34 LOAD_CONST 1 ('hello')
36 LOAD_FAST 0 (n)
38 CALL_FUNCTION 2
40 POP_TOP
42 JUMP_ABSOLUTE 28
...
The usage of __setitem__ involves a function call (see the usage of CALL_FUNCTION and POP_TOP instead of just STORE_SUBSCR - that's the difference underneath the hood), and function calls do add some amount of overhead, so using the bracket accessor leads to more optimised opcode.
I have this function in Python:
digit_sum = 0
while number > 0:
digit_sum += (number % 10)
number = number // 10
For determining the time complexity, I applied the following logic:
Line 1: 1 basic operation (assignment), gets executed 1 time so gets a value of 1
Line 2: 2 basic operations (reading the variable 'number' and comparing against zero), gets executed n+1 times so gets a value of 2*(n+1)
Line 3: 4 basic operations (reading the variable 'number', %10, computing the sum, and assignment), gets executed n times so gets a value of 4*n
Line 4: 3 basic operations (reading the variable 'number', //10 and assignment), gets executed n times so gets a value of 3*n
This brings me to a total of 1 + 2n+2 + 4n + 3n = 9n+3
But my textbook has a solution of 8n+3. Where did I go wrong in my logic?
Thanks,
Alex
When talking about complexity all you really care about is asymptotic complexity. Here, O(n). The 8 or 9 or 42 doesn't really matter, especially as there is no way for you to know.
Thus counting "operations" is pointless. It exposes the architectural details of the underlying processor (be it an actual hw proc or an interpreter). The only way to actually get the "real" count of operations would be to have a look at a specific implementation (for instance, say CPython 3.6.0) and see the bytecode it generates from your program.
Here is what my CPython 2.7.12 does:
>>> def test(number):
... digit_sum = 0
... while number > 0:
... digit_sum += (number % 10)
... number = number // 10
...
>>> import dis
>>> dis.dis(test)
2 0 LOAD_CONST 1 (0)
3 STORE_FAST 1 (digit_sum)
3 6 SETUP_LOOP 40 (to 49)
>> 9 LOAD_FAST 0 (number)
12 LOAD_CONST 1 (0)
15 COMPARE_OP 4 (>)
18 POP_JUMP_IF_FALSE 48
4 21 LOAD_FAST 1 (digit_sum)
24 LOAD_FAST 0 (number)
27 LOAD_CONST 2 (10)
30 BINARY_MODULO
31 INPLACE_ADD
32 STORE_FAST 1 (digit_sum)
5 35 LOAD_FAST 0 (number)
38 LOAD_CONST 2 (10)
41 BINARY_FLOOR_DIVIDE
42 STORE_FAST 0 (number)
45 JUMP_ABSOLUTE 9
>> 48 POP_BLOCK
>> 49 LOAD_CONST 0 (None)
52 RETURN_VALUE
I let you draw your own conclusions as to what you want to actually count as a basic operation. Python interpreter interprets bytecodes one after the other, so arguably you have 15 "basic operations" inside your loop. That's the closest you can get to a meaningful number. Still, every operation in there has different runtimes so that 15 carries no valuable information.
Also, keep in mind this is specific to CPython 2.7.12. It's very likely another version will generate something else, taking advantage of new bytecodes that might make it possible to express some operations in a simpler way.
Consider the following Python 2 code
from timeit import default_timer
def floor():
for _ in xrange(10**7):
1 * 12 // 39 * 2 // 39 * 23 - 234
def normal():
for _ in xrange(10**7):
1 * 12 / 39 * 2 / 39 * 23 - 234
t1 = default_timer()
floor()
t2 = default_timer()
normal()
t3 = default_timer()
print 'Floor %.3f' % (t2 - t1)
print 'Normal %.3f' % (t3 - t2)
And the output, on my computer, is
Floor 0.254
Normal 1.766
So, why is the floor division operator // faster than the normal division operator / when both of them are doing the same thing?
The Python interpreter is pre-calculating the expression inside the loop in floor, but not in normal.
Here's the code for floor:
>>> dis.dis(floor)
5 0 SETUP_LOOP 24 (to 27)
3 LOAD_GLOBAL 0 (xrange)
6 LOAD_CONST 9 (10000000)
9 CALL_FUNCTION 1
12 GET_ITER
>> 13 FOR_ITER 10 (to 26)
16 STORE_FAST 0 (_)
6 19 LOAD_CONST 15 (-234)
22 POP_TOP
23 JUMP_ABSOLUTE 13
>> 26 POP_BLOCK
>> 27 LOAD_CONST 0 (None)
30 RETURN_VALUE
You can see that the expression is already calculated LOAD_CONST 15 (-234).
Here's the same for normal:
>>> dis.dis(normal)
9 0 SETUP_LOOP 44 (to 47)
3 LOAD_GLOBAL 0 (xrange)
6 LOAD_CONST 9 (10000000)
9 CALL_FUNCTION 1
12 GET_ITER
>> 13 FOR_ITER 30 (to 46)
16 STORE_FAST 0 (_)
10 19 LOAD_CONST 10 (12)
22 LOAD_CONST 5 (39)
25 BINARY_DIVIDE
26 LOAD_CONST 6 (2)
29 BINARY_MULTIPLY
30 LOAD_CONST 5 (39)
33 BINARY_DIVIDE
34 LOAD_CONST 7 (23)
37 BINARY_MULTIPLY
38 LOAD_CONST 8 (234)
41 BINARY_SUBTRACT
42 POP_TOP
43 JUMP_ABSOLUTE 13
>> 46 POP_BLOCK
>> 47 LOAD_CONST 0 (None)
50 RETURN_VALUE
This time, the calculation is only partially simplified (eg: the initial 1 * is omitted), and most of the operations are performed at runtime.
It looks like Python 2.7 doesn't do constant folding containing the ambiguous / operator (that may be integer or float division depending on its operands). Adding from __future__ import division at the top of the program causes the constant to be folded in normal just as it was in floor (although the result is different of course, since now / is float division).
normal
10 0 SETUP_LOOP 24 (to 27)
3 LOAD_GLOBAL 0 (xrange)
6 LOAD_CONST 9 (10000000)
9 CALL_FUNCTION 1
12 GET_ITER
>> 13 FOR_ITER 10 (to 26)
16 STORE_FAST 0 (_)
11 19 LOAD_CONST 15 (-233.6370808678501)
22 POP_TOP
23 JUMP_ABSOLUTE 13
>> 26 POP_BLOCK
>> 27 LOAD_CONST 0 (None)
30 RETURN_VALUE
It's not like the interpreter couldn't do the constant folding with the default / operator, but it doesn't. Perhaps the code was back-ported from Python 3, and it wasn't considered important to make it work with the ambiguous division operator.
You can examine the compiled bytecode of a particular python function using the dis module:
def floor():
12 // 39
def normal():
12 / 39
>>> dis.dis(floor)
2 0 LOAD_CONST 3 (0)
3 POP_TOP
4 LOAD_CONST 0 (None)
7 RETURN_VALUE
>>> dis.dis(normal)
2 0 LOAD_CONST 1 (12)
3 LOAD_CONST 2 (39)
6 BINARY_DIVIDE
7 POP_TOP
8 LOAD_CONST 0 (None)
11 RETURN_VALUE
"produce the same result" doesn't imply "implemented the same way".
Also note that these operator don't always produce the same result as explained here:
Why Python's Integer Division Floors
So performance measurement is pretty much implementation dependant.
Usually hardware floating point division takes longer than integer division.
It might be that python classic division (referred by you as normal) is implemented by hardware floating point division and truncated back into integer only in the final stage, while true division (referred by you as floored) is implemented using hardware int division which is a lot faster.
I think I'll start at primary school, when you were learning how to add, subtract and multiply, you could easily learn to do this by counting on your fingers, and when multiplying, you would do it by adding several times. However, when you were learning to divide, you probably ran into more annoying algorithms like long division that take multiple steps of integer dividing numbers and there factors until we were left with nothing or something with no divisors.
This is because it's genuinely harder to divide a number than to multiply, add or subtract numbers and we often have to do multiple operations that estimate the division getting closer and closer to the true value. This algorithm will often perform many more steps after the decimal place to find the 10th, 100th etc. decimal place, and requires an operation for each position. (There are more efficient algorithms for this, but they all require more time to find more decimal places.)
Therefore, if we instead do integer division, we can halt the algorithm after it finds the value in the ones position. This means it can avoid the 'infinite' other decimal places making it a lot more efficient. (I used quotations around infinite as the algorithm generally has a stop point after a certain number of positions or it finds the point where values repeat endlessly as any rational number has one of these).
Halting this algorithm makes it a lot faster, there is also less information necessary to find the answer (as after the decimal place is unimportant) so it's probably possible to find a more efficient algorithm to solve the problem.
I am looking in to the performance issues of the loop like structures in Python and found the following statements:
Besides the syntactic benefit of list comprehensions, they are often
as fast or faster than equivalent use of map.
(Performance Tips)
List comprehensions run a bit faster than equivalent for-loops (unless
you're just going to throw away the result).
(Python Speed)
I am wondering what difference under the hood gives list comprehension this advantage. Thanks.
Test one: throwing away the result.
Here's our dummy function:
def examplefunc(x):
pass
And here are our challengers:
def listcomp_throwaway():
[examplefunc(i) for i in range(100)]
def forloop_throwaway():
for i in range(100):
examplefunc(i)
I won't do an analysis of its raw speed, only why, per the OP's question. Lets take a look at the diffs of the machine code.
--- List comprehension
+++ For loop
## -1,15 +1,16 ##
- 55 0 BUILD_LIST 0
+ 59 0 SETUP_LOOP 30 (to 33)
3 LOAD_GLOBAL 0 (range)
6 LOAD_CONST 1 (100)
9 CALL_FUNCTION 1
12 GET_ITER
- >> 13 FOR_ITER 18 (to 34)
+ >> 13 FOR_ITER 16 (to 32)
16 STORE_FAST 0 (i)
- 19 LOAD_GLOBAL 1 (examplefunc)
+
+ 60 19 LOAD_GLOBAL 1 (examplefunc)
22 LOAD_FAST 0 (i)
25 CALL_FUNCTION 1
- 28 LIST_APPEND 2
- 31 JUMP_ABSOLUTE 13
- >> 34 POP_TOP
- 35 LOAD_CONST 0 (None)
- 38 RETURN_VALUE
+ 28 POP_TOP
+ 29 JUMP_ABSOLUTE 13
+ >> 32 POP_BLOCK
+ >> 33 LOAD_CONST 0 (None)
+ 36 RETURN_VALUE
The race is on. Listcomp's first move is to build an empty list, while for loop's is to setup a loop. Both of them then proceed to load global range(), the constant 100, and call the range function for a generator. Then they both get the current iterator and get the next item, and store it into the variable i. Then they load examplefunc and i and call examplefunc. Listcomp appends it to the list and starts the loop over again. For loop does the same in three instructions instead of two. Then they both load None and return it.
So who seems better in this analysis? Here, list comprehension does some redundant operations such as building the list and appending to it, if you don't care about the result. For loop is pretty efficient too.
If you time them, using a for loop is about one-third faster than a list comprehension. (In this test, examplefunc divided its argument by five and threw it away instead of doing nothing at all.)
Test two: Keeping the result like normal.
No dummy function this test. So here are our challengers:
def listcomp_normal():
l = [i*5 for i in range(100)]
def forloop_normal():
l = []
for i in range(100):
l.append(i*5)
The diff isn't any use to us today. It's just the two machine codes in two blocks.
List comp's machine code:
55 0 BUILD_LIST 0
3 LOAD_GLOBAL 0 (range)
6 LOAD_CONST 1 (100)
9 CALL_FUNCTION 1
12 GET_ITER
>> 13 FOR_ITER 16 (to 32)
16 STORE_FAST 0 (i)
19 LOAD_FAST 0 (i)
22 LOAD_CONST 2 (5)
25 BINARY_MULTIPLY
26 LIST_APPEND 2
29 JUMP_ABSOLUTE 13
>> 32 STORE_FAST 1 (l)
35 LOAD_CONST 0 (None)
38 RETURN_VALUE
For loop's machine code:
59 0 BUILD_LIST 0
3 STORE_FAST 0 (l)
60 6 SETUP_LOOP 37 (to 46)
9 LOAD_GLOBAL 0 (range)
12 LOAD_CONST 1 (100)
15 CALL_FUNCTION 1
18 GET_ITER
>> 19 FOR_ITER 23 (to 45)
22 STORE_FAST 1 (i)
61 25 LOAD_FAST 0 (l)
28 LOAD_ATTR 1 (append)
31 LOAD_FAST 1 (i)
34 LOAD_CONST 2 (5)
37 BINARY_MULTIPLY
38 CALL_FUNCTION 1
41 POP_TOP
42 JUMP_ABSOLUTE 19
>> 45 POP_BLOCK
>> 46 LOAD_CONST 0 (None)
49 RETURN_VALUE
As you can probably already tell, the list comprehension has fewer instructions than for loop does.
List comprehension's checklist:
Build an anonymous empty list.
Load range.
Load 100.
Call range.
Get the iterator.
Get the next item on that iterator.
Store that item onto i.
Load i.
Load the integer five.
Multiply times five.
Append the list.
Repeat steps 6-10 until range is empty.
Point l to the anonymous empty list.
For loop's checklist:
Build an anonymous empty list.
Point l to the anonymous empty list.
Setup a loop.
Load range.
Load 100.
Call range.
Get the iterator.
Get the next item on that iterator.
Store that item onto i.
Load the list l.
Load the attribute append on that list.
Load i.
Load the integer five.
Multiply times five.
Call append.
Go to the top.
Go to absolute.
(Not including these steps: Load None, return it.)
The list comprehension doesn't have to do these things:
Load append of the list every time, since it's pre-bound as a local variable.
Load i twice per loop
Spend two instructions going to the top
Directly append to the list instead of calling a wrapper that appens the list
In conclusion, listcomp is a lot faster if you are going to use the values, but if you don't it's pretty slow.
Real speeds
Test one: for loop is faster by about one-third*
Test two: list comprehension is faster by about two-thirds*
*About -> second decimal place acurrate
I just ran across the dissembler function in python. But i couldn't make out what it means. Can anyone explain the working and use, based on the results of the factorial function (based on recursion and loop)
The recursive code and the corresponding dis code:
>>> def fact(n):
... if n==1:
... return 1
... return n*fact(n-1)
...
>>> dis.dis(fact)
2 0 LOAD_FAST 0 (n)
3 LOAD_CONST 1 (1)
6 COMPARE_OP 2 (==)
9 POP_JUMP_IF_FALSE 16
3 12 LOAD_CONST 1 (1)
15 RETURN_VALUE
4 >> 16 LOAD_FAST 0 (n)
19 LOAD_GLOBAL 0 (fact)
22 LOAD_FAST 0 (n)
25 LOAD_CONST 1 (1)
28 BINARY_SUBTRACT
29 CALL_FUNCTION 1
32 BINARY_MULTIPLY
33 RETURN_VALUE
And the factorial function using loop gives the following result:
def factor(n):
... f=1
... while n>1:
... f*=n
... n-=1
...
>>> dis.dis(factor)
2 0 LOAD_CONST 1 (1)
3 STORE_FAST 1 (f)
3 6 SETUP_LOOP 36 (to 45)
>> 9 LOAD_FAST 0 (n)
12 LOAD_CONST 1 (1)
15 COMPARE_OP 4 (>)
18 POP_JUMP_IF_FALSE 44
4 21 LOAD_FAST 1 (f)
24 LOAD_FAST 0 (n)
27 INPLACE_MULTIPLY
28 STORE_FAST 1 (f)
5 31 LOAD_FAST 0 (n)
34 LOAD_CONST 1 (1)
37 INPLACE_SUBTRACT
38 STORE_FAST 0 (n)
41 JUMP_ABSOLUTE 9
>> 44 POP_BLOCK
>> 45 LOAD_CONST 0 (None)
48 RETURN_VALUE
Can anyone tell me how to determine which one is faster?
To measure how fast something is running, use the timeit module, which comes with Python.
The dis module is used to get some idea of what the bytecode may look like; and its very specific to cpython.
One use of it is to see what, when and how storage is assigned for variables in a loop or method. However, this is a specialized module that is not normally used for efficiency calculations; use timeit to figure out how fast something is, and then dis to get an understanding of what is going on under the hood - to arrive at a possible why.
It's impossible to determine which one will be faster simply by looking at the bytecode; each VM has a different cost associated with each opcode and so runtimes can vary widely.
The dis.dis() function disassembles a function into its bytecode interpretation.
Timing
As stated by Ignacio, the pure length of the bytecode does not accurately represent the running time due to differences in how python interpreters actually run opcode and the timeit module would be what you want to use there.
Actual Purpose
There are several uses of this function, but they are not things that most people would end up doing. You can look at the output to help as part of the process of optimizing or debugging speed issues. It would also likely prove useful in working directly on the python interpreter, or writing your own. You can look at the documentation here to see a full list of the opcodes (though, just as that page will state, it's perfectly likely to change between versions of python).
Overall, this is not something you'd really use much in a production application (unless your application is a python disassembler!) but when you really, really need to optimize your code and debug at the lowest level, this is where the function would come in handy.