In which scenario it is useful to use Disassembly on python? - python

The dis module can be effectively used to disassemble Python methods, functions and classes into low-level interpreter instructions.
I know that dis information can be used for:
1. Find race condition in programs that use threads
2. Find possible optimizations
From your experience, do you know any other scenarios where Disassembly Python feature could be useful?

dis is useful, for example, when you have different code doing the same thing and you wonder where the performance difference lies in.
Example: list += [item] vs list.append(item)
def f(x): return 2*x
def f1(func, nums):
result = []
for item in nums:
result += [fun(item)]
return result
def f2(func, nums):
result = []
for item in nums:
result.append(fun(item))
return result
timeit.timeit says that f2(f, range(100)) is approximately twice as fast than f1(f, range(100). Why?
(Interestingly f2 is roughly as fast as map(f, range(100)) is.)
f1
You can see the whole output of dis by calling dis.dis(f1), here is line 4.
4 19 LOAD_FAST 2 (result)
22 LOAD_FAST 1 (fun)
25 LOAD_FAST 3 (item)
28 CALL_FUNCTION 1
31 BUILD_LIST 1
34 INPLACE_ADD
35 STORE_FAST 2 (result)
38 JUMP_ABSOLUTE 13
>> 41 POP_BLOCK
f2
Again, here is only line 4:
4 19 LOAD_FAST 2 (result)
22 LOAD_ATTR 0 (append)
25 LOAD_FAST 1 (fun)
28 LOAD_FAST 3 (item)
31 CALL_FUNCTION 1
34 CALL_FUNCTION 1
37 POP_TOP
38 JUMP_ABSOLUTE 13
>> 41 POP_BLOCK
Spot the difference
In f1 we need to:
Call fun on item (opcode 28)
Make a list out of it (opcode 31, expensive!)
Add it to result (opcode 34)
Store the returned value in result (opcode 35)
In f2, instead, we just:
Call fun on item (opcode 31)
Call append on result (opcode 34; C code: fast!)
This explains why the (imho) more expressive list += [value] is much slower than the list.append() method.
Other than that, dis.dis is mainly useful for curiosity and for trying to reconstruct code out of .pyc files you don't have the source of without spending a fortune :)

I see the dis module as being, essentially, a learning tool. Understanding what opcodes a certain snippet of Python code generates is a start to getting more "depth" to your grasp of Python -- rooting the "abstract" understanding of its semantics into a sample of (a bit more) concrete implementation. Sometimes the exact reason a certain Python snippet behaves the way it does may be hard to grasp "top-down" with pure reasoning from the "rules" of Python semantics: in such cases, reinforcing the study with some "bottom-up" verification (based on a possible implementation, of course -- other implementations would also be possible;-) can really help the study's effectiveness.

For day-to-day Python programming, not much. However, it is useful if you want to find out why doing something one way is faster than another way. I've also sometimes used it to figure out exactly how the interpreter handles some obscure bits of code. But really, I come up with a practical use-case for it very infrequently.
On the other hand, if your goal is to understand python rather than just being able to program in it, then it is an invaluable tool. For instance, ever wonder how function definition works? Here you go:
>>> def f():
... def foo(x=[1, 2, 3]):
... y = [4,]
... return x + y
...
>>> dis(f)
2 0 LOAD_CONST 1 (1)
3 LOAD_CONST 2 (2)
6 LOAD_CONST 3 (3)
9 BUILD_LIST 3
12 LOAD_CONST 4 (<code object foo at 0xb7690770, file "<stdin>", line 2>)
15 MAKE_FUNCTION 1
18 STORE_FAST 0 (foo)
21 LOAD_CONST 0 (None)
24 RETURN_VALUE
You can see that this happens by pushing the constants 1, 2, and 3 onto the stack, putting what's in the stack into a list, loading that into a code object, making the code function into an object, and storing it into a variable foo.

Related

Why is a=a*100 almost two times faster than a*=100? [duplicate]

This question already has answers here:
Numpy in-place operation performance
(2 answers)
Closed 1 year ago.
Following the question about Chaining *= += operators and the good comment of Tom Wojcik ("Why would you assume aaa *= 200 is faster than aaa = aaa * 200 ?"), I tested it in Jupyter notebook:
%%timeit aaa = np.arange(1,101,1)
aaa*=100
%%timeit aaa = np.arange(1,101,1)
aaa=aaa*100
And I was surprised because the first test is longer than the second one: 1530ns and 952ns, respectively. Why these values are so different?
TL;DR: this question is equivalent to the performance difference between inplace_binop (INPLACE_*) (aaa*=100) vs binop (BINARY_*) (aaa=aaa*100). The difference can be found by using dis module:
import numpy as np
import dis
aaa = np.arange(1,101,1)
dis.dis('''
for i in range(1000000):
aaa*=100
''')
3 14 LOAD_NAME 2 (aaa)
16 LOAD_CONST 1 (100)
18 INPLACE_MULTIPLY
20 STORE_NAME 2 (aaa)
22 JUMP_ABSOLUTE 10
>> 24 POP_BLOCK
>> 26 LOAD_CONST 2 (None)
28 RETURN_VALUE
dis.dis('''
for i in range(1000000):
aaa=aaa*100
''')
3 14 LOAD_NAME 2 (aaa)
16 LOAD_CONST 1 (100)
18 BINARY_MULTIPLY
20 STORE_NAME 2 (aaa)
22 JUMP_ABSOLUTE 10
>> 24 POP_BLOCK
>> 26 LOAD_CONST 2 (None)
28 RETURN_VALUE
Then back to your question, which is absolutely faster?
Unluckily, it's hard to say which function is faster, here's why:
You can check compile.c of CPython code directly. If you trace a bit into CPython code, here's the function call difference:
inplace_binop -> compiler_augassign -> compiler_visit_stmt
binop -> compiler_visit_expr1 -> compiler_visit_expr -> compiler_visit_kwonlydefaults
Since the function call and logic are different, that means there are tons of factors (including your input size(*), CPU...etc) could matter to the performance as well, you'll need to work on profiling to optimize your code based on your use case.
*: from others comment, you can check this post to know the performance of different input size.
The += symbol appeared in the C language in the 1970s, and - with the C idea of "smart assembler" correspond to a clearly different machine instruction and addressing mode
"a=a * 100" "a *= 100" produce the same effect but correspond at low level to a different way the processor is working.
a *= 100 means
find the place identified by a
multiply with 100
a = a * 100 means:
evaluate a*100
Find the place identified by a
Copy a into an accumulator
multiply with 100 the accumulator
Store the result in a
Find the place identified by a
Copy the accumulator to it
Python is coded in C, it inherited the syntax from C, but since there is no translation / optimization before the execution in interpreted languages, things are not necessarily so intimately related (since there is one less parsing step). However, an interpreter can refer to different execution routines for the three types of expression, taking advantage of different machine code depending on how the expression is formed and on the evaluation context.

Does python create an object for string constants in equality comparisons?

In python for comparisons like this, does python create a temporary object for the string constant "help" and then continue with the equality comparison ? The object would be GCed after some point.
s1 = "nohelp"
if s1 == "help":
# Blah Blah
String literals, like all Python constants, are created during compile time, when the source code is translated to byte code. And because all Python strings are immutable the interpreter can re-use the same string object if it encounters the same string literal in multiple places. It can even do that if the literal string is created via concatenation of literals, but not if the string is built by concatenating a string literal to an existing string object.
Here's a short demo that creates a few identical strings inside and outside of functions. It also dumps the disassembled byte code of one of the functions.
from __future__ import print_function
from dis import dis
def f1(s):
a = "help"
print('f1', id(s), id(a))
return s > a
def f2(s):
a = "help"
print('f2', id(s), id(a))
return s > a
a = "help"
print(id(a))
print(f1("he" + "lp"))
b = "h"
print(f2(b + "elp"))
print("\nf1")
dis(f1)
typical output on a 32 bit machine running Python 2.6.6
3073880672
f1 3073880672 3073880672
False
f2 3073636576 3073880672
False
f1
26 0 LOAD_CONST 1 ('help')
3 STORE_FAST 1 (a)
27 6 LOAD_GLOBAL 0 (print)
9 LOAD_CONST 2 ('f1')
12 LOAD_GLOBAL 1 (id)
15 LOAD_FAST 0 (s)
18 CALL_FUNCTION 1
21 LOAD_GLOBAL 1 (id)
24 LOAD_FAST 1 (a)
27 CALL_FUNCTION 1
30 CALL_FUNCTION 3
33 POP_TOP
28 34 LOAD_FAST 0 (s)
37 LOAD_FAST 1 (a)
40 COMPARE_OP 4 (>)
43 RETURN_VALUE
Note that the ids of all the "help" strings are identical, apart from the one constructed with b + "elp".
(BTW, Python will concatenate adjacent string literals, so instead of writing "he" + "lp" I could've written "he" "lp", or even "he""lp").
The string literals themselves are not freed until the process is cleaning itself up at termination, however a string like b would be GC'ed if it went out of scope.
Note that in CPython (standard Python) when objects are GC'ed their memory is returned to Python's allocation system for recycling, not to the OS. Python does return unneeded memory to the OS, but only in special circumstances. See Releasing memory in Python and Why doesn't memory get released to system after large queries (or series of queries) in django?
Another question that discusses this topic: Why strings object are cached in python

Python Dictionary vs If Statement Speed

I have found a few links talking about switch cases being faster in c++ than if else because it can be optimized in compilation. I then found some suggestions people had that using a dictionary may be faster than an If statement. However, most of the conversation are about someones work end just end up discussing that they should optimize other parts of the code first and it wont matter unless your doing millions of if else. Can anyone explain why this is?
Say I have 100 unique numbers that are going to be streamed in to a python code constantly. I want to check which number it is, then execute something. So i could either do a ton of if else, or i could put each number in a dictionary. For arguments sake, lets say its a single thread.
Does someone understand the layer between python and the low level execution that can explain how this is working?
Thanks :)
However, most of the conversation are about someones work end just end
up discussing that they should optimize other parts of the code first
and it wont matter unless your doing millions of if else. Can anyone
explain why this is?
Generally, you should only bother to optimize code if you really need to, i.e. if the program's performance is unusably slow.
If this is the case, you should use a profiler to determine which parts are actually causing the most problems. For Python, the cProfile module is pretty good for this.
Does someone understand the layer between python and the low level
execution that can explain how this is working?
If you want to get an idea of how your code executes, take a look at the dis module.
A quick example...
import dis
# Here are the things we might want to do
def do_something_a():
print 'I did a'
def do_something_b():
print 'I did b'
def do_something_c():
print 'I did c'
# Case 1
def f1(x):
if x == 1:
do_something_a()
elif x == 2:
do_something_b()
elif x == 3:
do_something_c()
# Case 2
FUNC_MAP = {1: do_something_a, 2: do_something_b, 3: do_something_c}
def f2(x):
FUNC_MAP[x]()
# Show how the functions execute
print 'Case 1'
dis.dis(f1)
print '\n\nCase 2'
dis.dis(f2)
...which outputs...
Case 1
18 0 LOAD_FAST 0 (x)
3 LOAD_CONST 1 (1)
6 COMPARE_OP 2 (==)
9 POP_JUMP_IF_FALSE 22
19 12 LOAD_GLOBAL 0 (do_something_a)
15 CALL_FUNCTION 0
18 POP_TOP
19 JUMP_FORWARD 44 (to 66)
20 >> 22 LOAD_FAST 0 (x)
25 LOAD_CONST 2 (2)
28 COMPARE_OP 2 (==)
31 POP_JUMP_IF_FALSE 44
21 34 LOAD_GLOBAL 1 (do_something_b)
37 CALL_FUNCTION 0
40 POP_TOP
41 JUMP_FORWARD 22 (to 66)
22 >> 44 LOAD_FAST 0 (x)
47 LOAD_CONST 3 (3)
50 COMPARE_OP 2 (==)
53 POP_JUMP_IF_FALSE 66
23 56 LOAD_GLOBAL 2 (do_something_c)
59 CALL_FUNCTION 0
62 POP_TOP
63 JUMP_FORWARD 0 (to 66)
>> 66 LOAD_CONST 0 (None)
69 RETURN_VALUE
Case 2
29 0 LOAD_GLOBAL 0 (FUNC_MAP)
3 LOAD_FAST 0 (x)
6 BINARY_SUBSCR
7 CALL_FUNCTION 0
10 POP_TOP
11 LOAD_CONST 0 (None)
14 RETURN_VALUE
...so it's pretty easy to see which function has to execute the most instructions.
As for which is actually faster, that's something you'd have to check by profiling the code.
The if/elif/else structure compares the key it was given to a sequence of possible values one by one until it finds a match in the condition of some if statement, then reads what it is supposed to execute from inside the if block. This can take a long time, because so many checks (n/2 on average, for n possible values) have to be made for every lookup.
The reason that a sequence of if statements is more difficult to optimize than a switch statement is that the condition checks (what's inside the parens in C++) might conceivably change the state of some variable that's involved in the next check, so you have to do them in order. The restrictions on switch statements remove that possibility, so the order doesn't matter (I think).
Python dictionaries are implemented as hash tables. The idea is this: if you could deal with arbitrarily large numbers and had infinite RAM, you could create a huge array of function pointers that is indexed just by casting whatever your lookup value is to an integer and using that as the index. Lookup would be virtually instantaneous.
You can't do that, of course, but you can create an array of some manageable length, pass the lookup value to a hash function (which generates some integer, depending on the lookup value), then % your result with the length of your array to get an index within the bounds of that array. That way, lookup takes as much time as is needed to call the hash function once, take the modulus, and jump to an index. If the amount of different possible lookup values is large enough, the overhead of the hash function becomes negligible compared to those n/2 condition checks.
(Actually, since many different lookup values will inevitably map to the same index, it's not quite that simple. You have to check for and resolve possible conflicts, which can be done in a number of ways. Still, the gist of it is as described above.)

Finding the variables (read or write)

I'd like to develop a small debugging tool for python programs.In Dynamic Slicing How can I find the variables that are accessed in a statement? And find the type of access (read or write) for those variables (in Python).### Write: A statement can change the program state Read : A statement can read the program state .**For example in these 4 lines we have: (1) x = a+b => write{x} & reads{a,b} (2)y=6 => write{y}&reads{} (3) while(n>1) => write{} &reads{n} (4) n=n-1 write{n} & reads{n}
Not sure what your goal is. Perhaps dis is what you're looking for?
>>> import dis
>>> dis.dis("x=a+b")
1 0 LOAD_NAME 0 (a)
3 LOAD_NAME 1 (b)
6 BINARY_ADD
7 STORE_NAME 2 (x)
10 LOAD_CONST 0 (None)
13 RETURN_VALUE

Python import X or from X import Y? (performance)

If there is a library from which I'm going to use at least two methods, is there any difference in performance or memory usage between the following?
from X import method1, method2
and
import X
There is a difference, because in the import x version there are two name lookups: one for the module name, and the second for the function name; on the other hand, using from x import y, you have only one lookup.
You can see this quite well, using the dis module:
import random
def f_1():
random.seed()
dis.dis(f_1)
0 LOAD_GLOBAL 0 (random)
3 LOAD_ATTR 0 (seed)
6 CALL_FUNCTION 0
9 POP_TOP
10 LOAD_CONST 0 (None)
13 RETURN_VALUE
from random import seed
def f_2():
seed()
dis.dis(f_2)
0 LOAD_GLOBAL 0 (seed)
3 CALL_FUNCTION 0
6 POP_TOP
7 LOAD_CONST 0 (None)
10 RETURN_VALUE
As you can see, using the form from x import y is a bit faster.
On the other hand, import x is less expensive than from x import y, because there's a name lookup less; let's look at the disassembled code:
def f_3():
import random
dis.dis(f_3)
0 LOAD_CONST 1 (-1)
3 LOAD_CONST 0 (None)
6 IMPORT_NAME 0 (random)
9 STORE_FAST 0 (random)
12 LOAD_CONST 0 (None)
15 RETURN_VALUE
def f_4():
from random import seed
dis.dis(f_4)
0 LOAD_CONST 1 (-1)
3 LOAD_CONST 2 (('seed',))
6 IMPORT_NAME 0 (random)
9 IMPORT_FROM 1 (seed)
12 STORE_FAST 0 (seed)
15 POP_TOP
16 LOAD_CONST 0 (None)
19 RETURN_VALUE
I do not know the reason, but it seems the form from x import y looks like a function call, and therefore is even more expensive than anticipated; for this reason, if the imported function is used only once, it means it would be faster to use import x, while if it is being used more than once, it becomes then faster to use from x import y.
That said, as usual, I would suggest you not following this knowledge for your decision on how to import modules and functions, because this is just some premature optimization.
Personally, I think in a lot of cases, explicit namespaces are much more readable, and I would suggest you doing the same: use your own sense of esthetic :-)
There is no memory or speed difference (the whole module has to be evaluated either way, because the last line could be Y = something_else). Unless your computer is from the 1980s it doesn't matter anyways.
It can matter if you are calling a function a lot of times in a loop (millions or more). Doing the double dictionary lookup will eventually accumulate. The example below shows a 20% increase.
Times quoted are for Python 3.4 on a Win7 64 bit machine. (Change the range command to xrange for Python 2.7).
This example is highly based on the book High Performance Python, although their third example of local function lookups being better no longer seemed to hold for me.
import math
from math import sin
def tight_loop_slow(iterations):
"""
>>> %timeit tight_loop_slow(10000000)
1 loops, best of 3: 3.2 s per loop
"""
result = 0
for i in range(iterations):
# this call to sin requires two dictionary lookups
result += math.sin(i)
def tight_loop_fast(iterations):
"""
>>> %timeit tight_loop_fast(10000000)
1 loops, best of 3: 2.56 s per loop
"""
result = 0
for i in range(iterations):
# this call to sin only requires only one lookup
result += sin(i)
I don't believe there's any real difference, and generally worrying about that little amount of memory isn't typically worth it. If you're going to be pressing memory considerations, it will far more likely be in your code.

Categories