Python Module Initialization Order? - python

I am a Python newbie coming from a C++ background. While I know it's not Pythonic to try to find a matching concept using my old C++ knowledge, I think this question is still a general question to ask:
Under C++, there is a well known problem called global/static variable initialization order fiasco, due to C++'s inability to decide which global/static variable would be initialized first across compilation units, thus a global/static variable depending on another one in different compilation units might be initialized earlier than its dependency counterparts, and when dependant started to use the services provided by the dependency object, we would have undefined behavior. Here I don't want to go too deep on how C++ solves this problem. :)
On the Python world, I do see uses of global variables, even across different .py files, and one typycal usage case I saw was: initialize one global object in one .py file, and on other .py files, the code just fearlessly start using the global object, assuming that it must have been initialized somewhere else, which under C++ is definitely unaccept by myself, due to the problem I specified above.
I am not sure if the above use case is common practice in Python (Pythonic), and how does Python solve this kind of global variable initialization order problem in general?

Under C++, there is a well known problem called global/static variable initialization order fiasco, due to C++'s inability to decide which global/static variable would be initialized first across compilation units,
I think that statement highlights a key difference between Python and C++: in Python, there is no such thing as different compilation units. What I mean by that is, in C++ (as you know), two different source files might be compiled completely independently from each other, and thus if you compare a line in file A and a line in file B, there is nothing to tell you which will get placed first in the program. It's kind of like the situation with multiple threads: you cannot say whether a particular statement in thread 1 will be executed before or after a particular statement in thread 2. You could say C++ programs are compiled in parallel.
In contrast, in Python, execution begins at the top of one file and proceeds in a well-defined order through each statement in the file, branching out to other files at the points where they are imported. In fact, you could almost think of the import directive as an #include, and in that way you could identify the order of execution of all the lines of code in all the source files in the program. (Well, it's a little more complicated than that, since a module only really gets executed the first time it's imported, and for other reasons.) If C++ programs are compiled in parallel, Python programs are interpreted serially.
Your question also touches on the deeper meaning of modules in Python. A Python module - which is everything that is in a single .py file - is an actual object. Everything declared at "global" scope in a single source file is actually an attribute of that module object. There is no true global scope in Python. (Python programmers often say "global" and in fact there is a global keyword in the language, but it always really refers to the top level of the current module.) I could see that being a bit of a strange concept to get used to coming from a C++ background. It took some getting used to for me, coming from Java, and in this respect Java is a lot more similar to Python than C++ is. (There is also no global scope in Java)
I will mention that in Python it is perfectly normal to use a variable without having any idea whether it has been initialized/defined or not. Well, maybe not normal, but at least acceptable under appropriate circumstances. In Python, trying to use an undefined variable raises a NameError; you don't get arbitrary behavior as you might in C or C++, so you can easily handle the situation. You may see this pattern:
try:
duck.quack()
except NameError:
pass
which does nothing if duck does not exist. Actually, what you'll more commonly see is
try:
duck.quack()
except AttributeError:
pass
which does nothing if duck does not have a method named quack. (AttributeError is the kind of error you get when you try to access an attribute of an object, but the object does not have any attribute by that name.) This is what passes for a type check in Python: we figure that if all we need the duck to do is quack, we can just ask it to quack, and if it does, we don't care whether it's really a duck or not. (It's called duck typing ;-)

Python import executes new Python modules from beginning to end. Subsequent imports only result in a copy of the existing reference in sys.modules, even if still in the middle of importing the module due to a circular import. Module attributes ("global variables" are actually at the module scope) that have been initialized before the circular import will exist.
main.py:
import a
a.py:
var1 = 'foo'
import b
var2 = 'bar'
b.py:
import a
print a.var1 # works
print a.var2 # fails

Related

Python execution from bytecode [duplicate]

I am trying to understand how Python works (because I use it all the time!). To my understanding, when you run something like python script.py, the script is converted to bytecode and then the interpreter/VM/CPython–really just a C Program–reads in the python bytecode and executes the program accordingly.
How is this bytecode read in? Is it similar to how a text file is read in C? I am unsure how the Python code is converted to machine code. Is it the case that the Python interpreter (the python command in the CLI) is really just a precompiled C program that is already converted to machine code and then the python bytecode files are just put through that program? In other words, is my Python program never actually converted into machine code? Is the python interpreter already in machine code, so my script never has to be?
Yes, your understanding is correct. There is basically (very basically) a giant switch statement inside the CPython interpreter that says "if the current opcode is so and so, do this and that".
http://hg.python.org/cpython/file/3.3/Python/ceval.c#l790
Other implementations, like Pypy, have JIT compilation, i.e. they translate Python to machine codes on the fly.
If you want to see the bytecode of some code (whether source code, a live function object or code object, etc.), the dis module will tell you exactly what you need. For example:
>>> dis.dis('i/3')
1 0 LOAD_NAME 0 (i)
3 LOAD_CONST 0 (3)
6 BINARY_TRUE_DIVIDE
7 RETURN_VALUE
The dis docs explain what each bytecode means. For example, LOAD_NAME:
Pushes the value associated with co_names[namei] onto the stack.
To understand this, you have to know that the bytecode interpreter is a virtual stack machine, and what co_names is. The inspect module docs have a nice table showing the most important attributes of the most important internal objects, so you can see that co_names is an attribute of code objects which holds a tuple of names of local variables. In other words, LOAD_NAME 0 pushes the value associated with the 0th local variable (and dis helpfully looks this up and sees that the 0th local variable is named 'i').
And that's enough to see that a string of bytecodes isn't enough; the interpreter also needs the other attributes of the code object, and in some cases attributes of the function object (which is also where the locals and globals environments come from).
The inspect module also has some tools that can help you further in investigating live code.
This is enough to figure out a lot of interesting stuff. For example, you probably know that Python figures out at compile time whether a variable in a function is local, closure, or global, based on whether you assign to it anywhere in the function body (and on any nonlocal or global statements); if you write three different functions and compare their disassembly (and the relevant other attributes) you can pretty easily figure out exactly what it must be doing.
(The one bit that's tricky here is understanding closure cells. To really get this, you will need to have 3 levels of functions, to see how the one in the middle forwards things along for the innermost one.)
To understand how the bytecode is interpreted and how the stack machine works (in CPython), you need to look at the ceval.c source code. The answers by thy435 and eyquem already cover this.
Understanding how pyc files are read only takes a bit more information. Ned Batchelder has a great (if slightly out-of-date) blog post called The structure of .pyc files, that covers all of the tricky and not-well-documented parts. (Note that in 3.3, some of the gory code related to importing has been moved from C to Python, which makes it much easier to follow.) But basically, it's just some header info and the module's code object, serialized by marshal.
To understand how source gets compiled to bytecode, that's the fun part.
Design of CPython's Compiler explains how everything works. (Some of the other sections of the Python Developer's Guide are also useful.)
For the early stuff—tokenizing and parsing—you can just use the ast module to jump right to the point where it's time to do the actual compiling. Then see compile.c for how that AST gets turned into bytecode.
The macros can be a bit tough to work through, but once you grasp the idea of how the compiler uses a stack to descend into blocks, and how it uses those compiler_addop and friends to emit bytecodes at the current level, it all makes sense.
One thing that surprises most people at first is the way functions work. The function definition's body is compiled into a code object. Then the function definition itself is compiled into code (inside the enclosing function body, module, etc.) that, when executed, builds a function object from that code object. (Once you think about how closures must work, it's obvious why it works that way. Each instance of the closure is a separate function object with the same code object.)
And now you're ready to start patching CPython to add your own statements, right? Well, as Changing CPython's Grammar shows, there's a lot of stuff to get right (and there's even more if you need to create new opcodes). You might find it easier to learn PyPy as well as CPython, and start hacking on PyPy first, and only come back to CPython once you know that what you're doing is sensible and doable.
Having read the answer of thg4535, I am sure you will find interesting the following explanations on ceval.c : Hello, ceval.c!
This article is part of a series written by Yaniv Aknin whose I'm sort of a fan: Python's Innards
When we run the python programs: 1_python source code compile with Cpython to the bytecode (bytecode is the binary file with .pyc format which seralize with marshal and it is set of stack structures that solve with pvm) 2_then the pvm (python virtual machine/python interpreter) is stackbase machine (the machine which solve task with stack data structure) which loop inside bytecode line by line and execute it.
What executes the bytecode?
The bytecode tells the Python interpreter which C code to execute.

Why doesn't Python spot errors before execution?

Suppose I have the following code in Python:
a = "WelcomeToTheMachine"
if a == "DarkSideOfTheMoon":
awersdfvsdvdcvd
print "done!"
Why doesn't this error? How does it even compile? In Java or C#, this would get spotted during compilation.
Python isn't a compiled language, that's why your code doesn't throw compilation errors.
Python is a byte code interpreted language. Technically the source code gets "compiled" to byte code, but then the byte code is just in time (JIT) compiled if using PyPy or Pyston otherwise it's line by line interpreted.
The workflow is as follows :
Your Python Code -> Compiler -> .pyc file -> Interpreter -> Your Output
Using the standard python runtime What does all this mean? Essentially all the heavy work happens during runtime, unlike with C or C++ where the source code in it's entirety is analyzed and translated to binary at compile time.
During "compiling", python pretty much only checks your syntax. Since awersdfvsdvdcvd is a valid identifier, no error is raised until that line actually gets executed. Just because you use a name which wasn't defined doesn't mean that it couldn't have been defined elsewhere... e.g.:
globals()['awersdfvsdvdcvd'] = 1
earlier in the file would be enough to suppress the NameError that would occur if the line with the misspelled name was executed.
Ok, so can't python just look for globals statements as well? The answer to that is again "no" -- From module "foo", I can add to the globals of module "bar" in similar ways. And python has no way of knowing what modules are or will be imported until it's actually running (I can dynamically import modules at runtime too).
Note that most of the reasons that I'm mentioning for why Python as a language can't give you a warning about these things involve people doing crazy messed up things. There are a number of tools which will warn you about these things (making the assumption that you aren't going to do stupid stuff like that). My favorite is pylint, but just about any python linter should be able to warn you about undefined variables. If you hook a linter up to your editor, most of the time you can catch these bugs before you ever actually run the code.
Because Python is an interpreted language. This means that if Python's interpreter doesn't arrive to that line, it won't produce any error.
There's nothing to spot: It's not an "error" as far as Python-the-language is concerned. You have a perfectly valid Python program. Python is a dynamic language, and the identifiers you're using get resolved at runtime.
An equivalent program written in C#, Java or C++ would be invalid, and thus the compilation would fail, since in all those languages the use of an undefined identifier is required to produce a diagnostic to the user (i.e. a compile-time error). In Python, it's simply not known whether that identifier is known or not at compile time. So the code is valid. Think of it this way: in Python, having the address of a construction site (a name) doesn't require the construction to have even started yet. What's important is that by the time you use the address (name) as if there was a building there, there better be a building or else an exception is raised :)
In Python, the following happens:
a = "WelcomeToTheMachine" looks up the enclosing context (here: the module context) for the attribute a, and sets the attribute 'a' to the given string object stored in a pool of constants. It also caches the attribute reference so the subsequent accesses to a will be quicker.
if a == "DarkSideOfTheMoon": finds the a in the cache, and executes a binary comparison operator on object a. This ends up in builtins.str.__eq__. The value returned from this operator is used to control the program flow.
awersdfvsdvdcvd is an expression, whose value is the result of a lookup of the name 'awersdfvsdvdcvd'. This expression is evaluted. In your case, the name is not found in the enclosing contexts, and the lookup raises the NameError exception.
This exception propagates to the matching exception handler. Since the handler is outside of all the nested code blocks in the current module, the print function never gets a chance of being called. The Python's built-in exception handler signals the error to the user. The interpreter (a misnomer!) instance has nothing more to do. Since the Python process doesn't try to do anything else after the interpreter instance is done, it terminates.
There's absolutely nothing that says that the program will cause a runtime error. For example, awersdfvsdvdcvd could be set in an enclosing scope before the module is executed, and then no runtime error would be raised. Python allows fine control over the lifetime of a module, and your code could inject the value for awersdfvsdvdcvd after the module has been compiled, but before it got executed. It takes just a few lines of fairly straightforward code to do that.
This is, in fact, one of the many dynamic programming techniques that get used in Python programs. Their judicious use makes possible the kinds of functionality that C++ will not natively get in decades or ever, and that are very cumbersome in both C# and Java. Of course, Python has a performance cost - nothing is free.
If you like to get such problems highlighted at compilation time, there are tools you can easily integrate in an IDE that would spot this problem. E.g. PyCharm has a built-in static checker, and this error would be highlighted with the red squiggly line as expected.

How to share same variable between imported modules

There are two Python scripts: master.py and to_be_imported.py
Here is the master.py:
import os
os.foo = 12345
import to_be_imported
And here is the to_be_imported.py:
import os
if hasattr(os, 'foo'):
print 'os hasattr foo: %s'%os.foo
Now when I run master.py I get this:
os hasattr foo: 12345
indicating that the imported module to_be_imported.py picks up the variable declared inside the process that imported it (master.py).
While it works fine I would like to know why it works and also to make sure it is a safe practice.
If a module is already imported, subsequent imports to the module uses the cached version of the module. Even if you reference it via different names as in the following case
import os as a
import os as b
Both refer to the same os module that was imported the first time. So it is obvious that the variable assigned to a module will be shared.
You can verify it using the built-in python function id()
Nothing is a bad idea per se, but you must remember few things:
Modules are objects in Python. They are loaded only once and added to sys.modules. These objects can also be added attributes like regular objects (with no messy implementation of setattr).
Since they are objects, but not instantiable ones, you must consider them as singletons (they are singletons, after all), and you must consider the disadvantages and benefits of such model:
a. Singletons are only one object. Are you sure that accessing their attributes is concurrency-safe?
b. Modules are global objects. Are you sure you can track the whole behavior and access to their members? Are you sure you will be able to debug errors there?
Is the code something you will work with others?
While no idea is better than other, good practices tell us that using global variables is not well-seen, specially if we have a team to work with. On the other hand: if your code is concurrent and/or reentrant, avoid using global variables or relying on module attributes. OTOH you will have no problem assigning attributes like that. They will last for the life of your script execution.
This is not the place to chose the best alternative. Depending on how you state your problem, you can ask it either on programmers or codereview. You can chose many variants to share state without using global variables in modules, like passing those variables inside a state back and forth across arguments, or learning and using OOP. But, again, this site is no scope for that.

How exactly is Python Bytecode Run in CPython?

I am trying to understand how Python works (because I use it all the time!). To my understanding, when you run something like python script.py, the script is converted to bytecode and then the interpreter/VM/CPython–really just a C Program–reads in the python bytecode and executes the program accordingly.
How is this bytecode read in? Is it similar to how a text file is read in C? I am unsure how the Python code is converted to machine code. Is it the case that the Python interpreter (the python command in the CLI) is really just a precompiled C program that is already converted to machine code and then the python bytecode files are just put through that program? In other words, is my Python program never actually converted into machine code? Is the python interpreter already in machine code, so my script never has to be?
Yes, your understanding is correct. There is basically (very basically) a giant switch statement inside the CPython interpreter that says "if the current opcode is so and so, do this and that".
http://hg.python.org/cpython/file/3.3/Python/ceval.c#l790
Other implementations, like Pypy, have JIT compilation, i.e. they translate Python to machine codes on the fly.
If you want to see the bytecode of some code (whether source code, a live function object or code object, etc.), the dis module will tell you exactly what you need. For example:
>>> dis.dis('i/3')
1 0 LOAD_NAME 0 (i)
3 LOAD_CONST 0 (3)
6 BINARY_TRUE_DIVIDE
7 RETURN_VALUE
The dis docs explain what each bytecode means. For example, LOAD_NAME:
Pushes the value associated with co_names[namei] onto the stack.
To understand this, you have to know that the bytecode interpreter is a virtual stack machine, and what co_names is. The inspect module docs have a nice table showing the most important attributes of the most important internal objects, so you can see that co_names is an attribute of code objects which holds a tuple of names of local variables. In other words, LOAD_NAME 0 pushes the value associated with the 0th local variable (and dis helpfully looks this up and sees that the 0th local variable is named 'i').
And that's enough to see that a string of bytecodes isn't enough; the interpreter also needs the other attributes of the code object, and in some cases attributes of the function object (which is also where the locals and globals environments come from).
The inspect module also has some tools that can help you further in investigating live code.
This is enough to figure out a lot of interesting stuff. For example, you probably know that Python figures out at compile time whether a variable in a function is local, closure, or global, based on whether you assign to it anywhere in the function body (and on any nonlocal or global statements); if you write three different functions and compare their disassembly (and the relevant other attributes) you can pretty easily figure out exactly what it must be doing.
(The one bit that's tricky here is understanding closure cells. To really get this, you will need to have 3 levels of functions, to see how the one in the middle forwards things along for the innermost one.)
To understand how the bytecode is interpreted and how the stack machine works (in CPython), you need to look at the ceval.c source code. The answers by thy435 and eyquem already cover this.
Understanding how pyc files are read only takes a bit more information. Ned Batchelder has a great (if slightly out-of-date) blog post called The structure of .pyc files, that covers all of the tricky and not-well-documented parts. (Note that in 3.3, some of the gory code related to importing has been moved from C to Python, which makes it much easier to follow.) But basically, it's just some header info and the module's code object, serialized by marshal.
To understand how source gets compiled to bytecode, that's the fun part.
Design of CPython's Compiler explains how everything works. (Some of the other sections of the Python Developer's Guide are also useful.)
For the early stuff—tokenizing and parsing—you can just use the ast module to jump right to the point where it's time to do the actual compiling. Then see compile.c for how that AST gets turned into bytecode.
The macros can be a bit tough to work through, but once you grasp the idea of how the compiler uses a stack to descend into blocks, and how it uses those compiler_addop and friends to emit bytecodes at the current level, it all makes sense.
One thing that surprises most people at first is the way functions work. The function definition's body is compiled into a code object. Then the function definition itself is compiled into code (inside the enclosing function body, module, etc.) that, when executed, builds a function object from that code object. (Once you think about how closures must work, it's obvious why it works that way. Each instance of the closure is a separate function object with the same code object.)
And now you're ready to start patching CPython to add your own statements, right? Well, as Changing CPython's Grammar shows, there's a lot of stuff to get right (and there's even more if you need to create new opcodes). You might find it easier to learn PyPy as well as CPython, and start hacking on PyPy first, and only come back to CPython once you know that what you're doing is sensible and doable.
Having read the answer of thg4535, I am sure you will find interesting the following explanations on ceval.c : Hello, ceval.c!
This article is part of a series written by Yaniv Aknin whose I'm sort of a fan: Python's Innards
When we run the python programs: 1_python source code compile with Cpython to the bytecode (bytecode is the binary file with .pyc format which seralize with marshal and it is set of stack structures that solve with pvm) 2_then the pvm (python virtual machine/python interpreter) is stackbase machine (the machine which solve task with stack data structure) which loop inside bytecode line by line and execute it.
What executes the bytecode?
The bytecode tells the Python interpreter which C code to execute.

Is there a way to automatically verify all imports in a Python script without running it?

Suppose that I have a somewhat long Python script (too long to hand-audit) that contains an expensive operation, followed by a bunch of library function calls that are dependent on the output of the expensive operation.
If I have not imported all the necessary modules for the library function calls, then Python will error out only after the expensive operation has finished, because Python interprets line by line.
Is there a way to automatically verify that I have all the necessary imports without either a) manually verifying it line by line or b) running through the expensive operation each time I miss a library?
Another way to put that question is whether there is a tool that will do what the C compiler does with respect to verifying dependencies before run time.
No, this is not possible, because dependencies can be injected at runtime.
Consider:
def foo(break_things):
if not break_things:
globals()['bar'] = lambda: None
long_result = ...
foo(long_result > 0)
bar()
Which depending on the runtime value of long_result, may give NameError: name 'bar' is not defined.
There is a module called snakefood:
Generate dependency graphs from Python code
It uses the AST to parse
the Python files.
This is very reliable, it always runs. No module is
loaded. Loading modules to figure out dependencies is almost always
problem, because a lot of codebases run initialization code in the
global namespace, which often requires additional setup. Snakefood is
guaranteed not to have this problem (it just runs, no matter what).
You can get a list of imports by calling sfood-imports <script.py>. Then you can import each module in the list one by one and see if it throws ImportError.
Or, just use pylint. Quote from docs:
Error detection
checking if declared interfaces are truly implemented
checking if modules are imported
Hope that helps.

Categories