Python interpretation model in comparison to direct and virtual machine compilation - python

I have been compiling diagrams (pun intended) in hope of understanding the different implementations of common programming languages. I understand whether code is compiled or interpreted depends on the implementation of the code, and is not an aspect of the programming language itself.
I am interested in comparing Python interpretation with direct compilation (ex of C++)
and the virtual machine model (ex Java or C#)
In light of these two diagrams above, could you please help me develop a similar flowchart of how the .py file is converted to .pyc, uses the standard libraries (I gather they are called modules) and then actually run. Many programmers on SO indicate that python as a scripting language is not executed by the CPU but rather the interpreter, but that sounds quite impossible because ultimately hardware must be doing the computation.

First off, this is an implementation detail. I am limiting my answer to CPython and PyPy because I am familiar with them. Answers for Jython, IronPython, and other implementations will differ - probably radically.
Python is closer to the "virtual machine model". Python code is, contrary to the statements of some too-loud-for-their-level-of-knowledge people and despite everyone (including me) conflating it in casual discussion, never interpreted. It is always compiled to bytecode (again, on CPython and PyPy) when it is loaded. If it was loaded because a module was imported and was loaded from a .py file, a .pyc file may be created to cache the compilation output. This step is not mandatory; you can turn it off via various means, and program execution is not affected the tiniest bit (except that the next process to load the module has to do it again). However, the compilation to bytecode is not avoidable, the bytecode is generated in memory if it is not loaded from disk.
This bytecode (the exact details of which are an implementation detail and differ between versions) is then executed, at module level, which entails building function objects, class objects, and the like. These objects simply reuse (hold a pointer to) the bytecode which is already in memory. This is unlike C++ and Java, where code and classes are set in stone during/after compilation. During execution, import statements may be encountered. I lack the space, time and understanding to describe the import machinery, but the short story is:
If it was already imported once, you get that module object (another runtime construct for a thing static languages only have at compile time). A couple of builtin modules (well, all of them in PyPy, for reasons beyond the scope of this question) are already imported before any Python code runs, simply because they are so tightly integrated with the core of the interpreter and so fundamental. sys is such a module. Some Python code may also run beforehand, especially when you start the interactive interpreter (look up site.py).
Otherwise, the module is located. The rules for this are not our concern. In the end, these rules arrive at either a Python file or a dynamically-linked piece of machine code (.DLL on Windows, though Python modules specifically use the extension .pyd but that's just a name; on unix the equivalent .so is used).
The module is first loaded into memory (loaded dynamically, or parsed and compiled to bytecode).
Then, the module is initialized. Extension modules have a special function for that which is called. Python modules are simply run, from top to bottom. In well-behaved modules this just sets up global data, defines functions and classes, and imports dependencies. Of course, anything else can also happen. The resulting module object is cached (remember step one) and returned.
All of this applies to standard library modules as well as third party modules. That's also why you can get a confusing error message if you call a script of yours just like a standard library module which you import in that script (it imports itself, albeit without crashing due to caching - one of many things I glossed over).
How the bytecode is executed (the last part of your question) differs. CPython simply interprets it, but as you correctly note, that doesn't mean it magically doesn't use the CPU. Instead, there is a large ugly loop which detects what bytecode instruction shall be executed next, and then jumps to some native code which carries out the semantics of that instruction. PyPy is more interesting; it starts off interpreting but records some stats along the way. When it decides it's worth doing so, it starts recording what the interpreter does in detail, and generates some highly optimized native code. The interpreter is still used for other parts of the Python code. Note that it's the same with many JVMs and possibly .NET, but the diagram you cite glosses over that.

For the reference implementation of python:
(.py) -> python (checks for .pyc) -> (.pyc) -> python (execution dynamically loads modules)
There are other implementations. Most notable are:
jython which compiles (.py) to (.class) and follows the java pattern from there
pypy which employs a JIT as it compiles (.py). the chain from there could vary (pypy could be run in cpython, jython or .net environments)

Python is technically a scripted language but it is also compiled, python source is taken from its source file and fed into the interpreter which often compiles the source to bytecode either internally and then throws it away or externally and saves it like a .pyc
Yes python is a single virtual machine that then sits ontop of the actual hardware but all python bytecode is, is a series of instructions for the pvm (python virtual machine) much like assembler for the actual CPU.

Related

Is a vulnerability identified in Python considered a vulnerability in Jython?

A lil confused as the diffs between Python, Jython and CPython.
I understand Jython is an implementation of Python in Java and CPython is the same except that it is implemented in C.
But what I'm confused on really is identifying vulnerabilities in Python.
Such as the two below.
For example - CVE-2016-5636 - Here it appears that the vulnerability can't be reproduced in Jython.
https://bugzilla.redhat.com/show_bug.cgi?id=1345857
Similarly looking at - CVE-2016-5699
https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2016-5699
It says
"CRLF injection vulnerability in the HTTPConnection.putheader function in urllib2 and urllib in CPython (aka Python) before 2.7.10 and 3.x before 3.4.4 allows remote attackers to inject arbitrary HTTP headers via CRLF sequences in a URL."
Does this mean CVE-2016-5699 is not vulnerable in Jython?
So overall- What I'm wondering is if a vulnerability in Python means it is vulnerable in Jython?
Not necessarily. When you refer to such things as "Python", you are potentially referring to two different things:
The Python language
The Python virtual machine (VM) or other specific implementation
Normally the Python language doesn't change (much) across different implementations. What changes is how the language is processed, including what external system functions are called.
Python without any other distinction would normally refer to CPython, the standard implementation. Others, as you mention above, are Jython and IronPython. Each of these runs in a different VM: JVM for Jython and dotnet for IronPython. These VMs might, for example, allocate memory differently thus preventing a memory-based error from occurring in a different VM. In the case of CVE-2016-5636 mentioned, it's noted that Jython calls the Java version of zip, while CPython likely calls the C version of zip.
In short - if the flaw occurs in how the language approaches a problem, it's likely to affect all implementations. Otherwise, you will need to check each platform's vulnerability on an individual basis.
Addendum: According the Red Hat tracker for CVE-2016-5699, this is language error and therefore is likely (but not guaranteed) to be vulnerable in all implementations until updated.
Not necessarily, it all depends which parts of Python standard lib JPython uses, which are modified, which are re-implemented, which are omitted...
urllib is a part of standard Python distribution, and you can find urllib.py in Lib folder of both, standard CPython and JPython. Sadly, they even state it in their code:
__version__ = '1.17' # XXX This version is not always updated :-(
So you cannot rely on that to figure out if the Python code itself is at fault (and is it fixed in a specific version).
Also, the exploit doesn't have to be necessary related to the actual Python wrapping around the lower level bytecode and ultimately the interpreter - it can be in any of those things, or combination of them. That's why they say that the exploit doesn't exist in a particular CPython version as it is assumed that the whole stack, together with the standard lib, is updated.
So, unless the exploit specifically says that the problem is in the Python code itself (e.g. in urllib.py in your example) and that it is fixed in a specific version of the said module, you cannot be sure that it's not due to the underlying interpreter, and if it is - whether the same applies for both PVM and JVM.

What does it mean when people say CPython is written in C?

From what I know, CPython programs are compiled into intermediate bytecode, which is executed by the virtual machine. Then how does one identify without knowing beforehand that CPython is written in C. Isn't there some common DNA for both which can be matched to identify this?
The interpreter is written in C.
It compiles Python code into bytecode, and then an evaluation loop interprets that bytecode to run your code.
You identify what Python is written in by looking at it's source code. See the source for the evaluation loop for example.
Note that the Python.org implementation is but one Python implementation. We call it CPython, because it is implemented in C. There are other implementations too, written in other languages. Jython is written in Java, IronPython in C#, and then there is PyPy, which is written in a (subset of) Python, and runs many tasks faster than CPython.
Python isn't written in C. Arguably, Python is written in an esoteric English dialect using BNF.
However, all the following statements are true:
Python is a language, consisting of a language specification and a bunch of standard modules
Python source code is compiled to a bytecode representation
this bytecode could in principle be executed directly by a suitably-designed processor but I'm not aware of one actually existing
in the absence of a processor that natively understands the bytecode, some other program must be used to translate the bytecode to something a hardware processor can understand
one real implementation of this runtime facility is CPython
CPython is itself written in C, but ...
C is a language, consisting of a language specification and a bunch of standard libraries
C source code is compiled to some bytecode format (typically something platform-specific)
this platform specific format is typically the native instruction set of some processor (in which case it may be called "object code" or "machine code")
this native bytecode doesn't retain any magical C-ness: it is just instructions. It doesn't make any difference to the processor which language the bytecode was compiled from
so the CPython executable which translates your Python bytecode is a sequence of
instructions executing directly on your processor
so you have: Python bytecode being interpreted by machine code being interpreted by the hardware processor
Jython is another implementation of the same Python runtime facility
Jython is written in Java, but ...
Java is a language, consisting of a spec, standard libraries etc. etc.
Java source code is compiled to a different bytecode
Java bytecode is also executable either on suitable hardware, or by some runtime facility
The Java runtime environment which provides this facility may also be written in C
so you have: Python bytecode being interpreted by Java bytecode being interpreted by machine code being interpreted by the hardware processor
You can add more layers indefinitely: consider that your "hardware processor" may really be a software emulation, or that hardware processors may have a front-end that decodes their "native" instruction set into another internal bytecode.
All of these layers are defined by what they do (executing or interpreting instructions according to some specification), not how they implement it.
Oh, and I skipped over the compilation step. The C compiler is typically written in C (and getting any language to the stage where it can compile itself is traditionally significant), but it could just as well be written in Python or Java. Again, the compiler is defined by what it does (transforms some source language to some output such as a bytecode, according to the language spec), rather than how it is implemented.
I found a good understanding of my original doubt here:
http://amitsaha.github.io/site/notes/articles/c_python_compiler_interpreter.html

Why do C programs require decompilers but python programs dont?

If I write a python script, anyone can simply point an editor to it and read it. But for programming written in C, one would have to use decompilers and hex tables and such. Why is that? I mean I simply can't open up the Safari web browser and look at its code.
Note: The author disavows a deep expertise in this subject. Some assertions may be incorrect.
Python actually is compiled into bytecode, which is what gets run by the python interpreter. Whenever you use a Python module, Python will generate a .pyc file with a name corresponding to the module. This is the equivalent of the .o file that's generated when you compile a C file.
So if you want something to disassemble, the .pyc file would be it :)
The process that Python goes through when compiling a module is pretty similar to what gcc or another C compiler does with C source code. The major difference is that it happens transparently as part of execution of the file. It's also optional: when running a non-module, i.e. an end-user script, Python will just interpret the code rather than compiling it first.
So really your question is "Why are python programs distributed as source rather than as compiled modules?" Or, put another way, "Why are C applications distributed as compiled binaries rather than as source code?"
It used to be very common for C applications to be distributed as source code. This was back before operating systems and their various subentities (i.e. linux distributions) became more established. Some distros, for example gentoo, still distribute apps as source code. Apps which are a bit more cutting edge or obscure are still distributed as source code for all platforms they target.
The reason for this is compatibility, and dependencies. The reason you can run the precompiled binary Safari on a Mac, or Firefox on Ubuntu Linux, is because it's been specifically built for that operating system, architecture (e.g. x86_64), and set of libraries.
Unfortunately, compilation of a large app is pretty slow, and needs to be redone at least partially every time the app is updated. Thus the motivation for binary distributions.
So why not create a binary distribution of Python? For one thing, as Aaron mentions, modules would need to be recompiled for each new version of the Python bytecode. But this would be similar to rebuilding a C app to link with a newer version of a dynamic library — Python modules are analogous in this sense to C libraries.
The real reason is that Python compilation is very much quicker than C compilation. This is in part, I think, because of the dynamic nature of the language, and also because it's not as thorough of a compilation. This has its tradeoffs: in particular, Python apps run much more slowly than do their C counterparts, because Python has to interpret the compiled bytecode into instructions for the processor, whereas the C app already contains such instructions.
That all being said, there is a program called py2exe that will take a Python module and distribution and build a precompiled windows executable, including in it the logic of the module and its dependencies, including Python itself. I guess the point of this is to avoid having to coerce people into installing Python on their Windows system just to run your app. Under linux, or I think even OS/X, Python is usually already installed, so precompilation is not really necessary. Linux systems also have super-dandy package managers that will transparently install dependencies such as Python if they are not already installed.
Python is a script language, runs in a virtual machine through an interpeter.
C is a compiled language, the code compiled to binary code which the computer can run without all that extra stuff Python needs.
This is sorta a big topic. You should look into your local friendly Computer Science curriculum, you'll find a lot of great stuff on this subject there.
The short answer is the Python is an "interpreted" language, which means that it requires a machine language program (the python interpreter) to run the python program, adding a layer of indirection. C or C++ are different. They are compiled directly to machine code, which runs directly on your processor.
There is a lot of additional voodoo to be learned here, however. Technically Python is compiled to a bytecode, and modern interpreters do more and more "Just in Time" compilation, so the boundaries between compiled and interpreted code are getting fuzzier all the time.
In several comments you asked: "Is it then possible to compile python to an executable binary file and then simply distribute that?"
From a theoretical viewpoint, there's no question the answer is yes -- a Python program could be compiled to, and distributed as, fully compiled machine code.
From a practical viewpoint, it's open to a lot more question. There are a few things like Unladen Swallow, Psyco, Shed Skin, and PyPy that you might want to know about though.
Unladen Swallow is primarily an attempt at making Python run faster, but part of the plan to do so involves using LLVM for its back-end. LLVM can (among other things) produce native machine code output. The last couple of releases of Unladen Swallow have used LLVM for native code generation, but 1) the most recent update on the web site is from late 2009, and 2) the release notes for that version say: "The Unladen Swallow team does not recommend wide adoption of the 2009Q3 release."
Psyco works as a plug-in for Python that basically does JIT compilation, so even though it can speed up execution (quite a lot in some cases), it doesn't produce a machine-code executable you can distribute. In short, while it's sort of similar to what you want, it's not intended to do exactly what you've asked for.
Shed Skin Python-to-C++ produces C++ as its output, and you then compile the C++ and (potentially) distribute the result of that. Shedskin is currently at version 0.5 -- i.e., nobody's claiming that it's a finished, released product. On the other hand, development is ongoing, and each release does seem to include pretty substantial improvements.
PyPy is a Python implementation written in Python. Their intent is to allow code production to be "plugged in" without affecting the rest of the implementation -- but while they currently support 4 different code generation models, I don't believe any of them results in producing native machine code that runs directly on the hardware.
Bottom line: work has been done and is being done with the intent of doing what you asked about, but at least to my knowledge there's not really anything I could reasonably recommend as a finished product that you can really depend on to do the job right now. The primary emphasis is really on execution speed, not producing standalone executables.
Yes, you can - it's called disassembling, and allows you to look at the code of Safari perfectly well. The thing is, C, among other languages, compiles to native code, i.e. code that your CPU can "understand" and execute.
More or less obviously, the level of abstraction present in the instruction set of your CPU is much smaller than that of a high level language like Python. The CPU instructions are not concerned with "downloading that URI", but more "check if that bit is set in a hardware register".
So, in conclusion, the level of complexity present in a native application is much higher when looking at the machine code, so many people simply can't make any sense of what is going on there, it's hard to get the big picture. With experience and time at your hands, it is possible though - people do it all the time, reversing applications and all.
you can't open up and read the code that actually runs for python either. Try
import dis
def foo():
for i in range(100):
print i
print dis.dis(foo)
That will show you the (human readable) bytcode of the foo program. equivalently, you can save the file and import it from the interactive python interpreter. This will create a .pyc file with the same basename as the script. open that with a hex editor and you are looking at the actually python bytecode.
The reason for the difference is that python changes up it's byte code between releases so that you would either need to distribute a different version of a binary only release for each version of python. This would be a pain.
With C, it's compiled to native code and so the byte code is much more stable making binary only releases possible.
because C code is complied to object (machine) code and python code is compiled into an intermediate byte code. I am not sure if you are even referring to the byte code of python - you must be referring to the source file itself which is directly executable (hiding the byte code from you!). C needs to be compiled and linked.
Python scripts are parsed and converted to binary only when they're run - i.e., they're text files and you can read them with an editor.
C code is compiled and linked to an executable binary file before they can be run. Normally, only this executable binary file is distributed - hence you need a decompiler. You can always view the source code, if you've access to it.
Not all C programs require decompilers. There's lots of C code distributed in source form. And some Python programs do require decompilers, if distributed as bytecode (.pyc files).
But, to the extent that your assumptions are valid, it's because C is a compiled language while Python is an interpreted language.
Python scripts are analogous to a man looking at a to-do list written in English (or language he understands). The man has to do all the work, every time that list of things has to be done.
If the man, instead of doing the steps on his own each time, creates and programs a robot which can carry out those steps again and again (and probably faster than him), that robot is analogous to the C program.
The man in the python case is called the "interpreter" and in the C case is called the "compiler", and the C robot is called the compiled program/executable.
When you look at the python program source, you see the to-do list. In case of the robot, you see the gears, motors and batteries, etc, which look very different from the to-do list. If you could get hold of the C "to-do" list, it looks somewhat like the python code, just in a different language.
G-WAN executes ANSI C scripts on the fly -making it just like Python scripts.
This can be server-side scripts (using G-WAN as a Web server) or any general-purpose C program and you can link any existing library.
Oh, and G-WAN C scripts are much faster than Python, PHP or Java...

How exactly does a python (django) request happen? does it have to reparse all the codebase?

With a scripting language like python (or php), things are not compiled down to bytecode like in .net or java.
So does this mean that on every request, it has to go through the entire application and parse/compile it? Or at least all the code required for the given call stack?
With a scripting language like python
(or php), things are not compiled down
to bytecode like in .net or java.
Wrong: everything you import in Python gets compiled to bytecode (and saved as .pyc files if you can write to the directory containing the source you're importing -- standard libraries &c are generally pre-compiled, depending on the installation choices of course). Just keep the main script short and simple (importing some module and calling a function in it) and you'll be using compiled bytecode throughout. (Python's compiler is designed to be extremely fast -- with implications including that it doesn't do a lot of otherwise reasonable optimizations -- but avoiding it altogether is still faster;-).
When running as CGI, yes, the entire project needs to be loaded for each request. FastCGI and mod_wsgi keep the project in memory and talk to it over a socket.

If Python is interpreted, what are .pyc files?

Python is an interpreted language. But why does my source directory contain .pyc files, which are identified by Windows as "Compiled Python Files"?
I've been given to understand that
Python is an interpreted language...
This popular meme is incorrect, or, rather, constructed upon a misunderstanding of (natural) language levels: a similar mistake would be to say "the Bible is a hardcover book". Let me explain that simile...
"The Bible" is "a book" in the sense of being a class of (actual, physical objects identified as) books; the books identified as "copies of the Bible" are supposed to have something fundamental in common (the contents, although even those can be in different languages, with different acceptable translations, levels of footnotes and other annotations) -- however, those books are perfectly well allowed to differ in a myriad of aspects that are not considered fundamental -- kind of binding, color of binding, font(s) used in the printing, illustrations if any, wide writable margins or not, numbers and kinds of builtin bookmarks, and so on, and so forth.
It's quite possible that a typical printing of the Bible would indeed be in hardcover binding -- after all, it's a book that's typically meant to be read over and over, bookmarked at several places, thumbed through looking for given chapter-and-verse pointers, etc, etc, and a good hardcover binding can make a given copy last longer under such use. However, these are mundane (practical) issues that cannot be used to determine whether a given actual book object is a copy of the Bible or not: paperback printings are perfectly possible!
Similarly, Python is "a language" in the sense of defining a class of language implementations which must all be similar in some fundamental respects (syntax, most semantics except those parts of those where they're explicitly allowed to differ) but are fully allowed to differ in just about every "implementation" detail -- including how they deal with the source files they're given, whether they compile the sources to some lower level forms (and, if so, which form -- and whether they save such compiled forms, to disk or elsewhere), how they execute said forms, and so forth.
The classical implementation, CPython, is often called just "Python" for short -- but it's just one of several production-quality implementations, side by side with Microsoft's IronPython (which compiles to CLR codes, i.e., ".NET"), Jython (which compiles to JVM codes), PyPy (which is written in Python itself and can compile to a huge variety of "back-end" forms including "just-in-time" generated machine language). They're all Python (=="implementations of the Python language") just like many superficially different book objects can all be Bibles (=="copies of The Bible").
If you're interested in CPython specifically: it compiles the source files into a Python-specific lower-level form (known as "bytecode"), does so automatically when needed (when there is no bytecode file corresponding to a source file, or the bytecode file is older than the source or compiled by a different Python version), usually saves the bytecode files to disk (to avoid recompiling them in the future). OTOH IronPython will typically compile to CLR codes (saving them to disk or not, depending) and Jython to JVM codes (saving them to disk or not -- it will use the .class extension if it does save them).
These lower level forms are then executed by appropriate "virtual machines" also known as "interpreters" -- the CPython VM, the .Net runtime, the Java VM (aka JVM), as appropriate.
So, in this sense (what do typical implementations do), Python is an "interpreted language" if and only if C# and Java are: all of them have a typical implementation strategy of producing bytecode first, then executing it via a VM/interpreter.
More likely the focus is on how "heavy", slow, and high-ceremony the compilation process is. CPython is designed to compile as fast as possible, as lightweight as possible, with as little ceremony as feasible -- the compiler does very little error checking and optimization, so it can run fast and in small amounts of memory, which in turns lets it be run automatically and transparently whenever needed, without the user even needing to be aware that there is a compilation going on, most of the time. Java and C# typically accept more work during compilation (and therefore don't perform automatic compilation) in order to check errors more thoroughly and perform more optimizations. It's a continuum of gray scales, not a black or white situation, and it would be utterly arbitrary to put a threshold at some given level and say that only above that level you call it "compilation"!-)
They contain byte code, which is what the Python interpreter compiles the source to. This code is then executed by Python's virtual machine.
Python's documentation explains the definition like this:
Python is an interpreted language, as
opposed to a compiled one, though the
distinction can be blurry because of
the presence of the bytecode compiler.
This means that source files can be
run directly without explicitly
creating an executable which is then
run.
There is no such thing as an interpreted language. Whether an interpreter or a compiler is used is purely a trait of the implementation and has absolutely nothing whatsoever to do with the language.
Every language can be implemented by either an interpreter or a compiler. The vast majority of languages have at least one implementation of each type. (For example, there are interpreters for C and C++ and there are compilers for JavaScript, PHP, Perl, Python and Ruby.) Besides, the majority of modern language implementations actually combine both an interpreter and a compiler (or even multiple compilers).
A language is just a set of abstract mathematical rules. An interpreter is one of several concrete implementation strategies for a language. Those two live on completely different abstraction levels. If English were a typed language, the term "interpreted language" would be a type error. The statement "Python is an interpreted language" is not just false (because being false would imply that the statement even makes sense, even if it is wrong), it just plain doesn't make sense, because a language can never be defined as "interpreted."
In particular, if you look at the currently existing Python implementations, these are the implementation strategies they are using:
IronPython: compiles to DLR trees which the DLR then compiles to CIL bytecode. What happens to the CIL bytecode depends upon which CLI VES you are running on, but Microsoft .NET, GNU Portable.NET and Novell Mono will eventually compile it to native machine code.
Jython: interprets Python sourcecode until it identifies the hot code paths, which it then compiles to JVML bytecode. What happens to the JVML bytecode depends upon which JVM you are running on. Maxine will directly compile it to un-optimized native code until it identifies the hot code paths, which it then recompiles to optimized native code. HotSpot will first interpret the JVML bytecode and then eventually compile the hot code paths to optimized machine code.
PyPy: compiles to PyPy bytecode, which then gets interpreted by the PyPy VM until it identifies the hot code paths which it then compiles into native code, JVML bytecode or CIL bytecode depending on which platform you are running on.
CPython: compiles to CPython bytecode which it then interprets.
Stackless Python: compiles to CPython bytecode which it then interprets.
Unladen Swallow: compiles to CPython bytecode which it then interprets until it identifies the hot code paths which it then compiles to LLVM IR which the LLVM compiler then compiles to native machine code.
Cython: compiles Python code to portable C code, which is then compiled with a standard C compiler
Nuitka: compiles Python code to machine-dependent C++ code, which is then compiled with a standard C compiler
You might notice that every single one of the implementations in that list (plus some others I didn't mention, like tinypy, Shedskin or Psyco) has a compiler. In fact, as far as I know, there is currently no Python implementation which is purely interpreted, there is no such implementation planned and there never has been such an implementation.
Not only does the term "interpreted language" not make sense, even if you interpret it as meaning "language with interpreted implementation", it is clearly not true. Whoever told you that, obviously doesn't know what he is talking about.
In particular, the .pyc files you are seeing are cached bytecode files produced by CPython, Stackless Python or Unladen Swallow.
These are created by the Python interpreter when a .py file is imported, and they contain the "compiled bytecode" of the imported module/program, the idea being that the "translation" from source code to bytecode (which only needs to be done once) can be skipped on subsequent imports if the .pyc is newer than the corresponding .py file, thus speeding startup a little. But it's still interpreted.
To speed up loading modules, Python caches the compiled content of modules in .pyc.
CPython compiles its source code into "byte code", and for performance reasons, it caches this byte code on the file system whenever the source file has changes. This makes loading of Python modules much faster because the compilation phase can be bypassed. When your source file is foo.py , CPython caches the byte code in a foo.pyc file right next to the source.
In python3, Python's import machinery is extended to write and search for byte code cache files in a single directory inside every Python package directory. This directory will be called __pycache__ .
Here is a flow chart describing how modules are loaded:
For more information:
ref:PEP3147
ref:“Compiled” Python files
THIS IS FOR BEGINNERS,
Python automatically compiles your script to compiled code, so called byte code, before running it.
Running a script is not considered an import and no .pyc will be created.
For example, if you have a script file abc.py that imports another module xyz.py, when you run abc.py, xyz.pyc will be created since xyz is imported, but no abc.pyc file will be created since abc.py isn’t being imported.
If you need to create a .pyc file for a module that is not imported, you can use the py_compile and compileall modules.
The py_compile module can manually compile any module. One way is to use the py_compile.compile function in that module interactively:
>>> import py_compile
>>> py_compile.compile('abc.py')
This will write the .pyc to the same location as abc.py (you can override that with the optional parameter cfile).
You can also automatically compile all files in a directory or directories using the compileall module.
python -m compileall
If the directory name (the current directory in this example) is omitted, the module compiles everything found on sys.path
Python (at least the most common implementation of it) follows a pattern of compiling the original source to byte codes, then interpreting the byte codes on a virtual machine. This means (again, the most common implementation) is neither a pure interpreter nor a pure compiler.
The other side of this is, however, that the compilation process is mostly hidden -- the .pyc files are basically treated like a cache; they speed things up, but you normally don't have to be aware of them at all. It automatically invalidates and re-loads them (re-compiles the source code) when necessary based on file time/date stamps.
About the only time I've seen a problem with this was when a compiled bytecode file somehow got a timestamp well into the future, which meant it always looked newer than the source file. Since it looked newer, the source file was never recompiled, so no matter what changes you made, they were ignored...
Python's *.py file is just a text file in which you write some lines of code. When you try to execute this file using say "python filename.py"
This command invokes Python Virtual Machine. Python Virtual Machine has 2 components: "compiler" and "interpreter". Interpreter cannot directly read the text in *.py file, so this text is first converted into a byte code which is targeted to the PVM (not hardware but PVM). PVM executes this byte code. *.pyc file is also generated, as part of running it which performs your import operation on file in shell or in some other file.
If this *.pyc file is already generated then every next time you run/execute your *.py file, system directly loads your *.pyc file which won't need any compilation(This will save you some machine cycles of processor).
Once the *.pyc file is generated, there is no need of *.py file, unless you edit it.
tldr; it's a converted code from the source code, which the python VM interprets for execution.
Bottom-up understanding: the final stage of any program is to run/execute the program's instructions on the hardware/machine. So here are the stages preceding execution:
Executing/running on CPU
Converting bytecode to machine code.
Machine code is the final stage of conversion.
Instructions to be executed on CPU are given in machine code. Machine code can be executed directly by CPU.
Converting Bytecode to machine code.
Bytecode is a medium stage. It could be skipped for efficiency, but sacrificing portability.
Converting Source code to bytecode.
Source code is a human readable code. This is what is used when working on IDEs (code editors) such as Pycharm.
Now the actual plot. There are two approaches when carrying any of these stages: convert [or execute] a code all at once (aka compile) and convert [or execute] the code line by line (aka interpret).
For example, we could compile a source code to bytecode, compile bytecode to machine code, interpret machine code for execution.
Some implementations of languages skip stage 3 for efficiency, i.e. compile source code into machine code and then interpret machine code for execution.
Some implementations skip all middle steps and interpret the source code directly for execution.
Modern languages often involve both compiling an interpreting.
JAVA for example, compiles source code to bytecode [that is how JAVA source is stored, as a bytecode, compile bytecode to machine code [using JVM], and interpret machine code for execution. [Thus JVM is implemented differently for different OSs, but the same JAVA source code could be executed on different OS that have JVM installed.]
Python for example, compile source code to bytecode [usually found as .pyc files accompanying the .py source codes], compile bytecode to machine code [done by a virtual machine such as PVM and the result is an executable file], interpret the machine code/executable for execution.
When can we say that a language is interpreted or compiled?
The answer is by looking into the approach used in execution. If it executes the machine code all at once (== compile), then it's a compiled language. On the other hand, if it executes the machine code line-by-line (==interpret) then it's an interpreted language.
Therefore, JAVA and Python are interpreted languages.
A confusion might occur because of the third stage, that's converting bytecode to machine code. Often this is done using a software called a virtual machine. The confusion occurs because a virtual machine acts like a machine, but it's actually not! Virtual machines are introduced for portability, having a VM on any REAL machine will allow us to execute the same source code. The approach used in most VMs [that's the third stage] is compiling, thus some people would say it's a compiled language. For the importance of VMs, we often say that such languages are both compiled and interpreted.
Python code goes through 2 stages. First step compiles the code into .pyc files which is actually a bytecode. Then this .pyc file(bytecode) is interpreted using CPython interpreter. Please refer to this link. Here process of code compilation and execution is explained in easy terms.
Its important distinguish language specification from language implementations:
Language specification is just a document with the formal specification of the language, with its context free grammar and definition of the semantic rules (like specifying primitive types and scope dynamics).
Language implementation is just a program (a compiler) that implement the use of the language according to its specification.
Any compiler consists of two independent parts: a frontend and backend. The frontend receives the source code, validate it and translate it into an intermediate code. After that, a backend translate it to machine code to run in a physical or a virtual machine.
An interpreter is a compiler, but in this case it can produce a way of executing the intermediate code directly in a virtual machine.
To execute python code, its necessary transform the code in a intermediate code, after that the code is then "assembled" as bytecode that can be stored in a file.pyc, so no need to compile modules of a program every time you run it.
You can view this assembled python code using:
from dis import dis
def a(): pass
dis(a)
Anyone can build a Compiler to static binary in Python language, as can build an interpreter to C language. There are tools (lex/yacc) to simplify and automate the proccess of building a compiler.
Machines don't understand English or any other languages, they understand only byte code, which they have to be compiled (e.g., C/C++, Java) or interpreted (e.g., Ruby, Python), the .pyc is a cached version of the byte code.
https://www.geeksforgeeks.org/difference-between-compiled-and-interpreted-language/
Here is a quick read on what is the difference between compiled language vs interpreted language, TLDR is interpreted language does not require you to compile all the code before run time and thus most of the time they are not strict on typing etc.

Categories