A post by Ned Batchelder back in 12008 made the claim the Python code objects are marshalled in Python bytecode files:
[...] The entire rest of the file is just the output of marshal.dump() of the code object that results from compiling the source file.
This may apply to code objects created at the Python level, but how exactly are these code objects marshalled at the C level?
I've looked over the documentation for Python code objects(version 3.5), and not much is said about them and certainly doesn't provide an information about how their marshalled:
The C structure of the objects used to describe code objects. The fields of this type are subject to change at any time.
I've also tried to take a look at the Python code object's source code, but couldn't really find anything that answered my question.
If Python code objects are implemented using structs at the C level - which I believe they are - how then can they be marshalled into bytecode files? As far as the research I've done, C provides no built in methods for marshalling nor any third-party libraries?
If code objects are marshalled at the C level, how exactly is it done? If not, how are they encoded into the bytecode?
1I especial noted the date because perhaps this detail has changed between Python versions.
This may apply to code objects created at the Python level, but how exactly are these code objects marshalled at the C level?
With marshal.dump, or perhaps one of the related interfaces like marshal.dumps.
It's not as simple as "delegate to some C standard library function"; the marshalling and unmarshalling implementation comprises an 1820-line file you'd have to read if you want to see every detail. With how Python-specific its data structures and functionality are, most of the object tree traversal and serialization need to be written from scratch.
Related
For a project idea of mine, I have the following need, which is quite precise:
I would like to be able to execute Python code (pre-compiled before hand if necessary) on a per-bytecode-instruction basis. I also need to access what's inside the Python VM (frame stack, data stacks, etc.). Ideally, I would also like to remove a lot of Python built-in features and reimplement a few of them my own way (such as file writing).
All of this must be coded in C# (I'm using Unity).
I'm okay with loosing a few of Python's actual features, especially concerning complicated stuff with imports, etc. However, I would like most of it to stay intact.
I looked a little bit into IronPython's code but it remains very obscure to me and it seems quite enormous too. I began translating Byterun (a Python bytecode interpreter written in Python) but I face a lot of difficulties as Byterun leverages a lot of Python's features to... interpret Python.
Today, I don't ask for a pre-made solution (except if you have one in mind?), but rather for some advice, places to look at, etc. Do you have any ideas about the things I should research first?
I've tried to do my own implementation of the Python VM in the distant past and learned a lot but never came even close to a fully working implementation. I used the C implementation as a starting point, specifically everything in https://github.com/python/cpython/tree/main/Objects and
https://github.com/python/cpython/blob/main/Python/ceval.c (look for switch(opcode))
Here are some pointers:
Come to grips with the Python object model. Implement an abstract PyObject class with the necessary methods for instancing, attribute access, indexing and slicing, calling, comparisons, aritmetic operations and representation. Provide concrete implemetations for None, booleans, ints, floats, strings, tuples, lists and dictionaries.
Implement the core of your VM: a Frame object that loops over the opcodes and dispatches, using a giant switch statment (following the C implementation here), to the corresponding methods of the PyObject. The frame should maintains a stack of PyObjects for the operants of the opcodes. Depending on the opcode, arguments are popped from and pushed on this stack. A dict can be used to store and retrieve local variables. Use the Frame object to create a PyObject for function objects.
Get familiar with the idea of a namespace and the way Python builds on the concept of namespaces. Implement a module, a class and an instance object, using the dict to map (attribute)names to objects.
Finally, add as many builtin functions as you think you need to get a usefull implementation.
I think it is easy to underestimate the amount of work you're getting yourself into, but ... have fun!
How exactly can Python call a C library? Tensorflow, for example, I believe is written mostly in C, but can be used from Python. I'm thinking of implementing something like this in my own (interpreted) programming language (written in Go, but I assume it would be a similar process).
What happens when a Python program calls a C function? I'm thinking either RPC or DLLs, but both of them seem unlikely.
cPython has two main ways to call C code: either by loading a shared library and calling its symbols, or by packing C code as Python binary modules and then calling them from Python code as though they were ordinary Python modules, which is how high performance stuff in the standard library is implemented - e.g. json.
Loading a shared library and calling functions from it using the ctypes module is rather trivial, and you can find a lot of examples here: https://docs.python.org/3/library/ctypes.html
Packing your C code as binary Python module requires a lot of boilerplate and careful attention to details such as ref counting, null pointers, etc, and is documented here: https://docs.python.org/2/extending/extending.html
There are several libraries that automate the process and generate binding code for you. One example is boost.python: https://www.boost.org/doc/libs/1_65_0/libs/python/doc/html/tutorial/index.html
I am curious of is it a way to deal with Avro Python in the same way as in Java or C++ implementations.
According to the official Avro Python documentation, I have to provide an Avro schema in runtime to encode/decode data. But is it a way to use code generator as it did in Java/C++?
Update: My coworker put together a pretty good library for doing this, avro-to-python. We have been using it in production for over a year now on some pretty complex schemas.
I had to implement something like this for php: avro-to-php
pyschema is a pretty good start, but the documentation is poor. You'll need to look a the source code to see how it all works. You can use it to read avro schemas and generate python source code. It adds another layer of abstraction and as such slows things down a bit more.
I've asked this question a couple of times recently in the Pulsar slack channel and my belief is that no tool currently exists that can convert an Avro schema to a Python class that is compatible with the Pulsar Python client library.
The Pulsar Python client library expects the Python class to inherit from the Record class (https://github.com/apache/pulsar/blob/master/pulsar-client-cpp/python/pulsar/schema/definition.py#L57), and for every field in the Python class to inherit from the Field class (https://github.com/apache/pulsar/blob/master/pulsar-client-cpp/python/pulsar/schema/definition.py#L141), both of which are defined in the Pulsar Python client library.
So, an Avro to Python converter would have to import the Record class and Field class from the Python client library, and so if such a converter exists, someone in the Pulsar Slack community really should know about it.
Further, the Pulsar Python client library is missing support for Avro keywords like "doc", "namespace", and for null default values. So even if an Avro to Python converter exists for Pulsar, likely, the converted Python class cannot be properly consumed by the Pulsar Python client library.
I don't see any indication of an existing Avro schema -> Python class code generator in the docs (which explicitly mention code generation for the Java case) for arbitrary Python interpreters. If you're using Jython, you could use the Java code generator to make a class that you access in your Jython code.
Unlike Java and C++, failing to have code generation doesn't affect Python performance much (in the CPython case anyway), since class instances are implemented in terms of dicts anyway (there are exceptions to this rule in a sense, but they mostly change memory usage, not the fact that dict lookup is always involved). That makes code generation largely "nice to have" syntactic sugar, not a necessary feature for development; with some effort, you could always implement a converter than writes out a class definition and evals it in Python to get a similar effect (this is how collections.namedtuple classes are defined).
I was reading Dietel's C++ programming book. In this book they mention how a programmer should release only the interface part of his code and not the implementation.
So carrying this over to python:
I have 2 files:
1) the implementation file = accountClass.py and
2) the interface file = useAccountClass.py
I have compiled the implementation file and have obtained the .pyc file. So when I provide my code to someone else, I would provide him with the .pyc file and the interface file, right?
Also, if I provide someone else with ONLY the .pyc file, can I expect him to write the interface on his own? I'm going to say no. But there's this one nagging doubt that I have:
The creators of numpy and scipy did not share the implementation with us end users. And I don't think they shared any interfaces either. But we can still search for the different classes and their methods inside both numpy and scipy. So, using this example of numpy and scipy, I guess what I'm trying to ask is:
Is it possible for someone else to create an interface to my code if I provide him/ her with only the compiled implementation file (in this case accountClass.pyc)? How will that person know what classes and methods I have defined in my implementation? I mean, will they use the
if __name__ = "__main__" :
blah blah
or is there some other way??
You got that entirely wrong. Or perhaps it's a horrible book whose author got something seriously wrong. Code using other code should indeed, barring significant counterarguments, adhere to an interface and not care about the details of the implementation. However, even in the world of static compilation to machine code (e.g. C++), this does not mean you should lock away the source code of the implementation.
Whether someone has access to the implementation, and whether they make use of that knowledge while writing a specific piece of code, are completely different issues. Heck, even the author of the implementation can/should still program to an interface when working on other code (e.g. other modules). Likewise, even if you lock the implementation away from someone, they may very well rely on implementation quirks which are not part of the interface. If anyone in the world of static compilation to machine code provides only headers and object files, and not the source code, it's because the projects are closed source, not to encourage good programming practices among clients.
In Python, your question makes no sense - there are no "interface" and "implementation" files, there's just code which is run and defines functions, classes, and other values. There is no such thing as an interface file you'd provide. You provide an implementation - and (hopefully) documentation which details both interface and possibly implementation details. And once a module is imported, the class objects, function objects, and other objects, contain plenty of information (including, in many cases, the text from which large parts of the documentation was generated). This is also true for extension modules like numpy. And note that their implementation is accessible, it's just not included in all distributions because it's of little use. With Python code, you practically have to distribute the source code because anything else is platform-specific.
On a side note, .pyc files are pretty high level, and easily understood when disassembled (which is as easy as importing the module and running the stdlib module dis on any function inside). I consider this a minor technicality as it's already the wrong question to ask.
Deitel's advice to C++ programmers doesn't apply to Python, for a number of reasons:
Python isn't compiled to machine code, so no matter what form you provide the program in, it will be relatively easy for someone to read the code.
Python doesn't have .h and .c files, all you can provide is the .py or .pyc files.
Treating code as a secret is kind of silly anyway. What is in your code that you need to keep hidden from others?
Numpy and Scipy are largely implemented in C, which is why you don't have the source, for your own convenience. You can get the source if you like. The "interface" to that code is the module that you can import and then call.
You should not confuse "user interface" with "class interface". If you have a useAccountClass file, that file probably performs some task using the classes and methods defined in the accountClass file, if I understood right.
If you send the file to other person, they are not supposed to "guess" what your compiled class does. That's what DOCUMENTATION is for: a description of the functions contained in the module (compiled or not), which parameters they take, which values they return, and what they are expected to do, the "meaning" of the task they perform.
As an abstract example, let's suppose you have an image processing class. If that class has the function findCircles(image), the documentation should explain that it takes an image, possibly containing circles, and returns a list or array of coordinates of the centers of circles contained in the image. HOW the circles are detected is not important, you don't need to know that to use the function. Now if the function was called like findCircles(image, gaussian_threshold=10), the caller would have to know the function uses some "gaussian_threshold" parameter, that is, the caller would NEED to know about the function's entrails, and in OOP this is Not Good. If you decided to use another algorithm in the future, every code using that function would have to be rewritten, because the gaussian_threshold most probably wouldn't make sense anymore.
So, the interface, in OOP, is the abstraction used to communicate to the object only the canonical parameters or inputs it needs to know to perform a task in the language of the problem, not in the language of the implementation (that can change anytime).
The documentation, in this sense, is a contract that assures to the user (in this case, another developer) that the function will perform as expected if sane inputs are given to it.
Now the FINAL USER, a non-technical person wanting to use your program, would need the WHOLE working program (controls and views), not only the class definitions (the model).
Hope this helps, and I must recommend the books "Code Complete 2nd ed." and "Pragmatic Programmer - From Journeyman to Master" as VERY enlightening readings on the broad topic.
I just read through a PEP concerning the unification of ints and longs in Python3k in PEP 237. The approach used in this seems very interesting. The approach is to create a new type "integer" which is the abstract base class of int and long. Also, performing operations on ints which result in very large numbers will no longer result in an OverflowError, instead it'll return a long.
I'd like to see and try to understand the underlying implementation of this in Python3k. How should I go about that? Which files contain the details about "type" implementations?
So far I have only ventured out to read most of the non-C python stdlib modules; hence I am unclear on where exactly to look.
Start with Include/longobject.h and Objects/longobject.h These paths are relative to the root of a Python source tree. Make sure to arm yourself with an editor suitable for browsing C code conveniently, or generate a HTML interlinked reference with GNU global.
Also, it would surely help to read this article on internals of objects in Python 3, as well as its sequel.