I'm currently going through a basic compsci course. We use Python's in a lot. I'm curious how it's implemented, what the code that powers in looks like.
I can think of how my implementation of such a thing would work, but something I've learned after turning in a couple homework assignments is that my ways of doing things are usually pretty terrible and inefficient.. So I want to start investigating 'good' code.
The thing about builtin functions and types and operators and so on is that they are not implemented in Python. Rather, they're implemented in C, which is a much more painful and verbose programming language that won't always translate well to Python (usually because things are easier some other way in Python.)
With that said, you can investigate all of Python's implementation online, via their public source repository.
The implementation for in is scattered -- there's one implementation per type, plus a more general implementation that calls the type-specific implementation (more on that later). For example, for lists, we'd look for the implementation of lists. In the Python source tree, the source for all builtin objects is in the Objects directory. In that directory you'll find listobject.c , which contains the implementation for the list object and all its methods.
On the repository at the time of answering, if you look at line 393 you'll find the implementation of the in operator (also known as the __contains__ method, which explains the name of the function). It's fairly straightforward, just loops through all the elements of the list until either the element is found, or there's no more elements, and returns the result of the search. :)
If it helps, in Python the idiomatic way to write this would be:
def __contains__(self, obj):
for item in self:
if item == obj:
return True
return False
I said earlier that there was a more general implementation. That can be seen in the implementation of PySequence_Contains in abstract.c. It tries to call the type-specific version, and if that fails, resorts to regular iteration. That loop there is what a regular Python for loop looks like when you write it in C (using the Python C-API).
From the Data Model section of the Python Language Reference:
The membership test operators (in and not in) are normally implemented
as an iteration through a sequence. However, container objects can
supply the following special method with a more efficient
implementation, which also does not require the object be a sequence.
object.__contains__(self, item)
Called to implement membership test operators. Should return true if item is in self,
false otherwise. For mapping objects, this should
consider the keys of the mapping rather than the values or the
key-item pairs.
For objects that don’t define __contains__(), the membership test first tries
iteration via __iter__(), then the old sequence iteration
protocol via __getitem__(), see this section in the language
reference.
So, by default Python iterates over a sequence to implement the in operator. If an object defines the __contains__ method, Python uses it instead of iterating. So what happens in the __contains__ method? To know exactly, you would have to browse the source. But I can tell you that Python's lists implement __contains__ using iteration. Python dictionaries and sets are implemented as hash tables, and therefore support faster membership testing.
Python's built-in methods are written in the C language - you can see their code by checking out Python's source code yourself.
However, if you want to take a look of an equivalent implementation of all the methods in Python itself, you can check PyPy - which features a Python implementation 100% written in Python and a subset of it (rpython).
The in operator calls the __contains__ method in a string object - so you can check for the implementation of the string in both projects - but the actual searching code will be buried deeper.
Here is some of the code in CPython for it, for example:
http://hg.python.org/cpython/file/c310233b1d64/Objects/stringlib/fastsearch.h
You can browse the Python-Source online: http://hg.python.org/
A good start is to clone the repository you need and then use grep to find the things you need.
Related
For a project idea of mine, I have the following need, which is quite precise:
I would like to be able to execute Python code (pre-compiled before hand if necessary) on a per-bytecode-instruction basis. I also need to access what's inside the Python VM (frame stack, data stacks, etc.). Ideally, I would also like to remove a lot of Python built-in features and reimplement a few of them my own way (such as file writing).
All of this must be coded in C# (I'm using Unity).
I'm okay with loosing a few of Python's actual features, especially concerning complicated stuff with imports, etc. However, I would like most of it to stay intact.
I looked a little bit into IronPython's code but it remains very obscure to me and it seems quite enormous too. I began translating Byterun (a Python bytecode interpreter written in Python) but I face a lot of difficulties as Byterun leverages a lot of Python's features to... interpret Python.
Today, I don't ask for a pre-made solution (except if you have one in mind?), but rather for some advice, places to look at, etc. Do you have any ideas about the things I should research first?
I've tried to do my own implementation of the Python VM in the distant past and learned a lot but never came even close to a fully working implementation. I used the C implementation as a starting point, specifically everything in https://github.com/python/cpython/tree/main/Objects and
https://github.com/python/cpython/blob/main/Python/ceval.c (look for switch(opcode))
Here are some pointers:
Come to grips with the Python object model. Implement an abstract PyObject class with the necessary methods for instancing, attribute access, indexing and slicing, calling, comparisons, aritmetic operations and representation. Provide concrete implemetations for None, booleans, ints, floats, strings, tuples, lists and dictionaries.
Implement the core of your VM: a Frame object that loops over the opcodes and dispatches, using a giant switch statment (following the C implementation here), to the corresponding methods of the PyObject. The frame should maintains a stack of PyObjects for the operants of the opcodes. Depending on the opcode, arguments are popped from and pushed on this stack. A dict can be used to store and retrieve local variables. Use the Frame object to create a PyObject for function objects.
Get familiar with the idea of a namespace and the way Python builds on the concept of namespaces. Implement a module, a class and an instance object, using the dict to map (attribute)names to objects.
Finally, add as many builtin functions as you think you need to get a usefull implementation.
I think it is easy to underestimate the amount of work you're getting yourself into, but ... have fun!
I am in doubt if a regular dict and an OrderedDict objects are truly interchangeable in a sense that the same function or method could return once a dict and other times an OrderedDict depending on the input arguments or for example in case of a class method depending on some other internal class instance attributes. If returning an OrderedDict would be significantly more costly than returning just a regular dict which should suffice as well why doing it the hard way? Would it be pythonic to create such a function or menthod? I use Python 2.7.
I have seen "Why should functions always return the same type?" and I felt my case is more special and less obvious to the unseasoned eye.
Python uses duck typing. as long as an object presents the necessary interface required for an action, the type of object is irrelevant. An OrderedDict will present all the interfaces a dict would, so what the function returns is irrelevant. It'll be up to whatever uses what was returned. You should figure out the implementation of the use not the returning function.
What's your use case? Detailing that would help in suggesting an answer or alternatives.
A fairly small question: does anyone know about a pre-made suite of Python unit tests that just check if a class conforms to one of the standard Python data structure interfaces (e.g., lists, sets, dictionaries, queues, etc). It's not overly hard to write them, but I'd hate to bother doing so if someone has already done this. It seems like very basic functionality that someone has probably done already.
The use case is that I am using a factory pattern to create data structures due to different restrictions related to platforms. As such, I need to be able to test that the resulting created objects still conform to the standard interfaces on the surface. Also, I should note that by "conform" I mean that the tests should check not just that the interface functions exist, but also check that they work (e.g., can set and retrieve a value in a map, for instance). Python 2.7 tests would be preferred.
First, "the standard Python data structure interfaces" are not lists, sets, dictionaries, queues, etc. Those are specific implementations of the interfaces. (And queue isn't even a data structure in the sense you're thinking of—its salient features are that its operations are atomic, and put and get optionally synchronize on a Condition, and so on.)
Anyway, the interfaces are defined in five different not-quite-compatible ways.
The Built-in Types section of the documentation describes what it means to be an iterator type, a sequence type, etc. However, these are not nearly as rigorous as you'd expect for reference documentation (at least if you're used to, say, C++ or Java).
I'm not aware of any tests for such a thing, so I think you'd have to build them from scratch.
The collections module contains Collections Abstract Base Classes that define the interfaces, and provide a way to register "virtual subclasses" via the abc module. So, you can declare "I am a mapping" by inheriting from collections.Mapping, or calling collections.Mapping.register. But that doesn't actually prove that you are a mapping, just that you're claiming to be. (If you inherit from Mapping, it also acts as a mixin that helps you complete the interface by implementing, e.g., __contains__ on top of __getitem__.)
If you want to test the ABC meaning, defuz's answer is very close, and with a little more work I think he or someone else can complete it.
The CPython C API defines an Abstract Objects Layer. While this is not actually authoritative for the language, it's obviously intended that the C-API protocols and the language-level interfaces are supposed to match. And, unlike the latter, the former are rigorously defined. And of course the source code from CPython 2.7, and maybe other implementations like PyPy, may help.
There are tests for this that come with CPython, but really, they're for testing that calling PyMapping_GetItem from C properly calls your mymapping.__getitem__ in Python, which is really at a tangent to what you want to test, so I don't think it will help much.
The actual concrete classes have additional interface on top of the protocols, that you may want to test, but that's harder to describe. In particular, the way the __new__ and __init__ methods work is often important. Implementing the Mapping protocol means someone can construct an empty Foo instance and add items to it with foo[key] = value, but it doesn't mean someone can construct Foo(key=value), or Foo({key: value}) or Foo([(key, value)]).
And for this case, there are existing tests that come with all of the standard Python implementations. CPython comes with a very extensive test suite that includes things like test_dict.py. PyPy runs all the (Python-level) CPython tests, and some extra ones besides.
You will obviously have to modify these tests to run on an arbitrary class instead of one hardcoded into the tests, and you may also have to modify them to handle whichever definition you pick. Plus, they probably test more than you asked for. You just want to know if a class conforms to the protocol, not whether its methods do the right thing, right? But still, I think they're a good starting point.
Finally, the C API defines a Concrete Objects Layer that, although it's not authoritative, matches the previous definition and is more rigorously defined.
Unfortunately, the tests for this one are definitely not going to be very useful to you, because they're checking things like whether PyDict_Check and PyDict_GetItem work on your class, which they will not for any mapping defined in pure Python.
If you do build something complete for any of these definitions, I would strongly suggest putting it on PyPI, and posting about it to python-list, so you get feedback (and bug reports).
There are abstract base classes in standart module collections based on ABC module.
You have to inherit your classes from these classes to be sure that your classes correspond to the standard behavior:
import collections
class MyDict(collections.Mapping):
...
Also, your can test already existed class that does not obviously inherit the abstract class:
class MyPerfectDict(object):
... realization ...
def is_inherit(cls, abstract):
try:
class Test(abstract, cls): pass
test = Test()
except TypeError:
return False
else:
return True
is_inherit(MyPerfectDict, Mapping) # False
is_inherit(dict, Mapping) # True
To ask my very specific question I find I need quite a long introduction to motivate and explain it -- I promise there's a proper question at the end!
While reading part of a large Python codebase, sometimes one comes across code where the interface required of an argument is not obvious from "nearby" code in the same module or package. As an example:
def make_factory(schema):
entity = schema.get_entity()
...
There might be many "schemas" and "factories" that the code deals with, and "def get_entity()" might be quite common too (or perhaps the function doesn't call any methods on schema, but just passes it to another function). So a quick grep isn't always helpful to find out more about what "schema" is (and the same goes for the return type). Though "duck typing" is a nice feature of Python, sometimes the uncertainty in a reader's mind about the interface of arguments passed in as the "schema" gets in the way of quickly understanding the code (and the same goes for uncertainty about typical concrete classes that implement the interface). Looking at the automated tests can help, but explicit documentation can be better because it's quicker to read. Any such documentation is best when it can itself be tested so that it doesn't get out of date.
Doctests are one possible approach to solving this problem, but that's not what this question is about.
Python 3 has a "parameter annotations" feature (part of the function annotations feature, defined in PEP 3107). The uses to which that feature might be put aren't defined by the language, but it can be used for this purpose. That might look like this:
def make_factory(schema: "xml_schema"):
...
Here, "xml_schema" identifies a Python interface that the argument passed to this function should support. Elsewhere there would be code that defines that interface in terms of attributes, methods & their argument signatures, etc. and code that allows introspection to verify whether particular objects provide an interface (perhaps implemented using something like zope.interface / zope.schema). Note that this doesn't necessarily mean that the interface gets checked every time an argument is passed, nor that static analysis is done. Rather, the motivation of defining the interface is to provide ways to write automated tests that verify that this documentation isn't out of date (they might be fairly generic tests so that you don't have to write a new test for each function that uses the parameters, or you might turn on run-time interface checking but only when you run your unit tests). You can go further and annotate the interface of the return value, which I won't illustrate.
So, the question:
I want to do exactly that, but using Python 2 instead of Python 3. Python 2 doesn't have the function annotations feature. What's the "closest thing" in Python 2? Clearly there is more than one way to do it, but I suspect there is one (relatively) obvious way to do it.
For extra points: name a library that implements the one obvious way.
Take a look at plac that uses annotations to define a command-line interface for a script. On Python 2.x it uses plac.annotations() decorator.
The closest thing is, I believe, an annotation library called PyAnno.
From the project webpage:
"The Pyanno annotations have two functions:
Provide a structured way to document Python code
Perform limited run-time checking "
My karate instructor is fond of saying, "a block is a lock is a throw is a blow." What he means is this: When we come to a technique in a form, although it might seem to look like a block, a little creativity and examination shows that it can also be seen as some kind of joint lock, or some kind of throw, or some kind of blow.
So it is with the way the django template syntax uses the dot (".") character. It perceives it first as a dictionary lookup, but it will also treat it as a class attribute, a method, or list index - in that order. The assumption seems to be that, one way or another, we are looking for a piece of knowledge. Whatever means may be employed to store that knowledge, we'll treat it in such a way as to get it into the template.
Why doesn't python do the same? If there's a case where I might have assigned a dictionary term spam['eggs'], but know for sure that spam has an attribute eggs, why not let me just write spam.eggs and sort it out the way django templates do?
Otherwise, I have to except an AttributeError and add three additional lines of code.
I'm particularly interested in the philosophy that drives this setup. Is it regarded as part of strong typing?
django templates and python are two, unrelated languages. They also have different target audiences.
In django templates, the target audience is designers, who proabably don't want to learn 4 different ways of doing roughly the same thing ( a dictionary lookup ). Thus there is a single syntax in django templates that performs the lookup in several possible ways.
python has quite a different audience. developers actually make use of the many different ways of doing similar things, and overload each with distinct meaning. When one fails it should fail, because that is what the developer means for it to do.
JUST MY correct OPINION's opinion is indeed correct. I can't say why Guido did it this way but I can say why I'm glad that he did.
I can look at code and know right away if some expression is accessing the 'b' key in a dict-like object a, the 'b' attribute on the object a, a method being called on or the b index into the sequence a.
Python doesn't have to try all of the above options every time there is an attribute lookup. Imagine if every time one indexed into a list, Python had to try three other options first. List intensive programs would drag. Python is slow enough!
It means that when I'm writing code, I have to know what I'm doing. I can't just toss objects around and hope that I'll get the information somewhere somehow. I have to know that I want to lookup a key, access an attribute, index a list or call a method. I like it that way because it helps me think clearly about the code that I'm writing. I know what the identifiers are referencing and what attributes and methods I'm expecting the object of those references to support.
Of course Guido Van Rossum might have just flipped a coin for all I know (He probably didn't) so you would have to ask him yourself if you really want to know.
As for your comment about having to surround these things with try blocks, it probably means that you're not writing very robust code. Generally, you want your code to expect to get some piece of information from a dict-like object, list-like object or a regular object. You should know which way it's going to do it and let anything else raise an exception.
The exception to this is that it's OK to conflate attribute access and method calls using the property decorator and more general descriptors. This is only good if the method doesn't take arguments.
The different methods of accessing
attributes do different things. If
you have a function foo the two lines
of code
a = foo,
a = foo()
do two
very different things. Without
distinct syntax to reference and call
functions there would be no way for
python to know whether the variable
should be a reference to foo or the
result of running foo. The () syntax removes the ambiguity.
Lists and dictionaries are two very different data structures. One of the things that determine which one is appropriate in a given situation is how its contents can be accessed (key Vs index). Having separate syntax for both of them reinforces the notion that these two things are not the same and neither one is always appropriate.
It makes sense for these distinctions to be ignored in a template language, the person writing the html doesn't care, the template language doesn't have function pointers so it knows you don't want one. Programmers who write the python that drive the template however do care about these distinctions.
In addition to the points already posted, consider this. Python uses special member variables and functions to provide metadata about the object. Both the interpreter and programmers make heavy use of these. For example, both dicts and lists have a __len__ member function. Now, if a dict's data were accessed by using the . operator, a potential ambiguity arises if the dict has a key called __len__. You could special-case these, but many objects have a __dict__ attribute which is a mapping of member names and values. If that object happened to be a container, which also defined a __len__ attribute, you would end up with an utter mess.
Problems like this would end up turning Python into a mishmash of special cases that the programmer would have to constantly be aware of. This would detract from the reason why many people use Python in the first place, i.e., its elegant simplicity.
Now, consider that new users often shadow built-ins (if the code in SO questions is any indication) and having something like this starts to look like a really bad idea, since it would exacerbate the problem many-fold.
In addition to the responses above, it's not practical to merge dictionary lookup and object lookup in general because of the restrictions on object members.
What if your key has whitespace? What if it's an int, or a frozenset, etc.? Dot notation can't account for these discrepancies, so while it's an acceptable tradeoff for a templating language, it's unacceptable for a general-purpose programming language like Python.