How do I avoid subclassing a pandas DataFrame using composition?

How do I avoid subclassing a pandas DataFrame using composition? - python

The pandas documentation recommends against sub-classing their data structures. One of their recommended alternatives is to use composition, but they just point readers to a Wikipedia article on composition vs. inheritance. That article and other resources I've found have not helped me understand how to extend a pandas DataFrame using composition. Can someone explain composition in this context and tell me about cases where composition might be a preferred alternative to sub-classing pd.DataFrame? A simple example or a link to information that's more instructive than Wikipedia articles would be very helpful.
In this question I'm specifically asking how composition should be used in cases where someone might be tempted to subclass pd.DataFrame. I understand there are other solutions to extending a Python object that do not involve composition, and I asked another question about extending pandas DataFrames that resulted in a different solution using a wrapper class.
I didn't understand that "wrapping" and "composition" refer to the same approach here, as noted in MaxYarmolinsky's answer below. The answer to the question I linked to above has a more complete discussion about using composition in this case, which may require handling __getattr__, __getitem__, and __setitem__ properly (I realize this is obvious to people who know what they're doing, but I had to ask my previous question because I had failed to get/set items when I tried on my own).

Just some googling show you how to create a simple class as you describe through composition.
class mydataframe():
def __init__(self,data):
self.coredataframe = pd.DataFrame(data)
self.otherattribute = None
Then you can add methods and attributes of your own...

In OOP inheriting models an "is-a" relationship where composition models "has-a."
In general you should reach for composition over inheritance unless you have a specific polymorphic design in mind as it is less tightly coupled and more modular. Inheritance is the strongest coupling you can do. And strong coupling leads to maintenance difficulties (everything is connected and hard to separate), whereas composition is much easier to refactor.
Inheritance can also lead to confusing inheritance hierarchies if care is not taken with the design or design is incremental.
That said don't be afraid to use inheritance for polymorphism. But be wary of using it for simple code reuse.

Related

Both Inheritance and composition in Python, bad practice?

I'm working a project, where the natural approach is to implement a main object with sub-components based on different classes, e.g. a PC consisting of CPU, GPU, ...
I've started with a composition structure, where the components have attributes and functions inherent to their sub-system and whenever higher level attributes are needed, they are given as arguments.
Now, as I'm adding more functionality, it would make sense to have different types of the main object, e.g. a notebook, which would extend the PC class, but still have a CPU, etc. At the moment, I'm using a separate script, which contains all the functions related to the type.
Would it be considered bad practice to combine inheritance and composition, by using child classes for different types of the main object?

In short
Preferring composition over inheritance does not exclude inheritance, and does not automatically make the combination of both a bad practice. It's about making an informed decision.
More details
The recommendation to prefer composition over inheritance is a rule of thumb. It was first coined by GoF. If you'll read their full argumentation, you'll see that it's not about composition being good and inheritance bad; it's that composition is more flexible and therefore more suitable in many cases.
But you'll need to decide case by case. And indeed, if you consider some variant of the composite pattern, specialization of the leaf and composite classes can be perfectly justified in some situations:
polymorphism could avoid a lot of if and cases,
composition could in some circumstances require additional call-forwarding overhead that might not be necessary when it's really about type specialization.
combination of composition and inheritance could be used to get the best of both worlds (caution: if applied carelessly, it could also give the worst of both worlds)
Note: If you'd provide a short overview of the context with an UML diagram, more arguments could be provided in your particular context. Meanwhile, this question on SE, could also be of interest

Should Domain Model Classes always depend on primitives?

Halfway through Architecture Patterns with Python, I have two questions about how should the Domain Model Classes be structured and instantiated. Assume on my Domain Model I have the class DepthMap:
class DepthMap:
def __init__(self, map: np.ndarray):
self.map = map
According to what I understood from the book, this class is not correct since it depends on Numpy, and it should depend only on Python primitives, hence the question: Should Domain Model classes rely only on Python primitives, or is there an exception?
Assuming the answer to the previous question is that classes should solely depend on primitives, what would the correct way create a DepthMap from a Numpy array be? Assume now I have more formats from where I can make a DepthMap object.
class DepthMap:
def __init__(self, map: List):
self.map = map
#classmethod
def from_numpy(cls, map: np.ndarray):
return cls(map.tolist())
#classmethod
def from_str(cls, map: str):
return cls([float(i) for i in s.split(',')])
or a factory:
class DepthMapFactory:
#staticmethod
def from_numpy(map: np.ndarray):
return DepthMap(map.tolist())
#staticmethod
def from_str(map: str):
return DepthMap([float(i) for i in s.split(',')])
I think even the Repository Pattern, which they go through in the book, could fit in here:
class StrRepository:
def get(map: str):
return DepthMap([float(i) for i in s.split(',')])
class NumpyRepository:
def get(map: np.ndarray):
return DepthMap(map.tolist())
The second question: When creating a Domain Model Object from different sources, what is the correct approach?
Note: My background is not software; hence some OOP concepts may be incorrect. Instead of downvoting, please comment and let me know how to improve the question.

I wrote the book, so I can at least have a go at answering your question.
You can use things other than primitives (str, int, boolean etc) in your domain model. Generally, although we couldn't show it in the book, your model classes will contain whole hierarchies of objects.
What you want to avoid is your technical implementation leaking into your code in a way that makes it hard to express your intent. It would probably be inappropriate to pass instances of Numpy arrays around your codebase, unless your domain is Numpy. We're trying to make code easier to read and test by separating the interesting stuff from the glue.
To that end, it's fine for you to have a DepthMap class that exposes some behaviour, and happens to have a Numpy array as its internal storage. That's not any different to you using any other data structure from a library.
If you've got data as a flat file or something, and there is complex logic involved in creating the Numpy array, then I think a Factory is appropriate. That way you can keep the boring, ugly code for producing a DepthMap at the edge of your system, and out of your model.
If creating a DepthMap from a string is really a one-liner, then a classmethod is probably better because it's easier to find and understand.

I think it's perfectly fine to depend on librairies that are pure language extensions or else you will just end up with having to define tons of "interface contracts" (Python doesn't have interfaces as a language construct -- but those can be conceptual) to abstract away these data structures and in the end those newly introduced contracts will probably be poor abstractions anyway and just result in additional complexity.
That means your domain objects can generally depend on these pure types. On the other hand I also think these types should be considered as language "primitives" (native may be more accurate) just like datetime and that you'd want to avoid primitive obsession.
In other words, DepthMap which is a domain concept is allowed to depend on Numpy for it's construction (no abstraction necessary here), but Numpy shouldn't necessarily be allowed to flow deep into the domain (unless it's the appropriate abstraction).
Or in pseudo-code, this could be bad:
someOperation(Numpy: depthMap);
Where this may be better:
class DepthMap(Numpy: data);
someOperation(DepthMap depthMap);
And regarding the second question, from a DDD perspective if the
DepthMap class has a Numpy array as it's internal structure but has to
be constructed from other sources (string or list for example) would
the best approach be a repository pattern? Or is this just for
handling databases and a Factory is a better approach?
The Repository pattern is exclusively for storage/retrieval so it wouldn't be appropriate. Now, you may have a factory method directly on DepthMap that accepts a Numpy or you may have a dedicated factory. If you want to decouple DepthMap from Numpy then it could make sense to introduce a dedicated factory, but it seems unnecessary here at first glance.

Should Domain Model classes rely only on Python primitives
Speaking purely from a domain-driven-design perspective, there's absolutely no reason that this should be true
Your domain dynamics are normally going to be described using the language of your domain, ie the manipulation of ENTITIES and VALUE OBJECTS (Evans, 2003) that are facades that place domain semantics on top of your data structures.
The underlying data structures, behind the facades, are whatever you need to get the job done.
There is nothing in domain driven design requiring that you forsake a well-tested off the shelf implementation of a highly optimized Bazzlefraz and instead write your own from scratch.
Part of the point of domain driven design is that we want to be making our investment into the code that helps the business, not the plumbing.

Python: Nested Class vs Inheritance [duplicate]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
There are two schools of thought on how to best extend, enhance, and reuse code in an object-oriented system:
Inheritance: extend the functionality of a class by creating a subclass. Override superclass members in the subclasses to provide new functionality. Make methods abstract/virtual to force subclasses to "fill-in-the-blanks" when the superclass wants a particular interface but is agnostic about its implementation.
Aggregation: create new functionality by taking other classes and combining them into a new class. Attach an common interface to this new class for interoperability with other code.
What are the benefits, costs, and consequences of each? Are there other alternatives?
I see this debate come up on a regular basis, but I don't think it's been asked on
Stack Overflow yet (though there is some related discussion). There's also a surprising lack of good Google results for it.

It's not a matter of which is the best, but of when to use what.
In the 'normal' cases a simple question is enough to find out if we need inheritance or aggregation.
If The new class is more or less as the original class. Use inheritance. The new class is now a subclass of the original class.
If the new class must have the original class. Use aggregation. The new class has now the original class as a member.
However, there is a big gray area. So we need several other tricks.
If we have used inheritance (or we plan to use it) but we only use part of the interface, or we are forced to override a lot of functionality to keep the correlation logical. Then we have a big nasty smell that indicates that we had to use aggregation.
If we have used aggregation (or we plan to use it) but we find out we need to copy almost all of the functionality. Then we have a smell that points in the direction of inheritance.
To cut it short. We should use aggregation if part of the interface is not used or has to be changed to avoid an illogical situation. We only need to use inheritance, if we need almost all of the functionality without major changes. And when in doubt, use Aggregation.
An other possibility for, the case that we have an class that needs part of the functionality of the original class, is to split the original class in a root class and a sub class. And let the new class inherit from the root class. But you should take care with this, not to create an illogical separation.
Lets add an example. We have a class 'Dog' with methods: 'Eat', 'Walk', 'Bark', 'Play'.
class Dog
Eat;
Walk;
Bark;
Play;
end;
We now need a class 'Cat', that needs 'Eat', 'Walk', 'Purr', and 'Play'. So first try to extend it from a Dog.
class Cat is Dog
Purr;
end;
Looks, alright, but wait. This cat can Bark (Cat lovers will kill me for that). And a barking cat violates the principles of the universe. So we need to override the Bark method so that it does nothing.
class Cat is Dog
Purr;
Bark = null;
end;
Ok, this works, but it smells bad. So lets try an aggregation:
class Cat
has Dog;
Eat = Dog.Eat;
Walk = Dog.Walk;
Play = Dog.Play;
Purr;
end;
Ok, this is nice. This cat does not bark anymore, not even silent. But still it has an internal dog that wants out. So lets try solution number three:
class Pet
Eat;
Walk;
Play;
end;
class Dog is Pet
Bark;
end;
class Cat is Pet
Purr;
end;
This is much cleaner. No internal dogs. And cats and dogs are at the same level. We can even introduce other pets to extend the model. Unless it is a fish, or something that does not walk. In that case we again need to refactor. But that is something for an other time.

At the beginning of GOF they state
Favor object composition over class inheritance.
This is further discussed here

The difference is typically expressed as the difference between "is a" and "has a". Inheritance, the "is a" relationship, is summed up nicely in the Liskov Substitution Principle. Aggregation, the "has a" relationship, is just that - it shows that the aggregating object has one of the aggregated objects.
Further distinctions exist as well - private inheritance in C++ indicates a "is implemented in terms of" relationship, which can also be modeled by the aggregation of (non-exposed) member objects as well.

Here's my most common argument:
In any object-oriented system, there are two parts to any class:
Its interface: the "public face" of the object. This is the set of capabilities it announces to the rest of the world. In a lot of languages, the set is well defined into a "class". Usually these are the method signatures of the object, though it varies a bit by language.
Its implementation: the "behind the scenes" work that the object does to satisfy its interface and provide functionality. This is typically the code and member data of the object.
One of the fundamental principles of OOP is that the implementation is encapsulated (ie:hidden) within the class; the only thing that outsiders should see is the interface.
When a subclass inherits from a subclass, it typically inherits both the implementation and the interface. This, in turn, means that you're forced to accept both as constraints on your class.
With aggregation, you get to choose either implementation or interface, or both -- but you're not forced into either. The functionality of an object is left up to the object itself. It can defer to other objects as it likes, but it's ultimately responsible for itself. In my experience, this leads to a more flexible system: one that's easier to modify.
So, whenever I'm developing object-oriented software, I almost always prefer aggregation over inheritance.

I gave an answer to "Is a" vs "Has a" : which one is better?.
Basically I agree with other folks: use inheritance only if your derived class truly is the type you're extending, not merely because it contains the same data. Remember that inheritance means the subclass gains the methods as well as the data.
Does it make sense for your derived class to have all the methods of the superclass? Or do you just quietly promise yourself that those methods should be ignored in the derived class? Or do you find yourself overriding methods from the superclass, making them no-ops so no one calls them inadvertently? Or giving hints to your API doc generation tool to omit the method from the doc?
Those are strong clues that aggregation is the better choice in that case.

I see a lot of "is-a vs. has-a; they're conceptually different" responses on this and the related questions.
The one thing I've found in my experience is that trying to determine whether a relationship is "is-a" or "has-a" is bound to fail. Even if you can correctly make that determination for the objects now, changing requirements mean that you'll probably be wrong at some point in the future.
Another thing I've found is that it's very hard to convert from inheritance to aggregation once there's a lot of code written around an inheritance hierarchy. Just switching from a superclass to an interface means changing nearly every subclass in the system.
And, as I mentioned elsewhere in this post, aggregation tends to be less flexible than inheritance.
So, you have a perfect storm of arguments against inheritance whenever you have to choose one or the other:
Your choice will likely be the wrong one at some point
Changing that choice is difficult once you've made it.
Inheritance tends to be a worse choice as it's more constraining.
Thus, I tend to choose aggregation -- even when there appears to be a strong is-a relationship.

The question is normally phrased as Composition vs. Inheritance, and it has been asked here before.

I wanted to make this a comment on the original question, but 300 characters bites [;<).
I think we need to be careful. First, there are more flavors than the two rather specific examples made in the question.
Also, I suggest that it is valuable not to confuse the objective with the instrument. One wants to make sure that the chosen technique or methodology supports achievement of the primary objective, but I don't thing out-of-context which-technique-is-best discussion is very useful. It does help to know the pitfalls of the different approaches along with their clear sweet spots.
For example, what are you out to accomplish, what do you have available to start with, and what are the constraints?
Are you creating a component framework, even a special purpose one? Are interfaces separable from implementations in the programming system or is it accomplished by a practice using a different sort of technology? Can you separate the inheritance structure of interfaces (if any) from the inheritance structure of classes that implement them? Is it important to hide the class structure of an implementation from the code that relies on the interfaces the implementation delivers? Are there multiple implementations to be usable at the same time or is the variation more over-time as a consequence of maintenance and enhancememt? This and more needs to be considered before you fixate on a tool or a methodology.
Finally, is it that important to lock distinctions in the abstraction and how you think of it (as in is-a versus has-a) to different features of the OO technology? Perhaps so, if it keeps the conceptual structure consistent and manageable for you and others. But it is wise not to be enslaved by that and the contortions you might end up making. Maybe it is best to stand back a level and not be so rigid (but leave good narration so others can tell what's up). [I look for what makes a particular portion of a program explainable, but some times I go for elegance when there is a bigger win. Not always the best idea.]
I'm an interface purist, and I am drawn to the kinds of problems and approaches where interface purism is appropriate, whether building a Java framework or organizing some COM implementations. That doesn't make it appropriate for everything, not even close to everything, even though I swear by it. (I have a couple of projects that appear to provide serious counter-examples against interface purism, so it will be interesting to see how I manage to cope.)

I'll cover the where-these-might-apply part. Here's an example of both, in a game scenario. Suppose, there's a game which has different types of soldiers. Each soldier can have a knapsack which can hold different things.
Inheritance here?
There's a marine, green beret & a sniper. These are types of soldiers. So, there's a base class Soldier with Marine, Green Beret & Sniper as derived classes
Aggregation here?
The knapsack can contain grenades, guns (different types), knife, medikit, etc. A soldier can be equipped with any of these at any given point in time, plus he can also have a bulletproof vest which acts as armor when attacked and his injury decreases to a certain percentage. The soldier class contains an object of bulletproof vest class and the knapsack class which contains references to these items.

I think it's not an either/or debate. It's just that:
is-a (inheritance) relationships occur less often than has-a (composition) relationships.
Inheritance is harder to get right, even when it's appropriate to use it, so due diligence has to be taken because it can break encapsulation, encourage tight coupling by exposing implementation and so forth.
Both have their place, but inheritance is riskier.
Although of course it wouldn't make sense to have a class Shape 'having-a' Point and a Square classes. Here inheritance is due.
People tend to think about inheritance first when trying to design something extensible, that is what's wrong.

Favour happens when both candidate qualifies. A and B are options and you favour A. The reason is that composition offers more extension/flexiblity possiblities than generalization. This extension/flexiblity refers mostly to runtime/dynamic flexibility.
The benefit is not immediately visible. To see the benefit you need to wait for the next unexpected change request. So in most cases those sticked to generlalization fails when compared to those who embraced composition(except one obvious case mentioned later). Hence the rule. From a learning point of view if you can implement a dependency injection successfully then you should know which one to favour and when. The rule helps you in making a decision as well; if you are not sure then select composition.
Summary: Composition :The coupling is reduced by just having some smaller things you plug into something bigger, and the bigger object just calls the smaller object back. Generlization: From an API point of view defining that a method can be overridden is a stronger commitment than defining that a method can be called. (very few occassions when Generalization wins). And never forget that with composition you are using inheritance too, from a interface instead of a big class

Both approaches are used to solve different problems. You don't always need to aggregate over two or more classes when inheriting from one class.
Sometimes you do have to aggregate a single class because that class is sealed or has otherwise non-virtual members you need to intercept so you create a proxy layer that obviously isn't valid in terms of inheritance but so long as the class you are proxying has an interface you can subscribe to this can work out fairly well.

How many private variables are too many? Capsulizing classes? Class Practices?

Okay so i am currently working on an inhouse statistics package for python, its mainly geared towards a combination of working with arcgis geoprocessor, for modeling comparasion and tools.
Anyways, so i have a single class, that calculates statistics. Lets just call it Stats. Now my Stats class, is getting to the point of being very large. It uses statistics calculated by other statistics, to calculate other statistics sets, etc etc. This leads to alot of private variables, that are kept simply to prevent recalculation. however there is certain ones, while used quite frequintly they are often only used by one or two key subsections of functionality. (e.g. summation of matrix diagonals, and probabilities). However its starting to become a major eyeesore, and i feel as if i am doing this terribly wrong.
So is this bad?
I was recommended by a coworker, to simply start putting core and common functionality togther, in the main class, then simply having capsules, that take a reference to the main class, and simply do what ever functionality they need to within themselves. E.g. for calculating accuracy of model predictions, i would create a capsule, who simply takes a reference to the parent, and it will offload all of the calculations needed, for model predictions.
Is something like this really a good idea? Is there a better way? Right now i have over a dozen different sub statistics that are dumped to a text file to make a smallish report. The code base is growing, and i would just love it if i could start splitting up more and more of my python classes. I am just not sure really what the best way about doing stuff like this is.

Why not create a class for each statistic you need to compute and when of the statistics requires other, just pass an instance of the latter to the computing method? However, there is little known about your code and required functionalities. Maybe you could describe in a broader fashion, what kind of statistics you need calculate and how they depend on each other?
Anyway, if I had to count certain statistics, I would instantly turn to creating separate class for each of them. I did once, when I was writing code statistics library for python. Every statistic, like how many times class is inherited or how often function was called, was a separate class. This way each of them was simple, however I didn't need to use any of them in the other.

I can think of a couple of solutions. One would be to simply store values in an array with an enum like so:
StatisticType = enum('AveragePerDay','MedianPerDay'...)
Another would be to use a inheritance like so:
class StatisticBase
....
class AveragePerDay ( StatisticBase )
...
class MedianPerDay ( StatisticBase )
...
There is no hard and fast rule on "too many", however a guideline is that if the list of fields, properties, and methods when collapsed, is longer than a single screen full, it's probably too big.

It's a common anti-pattern for a class to become "too fat" (have too much functionality and related state), and while this is commonly observed about "base classes" (whence the "fat base class" monicker for the anti-pattern), it can really happen without any inheritance involved.
Many design patterns (DPs for short_ can help you re-factor your code to whittle down the large, untestable, unmaintainable "fat class" to a nice package of cooperating classes (which can be used through "Facade" DPs for simplicity): consider, for example, State, Strategy, Memento, Proxy.
You could attack this problem directly, but I think, especially since you mention in a comment that you're looking at it as a general class design topic, it may offer you a good opportunity to dig into the very useful field of design patterns, and especially "refactoring to patterns" (Fowler's book by that title is excellent, though it doesn't touch on Python-specific issues).
Specifically, I believe you'll be focusing mostly on a few Structural and Behavioral patterns (since I don't think you have much need for Creational ones for this use case, except maybe "lazy initialization" of some of your expensive-to-compute state that's only needed in certain cases -- see this wikipedia entry for a pretty exhaustive listing of DPs, with classification and links for further explanations of each).

Since you are asking about best practices you might want to check out pylint (http://www.logilab.org/857). It has many good suggestions about code style including ones relating to how many private variables in a class.

Are inner-classes unpythonic?

My colleague just pointed out that my use of inner-classes seemed to be "unpythonic". I guess it violates the "flat is better than nested" heuristic.
What do people here think? Are inner-classes something which are more appropriate to Java etc than Python?
NB : I don't think this is a "subjective" question. Surely style and aesthetics are objective within a programming community.
Related Question: Is there a benefit to defining a class inside another class in Python?

This may not deserve a [subjective] tag on StackOverflow, but it's subjective on the larger stage: some language communities encourage nesting and others discourage it. So why would the Python community discourage nesting? Because Tim Peters put it in The Zen of Python? Does it apply to every scenario, always, without exception? Rules should be taken as guidelines, meaning you should not switch off your brain when applying them. You must understand what the rule means and why it's important enough that someone bothered to make a rule.
The biggest reason I know to keep things flat is because of another philosophy: do one thing and do it well. Lots of little special purpose classes living inside other classes is a sign that you're not abstracting enough. I.e., you should be removing the need and desire to have inner classes, not just moving them outside for the sake of following rules.
But sometimes you really do have some behavior that should be abstracted into a class, and it's a special case that only obtains within another single class. In that case you should use an inner class because it makes sense, and it tells anyone else reading the code that there's something special going on there.
Don't slavishly follow rules.
Do understand the reason for a rule and respect that.

"Flat is better than nested" is focused on avoiding excessive nesting -- i.e., seriously deep hierarchies. One level of nesting, per se, is no big deal: as long as your nesting still respects (a weakish form of) the Law of Demeter, as typically confirmed by the fact that you don't find yourself writing stuff like onething.another.andyet.anotherone (too many dots in an expression are a "code smell" suggesting you've gone too deep in nesting and need to refactor to flatten things out), I wouldn't worry too much.

Actually, I'm not sure if I agree with the whole premise that "Flat is better than nested". Sometimes, quite often actually, the best way to represent something is hierarchically... Even nature itself, often uses hierarchy.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.