Python: custom deepcopy in class hierarchy

Python: custom deepcopy in class hierarchy - python

I have a class hierarchy representing linear algebra expression (as expression trees), something like this (it’s actually more complicated than that, but that should be enough to give you an idea).
Expression
Operator
Times
Plus
Symbol
Matrix
Scalar
I’m manipulating those expression in many different ways, so I have to copy those expression a lot because I want to explore different ways of manipulating them (I really don’t see any way around copying them). Not surprisingly, copy.deepcopy() is slow.
Not all class attributes need do be deepcopied, and those expression trees will always be trees (if not, something is fundamentally wrong), so I can probably save some time by simplifying the memoization (or don’t use it at all).
I want to make copying faster by implementing a custom copy function. This could either mean implementing __deepcopy__(), writing an entirely new function or using something like get_state and set_state, I really don’t care. At the moment, I don’t care about pickling.
I’m using Python 3.
Since I have this class hierarchy where new attributes are introduced at different levels, I would prefer not to start from scratch on each level. Instead, it would be desirable to reuse some functionality of the superclass similar to how in __init__, one usually calls __init__ of the superclass. However, I would prefer not to call __init__ because that sometimes does some extra stuff that is not necessary when copying.
How do I do that in the fastest and most pythonic way? I was not able to find any reasonable guidelines regarding how to implement this copy functionality under those circumstances. I looked at the implementation of deepcopy both in Python 2 and 3. In Python 2, a bunch of different methods are used (get_state/set_state, using __dict__,…), in Python 3, I’m not able to find the corresponding function at all.

I found one possible solution myself:
class Expression(object):
...
def __deepcopy__(self, memo):
cpy = object.__new__(type(self))
# manually copy all attributes of Expression here
return cpy
class Operator(Expression):
...
def __deepcopy__(self, memo):
cpy = Expression.__deepcopy__(self, memo)
# manually copy all attributes exclusive to Operator here
return cpy
The new object is always created in __deepcopy__ of Expression. By using object.__new__(type(self)), the new object has the type of the object on which __deepcopy__ was originally called (for example Operator).
Unfortunately, this solution is not much faster than the default deepcopy. I'm wondering if this is because of all those calls to __deepcopy__ all the way to the top of the class hierarchy. I will investigate this.

Related

Should Domain Model Classes always depend on primitives?

Halfway through Architecture Patterns with Python, I have two questions about how should the Domain Model Classes be structured and instantiated. Assume on my Domain Model I have the class DepthMap:
class DepthMap:
def __init__(self, map: np.ndarray):
self.map = map
According to what I understood from the book, this class is not correct since it depends on Numpy, and it should depend only on Python primitives, hence the question: Should Domain Model classes rely only on Python primitives, or is there an exception?
Assuming the answer to the previous question is that classes should solely depend on primitives, what would the correct way create a DepthMap from a Numpy array be? Assume now I have more formats from where I can make a DepthMap object.
class DepthMap:
def __init__(self, map: List):
self.map = map
#classmethod
def from_numpy(cls, map: np.ndarray):
return cls(map.tolist())
#classmethod
def from_str(cls, map: str):
return cls([float(i) for i in s.split(',')])
or a factory:
class DepthMapFactory:
#staticmethod
def from_numpy(map: np.ndarray):
return DepthMap(map.tolist())
#staticmethod
def from_str(map: str):
return DepthMap([float(i) for i in s.split(',')])
I think even the Repository Pattern, which they go through in the book, could fit in here:
class StrRepository:
def get(map: str):
return DepthMap([float(i) for i in s.split(',')])
class NumpyRepository:
def get(map: np.ndarray):
return DepthMap(map.tolist())
The second question: When creating a Domain Model Object from different sources, what is the correct approach?
Note: My background is not software; hence some OOP concepts may be incorrect. Instead of downvoting, please comment and let me know how to improve the question.

I wrote the book, so I can at least have a go at answering your question.
You can use things other than primitives (str, int, boolean etc) in your domain model. Generally, although we couldn't show it in the book, your model classes will contain whole hierarchies of objects.
What you want to avoid is your technical implementation leaking into your code in a way that makes it hard to express your intent. It would probably be inappropriate to pass instances of Numpy arrays around your codebase, unless your domain is Numpy. We're trying to make code easier to read and test by separating the interesting stuff from the glue.
To that end, it's fine for you to have a DepthMap class that exposes some behaviour, and happens to have a Numpy array as its internal storage. That's not any different to you using any other data structure from a library.
If you've got data as a flat file or something, and there is complex logic involved in creating the Numpy array, then I think a Factory is appropriate. That way you can keep the boring, ugly code for producing a DepthMap at the edge of your system, and out of your model.
If creating a DepthMap from a string is really a one-liner, then a classmethod is probably better because it's easier to find and understand.

I think it's perfectly fine to depend on librairies that are pure language extensions or else you will just end up with having to define tons of "interface contracts" (Python doesn't have interfaces as a language construct -- but those can be conceptual) to abstract away these data structures and in the end those newly introduced contracts will probably be poor abstractions anyway and just result in additional complexity.
That means your domain objects can generally depend on these pure types. On the other hand I also think these types should be considered as language "primitives" (native may be more accurate) just like datetime and that you'd want to avoid primitive obsession.
In other words, DepthMap which is a domain concept is allowed to depend on Numpy for it's construction (no abstraction necessary here), but Numpy shouldn't necessarily be allowed to flow deep into the domain (unless it's the appropriate abstraction).
Or in pseudo-code, this could be bad:
someOperation(Numpy: depthMap);
Where this may be better:
class DepthMap(Numpy: data);
someOperation(DepthMap depthMap);
And regarding the second question, from a DDD perspective if the
DepthMap class has a Numpy array as it's internal structure but has to
be constructed from other sources (string or list for example) would
the best approach be a repository pattern? Or is this just for
handling databases and a Factory is a better approach?
The Repository pattern is exclusively for storage/retrieval so it wouldn't be appropriate. Now, you may have a factory method directly on DepthMap that accepts a Numpy or you may have a dedicated factory. If you want to decouple DepthMap from Numpy then it could make sense to introduce a dedicated factory, but it seems unnecessary here at first glance.

Should Domain Model classes rely only on Python primitives
Speaking purely from a domain-driven-design perspective, there's absolutely no reason that this should be true
Your domain dynamics are normally going to be described using the language of your domain, ie the manipulation of ENTITIES and VALUE OBJECTS (Evans, 2003) that are facades that place domain semantics on top of your data structures.
The underlying data structures, behind the facades, are whatever you need to get the job done.
There is nothing in domain driven design requiring that you forsake a well-tested off the shelf implementation of a highly optimized Bazzlefraz and instead write your own from scratch.
Part of the point of domain driven design is that we want to be making our investment into the code that helps the business, not the plumbing.

Python data model / internal methods

I would like to understand the data model in python (3.8.2), so maybe someone can explain in simple words, why every basic data type is a class-object and has so many attributes and functions.
Consider the following code:
num = 1337
type(num) # <class 'int'>
dir(num)
>>>
['__abs__',
'__add__',
'__and__',
...
'real',
'to_bytes']
So I can do:
num.__add__(1)
>>> 1338
So my question is: why are such functions part of the user defined variables and isn't there quite some overhead to this?

"everything is object in Python", so "everything" (maybe not exactly everything, I don't know) is a class instance and has potentially attributes and methods.
That said you have to know that, by convention, every method whom name begins with __ is considered as private (and is almost really protected) and you should not use it directly.
Every method whom name starts and ends by __ is called a magic method and has a pretty well defined role. They could be considered as a sort of hooks bound to particular operators or moments in the object lifecycle.
For example __add__ is the method bound to the + operator. When you do an addition, you are actually calling the __add__ method of a number, giving it another number as parameter.
As you noticed, you also can call directly the __add__ method if you want, but it is not encouraged.
Python is a high-level interpreted language. It provides many facilities and a clear syntax that allows to concentrate on real problems rather than implementation details.
Those facilities do not come from nowhere and are a benefit provided by (among others) an internal data model.
This data model is certainly not as lightweight as a custom handcrafted C data structure specialy designed for a single problem but it is pratical and usable in many situations (and still reasonably lightweight and well performing for many usages).
If you are concerned by memory optimisation and extreme-performance, You probably should look some lower level languages like C or Go.
But keep in mind that solving any problem could then probably take much more time :)

Why so many methods:
Python is an object-oriented language. It's built on objects interacting with each other. Maybe there's no specific reason why Python has to be object-oriented, but the guy who wrote it gets to decide what kind of language it is and he chose this type.
In Python, the objects get to decide how an operation is implemented. That's why you can have natural syntax like 'str1' + 'str2' or 23 + 40. Note, these are both written in the same way, but will produce different results because the + operation is carried out on different objects.
What about memory management:
If the objects didn't decide how to implement an operation, the logic would still have to be stored somewhere. It would probably take up at least some memory.
The question is how much memory does an object use? To do this, you can use the sys.getsizeof function:
import sys
print(sys.getsizeof(int)) # Prints size in bytes
>>> 416
print(sys.getsizeof(1))
>>> 28
So, the memory of the class (int) is much larger than the object (1). This means that the instances of integers are much smaller than their classes, so memory doesn't inflate as drastically.

Can monads do what this OOP can't?

Trying to understand monads and wondering if they'd be useful for data transformation programming in Python, I looked at many introductory explanations. But I don't understand why monads are important.
Is the following Python code a good representation of the Maybe monad?
class Some:
def __init__(self, val):
self.val=val
def apply(self, func):
return func(self.val)
class Error:
def apply(self, func):
return Error()
a = Some(1)
b = a.apply(lambda x:Some(x+5))
Can you give an example of a monad solution, which cannot be transformed into such OOP code?
(Do you think monads for data transformation in OOP languages can be useful?)

Here is a post that discusses monads using Swift as the language that might help you make more sense of them: http://www.javiersoto.me/post/106875422394
The basic point is that a monad is an object that has three components.
A constructor that can take some other object, or objects, and create a monad wrapping that object.
A method that can apply a function that knows nothing about the monad to its contents and return the result wrapped in a monad.
A method that can apply a function that takes the raw object and returns a result wrapped in the monad, and return that monad.
Note that this means even an Array class can be a monad if the language treats methods as first class objects, and the necessary methods exist.
If the language doesn't treat methods as first class objects, even if it's an OO language, it will not able able to implement Monads. I think your confusion may be coming from the fact that you are using a multi-paradigm language (Python) and assuming it is a pure OO language.

The apply you've written corresponds to the monad function called bind.
Given that and the constructor Some (which for a generalized monad is called unit or return) you can define the functions fmap and apply which promote other kinds of functions: where bind promotes and uses a function that takes plain data and returns a monad, fmap does so for for functions from plain data to plain data, and apply does so for functions that are themselves wrapped inside the monad.
You would have issues, I think, defining the last monadic operation: join. join takes a nested monad and flattens it to a single layer. So this is a method that can't be used on all objects of the class, only those with a particular structure. And while join can be avoided if you define bind instead (you can write join in terms of bind and the identity function), it's inconvenient to define bind for array-like monads. Functions on arrays are more naturally described in terms of fmap (loop over the array and apply the function) and join (take a nested array and flatten it).
I think if you can attack these challenges, you can implement a monad in whatever language you choose. And I do think that monads are useful in many languages, as many sub-computations can be described using them. They represent a known solution to many common problems, and if they're already available to you, that means code you don't have to test or debug; they're tools that already work.
Implementing them yourself lets you use the functional literature and way of thinking to attack problems. Whether that convenience is worth the effort of implementing them is up to you.

copy.deepcopy or create a new object?

I'm developing a real-time application and sometimes I need to create instances to new objects always with the same data.
First, I did it just instantiating them, but then I realised maybe with copy.deepcopy it would be faster. Now, I find people who say deepcopy is horribly slow.
I can't do a simply copy.copy because my object has lists.
My question is, do you know a faster way or I just need to give up and instantiate them again?
Thank you for your time

I believe copy.deepcopy() is still pure Python, so it's unlikely to give you any speed boost.
It sounds to me a little like a classic case of early optimisation. I would suggest writing your code to be intuitive which, in my opinion, is simply instantiating each object. Then you can profile it and see where savings need to be made, if anywhere. It may well be in your real-world use-case that some completely different piece of code will be a bottleneck.
EDIT: One thing I forgot to mention in my original answer - if you're copying a list, make sure you use slice notation (new_list = old_list[:]) rather than iterating through it in Python, which will be slower. This won't do a deep copy, however, so if your lists have other lists or dictionaries you'll need to use deepcopy(). For dict objects, use the copy() method.
If you still find constructing your objects is what's taking the time, you can then consider how to speed it up. You could experiment with __slots__, although they're typically about saving memory more than CPU time so I doubt they'll buy you much. In the extreme case, you can push your object out to a C extension module, which is likely to be a lot faster at the expense of increased complexity. This is always the approach I've taken in the past, where I use native C data structures under the hood and use Python's special methods to wrap a "list-like" or "dict-like" interface on top. This is does rely on you being happy with coding in C, of course.
(As an aside I would avoid C++ unless you have a compelling reason, C++ Python extensions are slightly more fiddly to get building than plain C - it's entirely possible if you have a good motivation, however)
If you object has, for example, very long lists then you might get some mileage out of a sort of copy-on-write approach, where clones of objects just keep the same references instead of copying the lists. Every time you access them you could use sys.getrefcount() to see if it's safe to update in-place or whether you need to take a copy. This approach is likely to be error-prone and overly complex, but I thought I'd mention it for interest.
You could also look at your object hierarchy and see whether you can break objects up such that parts which don't need to be duplicated can be shared between other objects. Again, you'd need to take care when modifying such shared objects.
The important point, however, is that you first want to get your code correct and then make your code fast once you understand the best ways to do that from real world usage.

Python 3.2: Information on class, class attributes, and values of objects

I'm new to Python, and I'm learning about the class function. Does anyone know of any good sources/examples for this? Here's an example I wrote up. I'm trying to read more about self and init
class Example:
def __init__(a, b, c, d):
self.a = a
self.b = b
self.c = c
self.d = d
test = Example(1, 1, 1, 1)
I've been to python.org, as well as this site. I've also been reading beginners Python books, but I'd like more information.
Thanks.

A couple of generic clarifications here:
Python's "class" keyword is not a function, it's a statement which signals to the language that the following code describes a class (a user-defined data type and its associated behavior). "class" takes a name (and a possibly empty list of "parent" classes) ... and introduces a "suite" (an indented block of code).
The "def" keyboard is a similar statement which defines a function. In your example, which should have read: *def _init_(self, a, b, c)*) you're defining a special type of function which is "part of" (associated with, bound to) the Example class. It's also possible (and fairly common) to create unbound functions. Commonly, in Python, unbound functions are simple called "functions" while those which are part of a class are called "methods" ... or "instance functions."
classes are templates for instantiating objects. The terms "instance" and "object" are synonymous in this context. Your example "test" is an instance ... and the Python interpreter "instantiates" and initializes that object according to the class description.
A class is also a "type", that is to say that it's a user definition of a type of data and its associated methods. "class" and "type" are somewhat synonymous in Python though they are conventionally used in different ways. The core "types" of Python data (integers, real numbers, imaginary/complex numbers, strings, lists, tuples, and dictionaries) are all referred to as "types" while the more complex data/operational structures are called classes. Early versions of Python were implemented with constraints that made the distinction between "type" and "class" more than merely a matter of terminological difference. However, the last several versions of Python have eliminated those underlying technical distinctions. Those distinctions related to "subclassing" (inheritance).
classes can be described as a set of additions and modifications to another class. This is called "inheritance" and the class which is derived from another in this manner is referred to as a "subclass." It's common for programmers to create hierarchies of classes ... with specific variations all deriving from more common bases. It's also common to define related functionality within the same files or sets of files. These are "class libraries" and sometimes they are built as "packages."
_init_() is a method; in particular it's the initializer for Python objects (instances of a class).
Python generally uses _..._ (prefixing and suffixing pairs of underscore characters around selected keywords) for "special" method or attribute names ... which is intended to reduce the likelihood that its naming choices will conflict with the meaningful names that you might wish to give to your own methods. While you can name your other methods and attributes with this _XXXX_ --- Python will not inherently treat that as an error --- it's an extremely bad idea to do so. Even if you don't pick any of the currently defined special names there's no guarantee that some future version of Python won't conflict with your usage later.
"methods" are functions ... but they are a type of function which is bound (associated with) a particular instance of a particular class. There are also "class methods" which are associated with the class rather than with a specific instance of the class.
In your example self.b, self.c and so on are "attributes" or "members" of the object (instance). (The terms are synonymous).
In general the purpose of object orient programming is to provide ways of describing types of data and operations on those types of data in a way that's amenable to both human comprehension and computerized interpretation and execution. (All programming languages are intended to strike some balance between human readability/comprehension and machine parsing/execution; but object-oriented languages do so specifically with a focus on the description and definition of "types," and the instantiation of those types into "objects" and finally the various interactions among those objects.
"self" is a Python specific convention for naming the special (and generally required) first argument to any bound method. It's a "self" reference (a way for the code in a method to refer to the object/instance's own attributes and other methods without any ambiguity). While you can call your first argument to your bound methods "a" (as you've unwittingly done in your Example) it's an extremely bad idea to do so. Not only will it likely confuse you later ... it will make no sense to anyone else trying to read your Python code).
The term "object-oriented" is confusing unless one is aware of the comparisons to other forms of programming language. It's an evolution from "procedural" programming. The simplest gist of that comparison is when you consider the sorts of functions one would define in a procedural language were one might have to define and separately name different functions to perform analogous operations on different types of data: print_student_record(this_student) vs. print_teacher_report(some_report) --- a programming model which necessitates a fair amount of overhead on the part of the programmer, to keep track of which functions work on which types. This sort of problem is eliminated in OO (object oriented) programming where one can, conceivably, call on this.print_() ... and, assuming one has created compatible classes, this will "print" regardless of whether "this" is a student (record/object) or a teacher (report/record/object). That's greatly oversimplified but useful for understanding the pressures which led to the development and adoption of OO based programming.
In Python it's possible to create classes with little or no functionality. Your example does nothing, yet, but transfer a set of arguments into "attributes" (members) during initialization (instantiation). After that you could use these attributes in programming statements like: test.a += 1 and print (test.a). This is possible because Python is a "multi-paradigm" language. It supports procedural as well as object-orient programming styles. Objects used this way are very similar to "structs" from the C programming language (predecessor to C++) and to the "records" in Pascal, etc. That style of programming is largely considered to be obsolete (particularly when using a modern, "very high level" language such as Python).
The gist of what I'm getting at is this ... you'll want to learn how to think of your data as the combination of it's "parts" (attributes) and the functionality that changes, manipulates, validates, and handles input, output, and possibly storage, of those attributes.
For example if you were writing a "code breaker" program to solve simply ciphers you might implement a "Histogram" object which counts the letter frequencies of a given coded message. That would have attributes (one integer for every letter) and behavior (feeding ports of the coded message(s) into the instance, splitting the strings into individual characters, filtering out all the non-letter characters, converting all the letters to upper or lower case, and counting them --- that is incrementing the integer corresponding to each letter). Additionally you'd need to have some way of querying the histogram ... for example getting list of the letters sorted by their frequency in the cipher text.
Once you had such a "histogram" class then you could think of ways to use that for your solver. For example to solve a cryptogram puzzle you might computer the histogram then try substituting "etaon" for the five most common ciphered letters ... then check how many of the "partial" strings (t.e for "the") match words, trying permutations, and so on. Each of these might be it's own class. A key point of programming is that your histogram class might be useful for counting all sorts of other things (even in a simple voting system or popularity context). A particular subclass or instantiation might make it a histogram of letters while others could be re-used for other types of "things" you want counted. Similarly the code that iterates over permutions of some list might be used in any number of simulation, optimization, and related programs. (In fact Python's standard libraries already including "counters" and "permutations" functions in the "collections" and "itertools" modules, respectively).
Of course you're going to hear of all of these concepts repeatedly as you study programming. This has been a rather rambling attempt to kickstart that process. I know I've been a bit repetitious in a few points here --- part of that is because I'm typing this a 4am after having started work at 7am yesterday; but part of it serves a pedagogical purpose as well.

There's an error in your class definition. Always include the variable self in your __init__ method. It represents the instance of the object itself and should be included as the first parameter to all of your methods.
What do you want to accomplish with this class? Up until now, it just stores a few variables. Try adding a few methods to spice things up a little bit! There is a plethora of available resources on classes in Python. To start, you may wanna give this one a try:
Python Programming - Classes

I'm learning python now also, and there is an intro class that's pretty good on codecademy.com
http://www.codecademy.com/tracks/python
It has a section that goes through an exercise on classes. Hope this helps

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.