__repr__ for (large) composite objects

__repr__ for (large) composite objects - python

I would like to have informative representations for my composite objects (i.e., objects composed of other (potentially composite) objects). However, because my code fundamentally deals with high-precision numbers (please don't ask me why I don't just use doubles), I end up with representations like you see here: http://pastebin.com/jpLgAfxC. Would it just be better to just stick with the default __repr__?

Whether to have a verbose repr depends on what you want to accomplish. For complex or composite objects, I know which I'd prefer of the following:
Point(x=1.12, y=2.2, z=-1.9)
<__main__.Point object at 0x103011890>
They both tell me what type the object is, but only the first is clear about all of the (relevant) values involved, and avoids low-level information that is only relevant on the rarest of occasions.
I like to see the real values. But, yours is a special case, given that your values are so frightfully humongous:
72401317106217603290426741268390656010621951704689382948334809645
87850348552960901165648762842931879347325584704068956434195098288
38279057775096090002410493665682226331178331461681861612403032369
73237863637784679012984303024949059416189689048527978878840119376
5152408961823197987224502419157858495179687559851
That they cannot be useful for most development or debugging purposes. I'm sure there are times you need the full serialization--to send to and from files, for example. But those have to be fairly rare, no? I can't imagine you really remember all 309 digits, or can determine if the above number is the same as the one below on visual inspection:
72401317106217603290426741268390656010621951704689382948334809645
87850348552960901165648762842931879347325584704068956434195098288
38279057775096090002410493665682226331178331461681861612403032369
73327863637784679012984303024949059416189689048527978878840119376
5152408961823197987224502419157858495179687559851
They're not the same. But unless you're Spock or The Terminator, you wouldn't know that from a quick glance. (And actually, I've made it easier here, length-wrapping to avoid having to horizontally scroll.)
So I would recommend (massively) shortening their representation, to make the output more tractable. This is like printing out the entire chapter text every time you want to print a Chapter object. Overkill.
Instead, try something much shorter and easier to work with. Truncation and/or ellipsis are useful. e.g.
72401...59851
7240131710...
You can use the object id as well. If your high-precision type is HP, then:
HP(0x103011890)
At least then you will be able to tell them apart. One ugliness of using object ids, however, is that objects can be logically equivalent, but if you create multiple objects with the same logical value, they'd have different ids, thus appear different when they are not. You can get around that by creating your own short hash function. There's a bit of an art to hashing, but for reprs, even something simple would work. E.g.:
import binascii, struct
def shorthash(s):
"""
Given a Python value, produce a short alphanumeric hash that
helps identify it for debugging purposes. A riff on
http://stackoverflow.com/a/2511059/240490
Enhanced to remove trailing boilerplate, and to work
on either Python 2 or Python 3.
"""
hashbytes = binascii.b2a_base64(struct.pack('l', hash(s)))
return hashbytes.decode('utf-8').rstrip().rstrip("=")
Then define your repr in the high-precision class:
def __repr__(self):
clsname = self.__class__.__name__
return '{0}({1}).format(clsname, shorthash(self.value))
Where self.value is whatever local attribute, property, or method creates the multi-hundred-digit value. If you're subclassing int, this could be just self.
This gets you to:
HP(Tea+5MY0WwA)
The two massive, almost identical numbers above? Using this scheme, they render out to:
HP(XhkG0358Fx4)
HP(27CdIG5elhQ)
Which are obviously different. You can combine this with a bit of a value representation. E.g. a few alternatives:
HP(~7.24013e308 # XhkG0358Fx4)
HP(dig='72401...59851', ndigits=309, hash='XhkG0358Fx4')
You'll find these shorter values more useful in debugging contexts. You can, of course, keep around a method or property (e.g. .value, .digits, or .alldigits) for those case in which you need every last bit, but define the common case as something more easily consumed.

Thank you to Demian for the pointer to https://docs.python.org/2/reference/datamodel.html#object.repr, specifically:
This is typically used for debugging, so it is important that the
representation is information-rich and unambiguous.
http://pastebin.com/jpLgAfxC is probably the best possible __repr__ in this case.

Related

Python number base class OR how to determine a value is a number

Python seems to lacks a base class for "all numbers", e. g. int, float, complex, long (in Python2). This is unfortunate and a bit inconvenient to me.
I'm writing a function for matching data types onto each other (a tree matching / searching algorithm). For this I'd like to test the input values for being lists, dicts, strings, or "numbers". Each of the four cases is handled separately, but I do not want to distinguish between int, long, float, or complex (though the latter will probably not appear). This would be rather easy to achieve if all number types would be derived from a base type number, but unfortunately, they are not, AFAICS.
This enforcement of explicitness makes it obvious that the unusual complex is included. Which typically raises questions which I rather not want to think about. My design rather says "all number types" than that explicit list. I also do not want to explicitly list all possible number types coming from other libraries like numpy or similar.
First question, rather a theoretical one: Why didn't the designers make all number types inherit a common number base class? Is there a good reason for this, maybe a theory about it which lets it seem not recommended? Or would it make sense to propose this idea for later versions of Python?
Second question, the more practical one: Is there a recommended way of checking a value for being a number, maybe a library call I'm not aware of? The straight-forward version of using isinstance(x, (int, float, complex, long)) does look like a clutch to me, isn't compatible to Python3 which doesn't know a long type anymore, and doesn't include library-based number types like numpy.int32.

There is actually a base class for those types you listed.
If you're not looking at numpy types, a good starting point would be numbers.Complex:
>>> import numbers
>>> isinstance(1+9j, numbers.Complex)
True
>>> isinstance(1L, numbers.Complex)
True
>>> isinstance(1., numbers.Complex)
True
>>> isinstance(1, numbers.Complex)
True
It gets a bit messier when you start to include those from numpy, however, the numbers.Complex abstract base class already handles a good number of the mentioned cases.

Sorry, a bit late to the party. But this might be helpful for future readers of this page. I suggest you use Python's "duck typing" character and its EAFP ("Easier to Ask Forgiveness than Permission") philosophy: in other words, just try using the object in question as a number. Write something like this:
def isnumber(thing):
try:
thing + 0
return True
except TypeError:
return False
It should work for any type of number, including user-defined classes.

Duck typing trouble. Duck typing test for "i-am-like-a-list"

USAGE CONTEXT ADDED AT END
I often want to operate on an abstract object like a list. e.g.
def list_ish(thing):
for i in xrange(0,len(thing)):
print thing[i]
Now this appropriate if thing is a list, but will fail if thing is a dict for example. what is the pythonic why to ask "do you behave like a list?"
NOTE:
hasattr('__getitem__') and not hasattr('keys')
this will work for all cases I can think of, but I don't like defining a duck type negatively, as I expect there could be cases that it does not catch.
really what I want is to ask.
"hey do you operate on integer indicies in the way I expect a list to do?" e.g.
thing[i], thing[4:7] = [...], etc.
NOTE: I do not want to simply execute my operations inside of a large try/except, since they are destructive. it is not cool to try and fail here....
USAGE CONTEXT
-- A "point-lists" is a list-like-thing that contains dict-like-things as its elements.
-- A "matrix" is a list-like-thing that contains list-like-things
-- I have a library of functions that operate on point-lists and also in an analogous way on matrix like things.
-- for example, From the users point of view destructive operations like the "spreadsheet-like" operations "column-slice" can operate on both matrix objects and also on point-list objects in an analogous way -- the resulting thing is like the original one, but only has the specified columns.
-- since this particular operation is destructive it would not be cool to proceed as if an object were a matrix, only to find out part way thru the operation, it was really a point-list or none-of-the-above.
-- I want my 'is_matrix' and 'is_point_list' tests to be performant, since they sometimes occur inside inner loops. So I would be satisfied with a test which only investigated element zero for example.
-- I would prefer tests that do not involve construction of temporary objects, just to determine an object's type, but maybe that is not the python way.
in general I find the whole duck typing thing to be kinda messy, and fraught with bugs and slowness, but maybe I dont yet think like a true Pythonista
happy to drink more kool-aid...

One thing you can do, that should work quickly on a normal list and fail on a normal dict, is taking a zero-length slice from the front:
try:
thing[:0]
except TypeError:
# probably not list-like
else:
# probably list-like
The slice fails on dicts because slices are not hashable.
However, str and unicode also pass this test, and you mention that you are doing destructive edits. That means you probably also want to check for __delitem__ and __setitem__:
def supports_slices_and_editing(thing):
if hasattr(thing, '__setitem__') and hasattr(thing, '__delitem__'):
try:
thing[:0]
return True
except TypeError:
pass
return False
I suggest you organize the requirements you have for your input, and the range of possible inputs you want your function to handle, more explicitly than you have so far in your question. If you really just wanted to handle lists and dicts, you'd be using isinstance, right? Maybe what your method does could only ever delete items, or only ever replace items, so you don't need to check for the other capability. Document these requirements for future reference.

When dealing with built-in types, you can use the Abstract Base Classes. In your case, you may want to test against collections.Sequence or collections.MutableSequence:
if isinstance(your_thing, collections.Sequence):
# access your_thing as a list
This is supported in all Python versions after (and including) 2.6.
If you are using your own classes to build your_thing, I'd recommend that you inherit from these abstract base classes as well (directly or indirectly). This way, you can ensure that the sequence interface is implemented correctly, and avoid all the typing mess.
And for third-party libraries, there's no simple way to check for a sequence interface, if the third-party classes didn't inherit from the built-in types or abstract classes. In this case you'll have to check for every interface that you're going to use, and only those you use. For example, your list_ish function used __len__ and __getitem__, so only check whether these two methods exist. A wrong behavior of __getitem__ (e.g. a dict) should raise an exception.

Perhaps their is no ideal pythonic answer here, so I am proposing a 'hack' solution, but don't know enough about the class structure of python to know if I am getting this right:
def is_list_like(thing):
return hasattr(thing, '__setslice__')
def is_dict_like(thing):
return hasattr(thing, 'keys')
My reduce goals here are to simply have performant tests that will:
(1) never call a dict-thing, nor a string-like-thing a list List item
(2) returns the right answer for python types
(3) will return the right answer if someone implement a "full" set of core method for a list/dict
(4) is fast (ideally does not allocate objects during the test)
EDIT: Incorporated ideas from #DanGetz

Enum vs String as a parameter in a function

I noticed that many libraries nowadays seem to prefer the use of strings over enum-type variables for parameters.
Where people would previously use enums, e.g. dateutil.rrule.FR for a Friday, it seems that this has shifted towards using string (e.g. 'FRI').
Same in numpy (or pandas for that matter), where searchsorted for example uses of strings (e.g. side='left', or side='right') rather than a defined enum. For the avoidance of doubt, before python 3.4 this could have been easily implemented as an enum as such:
class SIDE:
RIGHT = 0
LEFT = 1
And the advantages of enums-type variable are clear: You can't misspell them without raising an error, they offer proper support for IDEs, etc.
So why use strings at all, instead of sticking to enum types? Doesn't this make the programs much more prone to user errors? It's not like enums create an overhead - if anything they should be slightly more efficient. So when and why did this paradigm shift happen?

I think enums are safer especially for larger systems with multiple developers.
As soon as the need arises to change the value of such an enum, looking up and replacing a string in many places is not my idea of fun :-)
The most important criteria IMHO is the usage: for use in a module or even a package a string seems to be fine, in a public API I'ld prefer enums.

[update]
As of today (2019) Python introduced dataclasses - combined with optional type annotations and static type analyzers like mypy I think this is a solved problem.
As for efficiency, attribute lookup is somewhat expensive in Python compared to most computer languages so I guess some libraries may still chose to avoid it for performance reasons.
[original answer]
IMHO it is a matter of taste. Some people like this style:
def searchsorted(a, v, side='left', sorter=None):
...
assert side in ('left', 'right'), "Invalid side '{}'".format(side)
...
numpy.searchsorted(a, v, side='right')
Yes, if you call searchsorted with side='foo' you may get an AssertionError way later at runtime - but at least the bug will be pretty easy to spot looking the traceback.
While other people may prefer (for the advantages you highlighted):
numpy.searchsorted(a, v, side=numpy.CONSTANTS.SIDE.RIGHT)
I favor the first because I think seldom used constants are not worth the namespace cruft. You may disagree, and people may align with either side due to other concerns.
If you really care, nothing prevents you from defining your own "enums":
class SIDE(object):
RIGHT = 'right'
LEFT = 'left'
numpy.searchsorted(a, v, side=SIDE.RIGHT)
I think it is not worth but again it is a matter of taste.
[update]
Stefan made a fair point:
As soon as the need arises to change the value of such an enum, looking up and replacing a string in many places is not my idea of fun :-)
I can see how painful this can be in a language without named parameters - using the example you have to search for the string 'right' and get a lot of false positives. In Python you can narrow it down searching for side='right'.
Of course if you are dealing with an interface that already has a defined set of enums/constants (like an external C library) then yes, by all means mimic the existing conventions.

I understand this question has already been answered, but there is one thing that has not at all been addressed: the fact that Python Enum objects must be explicitly called for their value when using values stored by Enums.
>>> class Test(Enum):
... WORD='word'
... ANOTHER='another'
...
>>> str(Test.WORD.value)
'word'
>>> str(Test.WORD)
'Test.WORD'
One simple solution to this problem is to offer an implementation of __str__()
>>> class Test(Enum):
... WORD='word'
... ANOTHER='another'
... def __str__(self):
... return self.value
...
>>> Test.WORD
<Test.WORD: 'word'>
>>> str(Test.WORD)
'word'
Yes, adding .value is not a huge deal, but it is an inconvenience nonetheless. Using regular strings requires zero extra effort, no extra classes, or redefinition of any default class methods. Still, there must be explicit casting to a string value in many cases, where a simple str would not have a problem.

i prefer strings for the reason of debugging. compare an object like
side=1, opt_type=0, order_type=6
to
side='BUY', opt_type='PUT', order_type='FILL_OR_KILL'
i also like "enums" where the values are strings:
class Side(object):
BUY = 'BUY'
SELL = 'SELL'
SHORT = 'SHORT'

Strictly speaking Python does not have enums - or at least it didn't prior to v3.4
https://docs.python.org/3/library/enum.html
I prefer to think of your example as programmer defined constants.
In argparse, one set of constants have string values. While the code uses the constant names, users more often use the strings.
e.g. argparse.ZERO_OR_MORE = '*'
arg.parse.OPTIONAL = '?'
numpy is one of the older 3rd party packages (at least its roots like numeric are). String values are more common than enums. In fact I can't off hand think of any enums (as you define them).

Best way to handle lots of constants with more than one semantics

I intend to implement a library which includes lots of constants. Now there are already many questions about constants on stackoverflow. But I am pondering about a special case which complicates things and I am not sure which way to go. Probably someone already solved a similar problem and can help me avoid obvious fails :).
To be more precise, I am going to implement SMPP (short message p2p) related stuff. SMPP consists of many fields which have different possible values which sometimes are composed of different parts. The esm_class field for example is an 8 bit integer, bits 1-0, 5-2 and 7-6 being 'sub fields'. So the value esm_class consists of three constants 'concatenated' together.
But the story goes on. The esm_class subfields have different meanings depending on the type of message they are being used in.
Now a few questions arose:
What's the best way, to organise the different constants?
How do I define different constants for the same field (or subfield)?
What's the best way to add a description to each constant?
For question 3 I went for tupels: CONST_NAME = (0x01, 'value meaning',).
My first idea for the other problems was to organise the constants in modules. This would for example give us:
constants.esm_class which contains all esm_class values
constants.data_coding which contains all possible encodings
etc (for each field a module in constants)
For data_coding this works well. But esm_class actually consists of 'subfields'. So I thought about:
constants.esm_class.types for the message type subfield
constants.esm_class.modes for the message modes
constants.esm_class.features for the features field
This already is quite long, resulting in constants.esm_class.features.UDHI_AND_REPLAY_PATH for example. Not very nice, somehow. And that's not yet all I need... actually I even need to split the constants.esm_class.types apart because the type values have different meanings in different contexts (basically incoming messages vs. outgoing short messages). So this would result in even one more submodule, if I followed the schema: constants.esm_class.types.outgoing and constants.esm_class.types.incoming
Additionally, I also need a way to concatenate types, modes and features to create the esm_class value and I also need to split esm_class apart for parsing. So there needs to be a generic way to build field values from constants and also to detect that a certain value is illegal (i.e. has a value not defined by a constant).
My question: What do you think is the best way I should follow? Should I go for classes? As I've read best practice with python is to define constants on module level. I also see some issues with using classes. So probably a new data type for each possible field would do it? I'm stuck and looking forward to your ideas!
Edit: I intend to use the lib with python 3.3+ and probably I want to make it compatible to 2.7, so Enums from Python 3.4 unfortunately are no solution.
Edit 2: Probably I packed too much information in too few text. So I will give an example.
esm_class in SMPP is kind of a derived value, consisting of a type, a mode and a feature. If the esm_class for example has the value 00111100, in fact it means that the feature equals to 00, the mode to 1111 and the type again 00. That's what i refer to as a "subtype". The constants will be put together with bitwise operations.

I will make an attempt at giving an answer (I hope I've understood you correctly). Elaborating on my comments above, I would go for something like:
class esm_class:
BITMASK_FEATURE = 0b11000000
BITMASK_MODE = 0b00111100
BITMASK_TYPE = 0b00000011
class Features:
FEATURE_1 = 0b00
FEATURE_2 = 0b01
FEATURE_3 = 0b10
FEATURE_4 = 0b11
class Modes:
MODE_1 = 0b0000
# …
class Types:
TYPE_1 = 0b00
TYPE_2 = 0b01
# …
def __init__(self, type_, mode, feature):
self.type = type_
self.mode = mode
self.feature = feature
def __bytes__(self):
'''
Use this to write an instance of esm_class to a file-like
object or stream, e.g..:
mysocket.send(bytes(my_esm_class_obj))
'''
return ((self.feature << 6) | (self.mode << 2) |
self.type).to_bytes(1, byteorder='big')
#classmethod
def parse(cls, filelike):
'''
Use this to parse incoming data from a file-like object
and create an instance of esm_class accordingly, e.g:
my_esm_class_obj = esm_class.parse(stream)
'''
instance = cls()
data = filelike.read(1)
instance.feature = int.from_bytes((data & cls.BITMASK_FEATURE) >> 6)
instance.mode = int.from_bytes((data & cls.BITMASK_MODE) >> 2)
instance.mode = int.from_bytes((data & cls.BITMASK_TYPE) >> 2)
Now, the code can certainly be optimized but the point is: I tried to abstract away the binary representation as soon as possible, in order to arrive at a class hierarchy representing the structures appearing in the protocol. Accordingly, I didn't simply put the constants / the values in different modules but organized them in such a way that they appear together with the structure they are part of – which is also the structure you're going to use in the rest of your code.
Further things to note:
You can certainly omit the classes Features, Modes and Types if you like and put all values in the scope of esm_class. However, I would do this only if the symbols' names FEATURE_1, MODE_2 and so on make it clear which part of the bitstring they correspond to (i.e. the feature part or the mode part or…).
I would usually avoid Python 3.4's enums because IMO they are tedious to deal with when you need to access a symbol's value (i.e. its value attribute) but that's also because I don't need the symbols' names themselves. You mileage may vary.
Finally, in the comments above, chepner made a very good point regarding whether to store descriptions in your code at all. I recommend you follow his advice.

Is it bad style to reassign long variables as a local abbreviation?

I prefer to use long identifiers to keep my code semantically clear, but in the case of repeated references to the same identifier, I'd like for it to "get out of the way" in the current scope. Take this example in Python:
def define_many_mappings_1(self):
self.define_bidirectional_parameter_mapping("status", "current_status")
self.define_bidirectional_parameter_mapping("id", "unique_id")
self.define_bidirectional_parameter_mapping("location", "coordinates")
#etc...
Let's assume that I really want to stick with this long method name, and that these arguments are always going to be hard-coded.
Implementation 1 feels wrong because most of each line is taken up with a repetition of characters. The lines are also rather long in general, and will exceed 80 characters easily when nested inside of a class definition and/or a try/except block, resulting in ugly line wrapping. Let's try using a for loop:
def define_many_mappings_2(self):
mappings = [("status", "current_status"),
("id", "unique_id"),
("location", "coordinates")]
for mapping in mappings:
self.define_parameter_mapping(*mapping)
I'm going to lump together all similar iterative techniques under the umbrella of Implementation 2, which has the improvement of separating the "unique" arguments from the "repeated" method name. However, I dislike that this has the effect of placing the arguments before the method they're being passed into, which is confusing. I would prefer to retain the "verb followed by direct object" syntax.
I've found myself using the following as a compromise:
def define_many_mappings_3(self):
d = self.define_bidirectional_parameter_mapping
d("status", "current_status")
d("id", "unique_id")
d("location", "coordinates")
In Implementation 3, the long method is aliased by an extremely short "abbreviation" variable. I like this approach because it is immediately recognizable as a set of repeated method calls on first glance while having less redundant characters and much shorter lines. The drawback is the usage of an extremely short and semantically unclear identifier "d".
What is the most readable solution? Is the usage of an "abbreviation variable" acceptable if it is explicitly assigned from an unabbreviated version in the local scope?

itertools to the rescue again! Try using starmap - here's a simple demo:
list(itertools.starmap(min,[(1,2),(2,2),(3,2)]))
prints
[1,2,2]
starmap is a generator, so to actually invoke the methods, you have to consume the generator with a list.
import itertools
def define_many_mappings_4(self):
list(itertools.starmap(
self.define_parameter_mapping,
[
("status", "current_status"),
("id", "unique_id"),
("location", "coordinates"),
] ))
Normally I'm not a fan of using a dummy list construction to invoke a sequence of functions, but this arrangement seems to address most of your concerns.
If define_parameter_mapping returns None, then you can replace list with any, and then all of the function calls will get made, and you won't have to construct that dummy list.

I would go with Implementation 2, but it is a close call.
I think #2 and #3 are equally readable. Imagine if you had 100s of mappings... Either way, I cannot tell what the code at the bottom is doing without scrolling to the top. In #2 you are giving a name to the data; in #3, you are giving a name to the function. It's basically a wash.
Changing the data is also a wash, since either way you just add one line in the same pattern as what is already there.
The difference comes if you want to change what you are doing to the data. For example, say you decide to add a debug message for each mapping you define. With #2, you add a statement to the loop, and it is still easy to read. With #3, you have to create a lambda or something. Nothing wrong with lambdas -- I love Lisp as much as anybody -- but I think I would still find #2 easier to read and modify.
But it is a close call, and your taste might be different.

I think #3 is not bad although I might pick a slightly longer identifier than d, but often this type of thing becomes data driven, so then you would find yourself using a variation of #2 where you are looping over the result of a database query or something from a config file

There's no right answer, so you'll get opinions on all sides here, but I would by far prefer to see #2 in any code I was responsible for maintaining.
#1 is verbose, repetitive, and difficult to change (e.g. say you need to call two methods on each pair or add logging -- then you must change every line). But this is often how code evolves, and it is a fairly familiar and harmless pattern.
#3 suffers the same problem as #1, but is slightly more concise at the cost of requiring what is basically a macro and thus new and slightly unfamiliar terms.
#2 is simple and clear. It lays out your mappings in data form, and then iterates them using basic language constructs. To add new mappings, you only need add a line to the array. You might end up loading your mappings from an external file or URL down the line, and that would be an easy change. To change what is done with them, you only need change the body of your for loop (which itself could be made into a separate function if the need arose).
Your complaint of #2 of "object before verb" doesn't bother me at all. In scanning that function, I would basically first assume the verb does what it's supposed to do and focus on the object, which is now clear and immediately visible and maintainable. Only if there were problems would I look at the verb, and it would be immediately evident what it is doing.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.