What are the best practices for __repr__ with collection class Python? - python

I have a custom Python class which essentially encapsulate a list of some kind of object, and I'm wondering how I should implement its __repr__ function. I'm tempted to go with the following:
class MyCollection:
def __init__(self, objects = []):
self._objects = []
self._objects.extend(objects)
def __repr__(self):
return f"MyCollection({self._objects})"
This has the advantage of producing a valid Python output which fully describes the class instance. However, in my real-wold case, the object list can be rather large and each object may have a large repr by itself (they are arrays themselves).
What are the best practices in such situations? Accept that the repr might often be a very long string? Are there potential issues related to this (debugger UI, etc.)? Should I implement some kind of shortening scheme using semicolon? If so, is there a good/standard way to achieve this? Or should I skip listing the collection's content altogether?

The official documentation outlines this as how you should handle __repr__:
Called by the repr() built-in function to compute the “official”
string representation of an object. If at all possible, this should
look like a valid Python expression that could be used to recreate an
object with the same value (given an appropriate environment). If this
is not possible, a string of the form <...some useful description...>
should be returned. The return value must be a string object. If a
class defines __repr__() but not __str__(), then __repr__() is also
used when an “informal” string representation of instances of that
class is required.
This is typically used for debugging, so it is important that the
representation is information-rich and unambiguous.
Python 3 __repr__ Docs
Lists, strings, sets, tuples and dictionaries all print out the entirety of their collection in their __repr__ method.
Your current code looks to perfectly follow the example of what the documentation suggests. Though I would suggest changing your __init__ method so it looks more like this:
class MyCollection:
def __init__(self, objects=None):
if objects is None:
objects = []
self._objects = objects
def __repr__(self):
return f"MyCollection({self._objects})"
You generally want to avoid using mutable objects as default arguments. Technically because of the way your method is implemented using extend (which makes a copy of the list), it will still work perfectly fine, but Python's documentation still suggests you avoid this.
It is good programming practice to not use mutable objects as default
values. Instead, use None as the default value and inside the
function, check if the parameter is None and create a new
list/dictionary/whatever if it is.
https://docs.python.org/3/faq/programming.html#why-are-default-values-shared-between-objects
If you're interested in how another library handles it differently, the repr for Numpy arrays only shows the first three items and the last three items when the array length is greater than 1,000. It also formats the items so they all use the same amount of space (In the example below, 1000 takes up four spaces so 0 has to be padded with three more spaces to match).
>>> repr(np.array([i for i in range(1001)]))
'array([ 0, 1, 2, ..., 998, 999, 1000])'
To mimic this numpy array style you could implement a __repr__ method like this in your class:
class MyCollection:
def __init__(self, objects=None):
if objects is None:
objects = []
self._objects = objects
def __repr__(self):
# If length is less than 1,000 return the full list.
if len(self._objects) < 1000:
return f"MyCollection({self._objects})"
else:
# Get the first and last three items
items_to_display = self._objects[:3] + self._objects[-3:]
# Find the which item has the longest repr
max_length_repr = max(items_to_display, key=lambda x: len(repr(x)))
# Get the length of the item with the longest repr
padding = len(repr(max_length_repr))
# Create a list of the reprs of each item and apply the padding
values = [repr(item).rjust(padding) for item in items_to_display]
# Insert the '...' inbetween the 3rd and 4th item
values.insert(3, '...')
# Convert the list to a string joined by commas
array_as_string = ', '.join(values)
return f"MyCollection([{array_as_string}])"
>>> repr(MyCollection([1,2,3,4]))
'MyCollection([1, 2, 3, 4])'
>>> repr(MyCollection([i for i in range(1001)]))
'MyCollection([ 0, 1, 2, ..., 998, 999, 1000])'

Related

What is the reason for list.__str__ using __repr__ of its elements?

What is the reason for list.__str__ using __repr__ of its elements?
Example:
class Unit:
def __str__(self): return "unit"
def __repr__(self): return f"<unit id={id(self)}>"
>>> str(Unit())
'unit'
>>> str([Unit()])
'[<unit id=1491139133008>]'
If lists used the str() of their elements rather than their repr(), then what would an output of [1, 2] mean? Is that a list with two integers, two strings, or one of each? Or is it a list with one element, the string "1, 2"? Python made the only choice that allows you to determine anything at all about the contents of a list by looking at its string representation.
It's because list.__str__ is object.__str__ returns True, and object.__str__ is basically implemented like:
def __str__(self):
return repr(self)
Unless a type implements its own __str__, it uses object.__str__.
Most builtin types in Python do not implement their own __str__, because they do not have a logical choice for a human-friendly string representation, list is no different.

Get array of a specific attribute of all objects from an array of objects

I have an array containing a set of an object with many attributes. I want to get the values of specific attributes as a list in a simple way.
I do know that I can make a list of each attribute and save them into individual lists as follows.
attr = (o.attr for o in objarray)
But as there are a lot of attributes and these need to analyzed using plots, distributions etc. this is not an efficient way.
In my case, I am analyzing an array of 'Structure' objects which has attributes like lattice constants, position of atoms etc. And the object has functions to get distance, angles, etc. which when we give the index of atoms will return the corresponding values. What I want is to get a list of values (which may be an attribute like lattice constant or an output of a function of the object like distance between two atoms) each corresponding to each of the structures in the array. Making an individual list for each of the values needed (as mentioned above) is less efficient as a lot of such lists may need to be made and the values needed may differ depending on the purpose.
What I need is to get a list of values by something in the manner of:
objarray[a:b].attr
which can be used easily for plotting and other functions. But this doesn't work and gives an error:
[ERROR] 'list' object has no attribute 'attr'
Alternately, is there a way to make an array of objects which treats objects in the above mentioned way.
I would probably use the getattr built-in function for this purpose.
>>> my_object.my_attribute = 5
>>> getattr(my_object, 'my_attribute')
5
To create the numpy array as you would want:
def get_attrs(obj, attributes):
"""Returns the requested attributes of an object as a separate list"""
return [getattr(obj, attr) for attr in attributes]
attributes = ['a', 'b', 'c']
attributes_per_object = np.array([get_attrs(obj, attributes) for obj in all_objects])
This answer is inspired from the answer of #energya using the getattr built-in function. As that answer makes a function to get a list of attributes of a specific object, while the question was for getting a list of one specific attribute for all the objects in the array of objects.
So using getattr function,
>>> my_object.my_attribute = 5
>>> getattr(my_object, 'my_attribute')
5
For getting a numpy array of a specific attribute of all objects:
def get_attrs(all_objects, attribute, args=None):
"""Returns the requested attribute of all the objects as a list"""
if(args==None):
# If the requested attribute is a variable
return np.array([getattr(obj, attribute) for obj in all_objects])
else:
# If the requested attribute is a method
return np.array([getattr(obj, attribute)(*args) for obj in all_objects])
# For getting a variable 'my_object.a' of all objects
attribute_list = get_attrs(all_objects, attribute)
# For getting a method 'my_object.func(*args)' of all objects
attribute_list = get_attrs(all_objects, attribute, args)
I was also looking for something similar and this is how I did it.
low = 0
high = 10
x = [objarray[i].some_attr for i in range(low, high)]
Probably there's a better way.

Python ordered list search versus searching sets for list of objects

I have two lists of objects. Let's call the lists a and b. The objects (for our intents and purposes) are defined as below:
class MyObj:
def __init__(self, string: str, integer: int):
self.string = string
self.integer = integer
def __eq__(self, other):
if self.integer == other.integer:
pass
else:
return False
if fuzz.ratio(self.string, other.string) > 90: # fuzzywuzzy library checks if strings are "similar enough"
return True
else:
return False
Now what I want to achieve is to check which objects in list a are "in" list b (return true against == when compared to some object in list b).
Currently I'm just looping through them as follows:
for obj in a:
for other_obj in b:
if a == b:
<do something>
break
I strongly suspect that there is a faster way of implementing this. The lists are long. Up to like 100 000 objects each. So this is a big bottleneck in my code.
I looked at this answer Fastest way to search a list in python and it suggests that sets work much better. I'm a bit confused by this though:
How significant is the "removal of duplicates" speedup? I don't expect to have many duplicates in my lists.
Can sets remove duplicates and properly hash when I have defined the eq the way I have?
How would this compare with pre-ordering the list, and using something like binary search? A set is unordered...
So what is the best approach here? Please provide implementation guidelines in the answer as well.
TL;DR, when using fuzzy comparison techniques, sets and sorting can be very difficult to work with without some normalization method. You can try to be smart about reducing search spaces as much as possible, but care should be taken to do it consistently.
If a class defines __eq__ and not __hash__, it is not hashable.
For instance, consider the following class
class Name:
def __init__(self, first, last):
self.first = first
self.last = last
def __repr__(self):
return f'{self.first} {self.last}'
def __eq__(self, other):
return (self.first == other.first) and (self.last == other.last)
Now, if you were to try to create a set with these elements
>>> {Name('Neil', 'Stackoverflow-user')}
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: unhashable type: 'Name'
So, in the case of Name, you would simply define a __hash__ method. However, in your case, this is more difficult since you have fuzzy equality semantics. The only way I can think of to get around this is to have a normalization function that you can prove will be consistent, and use the normalized string instead of the actual string as part of your hash. Take Floats as dictionary keys as an example of needing to normalize in order to use a "fuzzy" type like floats as keys.
For sorting and binary searching, since you are fuzzy-searching, you still need to be careful with things like binary searching. As an example, assume you say equality is determined by being within a certain range of Levenshtein distances. Then book and hook will similar to each other (distance = 1), but hack with a distance of 2, will be closer to hook. So how will you define a good sorting algorithm for fuzzy searching in this case?
One thing to try would be to use some form of group-by/bucketing, like a dictionary of the type Dict[int, List[MyObj]], where instances of MyObj are classified by their one constant, the self.integer field. Then you can try comparing smaller sub-lists. This would at least reduce search spaces by clustering.

What does keyword "in" really do when is used to test if a sequence (list, tuple, string etc.) contains a value. A loop until find the value?

given the following code:
numbers= [1,2,3,1,2,4,5]
if 5 in numbers:
.................
we can notice we have a list(numbers) with 7 items and I want to know if the keyword in behind the scene do a loop for find a match in the list
It depends. It will call __contains__() on the container class (right side) - that can be implemented as a loop, and for some classes it can be calculated by some other faster method if possible.
You can even define it on your own class, like this ilustrative example:
class ContainsEverything:
def __contains__(self, item):
return True
c = ContainsEverything()
>>> None in c
True
>>> 4 in c
True
For container types in general, this is documented under __contains__ in the data model chapter and Membership test operations in the expressions chapter.
When you write this:
x in s
… what Python does is effectively (slightly oversimplified):
try:
return s.__contains__(x)
except AttributeError:
for value in s:
if value == x:
return True
return False
So, any type that defines a __contains__ method can do whatever it wants; any type that doesn't, Python automatically loops over it as an iterable. (Which in turn calls s.__iter__ if present, the old-style sequence API with s.__getitem__ if not.)
For the builtin sequence types list and tuple, the behavior is defined under Sequence Types — list, tuple, range and (again) Membership test operations:
True if an item of s is equal to x, else False
… which is exactly the same as the fallback behavior.
Semantically:
For container types such as list, tuple, set, frozenset, dict, or collections.deque, the expression x in y is equivalent to any(x is e or x == e for e in y).
Fpr list and tuple, this is in fact implemented by looping over all of the elements, and it's hard to imagine how it could be implemented otherwise. (In CPython, it's slightly faster because it can do the loop directly over the underlying array, instead of using an iterator, but it's still linear time.)
However, some other builtin types do something smarter. For example:
range does the closed-form arithmetic to check whether the value is in the range in constant time.
set, frozenset, and dict use the underlying hash table to look up the value in constant time.
There are some third-party types, like sorted-dict collections based on trees or skiplists or similar, that can't search in constant time but can search in logarithmic time by walking the tree. So, they'll (hopefully) implement __contains__ in logarithmic time.
Also note that if you use the ABC/mixin helpers in collections.abc to define your own Sequence or Mapping type, you get a __contains__ implementation for free. For sequences, this works by iterating over all the elements; for mappings, it works by try: self.[key].

Python lists/arrays: disable negative indexing wrap-around in slices

While I find the negative number wraparound (i.e. A[-2] indexing the second-to-last element) extremely useful in many cases, when it happens inside a slice it is usually more of an annoyance than a helpful feature, and I often wish for a way to disable that particular behaviour.
Here is a canned 2D example below, but I have had the same peeve a few times with other data structures and in other numbers of dimensions.
import numpy as np
A = np.random.randint(0, 2, (5, 10))
def foo(i, j, r=2):
'''sum of neighbours within r steps of A[i,j]'''
return A[i-r:i+r+1, j-r:j+r+1].sum()
In the slice above I would rather that any negative number to the slice would be treated the same as None is, rather than wrapping to the other end of the array.
Because of the wrapping, the otherwise nice implementation above gives incorrect results at boundary conditions and requires some sort of patch like:
def ugly_foo(i, j, r=2):
def thing(n):
return None if n < 0 else n
return A[thing(i-r):i+r+1, thing(j-r):j+r+1].sum()
I have also tried zero-padding the array or list, but it is still inelegant (requires adjusting the lookup locations indices accordingly) and inefficient (requires copying the array).
Am I missing some standard trick or elegant solution for slicing like this? I noticed that python and numpy already handle the case where you specify too large a number nicely - that is, if the index is greater than the shape of the array it behaves the same as if it were None.
My guess is that you would have to create your own subclass wrapper around the desired objects and re-implement __getitem__() to convert negative keys to None, and then call the superclass __getitem__
Note, what I am suggesting is to subclass existing custom classes, but NOT builtins like list or dict. This is simply to make a utility around another class, not to confuse the normal expected operations of a list type. It would be something you would want to use within a certain context for a period of time until your operations are complete. It is best to avoid making a globally different change that will confuse users of your code.
Datamodel
object.getitem(self, key)
Called to implement evaluation of
self[key]. For sequence types, the accepted keys should be integers
and slice objects. Note that the special interpretation of negative
indexes (if the class wishes to emulate a sequence type) is up to the
getitem() method. If key is of an inappropriate type, TypeError may be raised; if of a value outside the set of indexes for the
sequence (after any special interpretation of negative values),
IndexError should be raised. For mapping types, if key is missing (not
in the container), KeyError should be raised.
You could even create a wrapper that simply takes an instance as an arg, and just defers all __getitem__() calls to that private member, while converting the key, for cases where you can't or don't want to subclass a type, and instead just want a utility wrapper for any sequence object.
Quick example of the latter suggestion:
class NoWrap(object):
def __init__(self, obj, default=None):
self._obj = obj
self._default = default
def __getitem__(self, key):
if isinstance(key, int):
if key < 0:
return self._default
return self._obj.__getitem__(key)
In [12]: x = range(-10,10)
In [13]: x_wrapped = NoWrap(x)
In [14]: print x_wrapped[5]
-5
In [15]: print x_wrapped[-1]
None
In [16]: x_wrapped = NoWrap(x, 'FOO')
In [17]: print x_wrapped[-1]
FOO
While you could subclass e.g. list as suggested by jdi, Python's slicing behaviour is not something anyone's going to expect you to muck about with.
Changing it is likely to lead to some serious head-scratching by other people working with your code when it doesn't behave as expected - and it could take a while before they go looking at the special methods of your subclass to see what's actually going on.
See: Action at a distance
I think this isn't ugly enough to justify new classes and wrapping things.
Then again it's your code.
def foo(i, j, r=2):
'''sum of neighbours within r steps of A[i,j]'''
return A[i-r:abs(i+r+1), j-r:abs(j+r+1)].sum() # ugly, but works?
(Downvoting is fun, so I've added some more options)
I found out something quite unexpected (for me): The __getslice__(i,j) does not wrap! Instead, negative indices are just ignored, so:
lst[1:3] == lst.__getslice__(1,3)
lst[-3:-1] == 2 next to last items but lst.__getslice__(-3,-1) == []
and finally:
lst[-2:1] == [], but lst.__getslice__(-2,1) == lst[0:1]
Surprising, interesting, and completely useless.
If this only needs to apply in a few specific operations, a simple & straightworward if index>=0: do_something(array[i]) / if index<0: raise IndexError would do.
If this needs to apply wider, it's still the same logic, just being wrapped in this manner or another.

Categories