What is a flexible, hybrid python collection object? - python

As a way to get used to python, I am trying to translate some of my code to python from Autohotkey_L.
I am immediately running into tons of choices for collection objects. Can you help me figure out a built in type or a 3rd party contributed type that has as much as possible, the functionality of the AutoHotkey_L object type and its methods.
AutoHotkey_L Objects have features of a python dict, list, and a class instance.
I understand that there are tradeoffs for space and speed, but I am just interested in functionality rather than optimization issues.

Don't write Python as <another-language>. Write Python as Python.
The data structure should be chosen just to have the minimal ability you need to use.
list — an ordered sequence of elements, with 1 flexible end.
collections.deque — an ordered sequence of elements, with 2 flexible ends (e.g. a queue).
set / frozenset — an unordered sequence of unique elements.
collections.Counter — an unordered sequence of non-unique elements.
dict — an unordered key-value relationship.
collections.OrderedDict — an ordered key-value relationship.
bytes / bytearray — a list of bytes.
array.array — a homogeneous list of primitive types.
Looking at the interface of Object,
dict would be the most suitable for finding a value by key
collections.OrderedDict would be the most suitable for the push/pop stuff.
when you need MinIndex / MaxIndex, where a sorted key-value relationship (e.g. red black tree) is required. There's no such type in the standard library, but there are 3rd party implementations.

It would be impossible to recommend a particular class without knowing how you intend on using it. If you are using this particular object as an ordered sequence where elements can be repeated, then you should use a list; if you are looking up values by their key, then use a dictionary. You will get very different algorithmic runtime complexity with the different data types. It really does not take that much time to determine when to use which type.... I suggest you give it some further consideration.
If you really can't decide, though, here is a possibility:
class AutoHotKeyObject(object):
def __init__(self):
self.list_value = []
self.dict_value = {}
def getDict(self):
return self.dict_value
def getList(self):
return self.list_value
With the above, you could use both the list and dictionary features, like so:
obj = AutoHotKeyObject()
obj.getList().append(1)
obj.getList().append(2)
obj.getList().append(3)
print obj.getList() # Prints [1, 2, 3]
obj.getDict()['a'] = 1
obj.getDict()['b'] = 2
print obj.getDict() # Prints {'a':1, 'b':2}

Related

Python 3.7+: Access elements of a dictionary by order

So I understand that, since Python 3.7, dicts are ordered, but the documentation doesn't seem to list methods of using said ordering. For example, how do I access say the first element of the dict by order, independent of keys? What operations can I do with the ordered dict?
As an example, I am working on implementing a Least Frequently Used cache, where I need to track not only the number of uses of a key, but also use Least Recently Used information as a tie break. I could use a dict of queues to implement a priority queue, but then I lose O(1) lookup within the queue. If I use a dict of dicts I can retain the advantages of a hashed set AND implement it as a priority queue... I think. I just need to be able to pop the first element of the dictionary. Alas, there is no dict.popleft().
For now I am converting the keys to a list and just using the first element of the list, but while this works (the dict is keeping ordering), the conversion is costly.
LFU_queue = collections.defaultdict(collections.defaultdict)
LFU_queue[1].update({"key_1":None})
LFU_queue[1].update({"key_32":None})
LFU_queue[1].update({"key_6":None})
#Inspecting this, I DO get what I expect, which is a
#dict of dicts with the given ordering:
#{ 1: {"key_1":None}, {"key_32":None}, {"key_6":None}}
#here is where I'd love to be able to do something like
# LFU_queue[1].popleft() to return {"key_1":None}
list(LFU_queue[1])[0] works, but is less than ideal
As others commented, using OrderedDict is the right choice for that problem. But you can also use the bare dict using the items() like this.
d = {}
d['a'] = 1
d['b'] = 2
first_pair = next(iter(d.items()), None)
If you are interested, see PEP 372 and PEP 468 for the history of dictionary ordering.
With PEP 468, you can implement a syntax sugar like this.
from ctypes import Structure, c_int, c_ulong
def c_fields(**kwargs):
return list(kwargs.items())
class XClientMessageEvent(Structure):
_fields_ = c_fields(
type = c_int,
serial = c_ulong,
send_event = c_int,
...
)

Use of dict instead of class

I don't understand when should I use dictionaries instead of classes on Python. If a class can do the same as dictionaries and more, why do we use dictionaries? For example, the next two snippets do the same:
Using class:
class PiggyBank:
def __init__(self, dollars, cents):
self.dollars = dollars
self.cents = cents
piggy1 = PiggyBank(2, 2)
print(piggy1.dollars) # 2
print(piggy1.__dict__) # {'dollars': 2, 'cents': 2}
Using dictionary:
def create_piggy(dollars, cents):
return {'dollars': dollars, 'cents': cents}
piggy2 = create_piggy(2, 2)
print(piggy2['dollars']) # 2
print(piggy2) # {'dollars': 2, 'cents': 2}
So at the end, I am creating two static objects with the same information. When should I use a class or a function for creating an instance?
You can use dictionary if it suffices your usecase and you are not forced to use a Class. But if you need some added benefits of class (like below) you may write is as Class
Some use cases when you might need Class
When you need to update code frequently
When you need encapsulation
When you need to reuse the same interface,code or logic
dicts are often called "associative arrays". They're like regular lists in Python, except that instead of using an integer as an index, you can use strings (or any hashable type).
You can certainly implement list-like and dict-like objects (as well as set-like objects and str-like objects) with Python classes, but why would you if there's already an object type that does exactly what you need?
So if all you're looking for is an associative array, a dict will most likely serve you just fine. But if you jump through hoops to implement a class that already does what a dict does, you'll be "re-inventing the wheel," as well as giving the future readers of your code extra work trying to figure out that all your class is doing is re-implementing a dict with no real extra functionality.
Basically, if all that you need is a dict, just use a dict. Otherwise, you'll be writing a lot of extra code (that may be prone to bugs) for no real gain.
You would typically want to use a dictionary when you're doing something dynamic -- something where you don't know all of the keys or information when you're writing the code. Note also that classes have restrictions like not being able to use numbers as properties.
As a classic example (which has better solutions using collections.Counter, but we'll still keep it for its educational value), if you have a list of values that you want to count you can use a dictionary to do so efficiently:
items_to_count = ['a', 'a', 'a', 'b', 5, 5]
count = {}
for item in items_to_count:
if item in count:
count[item] += 1
else:
count[item] = 1
# count == {'a': 3, 'b': 1, 5: 2}
why do we use dictionaries?
Dictionaries are built-in containers with quite a bit of functionality that can be used at no extra coding cost to us. Since they are built-in a lot of that functionality is probably optimized.

Python - Versioned list instead of immutable list?

Update:
As of CPython 3.6, dictionaries have a version (thank you pylang for showing this to me).
If they added the same version to list and made it public, all 3 asserts from my original post would pass! It would definitely meet my needs. Their implementation differs from what I envisioned, but I like it.
As it is, I don't feel I can use dictionary version:
It isn't public. Jake Vanderplas shows how to expose it in a post, but he cautions: definitely not code you should use for any purpose beyond simply having fun. I agree with his reasons.
In all of my use cases, the data is conceptually arrays of elements each of which has the same structure. A list of tuples is a natural fit. Using a dictionary would make the code less natural and probably more cumbersome.
Does anyone know if there are plans to add version to list?
Are there plans to make it public?
If there are plans to add version to list and make it public, I would feel awkward putting forward an incompatible VersionedList now. I would just implement the bare minimum I need and get by.
Original post below
Turns out that many of the times I wanted an immutable list, a VersionedList would have worked almost as well (sometimes even better).
Has anyone implemented a versioned list?
Is there a better, more Pythonic, concept that meets my needs? (See motivation below.)
What I mean by a versioned list is:
A class that behaves like a list
Any change to an instance or elements in the instance results in instance.version() being updated. So, if alist is a normal list:
a = VersionedList(alist)
a_version = a.version()
change(a)
assert a_version != a.version()
reverse_last_change(a)
If a list was hashable, hash() would achieve the above and meet all the needs identified in the motivation below. We need to define 'version()' in a way that doesn't have all of the same problems as 'hash()'.
If identical data in two lists is highly unlikely to ever happen except at initialization, we aren't going to have a reason to test for deep equality. From (https://docs.python.org/3.5/reference/datamodel.html#object.hash) The only required property is that objects which compare equal have the same hash value. If we don't impose this requirement on 'version()', it seems likely that 'version()' won't have all of the same problems that makes lists unhashable. So unlike hash, identical contents doesn't mean the same version
#contents of 'a' are now identical to original, but...
assert a_version != a.version()
b = VersionedList(alist)
c = VersionedList(alist)
assert b.version() != c.version()
For VersionList, it would be good if any attempt to modify the result of __get__ automatically resulted in a copy instead of modifying the underlying implementation data. I think that the only other option would be to have __get__ always return a copy of the elements, and this would be very inefficient for all of the use cases I can think of. I think we need to restrict the elements to immutable objects (deeply immutable, for example: exclude tuples with list elements). I can think of 3 ways to achieve this:
Only allow elements that can't contain mutable elements (int, str, etc are fine, but exclude tuples). (This is far too limiting for my cases)
Add code to __init__, __set__, etc to traverse inputs to deeply check for mutable sub-elements. (expensive, any way to avoid this?)
Also allow more complex elements, but require that they are deeply immutable. Perhaps require that they expose a deeply_immutable attribute. (This turns out to be easy for all the use cases I have)
Motivation:
If I am analyzing a dataset, I often have to perform multiple steps that return large datasets (note: since the dataset is ordered, it is best represented by a List not a set).
If at the end of several steps (ex: 5) it turns out that I need to perform different analysis (ex: back at step 4), I want to know that the dataset from step 3 hasn't accidentally been changed. That way I can start at step 4 instead of repeating steps 1-3.
I have functions (control-points, first-derivative, second-derivative, offset, outline, etc) that depend on and return array-valued objects (in the linear algebra sense). The base 'array' is knots.
control-points() depends on: knots, algorithm_enum
first-derivative() depends on: control-points(), knots
offset() depends on: first-derivative(), control-points(), knots, offset_distance
outline() depends on: offset(), end_type_enum
If offset_distance changes, I want to avoid having to recalculate first-derivative() and control-points(). To avoid recalculation, I need to know that nothing has accidentally changed the resultant 'arrays'.
If 'knots' changes, I need to recalculate everything and not depend on the previous resultant 'arrays'.
To achieve this, knots and all of the 'array-valued' objects could be VersionedList.
FYI: I had hoped to take advantage of an efficient class like numpy.ndarray. In most of my use cases, the elements logically have structure. Having to mentally keep track of multi-dimensions of indexes meant implementing and debugging the algorithms was many times more difficult with ndarray. An implementation based on lists of namedtuples of namedtuples turned out to be much more sustainable.
Private dicts in 3.6
In Python 3.6, dictionaries are now private (PEP 509) and compact (issue 27350), which track versions and preserve order respectively. These features are presently true when using the CPython 3.6 implementation. Despite the challenge, Jake VanderPlas demonstrates in his blog post a detailed demonstration of exposing this versioning feature from CPython within normal Python. We can use his approach to:
determine when a dictionary has been updated
preserve the order
Example
import numpy as np
d = {"a": np.array([1,2,3]),
"c": np.array([1,2,3]),
"b": np.array([8,9,10]),
}
for i in range(3):
print(d.get_version()) # monkey-patch
# 524938
# 524938
# 524938
Notice the version number does not change until the dictionary is updated, as shown below:
d.update({"c": np.array([10, 11, 12])})
d.get_version()
# 534448
In addition, the insertion order is preserved (the following was tested in restarted sessions of Python 3.5 and 3.6):
list(d.keys())
# ['a', 'c', 'b']
You may be able to take advantage of this new dictionary behavior, saving you from implementing a new datatype.
Details
For those interested, the latter get_version()is a monkey-patched method for any dictionary, implemented in Python 3.6 using the following modified code derived from Jake VanderPlas' blog post. This code was run prior to calling get_version().
import types
import ctypes
import sys
assert (3, 6) <= sys.version_info < (3, 7) # valid only in Python 3.6
py_ssize_t = ctypes.c_ssize_t
# Emulate the PyObjectStruct from CPython
class PyObjectStruct(ctypes.Structure):
_fields_ = [('ob_refcnt', py_ssize_t),
('ob_type', ctypes.c_void_p)]
# Create a DictStruct class to wrap existing dictionaries
class DictStruct(PyObjectStruct):
_fields_ = [("ma_used", py_ssize_t),
("ma_version_tag", ctypes.c_uint64),
("ma_keys", ctypes.c_void_p),
("ma_values", ctypes.c_void_p),
]
def __repr__(self):
return (f"DictStruct(size={self.ma_used}, "
f"refcount={self.ob_refcnt}, "
f"version={self.ma_version_tag})")
#classmethod
def wrap(cls, obj):
assert isinstance(obj, dict)
return cls.from_address(id(obj))
assert object.__basicsize__ == ctypes.sizeof(PyObjectStruct)
assert dict.__basicsize__ == ctypes.sizeof(DictStruct)
# Code for monkey-patching existing dictionaries
class MappingProxyStruct(PyObjectStruct):
_fields_ = [("mapping", ctypes.POINTER(DictStruct))]
#classmethod
def wrap(cls, D):
assert isinstance(D, types.MappingProxyType)
return cls.from_address(id(D))
assert types.MappingProxyType.__basicsize__ == ctypes.sizeof(MappingProxyStruct)
def mappingproxy_setitem(obj, key, val):
"""Set an item in a read-only mapping proxy"""
proxy = MappingProxyStruct.wrap(obj)
ctypes.pythonapi.PyDict_SetItem(proxy.mapping,
ctypes.py_object(key),
ctypes.py_object(val))
mappingproxy_setitem(dict.__dict__,
'get_version',
lambda self: DictStruct.wrap(self).ma_version_tag)

What is the use case of the immutable objects

What is the use case of immutable types/objects like tuple in python.
Tuple('hello')
('h','i')
Where we can use the not changeable sequences.
One common use case is the list of (unnamed) arguments to a function.
In [1]: def foo(*args):
...: print(type(args))
...:
In [2]: foo(1,2,3)
<class 'tuple'>
Technically, tuples are semantically different to lists.
When you have a list, you have something that is... a list. Of items of some sort. And therefore can have items added or removed to it.
A tuple, on the other hand, is a set of values in a given order. It just happens to be one value that is made up of more than one value. A composite value.
For example. Say you have a point. X, Y. You could have a class called Point, but that class would have a dictionary to store its attributes. A point is only two values which are, most of the time, used together. You don't need the flexibility or the cost of a dictionary for storing named attributes, you can use a tuple instead.
myPoint = 70, 2
Points are always X and Y. Always 2 values. They are not lists of numbers. They are two values in which the order of a value matters.
Another example of tuple usage. A function that creates links from a list of tuples. The tuples must be the href and then the label of the link. Fixed order. Order that has meaning.
def make_links(*tuples):
return "".join('%s' % t for t in tuples)
make_links(
("//google.com", "Google"),
("//stackoveflow.com", "Stack Overflow")
)
So the reason tuples don't change is because they are supposed to be one single value. You can only assign the whole thing at once.
Here is a good resource that describes the difference between tuples and lists, and the reasons for using each: https://mail.python.org/pipermail/tutor/2001-September/008888.html
The main reason outlined in that link is that tuples are immutable and less extensive than say, lists. This makes them useful only in certain situations, but if those situations can be identified, tuples take up much less resources.
Immutable objects will make life simpler in many cases. They are especially applicable for value types, where objects don't have an identity so they can be easily replaced. And they can make concurrent programming way safer and cleaner (most of the notoriously hard to find concurrency bugs are ultimately caused by mutable state shared between threads). However, for large and/or complex objects, creating a new copy of the object for every single change can be very costly and/or tedious. And for objects with a distinct identity, changing an existing objects is much more simple and intuitive than creating a new, modified copy of it.

How to combine hash codes in in Python3?

I am more familiar with the "Java way" of building complex / combined hash codes from superclasses in subclasses. Is there a better / different / preferred way in Python 3? (I cannot find anything specific to Python3 on this matter via Google.)
class Superclass:
def __init__(self, data):
self.__data = data
def __hash__(self):
return hash(self.__data)
class Subclass(Superclass):
def __init__(self, data, more_data):
super().__init__(data)
self.__more_data = more_data
def __hash__(self):
# Just a guess...
return hash(super()) + 31 * hash(self.__more_data)
To simplifying this question, please assume self.__data and self.__more_data are simple, hashable data, such as str or int.
The easiest way to produce good hashes is to put your values in a standard hashable Python container, then hash that. This includes combining hashes in subclasses. I'll explain why, and then how.
Base requirements
First things first:
If two objects test as equal, then they MUST have the same hash value
Objects that have a hash, MUST produce the same hash over time.
Only when you follow those two rules can your objects safely be used in dictionaries and sets. The hash not changing is what keeps dictionaries and sets from breaking, as they use the hash to pick a storage location, and won't be able to locate the object again given another object that tests equal if the hash changed.
Note that it doesn’t even matter if the two objects are of different types; True == 1 == 1.0 so all have the same hash and will all count as the same key in a dictionary.
What makes a good hash value
You'd want to combine the components of your object value in ways that will produce, as much as possible, different hashes for different values. That includes things like ordering and specific meaning, so that two attributes that represent different aspects of your value, but that can hold the same type of Python objects, still result in different hashes, most of the time.
Note that it's fine if two objects that represent different values (won't test equal) have equal hashes. Reusing a hash value won't break sets or dictionaries. However, if a lot of different object values produce equal hashes then that reduces their efficiency, as you increase the likelihood of collisions. Collisions require collision resolution and collision resolution takes more time, so much so that you can use denial of service attacks on servers with predictable hashing implementations) (*).
So you want a nice wide spread of possible hash values.
Pitfalls to watch out for
The documentation for the object.__hash__ method includes some advice on how to combine values:
The only required property is that objects which compare equal have the same hash value; it is advised to somehow mix together (e.g. using exclusive or) the hash values for the components of the object that also play a part in comparison of objects.
but only using XOR will not produce good hash values, not when the values whose hashes that you XOR together can be of the same type but have different meaning depending on the attribute they've been assigned to. To illustrate with an example:
>>> class Foo:
... def __init__(self, a, b):
... self.a = a
... self.b = b
... def __hash__(self):
... return hash(self.a) ^ hash(self.b)
...
>>> hash(Foo(42, 'spam')) == hash(Foo('spam', 42))
True
Because the hashes for self.a and self.b were just XOR-ed together, we got the same hash value for either order, and so effectively halving the number of usable hashes. Do so with more attributes and you cut the number of unique hashes down rapidly. So you may want to include a bit more information in the hash about each attribute, if the same values can be used in different elements that make up the hash.
Next, know that while Python integers are unbounded, hash values are not. That is to say, hashes values have a finite range. From the same documentation:
Note: hash() truncates the value returned from an object’s custom __hash__() method to the size of a Py_ssize_t. This is typically 8 bytes on 64-bit builds and 4 bytes on 32-bit builds.
This means that if you used addition or multiplication or other operations that increase the number of bits needed to store the hash value, you will end up losing the upper bits and so reduce the number of different hash values again.
Next, if you combine multiple hashes with XOR that already have a limited range, chances are you end up with an even smaller number of possible hashes. Try XOR-ing the hashes of 1000 random integers in the range 0-10, for an extreme example.
Hashing, the easy way
Python developers have long since wrestled with the above pitfalls, and solved it for the standard library types. Use this to your advantage. Put your values in a tuple, then hash that tuple.
Python tuples use a simplified version of the xxHash algorithm to capture order information and to ensure a broad range of hash values. So for different attributes, you can capture the different meanings by giving them different positions in a tuple, then hashing the tuple:
def __hash__(self):
return hash((self.a, self.b))
This ensures you get unique hash values for unique orderings.
If you are subclassing something, put the hash of the parent implementation into one of the tuple positions:
def __hash__(self):
return hash((super().__hash__(), self.__more_data))
Hashing a hash value does reduce it to a 60-bit or 30-bit value (on 32-bit or 64-bit platforms, respectively), but that's not a big problem when combined with other values in a tuple. If you are really concerned about this, put None in the tuple as a placeholder and XOR the parent hash (so super().__hash__() ^ hash((None, self.__more_data))). But this is overkill, really.
If you have a multiple values whose relative order doesn't matter, and don't want to XOR these all together one by one, consider using a frozenset() object for fast processing, combined with a collections.Counter() object if values are not meant to be unique. The frozenset() hash operation accounts for small hash ranges by reshuffling the bits in hashes first:
# unordered collection hashing
from collections import Counter
hash(frozenset(Counter(...).items()))
As always, all values in the tuple or frozenset() must be hashable themselves.
Consider using dataclasses
For most objects you write __hash__ functions for, you actually want to be using a dataclass generated class:
from dataclasses import dataclass
from typing import Union
#dataclass(frozen=True)
class Foo:
a: Union[int, str]
b: Union[int, str]
Dataclasses are given a sane __hash__ implementation when frozen=True or unsafe_hash=True, using a tuple() of all the field values.
(*) Python protects your code against such hash collision attacks by using a process-wide random hash seed to hash strings, bytes and datetime objects.
The python documentation suggests that you use xor to combine hashes:
The only required property is that objects which compare equal have the same hash value; it is advised to somehow mix together (e.g. using exclusive or) the hash values for the components of the object that also play a part in comparison of objects.
I'd also recommend xor over addition and multiplication because of this:
Note
hash() truncates the value returned from an object’s custom __hash__() method to the size of a Py_ssize_t. This is typically 8 bytes on 64-bit builds and 4 bytes on 32-bit builds. If an object’s __hash__() must interoperate on builds of different bit sizes, be sure to check the width on all supported builds. An easy way to do this is with python -c "import sys; print(sys.hash_info.width)"
This documentation is the same for python 2.7 and python 3.4, by the way.
A note on symmetry and xoring items with themselves.
As pointed out in the comments, xor is symmetric, so order of operations disappears. The XOR of two the same elements is also zero. So if that's not desired mix in some rotations or shifts, or, even better, use this solution's suggestion of taking the hash of a tuple of the identifying elements. If you don't want to preserve order, consider using the frozenset.
Instead of combining multiple strings together, use tuples as they are hashable in python.
t: Tuple[str, str, int] = ('Field1', 'Field2', 33)
print(t.__hash__())
This will make the code also easier to read.
For anyone reading this, XORing hashes is a bad idea because it is possible for a particular sequence of duplicate hash values to XOR together and effectively remove an element from the hash set.
For instance:
(hash('asd') ^ hash('asd') ^ hash('derp')) == hash('derp')
and even:
(hash('asd') ^ hash('derp') ^ hash('asd')) == hash('derp')
So if you're using this technique to figure out if a certain set of values is in the combined hash, where you could potentially have duplicate values added to the hash, using XOR can result in that values removal from the set. Instead you should consider OR, which has the same properties of avoiding unbounded integer growth that the previous poster mentioned, but ensures duplicates are not removed.
(hash('asd') | hash('asd') | hash('derp')) != hash('derp')
If you're looking to explore this more, you should look up Bloom filters.

Categories