Update:
As of CPython 3.6, dictionaries have a version (thank you pylang for showing this to me).
If they added the same version to list and made it public, all 3 asserts from my original post would pass! It would definitely meet my needs. Their implementation differs from what I envisioned, but I like it.
As it is, I don't feel I can use dictionary version:
It isn't public. Jake Vanderplas shows how to expose it in a post, but he cautions: definitely not code you should use for any purpose beyond simply having fun. I agree with his reasons.
In all of my use cases, the data is conceptually arrays of elements each of which has the same structure. A list of tuples is a natural fit. Using a dictionary would make the code less natural and probably more cumbersome.
Does anyone know if there are plans to add version to list?
Are there plans to make it public?
If there are plans to add version to list and make it public, I would feel awkward putting forward an incompatible VersionedList now. I would just implement the bare minimum I need and get by.
Original post below
Turns out that many of the times I wanted an immutable list, a VersionedList would have worked almost as well (sometimes even better).
Has anyone implemented a versioned list?
Is there a better, more Pythonic, concept that meets my needs? (See motivation below.)
What I mean by a versioned list is:
A class that behaves like a list
Any change to an instance or elements in the instance results in instance.version() being updated. So, if alist is a normal list:
a = VersionedList(alist)
a_version = a.version()
change(a)
assert a_version != a.version()
reverse_last_change(a)
If a list was hashable, hash() would achieve the above and meet all the needs identified in the motivation below. We need to define 'version()' in a way that doesn't have all of the same problems as 'hash()'.
If identical data in two lists is highly unlikely to ever happen except at initialization, we aren't going to have a reason to test for deep equality. From (https://docs.python.org/3.5/reference/datamodel.html#object.hash) The only required property is that objects which compare equal have the same hash value. If we don't impose this requirement on 'version()', it seems likely that 'version()' won't have all of the same problems that makes lists unhashable. So unlike hash, identical contents doesn't mean the same version
#contents of 'a' are now identical to original, but...
assert a_version != a.version()
b = VersionedList(alist)
c = VersionedList(alist)
assert b.version() != c.version()
For VersionList, it would be good if any attempt to modify the result of __get__ automatically resulted in a copy instead of modifying the underlying implementation data. I think that the only other option would be to have __get__ always return a copy of the elements, and this would be very inefficient for all of the use cases I can think of. I think we need to restrict the elements to immutable objects (deeply immutable, for example: exclude tuples with list elements). I can think of 3 ways to achieve this:
Only allow elements that can't contain mutable elements (int, str, etc are fine, but exclude tuples). (This is far too limiting for my cases)
Add code to __init__, __set__, etc to traverse inputs to deeply check for mutable sub-elements. (expensive, any way to avoid this?)
Also allow more complex elements, but require that they are deeply immutable. Perhaps require that they expose a deeply_immutable attribute. (This turns out to be easy for all the use cases I have)
Motivation:
If I am analyzing a dataset, I often have to perform multiple steps that return large datasets (note: since the dataset is ordered, it is best represented by a List not a set).
If at the end of several steps (ex: 5) it turns out that I need to perform different analysis (ex: back at step 4), I want to know that the dataset from step 3 hasn't accidentally been changed. That way I can start at step 4 instead of repeating steps 1-3.
I have functions (control-points, first-derivative, second-derivative, offset, outline, etc) that depend on and return array-valued objects (in the linear algebra sense). The base 'array' is knots.
control-points() depends on: knots, algorithm_enum
first-derivative() depends on: control-points(), knots
offset() depends on: first-derivative(), control-points(), knots, offset_distance
outline() depends on: offset(), end_type_enum
If offset_distance changes, I want to avoid having to recalculate first-derivative() and control-points(). To avoid recalculation, I need to know that nothing has accidentally changed the resultant 'arrays'.
If 'knots' changes, I need to recalculate everything and not depend on the previous resultant 'arrays'.
To achieve this, knots and all of the 'array-valued' objects could be VersionedList.
FYI: I had hoped to take advantage of an efficient class like numpy.ndarray. In most of my use cases, the elements logically have structure. Having to mentally keep track of multi-dimensions of indexes meant implementing and debugging the algorithms was many times more difficult with ndarray. An implementation based on lists of namedtuples of namedtuples turned out to be much more sustainable.
Private dicts in 3.6
In Python 3.6, dictionaries are now private (PEP 509) and compact (issue 27350), which track versions and preserve order respectively. These features are presently true when using the CPython 3.6 implementation. Despite the challenge, Jake VanderPlas demonstrates in his blog post a detailed demonstration of exposing this versioning feature from CPython within normal Python. We can use his approach to:
determine when a dictionary has been updated
preserve the order
Example
import numpy as np
d = {"a": np.array([1,2,3]),
"c": np.array([1,2,3]),
"b": np.array([8,9,10]),
}
for i in range(3):
print(d.get_version()) # monkey-patch
# 524938
# 524938
# 524938
Notice the version number does not change until the dictionary is updated, as shown below:
d.update({"c": np.array([10, 11, 12])})
d.get_version()
# 534448
In addition, the insertion order is preserved (the following was tested in restarted sessions of Python 3.5 and 3.6):
list(d.keys())
# ['a', 'c', 'b']
You may be able to take advantage of this new dictionary behavior, saving you from implementing a new datatype.
Details
For those interested, the latter get_version()is a monkey-patched method for any dictionary, implemented in Python 3.6 using the following modified code derived from Jake VanderPlas' blog post. This code was run prior to calling get_version().
import types
import ctypes
import sys
assert (3, 6) <= sys.version_info < (3, 7) # valid only in Python 3.6
py_ssize_t = ctypes.c_ssize_t
# Emulate the PyObjectStruct from CPython
class PyObjectStruct(ctypes.Structure):
_fields_ = [('ob_refcnt', py_ssize_t),
('ob_type', ctypes.c_void_p)]
# Create a DictStruct class to wrap existing dictionaries
class DictStruct(PyObjectStruct):
_fields_ = [("ma_used", py_ssize_t),
("ma_version_tag", ctypes.c_uint64),
("ma_keys", ctypes.c_void_p),
("ma_values", ctypes.c_void_p),
]
def __repr__(self):
return (f"DictStruct(size={self.ma_used}, "
f"refcount={self.ob_refcnt}, "
f"version={self.ma_version_tag})")
#classmethod
def wrap(cls, obj):
assert isinstance(obj, dict)
return cls.from_address(id(obj))
assert object.__basicsize__ == ctypes.sizeof(PyObjectStruct)
assert dict.__basicsize__ == ctypes.sizeof(DictStruct)
# Code for monkey-patching existing dictionaries
class MappingProxyStruct(PyObjectStruct):
_fields_ = [("mapping", ctypes.POINTER(DictStruct))]
#classmethod
def wrap(cls, D):
assert isinstance(D, types.MappingProxyType)
return cls.from_address(id(D))
assert types.MappingProxyType.__basicsize__ == ctypes.sizeof(MappingProxyStruct)
def mappingproxy_setitem(obj, key, val):
"""Set an item in a read-only mapping proxy"""
proxy = MappingProxyStruct.wrap(obj)
ctypes.pythonapi.PyDict_SetItem(proxy.mapping,
ctypes.py_object(key),
ctypes.py_object(val))
mappingproxy_setitem(dict.__dict__,
'get_version',
lambda self: DictStruct.wrap(self).ma_version_tag)
Related
Tl;dr is bold-faced text.
I'm working with an image dataset that comes with boolean "one-hot" image annotations (Celeba to be specific). The annotations encode facial features like bald, male, young. Now I want to make a custom one-hot list (to test my GAN model). I want to provide a literate interface. I.e., rather than specifying features[12]=True knowing that 12 - counting from zero - corresponds to the male feature, I want something like features[male]=True or features.male=True.
Suppose the header of my .txt file is
Arched_Eyebrows Attractive Bags_Under_Eyes Bald Bangs Chubby Male Wearing_Necktie Young
and I want to codify Young, Bald, and Chubby. The expected output is
[ 0. 0. 0. 1. 0. 1. 0. 0. 1.]
since Bald is the fourth entry of the header, Chubby is the sixth, and so on. What is the clearest way to do this without expecting a user to know Bald is the fourth entry, etc.?
I'm looking for a Pythonic way, not necessarily the fastest way.
Ideal Features
In rough order of importance:
A way to accomplish my stated goal that is already standard in the Python community will take precedence.
A user/programmer should not need to count to an attribute in the .txt header. This is the point of what I'm trying to design.
A user should not be expected to have non-standard libraries like aenum.
A user/programmer should not need to reference the .txt header for attribute names/available attributes. One example: if a user wants to specify the gender attribute but does not know whether to use male or female, it should be easy to find out.
A user/programmer should be able to find out the available attributes via documentation (ideally generated by Sphinx api-doc). That is, the point 4 should be possible reading as little code as possible. Attribute exposure with dir() sufficiently satisfies this point.
The programmer should find the indexing tool natural. Specifically, zero-indexing should be preferred over subtracting from one-indexing.
Between two otherwise completely identical solutions, one with better performance would win.
Examples:
I'm going to compare and contrast the ways that immediately came to my mind. All examples use:
import numpy as np
header = ("Arched_Eyebrows Attractive Bags_Under_Eyes "
"Bald Bangs Chubby Male Wearing_Necktie Young")
NUM_CLASSES = len(header.split()) # 9
1: Dict Comprehension
Obviously we could use a dictionary to accomplish this:
binary_label = np.zeros([NUM_CLASSES])
classes = {head: idx for (idx, head) in enumerate(header.split())}
binary_label[[classes["Young"], classes["Bald"], classes["Chubby"]]] = True
print(binary_label)
For what it's worth, this has the fewest lines of code and is the only one that doesn't rely on a standard library over builtins. As for negatives, it isn't exactly self-documenting. To see the available options, you must print(classes.keys()) - it's not exposed with dir(). This borders on not satisfying feature 5 because it requires a user to know classes is a dict to exposure features AFAIK.
2: Enum:
Since I'm learning C++ right now, Enum is the first thing that came to mind:
import enum
binary_label = np.zeros([NUM_CLASSES])
Classes = enum.IntEnum("Classes", header)
features = [Classes.Young, Classes.Bald, Classes.Chubby]
zero_idx_feats = [feat-1 for feat in features]
binary_label[zero_idx_feats] = True
print(binary_label)
This gives dot notation and the image options are exposed with dir(Classes). However, enum uses one-indexing by default (the reason is documented). The work-around makes me feel like enum is not the Pythonic way to do this, and entirely fails to satisfy feature 6.
3: Named Tuple
Here's another one out of the standard Python library:
import collections
binary_label = np.zeros([NUM_CLASSES])
clss = collections.namedtuple(
"Classes", header)._make(range(NUM_CLASSES))
binary_label[[clss.Young, clss.Bald, clss.Chubby]] = True
print(binary_label)
Using namedtuple, we again get dot notation and self-documentation with dir(clss). But, the namedtuple class is heavier than enum. By this I mean, namedtuple has functionality I do not need. This solution appears to be a leader among my examples, but I do not know if it satisfies feature 1 or if an alternative could "win" via feature 7.
4: Custom Enum
I could really break my back:
binary_label = np.zeros([NUM_CLASSES])
class Classes(enum.IntEnum):
Arched_Eyebrows = 0
Attractive = 1
Bags_Under_Eyes = 2
Bald = 3
Bangs = 4
Chubby = 5
Male = 6
Wearing_Necktie = 7
Young = 8
binary_label[
[Classes.Young, Classes.Bald, Classes.Chubby]] = True
print(binary_label)
This has all the advantages of Ex. 2. But, it comes with obvious the obvious drawbacks. I have to write out all the features (there's 40 in the real dataset) just to zero-index! Sure, this is how to make an enum in C++ (AFAIK), but it shouldn't be necessary in Python. This is a slight failure on feature 6.
Summary
There are many ways to accomplish literate zero-indexing in Python. Would you provide a code snippet of how you would accomplish what I'm after and tell me why your way is right?
(edit:) Or explain why one of my examples is the right tool for the job?
Status Update:
I'm not ready to accept an answer yet in case anyone wants to address the following feedback/update, or any new solution appears. Maybe another 24 hours? All the responses have been helpful, so I upvoted everyone's so far. You may want to look over this repo I'm using to test solutions. Feel free to tell me if my following remarks are (in)accurate or unfair:
zero-enum:
Oddly, Sphinx documents this incorrectly (one-indexed in docs), but it does document it! I suppose that "issue" doesn't fail any ideal feature.
dotdict:
I feel that Map is overkill, but dotdict is acceptable. Thanks to both answerers that got this solution working with dir(). However, it doesn't appear that it "works seamlessly" with Sphinx.
Numpy record:
As written, this solution takes significantly longer than the other solutions. It comes in at 10x slower than a namedtuple (fastest behind pure dict) and 7x slower than standard IntEnum (slowest behind numpy record). That's not drastic at current scale, nor a priority, but a quick Google search indicates np.in1d is in fact slow. Let's stick with
_label = np.zeros([NUM_CLASSES])
_label[[header_rec[key].item() for key in ["Young", "Bald", "Chubby"]]] = True
unless I've implemented something wrong in the linked repo. This brings the execution speed into a range that compares with the other solutions. Again, no Sphinx.
namedtuple (and rassar's critiques)
I'm not convinced of your enum critique. It seems to me that you believe I'm approaching the problem wrong. It's fine to call me out on that, but I don't see how using the namedtuple is fundamentally different from "Enum [which] will provide separate values for each constant." Have I misunderstood you?
Regardless, namedtuple appears in Sphinx (correctly numbered, for what it's worth). On the Ideal Features list, this chalks up identically to zero-enum and profiles ahead of zero-enum.
Accepted Rationale
I accepted the zero-enum answer because the answer gave me the best challenger for namedtuple. By my standards, namedtuple is marginally the best solution. But salparadise wrote the answer that helped me feel confident in that assessment. Thanks to all who answered.
How about a factory function to create a zero indexed IntEnum since that is the object that suits your needs, and Enum provides flexibility in construction:
from enum import IntEnum
def zero_indexed_enum(name, items):
# splits on space, so it won't take any iterable. Easy to change depending on need.
return IntEnum(name, ((item, value) for value, item in enumerate(items.split())))
Then:
In [43]: header = ("Arched_Eyebrows Attractive Bags_Under_Eyes "
...: "Bald Bangs Chubby Male Wearing_Necktie Young")
In [44]: Classes = zero_indexed_enum('Classes', header)
In [45]: list(Classes)
Out[45]:
[<Classes.Arched_Eyebrows: 0>,
<Classes.Attractive: 1>,
<Classes.Bags_Under_Eyes: 2>,
<Classes.Bald: 3>,
<Classes.Bangs: 4>,
<Classes.Chubby: 5>,
<Classes.Male: 6>,
<Classes.Wearing_Necktie: 7>,
<Classes.Young: 8>]
You can use a custom class which I like to call as DotMap or as mentioned here is this SO discussion as Map:
https://stackoverflow.com/a/32107024/2598661 (Map, longer complete version)
https://stackoverflow.com/a/23689767/2598661 (dotdict, shorter lighter version)
About Map:
It has the features of a dictionary since the input to a Map/DotMap is a dict. You can access attributes using features['male'].
Additionally you can access the attributes using dot i.e. features.male and the attributes will be exposed when you do dir(features).
It is only as heavy as it needs to be in order to enable the dot functionality.
Unlike namedtuple you don't need to pre-define it and you can add and remove keys willy nilly.
The Map function described in the SO question is not Python3 compatible because it uses iteritems(). Just replace it with items() instead.
About dotdict:
dotdict provides the same advantages of Map with the exception that it does not override the dir() method therefore you will not be able to obtain the attributes for documentation. #SigmaPiEpsilon has provided a fix for this here.
It uses the dict.get method instead of dict.__getitem__ therefore it will return None instead of throwing KeyError when you are access attributes that don't exist.
It does not recursively apply dotdict-iness to nested dicts therefore you won't be able to use features.foo.bar.
Here's the updated version of dotdict which solves the first two issues:
class dotdict(dict):
__getattr__ = dict.__getitem__ # __getitem__ instead of get
__setattr__ = dict.__setitem__
__delattr__ = dict.__delitem__
def __dir__(self): # by #SigmaPiEpsilon for documentation
return self.keys()
Update
Map and dotdict don't have the same behavior as pointed out by #SigmaPiEpsilon so I added separate descriptions for both.
Of your examples, 3 is the most pythonic answer to your question.
1, as you said, does not even answer your question, since the names are not explicit.
2 uses enums, which though being in the standard library are not pythonic and generally not used in these scenarios in Python.
(Edit): In this case you only really need two different constants - the target values and the other ones. An Enum will provide separate values for each constant, which is not what the goal of your program is and seems to be a roundabout way of approaching the problem.
4 is just not maintainable if a client wants to add options, and even as it is it's painstaking work.
3 uses well-known classes from the standard library in a readable and succinct way. Also, it does not have any drawbacks, as it is perfectly explicit. Being too "heavy" doesn't matter if you don't care about performance, and anyway the lag will be unnoticeable with your input size.
Your requirements if I understand correctly can be divided into two parts:
Access the position of header elements in the .txt by name in the most pythonic way possible and with minimum external dependencies
Enable dot access to the data structure containing the names of the headers to be able to call dir() and setup easy interface with Sphinx
Pure Python Way (no external dependencies)
The most pythonic way to solve the problem is of course the method using dictionaries (dictionaries are at the heart of python). Searching a dictionary through key is also much faster than other methods. The only problem is this prevents dot access. Another answer mentions the Map and dotdict as alternatives. dotdict is simpler but it only enable dot access, it will not help in the documentation aspect with dir() since dir() calls the __dir__() method which is not overridden in these cases. Hence it will only return the attributes of Python dict and not the header names. See below:
>>> class dotdict(dict):
... __getattr__ = dict.get
... __setattr__ = dict.__setitem__
... __delattr__ = dict.__delitem__
...
>>> somedict = {'a' : 1, 'b': 2, 'c' : 3}
>>> somedotdict = dotdict(somedict)
>>> somedotdict.a
1
>>> 'a' in dir(somedotdict)
False
There are two options to get around this problem.
Option 1: Override the __dir__() method like below. But this will only work when you call dir() on the instances of the class. To make the changes apply for the class itself you have to create a metaclass for the class. See here
#add this to dotdict
def __dir__(self):
return self.keys()
>>> somedotdictdir = dotdictdir(somedict)
>>> somedotdictdir.a
1
>>> dir(somedotdictdir)
['a', 'b', 'c']
Option 2: A second option which makes it much closer to user-defined object with attributes is to update the __dict__ attribute of the created object. This is what Map also uses. A normal python dict does not have this attribute. If you add this then you can call dir() to get attributes/keys and also all the additional methods/attributes of python dict. If you just want the stored attribute and values you can use vars(somedotdictdir) which is also useful for documentation.
class dotdictdir(dict):
def __init__(self, *args, **kwargs):
dict.__init__(self, *args, **kwargs)
self.__dict__.update({k : v for k,v in self.items()})
def __setitem__(self, key, value):
dict.__setitem__(self, key, value)
self.__dict__.update({key : value})
__getattr__ = dict.get #replace with dict.__getitem__ if want raise error on missing key access
__setattr__ = __setitem__
__delattr__ = dict.__delitem__
>>> somedotdictdir = dotdictdir(somedict)
>>> somedotdictdir
{'a': 3, 'c': 6, 'b': 4}
>>> vars(somedotdictdir)
{'a': 3, 'c': 6, 'b': 4}
>>> 'a' in dir(somedotdictdir)
True
Numpy way
Another option will be to use a numpy record array which allows dot access. I noticed in your code you are already using numpy. In this case too __dir__() has to be overrridden to get the attributes. This may result in faster operations (not tested) for data with lots of other numeric values.
>>> headers = "Arched_Eyebrows Attractive Bags_Under_Eyes Bald Bangs Chubby Male Wearing_Necktie Young".split()
>>> header_rec = np.array([tuple(range(len(headers)))], dtype = zip(headers, [int]*len(headers)))
>>> header_rec.dtype.names
('Arched_Eyebrows', 'Attractive', 'Bags_Under_Eyes', 'Bald', 'Bangs', 'Chubby', 'Male', 'Wearing_Necktie', 'Young')
>>> np.in1d(header_rec.item(), [header_rec[key].item() for key in ["Young", "Bald", "Chubby"]]).astype(int)
array([0, 0, 0, 1, 0, 1, 0, 0, 1])
In Python 3, you will need to use dtype=list(zip(headers, [int]*len(headers))) since zip became its own object.
USAGE CONTEXT ADDED AT END
I often want to operate on an abstract object like a list. e.g.
def list_ish(thing):
for i in xrange(0,len(thing)):
print thing[i]
Now this appropriate if thing is a list, but will fail if thing is a dict for example. what is the pythonic why to ask "do you behave like a list?"
NOTE:
hasattr('__getitem__') and not hasattr('keys')
this will work for all cases I can think of, but I don't like defining a duck type negatively, as I expect there could be cases that it does not catch.
really what I want is to ask.
"hey do you operate on integer indicies in the way I expect a list to do?" e.g.
thing[i], thing[4:7] = [...], etc.
NOTE: I do not want to simply execute my operations inside of a large try/except, since they are destructive. it is not cool to try and fail here....
USAGE CONTEXT
-- A "point-lists" is a list-like-thing that contains dict-like-things as its elements.
-- A "matrix" is a list-like-thing that contains list-like-things
-- I have a library of functions that operate on point-lists and also in an analogous way on matrix like things.
-- for example, From the users point of view destructive operations like the "spreadsheet-like" operations "column-slice" can operate on both matrix objects and also on point-list objects in an analogous way -- the resulting thing is like the original one, but only has the specified columns.
-- since this particular operation is destructive it would not be cool to proceed as if an object were a matrix, only to find out part way thru the operation, it was really a point-list or none-of-the-above.
-- I want my 'is_matrix' and 'is_point_list' tests to be performant, since they sometimes occur inside inner loops. So I would be satisfied with a test which only investigated element zero for example.
-- I would prefer tests that do not involve construction of temporary objects, just to determine an object's type, but maybe that is not the python way.
in general I find the whole duck typing thing to be kinda messy, and fraught with bugs and slowness, but maybe I dont yet think like a true Pythonista
happy to drink more kool-aid...
One thing you can do, that should work quickly on a normal list and fail on a normal dict, is taking a zero-length slice from the front:
try:
thing[:0]
except TypeError:
# probably not list-like
else:
# probably list-like
The slice fails on dicts because slices are not hashable.
However, str and unicode also pass this test, and you mention that you are doing destructive edits. That means you probably also want to check for __delitem__ and __setitem__:
def supports_slices_and_editing(thing):
if hasattr(thing, '__setitem__') and hasattr(thing, '__delitem__'):
try:
thing[:0]
return True
except TypeError:
pass
return False
I suggest you organize the requirements you have for your input, and the range of possible inputs you want your function to handle, more explicitly than you have so far in your question. If you really just wanted to handle lists and dicts, you'd be using isinstance, right? Maybe what your method does could only ever delete items, or only ever replace items, so you don't need to check for the other capability. Document these requirements for future reference.
When dealing with built-in types, you can use the Abstract Base Classes. In your case, you may want to test against collections.Sequence or collections.MutableSequence:
if isinstance(your_thing, collections.Sequence):
# access your_thing as a list
This is supported in all Python versions after (and including) 2.6.
If you are using your own classes to build your_thing, I'd recommend that you inherit from these abstract base classes as well (directly or indirectly). This way, you can ensure that the sequence interface is implemented correctly, and avoid all the typing mess.
And for third-party libraries, there's no simple way to check for a sequence interface, if the third-party classes didn't inherit from the built-in types or abstract classes. In this case you'll have to check for every interface that you're going to use, and only those you use. For example, your list_ish function used __len__ and __getitem__, so only check whether these two methods exist. A wrong behavior of __getitem__ (e.g. a dict) should raise an exception.
Perhaps their is no ideal pythonic answer here, so I am proposing a 'hack' solution, but don't know enough about the class structure of python to know if I am getting this right:
def is_list_like(thing):
return hasattr(thing, '__setslice__')
def is_dict_like(thing):
return hasattr(thing, 'keys')
My reduce goals here are to simply have performant tests that will:
(1) never call a dict-thing, nor a string-like-thing a list List item
(2) returns the right answer for python types
(3) will return the right answer if someone implement a "full" set of core method for a list/dict
(4) is fast (ideally does not allocate objects during the test)
EDIT: Incorporated ideas from #DanGetz
The question arose when answering to another SO question (there).
When I iterate several times over a python set (without changing it between calls), can I assume it will always return elements in the same order? And if not, what is the rationale of changing the order ? Is it deterministic, or random? Or implementation defined?
And when I call the same python program repeatedly (not random, not input dependent), will I get the same ordering for sets?
The underlying question is if python set iteration order only depends on the algorithm used to implement sets, or also on the execution context?
There's no formal guarantee about the stability of sets. However, in the CPython implementation, as long as nothing changes the set, the items will be produced in the same order. Sets are implemented as open-addressing hashtables (with a prime probe), so inserting or removing items can completely change the order (in particular, when that triggers a resize, which reorganizes how the items are laid out in memory.) You can also have two identical sets that nonetheless produce the items in different order, for example:
>>> s1 = {-1, -2}
>>> s2 = {-2, -1}
>>> s1 == s2
True
>>> list(s1), list(s2)
([-1, -2], [-2, -1])
Unless you're very certain you have the same set and nothing touched it inbetween the two iterations, it's best not to rely on it staying the same. Making seemingly irrelevant changes to, say, functions you call inbetween could produce very hard to find bugs.
A set or frozenset is inherently an unordered collection. Internally, sets are based on a hash table, and the order of keys depends both on the insertion order and on the hash algorithm. In CPython (aka standard Python) integers less than the machine word size (32 bit or 64 bit) hash to themself, but text strings, bytes strings, and datetime objects hash to integers that vary randomly; you can control that by setting the PYTHONHASHSEED environment variable.
From the __hash__ docs:
Note
By default, the __hash__() values of str, bytes and datetime
objects are “salted” with an unpredictable random value. Although they
remain constant within an individual Python process, they are not
predictable between repeated invocations of Python.
This is intended to provide protection against a denial-of-service
caused by carefully-chosen inputs that exploit the worst case
performance of a dict insertion, O(n^2) complexity. See
http://www.ocert.org/advisories/ocert-2011-003.html for details.
Changing hash values affects the iteration order of dicts, sets and
other mappings. Python has never made guarantees about this ordering
(and it typically varies between 32-bit and 64-bit builds).
See also PYTHONHASHSEED.
The results of hashing objects of other classes depend on the details of the class's __hash__ method.
The upshot of all this is that you can have two sets containing identical strings but when you convert them to lists they can compare unequal. Or they may not. ;) Here's some code that demonstrates this. On some runs, it will just loop, not printing anything, but on other runs it will quickly find a set that uses a different order to the original.
from random import seed, shuffle
seed(42)
data = list('abcdefgh')
a = frozenset(data)
la = list(a)
print(''.join(la), a)
while True:
shuffle(data)
lb = list(frozenset(data))
if lb != la:
print(''.join(data), ''.join(lb))
break
typical output
dachbgef frozenset({'d', 'a', 'c', 'h', 'b', 'g', 'e', 'f'})
deghcfab dahcbgef
And when I call the same python
program repeatedly (not random, not
input dependent), will I get the same
ordering for sets?
I can answer this part of the question now after a quick experiment. Using the following code:
class Foo(object) :
def __init__(self,val) :
self.val = val
def __repr__(self) :
return str(self.val)
x = set()
for y in range(500) :
x.add(Foo(y))
print list(x)[-10:]
I can trigger the behaviour that I was asking about in the other question. If I run this repeatedly then the output changes, but not on every run. It seems to be "weakly random" in that it changes slowly. This is certainly implementation dependent so I should say that I'm running the macports Python2.6 on snow-leopard. While the program will output the same answer for long runs of time, doing something that affects the system entropy pool (writing to the disk mostly works) will somethimes kick it into a different output.
The class Foo is just a simple int wrapper as experiments show that this doesn't happen with sets of ints. I think that the problem is caused by the lack of __eq__ and __hash__ members for the object, although I would dearly love to know the underlying explanation / ways to avoid it. Also useful would be some way to reproduce / repeat a "bad" run. Does anyone know what seed it uses, or how I could set that seed?
It’s definitely implementation defined. The specification of a set says only that
Being an unordered collection, sets do not record element position or order of insertion.
Why not use OrderedDict to create your own OrderedSet class?
The answer is simply a NO.
Python set operation is NOT stable.
I did a simple experiment to show this.
The code:
import random
random.seed(1)
x=[]
class aaa(object):
def __init__(self,a,b):
self.a=a
self.b=b
for i in range(5):
x.append(aaa(random.choice('asf'),random.randint(1,4000)))
for j in x:
print(j.a,j.b)
print('====')
for j in set(x):
print(j.a,j.b)
Run this for twice, you will get this:
First time result:
a 2332
a 1045
a 2030
s 1935
f 1555
====
a 2030
a 2332
f 1555
a 1045
s 1935
Process finished with exit code 0
Second time result:
a 2332
a 1045
a 2030
s 1935
f 1555
====
s 1935
a 2332
a 1045
f 1555
a 2030
Process finished with exit code 0
The reason is explained in comments in this answer.
However, there are some ways to make it stable:
set PYTHONHASHSEED to 0, see details here, here and here.
Use OrderedDict instead.
As pointed out, this is strictly an implementation detail.
But as long as you don’t change the structure between calls, there should be no reason for a read-only operation (= iteration) to change with time: no sane implementation does that. Even randomized (= non-deterministic) data structures that can be used to implement sets (e.g. skip lists) don’t change the reading order when no changes occur.
So, being rational, you can safely rely on this behaviour.
(I’m aware that certain GCs may reorder memory in a background thread but even this reordering will not be noticeable on the level of data structures, unless a bug occurs.)
The definition of a set is unordered, unique elements ("Unordered collections of unique elements"). You should care only about the interface, not the implementation. If you want an ordered enumeration, you should probably put it into a list and sort it.
There are many different implementations of Python. Don't rely on undocumented behaviour, as your code could break on different Python implementations.
As a way to get used to python, I am trying to translate some of my code to python from Autohotkey_L.
I am immediately running into tons of choices for collection objects. Can you help me figure out a built in type or a 3rd party contributed type that has as much as possible, the functionality of the AutoHotkey_L object type and its methods.
AutoHotkey_L Objects have features of a python dict, list, and a class instance.
I understand that there are tradeoffs for space and speed, but I am just interested in functionality rather than optimization issues.
Don't write Python as <another-language>. Write Python as Python.
The data structure should be chosen just to have the minimal ability you need to use.
list — an ordered sequence of elements, with 1 flexible end.
collections.deque — an ordered sequence of elements, with 2 flexible ends (e.g. a queue).
set / frozenset — an unordered sequence of unique elements.
collections.Counter — an unordered sequence of non-unique elements.
dict — an unordered key-value relationship.
collections.OrderedDict — an ordered key-value relationship.
bytes / bytearray — a list of bytes.
array.array — a homogeneous list of primitive types.
Looking at the interface of Object,
dict would be the most suitable for finding a value by key
collections.OrderedDict would be the most suitable for the push/pop stuff.
when you need MinIndex / MaxIndex, where a sorted key-value relationship (e.g. red black tree) is required. There's no such type in the standard library, but there are 3rd party implementations.
It would be impossible to recommend a particular class without knowing how you intend on using it. If you are using this particular object as an ordered sequence where elements can be repeated, then you should use a list; if you are looking up values by their key, then use a dictionary. You will get very different algorithmic runtime complexity with the different data types. It really does not take that much time to determine when to use which type.... I suggest you give it some further consideration.
If you really can't decide, though, here is a possibility:
class AutoHotKeyObject(object):
def __init__(self):
self.list_value = []
self.dict_value = {}
def getDict(self):
return self.dict_value
def getList(self):
return self.list_value
With the above, you could use both the list and dictionary features, like so:
obj = AutoHotKeyObject()
obj.getList().append(1)
obj.getList().append(2)
obj.getList().append(3)
print obj.getList() # Prints [1, 2, 3]
obj.getDict()['a'] = 1
obj.getDict()['b'] = 2
print obj.getDict() # Prints {'a':1, 'b':2}
I'm currently implementing a complex microbial food-web in Python using SciPy.integrate.ode. I need the ability to easily add species and reactions to the system, so I have to code up something quite general. My scheme looks something like this:
class Reaction(object):
def __init__(self):
#stuff common to all reactions
def __getReactionRate(self, **kwargs):
raise NotImplementedError
... Reaction subclasses that
... implement specific types of reactions
class Species(object):
def __init__(self, reactionsDict):
self.reactionsDict = reactionsDict
#reactionsDict looks like {'ReactionName':reactionObject, ...}
#stuff common to all species
def sumOverAllReactionsForThisSpecies(self, **kwargs):
#loop over all the reactions and return the
#cumulative change in the concentrations of all solutes
...Species subclasses where for each species
... are defined and passed to the superclass constructor
class FermentationChamber(object):
def __init__(self, speciesList, timeToSolve, *args):
#do initialization
def step(self):
#loop over each species, which in turn loops
#over each reaction inside it and return a
#cumulative dictionary of total change for each
#solute in the whole system
if __name__==__main__:
f = FermentationChamber(...)
o = ode(...) #initialize ode solver
while o.successful() and o.t<timeToSolve:
o.integrate()
#process o.t and o.y (o.t contains the time points
#and o.y contains the solution matrix)
So, the question is, when I iterate over the dictionaries in Species.sumOverAllReactionsForThisSpecies() and FermentationChamber.step(), is the iteration order of the dictionaries guaranteed to be the same if no elements are added or removed from the dictionaries between the first and the last iteration? That is, can I assume that the order of the numpy array created at each iteration from the dictionary will not vary? For example, if a dictionary has the format {'Glucose':10, 'Fructose':12}, if an Array created from this dictionary will always have the same order (it doesn't matter what that order is, as long as it's deterministic).
Sorry for the mega-post, I just wanted to let you know where I'm coming from.
Yes, the same order is guaranteed if it is not modified.
See the docs here.
Edit:
Regarding if changing the value (but not adding/removing a key) will affect the order, this is what the comments in the C-source says:
/* CAUTION: PyDict_SetItem() must guarantee that it won't resize the
* dictionary if it's merely replacing the value for an existing key.
* This means that it's safe to loop over a dictionary with PyDict_Next()
* and occasionally replace a value -- but you can't insert new keys or
* remove them.
*/
It seems that its not an implementation detail, but a requirement of the language.
It depends on the Python version.
Python 3.7+
Dictionary iteration order is guaranteed to be in order of insertion.
Python 3.6
Dictionary iteration order happens to be in order of insertion in CPython implementation, but it is not a documented guarantee of the language.
Prior versions
Keys and values are iterated over in an arbitrary order which is non-random, varies across Python implementations, and depends on the dictionary’s history of insertions and deletions. If keys, values and items views are iterated over with no intervening modifications to the dictionary, the order of items will directly correspond. https://docs.python.org/2/library/stdtypes.html#dictionary-view-objects
The -R option
Python 2.6 added the -R option as (insufficient, it turned out) protection against hash flooding attacks. In Python 2 turning this on affected dictionary iteration order (the properties specified above were still maintained, but the specific iteration order would be different from one execution of the program to the next). For this reason, the option was off by default.
In Python 3, the -R option is on by default since Python 3.3, which adds nondeterminism to dict iteration order, as every time Python interpreter is run, the seed value for hash computation is generated randomly. This situation lasts until CPython 3.6 which changed dict implementation in a way so that the hash values of entries do not influence iteration order.
Source
Changed in version 3.7: Dictionary order is guaranteed to be insertion order. This behavior was an implementation detail of CPython from 3.6. https://docs.python.org/3.8/library/stdtypes.html
What’s New In Python 3.6: The order-preserving aspect of this new implementation is considered an implementation detail and should not be relied upon (this may change in the future, but it is desired to have this new dict implementation in the language for a few releases before changing the language spec to mandate order-preserving semantics for all current and future Python implementations; this also helps preserve backwards-compatibility with older versions of the language where random iteration order is still in effect, e.g. Python 3.5). https://docs.python.org/3/whatsnew/3.6.html#whatsnew36-compactdict
Provided no modifications are made to the dictionary, the answer is yes. See the docs here.
However, dictionaries are unordered by nature in Python. In general, it's not the best practice to rely on dictionaries for sensitive sorted data.
An example of an a more robust solution would be Django's SortedDict data structure.
If you want the order to be consistent, I would do something to force a particular order. Although you might be able to convince yourself that the order is guaranteed, and you might be right, it seems fragile to me, and it will be mysterious to other developers.
For example, you emphasize always in your question. Is it important that it be the same order in Python 2.5 and 2.6? 2.6 and 3.1? CPython and Jython? I wouldn't count on those.
I also would recommend not relying on the fact the dictionaries order is non-random.
If you want a built in solution to sorting you dictionary read http://www.python.org/dev/peps/pep-0265/
Here is the most relevant material:
This PEP is rejected because the need for it has been largely
fulfilled by Py2.4's sorted() builtin function:
>>> sorted(d.iteritems(), key=itemgetter(1), reverse=True)
[('b', 23), ('d', 17), ('c', 5), ('a', 2), ('e', 1)]
or for just the keys:
>>> sorted(d, key=d.__getitem__, reverse=True)
['b', 'd', 'c', 'a', 'e']
Also, Python 2.5's heapq.nlargest() function addresses the common use
case of finding only a few of the highest valued items:
>>> nlargest(2, d.iteritems(), itemgetter(1))
[('b', 23), ('d', 17)]
Python 3.1 has a collections.OrderedDict class that can be used for this purpose. It's very efficient, too: "Big-O running times for all methods are the same as for regular dictionaries."
The code for OrderedDict itself is compatible with Python 2.x, though some inherited methods (from the _abcoll module) do use Python 3-only features. However, they can be modified to 2.x code with minimal effort.