Limit for hashing nested tuples?

Limit for hashing nested tuples? - python

A few lines of code that demonstrates what I'm asking are:
>>> x = ()
>>> for i in range(1000000):
... x = (x,)
>>> x.__hash__()
=============================== RESTART: Shell ===============================
The 1000000 may be excessive, but it demonstrates that there is some form of limit when hashing nested tuples (and I assume other objects). Just to clarify, I didn't restart the shell, it did that automatically when I attempted the hash.
What I want to know is what is this limit, why does it happen (and why did it not raise an error), and is there a way around it (so that I can put tuples like this into sets or dictionaries).

The __hash__ method of tuple calculates the hash of each item in the tuple - in your case like a recursive function. So if you have a deeply nested tuple then it ends up with a very deep recursion. At some point there is probably not enough memory on the stack to go "one level deeper". That's also why the "shell restarts" without Python exception - because the recusion is done in C (at least for CPython). You could use, i.e. gdb to get more information about the exception or debug it.
There will be no global hard limit, the limit depends on your system (e.g. how much stack) and how many function calls (internally) are involved and how much of the "stack" each function call requires.
However that could qualify as Bug in the implementation, so it would be a good idea to post that on the Python issue tracker: CPython issue tracker.

Related

Python recursive algorithm segmentation fault

I'm pretty bad with recursion as it is, but this algorithm naturally seems like it's best done recursively. Basically, I have a list of all the function calls made in a C program, throughout multiple files. This list is unordered. My recursive algorithm attempts to make a tree of all the functions called, starting from the main method.
This works perfectly fine for smaller programs, but when I tried it out with larger ones I'm getting this error. I read that the issue might be due to me exceeding the cstack limit? Since I already tried raising the recursion limit in python.
Would appreciate some help here, thanks.
functions = set containing a list of function calls and their info, type Function. The data in node is of type Function.
#dataclass
class Function:
name : str
file : str
id : int
calls : set
....
Here's the algorithm.
def order_functions(node, functions, defines):
calls = set()
# Checking if the called function is user-defined
for call in node.data.calls:
if call in defines:
calls.add(call)
node.data.calls = calls
if len(calls) == 0:
return node
for call in node.data.calls:
child = Node(next((f for f in functions if f.name == call), None))
node.add_child(child)
Parser.order_functions(child, functions, defines)
return node

If you exceed the predefined limit on the call stack size, the best idea probably is to rewrite an iterative version of your program. If you have no idea on how deeply your recursion will go, then don't use recursion.
More information here, and maybe if you need to implement an iterative version you can get inspiration from this post.
The main information here is that python doesn't perform any tail recursion elimination. Therefore recursive functions will never work with inputs that have an unknown/unbounded hierarchical structure.

Which sequence type is better for a comparison and why? (Python)

I have a condition that compares one object to several others, like so:
if 'a' in ('a','b','c','e'):
The sequence was created for this purpose and doesn't exist anywhere else in the function. What are the pros and cons to grouping it as a tuple, list, or set, given that they all seem to work the same and the list is short? Which would be idiomatic?

Use a set until you have good reason not to. (And then use a list.)
I would consider a set to be more idiomatic. It conveys the meaning more clearly, since order doesn't matter, only membership.
And to be clear, a set is a collection but not a "sequence type" (even though it's iterable), because it's semantically "unordered".
Why not use a set?
Sets may only contain hashable types. And, this is important, they will raise a TypeError instead of simply returning False when you ask if an unhashable type is in the set. If you might get an unhashable object on either side of the in operator, you're out of luck. Sometimes you can use hashable elements instead (like frozenset instead of set or tuple instead of list), sometimes you can't.
But tuples and lists don't have to hash their elements.
Why a list over a tuple?
The main advantage of a list that they avoid a syntactic quirk for tuples of one element. Say you have ('foo', 'bar') and later decide to remove the 'bar'. Then you have ('foo'). Oops, see what I did there? It was actually supposed to be ('foo',). It's easy to forget the comma. And the in check still works for strings like ('foo'), since in checks for substrings. This can subtly change the meaning of your program. 'oo' is in ('foo'), but not in ('foo',).
A one-item list like ['foo'] doesn't have that problem. [And as
user2357112 pointed out, a constant list is going to get compiled to a tuple anyway.]
Note that a one-item set, like {'a'} doesn't have that problem either. An empty {} is a dict instead, but that's not going to cause any issues with an in check because it's also an empty collection.
But you should arguably be using == instead of in when comparing against only one element.
That's it for clarity. Now for the micro-optimizations. Early optimization is the root of all evil. Don't optimize at the expense of readability before it's actually necessary.
A set lookup is faster if it's not too small, since a tuple's elements have to be checked one-by-one which (on average) grows with the size of the tuple, while a set is backed by a hashtable (like a dict), which has a small constant overhead. If the distribution of cases isn't uniform, this means that the order of elements in the tuple matters a lot. Putting the more common cases first in the tuple will make the checks much faster than the reverse, on average.
How small does the collection have to be to for the set's constant overhead to matter? Profile and see. Performance can vary based on a lot of factors. It's not just the number of elements, but how long an equality check takes, and where they're located in memory, etc.
A tuple should have a slightly smaller overhead both in memory and construction time than the other collections. But the construction overhead doesn't really matter if the compiler can make it load as a saved constant value. (This can happen when all the elements are themselves constant at compile time. You can use the dis module to confirm this is happening.)

List comprehension is sorting autmatically [duplicate]

The question arose when answering to another SO question (there).
When I iterate several times over a python set (without changing it between calls), can I assume it will always return elements in the same order? And if not, what is the rationale of changing the order ? Is it deterministic, or random? Or implementation defined?
And when I call the same python program repeatedly (not random, not input dependent), will I get the same ordering for sets?
The underlying question is if python set iteration order only depends on the algorithm used to implement sets, or also on the execution context?

There's no formal guarantee about the stability of sets. However, in the CPython implementation, as long as nothing changes the set, the items will be produced in the same order. Sets are implemented as open-addressing hashtables (with a prime probe), so inserting or removing items can completely change the order (in particular, when that triggers a resize, which reorganizes how the items are laid out in memory.) You can also have two identical sets that nonetheless produce the items in different order, for example:
>>> s1 = {-1, -2}
>>> s2 = {-2, -1}
>>> s1 == s2
True
>>> list(s1), list(s2)
([-1, -2], [-2, -1])
Unless you're very certain you have the same set and nothing touched it inbetween the two iterations, it's best not to rely on it staying the same. Making seemingly irrelevant changes to, say, functions you call inbetween could produce very hard to find bugs.

A set or frozenset is inherently an unordered collection. Internally, sets are based on a hash table, and the order of keys depends both on the insertion order and on the hash algorithm. In CPython (aka standard Python) integers less than the machine word size (32 bit or 64 bit) hash to themself, but text strings, bytes strings, and datetime objects hash to integers that vary randomly; you can control that by setting the PYTHONHASHSEED environment variable.
From the __hash__ docs:
Note
By default, the __hash__() values of str, bytes and datetime
objects are “salted” with an unpredictable random value. Although they
remain constant within an individual Python process, they are not
predictable between repeated invocations of Python.
This is intended to provide protection against a denial-of-service
caused by carefully-chosen inputs that exploit the worst case
performance of a dict insertion, O(n^2) complexity. See
http://www.ocert.org/advisories/ocert-2011-003.html for details.
Changing hash values affects the iteration order of dicts, sets and
other mappings. Python has never made guarantees about this ordering
(and it typically varies between 32-bit and 64-bit builds).
See also PYTHONHASHSEED.
The results of hashing objects of other classes depend on the details of the class's __hash__ method.
The upshot of all this is that you can have two sets containing identical strings but when you convert them to lists they can compare unequal. Or they may not. ;) Here's some code that demonstrates this. On some runs, it will just loop, not printing anything, but on other runs it will quickly find a set that uses a different order to the original.
from random import seed, shuffle
seed(42)
data = list('abcdefgh')
a = frozenset(data)
la = list(a)
print(''.join(la), a)
while True:
shuffle(data)
lb = list(frozenset(data))
if lb != la:
print(''.join(data), ''.join(lb))
break
typical output
dachbgef frozenset({'d', 'a', 'c', 'h', 'b', 'g', 'e', 'f'})
deghcfab dahcbgef

And when I call the same python
program repeatedly (not random, not
input dependent), will I get the same
ordering for sets?
I can answer this part of the question now after a quick experiment. Using the following code:
class Foo(object) :
def __init__(self,val) :
self.val = val
def __repr__(self) :
return str(self.val)
x = set()
for y in range(500) :
x.add(Foo(y))
print list(x)[-10:]
I can trigger the behaviour that I was asking about in the other question. If I run this repeatedly then the output changes, but not on every run. It seems to be "weakly random" in that it changes slowly. This is certainly implementation dependent so I should say that I'm running the macports Python2.6 on snow-leopard. While the program will output the same answer for long runs of time, doing something that affects the system entropy pool (writing to the disk mostly works) will somethimes kick it into a different output.
The class Foo is just a simple int wrapper as experiments show that this doesn't happen with sets of ints. I think that the problem is caused by the lack of __eq__ and __hash__ members for the object, although I would dearly love to know the underlying explanation / ways to avoid it. Also useful would be some way to reproduce / repeat a "bad" run. Does anyone know what seed it uses, or how I could set that seed?

It’s definitely implementation defined. The specification of a set says only that
Being an unordered collection, sets do not record element position or order of insertion.
Why not use OrderedDict to create your own OrderedSet class?

The answer is simply a NO.
Python set operation is NOT stable.
I did a simple experiment to show this.
The code:
import random
random.seed(1)
x=[]
class aaa(object):
def __init__(self,a,b):
self.a=a
self.b=b
for i in range(5):
x.append(aaa(random.choice('asf'),random.randint(1,4000)))
for j in x:
print(j.a,j.b)
print('====')
for j in set(x):
print(j.a,j.b)
Run this for twice, you will get this:
First time result:
a 2332
a 1045
a 2030
s 1935
f 1555
====
a 2030
a 2332
f 1555
a 1045
s 1935
Process finished with exit code 0
Second time result:
a 2332
a 1045
a 2030
s 1935
f 1555
====
s 1935
a 2332
a 1045
f 1555
a 2030
Process finished with exit code 0
The reason is explained in comments in this answer.
However, there are some ways to make it stable:
set PYTHONHASHSEED to 0, see details here, here and here.
Use OrderedDict instead.

As pointed out, this is strictly an implementation detail.
But as long as you don’t change the structure between calls, there should be no reason for a read-only operation (= iteration) to change with time: no sane implementation does that. Even randomized (= non-deterministic) data structures that can be used to implement sets (e.g. skip lists) don’t change the reading order when no changes occur.
So, being rational, you can safely rely on this behaviour.
(I’m aware that certain GCs may reorder memory in a background thread but even this reordering will not be noticeable on the level of data structures, unless a bug occurs.)

The definition of a set is unordered, unique elements ("Unordered collections of unique elements"). You should care only about the interface, not the implementation. If you want an ordered enumeration, you should probably put it into a list and sort it.
There are many different implementations of Python. Don't rely on undocumented behaviour, as your code could break on different Python implementations.

Why Python’s function call semantics pass-in keyword arguments are not ordered?

Using the double star syntax in function definition, we obtain a regular dictionary. The problem is that it loose the user input order. Sometimes, we could want to know in which order keyword arguments where passed to the function.
Since usually a function call do not involved many arguments, I don't think it is a problem of performance so I wonder why the default is not to maintain the order.
I know we can use:
from collections import Ordereddict
def my_func(kwargs):
print kwargs
my_func(Ordereddict(a=1, b=42))
But it is less concise than:
def my_func(**kwargs):
print kwargs
my_func(a=1, b=42)
[EDIT 1]:
1) I thought there where 2 cases:
I need to know the order, this behaviour is known by the user through the documentation.
I do not need the order, so I do not care if it is ordered or not.
I did not thought that even if the user know it use the order, he could use:
a = dict(a=1, b=42)
my_func(**a)
Because he did not know that a dict is not ordered (even if he should know)
2) I thought that the overhead would not be huge in case of a few arguments, so the benefits of having a new possibility to manage arguments would be superior to this downside.
But it seems (from Joe's answer) that the overhead is not negligible.
[EDIT 2]:
It seems that the PEP 0468 -- Preserving the order of **kwargs in a function is going in this direction.

Because dictionaries are not ordered by definition. I think it really is that simple. The point of kwargs is to take care of exactly those formal parameters which are not ordered. If you did know the order then you could receive them as 'normal' parameters or *args.
Here is a dictionary definition.
CPython implementation detail: Keys and values are listed in an
arbitrary order which is non-random, varies across Python
implementations, and depends on the dictionary’s history of insertions
and deletions.
http://docs.python.org/2/library/stdtypes.html#dict
Python's dictionaries are central to the way the whole language works, so they are highly optimised. Adding ordering would impact performance and require more storage and processing overhead.
You may have a case where that's not true, but I think that's more exceptional than common. Adding a feature 'just in case' for a very hot code path is not a sensible design decision.
EDIT:
Just FYI
>>> timeit.timeit(stmt="z = dict(x)", setup='x = ((("one", "two"), ("three", "four"), ("five", "six")))', number=1000000)
1.6569631099700928
>>> timeit.timeit(stmt="z = OrderedDict(x)", setup='from collections import OrderedDict; x = ((("one", "two"), ("three", "four"), ("five", "six")))', number=1000000)
31.618864059448242
That's about a 30x speed difference in constructing a smallish 'normal' size dictionary. OrderedDict is part of the standard library, so I don't imagine there's much more performance that can be squeezed out of it.

As a counter-argument, here is an example of the complicated semantics this would cause. There are a couple of cases here:
The function always gets an unordered dictionary.
The function always gets an ordered dictionary - given this, we don't know if the order has any meaning, as if the user passes in an unordered data structure, the order will be arbitrary, while the data type implies order.
The function gets whatever is passed in - this seems ideal, but it's not that simple.
What about the case of some_func(a=1, b=2, **unordered_dict)? There is implicit ordering in the original keyword arguments, but then the dict is unordered. There is no clear choice here between ordered or not.
Given this, I'd say that ordering the keyword arguments wouldn't be useful, as it would be impossible to tell if the order is just an arbitrary one. This would cloud the semantics of function calling.
Given that, any benefit gained by making this a part of calling is lost - instead, just expect an OrderedDict as an argument.

If your function's arguments are so correlated that both name and order matter, consider using a specific data structure or define a class to hold them. Chances are, you'll want them together in other places in your code, and possibly define other functions/methods that use them.

Retrieving the order of key-word arguments passed via **kwargs would be extremely useful in the particular project I am working on. It is about making a kind of n-d numpy array with meaningful dimensions (right now called dimarray), particularly useful for geophysical data handling.
I have posted a developed question with examples here:
How to retrieve the original order of key-word arguments passed to a function call?

side effect gotchas in python/numpy? horror stories and narrow escapes wanted

I am considering moving from Matlab to Python/numpy for data analysis and numerical simulations. I have used Matlab (and SML-NJ) for years, and am very comfortable in the functional environment without side effects (barring I/O), but am a little reluctant about the side effects in Python. Can people share their favorite gotchas regarding side effects, and if possible, how they got around them? As an example, I was a bit surprised when I tried the following code in Python:
lofls = [[]] * 4 #an accident waiting to happen!
lofls[0].append(7) #not what I was expecting...
print lofls #gives [[7], [7], [7], [7]]
#instead, I should have done this (I think)
lofls = [[] for x in range(4)]
lofls[0].append(7) #only appends to the first list
print lofls #gives [[7], [], [], []]
thanks in advance

Confusing references to the same (mutable) object with references to separate objects is indeed a "gotcha" (suffered by all non-functional languages, ones which have mutable objects and, of course, references). A frequently seen bug in beginners' Python code is misusing a default value which is mutable, e.g.:
def addone(item, alist=[]):
alist.append(item)
return alist
This code may be correct if the purpose is to have addone keep its own state (and return the one growing list to successive callers), much as static data would work in C; it's not correct if the coder is wrongly assuming that a new empty list will be made at each call.
Raw beginners used to functional languages can also be confused by the command-query separation design decision in Python's built-in containers: mutating methods that don't have anything in particular to return (i.e., the vast majority of mutating methods) return nothing (specifically, they return None) -- they're doing all their work "in-place". Bugs coming from misunderstanding this are easy to spot, e.g.
alist = alist.append(item)
is pretty much guaranteed to be a bug -- it appends an item to the list referred to by name alist, but then rebinds name alist to None (the return value of the append call).
While the first issue I mentioned is about an early-binding that may mislead people who think the binding is, instead, a late one, there are issues that go the other way, where some people's expectations are for an early binding while the binding is, instead, late. For example (with a hypothetical GUI framework...):
for i in range(10):
Button(text="Button #%s" % i,
click=lambda: say("I'm #%s!" % i))
this will show ten buttons saying "Button #0", "Button #1", etc, but, when clicked, each and every one of them will say it's #9 -- because the i within the lambda is late bound (with a lexical closure). A fix is to take advantage of the fact that default values for argument are early-bound (as I pointed out about the first issue!-) and change the last line to
click=lambda i=i: say("I'm #%s!" % i))
Now lambda's i is an argument with a default value, not a free variable (looked up by lexical closure) any more, and so the code works as intended (there are other ways too, of course).

I stumbled upon this one recently again, (after years of python) while trying to remove a small dependency on numpy.
If you come from matlab you should use and trust numpy functions for mono-type array handling. Along with matplotlib, they are some very convenient packages for a smooth transition.
import numpy as np
np.zeros((4,)) # to make an array full of zeros [0,0,0,0]
np.zeros((4,1)) # another one full of zeros but 2 dimensions [[0],[0],[0],[0]]
np.zeros((4,0)) # an empty array like [[],[],[],[]]
np.zeros((0,4)) # another empty array, which can not be represented with python lists o_O
etc.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.