Python: Analyzing complex statements during execution

Python: Analyzing complex statements during execution - python

I am wondering if there is any way to get some meta information about the interpretation of a python statement during execution.
Let's assume this is a complex statement of some single statements joined with or (A, B, ... are boolean functions)
if A or B and ((C or D and E) or F) or G and H:
and I want to know which part of the statement is causing the statement to evaluate to True so I can do something with this knowledge. In the example, there would be 3 possible candidates:
A
B and ((C or D and E) or F)
G and H
And in the second case, I would like to know if it was (C or D and E) or F that evaluated to True and so on...
Is there any way without parsing the statement? Can I hook up to the interpreter in some way or utilize the inspect module in a way that I haven't found yet? I do not want to debug, it's really about knowing which part of this or-chain triggered the statement at runtime.
Edit - further information: The type of application that I want to use this in is a categorizing algorithm that inputs an object and outputs a certain category for this object, based on its attributes. I need to know which attributes were decisive for the category.
As you might guess, the complex statement from above comes from the categorization algorithm. The code for this algorithm is generated from a formal pseudo-code and contains about 3,000 nested if-elif-statements that determine the category in a hierarchical way like
if obj.attr1 < 23 and (is_something(obj.attr10) or eats_spam_for_breakfast(obj)):
return 'Category1'
elif obj.attr3 == 'Welcome Home' or count_something(obj) >= 2:
return 'Category2a'
elif ...
So aside from the category itself, I need to flag the attributes that were decisive for that category, so if I'd delete all other attributes, the object would still be assigned to the same category (due to the ors within the statements). The statements can be really long, up to 1,000 chars, and deeply nested. Every object can have up to 200 attributes.
Thanks a lot for your help!
Edit 2: Haven't found time in the last two weeks. Thanks for providing this solution, it works!

Could you recode your original code:
if A or B and ((C or D and E) or F) or G and H:
as, say:
e = Evaluator()
if e('A or B and ((C or D and E) or F) or G and H'):
...? If so, there's hope!-). The Evaluator class, upon __call__, would compile its string argument, then eval the result with (an empty real dict for globals, and) a pseudo-dict for locals that actually delegates the value lookups to the locals and globals of its caller (just takes a little black magic, but, not too bad;-) and also takes note of what names it's looked up. Given Python's and and or's short-circuiting behavior, you can infer from the actual set of names that were actually looked up, which one determined the truth value of the expression (or each subexpression) -- in an X or Y or Z, the first true value (if any) will be the last one looked up, and in a X and Y and Z, the first false one will.
Would this help? If yes, and if you need help with the coding, I'll be happy to expand on this, but first I'd like some confirmation that getting the code for Evaluator would indeed be solving whatever problem it is that you're trying to address!-)
Edit: so here's coding implementing Evaluator and exemplifying its use:
import inspect
import random
class TracingDict(object):
def __init__(self, loc, glob):
self.loc = loc
self.glob = glob
self.vars = []
def __getitem__(self, name):
try: v = self.loc[name]
except KeyError: v = self.glob[name]
self.vars.append((name, v))
return v
class Evaluator(object):
def __init__(self):
f = inspect.currentframe()
f = inspect.getouterframes(f)[1][0]
self.d = TracingDict(f.f_locals, f.f_globals)
def __call__(self, expr):
return eval(expr, {}, self.d)
def f(A, B, C, D, E):
e = Evaluator()
res = e('A or B and ((C or D and E) or F) or G and H')
print 'R=%r from %s' % (res, e.d.vars)
for x in range(20):
A, B, C, D, E, F, G, H = [random.randrange(2) for x in range(8)]
f(A, B, C, D, E)
and here's output from a sample run:
R=1 from [('A', 1)]
R=1 from [('A', 1)]
R=1 from [('A', 1)]
R=1 from [('A', 0), ('B', 1), ('C', 1)]
R=1 from [('A', 1)]
R=1 from [('A', 0), ('B', 0), ('G', 1), ('H', 1)]
R=1 from [('A', 1)]
R=1 from [('A', 1)]
R=1 from [('A', 0), ('B', 1), ('C', 1)]
R=1 from [('A', 1)]
R=1 from [('A', 0), ('B', 1), ('C', 1)]
R=1 from [('A', 1)]
R=1 from [('A', 1)]
R=1 from [('A', 1)]
R=0 from [('A', 0), ('B', 0), ('G', 0)]
R=1 from [('A', 1)]
R=1 from [('A', 1)]
R=1 from [('A', 1)]
R=0 from [('A', 0), ('B', 0), ('G', 0)]
R=1 from [('A', 0), ('B', 1), ('C', 1)]
You can see that often (about 50% of the time) A is true, which short-circuits everything. When A is false, B evaluates -- when B is also false, then G is next, when B is true, then C.

As far as I remember, Python does not return True or False per se:
Important exception: the Boolean
operations or and and always return
one of their operands.
The Python Standard Library - Truth Value Testing
Therefore, following is valid:
A = 1
B = 0
result = B or A # result == 1

The Python interpreter doesn't give you a way to introspect the evaluation of an expression at runtime. The sys.settrace() function lets you register a callback that is invoked for every line of source code, but that's too coarse-grained for what you want to do.
That said, I've experimented with a crazy hack to have the function invoked for every bytecode executed: Python bytecode tracing.
But even then, I don't know how to find the execution state, for example, the values on the interpreter stack.
I think the only way to get at what you want is to modify the code algorithmically. You could either transform your source (though you said you didn't want to parse the code), or you could transform the compiled bytecode. Neither is a simple undertaking, and I'm sure there are a dozen difficult hurdles to overcome if you try it.
Sorry to be discouraging...
BTW: What application do you have for this sort of technology?

I would just put something like this before the big statement (assuming the statement is in a class):
for i in ("A","B","C","D","E","F","G","H"):
print i,self.__dict__[i]

"""I do not want to debug, it's really about knowing which part of this or-chain triggered the statement at runtime.""": you might need to explain what is the difference between "debug" and "knowing which part".
Do you mean that you the observer need to be told at runtime what is going on (why??) so that you can do something different, or do you mean that the code needs to "know" so that it can do something different?
In any case, assuming that your A, B, C etc don't have side effects, why can't you simply split up your or-chain and test the components:
part1 = A
part2 = B and ((C or D and E) or F)
part3 = G and H
whodunit = "1" if part1 else "2" if part2 else "3" if part3 else "nobody"
print "Perp is", whodunit
if part1 or part2 or part3:
do_something()
??
Update:
"""The difference between debug and 'knowing which part' is that I need to assign a flag for the variables that were used in the statement that first evaluated to True (at runtime)"""
So you are saying that given the condition "A or B", that if A is True and B is True, A gets all the glory (or all the blame)? I'm finding it very hard to believe that categorisation software such as you describe is based on "or" having a short-circuit evaluation. Are you sure that there's an intent behind the code being "A or B" and not "B or A"? Could the order be random, or influenced by the order that the variables where originally input?
In any case, generating Python code automatically and then reverse-engineering it appears to be a long way around the problem. Why not just generate code with the part1 = yadda; part2 = blah; etc nature?

Related

Generator expressions vs generator functions and surprisingly eager evaluation

For reasons which are not relevant I am combining some data structures in a certain way, whilst also replacing the Python 2.7's default dict with OrderedDict. The data structures use tuples as keys in dictionaries. Please ignore those details (the replacement of the dict type is not useful below, but it is in the real code).
import __builtin__
import collections
import contextlib
import itertools
def combine(config_a, config_b):
return (dict(first, **second) for first, second in itertools.product(config_a, config_b))
#contextlib.contextmanager
def dict_as_ordereddict():
dict_orig = __builtin__.dict
try:
__builtin__.dict = collections.OrderedDict
yield
finally:
__builtin__.dict = dict_orig
This works as expected initially (dict can take non-string keyword arguments as a special case):
print 'one level nesting'
with dict_as_ordereddict():
result = combine(
[{(0, 1): 'a', (2, 3): 'b'}],
[{(4, 5): 'c', (6, 7): 'd'}]
)
print list(result)
print
Output:
one level nesting
[{(0, 1): 'a', (4, 5): 'c', (2, 3): 'b', (6, 7): 'd'}]
However, when nesting calls to the combine generator expression, it can be seen that the dict reference is treated as OrderedDict, lacking the special behaviour of dict to use tuples as keyword arguments:
print 'two level nesting'
with dict_as_ordereddict():
result = combine(combine(
[{(0, 1): 'a', (2, 3): 'b'}],
[{(4, 5): 'c', (6, 7): 'd'}]
),
[{(8, 9): 'e', (10, 11): 'f'}]
)
print list(result)
print
Output:
two level nesting
Traceback (most recent call last):
File "test.py", line 36, in <module>
[{(8, 9): 'e', (10, 11): 'f'}]
File "test.py", line 8, in combine
return (dict(first, **second) for first, second in itertools.product(config_a, config_b))
File "test.py", line 8, in <genexpr>
return (dict(first, **second) for first, second in itertools.product(config_a, config_b))
TypeError: __init__() keywords must be strings
Furthermore, implementing via yield instead of a generator expression fixes the problem:
def combine_yield(config_a, config_b):
for first, second in itertools.product(config_a, config_b):
yield dict(first, **second)
print 'two level nesting, yield'
with dict_as_ordereddict():
result = combine_yield(combine_yield(
[{(0, 1): 'a', (2, 3): 'b'}],
[{(4, 5): 'c', (6, 7): 'd'}]
),
[{(8, 9): 'e', (10, 11): 'f'}]
)
print list(result)
print
Output:
two level nesting, yield
[{(0, 1): 'a', (8, 9): 'e', (2, 3): 'b', (4, 5): 'c', (6, 7): 'd', (10, 11): 'f'}]
Questions:
Why does some item (only the first?) from the generator expression get evaluated before required in the second example, or what is it required for?
Why is it not evaluated in the first example? I actually expected this behaviour in both.
Why does the yield-based version work?

Before going into the details note the following: itertools.product evaluates the iterator arguments in order to compute the product. This can be seen from the equivalent Python implementation in the docs (the first line is relevant):
def product(*args, **kwds):
pools = map(tuple, args) * kwds.get('repeat', 1)
...
You can also try this with a custom class and a short test script:
import itertools
class Test:
def __init__(self):
self.x = 0
def __iter__(self):
return self
def next(self):
print('next item requested')
if self.x < 5:
self.x += 1
return self.x
raise StopIteration()
t = Test()
itertools.product(t, t)
Creating the itertools.product object will show in the output that all the iterators items are immediately requested.
This means, as soon as you call itertools.product the iterator arguments are evaluated. This is important because in the first case the arguments are just two lists and so there's no problem. Then you evaluate the final result via list(result after the context manager dict_as_ordereddict has returned and so all calls to dict will be resolved as the normal builtin dict.
Now for the second example the inner call to combine works still fine, now returning a generator expression which is then used as one of the arguments to the second combine's call to itertools.product. As we've seen above these arguments are immediately evaluated and so the generator object is asked to generate its values. In order to do so, it needs to resolve dict. However now we're still inside the context manager dict_as_ordereddict and for that reason dict will be resolved as OrderedDict which doesn't accept non-string keys for keyword arguments.
It is important to notice here that the first version which uses return needs to create the generator object in order to return it. That involves creating the itertools.product object. That means this version is as lazy as itertools.product.
Now to the question why the yield version works. By using yield, invoking the function will return a generator. Now this is a truly lazy version in the sense that execution of the function body doesn't start until items are requested. This means neither the inner nor the outer call to convert will start executing the function body and thus invoking itertools.product until the items are requested via list(result). You can check that by putting an additional print statement inside that function and right behind the context manager:
def combine(config_a, config_b):
print 'start'
# return (dict(first, **second) for first, second in itertools.product(config_a, config_b))
for first, second in itertools.product(config_a, config_b):
yield dict(first, **second)
with dict_as_ordereddict():
result = combine(combine(
[{(0, 1): 'a', (2, 3): 'b'}],
[{(4, 5): 'c', (6, 7): 'd'}]
),
[{(8, 9): 'e', (10, 11): 'f'}]
)
print 'end of context manager'
print list(result)
print
With the yield version we'll notice that it prints the following:
end of context manager
start
start
I.e. the generators are started only when the results are requested via list(result). This is different from the return version (uncomment in the above code). Now you'll see
start
start
and before the end of the context manager is reached the error is already raised.
On a side note, in order for your code to work, the replacement of dict needs to be ineffective (and it is for the first version), so I don't see why you would use that context manager at all. Secondly, dict literals are not ordered in Python 2, and neither are keyword arguments so that also defeats the purpose of using OrderedDict. Also note that in Python 3 that non-string keyword arguments behavior of dict has been removed and the clean way to update dictionaries of any keys is to use dict.update.

Python change namespace to access attributes of a class

I have a class with a lot of attributes to be set. In order to do so, I do:
opts.a = 1
opts.b = 2
# ...
opts.xyz = 0
After repeatedly writing opts. at the beginning of variables I am wondering: Is it possible to wrap this into a function or context so that the namespace is set to the class attributes so that I don't have to write opts all the time? I.e. that I only have to do something like:
with opts:
a = 1
b = 2
# ...
xyz = 3
I thought of moving the code inside a function of the class, but that doesn't make things easier to read or write, since then I'd need to write self instead of opts everytime.
One side condition for my case: the __setattr__ of the class should be called, since I did override that with a custom function to store the order in which attributes are set.

If I take your question and the comments together, you want to use the class-as-enum approach, but with mutable values, you want a quick way to update multiple attributes, and you want to store the order in which the attributes are set.
Here is a partial implementation that does what I think you want:
class MyClass:
def __init__(self):
self.__dict__['values'] = collections.OrderedDict()
def __setattr__ (self,a,v):
self.__dict__['values'][a]=v
def __getattr__ (self,a):
return self.__dict__['values'][a]
def __repr__(self):
return repr(self.__dict__['values'])
Given this definition you can do:
>>> m = MyClass()
>>> m.z = 56
>>> m.y = 22
>>> m.b = 34
>>> m.c = 12
>>> m
OrderedDict([('z', 56), ('y', 22), ('b', 34), ('c', 12)])
Quick update:
>>> m.values.update(f=2, g=3, h=5)
>>> m
OrderedDict([('z', 56), ('y', 22), ('b', 34), ('c', 12), ('h', 5), ('g', 3), ('f', 2)])
It is true that the relative order of f-g is a bit surprising, but since the update happens in a single statement, there is an argument for saying that the updates are simultaneous and so the relative order is arbitrary.

Trying to find majority element in a list

I'm writing a function to find a majority in a Python list.
Thinking that if I can write a hash function that can map every element to a single slot in the new array or to a unique identifier, perhaps for a dictionary, that should be the best and it should be undoable. I am not sure how to progress. My hash function is obviously useless, any tips on what I can/should do, or if this is even a reasonable approach?
def find_majority(k):
def hash_it(q):
return q
map_of = [0]*len(k)
for i in k:
mapped_to = hash_it(i) #hash function
map_of[mapped_to]+=1
find_majority([1,2,3,4,3,3,2,4,5,6,1,2,3,4,5,1,2,3,4,6,5])

Python has a built-in class called Counter that will do this for you.
>>> from collections import Counter
>>> c = Counter([1,2,3,4,3,3,2,4,5,6,1,2,3,4,5,1,2,3,4,6,5])
>>> c.most_common()
[(3, 5), (2, 4), (4, 4), (1, 3), (5, 3), (6, 2)]
>>> value, count = c.most_common()[0]
>>> print value
3
See the docs.
http://docs.python.org/2/library/collections.html#collections.Counter

There is an easy way to realize like this
l = [1,2,3,4,3,3,2,4,5,6,1,2,3,4,5,1,2,3,4,6,5]
print(max(set(l), key = l.count)) # 3

I think your approach is to use another array as big as k as your "hash map". If k is huge but the number of unique elements is not so huge, you would be wasting a lot of space. Furthermore, to find the majority, you would have to loop through your map_of hashmap/array to find the max.
On the other hand, a dictionary/set (where hashing is not your concern, and the underlying array structure will probably be more compact for average cases) seems a little more appropriate. Needless to say, with the occurring elements as keys and their occurrences as values, you can find what you want in one single iteration.
So, something like:
def find_majority(k):
myMap = {}
maximum = ( '', 0 ) # (occurring element, occurrences)
for n in k:
if n in myMap: myMap[n] += 1
else: myMap[n] = 1
# Keep track of maximum on the go
if myMap[n] > maximum[1]: maximum = (n,myMap[n])
return maximum
And as expected, we get what we want.
>>> find_majority([1,2,3,4,3,3,2,4,5,6,1,2,3,4,5,1,2,3,4,6,5])
(3, 5)
Of course, Counters and other cool modules will let you do what you want in finer syntax.

looking at a number in a tuple

Right now my function is not recoginizing my numbers in the list coeff as numbers. I am trying to pair up items from the two list and then sort them into a different list based on the value of mul. But everything is going into the negative list. How to i make sure it is considering mul as a number going into each if statement.
def balance_equation(species,coeff):
data=zip(coeff,species)
positive=[]
negative=[]
for (mul,el) in data:
if mul<0:
negative.append((el,mul))
if mul>0:
positive.append((el,mul))
Edit;
I ment to originally include this
balance_equation(['H2O','A2'],['6','-4'])

Your problem is that in the way you call it (balance_equation(['H2O','A2'],['6','-4'])), mul is a string rather than an int ('6' or '-4' rather than 6 or -4). Change your if statement to:
if int(mul)<0:
negative.append((el,mul))
if int(mul)>0:
positive.append((el,mul))
This converts mul to an integer before comparing it to 0.

Well, the first problem is that your function just returns None, just throwing away the two lists, so there's no way to even see whether it's doing the right thing.
If you fix that, you'll see that it is doing the right thing.
def balance_equation(species,coeff):
data=zip(coeff,species)
positive=[]
negative=[]
for (mul,el) in data:
if mul<0:
negative.append((el,mul))
if mul>0:
positive.append((el,mul))
return negative, positive
>>> n, p = balance_equation(balance_equation('abcdef', range(-3,3))
>>> n
[('a', -3), ('b', -2), ('c', -1)]
>>> p
[('e', 1), ('f', 2)]
So, there are two possibilities:
Since the code you pasted is clearly not the actual code you're running, maybe you fixed the bug while rewriting it to post here.
You're not calling it with sensible inputs. For example, if you pass the parameters backward, since species is presumably a collection of strings, they'll all end up positive. Or, likewise, if you pass the coeffs as string representations of integers.
If it's the last problem—you're passing, say, 'abcdef', ['-3', '-2', '-1', '0', '1', '2', '3'], and you want to deal with that within balance_equation instead of in the calling code, that's easy. Just add this line before the zip:
coeff = [int(x) for x in coeff]
Or change your zip to:
data = zip((int(x) for x in coeff), species)
By the way, I'm assuming you're on CPython 2. In Python 3, trying to compare a string to 0 will raise a TypeError instead of always returning True, while in other Python 2 implementations it might always return False instead of True…

I think you have your answer, but there's also a simpler way of doing this in Python:
for (mul, el) in data:
append_to = negative.append if mul < 0 else positive.append
append_to(el)
Not sure what "should happen" to 0 though

transitive closure python tuples

Does anyone know if there's a python builtin for computing transitive closure of tuples?
I have tuples of the form (1,2),(2,3),(3,4) and I'm trying to get (1,2),(2,3),(3,4),(1,3)(2,4)
Thanks.

There's no builtin for transitive closures.
They're quite simple to implement though.
Here's my take on it:
def transitive_closure(a):
closure = set(a)
while True:
new_relations = set((x,w) for x,y in closure for q,w in closure if q == y)
closure_until_now = closure | new_relations
if closure_until_now == closure:
break
closure = closure_until_now
return closure
call:
transitive_closure([(1,2),(2,3),(3,4)])
result:
set([(1, 2), (1, 3), (1, 4), (2, 3), (3, 4), (2, 4)])
call:
transitive_closure([(1,2),(2,1)])
result:
set([(1, 2), (1, 1), (2, 1), (2, 2)])

Just a quick attempt:
def transitive_closure(elements):
elements = set([(x,y) if x < y else (y,x) for x,y in elements])
relations = {}
for x,y in elements:
if x not in relations:
relations[x] = []
relations[x].append(y)
closure = set()
def build_closure(n):
def f(k):
for y in relations.get(k, []):
closure.add((n, y))
f(y)
f(n)
for k in relations.keys():
build_closure(k)
return closure
Executing it, we'll get
In [3]: transitive_closure([(1,2),(2,3),(3,4)])
Out[3]: set([(1, 2), (1, 3), (1, 4), (2, 3), (2, 4), (3, 4)])

We can perform the "closure" operation from a given "start node" by repeatedly taking a union of "graph edges" from the current "endpoints" until no new endpoints are found. We need to do this at most (number of nodes - 1) times, since this is the maximum length of a path. (Doing things this way avoids getting stuck in infinite recursion if there is a cycle; it will waste iterations in the general case, but avoids the work of checking whether we are done i.e. that no changes were made in a given iteration.)
from collections import defaultdict
def transitive_closure(elements):
edges = defaultdict(set)
# map from first element of input tuples to "reachable" second elements
for x, y in elements: edges[x].add(y)
for _ in range(len(elements) - 1):
edges = defaultdict(set, (
(k, v.union(*(edges[i] for i in v)))
for (k, v) in edges.items()
))
return set((k, i) for (k, v) in edges.items() for i in v)
(I actually tested it for once ;) )

Suboptimal, but conceptually simple solution:
def transitive_closure(a):
closure = set()
for x, _ in a:
closure |= set((x, y) for y in dfs(x, a))
return closure
def dfs(x, a):
"""Yields single elements from a in depth-first order, starting from x"""
for y in [y for w, y in a if w == x]:
yield y
for z in dfs(y, a):
yield z
This won't work when there's a cycle in the relation, i.e. a reflexive point.

Here's one essentially the same as the one from #soulcheck that works on adjacency lists rather than edge lists:
def inplace_transitive_closure(g):
"""g is an adjacency list graph implemented as a dict of sets"""
done = False
while not done:
done = True
for v0, v1s in g.items():
old_len = len(v1s)
for v2s in [g[v1] for v1 in v1s]:
v1s |= v2s
done = done and len(v1s) == old_len

If you have a lot of tupels (more than 5000), you might want to consider using the scipy code for matrix powers (see also http://www.ics.uci.edu/~irani/w15-6B/BoardNotes/MatrixMultiplication.pdf)
from scipy.sparse import csr_matrix as csr
def get_closure(tups):
index2id = list(set([tup[0] for tup in tups]) | set([tup[1] for tup in tups]));
id2index = {index2id[i]:i for i in xrange(len(index2id))};
tups_re = tups + [(index2id[i],index2id[i],) for i in xrange(len(index2id))]; # Unfortunately you have to make the relation reflexive first - you could also add the diagonal to M
M = csr( ([True for tup in tups_re],([id2index[tup[0]] for tup in tups_re],[id2index[tup[1]] for tup in tups_re])),shape=(len(index2id),len(index2id)),dtype=bool);
M_ = M**n; # n is maximum path length of your relation
temp = M_.nonzero();
#TODO: You might want to remove the added reflexivity tupels again
return [(index2id[temp[0][i]],index2id[temp[1][i]],) for i in xrange(len(temp[0]))];
In the best case, you can choose n wisely if you know a bit about your relation/graph -- that is how long the longest path can be. Otherwise you have to choose M.shape[0], which might blow up in your face.
This detour also has its limits, in particular you should be sure than the closure does not get too large (the connectivity is not too strong), but you would have the same problem in the python implementation.

You can create a graph from those tuples then use connnected components algorithm from the created graph. Networkx is library that supports connnected components algorithm.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.