Understand programmatically a python code without executing it

Understand programmatically a python code without executing it - python

I am implementing a workflow management system, where the workflow developer overloads a little process function and inherits from a Workflow class. The class offers a method named add_component in order to add a component to the workflow (a component is the execution of a software or can be more complex).
My Workflow class in order to display status needs to know what components have been added to the workflow. To do so I tried 2 things:
execute the process function 2 times, the first time allow to gather all components required, the second one is for the real execution. The problem is, if the workflow developer do something else than adding components (add element in a databases, create a file) this will be done twice!
parse the python code of the function to extract only the add_component lines, this works but if some components are in a if / else statement and the component should not be executed, the component apears in the monitoring!
I'm wondering if there is other solution (I thought about making my workflow being an XML or something to parse easier but this is less flexible).

You cannot know what a program does without "executing" it (could be in some context where you mock things you don't want to be modified but it look like shooting at a moving target).
If you do a handmade parsing there will always be some issues you miss.
You should break the code in two functions :
a first one where the code can only add_component(s) without any side
effects, but with the possibility to run real code to check the
environment etc. to know which components to add.
a second one that
can have side effects and rely on the added components.
Using an XML (or any static format) is similar except :
you are certain there are no side effects (don't need to rely on the programmer respecting the documentation)
much less flexibility but be sure you need it.

Related

What is the best or proper way to allow debugging of generated code?

For various reasons, in one project I generate executable code by means of generating AST from various source files the compiling that to bytecode (though the question could also work for cases where the bytecode is generated directly I guess).
From some experimentation, it looks like the debugger more or less just uses the lineno information embedded in the AST alongside the filename passed to compile in order to provide a representation for the debugger's purposes, however this assumes the code being executed comes from a single on-disk file.
That is not necessarily the case for my project, the executable code can be pieced together from multiple sources, and some or all of these sources may have been fetched over the network, or been retrieved from non-disk storage (e.g. database).
And so my Y questions, which may be the wrong ones (hence the background):
is it possible to provide a memory buffer of some sort, or is it necessary to generate a singular on-disk representation of the "virtual source"?
how well would the debugger deal with jumping around between the different bits and pieces if the virtual source can't or should not be linearised[0]
and just in case, is the assumption of Python only supporting a single contiguous source file correct or can it actually be fed multiple sources somehow?
[0] for instance a web-style literate program would be debugged in its original form, jumping between the code sections, not in the so-called "tangled" form

Some of this can be handled by the trepan3k debugger. For other things various hooks are in place.
First of all it can debug based on bytecode alone. But of course stepping instructions won't be possible if the line number table doesn't exist. And for that reason if for no other, I would add a "line number" for each logical stopping point, such as at the beginning of statements. The numbers don't have to be line numbers, they could just count from 1 or be indexes into some other table. This is more or less how go's Pos type position works.
The debugger will let you set a breakpoint on a function, but that function has to exist and when you start any python program most of the functions you define don't exist. So the typically way to do this is to modify the source to call the debugger at some point. In trepan3k the lingo for this is:
from trepan.api import debug; debug()
Do that in a place where the other functions you want to break on and that have been defined.
And the functions can be specified as methods on existing variables, e.g. self.my_function()
One of the advanced features of this debugger is that will decompile the bytecode to produce source code. There is a command called deparse which will show you the context around where you are currently stopped.
Deparsing bytecode though is a bit difficult so depending on which kind of bytecode you get the results may vary.
As for the virtual source problem, well that situation is somewhat tolerated in the debugger, since that kind of thing has to go on when there is no source. And to facilitate this and remote debugging (where the file locations locally and remotely can be different), we allow for filename remapping.
Another library pyficache is used to for this remapping; it has the ability I believe remap contiguous lines of one file into lines in another file. And I think you could use this over and over again. However so far there hasn't been need for this. And that code is pretty old. So someone would have to beef up trepan3k here.
Lastly, related to trepan3k is a trepan-xpy which is a CPython bytecode debugger which can step bytecode instructions even when the line number table is empty.

How do I get the local variables defined in a Python function?

I'm working on a project to do some static analysis of Python code. We're hoping to encode certain conventions that go beyond questions of style or detecting code duplication. I'm not sure this question is specific enough, but I'm going to post it anyway.
A few of the ideas that I have involve being able to build a certain understanding of how the various parts of source code work so we can impose these checks. For example, in part of our application that's exposing a REST API, I'd like to validate something like the fact that if a route is defined as a GET, then arguments to the API are passed as URL arguments rather than in the request body.
I'm able to get something like that to work by pulling all the routes, which are pretty nicely structured, and there are guarantees of consistency given the route has to be created as a route object. But once I know that, say, a given route is a GET, figuring out how the handler function uses arguments requires some degree of interpretation of the function source code.
Naïvely, something like inspect.getsourcelines will allow me to get the source code, but on further examination that's not the best solution because I immediately have to build interpreter-like features, such as figuring out whether a line is a comment, and then do something like use regular expressions to hunt down places where state is moved from the request context to a local variable.
Looking at tools like PyLint, they seem mostly focused on high-level "universals" of static analysis, and (at least on superficial inspection) don't have obvious ways of extracting this sort of understanding at a lower level.
Is there a more systematic way to get this representation of the source code, either with something in the standard library or with another tool? Or is the only way to do this writing a mini-interpreter that serves my purposes?

How many private variables is too much?

I'm currently working on my first larger scale piece of software, and am running into an ugly situation. Most features I have added thus far have required an additional private member in order to function properly. This is due to the fact that most features have been giving more power to the user, by allowing them to modify my program through either arguments passed to the constructor, or methods that specify a setting they wish to toggle.
I currently have around 13 private variables, and can see this spiraling out of control. The constructor code is starting to look very ugly. I was wondering if this is just a result of adding features, or if there was a creative/clever way to avoid this issue.

I would recommend abstracting the concept of of "behavior"
You'd have a base class "behavior" which actually performs the requested action, or manages the modification to behavior. Then you can initialize your code using an array of "parameters" and "behaviors".
Your startup code would become a simple "for" loop, and to add/remove behaviors you just add or remove to the list.
Of course, the tough part of this is actually fitting the activities of the behavior classes into your overall program flow. But I'm guessing that a focus on "single responsibility principle" would help figure that out.

How do I proceed with memory, .so filenames and hex offsets

Don't flame me for this but it's genuine. I am writing a multi threaded python application that runs for a very long time, typically 2-3 hours with 10 processes. This machine isn't slow it's just a lot of calculations.
The issue is that sometimes the application will hang about 85-90% of the way there because of outside tools.
I've broken this test up into smaller pieces that can then run successfully but the long running program hangs.
for example let's say I have to analyze some data on a list that 100,000,000 items long.
Breaking it up into twenty 5,000,000 lists all the smaller parts runs to completion.
Trying to do the 100,000,000 project it hangs towards the end. I use some outside tools that I cannot change so I am just trying to see what's going on.
I setup Dtrace and run
sudo dtrace -n 'syscall:::entry / execname == "python2.7" / { #[ustack()] = count() }'
on my program right when it hangs and I get an output like the code sample below.
libc.so.7`__sys_recvfrom+0xa
_socket.so`0x804086ecd
_socket.so`0x8040854ac
libpython2.7.so.1`PyEval_EvalFrameEx+0x52d7
libpython2.7.so.1`PyEval_EvalCodeEx+0x665
libpython2.7.so.1`0x800b3317d
libpython2.7.so.1`PyEval_EvalFrameEx+0x4e2f
libpython2.7.so.1`0x800b33250
libpython2.7.so.1`PyEval_EvalFrameEx+0x4e2f
libpython2.7.so.1`PyEval_EvalCodeEx+0x665
libpython2.7.so.1`0x800abb5a1
libpython2.7.so.1`PyObject_Call+0x64
libpython2.7.so.1`0x800aa3855
libpython2.7.so.1`PyObject_Call+0x64
libpython2.7.so.1`PyEval_EvalFrameEx+0x4de2
libpython2.7.so.1`PyEval_EvalCodeEx+0x665
libpython2.7.so.1`0x800abb5a1
libpython2.7.so.1`PyObject_Call+0x64
libpython2.7.so.1`0x800aa3855
libpython2.7.so.1`PyObject_Call+0x64
that code just repeats over and over. I tried looking into the Dtrace python probes but those seems busted two sides from Tuesday so this might be the closest that I'll get.
My question, I have a fuzzy idea that libpython2.7.so.1 is the shared library that holds the function pyObject_Call at an hex offset of 0x64
Is that right?
How can I decipher this? I don't know what to even call this so that I can google for answers or guides.

You should probably start by reading Showing the stack trace from a running Python application.
Your specific
question was about the interpretation of DTrace's ustack() action and
so this reply may be more than you need. This is because one of the
design principles of DTrace is to show the exact state of a system.
So, even though you're interested in the Python aspect of your
program, DTrace is revealing its underlying implementation.
The output you've presented is a stack, which is a way of
describing the state of a thread at a specific point in its
execution. For example, if you had the code
void c(void) { pause(); }
void b(void) { c(); }
void a(void) { b(); }
and you asked for a stack whilst execution was within pause() then
you might see something like
pause()
c()
b()
a()
Whatever tool you use will find the current instruction and its
enclosing function before finding the "return address", i.e. the
point to which that function will eventually return; repeating this
procedure yields a stack. Thus, although the stack should be read
from the top to the bottom as a series of return addresses, it's typically
read in the other direction as a series of callers. Note that
subtleties in the way that the program's corresponding
instructions are assembled mean that this second interpretation
can sometimes be misleading.
To extend the example above, it's likely that a(), b() and c() are
all present within the same library --- and that there may be
functions with the same names in other libraries. Thus it's
useful to display, for each function, the object to which it
belongs. Thus the stack above could become
libc.so`pause()
libfoo.so`c()
libfoo.so`b()
libfoo.so`a()
This goes some way towards allowing a developer to identify how a
program ended up in a particular state: function c() in libfoo
has called pause(). However, there's more to be done: if c()
looked like
void c() {
pause();
pause();
}
then in which call to pause() is the program waiting?
The functions a(), b() and c() will be sequences
of instructions that will typically occupy a contiguous region of
memory. Calling one of the functions involves little more than
making a note of where to return when finished (i.e. the return
address) and then jumping to whichever memory address corresponds
to the function's start. Functions' start addresses and sizes are
recorded in a "symbol table" that is embedded in the object; it's
by reading this table that a debugger is able to find the function
that contains a given location such as a return address. Thus a
specific point within a function can be described by an offset,
usually expressed in hex, from the start. So an even better
version of the stack above might be
libc.so`pause()+0x12
libfoo.so`c()+0x42
libfoo.so`b()+0x12
libfoo.so`a()+0x12
At this point, the developer can use a "disassembler" on libfoo.so
to display the instructions within c(); comparison with c()'s
source code would allow him to reveal the specific line from which
the call to pause() was made.
Before concluding this description of stacks, it's worth making
one more observation. Given the presence of sufficient "debug
data" in a library such as libfoo, a better debugger would be able
to go the extra mile and display the the source code file name and
line number instead of the hexadecimal offset for each "frame" in
the stack.
So now, to return to the stack in your question,
libpython(2.7.so.1) is a library whose functions perform the job
of executing a Python script. Functions in the Python script are
converted into executable instructions on the fly, so my guess is
that the fragment
libpython2.7.so.1`0x800b33250
libpython2.7.so.1`PyEval_EvalFrameEx+0x4e2f
libpython2.7.so.1`PyEval_EvalCodeEx+0x665
means that PyEval_EvalFrameEx() is functionality within libpython
itself that calls a Python function (i.e. something written in
Python) that resides in memory near the address 0x800b33250. A
simple debugger can see that this address belongs to libpython but
won't find a corresponding entry in the library's symbol table;
left with no choice, it simply prints the "raw" address.
So, you need to look at the Python script so see what it's
doing but, unfortunately, there's no indication of the names of
the functions in the Python component of the stack.
There are a few ways to proceed. The first is to find a
version of libpython, if one exists, with a "DTrace helper". This
is some extra functionality that lets DTrace see the state of the
Python program itself in addition to the surrounding
implementation. The result is that each Python frame would be
annotated with the corresponding point in the Python source code.
Another, if you're on Solaris, is to use pstack(1); this has
native support for Python.
Finally, try a specific Python debugger.
It's also worth pointing out that your dtrace invocation will show
you all the stacks seen, sorted by popularity, whenever the
program "python2.7" makes a system call. From your description,
this probably isn't what you want. If you're trying to understand
the behaviour of a hang then you probably want to start with a
single snapshot of the python2.7 process at the time of the
hang.

Design of a multi-level abstraction software

I'm working on designing a piece of software now, that has a few levels of abstraction. This might be the most complex piece of code I've ever started designing, and it has a requirement for easy upgrading, so I'm wanting to make sure I'm on the right track before I even start coding anything.
Essentially, there will be 3 main levels of classes. These two classes will need to talk with each other.
The first is the input source data. There are currently 2 main types of input data, which produce similar, but not identical output. The main goal of these classes will be to get the data from the two difference sources and convert it into a common interface, for use in the rest of the program.
The second set will be an adapter for an external library. The library has been periodically updated, and I have no reason to suspect that it will not continue to be updated throughout the years. Most likely, each upgrade will remain very similar to the previous one, but there might be some small changes made to support a new library version. This level will be responsible for taking the inputs, and formatting them for a use of an output class.
The last class is the outputs. I don't think that multiple versions will be required for this, but there will need to be at least two different output directories specified. I suspect the easiest thing to do would be to simply pass in an output directory when the output class is created, and that is the only level of abstraction required. This class will be frequently updated, but there is no requirement to support multiple versions.

Set up the code as follows, essentially following a bridge pattern, but with multiple abstraction layers.
The input class will be the abstraction. The currently two different means of getting output will be the two different concrete classes, and more concrete classes can be added if required.
The wrapper class will be a factory pattern. Most of the code should be common between the various implementations, so this should work well to handle minute differences.
The output class will be included as a part of the implementor class. There isn't a pattern really required, as only one version will ever be required for this class. Also, the implementor will likely be a singleton.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.