In C python, accessing the bytecode evaluation stack - python

Given a C Python frame pointer, how do I look at arbitrary evaluation stack entries? (Some specific stack entries can be found via locals(), I'm talking about other stack entries.)
I asked a broader question like this a while ago:
getting the C python exec argument string or accessing the evaluation stack
but here I want to focus on being able to read CPython stack entries at runtime.
I'll take a solution that works on CPython 2.7 or any Python later than Python 3.3. However if you have things that work outside of that, share that and, if there is no better solution I'll accept that.
I'd prefer not modifying the C Python code. In Ruby, I have in fact done this to get what I want. I can speak from experience that this is probably not the way we want to work. But again, if there's no better solution, I'll take that. (My understanding wrt to SO points is that I lose it in the bounty either way. So I'm happy go see it go to the person who has shown the most good spirit and willingness to look at this, assuming it works.)
update: See the comment by user2357112 tldr; Basically this is hard-to-impossible to do. (Still, if you think you have the gumption to try, by all means do so.)
So instead, let me narrow the scope to this simpler problem which I think is doable:
Given a python stack frame, like inspect.currentframe(), find the beginning of the evaluation stack. In the C version of the structure, this is f_valuestack. From that we then need a way in Python to read off the Python values/objects from there.
update 2 well the time period for a bounty is over and no one (including my own summary answer) has offered concrete code. I feel this is a good start though and I now understand the situation much more than I had. In the obligatory "describe why you think there should be a bounty" I had listed one of the proffered choices "to draw more attention to this problem" and to that extent where there had been something less than a dozen views of the prior incarnation of the problem, as I type this it has been viewed a little under 190 times. So this is a success. However...
If someone in the future decides to carry this further, contact me and I'll set up another bounty.
Thanks all.

This is sometimes possible, with ctypes for direct C struct member access, but it gets messy fast.
First off, there's no public API for this, on the C side or the Python side, so that's out. We'll have to dig into the undocumented insides of the C implementation. I'll be focusing on the CPython 3.8 implementation; the details should be similar, though likely different, in other versions.
A PyFrameObject struct has an f_valuestack member that points to the bottom of its evaluation stack. It also has an f_stacktop member that points to the top of its evaluation stack... sometimes. During execution of a frame, Python actually keeps track of the top of the stack using a stack_pointer local variable in _PyEval_EvalFrameDefault:
stack_pointer = f->f_stacktop;
assert(stack_pointer != NULL);
f->f_stacktop = NULL; /* remains NULL unless yield suspends frame */
There are two cases in which f_stacktop is restored. One is if the frame is suspended by a yield (or yield from, or any of the multiple constructs that suspend coroutines through the same mechanism). The other is right before calling a trace function for a 'line' or 'opcode' trace event. f_stacktop is cleared again when the frame unsuspends, or after the trace function finishes.
That means that if
you're looking at a suspended generator or coroutine frame, or
you're currently in a trace function for a 'line' or 'opcode' event for a frame
then you can access the f_valuestack and f_stacktop pointers with ctypes to find the lower and upper bounds of the frame's evaluation stack and access the PyObject * pointers stored in that range. You can even get a superset of the stack contents without ctypes with gc.get_referents(frame_object), although this will contain other referents that aren't on the frame's stack.
Debuggers use trace functions, so this gets you value stack entries for the top stack frame while debugging, most of the time. It does not get you value stack entries for any other stack frames on the call stack, and it doesn't get you value stack entries while tracing an 'exception' event or any other trace events.
When f_stacktop is NULL, determining the frame's stack contents is close to impossible. You can still see where the stack begins with f_valuestack, but you can't see where it ends. The stack top is stored in a C-level stack_pointer local variable that's really hard to access.
There's the frame's code object's co_stacksize, which gives an upper bound on the stack size, but it doesn't give the actual stack size.
You can't tell where the stack ends by examining the stack itself, because Python doesn't null out the pointers on the stack when it pops entries.
gc.get_referents doesn't return value stack entries when f_stacktop is null. It doesn't know how to retrieve stack entries safely in this case either (and it doesn't need to, because if f_stacktop is null and stack entries exist, the frame is guaranteed reachable).
You might be able to examine the frame's f_lasti to determine the last bytecode instruction it was on and try to figure out where that instruction would leave the stack, but that would take a lot of intimate knowledge of Python bytecode and the bytecode evaluation loop, and it's still ambiguous sometimes (because the frame might be halfway through an instruction). This would at least give you a lower bound on the current stack size, though, letting you safely inspect at least some of it.
Frame objects have independent value stacks that aren't contiguous with each other, so you can't look at the bottom of one frame's stack to find the top of another. (The value stack is actually allocated within the frame object itself.)
You might be able to hunt down the stack_pointer local variable with some GDB magic or something, but it'd be a mess.

Note added later: See crusaderky's get_stack.py which might be worked into a solution here.
Here are two potential solution partial solutions, since this problem has no simple obvious answer, short of:
modifying the CPython interpreter or by:
instrumenting the bytecode beforand such as via x-python
Thanks to user2357112 for enlightenment on the difficulty of the problem, and for descriptions of:
the various Python stacks used at runtime,
the non-contiguous evaluation stack,
the transiency of the evaluation stack and
the stack pointer top which only lives as a C local
variable (which at run time, might be or is likely only saved in the
value of a register).
Now to potential solutions...
The first solution is to write a C extension to access f_valuestack which is the bottom (not top) of a frame. From that you can access values, and that too would have to go in the C extension. The main problem here, since this is the stack bottom, is to understand which entry is the top or one you are interested in. The code records the maximum stack depth in the function.
The C extension would wrap the PyFrameObject so it can get access to the unexposed field f_valuestack. Although the PyFrameObject can change from Python version to Python version (so the extension might have to check which python version is running), it still is doable.
From that use an Abstract Virtual Machine to figure out which entry position you'd be at for a given offset stored in last_i.
Something similar for my purposes would for my purposes would be to use a real but alternative VM, like Ned Batchhelder's byterun. It runs a Python bytecode interpreter in Python.
Note added later: I have made some largish revisions in order to support Python 2.5 .. 3.7 or so and this is now called x-python
The advantage here would be that since this acts as a second VM so stores don't change the running of the current and real CPython VM. However you'd still need to deal with the fact of interacting with external persistent state (e.g. calls across sockets or changes to files). And byterun would need to be extended to cover all opcodes and Python versions that potentially might be needed.
By the way, for multi-version access to bytecode in a uniform way (since not only does bytecode change a bit, but also the collections of routines to access it), see xdis.
So although this isn't a general solution it could probably work for the special case of trying to figure out the value of say an EXEC up which appear briefly on the evaluation stack.

I wrote some code to do this. It seems to work so I'll add it to this question.
How it does it is by disassembling the instructions, and using dis.stack_effect to get the effect of each instruction on stack depth. If there's a jump, it sets the stack level at the jump target.
I think stack level is deterministic, i.e. it is always the same at any given bytecode instruction in a piece of code no matter how it was reached. So you can get stack depth at a particular bytecode by looking at the bytecode disassembly directly.
There's a slight catch which is that if you are in an active call, the code position is shown as last instruction being the call, but the stack state is actually that before the call. This is good because it means you can recreate the call arguments from the stack, but you need to be aware that if the instruction is a call that is ongoing, the stack will be at the level of the previous instruction.
Here's the code from my resumable exception thing that does this:
cdef get_stack_pos_after(object code,int target,logger):
stack_levels={}
jump_levels={}
cur_stack=0
for i in dis.get_instructions(code):
offset=i.offset
argval=i.argval
arg=i.arg
opcode=i.opcode
if offset in jump_levels:
cur_stack=jump_levels[offset]
no_jump=dis.stack_effect(opcode,arg,jump=False)
if opcode in dis.hasjabs or opcode in dis.hasjrel:
# a jump - mark the stack level at jump target
yes_jump=dis.stack_effect(opcode,arg,jump=True)
if not argval in jump_levels:
jump_levels[argval]=cur_stack+yes_jump
cur_stack+=no_jump
stack_levels[offset]=cur_stack
logger(offset,i.opname,argval,cur_stack)
return stack_levels[target]
https://github.com/joemarshall/unthrow

I've tried to do this in this package. As others point out, the main difficulty is in determining the top of the Python stack. I try to do this with some heuristics, which I've documented here.
The overall idea is that by the time my snapshotting function is called, the stack consists of the locals (as you point out), the iterators of nested for loops, and any exception triplets currently being handled. There's enough information in Python 3.6 & 3.7 to recover these states and therefore the stacktop.
I also relied on a tip from user2357112 to pave a way to making this work in Python 3.8.

Related

Code block in python in order to free memory

Pretty simple question:
I have some code to show some graphs, and it prepares data for the graphs, and I don't want to waste memory (limited)... is there a way to have a "local scope" so when we get to the end, everything inside is freed?
I come from C++ where you can define code inside { ... } so at the end everything is freed, and you don't have to care about anything
Anything like that in python?
The only thing I can think of is:
def tmp():
... code ...
tmp()
but is very ugly, and for sure I don't want to list all the del x at the end
If anything holds a reference to your object, it cannot be freed. By default, anything at the global scope is going to be held in the global namespace (globals()), and as far as the interpreter knows, the very next line of source code could reference it (or, another module could import it from this current module), so globals cannot be implicitly freed, ever.
This forces your hand to either explicitly delete references to objects with del, or to put them within the local scope of a function. This may seem ugly, but if you follow the philosophy that a function should do one thing and one thing well (thanks Unix!), you will already segment your code into functions already. On the one-off exceptions where you allocate a lot of memory early on in your function, and no longer need it midway through, you can del the reference to it.
I know this isn't the answer you want to hear, but its the reality of Python. You could accomplish something similar by nesting function defs or classs inside, but this is kinda hacky (or in the class case, which wouldn't require calling/instantiating, extremely hacky).
I will also mention, there is a gc built in module for interacting with the garbage collector. Here, you can trigger an immediate garbage collection (otherwise python will eventually get around to collecting the things you del refs to), as well as inspect how many references a given object has.
If you're curious where the allocations are happening, you can also use the built in tracemalloc module to trace said allocations.
Mechanism that handles freeing memory in Python is called "Garbage Collector" and it means there's no reason to use del in overwhelming majority of Python code.
When programming in Python, you are "not supposed" to care about such low level things as allocating and freeing memory for your variables.
That being said, putting your code into functions (although preferrably called something clearer than tmp()) is most definitely a good idea as it will make your code much more readable and "Pythonic"
Coming from C++ and already stumbled to one of the main diferences (drawbacks) of python and this is memory management.Python Garbage Collector will delete all the objects that will fall out of scope.Freeing up memory of objects althought doesnt guarantee that this memory will return actually to the system but instead a rather big portion will be kept reserved by the python programm even if not used.If you face a memory problem and you want to free your memory back to the system the only safe method is to run the memory intensive function into a seperate process.Every process in python have its own interpreter and any memory consumed by this process will return to the system when the process exits.

How to find a C stack pointer associated with execution of a CPython stack frame

Update: If it helps narrow down the question for anyone, this question is really more about the CPython API and whether or not I'm missing some way to reach information that I need. I'm not asking for solutions to a broader problem, but rather in working on a broader problem I hit upon a specific question about CPython and whether or not it provided a way that was not obvious to me to obtain some specific information. I only tagged the question c because by its nature it requires some C expertise, but it is not a general question about C or specific architectures/platforms.
See also the note below about one possible approach using PyEval_SetTrace, though I was hoping their might be a better way. As another example, there exists a PyMain_GetArgcArgv which would do the trick here, but only if the Python interpreter were started from the python executable rather than embedded (which might be an acceptable limitation). Also PyMain_GetArgcArgv is not documented as part of the API.
I would like to be able to find the address of a C stack frame (i.e. the __builtin_frame_address(0) as defined appropriately for that platform) that is most closely associated with a Python stack frame. In particular I'd like to find the outer-most frame--or close to it--associated with a Python function call, to be defined better below.
The context, to summarize, is that I'm wrapping a C library that uses an obscure custom-purpose garbage collector which needs a pointer to the bottom of the stack--at least as far back as there are local variables pointing to objects that should be tracked by the GC. Ideally I could mark the bottom of the stack once; in this case since it is being wrapped in a Python module it is sufficient to go down to the outer-most Python stack frame. The best available alternative would be to manually mark the stack bottom whenever entering calls to the library, but this is not ideal, and also would require patching to the library (which may be needed either way), as it currently only allows setting the stack bottom address once, during an initialization function.
How exactly a Python stack frame is associated with a C stack frame is ill-defined as it is, as there is technically no hard-and-fast connection between the two. However, for the practical purpose at hand it would be at or close to (depending on compiler optimizations, etc.) the PyEval_EvalFrameEx call for the frame being executed (I'm not interested in frames that are not currently on the call stack since it's obviously a meaningless question in that case).
This is all obviously very CPython-specific and that's OK for my purposes. That being the case, there's no reason technically that the CPython PyFrameObject struct implementation couldn't carry information like this on one of its members, but as far as I can tell there's nothing specifically stored on PyFrameObjects that would allow me to associate it with a C stack frame. For example, my problem would be "solved" well-enough, for the purposes of this application, if there were something in PyFrameObject like f_cstack that were used like:
PyObject* _Py_HOT_FUNCTION
_PyEval_EvalFrameDefault(PyFrameObject *f, int throwflag)
{
...
f->f_executing = 1;
f->f_cstack = &f;
...
}
This would work AFAICT--even though f is typically passed in a register, my gcc will handle code like this by pushing f on the stack and storing its address on the stack. Unfortunately there is currently nothing like this I can find.
The best idea I've been able to come up with would be to register a PyEval_SetTrace handler, which would be called upon entering Python stack frames and thus give me the opportunity to root around the stack from there. But really for the application at hand I only need to be able to find the "outer-most" PyEval_EvalFrameEx call, which there will be one of for any running Python code. So installing a trace callback won't necessarily get me that, and it's additional overhead I don't need for every function call.
I fear there is not currently a good solution to this, though it would be handy if there were.
(P.S. I'm also only concerned about the main stack, and not threads, though any solution that would work on the main thread would likely have a similar solution on auxiliary threads).
In general and in principle, you probably cannot always do what you want (it is well known that C implementations might not even need any call stack in some cases). Since sometimes compilers like GCC (or Clang) are able of tail-call compiler optimizations (which, combined with link-time optimizations, could give surprising results). Some calling conventions or compilation modes (e.g. gcc -fomit-frame-pointer -m32 on 32 bits x86) make difficult the traversal of the call stack (at least, without additional data).
In practice, you should investigate using the GNU backtrace function and even better Ian Taylor's libbacktrace. This libbacktrace library parses DWARF debug information (so it might be Linux specific and perhaps won't work on Windows). On Linux, dladdr(3) is able to get a symbol name close to a given address.
So you'll better compile both your main program and the Python runtime (and perhaps additional libraries) with -g flag passed to gcc or g++ (to get DWARF debug information), then use libbacktrace. Remember that GCC is able to handle both -g and optimizations flags like -O2 at the same time. The performance of the binary or library does not suffer (since optimizations are done by the GCC compiler).
For hunting memory leaks (which was indirectly mentioned in some comment, but not in the question itself), some tools are available (e.g. valgrind). Asking if they are adequate for a mixed Python + C program is a different question.
Garbage collection bugs are painful to hunt (and I did wrote several GCs myself -notably in my obsolete GCC MELT and in my bismon-, so I speak by experience; read also the GC handbook). Mixing a GC with another one (Python refcounting mechanism is a GC mechanism) is painful and brittle. It could be more reasonable in practice to split your software in several processes using inter-process communication facilities (and these are operating system specific).
Since CPython is free software, you might fork it to add libbacktrace support inside (and doing that should be reasonably easy, technically speaking).

How do I proceed with memory, .so filenames and hex offsets

Don't flame me for this but it's genuine. I am writing a multi threaded python application that runs for a very long time, typically 2-3 hours with 10 processes. This machine isn't slow it's just a lot of calculations.
The issue is that sometimes the application will hang about 85-90% of the way there because of outside tools.
I've broken this test up into smaller pieces that can then run successfully but the long running program hangs.
for example let's say I have to analyze some data on a list that 100,000,000 items long.
Breaking it up into twenty 5,000,000 lists all the smaller parts runs to completion.
Trying to do the 100,000,000 project it hangs towards the end. I use some outside tools that I cannot change so I am just trying to see what's going on.
I setup Dtrace and run
sudo dtrace -n 'syscall:::entry / execname == "python2.7" / { #[ustack()] = count() }'
on my program right when it hangs and I get an output like the code sample below.
libc.so.7`__sys_recvfrom+0xa
_socket.so`0x804086ecd
_socket.so`0x8040854ac
libpython2.7.so.1`PyEval_EvalFrameEx+0x52d7
libpython2.7.so.1`PyEval_EvalCodeEx+0x665
libpython2.7.so.1`0x800b3317d
libpython2.7.so.1`PyEval_EvalFrameEx+0x4e2f
libpython2.7.so.1`0x800b33250
libpython2.7.so.1`PyEval_EvalFrameEx+0x4e2f
libpython2.7.so.1`PyEval_EvalCodeEx+0x665
libpython2.7.so.1`0x800abb5a1
libpython2.7.so.1`PyObject_Call+0x64
libpython2.7.so.1`0x800aa3855
libpython2.7.so.1`PyObject_Call+0x64
libpython2.7.so.1`PyEval_EvalFrameEx+0x4de2
libpython2.7.so.1`PyEval_EvalCodeEx+0x665
libpython2.7.so.1`0x800abb5a1
libpython2.7.so.1`PyObject_Call+0x64
libpython2.7.so.1`0x800aa3855
libpython2.7.so.1`PyObject_Call+0x64
that code just repeats over and over. I tried looking into the Dtrace python probes but those seems busted two sides from Tuesday so this might be the closest that I'll get.
My question, I have a fuzzy idea that libpython2.7.so.1 is the shared library that holds the function pyObject_Call at an hex offset of 0x64
Is that right?
How can I decipher this? I don't know what to even call this so that I can google for answers or guides.
You should probably start by reading Showing the stack trace from a running Python application.
Your specific
question was about the interpretation of DTrace's ustack() action and
so this reply may be more than you need. This is because one of the
design principles of DTrace is to show the exact state of a system.
So, even though you're interested in the Python aspect of your
program, DTrace is revealing its underlying implementation.
The output you've presented is a stack, which is a way of
describing the state of a thread at a specific point in its
execution. For example, if you had the code
void c(void) { pause(); }
void b(void) { c(); }
void a(void) { b(); }
and you asked for a stack whilst execution was within pause() then
you might see something like
pause()
c()
b()
a()
Whatever tool you use will find the current instruction and its
enclosing function before finding the "return address", i.e. the
point to which that function will eventually return; repeating this
procedure yields a stack. Thus, although the stack should be read
from the top to the bottom as a series of return addresses, it's typically
read in the other direction as a series of callers. Note that
subtleties in the way that the program's corresponding
instructions are assembled mean that this second interpretation
can sometimes be misleading.
To extend the example above, it's likely that a(), b() and c() are
all present within the same library --- and that there may be
functions with the same names in other libraries. Thus it's
useful to display, for each function, the object to which it
belongs. Thus the stack above could become
libc.so`pause()
libfoo.so`c()
libfoo.so`b()
libfoo.so`a()
This goes some way towards allowing a developer to identify how a
program ended up in a particular state: function c() in libfoo
has called pause(). However, there's more to be done: if c()
looked like
void c() {
pause();
pause();
}
then in which call to pause() is the program waiting?
The functions a(), b() and c() will be sequences
of instructions that will typically occupy a contiguous region of
memory. Calling one of the functions involves little more than
making a note of where to return when finished (i.e. the return
address) and then jumping to whichever memory address corresponds
to the function's start. Functions' start addresses and sizes are
recorded in a "symbol table" that is embedded in the object; it's
by reading this table that a debugger is able to find the function
that contains a given location such as a return address. Thus a
specific point within a function can be described by an offset,
usually expressed in hex, from the start. So an even better
version of the stack above might be
libc.so`pause()+0x12
libfoo.so`c()+0x42
libfoo.so`b()+0x12
libfoo.so`a()+0x12
At this point, the developer can use a "disassembler" on libfoo.so
to display the instructions within c(); comparison with c()'s
source code would allow him to reveal the specific line from which
the call to pause() was made.
Before concluding this description of stacks, it's worth making
one more observation. Given the presence of sufficient "debug
data" in a library such as libfoo, a better debugger would be able
to go the extra mile and display the the source code file name and
line number instead of the hexadecimal offset for each "frame" in
the stack.
So now, to return to the stack in your question,
libpython(2.7.so.1) is a library whose functions perform the job
of executing a Python script. Functions in the Python script are
converted into executable instructions on the fly, so my guess is
that the fragment
libpython2.7.so.1`0x800b33250
libpython2.7.so.1`PyEval_EvalFrameEx+0x4e2f
libpython2.7.so.1`PyEval_EvalCodeEx+0x665
means that PyEval_EvalFrameEx() is functionality within libpython
itself that calls a Python function (i.e. something written in
Python) that resides in memory near the address 0x800b33250. A
simple debugger can see that this address belongs to libpython but
won't find a corresponding entry in the library's symbol table;
left with no choice, it simply prints the "raw" address.
So, you need to look at the Python script so see what it's
doing but, unfortunately, there's no indication of the names of
the functions in the Python component of the stack.
There are a few ways to proceed. The first is to find a
version of libpython, if one exists, with a "DTrace helper". This
is some extra functionality that lets DTrace see the state of the
Python program itself in addition to the surrounding
implementation. The result is that each Python frame would be
annotated with the corresponding point in the Python source code.
Another, if you're on Solaris, is to use pstack(1); this has
native support for Python.
Finally, try a specific Python debugger.
It's also worth pointing out that your dtrace invocation will show
you all the stacks seen, sorted by popularity, whenever the
program "python2.7" makes a system call. From your description,
this probably isn't what you want. If you're trying to understand
the behaviour of a hang then you probably want to start with a
single snapshot of the python2.7 process at the time of the
hang.

Counting number of symbols in Python script

I have a Telit module which runs [Python 1.5.2+] (http://www.roundsolutions.com/techdocs/python/Easy_Script_Python_r13.pdf)!. There are certain restrictions in the number of variable, module and method names I can use (< 500), the size of each variable (16k) and amount of RAM (~ 1MB). Refer pg 113&114 for details. I would like to know how to get the number of symbols being generated, size in RAM of each variable, memory usage (stack and heap usage).
I need something similar to a map file that gets generated with gcc after the linking process which shows me each constant / variable, symbol, its address and size allocated.
Python is an interpreted and dynamically-typed language, so generating that kind of output is very difficult, if it's even possible. I'd imagine that the only reasonable way to get this information is to profile your code on the target interpreter.
If you're looking for a true memory map, I doubt such a tool exists since Python doesn't go through the same kind of compilation process as C or C++. Since everything is initialized and allocated at runtime as the program is parsed and interpreted, there's nothing to say that one interpreter will behave the same as another, especially in a case such as this where you're running on such a different architecture. As a result, there's nothing to say that your objects will be created in the same locations or even with the same overall memory structure.
If you're just trying to determine memory footprint, you can do some manual checking with sys.getsizeof(object, [default]) provided that it is supported with Telit's libs. I don't think they're using a straight implementation of CPython. Even still, this doesn't always work and with raise a TypeError when an object's size cannot be determined if you don't specify the default parameter.
You might also get some interesting results by studying the output of the dis module's bytecode disassembly, but that assumes that dis works on your interpreter, and that your interpreter is actually implemented as a VM.
If you just want a list of symbols, take a look at this recipe. It uses reflection to dump a list of symbols.
Good manual testing is key here. Your best bet is to set up the module's CMUX (COM port MUXing), and watch the console output. You'll know very quickly if you start running out of memory.
This post makes me recall my pain once with Telit GM862-GPS modules. My code was exactly at the point that the number of variables, strings, etc added up to the limit. Of course, I didn't know this fact by then. I added one innocent line and my program did not work any more. I drove me really crazy for two days until I look at the datasheet to find this fact.
What you are looking for might not have a good answer because the Python interpreter is not a full fledged version. What I did was to use the same local variable names as many as possible. Also I deleted doc strings for functions (those count too) and replace with #comments.
In the end, I want to say that this module is good for small applications. The python interpreter does not support threads or interrupts so your program must be a super loop. When your application gets bigger, each iteration will take longer. Eventually, you might want to switch to a faster platform.

Are there any Python reference counting/garbage collection gotchas when dealing with C code?

Just for the sheer heck of it, I've decided to create a Scheme binding to libpython so you can embed Python in Scheme programs. I'm already able to call into Python's C API, but I haven't really thought about memory management.
The way mzscheme's FFI works is that I can call a function, and if that function returns a pointer to a PyObject, then I can have it automatically increment the reference count. Then, I can register a finalizer that will decrement the reference count when the Scheme object gets garbage collected. I've looked at the documentation for reference counting, and don't see any problems with this at first glance (although it may be sub-optimal in some cases). Are there any gotchas I'm missing?
Also, I'm having trouble making heads or tails of the cyclic garbage collector documentation. What things will I need to bear in mind here? In particular, how do I make Python aware that I have a reference to something so it doesn't collect it while I'm still using it?
Your link to http://docs.python.org/extending/extending.html#reference-counts is the right place. The Extending and Embedding and Python/C API sections of the documentation are the ones that will explain how to use the C API.
Reference counting is one of the annoying parts of using the C API. The main gotcha is keeping everything straight: Depending on the API function you call, you may or may not own the reference to the object you get. Be careful to understand whether you own it (and thus cannot forget to DECREF it or give it to something that will steal it) or are borrowing it (and must INCREF it to keep it and possibly to use it during your function). The most common bugs involving this are 1) remembering incorrectly whether you own a reference returned by a particular function and 2) believing you're safe to borrow a reference for a longer time than you are.
You do not have to do anything special for the cyclic garbage collector. It's just there to patch up a flaw in reference counting and doesn't require direct access.
The biggest gotcha I know with ref counting and the C API is the __del__ thing. When you have a borrowed reference to something, you think you can get away without INCREF'ing because you don't give up the GIL while you use that reference. But, if you end up deleting an object (by, for example, removing it from a list), it's possible that you trigger a __del__ call, which might remove the reference you're borrowing from under your feet. Very tricky.
If you INCREF (and then DECREF, of course) all borrowed references as soon as you get them, there shouldn't be any problem.

Categories