Accessing Python Objects in a Core Dump

Accessing Python Objects in a Core Dump - python

Is there anyway to discover the python value of a PyObject* from a corefile in gdb

It's lots of work, but of course it can be done, especially if you have all the symbols. Look at the header files for the specific version of Python (and compilation options in use to build it): they define PyObject as a struct which includes, first and foremost, a pointer to a type. Lots of macros are used, so you may want to run the compile of that Python from sources again, with exactly the same flags but in addition a -E to stop after preprocessing, so you can refer to the specific C code that made the bits you're seeing in the core dump.
A type object has, among many other things, a string (array of char) that's its name, and from it you can infer what exactly objects of that type contain -- be it content directly, or maybe some content (such as a length, i.e. number of items) and a pointer to the actual data.
I've done such super-advanced post-mortem debugging a couple of times (starting with VERY precise knowledge of the Python versions involved and all the prepared preprocessed sources &c) and each time it took me a day or two (were I still a freelance and charging by the hour, if I had to bid on such a task I'd say at least 20 hours -- at my not-cheap hourly rates!-).
IOW, it's worth it only if it's really truly the only way out of some very costly pickle. On the plus side, it WILL teach you more about Python's internals than you ever thought was there, even after memorizing every line of the sources. Good luck, you'll need some!!!

Related

What is the best or proper way to allow debugging of generated code?

For various reasons, in one project I generate executable code by means of generating AST from various source files the compiling that to bytecode (though the question could also work for cases where the bytecode is generated directly I guess).
From some experimentation, it looks like the debugger more or less just uses the lineno information embedded in the AST alongside the filename passed to compile in order to provide a representation for the debugger's purposes, however this assumes the code being executed comes from a single on-disk file.
That is not necessarily the case for my project, the executable code can be pieced together from multiple sources, and some or all of these sources may have been fetched over the network, or been retrieved from non-disk storage (e.g. database).
And so my Y questions, which may be the wrong ones (hence the background):
is it possible to provide a memory buffer of some sort, or is it necessary to generate a singular on-disk representation of the "virtual source"?
how well would the debugger deal with jumping around between the different bits and pieces if the virtual source can't or should not be linearised[0]
and just in case, is the assumption of Python only supporting a single contiguous source file correct or can it actually be fed multiple sources somehow?
[0] for instance a web-style literate program would be debugged in its original form, jumping between the code sections, not in the so-called "tangled" form

Some of this can be handled by the trepan3k debugger. For other things various hooks are in place.
First of all it can debug based on bytecode alone. But of course stepping instructions won't be possible if the line number table doesn't exist. And for that reason if for no other, I would add a "line number" for each logical stopping point, such as at the beginning of statements. The numbers don't have to be line numbers, they could just count from 1 or be indexes into some other table. This is more or less how go's Pos type position works.
The debugger will let you set a breakpoint on a function, but that function has to exist and when you start any python program most of the functions you define don't exist. So the typically way to do this is to modify the source to call the debugger at some point. In trepan3k the lingo for this is:
from trepan.api import debug; debug()
Do that in a place where the other functions you want to break on and that have been defined.
And the functions can be specified as methods on existing variables, e.g. self.my_function()
One of the advanced features of this debugger is that will decompile the bytecode to produce source code. There is a command called deparse which will show you the context around where you are currently stopped.
Deparsing bytecode though is a bit difficult so depending on which kind of bytecode you get the results may vary.
As for the virtual source problem, well that situation is somewhat tolerated in the debugger, since that kind of thing has to go on when there is no source. And to facilitate this and remote debugging (where the file locations locally and remotely can be different), we allow for filename remapping.
Another library pyficache is used to for this remapping; it has the ability I believe remap contiguous lines of one file into lines in another file. And I think you could use this over and over again. However so far there hasn't been need for this. And that code is pretty old. So someone would have to beef up trepan3k here.
Lastly, related to trepan3k is a trepan-xpy which is a CPython bytecode debugger which can step bytecode instructions even when the line number table is empty.

Is Python memory-safe?

With Deno being the new Node.js rival and all, the memory-safe nature of Rust has been mentioned in a lot of news articles, one particular piece stated Rust and Go are good for their memory-safe nature, as are Swift and Kotlin but the latter two are not used for systems programming that widely.
Safe Rust is the true Rust programming language. If all you do is write Safe Rust, you will never have to worry about type-safety or memory-safety. You will never endure a dangling pointer, a use-after-free, or any other kind of Undefined Behavior.
This piqued my interest into understanding if Python can be regarded as memory-safe and if yes or no, how safe or unsafe?
From the outset, the article on memory safety on Wikipedia does not even mention Python and the article on Python only mentions memory management it seems.
The closest I've come to finding an answer was this one by Daniel:
The wikipedia article associates type-safe to memory-safe, meaning, that the same memory area cannot be accessed as e.g. integer and string. In this way Python is type-safe. You cannot change the type of a object implicitly.
But even this only seems to imply a connection between two aspects (using an association from Wikipedia, which again is debatable) and no definitive answer on whether Python can be regarded as memory-safe.

Wikipedia lists the following examples of memory safety issues:
Access errors: invalid read/write of a pointer
Buffer overflow - out-of-bound writes can corrupt the content of adjacent objects, or internal data (like bookkeeping information for the heap) or return addresses.
Buffer over-read - out-of-bound reads can reveal sensitive data or help attackers bypass address space layout randomization.
Python at least tries to protect against these.
Race condition - concurrent reads/writes to shared memory
That's actually not that hard to do in languages with mutable data structures. (Advocates of functional programming and immutable data structures often use this fact as an argument in their favor).
Invalid page fault - accessing a pointer outside the virtual memory space. A null pointer dereference will often cause an exception or program termination in most environments, but can cause corruption in operating system kernels or systems without memory protection, or when use of the null pointer involves a large or negative offset.
Use after free - dereferencing a dangling pointer storing the address of an object that has been deleted.
Uninitialized variables - a variable that has not been assigned a value is used. It may contain an undesired or, in some languages, a corrupt value.
Null pointer dereference - dereferencing an invalid pointer or a pointer to memory that has not been allocated
Wild pointers arise when a pointer is used prior to initialization to some known state. They show the same erratic behaviour as dangling pointers, though they are less likely to stay undetected.
There's no real way to prevent someone from trying to access a null pointer. In C# and Java, this results in an exception. In C++, this results in undefined behavior.
Memory leak - when memory usage is not tracked or is tracked incorrectly
Stack exhaustion - occurs when a program runs out of stack space, typically because of too deep recursion. A guard page typically halts the program, preventing memory corruption, but functions with large stack frames may bypass the page.
Memory leaks in languages like C#, Java, and Python have different meanings than they do in languages like C and C++ where you manage memory manually. In C or C++, you get a memory leak by failing to deallocate allocated memory. In a language with managed memory, you don't have to explicitly de-allocate memory, but it's still possible to do something quite similar by accidentally maintaining a reference to an object somewhere even after the object is no longer needed.
This is actually quite easy to do with things like event handlers in C# and long-lived collection classes; I've actually worked on projects where there were memory leaks in spite of the fact that we were using managed memory. In one sense, working with an environment that has managed memory can actually make these issues more dangerous because programmers can have a false sense of security. In my experience, even experienced engineers often fail to do memory profiling or write test cases to check for this (likely due to the environment giving them a false sense of security).
Stack exhaustion is quite easy to do in Python too (e.g. with infinite recursion).
Heap exhaustion - the program tries to allocate more memory than the amount available. In some languages, this condition must be checked for manually after each allocation.
Still quite possible - I'm rather embarrassed to admit that I've personally done that in C# (although not in Python yet).
Double free - repeated calls to free may prematurely free a new object at the same address. If the exact address has not been reused, other corruption may occur, especially in allocators that use free lists.
Invalid free - passing an invalid address to free can corrupt the heap.
Mismatched free - when multiple allocators are in use, attempting to free memory with a deallocation function of a different allocator[20]
Unwanted aliasing - when the same memory location is allocated and modified twice for unrelated purposes.
Unwanted aliasing is actually quite easy to do in Python. Here's an example in Java (full disclosure: I wrote the accepted answer); you could just as easily do something quite similar in Python. The others are managed by the Python interpreter itself.
So, it would seem that memory-safety is relative. Depending on exactly what you consider a "memory-safety issue," it can actually be quite difficult to entirely prevent. High-level languages like Java, C#, and Python can prevent many of the worst of these errors, but there are other issues that are difficult or impossible to completely prevent.

Python Strings are immutable so why does s.split( ) return a list of new strings

By looking at the CPython implementation it seems the return value of a string split() is a list of newly allocated strings. However, since strings are immutable it seems one could have made substrings out of the original string by pointing at the offsets.
Am I understanding the current behavior of CPython correctly ? Are there reasons for not opting for this space optimization ? One reason I can think of is that the parent string cannot be freed until all its substrings are.

Without a crystal ball I can't tell you why CPython does it that way. However, there are some reasons why you might choose to do it that way.
The problem is that a small string might hold a reference to a much larger backing array. For example, suppose I read in a 8 GB HTTP access log file to analyze which user agents access my file the most, and I do that just by fp.read() and then run a regex on the whole file at once rather than going one line at a time.
I want to know about the top 10 most common user agents, so I keep this around in a list.
Then I want to do the same analysis for 100 other files, to see how the top 10 user agents have changed over time. Boom! My program is trying to use 800 GB of memory and gets killed. Why? How do I debug this?
Java used this sharing technique prior to Java 7, so the same reasoning applies. See Java 7 String - substring complexity and JDK-4513622: (str)
keeping a substring of a field prevents GC for object.
Also note that having strings share memory would require you to follow a pointer from the string object to the string data. In CPython, the string data is usually placed directly after a header in memory, so you don't need to follow a pointer. This reduces the number of allocations required and reduces data dependencies when reading strings.

In the current CPython implementation, strings are reference-counted; it is assumed that a string cannot hold references to other objects because a string is not a container. This means that garbage collection does not need to inspect or trace over string objects (because they're entirely covered by the reference counting). But it's actually worse than that: Old versions of Python did not have a tracing garbage collector at all; GC was new in 2.0. Before that, any cyclic garbage would simply leak.
A competently-implemented substring-to-offset algorithm should not form cycles. So in theory, a cyclic garbage collector is not a prerequisite for this. However, because we're doing reference counting instead of tracing, the child objects become responsible for Py_DECREF()ing their parent objects at end-of-life. Otherwise the parent leaks. This means you cannot just chuck the whole string into the free list when it reaches end-of-life; you have to check whether it's a substring, and branching is potentially expensive. Python was historically designed to do string processing (like Perl, but with nicer syntax), which means creating and destroying a lot of strings. Furthermore, all variable names are internally stored as strings, so even if the user is not doing string processing, the interpreter is. Slowing down the string deallocation process by even a little could have a serious impact on performance.

CPython internally uses NUL-terminated strings in addition to storing a length. This is a very early design choice, present since the very first version of Python, and still true in the latest version.
You can see that in Include/unicodeobject.h where PyASCIIObject says "wchar_t representation (null-terminated)" and PyCompactUnicodeObject says "UTF-8 representation (null-terminated)". (Recent CPython implementations select from one of 4 back-end string types, depending on the Unicode encoding needs.)
Many Python extension modules expect a NUL terminated string. It would be difficult to implement substrings as slices into a larger string and preserve the low-level C API. Not impossible, as it could be done using a copy-on-C-API-access. Or Python could require all extension writers to use a new subslice-friendly API. But that complexity is not worthwhile given the problems found from experience in other languages which implement subslice references, as Dietrich Epp described.
I see little in Kevin's answer which is applicable to this question. The decision had nothing do to with the lack of circular garbage collection before Python 2.0, nor could it. Substring slices are implemented with an acyclic data structure. 'Competently-implemented' isn't a relevant requirement as it would take a perverse sort of incompetence or malice to turn it into a cyclic data structure.
Nor would there necessarily be extra branch overhead in the deallocator. If the source string were one type and the substring slice another type, then Python's normal type dispatcher would automatically use the correct deallocator, with no additional overhead. Even if there were an extra branch, we know that branching overhead in this case is not "expensive". Python 3.3 (because of PEP 393) has those 4 back-end Unicode types, and decides what to do based on branching. String access occurs much more often than deallocation, so any dellocation overhead due to branching would be lost in the noise.
It is mostly true that in CPython "variable names are internally stored as strings". (The exception is that local variables are stored as indices into a local array.) However, these names are also interned into a global dictionary using PyUnicode_InternInPlace(). There is therefore no deallocation overhead because these strings are not deallocated, outside of cases involving dynamic dispatch using non-interned strings, like through getattr().

Counting number of symbols in Python script

I have a Telit module which runs [Python 1.5.2+] (http://www.roundsolutions.com/techdocs/python/Easy_Script_Python_r13.pdf)!. There are certain restrictions in the number of variable, module and method names I can use (< 500), the size of each variable (16k) and amount of RAM (~ 1MB). Refer pg 113&114 for details. I would like to know how to get the number of symbols being generated, size in RAM of each variable, memory usage (stack and heap usage).
I need something similar to a map file that gets generated with gcc after the linking process which shows me each constant / variable, symbol, its address and size allocated.

Python is an interpreted and dynamically-typed language, so generating that kind of output is very difficult, if it's even possible. I'd imagine that the only reasonable way to get this information is to profile your code on the target interpreter.
If you're looking for a true memory map, I doubt such a tool exists since Python doesn't go through the same kind of compilation process as C or C++. Since everything is initialized and allocated at runtime as the program is parsed and interpreted, there's nothing to say that one interpreter will behave the same as another, especially in a case such as this where you're running on such a different architecture. As a result, there's nothing to say that your objects will be created in the same locations or even with the same overall memory structure.
If you're just trying to determine memory footprint, you can do some manual checking with sys.getsizeof(object, [default]) provided that it is supported with Telit's libs. I don't think they're using a straight implementation of CPython. Even still, this doesn't always work and with raise a TypeError when an object's size cannot be determined if you don't specify the default parameter.
You might also get some interesting results by studying the output of the dis module's bytecode disassembly, but that assumes that dis works on your interpreter, and that your interpreter is actually implemented as a VM.
If you just want a list of symbols, take a look at this recipe. It uses reflection to dump a list of symbols.
Good manual testing is key here. Your best bet is to set up the module's CMUX (COM port MUXing), and watch the console output. You'll know very quickly if you start running out of memory.

This post makes me recall my pain once with Telit GM862-GPS modules. My code was exactly at the point that the number of variables, strings, etc added up to the limit. Of course, I didn't know this fact by then. I added one innocent line and my program did not work any more. I drove me really crazy for two days until I look at the datasheet to find this fact.
What you are looking for might not have a good answer because the Python interpreter is not a full fledged version. What I did was to use the same local variable names as many as possible. Also I deleted doc strings for functions (those count too) and replace with #comments.
In the end, I want to say that this module is good for small applications. The python interpreter does not support threads or interrupts so your program must be a super loop. When your application gets bigger, each iteration will take longer. Eventually, you might want to switch to a faster platform.

Python3 int, long unification implementation

I just read through a PEP concerning the unification of ints and longs in Python3k in PEP 237. The approach used in this seems very interesting. The approach is to create a new type "integer" which is the abstract base class of int and long. Also, performing operations on ints which result in very large numbers will no longer result in an OverflowError, instead it'll return a long.
I'd like to see and try to understand the underlying implementation of this in Python3k. How should I go about that? Which files contain the details about "type" implementations?
So far I have only ventured out to read most of the non-C python stdlib modules; hence I am unclear on where exactly to look.

Start with Include/longobject.h and Objects/longobject.h These paths are relative to the root of a Python source tree. Make sure to arm yourself with an editor suitable for browsing C code conveniently, or generate a HTML interlinked reference with GNU global.
Also, it would surely help to read this article on internals of objects in Python 3, as well as its sequel.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.