How do I proceed with memory, .so filenames and hex offsets

How do I proceed with memory, .so filenames and hex offsets - python

Don't flame me for this but it's genuine. I am writing a multi threaded python application that runs for a very long time, typically 2-3 hours with 10 processes. This machine isn't slow it's just a lot of calculations.
The issue is that sometimes the application will hang about 85-90% of the way there because of outside tools.
I've broken this test up into smaller pieces that can then run successfully but the long running program hangs.
for example let's say I have to analyze some data on a list that 100,000,000 items long.
Breaking it up into twenty 5,000,000 lists all the smaller parts runs to completion.
Trying to do the 100,000,000 project it hangs towards the end. I use some outside tools that I cannot change so I am just trying to see what's going on.
I setup Dtrace and run
sudo dtrace -n 'syscall:::entry / execname == "python2.7" / { #[ustack()] = count() }'
on my program right when it hangs and I get an output like the code sample below.
libc.so.7`__sys_recvfrom+0xa
_socket.so`0x804086ecd
_socket.so`0x8040854ac
libpython2.7.so.1`PyEval_EvalFrameEx+0x52d7
libpython2.7.so.1`PyEval_EvalCodeEx+0x665
libpython2.7.so.1`0x800b3317d
libpython2.7.so.1`PyEval_EvalFrameEx+0x4e2f
libpython2.7.so.1`0x800b33250
libpython2.7.so.1`PyEval_EvalFrameEx+0x4e2f
libpython2.7.so.1`PyEval_EvalCodeEx+0x665
libpython2.7.so.1`0x800abb5a1
libpython2.7.so.1`PyObject_Call+0x64
libpython2.7.so.1`0x800aa3855
libpython2.7.so.1`PyObject_Call+0x64
libpython2.7.so.1`PyEval_EvalFrameEx+0x4de2
libpython2.7.so.1`PyEval_EvalCodeEx+0x665
libpython2.7.so.1`0x800abb5a1
libpython2.7.so.1`PyObject_Call+0x64
libpython2.7.so.1`0x800aa3855
libpython2.7.so.1`PyObject_Call+0x64
that code just repeats over and over. I tried looking into the Dtrace python probes but those seems busted two sides from Tuesday so this might be the closest that I'll get.
My question, I have a fuzzy idea that libpython2.7.so.1 is the shared library that holds the function pyObject_Call at an hex offset of 0x64
Is that right?
How can I decipher this? I don't know what to even call this so that I can google for answers or guides.

You should probably start by reading Showing the stack trace from a running Python application.
Your specific
question was about the interpretation of DTrace's ustack() action and
so this reply may be more than you need. This is because one of the
design principles of DTrace is to show the exact state of a system.
So, even though you're interested in the Python aspect of your
program, DTrace is revealing its underlying implementation.
The output you've presented is a stack, which is a way of
describing the state of a thread at a specific point in its
execution. For example, if you had the code
void c(void) { pause(); }
void b(void) { c(); }
void a(void) { b(); }
and you asked for a stack whilst execution was within pause() then
you might see something like
pause()
c()
b()
a()
Whatever tool you use will find the current instruction and its
enclosing function before finding the "return address", i.e. the
point to which that function will eventually return; repeating this
procedure yields a stack. Thus, although the stack should be read
from the top to the bottom as a series of return addresses, it's typically
read in the other direction as a series of callers. Note that
subtleties in the way that the program's corresponding
instructions are assembled mean that this second interpretation
can sometimes be misleading.
To extend the example above, it's likely that a(), b() and c() are
all present within the same library --- and that there may be
functions with the same names in other libraries. Thus it's
useful to display, for each function, the object to which it
belongs. Thus the stack above could become
libc.so`pause()
libfoo.so`c()
libfoo.so`b()
libfoo.so`a()
This goes some way towards allowing a developer to identify how a
program ended up in a particular state: function c() in libfoo
has called pause(). However, there's more to be done: if c()
looked like
void c() {
pause();
pause();
}
then in which call to pause() is the program waiting?
The functions a(), b() and c() will be sequences
of instructions that will typically occupy a contiguous region of
memory. Calling one of the functions involves little more than
making a note of where to return when finished (i.e. the return
address) and then jumping to whichever memory address corresponds
to the function's start. Functions' start addresses and sizes are
recorded in a "symbol table" that is embedded in the object; it's
by reading this table that a debugger is able to find the function
that contains a given location such as a return address. Thus a
specific point within a function can be described by an offset,
usually expressed in hex, from the start. So an even better
version of the stack above might be
libc.so`pause()+0x12
libfoo.so`c()+0x42
libfoo.so`b()+0x12
libfoo.so`a()+0x12
At this point, the developer can use a "disassembler" on libfoo.so
to display the instructions within c(); comparison with c()'s
source code would allow him to reveal the specific line from which
the call to pause() was made.
Before concluding this description of stacks, it's worth making
one more observation. Given the presence of sufficient "debug
data" in a library such as libfoo, a better debugger would be able
to go the extra mile and display the the source code file name and
line number instead of the hexadecimal offset for each "frame" in
the stack.
So now, to return to the stack in your question,
libpython(2.7.so.1) is a library whose functions perform the job
of executing a Python script. Functions in the Python script are
converted into executable instructions on the fly, so my guess is
that the fragment
libpython2.7.so.1`0x800b33250
libpython2.7.so.1`PyEval_EvalFrameEx+0x4e2f
libpython2.7.so.1`PyEval_EvalCodeEx+0x665
means that PyEval_EvalFrameEx() is functionality within libpython
itself that calls a Python function (i.e. something written in
Python) that resides in memory near the address 0x800b33250. A
simple debugger can see that this address belongs to libpython but
won't find a corresponding entry in the library's symbol table;
left with no choice, it simply prints the "raw" address.
So, you need to look at the Python script so see what it's
doing but, unfortunately, there's no indication of the names of
the functions in the Python component of the stack.
There are a few ways to proceed. The first is to find a
version of libpython, if one exists, with a "DTrace helper". This
is some extra functionality that lets DTrace see the state of the
Python program itself in addition to the surrounding
implementation. The result is that each Python frame would be
annotated with the corresponding point in the Python source code.
Another, if you're on Solaris, is to use pstack(1); this has
native support for Python.
Finally, try a specific Python debugger.
It's also worth pointing out that your dtrace invocation will show
you all the stacks seen, sorted by popularity, whenever the
program "python2.7" makes a system call. From your description,
this probably isn't what you want. If you're trying to understand
the behaviour of a hang then you probably want to start with a
single snapshot of the python2.7 process at the time of the
hang.

Related

What is the best or proper way to allow debugging of generated code?

For various reasons, in one project I generate executable code by means of generating AST from various source files the compiling that to bytecode (though the question could also work for cases where the bytecode is generated directly I guess).
From some experimentation, it looks like the debugger more or less just uses the lineno information embedded in the AST alongside the filename passed to compile in order to provide a representation for the debugger's purposes, however this assumes the code being executed comes from a single on-disk file.
That is not necessarily the case for my project, the executable code can be pieced together from multiple sources, and some or all of these sources may have been fetched over the network, or been retrieved from non-disk storage (e.g. database).
And so my Y questions, which may be the wrong ones (hence the background):
is it possible to provide a memory buffer of some sort, or is it necessary to generate a singular on-disk representation of the "virtual source"?
how well would the debugger deal with jumping around between the different bits and pieces if the virtual source can't or should not be linearised[0]
and just in case, is the assumption of Python only supporting a single contiguous source file correct or can it actually be fed multiple sources somehow?
[0] for instance a web-style literate program would be debugged in its original form, jumping between the code sections, not in the so-called "tangled" form

Some of this can be handled by the trepan3k debugger. For other things various hooks are in place.
First of all it can debug based on bytecode alone. But of course stepping instructions won't be possible if the line number table doesn't exist. And for that reason if for no other, I would add a "line number" for each logical stopping point, such as at the beginning of statements. The numbers don't have to be line numbers, they could just count from 1 or be indexes into some other table. This is more or less how go's Pos type position works.
The debugger will let you set a breakpoint on a function, but that function has to exist and when you start any python program most of the functions you define don't exist. So the typically way to do this is to modify the source to call the debugger at some point. In trepan3k the lingo for this is:
from trepan.api import debug; debug()
Do that in a place where the other functions you want to break on and that have been defined.
And the functions can be specified as methods on existing variables, e.g. self.my_function()
One of the advanced features of this debugger is that will decompile the bytecode to produce source code. There is a command called deparse which will show you the context around where you are currently stopped.
Deparsing bytecode though is a bit difficult so depending on which kind of bytecode you get the results may vary.
As for the virtual source problem, well that situation is somewhat tolerated in the debugger, since that kind of thing has to go on when there is no source. And to facilitate this and remote debugging (where the file locations locally and remotely can be different), we allow for filename remapping.
Another library pyficache is used to for this remapping; it has the ability I believe remap contiguous lines of one file into lines in another file. And I think you could use this over and over again. However so far there hasn't been need for this. And that code is pretty old. So someone would have to beef up trepan3k here.
Lastly, related to trepan3k is a trepan-xpy which is a CPython bytecode debugger which can step bytecode instructions even when the line number table is empty.

Does Python load in function arguments into registers or does it keep them on the stack?

So I'm writing a function that takes in a tuple as an argument and does a bunch of stuff to it. Here is what that looks like:
def swap(self, location):
if (location[0] < 0 or location[1] < 0 or
location[0] >= self.r or location[1] >= self.c):
return False
self.board[0][0] = self.board[location[0]][location[1]]
self.board[location[0]][location[1]] = 0
self.empty = (location[0],location[1])
I'm trying to make my code as efficient as possible, so since I am not modifying the values of location, does it make sense to load the variables in registers (loc0 = location[0]; loc1 = location[1]) for faster computations (zero-cycle read) or is location already loaded into registers by the Python compiler when it's passed in as a function argument?
Edit: I bit the bullet and ran some tests. Here are the results (in seconds) for this function running 10 million times with the repeating inputs: "up", "down", "left", "right" (respectively)
Code as is:
run#1: 19.39
run#2: 17.18
run#3: 16.85
run#4: 16.90
run#5: 16.74
run#6: 16.76
run#7: 16.94
Code after defining location[0] and location[1] in the beginning of the function:
run#1: 14.83
run#2: 14.79
run#3: 14.88
run#4: 15.033
run#5: 14.77
run#6: 14.94
run#7: 14.67
That's an average of 16% increase in performance. Definitely not insignificant for my case. Of course, this is not scientific as I need to do more tests in more environments with more inputs, but enough for my simple use case!
Times measured using Python 2.7 on a Macbook Pro (Early 2015), which has a Broadwell i5-5257U CPU (2c4t max turbo 3.1GHz, sustained 2.7GHz, 3MB L3 cache).
IDE was: PyCharm Edu 3.5.1 JRE: 1.8.0_112-release-408-b6 x86_64 JVM: OpenJDK 64-Bit Server VM .
Unfortunately, this is for a class that grades based on code speed.

If you're using an interpreter, it's unlikely that any Python variables will live in registers between different expressions. You could look at how the Python source compiled to byte-code.
Python bytecode (the kind stored in files outside the interpreter) is stack-based (http://security.coverity.com/blog/2014/Nov/understanding-python-bytecode.html). This byte-code is then interpreted or JIT-compiled to native machine code. Regular python only interprets, so it's not plausible for it to keep python variables in machine registers across multiple statements.
An interpreter written in C might keep the top of the bytecode stack in a local variable inside an interpret loop, and the C compiler might keep that C variable in a register. So repeated use of the same Python variable might end up not having too many store/reload round-trips.
Note that store-forwarding latency on your Broadwell CPU is about 4 or 5 clock cycles, nowhere near the hundreds of cycles for a round-trip to DRAM. A store/reload doesn't even have to wait for the store to retire and commit to L1D cache; it's forwarded directly from the store buffer. Related: http://blog.stuffedcow.net/2014/01/x86-memory-disambiguation/ and http://agner.org/optimize/, and other links in the x86 tag wiki). Load-use latency is also only 5 clock cycles for an L1D cache hit (latency from address being ready to data being ready. You can measure it by pointer-chasing through a linked list (in asm).) There's enough interpreter overhead (total number of instructions it runs to figure out what to do next) that this probably isn't even the bottleneck.
Keeping a specific python variable in a register is not plausible at all for an interpreter. Even if you wrote an interpreter in asm, the fundamental problem is that registers aren't addressable. An x86 add r14d, eax instruction has to have both registers hard-coded into the instruction's machine-code. (Every other ISA works the same way: register numbers are part of the machine-code for the instruction, with no indirection based on any data). Even if the interpreter did the work to figure out that it needed to "add reg-var #3 to reg-var #2" (i.e. decoding the bytecode stack operations back into register variables for an internal representation that it interprets), it would have to use a different function than any other combination of registers.
Given an integer, the only ways to get the value of the Nth register are branching to an instruction that uses that register, or storing all the registers to memory and indexing the resulting array. (Or maybe some kind of branchless compare and mask stuff).
Anyway, trying to do anything specific about this is not profitable, which is why people just write the interpreter in C and let the C compiler do a (hopefully) good job of optimizing the machine code that will actually run.
Or you write a JIT-compiler like Sun did for Java (the HotSpot VM). IDK if there are any for Python. See Does the Python 3 interpreter have a JIT feature?.
A JIT-compiler does actually turn the Python code into machine code, where register state mostly holds Python variables rather than interpreter data. Again, without a JIT compiler (or ahead-of-time compiler), "keeping variables in registers" is not a thing.
It's probably faster because it avoids the [] operator and other overhead (see Bren's answer, which you accepted)
Footnote: a couple ISAs have memory-mapped registers. e.g. AVR (8-bit RISC microcontrollers), where the chip also has built-in SRAM containing the low range of memory addresses that includes the registers. So you can do an indexed load and get register contents, but you might as well have done that on memory that wasn't holding architectural register contents.

The Python VM only uses a stack to execute its bytecode, and this stack is completely independent of the hardware stack. You can use dis to disassemble your code to see how your changes affect the generated bytecode.

It will be a little faster if you store these two variable:
loc0 = location[0]
loc1 = location[1]
Because there will be only two look-up instead of four.
Btw, if you want to use python, you shouldn't take care about performance in this low level.

Those kinds of details are not part of the specified behavior of Python. As Ignacio's answer says, CPython does it one way, but that is not guaranteed by the language itself. Python's description of what it does is very far removed from low-level notions like registers, and most of the time it's not useful to worry about how what Python does maps onto those details. Python is a high-level language whose behavior is defined in terms of high-level abstractions, akin to an API.
In any case, doing something like loc0 = language[0] in Python code has nothing to do with setting registers. It's just creating new Python name pointing an existing Python object.
That said, there is a performance difference, because if you use location[0] everywhere, the actual lookup will (or at least may -- in theory a smart Python implementation could optimize this) happen again and again every time the expression location[0] is evaluated. But if you do loc0 = location[0] and then use loc0 everywhere, you know the lookup only happens once. In typical situations (e.g., location is a Python list or dict, you're not running this code gazillions of times in a tight loop) this difference will be tiny.

In C python, accessing the bytecode evaluation stack

Given a C Python frame pointer, how do I look at arbitrary evaluation stack entries? (Some specific stack entries can be found via locals(), I'm talking about other stack entries.)
I asked a broader question like this a while ago:
getting the C python exec argument string or accessing the evaluation stack
but here I want to focus on being able to read CPython stack entries at runtime.
I'll take a solution that works on CPython 2.7 or any Python later than Python 3.3. However if you have things that work outside of that, share that and, if there is no better solution I'll accept that.
I'd prefer not modifying the C Python code. In Ruby, I have in fact done this to get what I want. I can speak from experience that this is probably not the way we want to work. But again, if there's no better solution, I'll take that. (My understanding wrt to SO points is that I lose it in the bounty either way. So I'm happy go see it go to the person who has shown the most good spirit and willingness to look at this, assuming it works.)
update: See the comment by user2357112 tldr; Basically this is hard-to-impossible to do. (Still, if you think you have the gumption to try, by all means do so.)
So instead, let me narrow the scope to this simpler problem which I think is doable:
Given a python stack frame, like inspect.currentframe(), find the beginning of the evaluation stack. In the C version of the structure, this is f_valuestack. From that we then need a way in Python to read off the Python values/objects from there.
update 2 well the time period for a bounty is over and no one (including my own summary answer) has offered concrete code. I feel this is a good start though and I now understand the situation much more than I had. In the obligatory "describe why you think there should be a bounty" I had listed one of the proffered choices "to draw more attention to this problem" and to that extent where there had been something less than a dozen views of the prior incarnation of the problem, as I type this it has been viewed a little under 190 times. So this is a success. However...
If someone in the future decides to carry this further, contact me and I'll set up another bounty.
Thanks all.

This is sometimes possible, with ctypes for direct C struct member access, but it gets messy fast.
First off, there's no public API for this, on the C side or the Python side, so that's out. We'll have to dig into the undocumented insides of the C implementation. I'll be focusing on the CPython 3.8 implementation; the details should be similar, though likely different, in other versions.
A PyFrameObject struct has an f_valuestack member that points to the bottom of its evaluation stack. It also has an f_stacktop member that points to the top of its evaluation stack... sometimes. During execution of a frame, Python actually keeps track of the top of the stack using a stack_pointer local variable in _PyEval_EvalFrameDefault:
stack_pointer = f->f_stacktop;
assert(stack_pointer != NULL);
f->f_stacktop = NULL; /* remains NULL unless yield suspends frame */
There are two cases in which f_stacktop is restored. One is if the frame is suspended by a yield (or yield from, or any of the multiple constructs that suspend coroutines through the same mechanism). The other is right before calling a trace function for a 'line' or 'opcode' trace event. f_stacktop is cleared again when the frame unsuspends, or after the trace function finishes.
That means that if
you're looking at a suspended generator or coroutine frame, or
you're currently in a trace function for a 'line' or 'opcode' event for a frame
then you can access the f_valuestack and f_stacktop pointers with ctypes to find the lower and upper bounds of the frame's evaluation stack and access the PyObject * pointers stored in that range. You can even get a superset of the stack contents without ctypes with gc.get_referents(frame_object), although this will contain other referents that aren't on the frame's stack.
Debuggers use trace functions, so this gets you value stack entries for the top stack frame while debugging, most of the time. It does not get you value stack entries for any other stack frames on the call stack, and it doesn't get you value stack entries while tracing an 'exception' event or any other trace events.
When f_stacktop is NULL, determining the frame's stack contents is close to impossible. You can still see where the stack begins with f_valuestack, but you can't see where it ends. The stack top is stored in a C-level stack_pointer local variable that's really hard to access.
There's the frame's code object's co_stacksize, which gives an upper bound on the stack size, but it doesn't give the actual stack size.
You can't tell where the stack ends by examining the stack itself, because Python doesn't null out the pointers on the stack when it pops entries.
gc.get_referents doesn't return value stack entries when f_stacktop is null. It doesn't know how to retrieve stack entries safely in this case either (and it doesn't need to, because if f_stacktop is null and stack entries exist, the frame is guaranteed reachable).
You might be able to examine the frame's f_lasti to determine the last bytecode instruction it was on and try to figure out where that instruction would leave the stack, but that would take a lot of intimate knowledge of Python bytecode and the bytecode evaluation loop, and it's still ambiguous sometimes (because the frame might be halfway through an instruction). This would at least give you a lower bound on the current stack size, though, letting you safely inspect at least some of it.
Frame objects have independent value stacks that aren't contiguous with each other, so you can't look at the bottom of one frame's stack to find the top of another. (The value stack is actually allocated within the frame object itself.)
You might be able to hunt down the stack_pointer local variable with some GDB magic or something, but it'd be a mess.

Note added later: See crusaderky's get_stack.py which might be worked into a solution here.
Here are two potential solution partial solutions, since this problem has no simple obvious answer, short of:
modifying the CPython interpreter or by:
instrumenting the bytecode beforand such as via x-python
Thanks to user2357112 for enlightenment on the difficulty of the problem, and for descriptions of:
the various Python stacks used at runtime,
the non-contiguous evaluation stack,
the transiency of the evaluation stack and
the stack pointer top which only lives as a C local
variable (which at run time, might be or is likely only saved in the
value of a register).
Now to potential solutions...
The first solution is to write a C extension to access f_valuestack which is the bottom (not top) of a frame. From that you can access values, and that too would have to go in the C extension. The main problem here, since this is the stack bottom, is to understand which entry is the top or one you are interested in. The code records the maximum stack depth in the function.
The C extension would wrap the PyFrameObject so it can get access to the unexposed field f_valuestack. Although the PyFrameObject can change from Python version to Python version (so the extension might have to check which python version is running), it still is doable.
From that use an Abstract Virtual Machine to figure out which entry position you'd be at for a given offset stored in last_i.
Something similar for my purposes would for my purposes would be to use a real but alternative VM, like Ned Batchhelder's byterun. It runs a Python bytecode interpreter in Python.
Note added later: I have made some largish revisions in order to support Python 2.5 .. 3.7 or so and this is now called x-python
The advantage here would be that since this acts as a second VM so stores don't change the running of the current and real CPython VM. However you'd still need to deal with the fact of interacting with external persistent state (e.g. calls across sockets or changes to files). And byterun would need to be extended to cover all opcodes and Python versions that potentially might be needed.
By the way, for multi-version access to bytecode in a uniform way (since not only does bytecode change a bit, but also the collections of routines to access it), see xdis.
So although this isn't a general solution it could probably work for the special case of trying to figure out the value of say an EXEC up which appear briefly on the evaluation stack.

I wrote some code to do this. It seems to work so I'll add it to this question.
How it does it is by disassembling the instructions, and using dis.stack_effect to get the effect of each instruction on stack depth. If there's a jump, it sets the stack level at the jump target.
I think stack level is deterministic, i.e. it is always the same at any given bytecode instruction in a piece of code no matter how it was reached. So you can get stack depth at a particular bytecode by looking at the bytecode disassembly directly.
There's a slight catch which is that if you are in an active call, the code position is shown as last instruction being the call, but the stack state is actually that before the call. This is good because it means you can recreate the call arguments from the stack, but you need to be aware that if the instruction is a call that is ongoing, the stack will be at the level of the previous instruction.
Here's the code from my resumable exception thing that does this:
cdef get_stack_pos_after(object code,int target,logger):
stack_levels={}
jump_levels={}
cur_stack=0
for i in dis.get_instructions(code):
offset=i.offset
argval=i.argval
arg=i.arg
opcode=i.opcode
if offset in jump_levels:
cur_stack=jump_levels[offset]
no_jump=dis.stack_effect(opcode,arg,jump=False)
if opcode in dis.hasjabs or opcode in dis.hasjrel:
# a jump - mark the stack level at jump target
yes_jump=dis.stack_effect(opcode,arg,jump=True)
if not argval in jump_levels:
jump_levels[argval]=cur_stack+yes_jump
cur_stack+=no_jump
stack_levels[offset]=cur_stack
logger(offset,i.opname,argval,cur_stack)
return stack_levels[target]
https://github.com/joemarshall/unthrow

I've tried to do this in this package. As others point out, the main difficulty is in determining the top of the Python stack. I try to do this with some heuristics, which I've documented here.
The overall idea is that by the time my snapshotting function is called, the stack consists of the locals (as you point out), the iterators of nested for loops, and any exception triplets currently being handled. There's enough information in Python 3.6 & 3.7 to recover these states and therefore the stacktop.
I also relied on a tip from user2357112 to pave a way to making this work in Python 3.8.

Understand programmatically a python code without executing it

I am implementing a workflow management system, where the workflow developer overloads a little process function and inherits from a Workflow class. The class offers a method named add_component in order to add a component to the workflow (a component is the execution of a software or can be more complex).
My Workflow class in order to display status needs to know what components have been added to the workflow. To do so I tried 2 things:
execute the process function 2 times, the first time allow to gather all components required, the second one is for the real execution. The problem is, if the workflow developer do something else than adding components (add element in a databases, create a file) this will be done twice!
parse the python code of the function to extract only the add_component lines, this works but if some components are in a if / else statement and the component should not be executed, the component apears in the monitoring!
I'm wondering if there is other solution (I thought about making my workflow being an XML or something to parse easier but this is less flexible).

You cannot know what a program does without "executing" it (could be in some context where you mock things you don't want to be modified but it look like shooting at a moving target).
If you do a handmade parsing there will always be some issues you miss.
You should break the code in two functions :
a first one where the code can only add_component(s) without any side
effects, but with the possibility to run real code to check the
environment etc. to know which components to add.
a second one that
can have side effects and rely on the added components.
Using an XML (or any static format) is similar except :
you are certain there are no side effects (don't need to rely on the programmer respecting the documentation)
much less flexibility but be sure you need it.

Counting number of symbols in Python script

I have a Telit module which runs [Python 1.5.2+] (http://www.roundsolutions.com/techdocs/python/Easy_Script_Python_r13.pdf)!. There are certain restrictions in the number of variable, module and method names I can use (< 500), the size of each variable (16k) and amount of RAM (~ 1MB). Refer pg 113&114 for details. I would like to know how to get the number of symbols being generated, size in RAM of each variable, memory usage (stack and heap usage).
I need something similar to a map file that gets generated with gcc after the linking process which shows me each constant / variable, symbol, its address and size allocated.

Python is an interpreted and dynamically-typed language, so generating that kind of output is very difficult, if it's even possible. I'd imagine that the only reasonable way to get this information is to profile your code on the target interpreter.
If you're looking for a true memory map, I doubt such a tool exists since Python doesn't go through the same kind of compilation process as C or C++. Since everything is initialized and allocated at runtime as the program is parsed and interpreted, there's nothing to say that one interpreter will behave the same as another, especially in a case such as this where you're running on such a different architecture. As a result, there's nothing to say that your objects will be created in the same locations or even with the same overall memory structure.
If you're just trying to determine memory footprint, you can do some manual checking with sys.getsizeof(object, [default]) provided that it is supported with Telit's libs. I don't think they're using a straight implementation of CPython. Even still, this doesn't always work and with raise a TypeError when an object's size cannot be determined if you don't specify the default parameter.
You might also get some interesting results by studying the output of the dis module's bytecode disassembly, but that assumes that dis works on your interpreter, and that your interpreter is actually implemented as a VM.
If you just want a list of symbols, take a look at this recipe. It uses reflection to dump a list of symbols.
Good manual testing is key here. Your best bet is to set up the module's CMUX (COM port MUXing), and watch the console output. You'll know very quickly if you start running out of memory.

This post makes me recall my pain once with Telit GM862-GPS modules. My code was exactly at the point that the number of variables, strings, etc added up to the limit. Of course, I didn't know this fact by then. I added one innocent line and my program did not work any more. I drove me really crazy for two days until I look at the datasheet to find this fact.
What you are looking for might not have a good answer because the Python interpreter is not a full fledged version. What I did was to use the same local variable names as many as possible. Also I deleted doc strings for functions (those count too) and replace with #comments.
In the end, I want to say that this module is good for small applications. The python interpreter does not support threads or interrupts so your program must be a super loop. When your application gets bigger, each iteration will take longer. Eventually, you might want to switch to a faster platform.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.