I'm having a problem in a large-runtime script. This script is a multithreaded environment, to perform crawling tasks.
In large executions, script's memory consumption become huge, and after profiling memory with guppy hpy, I saw that most of the problem is coming by strings.
I'm not storing so many strings: just get content of htmls into memory, to store them in db. After it, string is not used anymore (the variable containing it is assigned to the next string).
The problem arised because I saw that every new string (with sys.getrefcount) have, at least, 2 references (1 from my var, and 1 internal). It seems that reassigning another value to my var does not remove the internal reference, so the string stills in memory.
What can I do to be sure that strings are garbage collected?
Thank you in advance
EDIT:
1- I'm using Django ORM
2- I'm obtaining all of that strings from 2 sources:
2.1- Directly from socket (urllib2.urlopen(url).read())
2.2- Parsing that responses, and extrating new URIs from every html, and feeding system
SOLVED
Finally, I got the key. The script is part of Django environment, and seems that Django's underground is doing some cache or something similar. I turned off debugging, and all started to work as expected (reused indentifiers seems to delete references to old objects, and that objects become collected by gc).
For anyone who uses some kind of framework layer over python, be aware of configuration: seems that some debug configurations with intensive process can lead to memory leaks
You say:
I saw that every new string (with sys.getrefcount) have, at least, 2 references
But did you carefully read the description of getrefcount() ? :
sys.getrefcount()
object) Return the reference count of the object. The count returned
is generally one higher than you might expect, because it includes the
(temporary) reference as an argument to getrefcount().
.
You should explain more about your prohgram.
What is the size of the HTML strings it holds ?
How are they obtained ? Are you sure to close all file's handler , all socket connexions, ....?
You'd need to find out who keeps the "internal" reference to your strings.
Perhaps the library you're using to write to DB (you didn't specify how you write to DB).
I find objgraph very useful for tasks like this: https://pypi.python.org/pypi/objgraph
E.g.
import objgraph
objgraph.show_backrefs([mystring], filename='a.png')
Related
I'm new to Python and have been tasked with optimizing some code and am trying to understand why my change has slowed things down. The code I'm working with is in a backend flask app.
The changes I made involved removing the use of temporary object that was being used to store data before copying all fields to a MongoEngine document object. All fields would get assigned to this temporary object, and then there was a conversion function that cast all fields to their proper data types for storage. Instead of using this temporary object, I just instantiated the MongoEngine document and replaced all lines that were assigning to the temporary object to instead assign to the document. I didn't add any lines, just replaced existing ones.
When I checked the changes using the Werkzeug Application Profiler for flask. It's showing 336,897 calls to __setattr__() before the changes and 502,953 calls after the changes.
I'm just wondering if there's any explanation for this other than me inadvertently increasing the calls somehow (I don't think this is the case because I've reviewed the changes in using git diff a few times and I didn't notice anything).
I appreciate any help I can get. Sorry for not providing any code examples (don't want to expose the companies code). However, if needed I can try my best to write some example code to show what I did.
Before:
__setattr()__ calls before changes
After:
__setattr()__ calls after changes
I am doing some experiments with the Python garbage collector, I would like to check if a memory address is used or not. In the following example, I have de-referenced the string (surely) at ls[2]. If I run the garbage collector, I can still see surely at the original address. I would like to be sure that the address is now writable. Is there a way to check it in Python?
from ctypes import string_at
from sys import getsizeof
import gc
ls = ['This','will be','surely','deleted']
idsurely= id(ls[2])
sizesurely = getsizeof(ls[2])
ls[2] = 'probably'
print(ls)
print(string_at(idsurely,sizesurely))
gc.collect()
# I check there is nothing in the garbage
print(gc.garbage)
print(string_at(idsurely,sizesurely))
I am interested in this mainly from a theoretical point of view so I am not saying that is something that has practical usage. My goal is to show how memory works for a tutorial. I want to show that the data is still there and that just that the bytes at the address can be now written. So the output of the script is up to now as expected. I just want to prove the last passage.
Not possible.
There is no central registry of used or unused memory addresses in Python. There isn't even a central registry of all objects (the cyclic GC doesn't know about all of them), and even if you had a registry of all objects, that wouldn't be enough to determine what memory locations are in use. Additionally, you can't just read arbitrary memory addresses, or write to arbitrary deallocated addresses. That'll quickly lead to segfaults or worse.
Finally, I would strongly advise against using this kind of thing in a tutorial even if you did find something to make it work. When you put something in a tutorial, a large fraction of people reading the tutorial are going to think it's something they're supposed to learn. Programming newbies should not be mislead into thinking that examining possibly-deallocated memory locations is something they should be doing.
Your experiments are way off base. id (solely as a CPython implementation detail) does get the memory address of the object in question, but we're talking about the Python object itself, not the data it contains. sys.getsizeof returns a number that roughly corresponds to how much memory the object occupies, but there is no guarantee that memory is contiguous.
By sheer coincidence, this almost works on str (though it will perform a buffer overread if the string in question has cached copies of its UTF-8 or wchar_t form, so you're risking crashing your program), but even then your test is flawed; CPython interns string literals that look like legal variable names, so if the string in question appears as a literal anywhere else in your program (including as the name of some class or function in some module you imported), it won't actually go away when you replace it. Similar implicit caches can occur if the literal string appears in any function, anywhere (it ends up being not only interned, but stored in the constants for that function).
Update: On testing, in an actual script, the reference count for 'surely' when you hold onto a copy of it is 3, which drops to 2 when you replace it with 'probably'. Turns out constants are being cached even at global scope. The only reason the interactive interpreter doesn't exhibit this behavior is that it effectively evals each line separately, so the constant cache is discarded when the eval completes.
And even if all that's not a problem, most (almost all) memory managers (CPython's specialized small object heap and the general heap it's built on) don't actually zero out memory when its released, so if you do look at the same address shortly after it really was released, it'll probably have pretty similar data in it.
Lastly, your gc.collect() call won't change anything except by coincidence (of whatever happens during gc possibly allocating memory by side-effect). str is not a garbage collected type, as it cannot contain references to other Python objects, so it's impossible for it to be a link in a reference cycle, and the CPython garbage collector is solely concerned with collecting cyclic garbage; CPython is reference counted, so anything that's not part of a reference cycle is cleaned up automatically and immediately when the last reference disappears.
The short answer this all leads up to is: There is no way to determine, within CPython, non-heuristically, if a particular memory address has been released to the free store and made available for reuse. CPython's memory management scheme is pure implementation detail, and exposing APIs at that level of detail would create compatibility concerns when people depended on them.
The closest you're going to get is using something like the tracemalloc module to perform basic snapshotting and compute differences in the snapshot. That's not going to give you a window into whether a specific address is still in use though AFAICT; at best it can tell you where an address that's definitely in use was allocated.
The other approach (specific to CPython) you can use is to just check the reference counts before replacing the object; sys.getrefcount for a given name/attribute reports 2, then deling (or rebinding) that name/attribute will release it (assuming no threads that might create additional references between the test and the del/rebind). You expect 2, not 1, because calling sys.getrefcount creates a temporary reference to the object in question. If it reports a number greater than 2, deling/rebinding could still lead to the object being deleted eventually when the cyclic garbage collectors runs, if the object was part of a reference cycle, but for a reference count of 2 (or 1 for something otherwise unnamed, e.g. sys.getrefcount(''.join(('f', '9')) or the like), the behavior will be deterministic.
From the documentation about gc:
... the collector supplements the reference counting already used in Python...
And from gc.is_tracked():
Returns True if the object is currently tracked by the garbage collector, False otherwise. As a general rule, instances of atomic types aren’t tracked and instances of non-atomic types (containers, user-defined objects…) are.
Strings are not tracked by the garbage collector:
In [1]: import gc
In [2]: test = 'surely'
Out[2]: 'surely'
In [3]: gc.is_tracked(test)
Out[3]: False
Looking at the documentation, there doesn't seem to be a method for accessing the reference counting from within the language.
Note that at least for me, using string_at doesn't work from the interactive interpreter. It does work in a script.
I've been having a hard time using a large dictionary (~86GB, 1.75 billion keys) to process a big dataset (2TB) using multiprocessing in Python.
Context: a dictionary mapping strings to strings is loaded from pickled files into memory. Once loaded, worker processes (ideally >32) are created that must lookup values in the dictionary but not modify it's contents, in order to process the ~2TB dataset. The data set needs to be processed in parallel otherwise the task would take over a month.
Here are the two three four five six seven eight nine approaches (all failing) that I have tried:
Store the dictionary as a global variable in the Python program and then fork the ~32 worker processes. Theoretically this method might work since the dictionary is not being modified and therefore the COW mechanism of fork on Linux would mean that the data structure would be shared and not copied among processes. However, when I attempt this, my program crashes on os.fork() inside of multiprocessing.Pool.map from OSError: [Errno 12] Cannot allocate memory. I'm convinced that this is because the kernel is configured to never overcommit memory (/proc/sys/vm/overcommit_memory is set to 2, and I can't configure this setting on the machine since I don't have root access).
Load the dictionary into a shared-memory dictionary with multiprocessing.Manager.dict. With this approach I was able to fork the 32 worker process without crashing but the subsequent data processing is orders of magnitude slower than another version of the task that required no dictionary (only difference is no dictionary lookup). I theorize that this is because of the inter-process communication between the manager process containing the dictionary and each worker process, that is required for every single dictionary lookup. Although the dictionary is not being modified, it is being accessed many many times, often simultaneously by many processes.
Copy the dictionary into a C++ std::map and rely on Linux's COW mechanism to prevent it from being copied (like approach #1 except with the dictionary in C++). With this approach, it took a long time to load the dictionary into std::map and subsequently crashed from ENOMEM on os.fork() just as before.
Copy the dictionary into pyshmht. It takes far too long to copy the dictionary into pyshmht.
Try using SNAP's HashTable. The underlying implementation in C++ allows for it to be made and used in shared memory. Unfortunately the Python API does not offer this functionality.
Use PyPy. Crash still happened as in #1.
Implement my own shared-memory hash table in python on top of multiprocessing.Array. This approach still resulted in the out of memory error that ocured in #1.
Dump the dictionary into dbm. After trying to dump the dictionary into a dbm database for four days and seeing an ETA of "33 days", I gave up on this approach.
Dump the dictionary into Redis. When I try to dump the dictionaries (the 86GB dict is loaded from 1024 smaller dicts) into Redis using redis.mset I get a connection reset by peer error. When I try to dump the key-value pairs using a loop, it takes an extremely long time.
How can I process this dataset in parallel efficiently without requiring inter-process communication in order to lookup values in this dictionary. I would welcome any suggestions for solving this problem!
I'm using Python 3.6.3 from Anaconda on Ubuntu on a machine with 1TB RAM.
Edit: What finally worked:
I was able to get this to work using Redis. To get around the issued in #9, I had to chunk the large key-value insertion and lookup queries into "bite-sized" chunks so that it was still processing in batches, but didn't time-out from too large a query. Doing this allowed the insertion of the 86GB dictionary to be performed in 45 minutes (with 128 threads and some load balancing), and the subsequent processing was not hampered in performance by the Redis lookup queries (finished in 2 days).
Thank you all for your help and suggestions.
You should probably use a system that's meant for sharing large amounts of data with many different processes -- like a Database.
Take your giant dataset and create a schema for it and dump it into a database. You could even put it on a separate machine.
Then launch as many processes as you want, across as many hosts as you want, to process the data in parallel. Pretty much any modern database will be more than capable of handling the load.
Instead of using a dictionary, use a data structure that compresses data, but still has fast lookups.
e.g:
keyvi: https://github.com/cliqz-oss/keyvi
keyvi is a FSA-based key-value data structure optimized for space & lookup speed. multiple processes reading from keyvi will re-use the memory, because a keyvi structure is memory mapped and it uses shared memory. Since your worker processes don't need to modify the data structure, I think this would be your best bet.
marisa trie: https://github.com/pytries/marisa-trie static trie structure for Python, based on the marisa-trie C++ library. Like keyvi, marisa-trie also uses memory-mapping. Multiple processes using the same trie will use the same memory.
EDIT:
To use keyvi for this task, you can first install it with pip install pykeyvi. Then use it like this:
from pykeyvi import StringDictionaryCompiler, Dictionary
# Create the dictionary
compiler = StringDictionaryCompiler()
compiler.Add('foo', 'bar')
compiler.Add('key', 'value')
compiler.Compile()
compiler.WriteToFile('test.keyvi')
# Use the dictionary
dct = Dictionary('test.keyvi')
dct['foo'].GetValue()
> 'bar'
dct['key'].GetValue()
> 'value'
marisa trie is just a trie, so it wouldn't work as a mapping out of the box, but you can for example us a delimiter char to separate keys from values.
If you can successfully load that data into a single process in point 1, you can most likely work around the problem of fork doing copies by using gc.freeze introduced in https://bugs.python.org/issue31558
You have to use python 3.7+ and call that function before you fork. (or before you do the map over process pool)
Since this requires a virtual copy of the whole memory for the CoW to work, you need to make sure your overcommit settings allow you to do that.
As most people here already mentioned:
Don't use that big a dictionary, Dump it on a Database instead!!!
After dumping your data into a database, using indexes will help reduce data retrieval times.
A good indexing explanation for PostgreSQL databases here.
You can optimize your database even further (I give a PostgreSQL example because that is what I mostly use, but those concepts apply to almost every database)
Assuming you did the above (or if you want to use the dictionary either way...), you can implement a parallel and asynchronous processing routine using Python's asyncio (needs Python version >= 3.4).
The base idea is to create a mapping method to assign (map) an asynchronous task to each item of an iterable and register each task to asyncio's event_loop.
Finally, we will collect all those promises with asyncio.gather and we will wait to receive all the results.
A skeleton code example of this idea:
import asyncio
async def my_processing(value):
do stuff with the value...
return processed_value
def my_async_map(my_coroutine, my_iterable):
my_loop = asyncio.get_event_loop()
my_future = asyncio.gather(
*(my_coroutine(val) for val in my_iterable)
)
return my_loop.run_until_complete(my_future)
my_async_map(my_processing, my_ginormous_iterable)
You can use gevent instead of asyncio, but keep in mind that asyncio is part of the standard library.
Gevent implementation:
import gevent
from gevent.pool import Group
def my_processing(value):
do stuff with the value...
return processed_value
def my_async_map(my_coroutine, my_iterable):
my_group = Group()
return my_group.map(my_coroutine, my_iterable)
my_async_map(my_processing, my_ginormous_iterable)
The already mentioned keyvi (http://keyvi.org) sounds like the best option to me, because "python shared memory dictionary" describes exactly what it is. I am the author of keyvi, call me biased, but give me the chance to explain:
Shared memory make it scalable, especially for python where the GIL-problematic forces you to use multiprocessing rather than threading. That's why a heap-based in-process solution wouldn't scale. Also shared memory can be bigger than main memory, parts can be swapped in and out.
External process network based solutions require an extra network hop, which you can avoid by using keyvi, this makes a big performance difference even on the local machine. The question is also whether the external process is single-threaded and therefore introduces a bottleneck again.
I wonder about your dictionary size: 86GB: there is a good chance that keyvi compresses that nicely, but hard to say without knowing the data.
As for processing: Note that keyvi works nicely in pySpark/Hadoop.
Your usecase BTW is exactly what keyvi is used for in production, even on a higher scale.
The redis solution sounds good, at least better than some database solution. For saturating the cores you should use several instances and divide the key space using consistent hashing. But still, using keyvi, I am sure, would scale way better. You should try it, if you have to repeat the task and/or need to process more data.
Last but not least, you find nice material on the website, explaining the above in more detail.
Maybe you should try do it in database, and maybe try to use Dask to solve your problem,let Dask to care about how to multiprocessing in the low level. You can focus on the main question you want to solve using that large data.
And this the link you may want to look Dask
Well I do believe that the Redis or a database would be the easiest and quickest fix.
But from what I understood, why not reduce the problem from your second solution? That is, first try to load a portion of the billion keys into memory (say 50 Million). Then using Multi-processing, create a pool to work on the 2 TB file. If the lookup of the line exists in the table, push the data to a list of processed lines. If it doesn't exist, push it to a list. Once you complete reading the data set, pickle your list and flush the keys you have stored from memory. Then load the next million and repeat the process instead reading from your list. Once it is finished completely, read all your pickle objects.
This should handle the speed issue that you were facing. Of course, I have very little knowledge of your data set and do not know if this is even feasible. Of course, you might be left with lines that did not get a proper dictionary key read, but at this point your data size would be significantly reduced.
Don't know if that is of any help.
Another solution could be to use some existing database driver which can allocate / retire pages as necessary and deal with the index lookup quickly.
dbm has a nice dictionary interface available and with automatic caching of pages may be fast enough for your needs. If nothing is modified, you should be able to effectively cache the whole file at VFS level.
Just remember to disable locking, open in not synch-ed mode, and open for 'r' only so nothing impacts caching/concurrent access.
Since you're only looking to create a read-only dictionary it is possible that you can get better speed than some off the shelf databases by rolling your own simple version. Perhaps you could try something like:
import os.path
import functools
db_dir = '/path/to/my/dbdir'
def write(key, value):
path = os.path.join(db_dir, key)
with open(path, 'w') as f:
f.write(value)
#functools.lru_cache(maxsize=None)
def read(key):
path = os.path.join(db_dir, key)
with open(path) as f:
return f.read()
This will create a folder full of text files. The name of each file is the dictionary key and the contents are the value. Timing this myself I get about 300us per write (using a local SSD). Using those numbers theoretically the time taken to write your 1.75 billion keys would be about a week but this is easily parallelisable so you might be able to get it done a lot faster.
For reading I get about 150us per read with warm cache and 5ms cold cache (I mean the OS file cache here). If your access pattern is repetitive you could memoize your read function in process with lru_cache as above.
You may find that storing this many files in one directory is not possible with your filesystem or that it is inefficient for the OS. In that case you can do like the .git/objects folder: Store the key abcd in a file called ab/cd (i.e. in a file cd in folder ab).
The above would take something like 15TB on disk based on a 4KB block size. You could make it more efficient on disk and for OS caching by trying to group together keys by the first n letters so that each file is closer to the 4KB block size. The way this would work is that you have a file called abc which stores key value pairs for all keys that begin with abc. You could create this more efficiently if you first output each of your smaller dictionaries into a sorted key/value file and then mergesort as you write them into the database so that you write each file one at a time (rather than repeatedly opening and appending).
While the majority suggestion of "use a database" here is wise and proven, it sounds like you may want to avoid using a database for some reason (and you are finding the load into the db to be prohibitive), so essentially it seems you are IO-bound, and/or processor-bound. You mention that you are loading the 86GB index from 1024 smaller indexes. If your key is reasonably regular, and evenly-distributed, is it possible for you to go back to your 1024 smaller indexes and partition your dictionary? In other words, if, for example, your keys are all 20 characters long, and comprised of the letters a-z, create 26 smaller dictionaries, one for all keys beginning with 'a', one for keys beginning 'b' and so on. You could extend this concept to a large number of smaller dictionaries dedicated to the first 2 characters or more. So, for example, you could load one dictionary for the keys beginning 'aa', one for keys beginning 'ab' and so on, so you would have 676 individual dictionaries. The same logic would apply for a partition over the first 3 characters, using 17,576 smaller dictionaries. Essentially I guess what I'm saying here is "don't load your 86GB dictionary in the first place". Instead use a strategy that naturally distributes your data and/or load.
By looking at the CPython implementation it seems the return value of a string split() is a list of newly allocated strings. However, since strings are immutable it seems one could have made substrings out of the original string by pointing at the offsets.
Am I understanding the current behavior of CPython correctly ? Are there reasons for not opting for this space optimization ? One reason I can think of is that the parent string cannot be freed until all its substrings are.
Without a crystal ball I can't tell you why CPython does it that way. However, there are some reasons why you might choose to do it that way.
The problem is that a small string might hold a reference to a much larger backing array. For example, suppose I read in a 8 GB HTTP access log file to analyze which user agents access my file the most, and I do that just by fp.read() and then run a regex on the whole file at once rather than going one line at a time.
I want to know about the top 10 most common user agents, so I keep this around in a list.
Then I want to do the same analysis for 100 other files, to see how the top 10 user agents have changed over time. Boom! My program is trying to use 800 GB of memory and gets killed. Why? How do I debug this?
Java used this sharing technique prior to Java 7, so the same reasoning applies. See Java 7 String - substring complexity and JDK-4513622: (str)
keeping a substring of a field prevents GC for object.
Also note that having strings share memory would require you to follow a pointer from the string object to the string data. In CPython, the string data is usually placed directly after a header in memory, so you don't need to follow a pointer. This reduces the number of allocations required and reduces data dependencies when reading strings.
In the current CPython implementation, strings are reference-counted; it is assumed that a string cannot hold references to other objects because a string is not a container. This means that garbage collection does not need to inspect or trace over string objects (because they're entirely covered by the reference counting). But it's actually worse than that: Old versions of Python did not have a tracing garbage collector at all; GC was new in 2.0. Before that, any cyclic garbage would simply leak.
A competently-implemented substring-to-offset algorithm should not form cycles. So in theory, a cyclic garbage collector is not a prerequisite for this. However, because we're doing reference counting instead of tracing, the child objects become responsible for Py_DECREF()ing their parent objects at end-of-life. Otherwise the parent leaks. This means you cannot just chuck the whole string into the free list when it reaches end-of-life; you have to check whether it's a substring, and branching is potentially expensive. Python was historically designed to do string processing (like Perl, but with nicer syntax), which means creating and destroying a lot of strings. Furthermore, all variable names are internally stored as strings, so even if the user is not doing string processing, the interpreter is. Slowing down the string deallocation process by even a little could have a serious impact on performance.
CPython internally uses NUL-terminated strings in addition to storing a length. This is a very early design choice, present since the very first version of Python, and still true in the latest version.
You can see that in Include/unicodeobject.h where PyASCIIObject says "wchar_t representation (null-terminated)" and PyCompactUnicodeObject says "UTF-8 representation (null-terminated)". (Recent CPython implementations select from one of 4 back-end string types, depending on the Unicode encoding needs.)
Many Python extension modules expect a NUL terminated string. It would be difficult to implement substrings as slices into a larger string and preserve the low-level C API. Not impossible, as it could be done using a copy-on-C-API-access. Or Python could require all extension writers to use a new subslice-friendly API. But that complexity is not worthwhile given the problems found from experience in other languages which implement subslice references, as Dietrich Epp described.
I see little in Kevin's answer which is applicable to this question. The decision had nothing do to with the lack of circular garbage collection before Python 2.0, nor could it. Substring slices are implemented with an acyclic data structure. 'Competently-implemented' isn't a relevant requirement as it would take a perverse sort of incompetence or malice to turn it into a cyclic data structure.
Nor would there necessarily be extra branch overhead in the deallocator. If the source string were one type and the substring slice another type, then Python's normal type dispatcher would automatically use the correct deallocator, with no additional overhead. Even if there were an extra branch, we know that branching overhead in this case is not "expensive". Python 3.3 (because of PEP 393) has those 4 back-end Unicode types, and decides what to do based on branching. String access occurs much more often than deallocation, so any dellocation overhead due to branching would be lost in the noise.
It is mostly true that in CPython "variable names are internally stored as strings". (The exception is that local variables are stored as indices into a local array.) However, these names are also interned into a global dictionary using PyUnicode_InternInPlace(). There is therefore no deallocation overhead because these strings are not deallocated, outside of cases involving dynamic dispatch using non-interned strings, like through getattr().
I have a Telit module which runs [Python 1.5.2+] (http://www.roundsolutions.com/techdocs/python/Easy_Script_Python_r13.pdf)!. There are certain restrictions in the number of variable, module and method names I can use (< 500), the size of each variable (16k) and amount of RAM (~ 1MB). Refer pg 113&114 for details. I would like to know how to get the number of symbols being generated, size in RAM of each variable, memory usage (stack and heap usage).
I need something similar to a map file that gets generated with gcc after the linking process which shows me each constant / variable, symbol, its address and size allocated.
Python is an interpreted and dynamically-typed language, so generating that kind of output is very difficult, if it's even possible. I'd imagine that the only reasonable way to get this information is to profile your code on the target interpreter.
If you're looking for a true memory map, I doubt such a tool exists since Python doesn't go through the same kind of compilation process as C or C++. Since everything is initialized and allocated at runtime as the program is parsed and interpreted, there's nothing to say that one interpreter will behave the same as another, especially in a case such as this where you're running on such a different architecture. As a result, there's nothing to say that your objects will be created in the same locations or even with the same overall memory structure.
If you're just trying to determine memory footprint, you can do some manual checking with sys.getsizeof(object, [default]) provided that it is supported with Telit's libs. I don't think they're using a straight implementation of CPython. Even still, this doesn't always work and with raise a TypeError when an object's size cannot be determined if you don't specify the default parameter.
You might also get some interesting results by studying the output of the dis module's bytecode disassembly, but that assumes that dis works on your interpreter, and that your interpreter is actually implemented as a VM.
If you just want a list of symbols, take a look at this recipe. It uses reflection to dump a list of symbols.
Good manual testing is key here. Your best bet is to set up the module's CMUX (COM port MUXing), and watch the console output. You'll know very quickly if you start running out of memory.
This post makes me recall my pain once with Telit GM862-GPS modules. My code was exactly at the point that the number of variables, strings, etc added up to the limit. Of course, I didn't know this fact by then. I added one innocent line and my program did not work any more. I drove me really crazy for two days until I look at the datasheet to find this fact.
What you are looking for might not have a good answer because the Python interpreter is not a full fledged version. What I did was to use the same local variable names as many as possible. Also I deleted doc strings for functions (those count too) and replace with #comments.
In the end, I want to say that this module is good for small applications. The python interpreter does not support threads or interrupts so your program must be a super loop. When your application gets bigger, each iteration will take longer. Eventually, you might want to switch to a faster platform.