Loadable program checkpoints in python (serializable continuations/program images)

Loadable program checkpoints in python (serializable continuations/program images) - python

Lets say I have a python program that does following:
do_some_long_data_crunching_computation()
call_some_fiddly_new_crashing_function()
Is there a way to "freeze" and serialize the state of python program (all globals and such) after the return point of first long computation and then re-iterate your development of new function and restart the program execution from this point?
This is somewhat possible, if you are running the program from the interpreter, but is there any other way?

Well, no.
The point is that Python and your OS aren't free of side effects (it really isn't even remotely functional, though it has some features of functional languages), so to restore the state of your program doesn't really work in this case. You'd basically have to re-run the program, with exactly the computer state you had when you started it the last time.
Now, what you'd do is that after your long operation, you could save the state of the variables that are important to you, using pickle or similar.
Now, I know you'd like to avoid taking care of what to store and how to restore it, but that's basically a sign of unclean design: You should, as far as possible, store the state of your computation in a single state object; serializing and unserializing that then would be easy. Don't store computational state in globals!

Related

What code should I add in order to save the changes I make? [duplicate]

This question already has answers here:
Best method of saving data
(5 answers)
Closed 3 years ago.
I occasionally have Python programs that take a long time to run, and that I want to be able to save the state of and resume later. Does anyone have a clever way of saving the state either every x seconds, or when the program is exiting?

Put all of your "state" data in one place and use a pickle.
The pickle module implements a fundamental, but powerful algorithm for serializing and de-serializing a Python object structure. “Pickling” is the process whereby a Python object hierarchy is converted into a byte stream, and “unpickling” is the inverse operation, whereby a byte stream is converted back into an object hierarchy. Pickling (and unpickling) is alternatively known as “serialization”, “marshalling,” 1 or “flattening”, however, to avoid confusion, the terms used here are “pickling” and “unpickling”.

If you want to save everything, including the entire namespace and the line of code currently executing to be restarted at any time, there is not a standard library module to do that.
As another poster said, the pickle module can save pretty much everything into a file and then load it again, but you would have to specifically design your program around the pickle module (i.e. saving your "state" -- including variables, etc -- in a class).

If you ok with OOP, consider creating a method for each class that output a serialised version ( using pickle ) to file. Then add a second method to load in the instance the data, and if the pickled file is there you call the load method instead of the processing one.
I use this approach for ML and it really seed up my workflow.

In the traditional programming approach the obvious way to save a state of variables or objects after some point of execution is serialization.
So if you want to execute the program after some heavy already computed state we need to start only from the deserialization part.
These steps will be mostly needed mostly in data science modelling where we load a huge CSV or some other data and compute it and we do not want to recompute every time we run a program.
http://jupyter.org/ - A tool which does these serialization/deserialization automatically without you doing any manual work.
All you need to do is do execute a selected portion of your python code let us say line 10-15, which are dependent on pervious lines 1-9. Jupyter saves the state of 1-9 for you. Explore a tutorial it and give it a try.

Does Python load in function arguments into registers or does it keep them on the stack?

So I'm writing a function that takes in a tuple as an argument and does a bunch of stuff to it. Here is what that looks like:
def swap(self, location):
if (location[0] < 0 or location[1] < 0 or
location[0] >= self.r or location[1] >= self.c):
return False
self.board[0][0] = self.board[location[0]][location[1]]
self.board[location[0]][location[1]] = 0
self.empty = (location[0],location[1])
I'm trying to make my code as efficient as possible, so since I am not modifying the values of location, does it make sense to load the variables in registers (loc0 = location[0]; loc1 = location[1]) for faster computations (zero-cycle read) or is location already loaded into registers by the Python compiler when it's passed in as a function argument?
Edit: I bit the bullet and ran some tests. Here are the results (in seconds) for this function running 10 million times with the repeating inputs: "up", "down", "left", "right" (respectively)
Code as is:
run#1: 19.39
run#2: 17.18
run#3: 16.85
run#4: 16.90
run#5: 16.74
run#6: 16.76
run#7: 16.94
Code after defining location[0] and location[1] in the beginning of the function:
run#1: 14.83
run#2: 14.79
run#3: 14.88
run#4: 15.033
run#5: 14.77
run#6: 14.94
run#7: 14.67
That's an average of 16% increase in performance. Definitely not insignificant for my case. Of course, this is not scientific as I need to do more tests in more environments with more inputs, but enough for my simple use case!
Times measured using Python 2.7 on a Macbook Pro (Early 2015), which has a Broadwell i5-5257U CPU (2c4t max turbo 3.1GHz, sustained 2.7GHz, 3MB L3 cache).
IDE was: PyCharm Edu 3.5.1 JRE: 1.8.0_112-release-408-b6 x86_64 JVM: OpenJDK 64-Bit Server VM .
Unfortunately, this is for a class that grades based on code speed.

If you're using an interpreter, it's unlikely that any Python variables will live in registers between different expressions. You could look at how the Python source compiled to byte-code.
Python bytecode (the kind stored in files outside the interpreter) is stack-based (http://security.coverity.com/blog/2014/Nov/understanding-python-bytecode.html). This byte-code is then interpreted or JIT-compiled to native machine code. Regular python only interprets, so it's not plausible for it to keep python variables in machine registers across multiple statements.
An interpreter written in C might keep the top of the bytecode stack in a local variable inside an interpret loop, and the C compiler might keep that C variable in a register. So repeated use of the same Python variable might end up not having too many store/reload round-trips.
Note that store-forwarding latency on your Broadwell CPU is about 4 or 5 clock cycles, nowhere near the hundreds of cycles for a round-trip to DRAM. A store/reload doesn't even have to wait for the store to retire and commit to L1D cache; it's forwarded directly from the store buffer. Related: http://blog.stuffedcow.net/2014/01/x86-memory-disambiguation/ and http://agner.org/optimize/, and other links in the x86 tag wiki). Load-use latency is also only 5 clock cycles for an L1D cache hit (latency from address being ready to data being ready. You can measure it by pointer-chasing through a linked list (in asm).) There's enough interpreter overhead (total number of instructions it runs to figure out what to do next) that this probably isn't even the bottleneck.
Keeping a specific python variable in a register is not plausible at all for an interpreter. Even if you wrote an interpreter in asm, the fundamental problem is that registers aren't addressable. An x86 add r14d, eax instruction has to have both registers hard-coded into the instruction's machine-code. (Every other ISA works the same way: register numbers are part of the machine-code for the instruction, with no indirection based on any data). Even if the interpreter did the work to figure out that it needed to "add reg-var #3 to reg-var #2" (i.e. decoding the bytecode stack operations back into register variables for an internal representation that it interprets), it would have to use a different function than any other combination of registers.
Given an integer, the only ways to get the value of the Nth register are branching to an instruction that uses that register, or storing all the registers to memory and indexing the resulting array. (Or maybe some kind of branchless compare and mask stuff).
Anyway, trying to do anything specific about this is not profitable, which is why people just write the interpreter in C and let the C compiler do a (hopefully) good job of optimizing the machine code that will actually run.
Or you write a JIT-compiler like Sun did for Java (the HotSpot VM). IDK if there are any for Python. See Does the Python 3 interpreter have a JIT feature?.
A JIT-compiler does actually turn the Python code into machine code, where register state mostly holds Python variables rather than interpreter data. Again, without a JIT compiler (or ahead-of-time compiler), "keeping variables in registers" is not a thing.
It's probably faster because it avoids the [] operator and other overhead (see Bren's answer, which you accepted)
Footnote: a couple ISAs have memory-mapped registers. e.g. AVR (8-bit RISC microcontrollers), where the chip also has built-in SRAM containing the low range of memory addresses that includes the registers. So you can do an indexed load and get register contents, but you might as well have done that on memory that wasn't holding architectural register contents.

The Python VM only uses a stack to execute its bytecode, and this stack is completely independent of the hardware stack. You can use dis to disassemble your code to see how your changes affect the generated bytecode.

It will be a little faster if you store these two variable:
loc0 = location[0]
loc1 = location[1]
Because there will be only two look-up instead of four.
Btw, if you want to use python, you shouldn't take care about performance in this low level.

Those kinds of details are not part of the specified behavior of Python. As Ignacio's answer says, CPython does it one way, but that is not guaranteed by the language itself. Python's description of what it does is very far removed from low-level notions like registers, and most of the time it's not useful to worry about how what Python does maps onto those details. Python is a high-level language whose behavior is defined in terms of high-level abstractions, akin to an API.
In any case, doing something like loc0 = language[0] in Python code has nothing to do with setting registers. It's just creating new Python name pointing an existing Python object.
That said, there is a performance difference, because if you use location[0] everywhere, the actual lookup will (or at least may -- in theory a smart Python implementation could optimize this) happen again and again every time the expression location[0] is evaluated. But if you do loc0 = location[0] and then use loc0 everywhere, you know the lookup only happens once. In typical situations (e.g., location is a Python list or dict, you're not running this code gazillions of times in a tight loop) this difference will be tiny.

How do I proceed with memory, .so filenames and hex offsets

Don't flame me for this but it's genuine. I am writing a multi threaded python application that runs for a very long time, typically 2-3 hours with 10 processes. This machine isn't slow it's just a lot of calculations.
The issue is that sometimes the application will hang about 85-90% of the way there because of outside tools.
I've broken this test up into smaller pieces that can then run successfully but the long running program hangs.
for example let's say I have to analyze some data on a list that 100,000,000 items long.
Breaking it up into twenty 5,000,000 lists all the smaller parts runs to completion.
Trying to do the 100,000,000 project it hangs towards the end. I use some outside tools that I cannot change so I am just trying to see what's going on.
I setup Dtrace and run
sudo dtrace -n 'syscall:::entry / execname == "python2.7" / { #[ustack()] = count() }'
on my program right when it hangs and I get an output like the code sample below.
libc.so.7`__sys_recvfrom+0xa
_socket.so`0x804086ecd
_socket.so`0x8040854ac
libpython2.7.so.1`PyEval_EvalFrameEx+0x52d7
libpython2.7.so.1`PyEval_EvalCodeEx+0x665
libpython2.7.so.1`0x800b3317d
libpython2.7.so.1`PyEval_EvalFrameEx+0x4e2f
libpython2.7.so.1`0x800b33250
libpython2.7.so.1`PyEval_EvalFrameEx+0x4e2f
libpython2.7.so.1`PyEval_EvalCodeEx+0x665
libpython2.7.so.1`0x800abb5a1
libpython2.7.so.1`PyObject_Call+0x64
libpython2.7.so.1`0x800aa3855
libpython2.7.so.1`PyObject_Call+0x64
libpython2.7.so.1`PyEval_EvalFrameEx+0x4de2
libpython2.7.so.1`PyEval_EvalCodeEx+0x665
libpython2.7.so.1`0x800abb5a1
libpython2.7.so.1`PyObject_Call+0x64
libpython2.7.so.1`0x800aa3855
libpython2.7.so.1`PyObject_Call+0x64
that code just repeats over and over. I tried looking into the Dtrace python probes but those seems busted two sides from Tuesday so this might be the closest that I'll get.
My question, I have a fuzzy idea that libpython2.7.so.1 is the shared library that holds the function pyObject_Call at an hex offset of 0x64
Is that right?
How can I decipher this? I don't know what to even call this so that I can google for answers or guides.

You should probably start by reading Showing the stack trace from a running Python application.
Your specific
question was about the interpretation of DTrace's ustack() action and
so this reply may be more than you need. This is because one of the
design principles of DTrace is to show the exact state of a system.
So, even though you're interested in the Python aspect of your
program, DTrace is revealing its underlying implementation.
The output you've presented is a stack, which is a way of
describing the state of a thread at a specific point in its
execution. For example, if you had the code
void c(void) { pause(); }
void b(void) { c(); }
void a(void) { b(); }
and you asked for a stack whilst execution was within pause() then
you might see something like
pause()
c()
b()
a()
Whatever tool you use will find the current instruction and its
enclosing function before finding the "return address", i.e. the
point to which that function will eventually return; repeating this
procedure yields a stack. Thus, although the stack should be read
from the top to the bottom as a series of return addresses, it's typically
read in the other direction as a series of callers. Note that
subtleties in the way that the program's corresponding
instructions are assembled mean that this second interpretation
can sometimes be misleading.
To extend the example above, it's likely that a(), b() and c() are
all present within the same library --- and that there may be
functions with the same names in other libraries. Thus it's
useful to display, for each function, the object to which it
belongs. Thus the stack above could become
libc.so`pause()
libfoo.so`c()
libfoo.so`b()
libfoo.so`a()
This goes some way towards allowing a developer to identify how a
program ended up in a particular state: function c() in libfoo
has called pause(). However, there's more to be done: if c()
looked like
void c() {
pause();
pause();
}
then in which call to pause() is the program waiting?
The functions a(), b() and c() will be sequences
of instructions that will typically occupy a contiguous region of
memory. Calling one of the functions involves little more than
making a note of where to return when finished (i.e. the return
address) and then jumping to whichever memory address corresponds
to the function's start. Functions' start addresses and sizes are
recorded in a "symbol table" that is embedded in the object; it's
by reading this table that a debugger is able to find the function
that contains a given location such as a return address. Thus a
specific point within a function can be described by an offset,
usually expressed in hex, from the start. So an even better
version of the stack above might be
libc.so`pause()+0x12
libfoo.so`c()+0x42
libfoo.so`b()+0x12
libfoo.so`a()+0x12
At this point, the developer can use a "disassembler" on libfoo.so
to display the instructions within c(); comparison with c()'s
source code would allow him to reveal the specific line from which
the call to pause() was made.
Before concluding this description of stacks, it's worth making
one more observation. Given the presence of sufficient "debug
data" in a library such as libfoo, a better debugger would be able
to go the extra mile and display the the source code file name and
line number instead of the hexadecimal offset for each "frame" in
the stack.
So now, to return to the stack in your question,
libpython(2.7.so.1) is a library whose functions perform the job
of executing a Python script. Functions in the Python script are
converted into executable instructions on the fly, so my guess is
that the fragment
libpython2.7.so.1`0x800b33250
libpython2.7.so.1`PyEval_EvalFrameEx+0x4e2f
libpython2.7.so.1`PyEval_EvalCodeEx+0x665
means that PyEval_EvalFrameEx() is functionality within libpython
itself that calls a Python function (i.e. something written in
Python) that resides in memory near the address 0x800b33250. A
simple debugger can see that this address belongs to libpython but
won't find a corresponding entry in the library's symbol table;
left with no choice, it simply prints the "raw" address.
So, you need to look at the Python script so see what it's
doing but, unfortunately, there's no indication of the names of
the functions in the Python component of the stack.
There are a few ways to proceed. The first is to find a
version of libpython, if one exists, with a "DTrace helper". This
is some extra functionality that lets DTrace see the state of the
Python program itself in addition to the surrounding
implementation. The result is that each Python frame would be
annotated with the corresponding point in the Python source code.
Another, if you're on Solaris, is to use pstack(1); this has
native support for Python.
Finally, try a specific Python debugger.
It's also worth pointing out that your dtrace invocation will show
you all the stacks seen, sorted by popularity, whenever the
program "python2.7" makes a system call. From your description,
this probably isn't what you want. If you're trying to understand
the behaviour of a hang then you probably want to start with a
single snapshot of the python2.7 process at the time of the
hang.

Counting number of symbols in Python script

I have a Telit module which runs [Python 1.5.2+] (http://www.roundsolutions.com/techdocs/python/Easy_Script_Python_r13.pdf)!. There are certain restrictions in the number of variable, module and method names I can use (< 500), the size of each variable (16k) and amount of RAM (~ 1MB). Refer pg 113&114 for details. I would like to know how to get the number of symbols being generated, size in RAM of each variable, memory usage (stack and heap usage).
I need something similar to a map file that gets generated with gcc after the linking process which shows me each constant / variable, symbol, its address and size allocated.

Python is an interpreted and dynamically-typed language, so generating that kind of output is very difficult, if it's even possible. I'd imagine that the only reasonable way to get this information is to profile your code on the target interpreter.
If you're looking for a true memory map, I doubt such a tool exists since Python doesn't go through the same kind of compilation process as C or C++. Since everything is initialized and allocated at runtime as the program is parsed and interpreted, there's nothing to say that one interpreter will behave the same as another, especially in a case such as this where you're running on such a different architecture. As a result, there's nothing to say that your objects will be created in the same locations or even with the same overall memory structure.
If you're just trying to determine memory footprint, you can do some manual checking with sys.getsizeof(object, [default]) provided that it is supported with Telit's libs. I don't think they're using a straight implementation of CPython. Even still, this doesn't always work and with raise a TypeError when an object's size cannot be determined if you don't specify the default parameter.
You might also get some interesting results by studying the output of the dis module's bytecode disassembly, but that assumes that dis works on your interpreter, and that your interpreter is actually implemented as a VM.
If you just want a list of symbols, take a look at this recipe. It uses reflection to dump a list of symbols.
Good manual testing is key here. Your best bet is to set up the module's CMUX (COM port MUXing), and watch the console output. You'll know very quickly if you start running out of memory.

This post makes me recall my pain once with Telit GM862-GPS modules. My code was exactly at the point that the number of variables, strings, etc added up to the limit. Of course, I didn't know this fact by then. I added one innocent line and my program did not work any more. I drove me really crazy for two days until I look at the datasheet to find this fact.
What you are looking for might not have a good answer because the Python interpreter is not a full fledged version. What I did was to use the same local variable names as many as possible. Also I deleted doc strings for functions (those count too) and replace with #comments.
In the end, I want to say that this module is good for small applications. The python interpreter does not support threads or interrupts so your program must be a super loop. When your application gets bigger, each iteration will take longer. Eventually, you might want to switch to a faster platform.

Will Python be faster if I put commonly called code into separate methods or files?

I thought I once read on SO that Python will compile and run slightly more quickly if commonly called code is placed into methods or separate files. Does putting Python code in methods have an advantage over separate files or vice versa? Could someone explain why this is? I'd assume it has to do with memory allocation and garbage collection or something.

It doesn't matter. Don't structure your program around code speed; structure it around coder speed. If you write something in Python and it's too slow, find the bottleneck with cProfile and speed it up. How do you speed it up? You try things and profile them. In general, function call overhead in critical loops is high. Byte compiling your code takes a very small amount of time and only needs to be done once.

No. Regardless of where you put your code, it has to be parsed once and compiled if necessary. Distinction between putting code in methods or different files might have an insignificant performance difference, but you shouldn't worry about it.
About the only language right now that you have to worry about structuring "right" is Javascript. Because it has to be downloaded from net to client's computer. That's why there are so many compressors and obfuscators for it. Stuff like this isn't done with Python because it's not needed.

Two things:
Code in separate modules is compiled into bytecode at first runtime and saved as a precompiled .pyc file, so it doesn't have to be recompiled at the next run as long as the source hasn't been modified since. This might result in a small performance advantage, but only at program startup.
Also, Python stores variables etc. a bit more efficiently if they are placed inside functions instead of at the top level of a file. But I don't think that's what you're referring to here, is it?

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.