Python execution process indepth - python

im beginning to learn the execution process of Python. I came across an article which attempts to explain the CPython virtual machine: https://tech.blog.aknin.name/2010/04/02/pythons-innards-introduction/
However i find his writings lacking depth. When the command $ python -c 'print("Hello, world!") is executed, am i correct to say that the python interpreter will be called, and the source code print("Hello, world!") will pass through series of lexing, parsing, compilation and finally execution by the virtual machine? Could you provide clarification on which functions are called and exactly what they accomplish?
Any resources which point to an indepth explanation is also welcomed!
That said, let’s start with a bird’s eye overview of what happens when
you do this: $ python -c 'print("Hello, world!")'. Python’s binary is
executed, the standard C library initialization which pretty much any
process does happens and then the main function starts executing (see
its source, ./Modules/python.c: main, which soon calls
./Modules/main.c: Py_Main). After some mundane initialization stuff
(parse arguments, see if environment variables should affect
behaviour, assess the situation of the standard streams and act
accordingly, etc), ./Python/pythonrun.c: Py_Initialize is called. In
many ways, this function is what ‘builds’ and assembles together the
pieces needed to run the CPython machine and makes ‘a process’ into ‘a
process with a Python interpreter in it’. Among other things, it
creates two very important Python data-structures: the interpreter
state and thread state. It also creates the built-in module sys and
the module which hosts all builtins. At a later post(s) we will cover
all these in depth.

Related

Why use Python's os module methods instead of executing shell commands directly?

I am trying to understand what is the motivation behind using Python's library functions for executing OS-specific tasks such as creating files/directories, changing file attributes, etc. instead of just executing those commands via os.system() or subprocess.call()?
For example, why would I want to use os.chmod instead of doing os.system("chmod...")?
I understand that it is more "pythonic" to use Python's available library methods as much as possible instead of just executing shell commands directly. But, is there any other motivation behind doing this from a functionality point of view?
I am only talking about executing simple one-line shell commands here. When we need more control over the execution of the task, I understand that using subprocess module makes more sense, for example.
It's faster, os.system and subprocess.call create new processes which is unnecessary for something this simple. In fact, os.system and subprocess.call with the shell argument usually create at least two new processes: the first one being the shell, and the second one being the command that you're running (if it's not a shell built-in like test).
Some commands are useless in a separate process. For example, if you run os.spawn("cd dir/"), it will change the current working directory of the child process, but not of the Python process. You need to use os.chdir for that.
You don't have to worry about special characters interpreted by the shell. os.chmod(path, mode) will work no matter what the filename is, whereas os.spawn("chmod 777 " + path) will fail horribly if the filename is something like ; rm -rf ~. (Note that you can work around this if you use subprocess.call without the shell argument.)
You don't have to worry about filenames that begin with a dash. os.chmod("--quiet", mode) will change the permissions of the file named --quiet, but os.spawn("chmod 777 --quiet") will fail, as --quiet is interpreted as an argument. This is true even for subprocess.call(["chmod", "777", "--quiet"]).
You have fewer cross-platform and cross-shell concerns, as Python's standard library is supposed to deal with that for you. Does your system have chmod command? Is it installed? Does it support the parameters that you expect it to support? The os module will try to be as cross-platform as possible and documents when that it's not possible.
If the command you're running has output that you care about, you need to parse it, which is trickier than it sounds, as you may forget about corner-cases (filenames with spaces, tabs and newlines in them), even when you don't care about portability.
It is safer. To give you an idea here is an example script
import os
file = raw_input("Please enter a file: ")
os.system("chmod 777 " + file)
If the input from the user was test; rm -rf ~ this would then delete the home directory.
This is why it is safer to use the built in function.
Hence why you should use subprocess instead of system too.
There are four strong cases for preferring Python's more-specific methods in the os module over using os.system or the subprocess module when executing a command:
Redundancy - spawning another process is redundant and wastes time and resources.
Portability - Many of the methods in the os module are available in multiple platforms while many shell commands are os-specific.
Understanding the results - Spawning a process to execute arbitrary commands forces you to parse the results from the output and understand if and why a command has done something wrong.
Safety - A process can potentially execute any command it's given. This is a weak design and it can be avoided by using specific methods in the os module.
Redundancy (see redundant code):
You're actually executing a redundant "middle-man" on your way to the eventual system calls (chmod in your example). This middle man is a new process or sub-shell.
From os.system:
Execute the command (a string) in a subshell ...
And subprocess is just a module to spawn new processes.
You can do what you need without spawning these processes.
Portability (see source code portability):
The os module's aim is to provide generic operating-system services and it's description starts with:
This module provides a portable way of using operating system dependent functionality.
You can use os.listdir on both windows and unix. Trying to use os.system / subprocess for this functionality will force you to maintain two calls (for ls / dir) and check what operating system you're on. This is not as portable and will cause even more frustration later on (see Handling Output).
Understanding the command's results:
Suppose you want to list the files in a directory.
If you're using os.system("ls") / subprocess.call(['ls']), you can only get the process's output back, which is basically a big string with the file names.
How can you tell a file with a space in it's name from two files?
What if you have no permission to list the files?
How should you map the data to python objects?
These are only off the top of my head, and while there are solutions to these problems - why solve again a problem that was solved for you?
This is an example of following the Don't Repeat Yourself principle (Often reffered to as "DRY") by not repeating an implementation that already exists and is freely available for you.
Safety:
os.system and subprocess are powerful. It's good when you need this power, but it's dangerous when you don't. When you use os.listdir, you know it can not do anything else other then list files or raise an error. When you use os.system or subprocess to achieve the same behaviour you can potentially end up doing something you did not mean to do.
Injection Safety (see shell injection examples):
If you use input from the user as a new command you've basically given him a shell. This is much like SQL injection providing a shell in the DB for the user.
An example would be a command of the form:
# ... read some user input
os.system(user_input + " some continutation")
This can be easily exploited to run any arbitrary code using the input: NASTY COMMAND;# to create the eventual:
os.system("NASTY COMMAND; # some continuation")
There are many such commands that can put your system at risk.
For a simple reason - when you call a shell function, it creates a sub-shell which is destroyed after your command exists, so if you change directory in a shell - it does not affect your environment in Python.
Besides, creating sub-shell is time consuming, so using OS commands directly will impact your performance
EDIT
I had some timing tests running:
In [379]: %timeit os.chmod('Documents/recipes.txt', 0755)
10000 loops, best of 3: 215 us per loop
In [380]: %timeit os.system('chmod 0755 Documents/recipes.txt')
100 loops, best of 3: 2.47 ms per loop
In [382]: %timeit call(['chmod', '0755', 'Documents/recipes.txt'])
100 loops, best of 3: 2.93 ms per loop
Internal function runs more than 10 time faster
EDIT2
There may be cases when invoking external executable may yield better results than Python packages - I just remembered a mail sent by a colleague of mine that performance of gzip called through subprocess was much higher than the performance of a Python package he used. But certainly not when we are talking about standard OS packages emulating standard OS commands
Shell call are OS specific whereas Python os module functions are not, in most of the case. And it avoid spawning a subprocess.
It's far more efficient. The "shell" is just another OS binary which contains a lot of system calls. Why incur the overhead of creating the whole shell process just for that single system call?
The situation is even worse when you use os.system for something that's not a shell built-in. You start a shell process which in turn starts an executable which then (two processes away) makes the system call. At least subprocess would have removed the need for a shell intermediary process.
It's not specific to Python, this. systemd is such an improvement to Linux startup times for the same reason: it makes the necessary system calls itself instead of spawning a thousand shells.

Finding a line of a C module called by a python script that segfaults

I have a caller.py which repeatedly calls routines from some_c_thing.so, which was created from some_c_thing.c. When I run it, it segfaults - is there a way for me to detect which line of c code is segfaulting?
This might work:
make sure the native library is compiled with debug symbols (-g switch for gcc).
Run python under gdb and let it crash:
gdb --args python caller.py
run # tell gdb to run the program
# script runs and crashes
bt # print backtrace, which should show the crashing line
If crash happens in the native library code, then this should reveal the line.
If native library code just corrupts something or violates some postconditions, and crash happens in Python interpreter's code, then this will not be helpful. In that case your options are code review, adding debug prints (first step would be to just log entry and exit of each C function to detect which is the last C function called before crash, then adding more fine-grained logging for variable values etc), and finally using debugger to see what happens by using the usual debugger techniques (breakpoints, stepping, watches...).
Take Python and the .so file(s) out of the equation. See what params are being passed, if any, and call the routines from a debugger capable of stepping through C code and binaries.
Here is a link to an article describing a simple C debugging process, in case you're not familiar with debugging C (command line interface). Here is another link on using NetBeans to debug C. Also using Eclipse...
This could help: gdb: break in shared library loaded by python (might also turn out to be a dupe)
segfault... Check if the number of variables or the types of variables you passed to that c function (in .so) are correct. If not aligned, usually it's a segfault.

Possible to run a delayed code execution?

Will it is possible to run a small set of code automatically after a script was run?
I am asking this because for some reasons, if I added this set of code into the main script, though it works, it will displays a list of tab errors (its already there, but it is stating that it cannot find it some sort).
I realized that after running my script, Maya seems to 'load' its own setup of refreshing, along with some plugins done by my company. As such, if I am running the small set of code after my main script execution and the Maya/ plugins 'refresher', it works with no problem. I had like to make the process as automated as possible, all within a script if that is possible...
Thus is it possible to do so? Like a delayed sort of coding method?
FYI, the main script execution time depends on the number of elements in the scene. The more there are, it will takes longer...
Maya has a command Maya.cmds.evalDeferred that is meant for this purpose. It waits till no more Maya processing is pending and then evaluates itself.
You can also use Maya.cmds.scriptJob for the same purpose.
Note: While eval is considered dangerous and insecure in Maya context its really normal. Mainly because everything in Maya is inherently insecure as nearly all GUI items are just eval commands that the user may modify. So the second you let anybody use your Maya shell your security is breached.

Is it possible to pass a callable programmatically to a python instance (invoked with different permissions)

Assume I have python code
def my_great_func(an_arg):
a_file = open("/user/or/root/file", "w")
a_file.write("bla")
which I want to maintain without paying attention to invokation with and without priveleges. At the same time I don't want to invoke the script with sudo/enforce the invokation with sudo (although this would be a legitemate pratice) or enable setuid for my python interpreter (generally a bad idea...). An idea is now to start a second instance of the python interpretor and communicate over processes/pipes. In order to maximize the maintainability of the code it would be nice to simply pass the callable to the instance (e.g. started with subprocess.Popen and addressed to with its PID) like I would pass it to multiprocess.Process (which I can't use because I can't setuid in the subprocess). I imagine something like
# please consider this pseudo python code
pid = subprocess.Popen(["sudo", "python"]).get_pid()
thelib.pass_callable(pid, target, args)
or even
interpreter_instance = greatlib.Python(target, args)
interpreter_instance.start()
interpreter_instance.wait()
Is that possible and covered by existing libs?
Generally speaking, you don't want any script to run as Super User unless the script invoking it was called with Super User. This is not only an issue of good practice and secure programming, but also programmer etiquette. If any part of your program requires use of Super User, this intention should be made known before you even begin the program.
With that in mind, the Python thread library should work just fine for this.

How to control gdb within C or Python code without the GDB Python API?

I am trying to write a program in python or c that can debug c code by using gdb.
I've read the solution of Tom and Invoke and control GDB from Python. But they are more or less a solution for scripting gdb in python. Since I am going to use an arm-gdb to debug embedded program, I cannot enable python scripting in my gdb.
My goal is to create a high-level abstraction of gdb. For example, launch gdb, set some breakpoints and continue within my code. I also read some material gdb/mi interface. But could anyone tell me how to use gdb/mi interface to create a gdb process and communicate with gdb from c/python code? (Luckily my arm-gdb supports gdb/mi interface).
As promised in the comments above, I have published my (early, incomplete, almost certainly buggy) ruby work to http://github.com/mcarpenter/rubug.
Here's an example (you can find this in
examples/breakpoint). Function check_for_crash is a callback that may be invoked
after the program called factorial is set running. The breakpoint takes a function name
(fac; the leading colon just indicates that this is a
ruby symbol which to all intents and purposes here is a lightweight
string).
EXE = 'factorial'
def check_for_crash(gdb, event)
case event.type
when :command_response
raise RuntimeError, 'oops' unless
[ :done, :running ].include? event.response.result
when :breakpoint
puts 'Breakpoint reached'
pp event
gdb.continue
when :exit
puts 'Exit'
gdb.stop_event_loop
exit
end
end
gdb = Rubug::Gdb.new
resp = gdb.file EXE
gdb.register_callback(method :check_for_crash)
gdb.break(:fac)
gdb.run '5 > /dev/null'
gdb.start_event_loop
It is only fair to warn you that the code may be... crufty. Currently (this is where I stopped) nothing much works (subsequent to a gdb update midway through my work, see
Grammar below).
There are a bunch of examples in the directory of the same
name that might prove helpful however. To (attempt to!) run them, you will need to do something like this:
rake clean
rake grammar
rake make
cd examples/simple_fuzzer
ruby -I ../../lib -r rubygems simple_fuzzer.rb
Given the time that this was written you should probably go with a ruby1.8
if you have the choice (I wasn't into 1.9 at the time and there are probably
string encoding issues under 1.9).
Parsing of responses is performed by treetop
http://treetop.rubyforge.org, a PEG parser. Looking at the grammar with
fresh eyes I'm sure that it could be simplified. You will need to install this
(and any other required gems) using gem install ....
Some more tips if you do Pythonize:
Documentation
There is little outside "Debugging with GDB"
(ch. 22). I've thrown this PDF and just ch. 22 as a separate file into
the docs section of the repository.
Async
The protocol is asynrchronous (at first I assumed this was
command/response type protocol, this was a mistake). If I were to
re-implement this I'd probably use something like event machine or
libevent rather than rolling my own select() loop.
Grammar
The grammar is a little... confusing. Although the documentation
(27.2.2) states that a response "consists of zero or more out of band
records followed, optionally, by a single result record":
`output -> ( out-of-band-record )* [ result-record ] "(gdb)" nl`
you should be aware that since anything can arrive at any time a read()
on the socket can apparently return async/result/more
async/terminator(!). For example, I see this with my current gdb:
=thread-group-started,id="i1",pid="1086"
=thread-created,id="1",group-id="i1"
^running
*running,thread-id="all"
(gdb)
The line starting ^ is a result record, all others are async (then
the terminator). This seems like a fairly significant flaw in the
specification.
Speed
My main focus is security and I was interested in MI for
automated fuzzing, binary inspection, etc. For this purpose GDB/MI is too
slow (cost of starting the program in the debugger). YMMV.
MI / CLI mapping
There were some things in the standard gdb CLI command set that I
could not see how to implement using MI commands. I have skeleton
code for something like this:
gdb = Gdb::MI.new
gdb.cli(:file, '/bin/ls')
gdb.cli(:set, :args, '> /dev/null')
gdb.cli(:run)
gdb.cli(:quit)
(which is nice and clear, I think, for us non-MI-expert-but-gdb-knowledgeable users).
I can't now remember what those problematic things were (it's over a year
since I looked at this) but if those neurons do fire I'll come back and
update this.
Alternatives
When I first started out
on this road I found a blog posting from Jamis Buck:
http://weblog.jamisbuck.org/2006/9/25/gdb-wrapper-for-ruby This wraps a
gdb command line session in popen() which made me wince a little. In
particular one might expect it to be brittle since gdb makes no
guarantees about the stability of the CLI output. You may (or may not)
prefer this approach.
If you're on windows then PyDbg / PeiMei may be of interest: http://code.google.com/p/paimei/
You might also like the book Grey Hat Python: Python Programming for Hackers (Seitz). Again, mostly windows based but might prove inspirational.
The links you listed are more of "invoking Python from GDB", but you're asking how to invoke GDB from Python or C. The GDB/MI interface is definately the way to go. Eclipse, Emacs, and KDevelop use GDB/MI to abstract the debugging interface. I've personally used KDevelop with three different cross-compiled gdb versions for ARM, AVR and H8S. The MI protocol is designed to be parsed by software, so the syntax is very regular.
A Google search yielded a Python GDB wrapper that should get you started.
What about using http://www.noah.org/python/pexpect/ ? It is the python version of http://en.wikipedia.org/wiki/Expect which is very usefull for automating tasks with external commands.

Categories