Python c-api and unicode strings - python

I need to convert between python objects and c strings of various encodings. Going from a c string to a unicode object was fairly simple using PyUnicode_Decode, however Im not sure how to go the other way
//char* can be a wchar_t or any other element size, just make sure it is correctly terminated for its encoding
Unicode(const char *str, size_t bytes, const char *encoding="utf-16", const char *errors="strict")
:Object(PyUnicode_Decode(str, bytes, encoding, errors))
{
//check for any python exceptions
ExceptionCheck();
}
I want to create another function that takes the python Unicode string and puts it in a buffer using a given encodeing, eg:
//fills buffer with a null terminated string in encoding
void AsCString(char *buffer, size_t bufferBytes,
const char *encoding="utf-16", const char *errors="strict")
{
...
}
I suspect it has somthing to do with PyUnicode_AsEncodedString however that returns a PyObject so I'm not sure how to put that into my buffer...
Note: both methods above are members of a c++ Unicode class that wraps the python api
I'm using Python 3.0

I suspect it has somthing to do with PyUnicode_AsEncodedString however that returns a PyObject so I'm not sure how to put that into my buffer...
The PyObject returned is a PyStringObject, so you just need to use PyString_Size and PyString_AsString to get a pointer to the string's buffer and memcpy it to your own buffer.
If you're looking for a way to go directly from a PyUnicode object into your own char buffer, I don't think that you can do that.

Related

Why is PyBytes_AsStringAndSize() writing the wrong size byte array?

I am working on a research project calling some python functions from C and am trying to return a 256-byte bytearray from a python script to my C program using the python/C API. I am trying to store the returned byte array as a char array in my C program so I can later write it to a file, however when I try to convert the bytes PyObject using PyBytes_AsStringAndSize(), it only appears to write 8 bytes of data despite me specifying 256. Could anyone explain what is causing this behaviouir? I have tried scouring the documentation and online and haven't found help. Any ideas would be much appreciated!
int len = PyBytes_Size(pValue); //pValue is the object returned from our python function
printf("C:object returned. Length of pValue Object: %i\n", len);
PyObject *pBytes = PyBytes_FromObject(pValue);
int bLen = PyBytes_Size(pBytes);
printf("C:object converted to bytes. Length of pBytes: %i\n", bLen);
Py_ssize_t size = len;
char * test;
PyBytes_AsStringAndSize(pBytes,&test,&size); //Stores contents of returned pyobject into char array test
printf("Length of new byte array: %lu \n", sizeof(test));
I have looked all throughout the C/python api documentation and online but haven't found any clues so far.
Whenever the code is run, it produces the following output:
C:object returned. Length of pValue Object: 256
C:object converted to bytes. Length of pBytes: 256
Length of new byte array: 8

How to extract a memory address from inside a Python object

I'm using a binary Python library that returns a Buffer object. This object is basically a wrapper of a C object containing a pointer to the actual memory buffer. What I need is to get the memory address contained in that pointer from Python, the problem is that the Buffer object doesn't have a Python method to obtain it, so I need to do some hacky trick to get it.
For the moment I found an ugly and unsafe way to get the pointer value:
I know the internal structure of the C object:
typedef struct _Buffer {
PyObject_VAR_HEAD PyObject *parent;
int type; /* GL_BYTE, GL_SHORT, GL_INT, GL_FLOAT */
int ndimensions;
int *dimensions;
union {
char *asbyte;
short *asshort;
int *asint;
float *asfloat;
double *asdouble;
void *asvoid;
} buf;
} Buffer;
So I wrote this Python code:
# + PyObject_VAR_HEAD size
# + 8 bytes PyObject_VAR_HEAD PyObject *parent
# + 4 bytes from int type
# + 4 bytes from int ndimensions
# + 8 bytes from int *dimensions
# = 24
offset = sys.getsizeof(0) + 24
buffer_pointer_addr = id(buffer) + offset
buffer_pointer_data = ctypes.string_at(buffer_pointer_addr, 8)
buffer_pointer_value = struct.unpack('Q', buffer_pointer_data)[0]
This is working consistently for me. As you can see I'm getting the memory address of the Python Buffer object with id(buffer), but as you may know that's not the actual pointer to the buffer, but just a Python number that in CPython happens to be the memory address to the Python object.
So then I'm adding the offset that I calculated by adding the sizes of all the variables in the C struct. I'm hardcoding the byte sizes (which is obviously completely unsafe) except for the PyObject_VAR_HEAD, that I get with sys.getsizeof(0).
By adding the offset I get the memory address that contains the pointer to the actual buffer, then I use ctypes to extract it with ctypes.string_at hardcoding the size of the pointer as 8 bytes (I'm on a 64bit OS), then I use struct.unpack to convert it to an actual Python int.
So now my question is: how could I implement a safer solution without hardcoding all the sizes? (if it exists). Maybe something with ctypes? It's OK if it only works on CPython.
I found a safer solution after investigating about C Struct padding and based on the following assumptions:
The code will only be used on CPython.
The buffer pointer is at the end of the C Struct.
The buffer pointer size can be safely extracted from void * C-type as it's going to be the biggest of the union{} made in the C struct. Anyway there will be no different sizes between data pointer types on most modern OS's.
The C Struct members are going to be exactly the ones shown in the question
Based on all these assumptions and the rules found here: https://stackoverflow.com/a/38144117/8861787,
we can safely say that there will be no padding at the end of the struct and we can extract the pointer without hardcoding anything:
# Get the size of the Buffer Python object
buffer_obj_size = sys.getsizeof(buffer)
# Get the size of void * C-type
buffer_pointer_size = ctypes.sizeof(ctypes.c_void_p)
# Calculate the address to the pointer assuming that it's at the end of the C Struct
buffer_pointer_addr = id(buffer) + buffer_obj_size - buffer_pointer_size
# Get the actual pointer value as a Python Int
buffer_pointer_value = (ctypes.c_void_p).from_address(buffer_pointer_addr).value

Passing byte string from Python to C

I am writing a python extension in C and I am trying to pass a bytes object to my function. Obviously the 's' token is for strings; I have tried 'O', 'N', and a few others with no luck. Is there a token I can use to parse a bytes object? If not is there an alternative method to parse bytes objects?
static PyObject *test(PyObject *self, PyObject *args)
{
char *dev;
uint8_t *key;
if(!PyArg_ParseTuple(args, "ss", &dev, &key))
return NULL;
printf("%s\n", dev);
for (int i = 0; i < 32; i++)
{
printf("Val %d: %d\n", i, key[i]);
}
Py_RETURN_NONE;
}
Calling from python: test(b"device", f.read(32)).
If you read the parsing format string docs, it's pretty clear.
s is solely for getting a NUL terminated UTF-8 encoded C-style string from a str object (so it's appropriate for your first argument, but not your second).
y* is specifically called out in the docs with (emphasized in original text):
This is the recommended way to accept binary data.
y# would also work, at the expense of requiring the caller to provide immutable bytes-like objects, excluding stuff like bytearray and mmap.mmaps.

Passing binary data from Python to C API extension

I'm writing a Python (2.6) extension, and I have a situation where I need to pass an opaque binary blob (with embedded null bytes) to my extension.
Here is a snippet of my code:
from authbind import authenticate
creds = 'foo\x00bar\x00'
authenticate(creds)
which throws the following:
TypeError: argument 1 must be string without null bytes, not str
Here is some of authbind.cc:
static PyObject* authenticate(PyObject *self, PyObject *args) {
const char* creds;
if (!PyArg_ParseTuple(args, "s", &creds))
return NULL;
}
So far, I have tried passing the blob as a raw string, like creds = '%r' % creds, but that not only gives me embedded quotes around the string but also turns the \x00 bytes into their literal string representations, which I do not want to mess around with in C.
How can I accomplish what I need? I know about the y, y# and y* PyArg_ParseTuple() format characters in 3.2, but I am limited to 2.6.
Ok, I figured out a with the help of this link.
I used a PyByteArrayObject (docs here) like this:
from authbind import authenticate
creds = 'foo\x00bar\x00'
authenticate(bytearray(creds))
And then in the extension code:
static PyObject* authenticate(PyObject *self, PyObject *args) {
PyByteArrayObject *creds;
if (!PyArg_ParseTuple(args, "O", &creds))
return NULL;
char* credsCopy;
credsCopy = PyByteArray_AsString((PyObject*) creds);
}
credsCopy now holds the string of bytes, exactly as they are needed.

Python to C/C++ const char question

I am extending Python with some C++ code.
One of the functions I'm using has the following signature:
int PyArg_ParseTupleAndKeywords(PyObject *arg, PyObject *kwdict,
char *format, char **kwlist, ...);
(link: http://docs.python.org/release/1.5.2p2/ext/parseTupleAndKeywords.html)
The parameter of interest is kwlist. In the link above, examples on how to use this function are given. In the examples, kwlist looks like:
static char *kwlist[] = {"voltage", "state", "action", "type", NULL};
When I compile this using g++, I get the warning:
warning: deprecated conversion from string constant to ‘char*’
So, I can change the static char* to a static const char*. Unfortunately, I can't change the Python code. So with this change, I get a different compilation error (can't convert char** to const char**). Based on what I've read here, I can turn on compiler flags to ignore the warning or I can cast each of the constant strings in the definition of kwlist to char *. Currently, I'm doing the latter. What are other solutions?
Sorry if this question has been asked before. I'm new.
Does PyArg_ParseTupleAndKeywords() expect to modify the data you are passing in? Normally, in idiomatic C++, a const <something> * points to an object that the callee will only read from, whereas <something> * points to an object that the callee can write to.
If PyArg_ParseTupleAndKeywords() expects to be able to write to the char * you are passing in, you've got an entirely different problem over and above what you mention in your question.
Assuming that PyArg_ParseTupleAndKeywords does not want to modify its parameters, the idiomatically correct way of dealing with this problem would be to declare kwlist as const char *kwlist[] and use const_cast to remove its const-ness when calling PyArg_ParseTupleAndKeywords() which would make it look like this:
PyArg_ParseTupleAndKeywords(..., ..., ..., const_cast<char **>(kwlist), ...);
There is an accepted answer from seven years ago, but I'd like to add an alternative solution, since this topic seems to be still relevant.
If you don't like the const_cast solution, you can also create a write-able version of the string array.
char s_voltage[] = "voltage";
char s_state[] = "state";
char s_action[] = "action";
char s_type[] = "type";
char *kwlist[] = {s_voltage, s_state, s_action, s_type, NULL};
The char name[] = ".." copies the your string to a writable location.

Categories