I often have to write code in other languages that interact with C structs. Most typically this involves writing Python code with the struct or ctypes modules.
So I'll have a .h file full of struct definitions, and I have to manually read through them and duplicate those definitions in my Python code. This is time consuming and error-prone, and it's difficult to keep the two definitions in sync when they change frequently.
Is there some tool or library in any language (doesn't have to be C or Python) which can take a .h file and produce a structured list of its structs and their fields? I'd love to be able to write a script to generate my automatically generate my struct definitions in Python, and I don't want to have to process arbitrary C code to do it. Regular expressions would work great about 90% of the time and then cause endless headaches for the remaining 10%.
If you compile your C code with debugging (-g), pahole (git) can give you the exact structure layouts being used.
$ pahole /bin/dd
…
struct option {
const char * name; /* 0 8 */
int has_arg; /* 8 4 */
/* XXX 4 bytes hole, try to pack */
int * flag; /* 16 8 */
int val; /* 24 4 */
/* size: 32, cachelines: 1, members: 4 */
/* sum members: 24, holes: 1, sum holes: 4 */
/* padding: 4 */
/* last cacheline: 32 bytes */
};
…
This should be quite a lot nicer to parse than straight C.
Regular expressions would work great about 90% of the time and then cause endless headaches for the remaining 10%.
The headaches happen in the cases where the C code contains syntax that you didn't think of when writing your regular expressions. Then you go back and realise that C can't really be parsed by regular expressions, and life becomes not fun.
Try turning it around: define your own simple format, which allows less tricks than C does, and generate both the C header file and the Python interface code from your file:
define socketopts
int16 port
int32 ipv4address
int32 flags
Then you can easily write some Python to convert this to:
typedef struct {
short port;
int ipv4address;
int flags;
} socketopts;
and also to emit a Python class which uses struct to pack/unpack three values (possibly two of them big-endian and the other native-endian, up to you).
Have a look at Swig or SIP that would generate interface code for you or use ctypes.
Have you looked at Swig?
I have quite successfully used GCCXML on fairly large projects. You get an XML representation of the C code (including structures) which you can post-process with some simple Python.
ctypes-codegen or ctypeslib (same thing, I think) will generate ctypes Structure definitions (also other things, I believe, but I only tried structs) by parsing header files using GCCXML. It's no longer supported, but will likely work in some cases.
One my friend for this tasks done C-parser which he use with cog.
Related
I created a C program and choose to use python as my GUI creator
I wrote the python GUI and decided to create a struct
typedef struct GUIcommunicator
{
char action[MAX_SIZE];
char params[MAX_PARAMS][MAX_SIZE];
}GUIcommunicator;
I using this answer I will return the
struct from the python receive him in my C code
but I don't know how.
And I'd like to some help
(the function that sends the struct (the python code) already got the action and the params as string list )
If you are developing a linux based application, you can perform Inter-Process Communication using FIFO files
For windows based applications, you could use Named-pipes
Regardless of the direction of the call, you end up with a PyObject* in C that refers to a list of strings. Fetch each element, then fetch the string data pointer and size for a Python 2 str or (the UTF-8 encoding of) a Python 3 str. Check the size against MAX_SIZE (probably MAX_SIZE-1) and beware of embedded null characters, which your struct presumably cannot represent.
There are of course ways of accepting any iterable, converting elements to strings, etc., if you want that flexibility.
During a Cython meetup a speaker pointed to other data types such as cython.ssize_t. The type ssize_t is briefly mentioned in this Wikipedia article however it is not well explained. Similarly Cython documentation mentions types in terms of how types are automatically converted.
What are all the data types available in Cython and what are their specifications?
You have basically access to most of the C types:
Here are the equivalent of all the Python types (if I do not miss some), taken from Oreilly book cython book
Python bool:
bint (boolean coded on 4 bits, alias for short)
Python int and long
[unsigned] char
[unsigned] short
[unsigned] int
[unsigned] long
[unsigned] long long
Python float
float
double
long double
Python complex
float complex
double complex
Python bytes / str / unicode
char *
std::string
For the size_t and Py_ssite_t, keep in mind these are aliases.
Py_ssize_t is defined in python.h imported implicitly in cython. That can hold the size (in bytes) of the largest object the Python interpreter ever creates.
While size_t is a standard C89 type, defined in <stddef.h>.
I have a unicode string f. I want to memset it to 0. print f should display null (\0)
I am using ctypes.memset to achieve this -
> >>> f
> u'abc'
> >>> print ("%s" % type(f))
> <type 'unicode'>
> >>> import ctypes
> **>>> ctypes.memset(id(f)+50,0,6)**
> **4363962530
> >>> f
> u'abc'
> >>> print f
> abc**
Why did the memory location not get memset in case of unicode string?
It works perfectly for an str object.
Thanks for help.
First, this is almost certainly a very bad idea. Python expects strings to be immutable. There's a reason that even the C API won't let you change their contents after they're flagged ready. If you're just doing this to play around with the interpreter's implementation, that can be fun and instructive, but if you're doing it for any real-life purpose, you're probably doing something wrong.
In particular, if you're doing it for "security", what you almost certainly really want to do is to not create a unicode in the first place, but instead create, say, a bytearray with the UTF-16 or UTF-32 encoding of your string, which can be zeroed out in a way that's safe, portable, and a lot easier.
Anyway, there's no reason to expect that two completely different types should store their buffers at the same offset.
In CPython 2.x, a str is a PyStringObject:
typedef struct {
PyObject_VAR_HEAD
long ob_shash;
int ob_sstate;
char ob_sval[1];
} PyStringObject;
That ob_sval is the buffer; the offset should be 36 on 64-bit builds and (I think) 24 on 32-bit builds.
In a comment, you say:
I read it somewhere and also the offset for a string type is 37 in my system which is what sys.getsizeof('') shows -> >>> sys.getsizeof('') 37
The offset for a string buffer is actually 36, not 37. And the fact that it's even that close is just a coincidence of the way str is implemented. (Hopefully you can understand why by looking at the struct definition—if not, you definitely shouldn't be writing code like this.) There's no reason to expect the same trick to work for some other type without looking at its implementation.
A unicode is a PyUnicodeObject:
typedef struct {
PyObject_HEAD
Py_ssize_t length; /* Length of raw Unicode data in buffer */
Py_UNICODE *str; /* Raw Unicode buffer */
long hash; /* Hash value; -1 if not set */
PyObject *defenc; /* (Default) Encoded version as Python
string, or NULL; this is used for
implementing the buffer protocol */
} PyUnicodeObject;
Its buffer is not even inside the object itself; that str member is a pointer to the buffer (which is not guaranteed to be right after the struct). Its offset should be 24 on 64-bit builds, and (I think) 20 on 32-bit builds. So, to do the equivalent, you'd need to read the pointer there, then follow it to find the location to memset.
If you're using a narrow-Unicode build, it should look like this:
>>> ctypes.POINTER(ctypes.c_uint16 * len(g)).from_address(id(g)+24).contents[:]
[97, 98, 99]
That's the ctypes translation of finding (uint16_t *)(((char *)g)+24) and reading the array that starts at *that and ends at *(that+len(g)), which is what you'd have to do if you were writing C code and didn't have access to the unicodeobject.h header.
(In the the test I just quoted, g is at 0x10a598090, while its src points to 0x10a3b09e0, so the buffer is not immediately after the struct, or anywhere near it; it's about 2MB before it.)
For a wide-Unicode build, the same thing with c_uint32.
So, that should show you what you want to memset.
And you should also see a serious implication for your attempt at "security" here. (If I have to point it out, that's yet another indication that you should not be writing this code.)
Say I have the following code in C++:
union {
int32_t i;
uint32_t ui;
};
i = SomeFunc();
std::string test(std::to_string(ui));
std::ofstream outFile(test);
And say I had the value of i somehow in Python, how would I be able to get the name of the file?
For those of you that are unfamiliar with C++. What I am doing here is writing some value in signed 32-bit integer format to i and then interpreting the bitwise representation as an unsigned 32-bit integer in ui. I am taking the same 32 bits and interpreting them in two different ways.
How can I do this in Python? There does not seem to be any explicit type specification in Python, so how can I reinterpret some set of bits in a different way?
EDIT: I am using Python 2.7.12
I would use python struct for interpreting bits in different ways.
something like following to print -12 as unsigned integer
import struct
p = struct.pack("#i", -12)
print("{}".format(struct.unpack("#I",p)[0]))
Is there a way in python to unpack C structures created using #pragma pack(x) or __attribute__((packed)) using structs?
Alternatively, how to determine the manner in which python struct handles padding?
Use the struct class.
It is flexible in terms of byte order (big vs. little endian) and alignment (packing). See Byte Order, Size, and Alignment. It defaults to native byte order (pretty much meaning however python was compiled).
Native example
C:
struct foo {
int bar;
char t;
char x;
}
Python:
struct.pack('IBB', bar, t, x)