Python convert C header file to dict - python

I have a C header file which contains a series of classes, and I'm trying to write a function which will take those classes, and convert them to a python dict. A sample of the file is down the bottom.
Format would be something like
class CFGFunctions {
class ABC {
class AA {
file = "abc/aa/functions"
class myFuncName{ recompile = 1; };
};
class BB
{
file = "abc/bb/functions"
class funcName{
recompile=1;
}
}
};
};
I'm hoping to turn it into something like
{CFGFunctions:{ABC:{AA:"myFuncName"}, BB:...}}
# Or
{CFGFunctions:{ABC:{AA:{myFuncName:"string or list or something"}, BB:...}}}
In the end, I'm aiming to get the filepath string (which is actually a path to a folder... but anyway), and the class names in the same class as the file/folder path.
I've had a look on SO, and google and so on, but most things I've found have been about splitting lines into dicts, rather then n-deep 'blocks'
I know I'll have to loop through the file, however, I'm not sure the most efficient way to convert it to the dict.
I'm thinking I'd need to grab the outside class and its relevant brackets, then do the same for the text remaining inside.
If none of that makes sense, it's cause I haven't quite made sense of the process myself haha
If any more info is needed, I'm happy to provide.
The following code is a quick mockup of what I'm sorta thinking...
It is most likely BROKEN and probably does NOT WORK. but its sort of the process that I'm thinking of
def get_data():
fh = open('CFGFunctions.h', 'r')
data = {} # will contain final data model
# would probably refactor some of this into a function to allow better looping
start = "" # starting class name
brackets = 0 # number of brackets
text= "" # temp storage for lines inside block while looping
for line in fh:
# find the class (start
mt = re.match(r'Class ([\w_]+) {', line)
if mt:
if start == "":
start = mt.group(1)
else:
# once we have the first class, find all other open brackets
mt = re.match(r'{', line)
if mt:
# and inc our counter
brackets += 1
mt2 = re.match(r'}', line)
if mt2:
# find the close, and decrement
brackets -= 1
# if we are back to the initial block, break out of the loop
if brackets == 0:
break
text += line
data[start] = {'tempText': text}
====
Sample file
class CfgFunctions {
class ABC {
class Control {
file = "abc\abc_sys_1\Modules\functions";
class assignTracker {
description = "";
recompile = 1;
};
class modulePlaceMarker {
description = "";
recompile = 1;
};
};
class Devices
{
file = "abc\abc_sys_1\devices\functions";
class registerDevice { recompile = 1; };
class getDeviceSettings { recompile = 1; };
class openDevice { recompile = 1; };
};
};
};
EDIT:
If possible, if I have to use a package, I'd like to have it in the programs directory, not the general python libs directory.

As you detected, parsing is necessary to do the conversion. Have a look at the package PyParsing, which is a fairly easy-to-use library to implement parsing in your Python program.
Edit: This is a very symbolic version of what it would take to recognize a very minimalistic grammer - somewhat like the example at the top of the question. It won't work, but it might put you in the right direction:
from pyparsing import ZeroOrMore, OneOrMore, \
Keyword, Literal
test_code = """
class CFGFunctions {
class ABC {
class AA {
file = "abc/aa/functions"
class myFuncName{ recompile = 1; };
};
class BB
{
file = "abc/bb/functions"
class funcName{
recompile=1;
}
}
};
};
"""
class_tkn = Keyword('class')
lbrace_tkn = Literal('{')
rbrace_tkn = Literal('}')
semicolon_tkn = Keyword(';')
assign_tkn = Keyword(';')
class_block = ( class_tkn + identifier + lbrace_tkn + \
OneOrMore(class_block | ZeroOrMore(assignment)) + \
rbrace_tkn + semicolon_tkn \
)
def test_parser(test):
try:
results = class_block.parseString(test)
print test, ' -> ', results
except ParseException, s:
print "Syntax error:", s
def main():
test_parser(test_code)
return 0
if __name__ == '__main__':
main()
Also, this code is only the parser - it does not generate any output. As you can see in the PyParsing docs, you can later add the actions you want. But the first step would be to recognize the what you want to translate.
And a last note: Do not underestimate the complexities of parsing code... Even with a library like PyParsing, which takes care of much of the work, there are many ways to get mired in infinite loops and other amenities of parsing. Implement things step-by-step!
EDIT: A few sources for information on PyParsing are:
http://werc.engr.uaf.edu/~ken/doc/python-pyparsing/HowToUsePyparsing.html
http://pyparsing.wikispaces.com/
(Particularly interesting is http://pyparsing.wikispaces.com/Publications, with a long list of articles - several of them introductory - on PyParsing)
http://pypi.python.org/pypi/pyparsing_helper is a GUI for debugging parsers
There is also a 'tag' Pyparsing here on stackoverflow, Where Paul McGuire (the PyParsing author) seems to be a frequent guest.
* NOTE: *
From PaulMcG in the comments below: Pyparsing is no longer hosted on wikispaces.com. Go to github.com/pyparsing/pyparsing

Related

How to get two RichText features to be mutually exclusive

So basically I've added two custom features for coloring text to a RichTextBlock, and I'd like to make them so selecting one for a portion of text would automatically unselect the other color button, much like it's already the case for h tags.
I've searched for a bit but didn't find much, so I guess I could use some help, be it advice, instruction or even code.
My features go like this :
#hooks.register('register_rich_text_features')
def register_redtext_feature(features):
feature_name = 'redtext'
type_ = 'RED_TEXT'
tag = 'span'
control = {
'type': type_,
'label': 'Red',
'style': {'color': '#bd003f'},
}
features.register_editor_plugin(
'draftail', feature_name, draftail_features.InlineStyleFeature(control)
)
db_conversion = {
'from_database_format': {tag: InlineStyleElementHandler(type_)},
'to_database_format': {
'style_map': {
type_: {'element': tag, 'props': {'class': 'text-primary'}}
}
},
}
features.register_converter_rule(
'contentstate', feature_name, db_conversion
)
The other one is similar but color is different.
This is possible, but it requires jumping through many hoops in Wagtail. The h1…h6 tags work like this out of the box because they are block-level formatting – each block within the editor can only be of one type. Here you’re creating this RED_TEXT formatting as inline formatting ("inline style"), which, intentionally supports multiple formats being applied to the same text.
If you want to achieve this mutually exclusive implementation anyway – you’ll need to write custom JS code to auto-magically remove the desired styles from the text when attempting to add a new style.
Here is a function that does just that. It goes through all of the characters in the user’s selection, and removes the relevant styles from them:
/**
* Remove all of the COLOR_ styles from the current selection.
* This is to ensure only one COLOR_ style is applied per range of text.
* Replicated from https://github.com/thibaudcolas/draftjs-filters/blob/f997416a0c076eb6e850f13addcdebb5e52898e5/src/lib/filters/styles.js#L7,
* with additional "is the character in the selection" logic.
*/
export const filterColorStylesFromSelection = (
content: ContentState,
selection: SelectionState,
) => {
const blockMap = content.getBlockMap();
const startKey = selection.getStartKey();
const endKey = selection.getEndKey();
const startOffset = selection.getStartOffset();
const endOffset = selection.getEndOffset();
let isAfterStartKey = false;
let isAfterEndKey = false;
const blocks = blockMap.map((block) => {
const isStartBlock = block.getKey() === startKey;
const isEndBlock = block.getKey() === endKey;
isAfterStartKey = isAfterStartKey || isStartBlock;
isAfterEndKey = isAfterEndKey || isEndBlock;
const isBeforeEndKey = isEndBlock || !isAfterEndKey;
const isBlockInSelection = isAfterStartKey && isBeforeEndKey;
// Skip filtering through the block chars if out of selection.
if (!isBlockInSelection) {
return block;
}
let altered = false;
const chars = block.getCharacterList().map((char, i) => {
const isAfterStartOffset = i >= startOffset;
const isBeforeEndOffset = i < endOffset;
const isCharInSelection =
// If the selection is on a single block, the char needs to be in-between start and end offsets.
(isStartBlock &&
isEndBlock &&
isAfterStartOffset &&
isBeforeEndOffset) ||
// Start block only: after start offset
(isStartBlock && !isEndBlock && isAfterStartOffset) ||
// End block only: before end offset.
(isEndBlock && !isStartBlock && isBeforeEndOffset) ||
// Neither start nor end: just "in selection".
(isBlockInSelection && !isStartBlock && !isEndBlock);
let newChar = char;
if (isCharInSelection) {
char
.getStyle()
.filter((type) => type.startsWith("COLOR_"))
.forEach((type) => {
altered = true;
newChar = CharacterMetadata.removeStyle(newChar, type);
});
}
return newChar;
});
return altered ? block.set("characterList", chars) : block;
});
return content.merge({
blockMap: blockMap.merge(blocks),
});
};
This is taken from the Draftail ColorPicker demo, which you can see running in the Draftail Storybook’s "Custom formats" example.
To implement this kind of customisation in Draftail, you’d need to use the controls API. Unfortunately that API isn’t currently supported out of the box in Wagtail’s integration of the editor (see wagtail/wagtail#5580), so at the moment in order for this to work you’d need to customize Draftail’s initialisation within Wagtail as well.

How to read inline-styles from WxPython

I'm trying to put text into a RichTextCtrl and then, after the user has made edits, I want to get the edited text back out along with the styles. Its the second part I'm having trouble with. Out of all the methods to get styles out of the buffer, none of them are really user-friendly.
The best I've come up with is to walk through the text a character at a time with GetStyleForRange(range, style). There has got to be a better way to do this! Here's my code now, which walks through gathering a list of text segments and styles.
Please give me a better way to do this. I have to be missing something.
buffer: wx.richtext.RichTextBuffer = self.rtc.GetBuffer()
end = len(buffer.GetText())
# Variables for text/style reading loop
ch: str
curStyle: str
i: int = 0
style = wx.richtext.RichTextAttr()
text: List[str] = []
textItems: List[Tuple[str, str]] = []
# Read the style of the first character
self.rtc.GetStyleForRange(wx.richtext.RichTextRange(i, i + 1), style)
curStyle = self.describeStyle(style)
# Loop until we hit the end. Use a while loop so we can control the index increment.
while i < end + 1:
# Read the current character and its style as `ch` and `newStyle`
ch = buffer.GetTextForRange(wx.richtext.RichTextRange(i, i))
self.rtc.GetStyleForRange(wx.richtext.RichTextRange(i, i + 1), style)
newStyle = self.describeStyle(style)
# If the style has changed, we flush the collected text and start new collection
if text and newStyle != curStyle and ch != '\n':
newText = "".join(text)
textItems.append((newText, curStyle))
text = []
self.rtc.GetStyleForRange(wx.richtext.RichTextRange(i + 1, i + 2), style)
curStyle = self.describeStyle(style)
# Otherwise, collect the character and continue
else:
i += 1
text.append(ch)
# Capture the last text being collected
newText = "".join(text)
textItems.append((newText, newStyle))
Here's a C++ version of the solution I mentioned in the comment above. It's a simple tree walk using a queue, so I think should be translatable to python easily.
const wxRichTextBuffer& buffer = m_richText1->GetBuffer();
std::deque<const wxRichTextObject*> objects;
objects.push_front(&buffer);
while ( !objects.empty() )
{
const wxRichTextObject* curObject = objects.front();
objects.pop_front();
if ( !curObject->IsComposite() )
{
wxRichTextRange range = curObject->GetRange();
const wxRichTextAttr& attr = curObject->GetAttributes();
// Do something with range and attr here.
}
else
{
// This is a composite object. Add its children to the queue.
// The children are added in reverse order to do a depth first walk.
const wxRichTextCompositeObject* curComposite =
static_cast<const wxRichTextCompositeObject*>(curObject);
size_t childCount = curComposite->GetChildCount() ;
for ( int i = childCount - 1 ; i >= 0 ; --i )
{
objects.push_front(curComposite->GetChild(i));
}
}
}

How to get only function blocks using sly

I need to get the function blocks (definition and everything, not just declaration), in order to get a function dependency graph. From the function dependency graph, identify connected components and modularize my insanely huge C codebase, one file at a time.
Problem : I need a C parser to identify function blocks, just that, nothing more. We have custom types etc but signature goes
storage_class return_type function_name ( comma separated type value pairs )
{
//some content I view as generic stuff
}
Solution that I've come up with : Use sly and pycparser like any sane person would do, obviously.
Problem with pycparser : Needs to compile pre-processors from other files, just to identify the code-blocks. In my case, things go to depth of 6 levels. I am sorry I can't show the actual code.
Attempted Code with Sly :
from sly import Lexer, Parser
import re
def comment_remover(text):
def replacer(match):
s = match.group(0)
if s.startswith('/'):
return " " # note: a space and not an empty string
else:
return s
pattern = re.compile(
r'//.*?$|/\*.*?\*/|\'(?:\\.|[^\\\'])*\'|"(?:\\.|[^\\"])*"',
re.DOTALL | re.MULTILINE
)
return re.sub(pattern, replacer, text)
class CLexer(Lexer):
ignore = ' \t\n'
tokens = {LEXEME, PREPROP, FUNC_DECL,FUNC_DEF,LBRACE,RBRACE, SYMBOL}
literals = {'(', ')',',','\n','<','>','-',';','&','*','=','!'}
LBRACE = r'\{'
RBRACE = r'\}'
FUNC_DECL = r'[a-z]+[ \n\t]+[a-zA-Z_0-9]+[ \n\t]+[a-zA-Z_0-9]+[ \n\t]*\([a-zA-Z_\* \,\t\n]+\)[ ]*\;'
FUNC_DEF = r'[a-zA-Z_0-9]+[ \n\t]+[a-zA-Z_0-9]+[ \n\t]*\([a-zA-Z_\* \,\t\n]+\)'
PREPROP = r'#[a-zA-Z_][a-zA-Z0-9_\" .\<\>\/\(\)\-\+]*'
LEXEME = r'[a-zA-Z0-9]+'
SYMBOL = r'[-!$%^&*\(\)_+|~=`\[\]\:\"\;\'\<\>\?\,\.\/]'
def __init__(self):
self.nesting_level = 0
self.lineno = 0
#_(r'\n+')
def newline(self, t):
self.lineno += t.value.count('\n')
#_(r'[-!$%^&*\(\)_+|~=`\[\]\:\"\;\'\<\>\?\,\.\/]')
def symbol(self,t):
t.type = 'symbol'
return t
def error(self, t):
print("Illegal character '%s'" % t.value[0])
self.index += 1
class CParser(Parser):
# Get the token list from the lexer (required)
tokens = CLexer.tokens
#_('PREPROP')
def expr(self,p):
return p.PREPROP
#_('FUNC_DECL')
def expr(self,p):
return p.FUNC_DECL
#_('func')
def expr(self,p):
return p.func
# Grammar rules and actions
#_('FUNC_DEF LBRACE stmt RBRACE')
def func(self, p):
return p.func_def + p.lbrace + p.stmt + p.rbrace
#_('LEXEME stmt')
def stmt(self, p):
return p.LEXEME
#_('SYMBOL stmt')
def stmt(self, p):
return p.SYMBOL
#_('empty')
def stmt(self, p):
return p.empty
#_('')
def empty(self, p):
pass
with open('inputfile.c') as f:
data = "".join(f.readlines())
data = comment_remover(data)
lexer = CLexer()
parser = CParser()
while True:
try:
result = parser.parse(lexer.tokenize(data))
print(result)
except EOFError:
break
Error :
None
None
None
.
.
.
.
None
None
yacc: Syntax error at line 1, token=PREPROP
yacc: Syntax error at line 1, token=LBRACE
yacc: Syntax error at line 1, token=PREPROP
yacc: Syntax error at line 1, token=LBRACE
yacc: Syntax error at line 1, token=PREPROP
.
.
.
.
.
INPUT:
#include <mycustomheader1.h> //defines type T1
#include <somedir/mycustomheader2.h> //defines type T2
#include <someotherdir/somefile.c>
MACRO_THINGY_DEFINED_IN_SOMEFILE(M1,M2)
static T1 function_name_thats_way_too_long_than_usual(int *a, float* b, T2* c)
{
//some code I don't even care about at this point
}
extern T2 function_name_thats_way_too_long_than_usual(int *a, char* b, T1* c)
{
//some code I don't even care about at this point
}
DESIRED OUTPUT:
function1 :
static T1 function_name_thats_way_too_long_than_usual(int *a, float* b, T2* c)
{
//some code I don't even care about at this point
}
function2 :
extern T2 function_name_thats_way_too_long_than_usual(int *a, char* b, T1* c)
{
//some code I don't even care about at this point
}
pycparser has a func_defs example to do exactly what you need, but IIUC you're having issues with the preprocessing?
This post describes in some detail why pycparser needs preprocessed files, and how to set it up. If you control the build system it's actually pretty easy. Once you have preprocessed files, the example mentioned above should work.
I will also note that statically finding function dependencies is not an easy problem, because of function pointers. You also won't be able to do this accurately with a single file - this needs multi-file analysis.

fgetc causes a segfault after running the second time

I have an application that tries to read a specific key file and this can happen multiple times during the program's lifespan. Here is the function for reading the file:
__status
_read_key_file(const char * file, char ** buffer)
{
FILE * pFile = NULL;
long fsize = 0;
pFile = fopen(file, "rb");
if (pFile == NULL) {
_set_error("Could not open file: ", 1);
return _ERROR;
}
// Get the filesize
while(fgetc(pFile) != EOF) {
++fsize;
}
*buffer = (char *) malloc(sizeof(char) * (fsize + 1));
// Read the file and write it to the buffer
rewind(pFile);
size_t result = fread(*buffer, sizeof(char), fsize, pFile);
if (result != fsize) {
_set_error("Reading error", 0);
fclose(pFile);
return _ERROR;
}
fclose(pFile);
pFile = NULL;
return _OK;
}
Now the problem is that for a single open/read/close it works just fine, except when I run the function the second time - it will always segfault at this line: while(fgetc(pFile) != EOF)
Tracing with gdb, it shows that the segfault occurs deeper within the fgetc function itself.
I am a bit lost, but obviously am doing something wrong, since if I try to tell the size with fseek/ftell, I always get a 0.
Some context:
Language: C
System: Linux (Ubuntu 16 64bit)
Please ignore functions
and names with underscores as they are defined somewhere else in the
code.
Program is designed to run as a dynamic library to load in Python via ctypes
EDIT
Right, it seems there's more than meets the eye. Jean-François Fabre spawned an idea that I tested and it worked, however I am still confused to why.
Some additional context:
Suppose there's a function in C that looks something like this:
_status
init(_conn_params cp) {
_status status = _NONE;
if (!cp.pkey_data) {
_set_error("No data, open the file", 0);
if(!cp.pkey_file) {
_set_error("No public key set", 0);
return _ERROR;
}
status = _read_key_file(cp.pkey_file, &cp.pkey_data);
if (status != _OK) return status;
}
/* SOME ADDITIONAL WORK AND CHECKING DONE HERE */
return status;
}
Now in Python (using 3.5 for testing), we generate those conn_params and then call the init function:
from ctypes import *
libCtest = CDLL('./lib/lib.so')
class _conn_params(Structure):
_fields_ = [
# Some params
('pkey_file', c_char_p),
('pkey_data', c_char_p),
# Some additonal params
]
#################### PART START #################
cp = _conn_params()
cp.pkey_file = "public_key.pem".encode('utf-8')
status = libCtest.init(cp)
status = libCtest.init(cp) # Will cause a segfault
##################### PART END ###################
# However if we do
#################### PART START #################
cp = _conn_params()
cp.pkey_file = "public_key.pem".encode('utf-8')
status = libCtest.init(cp)
# And then
cp = _conn_params()
cp.pkey_file = "public_key.pem".encode('utf-8')
status = libCtest.init(cp)
##################### PART END ###################
The second PART START / PART END will not cause the segfault in this context.
Would anyone know a reason to why?

Parsing C in python with libclang but generated the wrong AST

I want to use the libclang binding python to generate a C code's AST. OK, the source code is portrayed below .
#include <stdlib.h>
#include "adlist.h"
#include "zmalloc.h"
list *listCreate(void)
{
struct list *list;
if ((list = zmalloc(sizeof(*list))) == NULL)
return NULL;
list->head = list->tail = NULL;
list->len = 0;
list->dup = NULL;
list->free = NULL;
list->match = NULL;
return list;
}
And a implementation I wrote :
#!/usr/bin/python
# vim: set fileencoding=utf-8
import clang.cindex
import asciitree
import sys
def node_children(node):
return (c for c in node.get_children() if c.location.file.name == sys.argv[1])
def print_node(node):
text = node.spelling or node.displayname
kind = str(node.kind)[str(node.kind).index('.')+1:]
return '{} {}'.format(kind, text)
if len(sys.argv) != 2:
print("Usage: dump_ast.py [header file name]")
sys.exit()
clang.cindex.Config.set_library_file('/usr/lib/llvm-3.6/lib/libclang-3.6.so')
index = clang.cindex.Index.create()
translation_unit = index.parse(sys.argv[1], ['-x', 'c++', '-std=c++11', '-D__CODE_GENERATOR__'])
print(asciitree.draw_tree(translation_unit.cursor, node_children, print_node))
But the final output of this test is like the below :
TRANSLATION_UNIT adlist.c
+--FUNCTION_DECL listCreate
+--COMPOUND_STMT
+--DECL_STMT
+--STRUCT_DECL list
+--VAR_DECL list
+--TYPE_REF struct list
Obviously, the final result is wrong. there are much codes left no parsed. I have tried to traverse the translation unit but the result is just like the tree shows---many nodes were gone. Why will be that ? And is there any method to solve the problem? Thank you!
I guess that the reason is that Libclang is unable to parse malloc(). because neither stdlib has been included in this code nor has a user-defined definition provided for malloc.
The parse did not complete successfully, probably because you're missing some include paths.
You can confirm what the exact problem is by printing the diagnostic messages.
translation_unit = index.parse(sys.argv[1], args)
for diag in translation_unit.diagnostics:
print diag

Categories