How to get only function blocks using sly - python

I need to get the function blocks (definition and everything, not just declaration), in order to get a function dependency graph. From the function dependency graph, identify connected components and modularize my insanely huge C codebase, one file at a time.
Problem : I need a C parser to identify function blocks, just that, nothing more. We have custom types etc but signature goes
storage_class return_type function_name ( comma separated type value pairs )
{
//some content I view as generic stuff
}
Solution that I've come up with : Use sly and pycparser like any sane person would do, obviously.
Problem with pycparser : Needs to compile pre-processors from other files, just to identify the code-blocks. In my case, things go to depth of 6 levels. I am sorry I can't show the actual code.
Attempted Code with Sly :
from sly import Lexer, Parser
import re
def comment_remover(text):
def replacer(match):
s = match.group(0)
if s.startswith('/'):
return " " # note: a space and not an empty string
else:
return s
pattern = re.compile(
r'//.*?$|/\*.*?\*/|\'(?:\\.|[^\\\'])*\'|"(?:\\.|[^\\"])*"',
re.DOTALL | re.MULTILINE
)
return re.sub(pattern, replacer, text)
class CLexer(Lexer):
ignore = ' \t\n'
tokens = {LEXEME, PREPROP, FUNC_DECL,FUNC_DEF,LBRACE,RBRACE, SYMBOL}
literals = {'(', ')',',','\n','<','>','-',';','&','*','=','!'}
LBRACE = r'\{'
RBRACE = r'\}'
FUNC_DECL = r'[a-z]+[ \n\t]+[a-zA-Z_0-9]+[ \n\t]+[a-zA-Z_0-9]+[ \n\t]*\([a-zA-Z_\* \,\t\n]+\)[ ]*\;'
FUNC_DEF = r'[a-zA-Z_0-9]+[ \n\t]+[a-zA-Z_0-9]+[ \n\t]*\([a-zA-Z_\* \,\t\n]+\)'
PREPROP = r'#[a-zA-Z_][a-zA-Z0-9_\" .\<\>\/\(\)\-\+]*'
LEXEME = r'[a-zA-Z0-9]+'
SYMBOL = r'[-!$%^&*\(\)_+|~=`\[\]\:\"\;\'\<\>\?\,\.\/]'
def __init__(self):
self.nesting_level = 0
self.lineno = 0
#_(r'\n+')
def newline(self, t):
self.lineno += t.value.count('\n')
#_(r'[-!$%^&*\(\)_+|~=`\[\]\:\"\;\'\<\>\?\,\.\/]')
def symbol(self,t):
t.type = 'symbol'
return t
def error(self, t):
print("Illegal character '%s'" % t.value[0])
self.index += 1
class CParser(Parser):
# Get the token list from the lexer (required)
tokens = CLexer.tokens
#_('PREPROP')
def expr(self,p):
return p.PREPROP
#_('FUNC_DECL')
def expr(self,p):
return p.FUNC_DECL
#_('func')
def expr(self,p):
return p.func
# Grammar rules and actions
#_('FUNC_DEF LBRACE stmt RBRACE')
def func(self, p):
return p.func_def + p.lbrace + p.stmt + p.rbrace
#_('LEXEME stmt')
def stmt(self, p):
return p.LEXEME
#_('SYMBOL stmt')
def stmt(self, p):
return p.SYMBOL
#_('empty')
def stmt(self, p):
return p.empty
#_('')
def empty(self, p):
pass
with open('inputfile.c') as f:
data = "".join(f.readlines())
data = comment_remover(data)
lexer = CLexer()
parser = CParser()
while True:
try:
result = parser.parse(lexer.tokenize(data))
print(result)
except EOFError:
break
Error :
None
None
None
.
.
.
.
None
None
yacc: Syntax error at line 1, token=PREPROP
yacc: Syntax error at line 1, token=LBRACE
yacc: Syntax error at line 1, token=PREPROP
yacc: Syntax error at line 1, token=LBRACE
yacc: Syntax error at line 1, token=PREPROP
.
.
.
.
.
INPUT:
#include <mycustomheader1.h> //defines type T1
#include <somedir/mycustomheader2.h> //defines type T2
#include <someotherdir/somefile.c>
MACRO_THINGY_DEFINED_IN_SOMEFILE(M1,M2)
static T1 function_name_thats_way_too_long_than_usual(int *a, float* b, T2* c)
{
//some code I don't even care about at this point
}
extern T2 function_name_thats_way_too_long_than_usual(int *a, char* b, T1* c)
{
//some code I don't even care about at this point
}
DESIRED OUTPUT:
function1 :
static T1 function_name_thats_way_too_long_than_usual(int *a, float* b, T2* c)
{
//some code I don't even care about at this point
}
function2 :
extern T2 function_name_thats_way_too_long_than_usual(int *a, char* b, T1* c)
{
//some code I don't even care about at this point
}

pycparser has a func_defs example to do exactly what you need, but IIUC you're having issues with the preprocessing?
This post describes in some detail why pycparser needs preprocessed files, and how to set it up. If you control the build system it's actually pretty easy. Once you have preprocessed files, the example mentioned above should work.
I will also note that statically finding function dependencies is not an easy problem, because of function pointers. You also won't be able to do this accurately with a single file - this needs multi-file analysis.

Related

ply.yacc error: 'ERROR: no rules of the form p_rulename are defined'

I am coding a parser to c-minus language. Lexer is ready and working properly, so I began develop parser but I can't pass from the first part: i am receiving an error that don't let me move forward, because I can se whats is right and what is wrong, I only see this error reproduced below. I try to change parser builder but still don't work.
This code below is Lexer that is working. It builds a lexer that identifies all symbols of grammar.
reserved = {
'else' : 'ELSE',
'if' : 'IF',
'int' : 'INT',
'return' : 'RETURN',
'void' : 'VOID',
'while' : 'WHILE'
}
tokens = [
'ID',
'NUM',
'PLUS',
'MINUS',
'MULT',
'DIV',
'LESS',
'LESSOREQUAL',
'GREAT',
'GREATOREQUAL',
'DOUBLEEQUAL',
'NOTEQUAL',
'EQUAL',
'SEMICOLON',
'COLON',
'LPAREN',
'RPAREN',
'LBRACKET',
'RBRACKET',
'LKEY',
'RKEY',
'COMENT'
] + list(reserved.values())
def t_COMENT(t):
r'/\*([^*]|[\r\n]|(\*+([^*/]|[\r\n])))*\*+/'
return t
def t_ID(t):
r'[a-zA-Z]+'
return t
def t_NUM(t):
r'[0-9]+'
return t
def t_PLUS(t):
r'\+'
return t
def t_MINUS(t):
r'\-'
return t
def t_MULT(t):
r'\*'
return t
def t_DIV(t):
r'\/'
return t
def t_LESS(t):
r'\<'
return t
def t_LESSOREQUAL(t):
r'\<\='
return t
def t_GREAT(t):
r'\>'
return t
def t_GREATOREQUAL(t):
r'\>\='
return t
def t_DOUBLEEQUAL(t):
r'\=\='
return t
def t_NOTEQUAL(t):
r'\!\='
return t
def t_EQUAL(t):
r'\='
return t
def t_SEMICOLON(t):
r'\;'
return t
def t_COLON(t):
r'\,'
return t
def t_LPAREN(t):
r'\('
return t
def t_RPAREN(t):
r'\)'
return t
def t_LBRACKET(t):
r'\['
return t
def t_RBRACKET(t):
r'\]'
return t
def t_LKEY(t):
r'\{'
return t
def t_RKEY(t):
r'\}'
return t
def t_newline(t):
r'\n+'
t.lexer.lineno += t.value.count("\n")
def t_error(t):
print("ERROR: Illegal character '{0}' at line {1}".format(t.value[0], t.lineno))
t.lexer.skip(1)
t_ignore = ' \t'
This code below is Parser that is still in development. But I can't test none of the functions I am creating because the error that is ocurring.
import ply.yacc as yacc
import lexer
tokens = lexer.tokens
class Parser():
def p_program(p):
'program: declaration_list'
p[0] = p[1]
def p_declaration_list(p):
'''declaration_list: declaration_list declaration
| declaration'''
p[0] = (0, (p[1], 0))
Main:
import ply.yacc as yacc
import ply.lex as lex
from tabulate import tabulate
import sys
from lexer import *
from parser import Parser
lexer = lex.lex()
with open(sys.argv[1], 'r') as f:
lexer.input(f.read())
tok_array = [[tok.type, tok.value, tok.lexpos, tok.lineno] for tok in lexer]
print(tabulate(tok_array, headers=['Tipo','Valor','Posição','Linha']),'\n')
print('passou aqui 1')
parser = yacc.yacc()
with open(sys.argv[1], 'r') as f:
parser.input(f.read())
tok_array = [[tok.type, tok.value, tok.lexpos, tok.lineno] for tok in parser]
print(tabulate(tok_array, headers=['Tipo','Valor','Posição','Linha']),'\n')
Below is the complete error:
ERROR: no rules of the form p_rulename are defined
Traceback (most recent call last):
File "main.py", line 16, in <module>
parser = yacc.yacc()
File "/home/tlunafar/.local/lib/python3.8/site-packages/ply/yacc.py", line 3323, in yacc
raise YaccError('Unable to build parser')
ply.yacc.YaccError: Unable to build parser
```
Here is the program in c-minus that i am testing:
```
int gcd(int u) {
if (v == 0) return u;
&
else return gcd(v, u-u/v*v);
/* comment */
}
```
Where exactly is this error? Can anybody show me?
If you put the parser definitions into a class or you try to build the parser from a different module, you need to use the module= parameter to tell yacc where the rules are. Otherwise, it can't find them and you get an error saying that no rules were found. So instead of parser = yacc.yacc(), you need:
parser = yacc.yacc(module=Parser)
Note that all of the parser rules need to be in the same namespace; that includes the definition of tokens. So you'll need to put that inside the class:
class Parser():
tokens = lexer.tokens
# ...
Also, Ply insists that productions be written with whitespace on both sides of the colon, so you'll have to fix that. And there are a number of other errors; notably, parsers are not called the same way as lexers; they don't return a generator of tokens. Generally, you only call the parser once to parse the entire input. The details are in the Ply manual.

ctypes dll declaration of variables

I'm trying to use following function with ctypes and I do have troubles how declare all the parameters and variables right.
The Documentation of the C code is following,
/* global variables */
int main ()
char sDeviceSerialNumber[32];
FEUSB_GetScanListPara( 0, "Device-ID", sDeviceSerialNumber ) ;
sDeviceSerialNumber is supposed to be a return value of the function which I need in Python for further use.
Python code:
def FEUSB_GetScanListPara(iIndex, cPara):
libfeusb.FEUSB_GetScanListPara.argtypes = [ctypes.c_int,
ctypes.c_wchar_p,
ctypes.POINTER(ctypes.c_char_p)
]
libfeusb.FEUSB_GetScanListPara.restype = ctypes.c_int
iIndex = ctypes.c_int(iIndex)
cValue_buffer = ctypes.create_string_buffer(32)
cValue = ctypes.c_char_p(ctypes.addressof(cValue_buffer))
value = libfeusb.FEUSB_GetScanListPara(iIndex,
cPara,
ctypes.byref(cValue)
)
if __name__ == "__main__":
i = 0
RFID.FEUSB_GetScanListPara(i, "Device-ID")
When I call the function with the code above, I get an error code, FEUSB_ERR_UNKNOWN_PARAMETER, therefore I assume that I do not declare the parameters correctly.
Any input is appreciated!
EDIT 1
def FEUSB_GetScanListPara(iIndex, cPara):
libfeusb.FEUSB_GetScanListPara.argtypes = [ctypes.c_int,
ctypes.c_char_p,
ctypes.c_char_p
]
libfeusb.FEUSB_GetScanListPara.restype = ctypes.c_int
cValue = ctypes.create_string_buffer(32)
value = libfeusb.FEUSB_GetScanListPara(iIndex, cPara,
ctypes.byref(cValue))
print("1.0", cPara, "back value", " = ", value)
print("1.1", cPara, " = ", cValue.value)
print("######")
if __name__ == "__main__":
data = RFID.FEUSB_GetScanListPara(i, b"Device-ID")
Python Console:
FEUSB_ClearScanList = 0
FEUSB_Scan = 0
FEUSB_GetScanListSize = 1
Traceback (most recent call last):
File "C:\xxxxx\3.1_ObidRFID_test\OBID_RFID_06.py", line 265, in <module>
data = RFID.FEUSB_GetScanListPara(i, b"Device-ID")
File "C:\xxxxx\3.1_ObidRFID_test\OBID_RFID_06.py", line 89, in FEUSB_GetScanListPara
value = libfeusb.FEUSB_GetScanListPara(iIndex, cPara, ctypes.byref(cValue))
ArgumentError: argument 3: <class 'TypeError'>: wrong type
EDIT 2
working code
def FEUSB_GetScanListPara(iIndex, cPara):
libfeusb.FEUSB_GetScanListPara.argtypes = [ctypes.c_int,
ctypes.c_char_p,
ctypes.c_char_p
]
libfeusb.FEUSB_GetScanListPara.restype = ctypes.c_int
cValue = ctypes.create_string_buffer(32)
return_value = libfeusb.FEUSB_GetScanListPara(0, b'Device-ID',
cValue)
Your declaration of .argtypes would match the C prototype of:
int FEUSB_GetScanListPara(int, wchar_t*, char**)
You haven't provided the exact C prototype, but from your example of:
FEUSB_GetScanListPara( 0, "Device-ID", sDeviceSerialNumber ) ;
and knowing wchar_t* is not a common interface parameter, you probably actually have simple char* declarations like:
int FEUSB_GetScanListPara(int, const char*, char*);
I'm assuming the 2nd parameter is an input parameter and 3rd parameter is an output parameter. Note that c_char_p corresponds to a byte string so use b'DeviceID' for cPara. Also if you have to allocate the buffer, the 3rd parameter is unlikely to be char**. If the API itself is not returning a pointer, but filling out an already allocated buffer, char* and hence ctypes.c_char_p is appropriate. You correctly use create_string_buffer() for an output parameter.
Note you don't need to wrap iIndex in a c_int. From .argtypes, ctypes knows the 1st parameter is a c_int and converts it for you. That's also the default if no .argtypes is provided, but better to be explicit and provide .argtypes.
This code should work. I don't have the DLL to verify:
import ctypes as ct
libfeusb = CDLL('./FESUB') # assuming in same directory
libfeusb.FEUSB_GetScanListPara.argtypes = ct.c_int, ct.c_char_p, ct.c_char_p
libfeusb.FEUSB_GetScanListPara.restype = ct.c_int
cValue = ct.create_string_buffer(32)
ret = libfeusb.FEUSB_GetScanListPara(0, b'Device-ID', cValue)
if ret == 0:
print(cValue.value)
else:
print('error:',ret)
If you still have issues, edit your question with a minimal, reproducible example. Make sure to provide the real C prototype.

Use the installed version of libsqlite3 from Python

Is it possible to use the installed version of SQLite3 from Python, with ctypes? If so, how?
On a Mac, the below works without error:
from ctypes import CDLL
libsqlite3 = CDLL("libsqlite3.dylib")
... but then from https://www.sqlite.org/c3ref/sqlite3.html
Each open SQLite database is represented by a pointer to an instance of the opaque structure named "sqlite3".
(emphasis mine)
which to me suggests you can't really make a ctypes.Structure for the database, say to then pass to sqlite3_open.
(Context: I want to use parts of SQLite from Python that are not exposed by the built-in sqlite3 module)
The sqlite3-API uses an opaque pointer, so in the end there is no need to know its memory layout - one just could use a void-pointer.
For example, opening a sqlite3-database would create such a pointer:
int sqlite3_open(
const char *filename, /* Database filename (UTF-8) */
sqlite3 **ppDb /* OUT: SQLite db handle */
);
i.e. the second parameter is a pointer to pointer. This function will create the structure and give its address to us - no need to know the exact layout of the the structur at all.
Later, we only need the address of this structure to be able to use further functionality, i.e.:
int sqlite3_close(sqlite3*);
The type-safety is something ensured by the compiler, once we have the machine code, the gloves are off and we can pass anything instead of sqlite3* to the function, but we have to ensure that it would work. Any pointer can be replaced by void* as long as it points to a valid memory (i.e. with correct memory layout). That leads to:
import ctypes
libsqlite3 = ctypes.CDLL("libsqlite3.dylib")
sqlite3_handle = ctypes.c_void_p() # nullptr
# pass handle by reference:
res = libsqlite3.sqlite3_open(b"mydb.db", ctypes.byref(sqlite3_handle))
print("open result", res) # check res == 0
print("pointer value:", sqlite3_handle) # address is set
# do what ever needed...
# example usage of handle:
res = libsqlite3.sqlite3_close(sqlite3_handle)
print("close result", res)# check res == 0
sqlite3_handle = None # make sure nobody accesses dangling pointer
This is somewhat quick and dirty: usually one needs to set argument-types and return-value-type. But in the functions above, defaults get correct behavior so I've skipped this (otherwise important) step.
Based on ead's answer, this is a more complete example of how to use libsqlite3 from Python, which is also at https://gist.github.com/michalc/a3147997e21665896836e0f4157975cb
The below defines a (generator) function, query
from contextlib import contextmanager
from collections import namedtuple
from ctypes import cdll, byref, string_at, c_char_p, c_int, c_double, c_int64, c_void_p
from sys import platform
def query(db_file, sql, params=()):
libsqlite3 = cdll.LoadLibrary({'linux': 'libsqlite3.so', 'darwin': 'libsqlite3.dylib'}[platform])
libsqlite3.sqlite3_errstr.restype = c_char_p
libsqlite3.sqlite3_errmsg.restype = c_char_p
libsqlite3.sqlite3_column_name.restype = c_char_p
libsqlite3.sqlite3_column_double.restype = c_double
libsqlite3.sqlite3_column_int64.restype = c_int64
libsqlite3.sqlite3_column_blob.restype = c_void_p
libsqlite3.sqlite3_column_bytes.restype = c_int64
SQLITE_ROW = 100
SQLITE_DONE = 101
SQLITE_TRANSIENT = -1
SQLITE_OPEN_READWRITE = 0x00000002
bind = {
type(0): libsqlite3.sqlite3_bind_int64,
type(0.0): libsqlite3.sqlite3_bind_double,
type(''): lambda pp_stmt, i, value: libsqlite3.sqlite3_bind_text(pp_stmt, i, value.encode('utf-8'), len(value.encode('utf-8')), SQLITE_TRANSIENT),
type(b''): lambda pp_stmt, i, value: libsqlite3.sqlite3_bind_blob(pp_stmt, i, value, len(value), SQLITE_TRANSIENT),
type(None): lambda pp_stmt, i, _: libsqlite3.sqlite3_bind_null(pp_stmt, i),
}
extract = {
1: libsqlite3.sqlite3_column_int64,
2: libsqlite3.sqlite3_column_double,
3: lambda pp_stmt, i: string_at(
libsqlite3.sqlite3_column_blob(pp_stmt, i),
libsqlite3.sqlite3_column_bytes(pp_stmt, i),
).decode(),
4: lambda pp_stmt, i: string_at(
libsqlite3.sqlite3_column_blob(pp_stmt, i),
libsqlite3.sqlite3_column_bytes(pp_stmt, i),
),
5: lambda pp_stmt, i: None,
}
def run(func, *args):
res = func(*args)
if res != 0:
raise Exception(libsqlite3.sqlite3_errstr(res).decode())
def run_with_db(db, func, *args):
if func(*args) != 0:
raise Exception(libsqlite3.sqlite3_errmsg(db).decode())
#contextmanager
def get_db(db_file):
db = c_void_p()
run(libsqlite3.sqlite3_open_v2, db_file.encode(), byref(db), SQLITE_OPEN_READWRITE, None)
try:
yield db
finally:
run_with_db(db, libsqlite3.sqlite3_close, db)
#contextmanager
def get_pp_stmt(db, sql):
pp_stmt = c_void_p()
run_with_db(db, libsqlite3.sqlite3_prepare_v3, db, sql.encode(), -1, 0, byref(pp_stmt), None)
try:
yield pp_stmt
finally:
run_with_db(db, libsqlite3.sqlite3_finalize, pp_stmt)
with \
get_db(db_file) as db, \
get_pp_stmt(db, sql) as pp_stmt:
for i, param in enumerate(params):
run_with_db(db, bind[type(param)], pp_stmt, i + 1, param)
row_constructor = namedtuple('Row', (
libsqlite3.sqlite3_column_name(pp_stmt, i).decode()
for i in range(0, libsqlite3.sqlite3_column_count(pp_stmt))
))
while True:
res = libsqlite3.sqlite3_step(pp_stmt)
if res == SQLITE_DONE:
break
if res != SQLITE_ROW:
raise Exception(libsqlite3.sqlite3_errstr(res).decode())
yield row_constructor(*(
extract[libsqlite3.sqlite3_column_type(pp_stmt, i)](pp_stmt, i)
for i in range(0, len(row_constructor._fields))
))
which can be used as, for example:
for row in query('my.db', 'SELECT * FROM my_table WHERE a = ?;', ('b',)):
print(row)

Parsing C in python with libclang but generated the wrong AST

I want to use the libclang binding python to generate a C code's AST. OK, the source code is portrayed below .
#include <stdlib.h>
#include "adlist.h"
#include "zmalloc.h"
list *listCreate(void)
{
struct list *list;
if ((list = zmalloc(sizeof(*list))) == NULL)
return NULL;
list->head = list->tail = NULL;
list->len = 0;
list->dup = NULL;
list->free = NULL;
list->match = NULL;
return list;
}
And a implementation I wrote :
#!/usr/bin/python
# vim: set fileencoding=utf-8
import clang.cindex
import asciitree
import sys
def node_children(node):
return (c for c in node.get_children() if c.location.file.name == sys.argv[1])
def print_node(node):
text = node.spelling or node.displayname
kind = str(node.kind)[str(node.kind).index('.')+1:]
return '{} {}'.format(kind, text)
if len(sys.argv) != 2:
print("Usage: dump_ast.py [header file name]")
sys.exit()
clang.cindex.Config.set_library_file('/usr/lib/llvm-3.6/lib/libclang-3.6.so')
index = clang.cindex.Index.create()
translation_unit = index.parse(sys.argv[1], ['-x', 'c++', '-std=c++11', '-D__CODE_GENERATOR__'])
print(asciitree.draw_tree(translation_unit.cursor, node_children, print_node))
But the final output of this test is like the below :
TRANSLATION_UNIT adlist.c
+--FUNCTION_DECL listCreate
+--COMPOUND_STMT
+--DECL_STMT
+--STRUCT_DECL list
+--VAR_DECL list
+--TYPE_REF struct list
Obviously, the final result is wrong. there are much codes left no parsed. I have tried to traverse the translation unit but the result is just like the tree shows---many nodes were gone. Why will be that ? And is there any method to solve the problem? Thank you!
I guess that the reason is that Libclang is unable to parse malloc(). because neither stdlib has been included in this code nor has a user-defined definition provided for malloc.
The parse did not complete successfully, probably because you're missing some include paths.
You can confirm what the exact problem is by printing the diagnostic messages.
translation_unit = index.parse(sys.argv[1], args)
for diag in translation_unit.diagnostics:
print diag

Python convert C header file to dict

I have a C header file which contains a series of classes, and I'm trying to write a function which will take those classes, and convert them to a python dict. A sample of the file is down the bottom.
Format would be something like
class CFGFunctions {
class ABC {
class AA {
file = "abc/aa/functions"
class myFuncName{ recompile = 1; };
};
class BB
{
file = "abc/bb/functions"
class funcName{
recompile=1;
}
}
};
};
I'm hoping to turn it into something like
{CFGFunctions:{ABC:{AA:"myFuncName"}, BB:...}}
# Or
{CFGFunctions:{ABC:{AA:{myFuncName:"string or list or something"}, BB:...}}}
In the end, I'm aiming to get the filepath string (which is actually a path to a folder... but anyway), and the class names in the same class as the file/folder path.
I've had a look on SO, and google and so on, but most things I've found have been about splitting lines into dicts, rather then n-deep 'blocks'
I know I'll have to loop through the file, however, I'm not sure the most efficient way to convert it to the dict.
I'm thinking I'd need to grab the outside class and its relevant brackets, then do the same for the text remaining inside.
If none of that makes sense, it's cause I haven't quite made sense of the process myself haha
If any more info is needed, I'm happy to provide.
The following code is a quick mockup of what I'm sorta thinking...
It is most likely BROKEN and probably does NOT WORK. but its sort of the process that I'm thinking of
def get_data():
fh = open('CFGFunctions.h', 'r')
data = {} # will contain final data model
# would probably refactor some of this into a function to allow better looping
start = "" # starting class name
brackets = 0 # number of brackets
text= "" # temp storage for lines inside block while looping
for line in fh:
# find the class (start
mt = re.match(r'Class ([\w_]+) {', line)
if mt:
if start == "":
start = mt.group(1)
else:
# once we have the first class, find all other open brackets
mt = re.match(r'{', line)
if mt:
# and inc our counter
brackets += 1
mt2 = re.match(r'}', line)
if mt2:
# find the close, and decrement
brackets -= 1
# if we are back to the initial block, break out of the loop
if brackets == 0:
break
text += line
data[start] = {'tempText': text}
====
Sample file
class CfgFunctions {
class ABC {
class Control {
file = "abc\abc_sys_1\Modules\functions";
class assignTracker {
description = "";
recompile = 1;
};
class modulePlaceMarker {
description = "";
recompile = 1;
};
};
class Devices
{
file = "abc\abc_sys_1\devices\functions";
class registerDevice { recompile = 1; };
class getDeviceSettings { recompile = 1; };
class openDevice { recompile = 1; };
};
};
};
EDIT:
If possible, if I have to use a package, I'd like to have it in the programs directory, not the general python libs directory.
As you detected, parsing is necessary to do the conversion. Have a look at the package PyParsing, which is a fairly easy-to-use library to implement parsing in your Python program.
Edit: This is a very symbolic version of what it would take to recognize a very minimalistic grammer - somewhat like the example at the top of the question. It won't work, but it might put you in the right direction:
from pyparsing import ZeroOrMore, OneOrMore, \
Keyword, Literal
test_code = """
class CFGFunctions {
class ABC {
class AA {
file = "abc/aa/functions"
class myFuncName{ recompile = 1; };
};
class BB
{
file = "abc/bb/functions"
class funcName{
recompile=1;
}
}
};
};
"""
class_tkn = Keyword('class')
lbrace_tkn = Literal('{')
rbrace_tkn = Literal('}')
semicolon_tkn = Keyword(';')
assign_tkn = Keyword(';')
class_block = ( class_tkn + identifier + lbrace_tkn + \
OneOrMore(class_block | ZeroOrMore(assignment)) + \
rbrace_tkn + semicolon_tkn \
)
def test_parser(test):
try:
results = class_block.parseString(test)
print test, ' -> ', results
except ParseException, s:
print "Syntax error:", s
def main():
test_parser(test_code)
return 0
if __name__ == '__main__':
main()
Also, this code is only the parser - it does not generate any output. As you can see in the PyParsing docs, you can later add the actions you want. But the first step would be to recognize the what you want to translate.
And a last note: Do not underestimate the complexities of parsing code... Even with a library like PyParsing, which takes care of much of the work, there are many ways to get mired in infinite loops and other amenities of parsing. Implement things step-by-step!
EDIT: A few sources for information on PyParsing are:
http://werc.engr.uaf.edu/~ken/doc/python-pyparsing/HowToUsePyparsing.html
http://pyparsing.wikispaces.com/
(Particularly interesting is http://pyparsing.wikispaces.com/Publications, with a long list of articles - several of them introductory - on PyParsing)
http://pypi.python.org/pypi/pyparsing_helper is a GUI for debugging parsers
There is also a 'tag' Pyparsing here on stackoverflow, Where Paul McGuire (the PyParsing author) seems to be a frequent guest.
* NOTE: *
From PaulMcG in the comments below: Pyparsing is no longer hosted on wikispaces.com. Go to github.com/pyparsing/pyparsing

Categories