Regular expression to match C's multiline preprocessor statements - python

what I need is to match multiline preprocessor's statements such as:
#define max(a,b) \
({ typeof (a) _a = (a); \
typeof (b) _b = (b); \
_a > _b ? _a : _b; })
The point is to match everything between #define and last }), but I still can't figure out how to write the regexp. I need it to make it work in Python, using "re" module.
Could somebody help me please?
Thanks

This should do it:
r'(?m)^#define (?:.*\\\r?\n)*.*$'
(?:.*\\\r?\n)* matches zero or more lines ending with backslashes, then .*$ matches the final line.

I think something like this will work:
m = re.compile(r"^#define[\s\S]+?}\)*$", re.MULTILINE)
matches = m.findall(your_string_here)
This assumes that your macros all end with '}', with an optional ')' at the end.

I think above solution might not work for:
#define MACRO_ABC(abc, djhg) \
do { \
int i; \
/*
* multi line comment
*/ \
(int)i; \
} while(0);

Related

Remove outer most curly bracket with Regex

I am trying to remove the outer most curly bracket while keeping only the inner string. My code almost works 100%, except when
expr = 'namespace P {\\na; b;}'
# I expect '\\na; b;'
# but I get 'namespace P {\na; b;}' instead
Any idea how to fix my regex string?
import doctest
import re
def remove_outer_curly_bracket(expr):
"""
>>> remove_outer_curly_bracket('P {')
'P {'
>>> remove_outer_curly_bracket('P')
'P'
>>> remove_outer_curly_bracket('P { a; b(); { c1(d,e); } }')
' a; b(); { c1(d,e); } '
>>> remove_outer_curly_bracket('a { }')
' '
>>> remove_outer_curly_bracket('')
''
>>> remove_outer_curly_bracket('namespace P {\\na; b;}')
'\\na; b;'
"""
r = re.findall(r'[.]*\{(.*)\}', expr)
return r[0] if r else expr
doctest.testmod()
This suffices:
def remove_outer_curly_bracket(expr):
r = re.search(r'{(.*)}', expr, re.DOTALL)
return r.group(1) if r else expr
The match will start as soon as possible, so the first { will indeed match the leftmost opening brace. Because * is greedy, .* will want to be as large as possible, which will ensure } will match the last closing brace.
Neither of the braces is a special character, and does not need escaping; also, [.]* matches any number of periods in a row, and will not help you at all in this task.
This will not work sensibly if the braces are not balanced; for example, for "{ { x }" will return " { x", but fortunately your examples do not include such.
EDIT: That said, this is just prettifying the original a bit. The functionality is unchanged. As blhsing says in comments, it seems your code is doing what it is supposed to. It even passes your tests.
EDIT2: There is nothing special in 'namespace P {\\na; b;}'. I believe you meant 'namespace P {\na; b;}'? With a line break inside? Indeed, that would not have worked. I changed my code so it does. The issue is that normally . matches every character except newline. We can modify that behaviour by supplying the flag re.DOTALL.

Easy/Simple way to write switch-like regular expressions

I'm newbie for Python and wondering what is best way to write a code below in perl into python:
if ($line =~ /(\d)/) {
$a = $1
}
elsif ($line =~ /(\d\d)/) {
$b = $1
}
elsif ($line =~ /(\d\d\d)/) {
$c = $1
}
What I want to do is to retrieve a specific part of each line within a large set of lines. In python all what I can do is as below and is very ugly.
res = re.search(r'(\d)', line)
if res:
a = res.group(1)
else:
res = re.search(r'(\d\d)', line)
if res:
b = res.group(1)
else:
res = re.search(r'(\d\d\d)', line)
if res:
c = res.group(1)
Does any one know better way to write same thing without non-built-in module?
EDIT:
How do you write if you need parse line using very different re?
My point here is it should be simple so that any one can understand what the code is doing there.
In perl, we can write:
if ($line =~ /^this is a sample line (.+) and contain single value$/) {
$name = $1
}
elsif ($line =~ /^this is another sample: (.+):(.+) two values here$/) {
($address, $call) = ($1, $2)
}
elsif ($line =~ /^ahhhh thiiiss isiss (\d+) last sample line$/) {
$description = $1
}
From my view, this kind perl code is very simple and easy to understand.
EDIT2:
I found same discussion here:
http://bytes.com/topic/python/answers/750203-checking-string-against-multiple-patterns
So there's no way to write in python simple enough like perl..
You could write yourself a helper function to store the result of the match at an outer scope so that you don't need to rematch the regex in the if statement
def search(patt, str):
search.result = re.search(patt, str)
return search.result
if search(r'(\d)', line):
a = search.result.group(1)
elif search(r'(\d\d)', line):
b = search.result.group(1)
elif search(r'(\d\d\d)', line):
c = search.result.group(1)
In python 3.8, you'll be able to use:
if res := re.search(r'(\d)', line):
a = res.group(1)
elif res := re.search(r'(\d\d)', line):
b = res.group(1)
elif res := re.search(r'(\d\d\d)', line):
c = res.group(1)
Order of the pattern is very important. Because if you use this (\d)|(\d\d)|(\d\d\d) pattern, the first group alone will match all the digit characters. So, it won't try to check the next two patterns, since the first pattern alone will find all the matches.
res = re.search(r'(\d\d\d)|(\d\d)|(\d)', line)
if res:
a, b, c = res.group(3), res.group(2), res.group(1)
DEMO
Similar to perl except 'elif' instead of 'elsif' and ':' after the test and no curly braces (replaced by indentation) and optional parenthesis. There are many resources on the web which describe Python statements and more which can be easily found with a google search.
if re.search(r'(\d)', line):
a = re.search(r'(\d)', line).group(1)
elif re.search(r'(\d\d)', line):
b = re.search(r'(\d\d)', line).group(1)
elif re.search(r'(\d\d\d)', line):
c = re.search(r'(\d\d\d)', line).group(1)
Of course the logic of the code is flawed since 'b' and 'c' never get set but I think this is the syntax you were looking for.

regex to remove hyphens and spaces

I've got the string:
<u>40 -04-11</u>
How do I remove the spaces and hyphens so it returns 400411?
Currently I've got this:
(<u[^>]*>)(\-\s)(<\/u>)
But I can't figure out why it isn't working. Any insight would be appreciated.
Thanks
(<u[^>]*>)(\-\s)(<\/u>)
Your pattern above doesn't tell your regex where to expect numbers.
(<u[^>]*>)(?:-|\s|(\d+))*(<\/u>)
That should get you started, but not being a python guy, I can't give you the exact replacement syntax. Just be aware that the digits are in a repeating capture group.
Edit: This is an edit in response to your comment. Like I said, not a python guy, but this will probably do what you need if you hold your tongue just right.
def repl(matchobj):
if matchobj.group(1) is None:
return ''
else:
return matchobj.group(1)
source = '<u>40 -04-11</u>40 -04-11<u>40 -04-11</u>40 -04-11'
print re.sub(r'(?:\-|\s|(\d+))(?=[^><]*?<\/u>)', repl, source)
Results in:
>>>'<u>400411</u>40 -04-11<u>400411</u>40 -04-11'
If the above offends the Python deities, I promise to sacrifice the next PHP developer I come across. :)
You don't really need a regex, you could use :
>>> '<u>40 -04-11</u>'.replace('-','').replace(' ','')
'<u>400411</u>'
Using Perl syntax:
s{
(<u[^>]*>) (.*?) (</u>)
}{
my ($start, $body, $end) = ($1, $2, $3);
$body =~ s/[-\s]//g;
$start . $body . $end
}xesg;
Or if Python doesn't have an equivalent to /e,
my $out = '';
while (
$in =~ m{
\G (.*?)
(?: (<u[^>]*>) (.*?) (</u>) | \z )
}sg
) {
my ($pre, $start, $body, $end) = ($1, $2, $3, $4);
$out .= $pre;
if (defined($start)) {
$body =~ s/[-\s]//g;
$out .= $start . $body . $end;
}
}
I'm admittedly not very good at regexes, but the way I would do this is by:
Doing a match on a <u>...</u> pair
doing a re.sub on the bit between the match using group().
That looks like this:
example_str = "<u> 76-6-76s</u> 34243vvfv"
tmp = re.search("(<u[^>]*>)(.*?)(<\/u>)",example_str).group(2)
clean_str = re.sub("(\D)","",tmp)
>>>'76676'
You should expose correctly your problem. I firstly didn't exactly understand it.
Having read your comment (only between the tags <u> and </u> tags) , I can now propose:
import re
ss = '87- 453- kol<u>40 -04-11</u> maa78-55 98 12'
print re.sub('(?<=<u>).+?(?=</u>)',
lambda mat: ''.join(c for c in mat.group() if c not in ' -'),
ss)
result
87- 453- kol<u>400411</u> maa78-55 98 12

how to get the function declaration or definitions using regex

I want to get only function prototypes like
int my_func(char, int, float)
void my_func1(void)
my_func2()
from C files using regex and python.
Here is my regex format: ".*\(.*|[\r\n]\)\n"
This is a convenient script I wrote for such tasks but it wont give the function types. It's only for function names and the argument list.
# Exctract routine signatures from a C++ module
import re
def loadtxt(filename):
"Load text file into a string. I let FILE exceptions to pass."
f = open(filename)
txt = ''.join(f.readlines())
f.close()
return txt
# regex group1, name group2, arguments group3
rproc = r"((?<=[\s:~])(\w+)\s*\(([\w\s,<>\[\].=&':/*]*?)\)\s*(const)?\s*(?={))"
code = loadtxt('your file name here')
cppwords = ['if', 'while', 'do', 'for', 'switch']
procs = [(i.group(2), i.group(3)) for i in re.finditer(rproc, code) \
if i.group(2) not in cppwords]
for i in procs: print i[0] + '(' + i[1] + ')'
See if your C compiler has an option to output a file of just the prototypes of what it is compiling. For gcc, it's -aux-info FILENAME
I think regex isn't best solution in your case. There are many traps like comments, text in string etc., but if your function prototypes share common style:
type fun_name(args);
then \w+ \w+\(.*\); should work in most cases:
mn> egrep "\w+ \w+\(.*\);" *.h
md5.h:extern bool md5_hash(const void *buff, size_t len, char *hexsum);
md5file.h:int check_md5files(const char *filewithsums, const char *filemd5sum);
I think this one should do the work:
r"^\s*[\w_][\w\d_]*\s*.*\s*[\w_][\w\d_]*\s*\(.*\)\s*$"
which will be expanded into:
string begin:
^
any number of whitespaces (including none):
\s*
return type:
- start with letter or _:
[\w_]
- continue with any letter, digit or _:
[\w\d_]*
any number of whitespaces:
\s*
any number of any characters
(for allow pointers, arrays and so on,
could be replaced with more detailed checking):
.*
any number of whitespaces:
\s*
function name:
- start with letter or _:
[\w_]
- continue with any letter, digit or _:
[\w\d_]*
any number of whitespaces:
\s*
open arguments list:
\(
arguments (allow none):
.*
close arguments list:
\)
any number of whitespaces:
\s*
string end:
$
It's not totally correct for matching all possible combinations, but should work in more cases. If you want it to be more accurate, just let me know.
EDIT:
Disclaimer - I'm quite new to both Python and Regex, so please be indulgent ;)
There are LOTS of pitfalls trying to "parse" C code (or extract some information at least) with just regular expressions, I will definitely borrow a C for your favourite parser generator (say Bison or whatever alternative there is for Python, there are C grammar as examples everywhere) and add the actions in the corresponding rules.
Also, do not forget to run the C preprocessor on the file before parsing.
I built on Nick Dandoulakis's answer for a similar use case. I wanted to find the definition of the socket function in glibc. This finds a bunch of functions with "socket" in the name but socket was not found, highlighting what many others have said: there are probably better ways to extract this information, like tools provided by compilers.
# find_functions.py
#
# Extract routine signatures from a C++ module
import re
import sys
def loadtxt(filename):
# Load text file into a string. Ignore FILE exceptions.
f = open(filename)
txt = ''.join(f.readlines())
f.close()
return txt
# regex group1, name group2, arguments group3
rproc = r"((?<=[\s:~])(\w+)\s*\(([\w\s,<>\[\].=&':/*]*?)\)\s*(const)?\s*(?={))"
file = sys.argv[1]
code = loadtxt(file)
cppwords = ['if', 'while', 'do', 'for', 'switch']
procs = [(i.group(1)) for i in re.finditer(rproc, code) \
if i.group(2) not in cppwords]
for i in procs: print file + ": " + i
Then
$ cd glibc
$ find . -name "*.c" -print0 | xargs -0 -n 1 python find_functions.py | grep ':.*socket'
./hurd/hurdsock.c: _hurd_socket_server (int domain, int dead)
./manual/examples/mkfsock.c: make_named_socket (const char *filename)
./manual/examples/mkisock.c: make_socket (uint16_t port)
./nscd/connections.c: close_sockets (void)
./nscd/nscd.c: nscd_open_socket (void)
./nscd/nscd_helper.c: wait_on_socket (int sock, long int usectmo)
./nscd/nscd_helper.c: open_socket (request_type type, const char *key, size_t keylen)
./nscd/nscd_helper.c: __nscd_open_socket (const char *key, size_t keylen, request_type type,
./socket/socket.c: __socket (int domain, int type, int protocol)
./socket/socketpair.c: socketpair (int domain, int type, int protocol, int fds[2])
./sunrpc/key_call.c: key_call_socket (u_long proc, xdrproc_t xdr_arg, char *arg,
./sunrpc/pm_getport.c: __get_socket (struct sockaddr_in *saddr)
./sysdeps/mach/hurd/socket.c: __socket (int domain, int type, int protocol)
./sysdeps/mach/hurd/socketpair.c: __socketpair (int domain, int type, int protocol, int fds[2])
./sysdeps/unix/sysv/linux/socket.c: __socket (int fd, int type, int domain)
./sysdeps/unix/sysv/linux/socketpair.c: __socketpair (int domain, int type, int protocol, int sv[2])
In my case, this and this might help me, except it seems like I will need to read assembly code to reuse the strategy described there.
The regular expression below consider also the definition of destructor or const functions:
^\s*\~{0,1}[\w_][\w\d_]*\s*.*\s*[\w_][\w\d_]*\s*\(.*\)\s*(const){0,1}$

python regex to match multi-line preprocessor macro

What follows is a regular expression I have written to match multi-line pre-processor macros in C / C++ code. I'm by no means a regular expressions guru, so I'd welcome any advice on how I can make this better.
Here's the regex:
\s*#define(.*\\\n)+[\S]+(?!\\)
It should match all of this:
#define foo(x) if(x) \
doSomething(x)
But only some of this (shouldn't match the next line of code:
#define foo(x) if(x) \
doSomething(x)
normalCode();
And also shouldn't match single-line preprocessor macros.
I'm pretty sure that the regex above works - but as I said, there probably a better way of doing it, and I imagine that there are ways of breaking it. Can anyone suggest any?
This is a simple test program I knocked up:
#!/usr/bin/env python
TEST1="""
#include "Foo.h"
#define bar foo\\
x
#include "Bar.h"
"""
TEST2="""
#define bar foo
#define x 1 \\
12 \\
2 \\\\ 3
Foobar
"""
TEST3="""
#define foo(x) if(x) \\
doSomething(x)
"""
TEST4="""
#define foo(x) if(x) \\
doSomething(x)
normalCode();
"""
import re
matcher = re.compile(r"^[ \t]*#define(.*\\\n)+.*$",re.MULTILINE)
def extractDefines(s):
mo = matcher.search(s)
if not mo:
print mo
return
print mo.group(0)
extractDefines(TEST1)
extractDefines(TEST2)
extractDefines(TEST3)
extractDefines(TEST4)
The re I used:
r"^[ \t]*#define(.*\\\n)+.*$"
Is very similar to the one use used, the changes:
[ \t] To avoid newlines at the start
of the define.
I rely on + being
greedy, so I can use a simple .*$ at
the end to get the first line of the
define that doesn't end with \
start = r"^\s*#define\s+"
continuation = r"(?:.*\\\n)+"
lastline = r".*$"
re_multiline_macros = re.compile(start + continuation + lastline,
re.MULTILINE)

Categories