python regex to match multi-line preprocessor macro

python regex to match multi-line preprocessor macro - python

What follows is a regular expression I have written to match multi-line pre-processor macros in C / C++ code. I'm by no means a regular expressions guru, so I'd welcome any advice on how I can make this better.
Here's the regex:
\s*#define(.*\\\n)+[\S]+(?!\\)
It should match all of this:
#define foo(x) if(x) \
doSomething(x)
But only some of this (shouldn't match the next line of code:
#define foo(x) if(x) \
doSomething(x)
normalCode();
And also shouldn't match single-line preprocessor macros.
I'm pretty sure that the regex above works - but as I said, there probably a better way of doing it, and I imagine that there are ways of breaking it. Can anyone suggest any?

This is a simple test program I knocked up:
#!/usr/bin/env python
TEST1="""
#include "Foo.h"
#define bar foo\\
x
#include "Bar.h"
"""
TEST2="""
#define bar foo
#define x 1 \\
12 \\
2 \\\\ 3
Foobar
"""
TEST3="""
#define foo(x) if(x) \\
doSomething(x)
"""
TEST4="""
#define foo(x) if(x) \\
doSomething(x)
normalCode();
"""
import re
matcher = re.compile(r"^[ \t]*#define(.*\\\n)+.*$",re.MULTILINE)
def extractDefines(s):
mo = matcher.search(s)
if not mo:
print mo
return
print mo.group(0)
extractDefines(TEST1)
extractDefines(TEST2)
extractDefines(TEST3)
extractDefines(TEST4)
The re I used:
r"^[ \t]*#define(.*\\\n)+.*$"
Is very similar to the one use used, the changes:
[ \t] To avoid newlines at the start
of the define.
I rely on + being
greedy, so I can use a simple .*$ at
the end to get the first line of the
define that doesn't end with \

start = r"^\s*#define\s+"
continuation = r"(?:.*\\\n)+"
lastline = r".*$"
re_multiline_macros = re.compile(start + continuation + lastline,
re.MULTILINE)

Related

python re.search not working on multiline string

I have this file loaded in string:
// some preceding stuff
static char header_data[] = {
1,1,1,1,1,1,0,0,0,0,1,1,1,1,1,1,
1,1,1,0,1,1,0,1,1,0,1,1,0,1,1,1,
1,1,0,1,0,1,0,1,1,0,1,0,1,0,1,1,
1,0,1,1,1,0,0,1,1,0,0,1,1,1,0,1,
0,0,0,1,1,1,1,1,1,1,1,1,1,0,1,1,
1,0,0,0,1,1,0,1,1,1,1,1,0,1,1,1,
0,1,0,0,0,1,0,0,1,1,1,1,0,0,0,0,
0,1,1,0,0,0,0,0,0,1,1,1,1,1,1,0,
0,1,1,1,0,0,0,0,0,0,1,1,0,1,1,0,
0,0,0,0,1,0,0,0,1,0,0,1,0,1,0,0,
1,1,1,0,1,1,0,0,1,1,0,0,0,1,1,1,
1,1,0,1,1,1,1,1,1,1,1,0,0,0,1,1,
1,0,1,1,1,0,0,1,1,0,0,0,0,0,1,1,
1,1,0,1,0,1,0,1,1,1,1,0,0,0,0,1,
1,1,1,0,1,1,0,1,1,0,1,1,1,1,0,1,
1,1,1,1,1,1,0,0,0,0,1,1,1,1,1,1
};
I want to get only the block with ones and zeros, and then somehow process it.
I imported re, and tried:
In [11]: re.search('static char header_data(.*);', src, flags=re.M)
In [12]: re.findall('static char header_data(.*);', src, flags=re.M)
Out[12]: []
Why doesn't it match anything? How to fix this? (It's python3)

You need to use the re.S flag, not re.M.
re.M (re.MULTILINE) controls the behavior of ^ and $ (whether they match at the start/end of the entire string or of each line).
re.S (re.DOTALL) controls the behavior of the . and is the option you need when you want to allow the dot to match newlines.
See also the documentation.

and then somehow process it.
Here we go to get a useable list out of the file:
import re
match = re.search(r"static char header_data\[\] = {(.*?)};", src, re.DOTALL)
if match:
header_data = "".join(match.group(1).split()).split(',')
print header_data
.*? is a non-greedy match so you really will get just the value between this set of braces.
A more expicit way without DOTALL or MULTILINE would be
match = re.search(r"static char header_data\[\] = {([01,\s\r\n]*?)};", src)

If the format of the file does not change, you might as well not resort to re but use slices. Something on these lines could be useful
>>> file_in_string
'\n// some preceding stuff\nstatic char header_data[] = {\n 1,1,1,1,1,1,0,0,0
,0,1,1,1,1,1,1,\n 1,1,1,0,1,1,0,1,1,0,1,1,0,1,1,1,\n 1,1,0,1,0,1,0,1,1,0,1
,0,1,0,1,1,\n 1,0,1,1,1,0,0,1,1,0,0,1,1,1,0,1,\n 0,0,0,1,1,1,1,1,1,1,1,1,1
,0,1,1,\n 1,0,0,0,1,1,0,1,1,1,1,1,0,1,1,1,\n 0,1,0,0,0,1,0,0,1,1,1,1,0,0,0
,0,\n 0,1,1,0,0,0,0,0,0,1,1,1,1,1,1,0,\n 0,1,1,1,0,0,0,0,0,0,1,1,0,1,1,0,\
n 0,0,0,0,1,0,0,0,1,0,0,1,0,1,0,0,\n 1,1,1,0,1,1,0,0,1,1,0,0,0,1,1,1,\n
1,1,0,1,1,1,1,1,1,1,1,0,0,0,1,1,\n 1,0,1,1,1,0,0,1,1,0,0,0,0,0,1,1,\n 1,1
,0,1,0,1,0,1,1,1,1,0,0,0,0,1,\n 1,1,1,0,1,1,0,1,1,0,1,1,1,1,0,1,\n 1,1,1,1
,1,1,0,0,0,0,1,1,1,1,1,1\n };\n'
>>> lines = file_in_string.split()
>>> lines[9:-1]
['1,1,1,1,1,1,0,0,0,0,1,1,1,1,1,1,', '1,1,1,0,1,1,0,1,1,0,1,1,0,1,1,1,', '1,1,0,
1,0,1,0,1,1,0,1,0,1,0,1,1,', '1,0,1,1,1,0,0,1,1,0,0,1,1,1,0,1,', '0,0,0,1,1,1,1,
1,1,1,1,1,1,0,1,1,', '1,0,0,0,1,1,0,1,1,1,1,1,0,1,1,1,', '0,1,0,0,0,1,0,0,1,1,1,
1,0,0,0,0,', '0,1,1,0,0,0,0,0,0,1,1,1,1,1,1,0,', '0,1,1,1,0,0,0,0,0,0,1,1,0,1,1,
0,', '0,0,0,0,1,0,0,0,1,0,0,1,0,1,0,0,', '1,1,1,0,1,1,0,0,1,1,0,0,0,1,1,1,', '1,
1,0,1,1,1,1,1,1,1,1,0,0,0,1,1,', '1,0,1,1,1,0,0,1,1,0,0,0,0,0,1,1,', '1,1,0,1,0,
1,0,1,1,1,1,0,0,0,0,1,', '1,1,1,0,1,1,0,1,1,0,1,1,1,1,0,1,', '1,1,1,1,1,1,0,0,0,
0,1,1,1,1,1,1']

Increase C++ regex replace performance

I'm a beginner C++ programmer working on a small C++ project for which I have to process a number of relatively large XML files and remove the XML tags out of them. I've succeeded doing so using the C++0x regex library. However, I'm running into some performance issues. Just reading in the files and executing the regex_replace function over its contents takes around 6 seconds on my PC. I can bring this down to 2 by adding some compiler optimization flags. Using Python, however, I can get it done it less than 100 milliseconds. Obviously, I'm doing something very inefficient in my C++ code. What can I do to speed this up a bit?
My C++ code:
std::regex xml_tags_regex("<[^>]*>");
for (std::vector<std::string>::iterator it = _files.begin(); it !=
_files.end(); it++) {
std::ifstream file(*it);
file.seekg(0, std::ios::end);
size_t size = file.tellg();
std::string buffer(size, ' ');
file.seekg(0);
file.read(&buffer[0], size);
buffer = regex_replace(buffer, xml_tags_regex, "");
file.close();
}
My Python code:
regex = re.compile('<[^>]*>')
for filename in filenames:
with open(filename) as f:
content = f.read()
content = regex.sub('', content)
P.S. I don't really care about processing the complete file at once. I just found that reading a file line by line, word by word or character by character slowed it down considerably.

C++11 regex replace is indeed rather slow, as of yet, at least. PCRE performs much better in terms of pattern matching speed, however, PCRECPP provides very limited means for regular expression based substitution, citing the man page:
You can replace the first match of "pattern" in "str" with "rewrite".
Within "rewrite", backslash-escaped digits (\1 to \9) can be used to
insert text matching corresponding parenthesized group from the
pattern. \0 in "rewrite" refers to the entire matching text.
This is really poor, compared to Perl's 's' command. That is why I wrote my own C++ wrapper around PCRE that handles regular expression based substitution in a fashion that is close to Perl's 's', and also supports 16- and 32-bit character strings: PCRSCPP:
Command string syntax
Command syntax follows Perl s/pattern/substitute/[options]
convention. Any character (except the backslash \) can be used as a
delimiter, not just /, but make sure that delimiter is escaped with
a backslash (\) if used in pattern, substitute or options
substrings, e.g.:
s/\\/\//g to replace all backslashes with forward ones
Remember to double backslashes in C++ code, unless using raw string
literal (see string literal):
pcrscpp::replace rx("s/\\\\/\\//g");
Pattern string syntax
Pattern string is passed directly to pcre*_compile, and thus has to
follow PCRE syntax as described in PCRE documentation.
Substitute string syntax
Substitute string backreferencing syntax is similar to Perl's:
$1 ... $n: nth capturing subpattern matched.
$& and $0: the whole match
${label} : labled subpattern matched. label is up to 32 alphanumerical +
underscore characters ('A'-'Z','a'-'z','0'-'9','_'),
first character must be alphabetical
$` and $' (backtick and tick) refer to the areas of the subject before
and after the match, respectively. As in Perl, the unmodified
subject is used, even if a global substitution previously matched.
Also, following escape sequences get recognized:
\n: newline
\r: carriage return
\t: horizontal tab
\f: form feed
\b: backspace
\a: alarm, bell
\e: escape
\0: binary zero
Any other escape sequence \<char>, is interpreted as <char>,
meaning that you have to escape backslashes too
Options string syntax
In Perl-like manner, options string is a sequence of allowed modifier
letters. PCRSCPP recognizes following modifiers:
Perl-compatible flags
g: global replace, not just the first match
i: case insensitive match
(PCRE_CASELESS)
m: multi-line mode: ^ and $ additionally match positions
after and before newlines, respectively
(PCRE_MULTILINE)
s: let the scope of the . metacharacter include newlines
(treat newlines as ordinary characters)
(PCRE_DOTALL)
x: allow extended regular expression syntax,
enabling whitespace and comments in complex patterns
(PCRE_EXTENDED)
PHP-compatible flags
A: "anchor" pattern: look only for "anchored" matches: ones that
start with zero offset. In single-line mode is identical to
prefixing all pattern alternative branches with ^
(PCRE_ANCHORED)
D: treat dollar $ as subject end assertion only, overriding the default:
end, or immediately before a newline at the end.
Ignored in multi-line mode
(PCRE_DOLLAR_ENDONLY)
U: invert * and + greediness logic: make ungreedy by default,
? switches back to greedy. (?U) and (?-U) in-pattern switches
remain unaffected
(PCRE_UNGREEDY)
u: Unicode mode. Treat pattern and subject as UTF8/UTF16/UTF32 string.
Unlike in PHP, also affects newlines, \R, \d, \w, etc. matching
((PCRE_UTF8/PCRE_UTF16/PCRE_UTF32) | PCRE_NEWLINE_ANY
| PCRE_BSR_UNICODE | PCRE_UCP)
PCRSCPP own flags:
N: skip empty matches
(PCRE_NOTEMPTY)
T: treat substitute as a trivial string, i.e., make no backreference
and escape sequences interpretation
n: discard non-matching portions of the string to replace
Note: PCRSCPP does not automatically add newlines,
the replacement result is plain concatenation of matches,
be specifically aware of this in multiline mode
I wrote a simple speed test code, which stores a 10x copy of file "move.sh" and tests regex performance on resulting string:
#include <pcrscpp.h>
#include <string>
#include <iostream>
#include <fstream>
#include <regex>
#include <chrono>
int main (int argc, char *argv[]) {
const std::string file_name("move.sh");
pcrscpp::replace pcrscpp_rx(R"del(s/(?:^|\n)mv[ \t]+(?:-f)?[ \t]+"([^\n]+)"[ \t]+"([^\n]+)"(?:$|\n)/$1\n$2\n/Dgn)del");
std::regex std_rx (R"del((?:^|\n)mv[ \t]+(?:-f)?[ \t]+"([^\n]+)"[ \t]+"([^\n]+)"(?:$|\n))del");
std::ifstream file (file_name);
if (!file.is_open ()) {
std::cerr << "Unable to open file " << file_name << std::endl;
return 1;
}
std::string buffer;
{
file.seekg(0, std::ios::end);
size_t size = file.tellg();
file.seekg(0);
if (size > 0) {
buffer.resize(size);
file.read(&buffer[0], size);
buffer.resize(size - 1); // strip '\0'
}
}
file.close();
std::string bigstring;
bigstring.reserve(10*buffer.size());
for (std::string::size_type i = 0; i < 10; i++)
bigstring.append(buffer);
int n = 10;
std::cout << "Running tests " << n << " times: be patient..." << std::endl;
std::chrono::high_resolution_clock::duration std_regex_duration, pcrscpp_duration;
std::chrono::high_resolution_clock::time_point t1, t2;
std::string result1, result2;
for (int i = 0; i < n; i++) {
// clear result
std::string().swap(result1);
t1 = std::chrono::high_resolution_clock::now();
result1 = std::regex_replace (bigstring, std_rx, "$1\\n$2", std::regex_constants::format_no_copy);
t2 = std::chrono::high_resolution_clock::now();
std_regex_duration = (std_regex_duration*i + (t2 - t1)) / (i + 1);
// clear result
std::string().swap(result2);
t1 = std::chrono::high_resolution_clock::now();
result2 = pcrscpp_rx.replace_copy (bigstring);
t2 = std::chrono::high_resolution_clock::now();
pcrscpp_duration = (pcrscpp_duration*i + (t2 - t1)) / (i + 1);
}
std::cout << "Time taken by std::regex_replace: "
<< std_regex_duration.count()
<< " ms" << std::endl
<< "Result size: " << result1.size() << std::endl;
std::cout << "Time taken by pcrscpp::replace: "
<< pcrscpp_duration.count()
<< " ms" << std::endl
<< "Result size: " << result2.size() << std::endl;
return 0;
}
(note that std and pcrscpp regular expressions do the same here, the trailing newline in expression for pcrscpp is due to std::regex_replace not stripping newlines despite std::regex_constants::format_no_copy)
and launched it on a large (20.9 MB) shell move script:
Running tests 10 times: be patient...
Time taken by std::regex_replace: 12090771487 ms
Result size: 101087330
Time taken by pcrscpp::replace: 5910315642 ms
Result size: 101087330
As you can see, PCRSCPP is more than 2x faster. And I expect this gap to grow with pattern complexity increase, since PCRE deals with complicated patterns much better. I originally wrote a wrapper for myself, but I think it can be useful for others too.
Regards,
Alex

I don't think you're doing anything "wrong" per-say, the C++ regex library just isn't as fast as the python one (for this use case at this time at least). This isn't too surprising, keeping in mind the python regex code is all C/C++ under the hood as well, and has been tuned over the years to be pretty fast as that's a fairly important feature in python, so naturally it is going to be pretty fast.
But there are other options in C++ for getting things faster if you need. I've used PCRE ( http://pcre.org/ ) in the past with great results, though I'm sure there are other good ones out there these days as well.
For this case in particular however, you can also achieve what you're after without regexes, which in my quick tests yielded a 10x performance improvement. For example, the following code scans your input string copying everything to a new buffer, when it hits a < it starts skipping over characters until it sees the closing >
std::string buffer(size, ' ');
std::string outbuffer(size, ' ');
... read in buffer from your file
size_t outbuffer_len = 0;
for (size_t i=0; i < buffer.size(); ++i) {
if (buffer[i] == '<') {
while (buffer[i] != '>' && i < buffer.size()) {
++i;
}
} else {
outbuffer[outbuffer_len] = buffer[i];
++outbuffer_len;
}
}
outbuffer.resize(outbuffer_len);

replacing string in python

I have the following sequences in C code:
variable == T_CONSTANT
or
variable != T_CONSTANT
Using Python, how can I replace these by SOME_MACRO(variable) or !SOME_MACRO(variable), respectively?

A very simple and error-prone method is to use regular expressions:
>>> s = "a == T_CONSTANT"
>>> import re
>>> re.sub(r"(\w+)\s*==\s*T_CONSTANT", r"SOME_MACRO(\1)", s)
'SOME_MACRO(a)'
A similar regex can be used for the != part.

OK, I know you've already accepted the other answer. However, I just couldn't help but throw this out because your problem is one that often comes up and regular expressions may not always be suitable.
This chunk of code defines a tiny limited non-recursive Parsing Expression Grammar which allows you to describe what you're searching for in terms of a series of compiled regular expressions, alternatives (tuples of different matching strings) and plain strings. This format can be more convenient than a regular expression for a computer language because it looks similar to the formal specifications of the language's syntax. Basically, [varname, ("==", "!="), "T_CONSTANT"] describes what you're looking for, and the action() function describes what you want to do when you find it.
I've included a corpus of example "C" code to demonstrate the parser.
import re
# The #match()# function takes a parsing specification and a list of
# words and tries to match them together. It will accept compiled
# regular expressions, lists of strings or plain strings.
__matcher = re.compile("x") # Dummy for testing spec elements.
def match(spec, words):
if len(spec) > len(words): return False
for i in range(len(spec)):
if type(__matcher) is type(spec[i]):
if not spec[i].match(words[i]): return False
elif type(()) is type(spec[i]):
if words[i] not in spec[i]: return False
else:
if words[i] != spec[i]: return False
return True
# #parse()# takes a parsing specification, an action to execute if the
# spec matches and the text to parse. There can be multiple matches in
# the text. It splits and rejoins on spaces. A better tokenisation
# method is left to the reader...
def parse(spec, action, text):
words = text.strip().split()
n = len(spec)
out = []
while(words):
if match(spec, words[:n+1]): out.append(action(words[:n+1])); words = words[n:]
else: out.append(words[0]); words = words[1:]
return " ".join(out)
# This code is only executed if this file is run standalone (so you
# can use the above as a library module...)
if "__main__" == __name__:
# This is a chunk of bogus C code to demonstrate the parser with:
corpus = """\
/* This is a dummy. */
variable == T_CONSTANT
variable != T_CONSTANT
/* Prefix! */ variable != T_CONSTANT
variable == T_CONSTANT /* This is a test. */
variable != T_CONSTANT ; variable == T_CONSTANT /* Note contrived placement of semi. */
x = 9 + g;
"""
# This compiled regular expression defines a legal C/++ variable
# name. Note "^" and "$" guards to make sure the full token is matched.
varname = re.compile("^[A-Za-z_][A-Za-z0-9_]*$")
# This is the parsing spec, which describes the kind of expression
# we're searching for.
spec = [varname, ("==", "!="), "T_CONSTANT"]
# The #action()# function describes what to do when we have a match.
def action(words):
if "!=" == words[1]: return "!SOME_MACRO(%s)" % words[0]
else: return "SOME_MACRO(%s)" % words[0]
# Process the corpus line by line, applying the parser to each line.
for line in corpus.split("\n"): print parse(spec, action, line)
Here's the result if you run it:
/* This is a dummy. */
SOME_MACRO(variable)
!SOME_MACRO(variable)
/* Prefix! */ !SOME_MACRO(variable)
SOME_MACRO(variable) /* This is a test. */
!SOME_MACRO(variable) ; SOME_MACRO(variable) /* Note contrived placement of semi. */
x = 9 + g;
Oh well, I had fun! ; - )

Regular expression to match C's multiline preprocessor statements

what I need is to match multiline preprocessor's statements such as:
#define max(a,b) \
({ typeof (a) _a = (a); \
typeof (b) _b = (b); \
_a > _b ? _a : _b; })
The point is to match everything between #define and last }), but I still can't figure out how to write the regexp. I need it to make it work in Python, using "re" module.
Could somebody help me please?
Thanks

This should do it:
r'(?m)^#define (?:.*\\\r?\n)*.*$'
(?:.*\\\r?\n)* matches zero or more lines ending with backslashes, then .*$ matches the final line.

I think something like this will work:
m = re.compile(r"^#define[\s\S]+?}\)*$", re.MULTILINE)
matches = m.findall(your_string_here)
This assumes that your macros all end with '}', with an optional ')' at the end.

I think above solution might not work for:
#define MACRO_ABC(abc, djhg) \
do { \
int i; \
/*
* multi line comment
*/ \
(int)i; \
} while(0);

how to get the function declaration or definitions using regex

I want to get only function prototypes like
int my_func(char, int, float)
void my_func1(void)
my_func2()
from C files using regex and python.
Here is my regex format: ".*\(.*|[\r\n]\)\n"

This is a convenient script I wrote for such tasks but it wont give the function types. It's only for function names and the argument list.
# Exctract routine signatures from a C++ module
import re
def loadtxt(filename):
"Load text file into a string. I let FILE exceptions to pass."
f = open(filename)
txt = ''.join(f.readlines())
f.close()
return txt
# regex group1, name group2, arguments group3
rproc = r"((?<=[\s:~])(\w+)\s*\(([\w\s,<>\[\].=&':/*]*?)\)\s*(const)?\s*(?={))"
code = loadtxt('your file name here')
cppwords = ['if', 'while', 'do', 'for', 'switch']
procs = [(i.group(2), i.group(3)) for i in re.finditer(rproc, code) \
if i.group(2) not in cppwords]
for i in procs: print i[0] + '(' + i[1] + ')'

See if your C compiler has an option to output a file of just the prototypes of what it is compiling. For gcc, it's -aux-info FILENAME

I think regex isn't best solution in your case. There are many traps like comments, text in string etc., but if your function prototypes share common style:
type fun_name(args);
then \w+ \w+\(.*\); should work in most cases:
mn> egrep "\w+ \w+\(.*\);" *.h
md5.h:extern bool md5_hash(const void *buff, size_t len, char *hexsum);
md5file.h:int check_md5files(const char *filewithsums, const char *filemd5sum);

I think this one should do the work:
r"^\s*[\w_][\w\d_]*\s*.*\s*[\w_][\w\d_]*\s*\(.*\)\s*$"
which will be expanded into:
string begin:
^
any number of whitespaces (including none):
\s*
return type:
- start with letter or _:
[\w_]
- continue with any letter, digit or _:
[\w\d_]*
any number of whitespaces:
\s*
any number of any characters
(for allow pointers, arrays and so on,
could be replaced with more detailed checking):
.*
any number of whitespaces:
\s*
function name:
- start with letter or _:
[\w_]
- continue with any letter, digit or _:
[\w\d_]*
any number of whitespaces:
\s*
open arguments list:
\(
arguments (allow none):
.*
close arguments list:
\)
any number of whitespaces:
\s*
string end:
$
It's not totally correct for matching all possible combinations, but should work in more cases. If you want it to be more accurate, just let me know.
EDIT:
Disclaimer - I'm quite new to both Python and Regex, so please be indulgent ;)

There are LOTS of pitfalls trying to "parse" C code (or extract some information at least) with just regular expressions, I will definitely borrow a C for your favourite parser generator (say Bison or whatever alternative there is for Python, there are C grammar as examples everywhere) and add the actions in the corresponding rules.
Also, do not forget to run the C preprocessor on the file before parsing.

I built on Nick Dandoulakis's answer for a similar use case. I wanted to find the definition of the socket function in glibc. This finds a bunch of functions with "socket" in the name but socket was not found, highlighting what many others have said: there are probably better ways to extract this information, like tools provided by compilers.
# find_functions.py
#
# Extract routine signatures from a C++ module
import re
import sys
def loadtxt(filename):
# Load text file into a string. Ignore FILE exceptions.
f = open(filename)
txt = ''.join(f.readlines())
f.close()
return txt
# regex group1, name group2, arguments group3
rproc = r"((?<=[\s:~])(\w+)\s*\(([\w\s,<>\[\].=&':/*]*?)\)\s*(const)?\s*(?={))"
file = sys.argv[1]
code = loadtxt(file)
cppwords = ['if', 'while', 'do', 'for', 'switch']
procs = [(i.group(1)) for i in re.finditer(rproc, code) \
if i.group(2) not in cppwords]
for i in procs: print file + ": " + i
Then
$ cd glibc
$ find . -name "*.c" -print0 | xargs -0 -n 1 python find_functions.py | grep ':.*socket'
./hurd/hurdsock.c: _hurd_socket_server (int domain, int dead)
./manual/examples/mkfsock.c: make_named_socket (const char *filename)
./manual/examples/mkisock.c: make_socket (uint16_t port)
./nscd/connections.c: close_sockets (void)
./nscd/nscd.c: nscd_open_socket (void)
./nscd/nscd_helper.c: wait_on_socket (int sock, long int usectmo)
./nscd/nscd_helper.c: open_socket (request_type type, const char *key, size_t keylen)
./nscd/nscd_helper.c: __nscd_open_socket (const char *key, size_t keylen, request_type type,
./socket/socket.c: __socket (int domain, int type, int protocol)
./socket/socketpair.c: socketpair (int domain, int type, int protocol, int fds[2])
./sunrpc/key_call.c: key_call_socket (u_long proc, xdrproc_t xdr_arg, char *arg,
./sunrpc/pm_getport.c: __get_socket (struct sockaddr_in *saddr)
./sysdeps/mach/hurd/socket.c: __socket (int domain, int type, int protocol)
./sysdeps/mach/hurd/socketpair.c: __socketpair (int domain, int type, int protocol, int fds[2])
./sysdeps/unix/sysv/linux/socket.c: __socket (int fd, int type, int domain)
./sysdeps/unix/sysv/linux/socketpair.c: __socketpair (int domain, int type, int protocol, int sv[2])
In my case, this and this might help me, except it seems like I will need to read assembly code to reuse the strategy described there.

The regular expression below consider also the definition of destructor or const functions:
^\s*\~{0,1}[\w_][\w\d_]*\s*.*\s*[\w_][\w\d_]*\s*\(.*\)\s*(const){0,1}$

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

python regex to match multi-line preprocessor macro - python

start = r"^\s#define\s+" continuation = r"(?:.\\\n)+" lastline = r".*$" re_multiline_macros = re.compile(start + continuation + lastline, re.MULTILINE)

Related

python re.search not working on multiline string

Increase C++ regex replace performance

replacing string in python

Regular expression to match C's multiline preprocessor statements

how to get the function declaration or definitions using regex

Categories

Resources

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

python regex to match multi-line preprocessor macro - python

start = r"^\s*#define\s+" continuation = r"(?:.*\\\n)+" lastline = r".*$" re_multiline_macros = re.compile(start + continuation + lastline, re.MULTILINE)

Related

python re.search not working on multiline string

Increase C++ regex replace performance

replacing string in python

Regular expression to match C's multiline preprocessor statements

how to get the function declaration or definitions using regex

Categories

Resources

start = r"^\s#define\s+" continuation = r"(?:.\\\n)+" lastline = r".*$" re_multiline_macros = re.compile(start + continuation + lastline, re.MULTILINE)