Extracting unique syscall names from strace output (via regex?)

Extracting unique syscall names from strace output (via regex?) - python

I have a file produced by strace which contains all the system calls. Now I want to get the name of all system calls. Therefore, say if I have mprotect listed 4 times, I only need to list it 1 time, that is I only need to list unique system calls.
One method that comes to mind is to use regular expressions using python or any other language that supports parsing regular expression to first see all system calls and then eliminate the duplicates. For that purpose, I was first trying to test my regular expression using the search feature of notepad++. I want to match anything like this, blah(. For that purpose I devised the following regular expression
[a-zA-Z_](
but notepad found nothing. What do you think is the correct regular expression for this?

Why do you think you need regular expressions for this? The output of strace is a sequence of lines, each starting with
<c_identifier>(
and C identifiers can't contain (, so you can just take the part up to the ( to get the name of the system calls. In Python, this one-liner computes the set of distinct system calls:
syscalls = set(ln.split('(', 1)[0] for ln in strace_output)
(You can do this in one line of Awk as well, if you rather work in the shell than in Python.)

Notepad++ should have told you invalid regular expression. The latest version does.
In regular expressions, parentheses have special meaning, so you have to escape them:
[a-zA-Z_]\(
will find h( in blah(, since the part in the brackets isn't quantified (as #CharlesDuffy pointed out).
To match the entire blah(, use
[a-zA-Z_]+\(

It should be [a-zA-Z_]+\( instead. This is because round brackets are used as meta characters.

Related

Split and reverse string purely with regular expression

First off: Yes, I know that trying to accomplish this purely with regular expressions is foolish, but I need to do this within the context of Carbon Rewrite Rules which are essentially Python regular expressions, eg:
^collectd\.([a-z0-9]+)\. = \1.system.
I'm trying to migrate our monitoring systems from a Nagios-based system to one based on Collectd. However, collectd's write_graphite plugin is hard-coded to produce metrics named $prefix.host_example_com.$metric and our existing metrics are stored as $prefix.com.example.host.$metric.
Note: The hostnames do not have a fixed number of sections, they might be bar.foo, baz.bar.foo, bif.baz.bar.foo, etc.
So basically it seems to boil down to accomplishing this within a single re.sub() call.
So far I've got:
metric = 'collectd.foo_bar_baz.some.metric'
pattern = r'^collectd\.(?:([^_.]+)_?)+(.*)$'
print re.sub(pattern, metric, r'\1 \2')
Which outputs: baz .some.metric and I can't even get it to repeat the capture group, let along have the first idea about how to reverse and join an arbitrary number of backreferences.
Is such a thing even possible in a single re.sub() call or should I just resign myself to a fate of terribly named/organized metrics and queries full of wildcards?

Regex named conditional lookahead (in Python)

I'm hoping to match the beginning of a string differently based on whether a certain block of characters is present later in the string. A very simplified version of this is:
re.search("""^(?(pie)a|b)c.*(?P<pie>asda)$""", 'acaaasda')
Where, if <pie> is matched, I want to see a at the beginning of the string, and if it isn't then I'd rather see b.
I'd use normal numerical lookahead but there's no guarantee how many groups will or won't be matched between these two.
I'm currently getting error: unknown group name. The sinking feeling in my gut tells me that this is because what I want is impossible (look-ahead to named groups isn't exactly a feature of a regular language parser), but I really really really want this to work -- the alternative is scrapping 4 or 5 hours' worth of regex writing and redoing it all tomorrow as a recursive descent parser or something.
Thanks in advance for any help.

Unfortunately, I don't think there is a way to do what you want to do with named groups. If you don't mind duplication too much, you could duplicate the shared conditions and OR the expressions together:
^(ac.*asda|bc.*)$
If it is a complicated expression you could always use string formatting to share it (rather than copy-pasting the shared part):
common_regex = "c.*"
final_regex = "^(a{common}asda|b{common})$".format(common=common_regex)

You can use something like that:
^(?:a(?=c.*(?P<pie>asda)$)|b)c.*$
or without .*$ if you don't need it.

regular expression help with converting exp1^exp2 to pow(exp1, exp2)

I am converting some matlab code to C, currently I have some lines that have powers using the ^, which is rather easy to do with something along the lines \(?(\w*)\)?\^\(?(\w*)\)?
works fine for converting (glambda)^(galpha),using the sub routine in python pattern.sub(pow(\g<1>,\g<2>),'(glambda)^(galpha)')
My problem comes with nested parenthesis
So I have a string like:
glambdastar^(1-(1-gphi)*galpha)*(glambdaq)^(-(1-gphi)*galpha);
And I can not figure out how to convert that line to:
pow(glambdastar,(1-(1-gphi)*galpha))*pow(glambdaq,-(1-gphi)*galpha));

Unfortunately, regular expressions aren't the right tool for handling nested structures. There are some regular expressions engines (such as .NET) which have some support for recursion, but most — including the Python engine — do not, and can only handle as many levels of nesting as you build into the expression (which gets ugly fast).
What you really need for this is a simple parser. For example, iterate over the string counting parentheses and storing their locations in a list. When you find a ^ character, put the most recently closed parenthesis group into a "left" variable, then watch the group formed by the next opening parenthesis. When it closes, use it as the "right" value and print the pow(left, right) expression.

I think you can use recursion here.
Once you figure out the Left and Right parts, pass each of those to your function again.
The base case would be that no ^ operator is found, so you will not need to add the pow() function to your result string.
The function will return a string with all the correct pow()'s in place.
I'll come up with an example of this if you want.

Nested parenthesis cannot be described by a regexp and require a full parser (able to understand a grammar, which is something more powerful than a regexp). I do not think there is a solution.

See recent discussion function-parser-with-regex-in-python (one of many similar discussions). Then follow the suggestion to pyparsing.

An alternative would be to iterate until all ^ have been exhausted. no?.
Ruby code:
# assuming str contains the string of data with the expressions you wish to convert
while str.include?('^')
str!.gsub!(/(\w+)\^(\w+)/, 'pow(\1,\2)')
end

Python regular expressions with more than 100 groups?

Is there any way to beat the 100-group limit for regular expressions in Python? Also, could someone explain why there is a limit.

There is a limit because it would take too much memory to store the complete state machine efficiently. I'd say that if you have more than 100 groups in your re, something is wrong either in the re itself or in the way you are using them. Maybe you need to split the input and work on smaller chunks or something.

I found the easiest way was to
import regex as re
instead of
import re
The default _MAXCACHE for regex is 500 instead of 100 I believe. This is one of the many reasons I find regex to be a better module than re.

If I'm not mistaken, the "new" regex module (currently third-party, but intended to eventually replace the re module in the stdlib) does not have this limit, so you might give that a try.

I'm not sure what you're doing exactly, but try using a single group, with a lot of OR clauses inside... so (this)|(that) becomes (this|that). You can do clever things with the results by passing a function that does something with the particular word that is matched:
newContents, num = cregex.subn(lambda m: replacements[m.string[m.start():m.end()]], contents)
If you really need so many groups, you'll probably have to do it in stages... one pass for a dozen big groups, then another pass inside each of those groups for all the details you want.

I doubt you really need to process 100 named groups by next commands or use it in regexp replacement command. It would be quite impractical. If you just need groups to express the rich conditions in regexp you can use non-grouping group.
(?:word1|word2)(?:word3|word4)
etc. Complex scenarios including nesting groups are possible.
There is no limit for non-grouping groups.

First, as others have said, there are probably good alternatives to using 100 groups. The re.findall method might be a useful place to start. If you really need more than 100 groups, the only workaround I see is to modify the core Python code.
In [python-install-dir]/lib/sre_compile.py simply modify the compile() function by removing the following lines:
# in lib/sre_compile.py
if pattern.groups > 100:
raise AssertionError(
"sorry, but this version only supports 100 named groups"
)
For a slightly more flexible version, just define a constant at the top of the sre_compile module, and have the above line compare to that constant instead of 100.
Funnily enough, in the (Python 2.5) source there is a comment indicating that the 100 group limit is scheduled to be removed in future versions.

I've found that Python 3 doesn't have this limitation, whereas the same code ran in latest 2.7 displays this error.

When I run into this I had a really complex pattern that was actually composed of a bunch of high-level patterns joined by ORs, like this:
pattern_string = u"pattern1|" \
u"pattern2|" \
u"patternN"
pattern = re.compile(pattern_string, re.UNICODE)
for match in pattern.finditer(string_to_search):
pass # Extract data from the groups in the match.
As a workaround, I turned the pattern into a list and I used that list as follows:
pattern_strings = [
u"pattern1",
u"pattern2",
u"patternN",
]
patterns = [re.compile(pattern_string, re.UNICODE) for pattern_string in pattern_strings]
for pattern in patterns:
for match in pattern.finditer(string_to_search):
pass # Extract data from the groups in the match.
string_to_search = pattern.sub(u"", string_to_search)

I would say you could reduce the number of groups by using non-grouping parentheses, but whatever it is that you're doing seems like you want all these groupings.

in my case, i have a dictionary of n words and want to create a single regex that matches all of them.. ie: if my dictionary is
hello
goodbye
my regex would be: (^|\s)hello($|\s)|(^|\s)goodbye($|\s) ... it's the only way to do it, and works fine on small dictionaries, but when you have more tan 50 words, well...

It's very ease to resolve this error:
Open the re class and you'll see this constant _MAXCACHE = 100.
Change the value to 1000, for example, and do a test.

How can I convert a Perl regex with named groups to Python?

I am trying to convert the following Perl regex I found in the Video::Filename Perl module to a Python 2.5.4 regex to parse a filename
# Perl > v5.10
re => '^(?:(?<name>.*?)[\/\s._-]*)?(?<openb>\[)?(?<season>\d{1,2})[x\/](?<episode>\d{1,2})(?:-(?:\k<season>x)?(?<endep>\d{1,2}))?(?(<openb>)\])(?:[\s._-]*(?<epname>[^\/]+?))?$',
I would like to use named groups too, and I know in Python the regex extension for named groups is different, but I am not 100% sure on the syntax.
This is what I tried:
# Python (not working)
r = re.compile(r'^(?:(?P<name>.*?)[\/\s._-]*)?(?P<openb>\[)?(?P<season>\d{1,2})[x\/](?P<episode>\d{1,2})(?:-(?:\kP<season>x)?(?P<endep>\d{1,2}))?(?(P<openb>)\])(?:[\s._-]*(?P<epname>[^\/]+?))?$')
The error I get:
raise error, v # invalid expression
sre_constants.error: bad character in group name
For example, this one I managed to convert and it works. But the one above I can't seem to get right. I get a compilation error in Python.
# Perl:
re => '^(?:(?<name>.*?)[\/\s._-]+)?(?:s|se|season|series)[\s._-]?(?<season>\d{1,2})[x\/\s._-]*(?:e|ep|episode|[\/\s._-]+)[\s._-]?(?<episode>\d{1,2})(?:-?(?:(?:e|ep)[\s._]*)?(?<endep>\d{1,2}))?(?:[\s._]?(?:p|part)[\s._]?(?<part>\d+))?(?<subep>[a-z])?(?:[\/\s._-]*(?<epname>[^\/]+?))?$',
# Python (working):
r = re.compile(r'^(?:(?P<name>.*?)[\/\s._-]+)?(?:s|se|season|series)[\s._-]?(?P<season>\d{1,2})[x\/\s._-]*(?:e|ep|episode|[\/\s._-]+)[\s._-]?(?P<episode>\d{1,2})(?:-?(?:(?:e|ep)[\s._]*)?(?P<endep>\d{1,2}))?(?:[\s._]?(?:p|part)[\s._]?(?P<part>\d+))?(?P<subep>[a-z])?(?:[\/\s._-]*(?P<epname>[^\/]+?))?$')
I am not sure where to start looking.

There are 2 problems with your translation. First of all, the second mention of openb has extra parenthesis around it making it a conditional expression, not a named expression.
Next is that you didn't translate the \k<season> backreference, Python uses (P=season) to match the same. The following compiles for me:
r = re.compile(r'^(?:(?P<name>.*?)[\/\s._-]*)?(?P<openb>\[)?(?P<season>\d{1,2})[x\/](?P<episode>\d{1,2})(?:-(?:(?P=season)x)?(?P<endep>\d{1,2}))?(?(openb)\])(?:[\s._-]*(?P<epname>[^\/]+?))?$')
If I were you, I'd use re.VERBOSE to split this expression over multiple lines and add copious documentation so you can keep understanding the expression in the future if this is something that needs to remain maintainable though.
(edited after realising the second openb reference was a conditional expression, and to properly translate the backreference).

I found the offending part but can't figure out what exactly is wrong without wrapping my mind around the whole thing.
r = re.compile(r'^(?:(?P<name>.*?)[\/\s._-]*)?(?P<openb>\[)?(?P<season>\d{1,2})[x\/](?P<episode>\d{1,2})(?:-(?:\kP<season>x)?(?P<endep>\d{1,2}))?
(?(P<openb>)\]) // this part here causes the error message
(?:[\s._-]*(?P<epname>[^\/]+?))?$')
The problem seems to be with the fact that group names in python must be valid python identifiers (check documentation). The parentheses seem to be the problem. Removing them gives
(?(P<openb>)\]) //with parentheses
(?P<openb>\]) //without parentheses
redefinition of group name 'openb' as group 6; was group 2

Those regexps are the product of a sick an twisted mind... :-)
Anyway, (?()) are conditions in both Python and Perl, and the perl syntax above looks like it should be the same as the Python syntax, i.e., it evaluates as true of the group named exists.
Where to start looking? The documentation for the modules are here:
http://docs.python.org/library/re.html
http://www.perl.com/doc/manual/html/pod/perlre.html

I may be wrong but you tried to get the backreference using :
(?:\k<season>x)
Isn't the syntax \g<name> in Python ?

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extracting unique syscall names from strace output (via regex?) - python

It should be [a-zA-Z_]+\( instead. This is because round brackets are used as meta characters.

Related

Split and reverse string purely with regular expression

Regex named conditional lookahead (in Python)

regular expression help with converting exp1^exp2 to pow(exp1, exp2)

Python regular expressions with more than 100 groups?

How can I convert a Perl regex with named groups to Python?

Categories

Resources