For a certain project of mine I need to parse enum typedefs from an .h file.
For example lets take the next simple case:
typedef enum
{
data1, /*aaagege*/
data2,
data3
}ESample;
This is a very simple declaration (without assigns or anything a bit more complex) and yet the regular expression that I wrote seems to be very poor performance wise.
Here is my expression:
typedef\s+enum\s*\{(?:\s+(\w+)[^\n]*)+\s*\}(\w+)\s*;
I've tested the expression on one of my files (about 2000 lines of code) and it took ages..
The first thing that I tried to do is to make everything possible not greedy like so:
typedef\s+?enum\s*?\{(?:\s+?(\w+?)[^\n]*?)+?\s*?\}(\w+?)\s*?;
But that only made things worse.
Any suggestions as to how I can make this better performance wise? If you could add an explanation about your suggested solution and why it is better than mine It will help me a lot.
Thanks in advance,
Kfir
The reason it's slow is because of your nested repeats (marked with ^):
(?:\s+(\w+)[^\n]*)+
^ ^
This causes nested backtracking, which leads to exponential running times.
But you have a larger problem which is that putting a group inside a repeat means that only the last match of the group is kept:
>>> print m.groups()
('data3', 'ESample')
You can't parse C with a regex:
// w00t /* "testing */ "strings n comments \"here"//
printf("/* haha gotcha\" epic stuff") /* "more text // */;
/* typedef test {
val,
"string",
*/ typedef test ??<
val,
"commentstring/*\"//",
??>
But if you just want a quick hack to parse all the typedefs:
typedef\s+enum\s*{[^}]*}[^;]+;
The first thing that I tried to do is to make everything possible not gready... But that only made things worse.
Of course it did! How couldn't it? Look a this regex:
\w+\s
It will (greedily) eat up all the word characters, and when those are out, it will look for a space character. Now consider:
\w+?\s
This eats up one word character, then checks for a space. Failing that, it eats another word character and checks for a space. It checks every word character to see if it's a space.
Generally, non-greedy is slower than greedy because it has to check the same characters twice. Sometimes, non-greedy produces different results, but when it doesn't, always use greedy. In fact, Perl has possessive quantifiers:
\w++\s
Which means "be greedy, and if that fails to match don't bother giving any characters back because you're too greedy." The example above works fine, and may be optimizable, but you can really understand it with this:
\w++h
That example will always fail, because any "h" character at the end of a word will get permanently eaten up by \w++, whereas if it was just \w+ it'd get eaten up, but then given back once the match failed once to see if it would succeed.
Unfortunately Python doesn't have the possessive form to my knowledge (though in the comments, #tchrist suggests an alterative Python regex library), so the first example is about as fast as I suspect you'll get. You might also find a speedup by searching for occurrences of the string "enum" and working from there instead of using a single giant regex to search through an entire file.
Related
I'm trying to write an iterative LL(k) parser, and I've gotten strings down pretty well, because they have a start and end token, and so you can just "".join(tokenlist[string_start:string_end]).
Numbers, however, do not, and only consist of .0123456789. They can occur at any given point in a program, have any arbitrary length and are delimited purely by non-numerals.
Some examples, because that definition is pretty vague:
56 123.45/! is 56 and 123.45 followed by two other tokens
565.5345.345 % is 565.5345, 0.345 and two other tokens (incl. whitespace)
The problem I'm trying to solve is how the parser should figure out where a numeric literal ends. (Note that this is a context-free, self-modifying interpretive grammar thus there is no separate lexical analysis to be done.)
I could and have solved this with iteration:
def _next_notinst(self, atindex, subs = DIGITS):
"""return the next index of a char not in subs"""
for i, e in enumerate(self.toklist[atindex:]):
if e not in subs:
return i - len(self.toklist)
else:
break
return self.idx.v
(I don't think I need to clarify the variables, since it's an example and extremely straightforward.)
Great! That works, but there are at least two issues:
It's O(n) for a number with digit-length n. Not ideal.*
The parser class of which this method is a member is already using a while True: to cycle over arbitrary parts of the string, and I would prefer not having remotely nested loops when I don't need to.
From the previous bullet: since the parser uses arbitrary k lookahead and skipahead, parsing each individual token is absolutely not what I want.
I don't want to use RegEx mostly because I don't know it, and using it for this right now would make my code uncomprehendable to me, its creator.
There must be a simple, < O(n) solution to this, that simply collects the contiguous numerals in a string given a starting point, up until a non-numeral.
*Yes, I'm fully aware the parser itself is O(n), but we don't also need the number catenator to be > O(n). If you don't believe me, the string catenator is O(1) because it simply looks for the next unescaped " in the program and then joins all the chars up to that. Can't I do the same thing for numbers?
My other answer was actually erroneous due to lack of testing.
I decided to suck it up and learn a little bit of RegEx just because it's the only other way to solve this.
^([.\d]+[.\d]+|[.\d]) works for what I want, and matches these:
123.43.453""
.234234!/%
but not, for example:
"1233
I'm hoping to match the beginning of a string differently based on whether a certain block of characters is present later in the string. A very simplified version of this is:
re.search("""^(?(pie)a|b)c.*(?P<pie>asda)$""", 'acaaasda')
Where, if <pie> is matched, I want to see a at the beginning of the string, and if it isn't then I'd rather see b.
I'd use normal numerical lookahead but there's no guarantee how many groups will or won't be matched between these two.
I'm currently getting error: unknown group name. The sinking feeling in my gut tells me that this is because what I want is impossible (look-ahead to named groups isn't exactly a feature of a regular language parser), but I really really really want this to work -- the alternative is scrapping 4 or 5 hours' worth of regex writing and redoing it all tomorrow as a recursive descent parser or something.
Thanks in advance for any help.
Unfortunately, I don't think there is a way to do what you want to do with named groups. If you don't mind duplication too much, you could duplicate the shared conditions and OR the expressions together:
^(ac.*asda|bc.*)$
If it is a complicated expression you could always use string formatting to share it (rather than copy-pasting the shared part):
common_regex = "c.*"
final_regex = "^(a{common}asda|b{common})$".format(common=common_regex)
You can use something like that:
^(?:a(?=c.*(?P<pie>asda)$)|b)c.*$
or without .*$ if you don't need it.
I want to search for string that occurs between a certain string. For example,
\start
\problem{number}
\subproblem{number}
/* strings that I want to get */
\subproblem{number}
/* strings that I want to get */
\problem{number}
\subproblem{number}
...
...
\end
More specifically, I want to get problem number and subproblem number and strings between which is answer.
I somewhat came up with expression like
'(\\problem{(.*?)}\n)? \\subproblem{(.*?)} (.*?) (\\problem|\\subproblem|\\end)'
but it seems like it doesn't work as I expect. What is wrong with this expression?
This one:
(?:\\problem\{(.*?)\}\n)?\\subproblem\{(.*?)\}\n+(.*?)\n+(?=\\problem|\\subproblem|\\end)
returns three matches for me:
Match 1:
group 1: "number"
group 2: "number"
group 3: "/* strings that I want to get */"
Match 2:
group 1: null
group 2: "number"
group 3: "/* strings that I want to get */"
Match 3:
group 1: "number"
group 2: "number"
group 3: " ...\n ..."
However I'd rather parse it in two steps.
First find the problem's number (group 1) and content (group 2) using:
\\problem\{(.*?)\}\n(.+?)\\end
Then find the subproblem's numbers (group 1) and contents (group 2) inside that content using:
\\subproblem\{(.*?)\}\n+(.*?)\n+(?=\\problem|\\subproblem|\\end)
TeX is pretty complicated and I'm not sure how I feel about parsing it using regular expressions.
That said, your regular expression has two issues:
You're using a space character where you should just consume all whitespace
You need to use a lookahead assertion for your final group so that it doesn't get eaten up (because you need to match it at the beginning of the regex the next time around)
Give this a try:
>>> v
'\\start\n\n\\problem{number}\n\\subproblem{number}\n\n/* strings that I want to get */\n\n\\subproblem{number}\n\n/* strings that I want to get */\n\n\\problem{number}\n\\subproblem{number}\n ...\n ...\n\\end\n'
>>> re.findall(r'(?:\\problem{(.*?)})?\s*\\subproblem{(.*?)}\s*(.*?)\s*(?=\\problem{|\\subproblem{|\\end)', v, re.DOTALL)
[('number', 'number', '/* strings that I want to get */'), ('', 'number', '/* strings that I want to get */'), ('number', 'number', '...\n ...')]
If the question really is "What is wrong with this expression?", here's the answer:
You're trying to match newlines with a .*?. You need (?s) for that to work.
You have explicit spaces and newlines in the middle of the regex that don't have any corresponding characters in the source text. You need (?x) for that to work.
That may not be all that's wrong with the expression. But just adding (?sx), turning it into a raw string (because I don't trust myself to mix Python quoting and regex quoting properly), and removing the \n gives me this:
r'(?sx)(\\problem{(.*?)}? \\subproblem{(.*?)} (.*?)) (\\problem|\\subproblem|\\end)'
That returns 2 matches instead of 0, and it's probably the smallest change to your regex that works.
However, if the question is "How can I parse this?", rather than "What's wrong with my existing attempt?", I think impl's solution makes more sense (and I also agree with the point about using regex to parse TeX being usually a bad idea)—-or, even better, doing it in two steps as Regexident does.
if using regex to parse TeX is not good idea, then what method would you suggest to parse TeX?
First of all, as a general rule of thumb, if I can't write the regex to solve a problem by myself, I don't want to solve it with a regex, because I'll have a hard time figuring it out a few months from now. Sometimes I break it down into subexpressions, or use (?x) and load it up with comments, but usually I look for another way.
More importantly, if you have a real parser that can consume your language and give you a tree (or whatever's appropriate) that you can walk and search—as with, e.g. etree for XML—then you've got 90% of a solution for every problem you're going to come up with in dealing with that language. A quick&dirty regex (especially one you can't write on your own) only gets you 10% of the way to solving the next problem. And more often than not, if I've got a problem today, I'm going to have more of them in the next few months.
So, what's a good parser for TeX in Python? Honestly, I don't know. I know scipy/matplotlib has something that does it, so I'd probably look there first. Beyond that, check Google, PyPI, and maybe tex.stackexchange.com. The first things that turn up in a search are Texcaller and plasTeX. I have no idea how good they are, or if they're appropriate for your use case, but it shouldn't take long to skim the tutorials and find out.
If it turns out that there's nothing out there, and it comes down to writing something myself with, e.g., pyparsing vs. regexes, then it's a tougher choice. Some languages, it's very easy to define just the subset you care about and leave the rest as giant uninterpreted tokens, in which case a real parser will be just as easy as a regex, so you might as well go that way. Other languages, you have to handle half the syntax before you can do anything useful, so I wouldn't even try. I'd have to put a bit of time into thinking about it and experimenting both ways before deciding which way to go.
I wondered how you would go about tokenizing strings in English (or other western languages) if whitespaces were removed?
The inspiration for the question is the Sheep Man character in the Murakami novel 'Dance Dance Dance'
In the novel, the Sheep Man is translated as saying things like:
"likewesaid, we'lldowhatwecan. Trytoreconnectyou, towhatyouwant," said the Sheep Man. "Butwecan'tdoit-alone. Yougottaworktoo."
So, some punctuation is kept, but not all. Enough for a human to read, but somewhat arbitrary.
What would be your strategy for building a parser for this? Common combinations of letters, syllable counts, conditional grammars, look-ahead/behind regexps etc.?
Specifically, python-wise, how would you structure a (forgiving) translation flow? Not asking for a completed answer, just more how your thought process would go about breaking the problem down.
I ask this in a frivolous manner, but I think it's a question that might get some interesting (nlp/crypto/frequency/social) answers.
Thanks!
I actually did something like this for work about eight months ago. I just used a dictionary of English words in a hashtable (for O(1) lookup times). I'd go letter by letter matching whole words. It works well, but there are numerous ambiguities. (asshit can be ass hit or as shit). To resolve those ambiguities would require much more sophisticated grammar analysis.
First of all, I think you need a dictionary of English words -- you could try some methods that rely solely on some statistical analysis, but I think a dictionary has better chances of good results.
Once you have the words, you have two possible approaches:
You could categorize the words into grammar categories and use a formal grammar to parse the sentences -- obviously, you would sometimes get no match or multiple matches -- I'm not familiar with techniques that would allow you to loosen the grammar rules in case of no match, but I'm sure there must be some.
On the other hand, you could just take some large corpus of English text and compute relative probabilities of certain words being next to each other -- getting a list of pair and triples of words. Since that data structure would be rather big, you could use word categories (grammatical and/or based on meaning) to simplify it. Then you just build an automaton and choose the most probable transitions between the words.
I am sure there are many more possible approaches. You can even combine the two I mentioned, building some kind of grammar with weight attached to its rules. It's a rich field for experimenting.
I don't know if this is of much help to you, but you might be able to make use of this spelling corrector in some way.
This is just some quick code I wrote out that I think would work fairly well to extract words from a snippet like the one you gave... Its not fully thought out, but I think something along these lines would work if you can't find a pre-packaged type of solution
textstring = "likewesaid, we'lldowhatwecan. Trytoreconnectyou, towhatyouwant," said the Sheep Man. "Butwecan'tdoit-alone. Yougottaworktoo."
indiv_characters = list(textstring) #splits string into individual characters
teststring = ''
sequential_indiv_word_list = []
for cur_char in indiv_characters:
teststring = teststring + cur_char
# do some action here to test the testsring against an English dictionary where you can API into it to get True / False if it exists as an entry
if in_english_dict == True:
sequential_indiv_word_list.append(teststring)
teststring = ''
#at the end just assemble a sentence from the pieces of sequential_indiv_word_list by putting a space between each word
There are some more issues to be worked out, such as if it never returns a match, this would obviously not work as it would never match if it just kept adding in more characters, however since your demo string had some spaces you could have it recognize these too and automatically start over at each of these.
Also you need to account for punctuation, write conditionals like
if cur_char == ',' or cur_char =='.':
#do action to start new "word" automatically
I was recently bitten by a subtle bug.
char ** int2str = {
"zero", // 0
"one", // 1
"two" // 2
"three",// 3
nullptr };
assert( int2str[1] == std::string("one") ); // passes
assert( int2str[2] == std::string("two") ); // fails
If you have godlike code review powers you'll notice I forgot the , after "two".
After the considerable effort to find that bug I've got to ask why would anyone ever want this behavior?
I can see how this might be useful for macro magic, but then why is this a "feature" in a modern language like python?
Have you ever used string literal concatenation in production code?
Sure, it's the easy way to make your code look good:
char *someGlobalString = "very long "
"so broken "
"onto multiple "
"lines";
The best reason, though, is for weird printf formats, like type forcing:
uint64_t num = 5;
printf("Here is a number: %"PRIX64", what do you think of that?", num);
There are a bunch of those defined, and they can come in handy if you have type size requirements. Check them all out at this link. A few examples:
PRIo8 PRIoLEAST16 PRIoFAST32 PRIoMAX PRIoPTR
It's a great feature that allows you to combine preprocessor strings with your strings.
// Here we define the correct printf modifier for time_t
#ifdef TIME_T_LONG
#define TIME_T_MOD "l"
#elif defined(TIME_T_LONG_LONG)
#define TIME_T_MOD "ll"
#else
#define TIME_T_MOD ""
#endif
// And he we merge the modifier into the rest of our format string
printf("time is %" TIME_T_MOD "u\n", time(0));
I see several C and C++ answers but none of the really answer why or really what was the rationale for this feature? In C++ this is feature comes from C99 and we can find the rationale for this feature by going to Rationale for International Standard—Programming Languages—C section 6.4.5 String literals which says (emphasis mine):
A string can be continued across multiple lines by using the backslash–newline line continuation, but this requires that the continuation of the string start in the first position of the next line. To permit more flexible layout, and to solve some preprocessing problems (see §6.10.3), the C89 Committee introduced string literal concatenation. Two string literals in a row are pasted together, with no null character in the middle, to make one combined string literal. This addition to the C language allows a programmer to extend a string literal beyond the end of a physical line without having to use the backslash–newline mechanism and thereby destroying the indentation scheme of the program. An explicit concatenation operator was not introduced because the concatenation is a lexical construct rather than a run-time operation.
Python which seems to have the same reason, this reduces the need for ugly \ to continue long string literals. Which is covered in section 2.4.2 String literal concatenation of the
The Python Language Reference.
Cases where this can be useful:
Generating strings including components defined by the preprocessor (this is perhaps the largest use case in C, and it's one I see very, very frequently).
Splitting string constants over multiple lines
To provide a more concrete example for the former:
// in version.h
#define MYPROG_NAME "FOO"
#define MYPROG_VERSION "0.1.2"
// in main.c
puts("Welcome to " MYPROG_NAME " version " MYPROG_VERSION ".");
I'm not sure about other programming languages, but for example C# doesn't allow you to do this (and I think this is a good thing). As far as I can tell, most of the examples that show why this is useful in C++ would still work if you could use some special operator for string concatenation:
string someGlobalString = "very long " +
"so broken " +
"onto multiple " +
"lines";
This may not be as comfortable, but it is certainly safer. In your motivating example, the code would be invalid unless you added either , to separate elements or + to concatenate strings...
From the python lexical analysis reference, section 2.4.2:
This feature can be used to reduce the
number of backslashes needed, to split
long strings conveniently across long
lines, or even to add comments to
parts of strings
http://docs.python.org/reference/lexical_analysis.html
For rationale, expanding and simplifying Shafik Yaghmour’s answer: string literal concatenation originated in C (hence inherited by C++), as did the term, for two reasons (references are from Rationale for the ANSI C Programming Language):
For formatting: to allow long string literals to span multiple lines with proper indentation – in contrast to line continuation, which destroys the indentation scheme (3.1.4 String literals); and
For macro magic: to allow the construction of string literals by macros (via stringizing) (3.8.3.2 The # operator).
It is included in the modern languages Python and D because they copied it from C, though in both of these it has been proposed for deprecation, as it is bug-prone (as you note) and unnecessary (since one can just have a concatenation operator and constant folding for compile-time evaluation; you can’t do this in C because strings are pointers, and so you can’t add them).
It’s not simple to remove because that breaks compatibility, and you have to be careful about precedence (implicit concatenation happens during lexing, prior to operators, but replacing this with an operator means you need to be careful about precedence), hence why it’s still present.
Yes, it is in used production code. Google Python Style Guide: Line length specifies:
When a literal string won't fit on a single line, use parentheses for implicit line joining.
x = ('This will build a very long long '
'long long long long long long string')
See “String literal concatenation” at Wikipedia for more details and references.
So that you can split long string literals across lines.
And yes, I've seen it in production code.
While people have taken the words out of my mouth about the practical uses of the feature, nobody has so far tried to defend the choice of syntax.
For all I know, the typo that can slip through as a result was probably just overlooked. After all, it seems robustness against typos wasn't at the front of Dennis's mind, as shown further by:
if (a = b);
{
printf("%d", a);
}
Furthermore, there's the possible view that it wasn't worth using up an extra symbol for concatenation of string literals - after all, there isn't much else that can be done with two of them, and having a symbol there might create temptation to try to use it for runtime string concatenation, which is above the level of C's built-in features.
Some modern, higher-level languages based on C syntax have discarded this notation presumably because it is typo-prone. But these languages have an operator for string concatenation, such as + (JS, C#), . (Perl, PHP), ~ (D, though this has also kept C's juxtaposition syntax), and constant folding (in compiled languages, anyway) means that there is no runtime performance overhead.
Another sneaky error I've seen in the wild is people presuming that two single quotes are a way to escape the quote (as it is commonly used for double quotes in CSV files, for example), so they'll write things like the following in python:
print('Beggars can''t be choosers')
which outputs Beggars cant be choosers instead of the Beggars can't be choosers the coder desired.
As for the original "why" question: why is this a "feature" in a modern language like python? - in my opinion, I concur with the OP, it shouldn't be.