I am trying to write a very simple parser. I read similar questions here on SO and on the Internet, but all I could find was limited to "arithmetic like" things.
I have a very simple DSL, for example:
ELEMENT TYPE<TYPE> elemName {
TYPE<TYPE> memberName;
}
Where the <TYPE> part is optional and valid only for some types.
Following what I read, I tried to write a recursive descent parser in Python, but there are a few things that I can't seem to understand:
How do I look for tokens that are longer than 1 char?
How do I break up the text in the different parts? For example, after a TYPE I can have a whitespace or a < or a whitespace followed by a <. How do I address that?
Short answer
All your questions boil down to the fact that you are not tokenizing your string before parsing it.
Long answer
The process of parsing is actually split in two distinct parts: lexing and parsing.
Lexing
What seems to be missing in the way you think about parsing is called tokenizing or lexing. It is the process of converting a string into a stream of tokens, i.e. words. That is what you are looking for when asking How do I break up the text in the different parts?
You can do it by yourself by checking your string against a list of regexp using re, or you can use some well-known librairy such as PLY. Although if you are using Python3, I will be biased toward a lexing-parsing librairy that I wrote, which is ComPyl.
So proceeding with ComPyl, the syntax you are looking for seems to be the following.
from compyl.lexer import Lexer
rules = [
(r'\s+', None),
(r'\w+', 'ID'),
(r'< *\w+ *>', 'TYPE'), # Will match your <TYPE> token with inner whitespaces
(r'{', 'L_BRACKET'),
(r'}', 'R_BRACKET'),
]
lexer = Lexer(rules=rules, line_rule='\n')
# See ComPyl doc to figure how to proceed from here
Notice that the first rule (r'\s+', None), is actually what solves your issue about whitespace. It basically tells the lexer to match any whitespace character and to ignore them. Of course if you do not want to use a lexing tool, you can simply add a similar rule in your own re implementation.
Parsing
You seem to want to write your own LL(1) parser, so I will be brief on that part. Just know that there exist a lot of tools that can do that for you (PLY and ComPyl librairies offer LR(1) parsers which are more powerful but harder to hand-write, see the difference between LL(1) and LR(1) here).
Simply notice that now that you know how to tokenize your string, the issue of How do I look for tokens that are longer than 1 char? has been solved. You are now parsing, not a stream of characters, but a stream of tokens that encapsulate the matched words.
Olivier's answer regarding lexing/tokenizing and then parsing is helpful.
However, for relatively simple cases, some parsing tools are able to handle your kind of requirements without needing a separate tokenizing step. parsy is one of those. You build up parsers from smaller building blocks - there is good documentation to help.
An example of a parser done with parsy for your kind of grammar is here: http://parsy.readthedocs.io/en/latest/howto/other_examples.html#proto-file-parser .
It is significantly more complex than yours, but shows what is possible. Where whitespace is allowed (but not required), it uses the lexeme utility (defined at the top) to consume optional whitespace.
You may need to tighten up your understanding of where whitespace is necessary and where it is optional, and what kind of whitespace you really mean.
Related
I would like to generate string matching my regexes using Python 3. For this I am using handy library called rstr.
My regexes:
^[abc]+.
[a-z]+
My task:
I must find a generic way, how to create string that would match both my regexes.
What I cannot do:
Modify both regexes or join them in any way. This I consider as ineffective solution, especially in the case if incompatible regexes:
import re
import rstr
regex1 = re.compile(r'^[abc]+.')
regex2 = re.compile(r'[a-z]+')
for index in range(0, 1000):
generated_string = rstr.xeger(regex1)
if re.fullmatch(regex2, generated_string):
break;
else:
raise Exception('Regexes are probably incompatibile.')
print('String matching both regexes is: {}'.format(generated_string))
Is there any workaround or any magical library that can handle this? Any insights appreciated.
Questions which are seemingly similar, but not helpful in any way:
Match a line with multiple regex using Python
Asker already has the string, which he just want to check against multiple regexes in the most elegant way. In my case we need to generate string in a smart way that would match regexes.
If you want really generic way, you can't really use brute force approach.
What you look for is create some kind of representation of regexp (as rstr does through call of sre_parse.py) and then calling some SMT solver to satisfy both criteria.
For Haskell there is https://github.com/audreyt/regex-genex which uses Yices SMT solver to do just that, but I doubt there is anything like this for Python. If I were you, I'd bite a bullet and call it as external program from your python program.
I don't know if there is something that can fulfill your needs much smother.
But I would do it something like (as you've done it already):
Create a Regex object with the re.compile() function.
Generate String based on 1st regex.
Pass the string you've got into the 2nd regex object using search() method.
If that passes... your done, string passed both regexs.
Maybe you can create a function and pass both regexes as parameters and test "2 by 2" using the same logic.
And then if you have 8 regexes to match...
Just do:
call (regex1, regex2)
call (regex2, regex3)
call (regex4, regex5)
...
I solved this using a little alternative approach. Notice second regex is basically insurance so only lowercase letters are generated in our new string.
I used Google's python package sre_yield which allows charset limitation. Package is also available on PyPi. My code:
import sre_yield
import string
sre_yield.AllStrings(r'^[abc]+.', charset=string.ascii_lowercase)[0]
# returns `aa`
Lots of questions on how to tokenize some string using regexp.
I'm looking however, to tokenize the regexp pattern itself, I'm sure there are some posts on the subject but I cannot find them.
Examples:
^\w$ -> ['^', '\w', '&']
[3-7]* -> ['[3-7]*']
\w+\s\w+ -> ['\w+', '\s', '\w+']
(xyz)*\s[a-zA-Z]+[0-9]? -> ['(xyz)*','\s','[a-zA-Z]+','[0-9]?']
I'm assuming this work is done in python under the hood when some regexp function is called.
One place to start: the PyPy project has an implementation of Python in (mostly) Python. The re.py in the source distribution calls sre-compile.c:_compile() to do the work. You might be able to hack that to provide the output form you want.
Edit Also, the Javascript XRegExp library parses regexes in an extended syntax and renders them to standard syntax. The parser routine may help you.
I'm hoping to match the beginning of a string differently based on whether a certain block of characters is present later in the string. A very simplified version of this is:
re.search("""^(?(pie)a|b)c.*(?P<pie>asda)$""", 'acaaasda')
Where, if <pie> is matched, I want to see a at the beginning of the string, and if it isn't then I'd rather see b.
I'd use normal numerical lookahead but there's no guarantee how many groups will or won't be matched between these two.
I'm currently getting error: unknown group name. The sinking feeling in my gut tells me that this is because what I want is impossible (look-ahead to named groups isn't exactly a feature of a regular language parser), but I really really really want this to work -- the alternative is scrapping 4 or 5 hours' worth of regex writing and redoing it all tomorrow as a recursive descent parser or something.
Thanks in advance for any help.
Unfortunately, I don't think there is a way to do what you want to do with named groups. If you don't mind duplication too much, you could duplicate the shared conditions and OR the expressions together:
^(ac.*asda|bc.*)$
If it is a complicated expression you could always use string formatting to share it (rather than copy-pasting the shared part):
common_regex = "c.*"
final_regex = "^(a{common}asda|b{common})$".format(common=common_regex)
You can use something like that:
^(?:a(?=c.*(?P<pie>asda)$)|b)c.*$
or without .*$ if you don't need it.
I want to search for string that occurs between a certain string. For example,
\start
\problem{number}
\subproblem{number}
/* strings that I want to get */
\subproblem{number}
/* strings that I want to get */
\problem{number}
\subproblem{number}
...
...
\end
More specifically, I want to get problem number and subproblem number and strings between which is answer.
I somewhat came up with expression like
'(\\problem{(.*?)}\n)? \\subproblem{(.*?)} (.*?) (\\problem|\\subproblem|\\end)'
but it seems like it doesn't work as I expect. What is wrong with this expression?
This one:
(?:\\problem\{(.*?)\}\n)?\\subproblem\{(.*?)\}\n+(.*?)\n+(?=\\problem|\\subproblem|\\end)
returns three matches for me:
Match 1:
group 1: "number"
group 2: "number"
group 3: "/* strings that I want to get */"
Match 2:
group 1: null
group 2: "number"
group 3: "/* strings that I want to get */"
Match 3:
group 1: "number"
group 2: "number"
group 3: " ...\n ..."
However I'd rather parse it in two steps.
First find the problem's number (group 1) and content (group 2) using:
\\problem\{(.*?)\}\n(.+?)\\end
Then find the subproblem's numbers (group 1) and contents (group 2) inside that content using:
\\subproblem\{(.*?)\}\n+(.*?)\n+(?=\\problem|\\subproblem|\\end)
TeX is pretty complicated and I'm not sure how I feel about parsing it using regular expressions.
That said, your regular expression has two issues:
You're using a space character where you should just consume all whitespace
You need to use a lookahead assertion for your final group so that it doesn't get eaten up (because you need to match it at the beginning of the regex the next time around)
Give this a try:
>>> v
'\\start\n\n\\problem{number}\n\\subproblem{number}\n\n/* strings that I want to get */\n\n\\subproblem{number}\n\n/* strings that I want to get */\n\n\\problem{number}\n\\subproblem{number}\n ...\n ...\n\\end\n'
>>> re.findall(r'(?:\\problem{(.*?)})?\s*\\subproblem{(.*?)}\s*(.*?)\s*(?=\\problem{|\\subproblem{|\\end)', v, re.DOTALL)
[('number', 'number', '/* strings that I want to get */'), ('', 'number', '/* strings that I want to get */'), ('number', 'number', '...\n ...')]
If the question really is "What is wrong with this expression?", here's the answer:
You're trying to match newlines with a .*?. You need (?s) for that to work.
You have explicit spaces and newlines in the middle of the regex that don't have any corresponding characters in the source text. You need (?x) for that to work.
That may not be all that's wrong with the expression. But just adding (?sx), turning it into a raw string (because I don't trust myself to mix Python quoting and regex quoting properly), and removing the \n gives me this:
r'(?sx)(\\problem{(.*?)}? \\subproblem{(.*?)} (.*?)) (\\problem|\\subproblem|\\end)'
That returns 2 matches instead of 0, and it's probably the smallest change to your regex that works.
However, if the question is "How can I parse this?", rather than "What's wrong with my existing attempt?", I think impl's solution makes more sense (and I also agree with the point about using regex to parse TeX being usually a bad idea)—-or, even better, doing it in two steps as Regexident does.
if using regex to parse TeX is not good idea, then what method would you suggest to parse TeX?
First of all, as a general rule of thumb, if I can't write the regex to solve a problem by myself, I don't want to solve it with a regex, because I'll have a hard time figuring it out a few months from now. Sometimes I break it down into subexpressions, or use (?x) and load it up with comments, but usually I look for another way.
More importantly, if you have a real parser that can consume your language and give you a tree (or whatever's appropriate) that you can walk and search—as with, e.g. etree for XML—then you've got 90% of a solution for every problem you're going to come up with in dealing with that language. A quick&dirty regex (especially one you can't write on your own) only gets you 10% of the way to solving the next problem. And more often than not, if I've got a problem today, I'm going to have more of them in the next few months.
So, what's a good parser for TeX in Python? Honestly, I don't know. I know scipy/matplotlib has something that does it, so I'd probably look there first. Beyond that, check Google, PyPI, and maybe tex.stackexchange.com. The first things that turn up in a search are Texcaller and plasTeX. I have no idea how good they are, or if they're appropriate for your use case, but it shouldn't take long to skim the tutorials and find out.
If it turns out that there's nothing out there, and it comes down to writing something myself with, e.g., pyparsing vs. regexes, then it's a tougher choice. Some languages, it's very easy to define just the subset you care about and leave the rest as giant uninterpreted tokens, in which case a real parser will be just as easy as a regex, so you might as well go that way. Other languages, you have to handle half the syntax before you can do anything useful, so I wouldn't even try. I'd have to put a bit of time into thinking about it and experimenting both ways before deciding which way to go.
I'd like to know if it's a good idea avoid regex.
actually I have avoided it in any case and some peoples has been giving me advice that i shouldn't avoid it, since if you know what means every thing like:
[] '|' \A \B \d \D \W \w \S \Z $ * ? ...
it would be easy to read, right? but i fell like avoiding regex i would have a more readable code.
it gets more unreadable when it's bigger, example: validators.py
email_re = re.compile(
r"(^[-!#$%&'*+/=?^_`{}|~0-9A-Z]+(\.[-!#$%&'*+/=?^_`{}|~0-9A-Z]+)*" # dot-atom
r'|^"([\001-\010\013\014\016-\037!#-\[\]-\177]|\\[\001-011\013\014\016-\177])*"' # quoted-string
r')#(?:[A-Z0-9](?:[A-Z0-9-]{0,61}[A-Z0-9])?\.)+[A-Z]{2,6}\.?$', re.IGNORECASE) # domain
so, I'd like to know a reason to not avoid regex?
No, don't avoid regular expressions. They're actually quite a nifty little tool and will save you a lot of work if you use them wisely.
What you do need to avoid is trying to use it for everything, a malaise that appears to strike those new to regular expressions before they become a little more tempered and a little less enamoured :-)
For example, don't use it to validate email addresses. The way you validate an email address is to send an email to it with a link that the receiver has to click on to complete the "transaction".
There are billions of valid email addresses (according to the RFCs) that have no physical email receiver behind them. The only way to be certain that there is a receiver is to send an email and wait for proof positive that it was received and acted upon.
If I find myself writing a regular expression that's more than, let's say, 60 characters, I step back to see if there's a more readable way. Similarly, if I write a regular expression and come back a week later and can't instantly recognise what it does, I think about replacing it. This particular paragraph consists of my opinions of course, but they've served me well :-)
Regular expressions are a tool. They are perfectly suited to some tasks and not to others. Like any tool, use them when they are the right tool for the job. Don't just avoid them because somebody said they were bad. Learn how to use them and then you can decide for yourself rather then depending on someone elses dogma.
If you choose to use a more general parsing approach, like pyparsing or PLY, you will never require regular expressions (which can only match a small subset of the languages matchable with such general parsers). However, lexers such as the one in PLY are typically built around regular expressions (which are a perfect match for a lexer's needs!), so you will probably have to avoid that (as well as powerful tools such as BeautifulSoup when any "normal" user would be able to keep using and enjoying it by simply passing a regular expression object as the selector, since BeautifulSoup fully supports that) and will have to recode a lot of such existing parsers with your chosen general-purpose parsing package.
Performance may suffer greatly, of course, by using extremely general tools in cases where simpler, highly optimized and concise ones would be a perfect solution -- and the size of your code may "blow up" to being very large in many common cases. But if you don't mind having programs twice as big and twice as slow, and are determined to avoid regular expressions at all costs, you can do that.
On the other hand, if your main concern is with readability (quite an understandable and commendable concern, too), then the re.VERBOSE option, by allowing abundant use of whitespace and comments within the RE's pattern, can really do wonders for that goal without removing any of REs' advantages (except by diluting a sometimes-excessive conciseness;-). You WILL want to also keep at least one general-purpose parsing system under your belt, of course (rather than stretch REs to do tasks they're wrong for, as so many people unfortunately do!) -- but a minimal command of REs will serve you well in so many cases (including, for example, full use of BeautifulSoup and many other tools which can accept REs as parameters to apply them appropriately) that I think it's quite to be recommended.
Just for some comparisions, here my version email format check not with regexp (with test cases) and one readable regexp offered to me as alternative (though sending email after it is accepted, is great idea):
# -*- coding: utf8 -*-
import string
print("Valid letters in this computer are: "+string.letters)
import re
def validateEmail(a):
sep=[x for x in a if not (x.isalpha() or
x.isdigit() or
x in r"!#$%&'*+-/=?^_`{|}~]") ]
sepjoined=''.join(sep)
## sep joined must be ..#.... form
if len(a)>255 or sepjoined.strip('.') != '#': return False
end=a
for i in sep:
part,i,end=end.partition(i)
if len(part)<2: return False
return len(end)>1
def emailval(address):
pattern = "[\.\w]{2,}[#]\w+[.]\w+"
return re.match(pattern, address)
if __name__ == '__main__':
emails = [ "test.#web.com","test+john#web.museum", "test+john#web.m",
"a#n.dk", "and.bun#webben.de","marjaliisa.hämäläinen#hel.fi",
"marja-liisa.hämäläinen#hel.fi", "marjaliisah#hel.",'tony#localhost',
'1234#23.45','me#somewhere']
print('\n\t'.join(["Valid emails are:"] +
filter(validateEmail,emails)))
print('\n\t'.join(["Regexp gives wrong answer:"] +
filter(emailval,emails)))
""" Output:
Valid letters in this computer are: abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ
Valid emails are:
test+john#web.museum
and.bun#webben.de
tony#localhost
1234#23.45
me#somewhere
Regexp gives wrong answer:
test.#web.com
and.bun#webben.de
1234#23.45
"""
EDIT: cleaned up the regex filter function from this ancient code, edited for #detly link based more permissive version. Good enough for form filling first check for me before sending the confirmation email. Finaly put the 255 character length limit check mentioned in comments.
This code by purpose does not accept the normal a#b as valid email address, but does accept me#somewhere. Another thing is that it depends of what isalpha returns. So this output, which is from Ideone.com has not accepted the scandinavian öä even they are valid nowadays. When run in my home computer, those are accepted. This is even when coding line is there.
(Deleted a regular expression which purported to be an "official" one but is in fact not found in the RFC it claimed to be from.)
This regex may be amusing as it is an attempt to precisely match the e-mail address grammar provided in an older version of the Internet mail standards.
Regular expressions are likely the right tool for extracting/validating email addresses...
To extract one or more email addresses from raw text:
import re
pat_e = re.compile(r'(?P<email>[\w.+-]+#(?:[\w-]+\.)+[a-zA-Z]{2,})')
emails = []
for r in pat_e.finditer(text):
emails.append(r.group('email'))
return emails
To see if a single piece of text is a valid email:
import re
pat_m = re.compile(r'([\w.+-]+#(?:[\w-]+\.)+[a-zA-Z]{2,}$)')
if pat_m.match(text):
return True
return False