I am writing a Pyparsing grammar to convert Creole markup to HTML. I'm stuck because there's a bit of conflict trying to parse these two constructs:
Image link: {{image.jpg|title}}
Ignore formatting: {{{text}}}
The way I'm parsing the image link is as follows (note that this converts perfectly fine):
def parse_image(s, l, t):
try:
link, title = t[0].split("|")
except ValueError:
raise ParseFatalException(s,l,"invalid image link reference: " + t[0])
return '<img src="{0}" alt="{1}" />'.format(link, title)
image = QuotedString("{{", endQuoteChar="}}")
image.setParseAction(parse_image)
Next, I wrote a rule so that when {{{text}}} is encountered, simply return what's between the opening and closing braces without formatting it:
n = QuotedString("{{{", endQuoteChar="}}}")
n.setParseAction(lambda x: x[0])
However, when I try to run the following test case:
text = italic | bold | hr | newline | image | n
print text.transformString("{{{ //ignore formatting// }}}")
I get the following stack trace:
Traceback (most recent call last):
File "C:\Users\User\py\kreyol\parser.py", line 36, in <module>
print text.transformString("{{{ //ignore formatting// }}}")
File "C:\Python27\lib\site-packages\pyparsing.py", line 1210, in transformString
raise exc
pyparsing.ParseFatalException: invalid image link reference: { //ignore formatting// (at char 0), (line:1, col:1)
From what I understand, the parser encounters the {{ first and tries to parse the text as an image instead of text without formatting. How can I solve this ambiguity?
The issue is with this expression:
text = italic | bold | hr | newline | image | n
Pyparsing works strictly left-to-right, with no lookahead. Using '|' operators, you construct a pyparsing MatchFirst expression, which will match the first match of all the alternatives, even if a later match is better.
You can change the evaluation to use "longest match" by using the '^' operator instead:
text = italic ^ bold ^ hr ^ newline ^ image ^ n
This would have a performance penalty in that every expression is tested, even though there is no possibility of a better match.
An easier solution is to just reorder the expressions in your list of alternatives: test for n before image:
text = italic | bold | hr | newline | n | image
Now when evaluating alternatives, it will look for the leading {{{ of n before the leading {{ of image.
This often crops up when people define numeric terms, and accidentally define something like:
integer = Word(nums)
realnumber = Combine(Word(nums) + '.' + Word(nums))
number = integer | realnumber
In this case, number will never match a realnumber, since the leading whole number part will be parsed as an integer. The fix, as in your case, is to either use '^' operator, or just reorder:
number = realnumber | integer
Related
So I have several examples of raw text in which I have to extract the characters after 'Terms'. The common pattern I see is after the word 'Terms' there is a '\n' and also at the end '\n' I want to extract all the characters(words, numbers, symbols) present between these to \n but after keyword 'Terms'.
Some examples of text are given below:
1) \nTERMS \nDirect deposit; Routing #256078514, acct. #160935\n\n'
2) \nTerms\nDue on receipt\nDue Date\n1/31/2021
3) \nTERMS: \nNET 30 DAYS\n
The code I have written is given below:
def get_term_regex(s):
raw_text = s
term_regex1 = r'(TERMS\s*\\n(.*?)\\n)'
try:
if ('TERMS' or 'Terms') in raw_text:
pattern1 = re.search(term_regex1,raw_text)
#print(pattern1)
return pattern1
except:
pass
But I am not getting any output, as there is no match.
The expected output is:
1) Direct deposit; Routing #256078514, acct. #160935
2) Due on receipt
3) NET 30 DAYS
Any help would be really appreciated.
Try the following:
import re
text = '''1) \nTERMS \nDirect deposit; Routing #256078514, acct. #160935\n\n'
2) \nTerms\nDue on receipt\nDue Date\n1/31/2021
3) \nTERMS: \nNET 30 DAYS\n''' # \n are real new lines
for m in re.finditer(r'(TERMS|Terms)\W*\n(.*?)\n', text):
print(m.group(2))
Note that your regex could not deal with the third 'line' because there is a colon : after TERMS. So I replaced \s with \W.
('TERMS' or 'Terms') in raw_text might not be what you want. It does not raise a syntax error, but it is just the same as 'TERMS' in raw_text; when python evaluates the parenthesis part, both 'TERMS' and 'Terms' are all truthy, and therefore python just takes the last truthy value, i.e., 'Terms'. The result is, TERMS cannot be picked up by that part!
So you might instead want someting like ('TERMS' in raw_text) or ('Terms' in raw_text), although it is quite verbose.
I am trying to use Regex to look through a specific part of a string and take what is between but I cant get the right Regex pattern for this.
My biggest issue is with trying to form a Regex pattern for this. I've tried a bunch of variations close to the example listed. It should be close.
import re
toFind = ['[]', '[x]']
text = "| Completed?|\n|------|:---------:|\n|Link Created | [] |\n|Research Done | [X] "
# Regex to search between parameters and make result lowercase if there are any uppercase Chars
result = (re.search("(?<=Link Created)(.+?)(?=Research Done)", text).lower())
# Gets rid of whitespace in case they move the []/[x] around
result = result.replace(" ", "")
if any(x in result for x in toFind):
print("Exists")
else:
print("Doesn't Exist")
Happy Path:
I take string (text) and use Regex expression to get the substring between Link Created and Research Done.
Then make the result lowercase and get rid of whitespace just in case they move the []/[x]s. Then it looks at the string (result) for '[]' or '[x]' and print.
Actual Output:
At the moment all I keep getting is None because the the Regex syntax is off...
If you want . to match newlines, you have the use the re.S option.
Also, it would seem a better idea to check if the regex matched before proceeding with further calls. Your call to lower() gave me an error because the regex didn't match, so calling result.group(0).lower() only when result evaluates as true is safer.
import re
toFind = ['[]', '[x]']
text = "| Completed?|\n|------|:---------:|\n|Link Created | [] |\n|Research Done | [X] "
# Regex to search between parameters and make result lowercase if there are any uppercase Chars
result = (re.search("(?<=Link Created)(.+?)(?=Research Done)", text, re.S))
if result:
# Gets rid of whitespace in case they move the []/[x] around
result = result.group(0).lower().replace(" ", "")
if any(x in result for x in toFind):
print("Exists")
else:
print("Doesn't Exist")
else:
print("re did not match")
PS: all the re options are documented in the re module documentation. Search for re.DOTALL for the details on re.S (they're synonyms). If you want to combine options, use bitwise OR. E.g., re.S|re.I will have . match newline and do case-insensitive matching.
I believe it's the \n newline characters giving issues. You can get around this using [\s\S]+ as such:
import re
toFind = ['[]', '[x]']
text = "| Completed?|\n|------|:---------:|\n|Link Created | [] |\n|Research Done | [X] "
# New regex to match text between
# Remove all newlines, tabs, whitespace and column separators
result = re.search(r"Link Created([\s\S]+)Research Done", text).group(1)
result = re.sub(r"[\n\t\s\|]*", "", result)
if any(x in result for x in toFind):
print("Exists")
else:
print("Doesn't Exist")
Seems like regex is overkill for this particular job unless I am missing something (also not clear to me why you need the step that removes the whitespace from the substring). You could just split on "Link Created" and then split the following string on "Research Done".
text = "| Completed?|\n|------|:---------:|\n|Link Created | [] |\n|Research Done | [X] "
s = text.split("Link Created")[1].split("Research Done")[0].lower()
if "[]" in s or "[x]" in s:
print("Exists")
else:
print("Doesn't Exist")
# Exists
My question is quite simple
I'm trying to come up with a RE to select any set of words or statement in between two characters.
For example is the strings are something like this :
') as whatever '
and it can also look like
') as whatever\r\n'
So i need to extract 'whatever' from this string.
The Regex I came up with is this :
\)\sas\s(.*?)\s
It works fine and extracts 'whatever' but this will only work for the first example not the second. What should i do in case of the second statement
I'm basically looking for an OR condition kind of thing!
Any help would be appreciated
Thanks in advance
The question is not very clear but maybe the regular expression syntax you are looking for might be something like this:
\)\sas\s(.*?)[\s | \r | \n]
basically telling after the string you are interested you can find a space or other characters.
EDIT
As example take the following code in Python2. The OR operator is '|' and I used it in the square brackets to catch the strings which have as subsequent character a space, '\r' a . or 'd'.
import re
a = ') as whatever '
b = ') as whatever\r\n'
c = ') as whatever.'
d = ') as whateverd'
a_res = re.findall(r'\)\sas\s(.*?)[\s | \r | \n]', a)[0] #ending with space, \r or new line char
b_res = re.findall(r'\)\sas\s(.*?)[\s | \r | \n]', b)[0]
c_res = re.findall(r'\)\sas\s(.*?)[\s | \r | \on | \.]', c)[0] #ending with space, \r new line char or .
d_res = re.findall(r'\)\sas\s(.*?)[\s | \r | \on | \. | d]', d)[0] #ending with space, \r, new line char, . or d
print(a_res, len(a_res))
print(b_res, len(b_res))
print(c_res, len(c_res))
print(d_res, len(d_res))
It is working as you intended. Please check it
import re
a =') as whatever '
b=') as whatever\r\n'
print re.findall(r'\)\sas\s(.*?)\s', a)[0]
print re.findall(r'\)\sas\s(.*?)\s', b)[0]
This will output as
'whatever'
'whatever'
I'm trying to write a simple int expression parser using tatsu, a PEG-based Python parser generator. Here is my code:
import tatsu
grammar = r'''
start = expression $ ;
expression = add | sub | term ;
add = expression '+' term ;
sub = expression '-' term ;
term = mul | div | number ;
mul = term '*' number ;
div = term '/' number ;
number = [ '-' ] /\d+/ ;
'''
parser = tatsu.compile(grammar)
print(parser.parse('2-1'))
The output of this program is ['-', '1'] instead of the expected ['2', '-', '1'].
I get the correct output if I either:
Remove support for unary minus, i.e. change the last rule to number = /\d+/ ;
Remove the term, mul and div rules, and support only addition and subtraction
Replace the second rule with expresssion = add | sub | mul | div | number ;
The last option actually works without leaving any feature out, but I don't understand why it works. What is going on?
EDIT: If I just flip the add/sub/mul/div rules to get rid of left recursion, it also works. But then evaluating the expressions becomes a problem, since the parse tree is flipped. (3-2-1 becomes 3-(2-1))
There are left recursion cases that TatSu doesn't handle, and work on fixing that is currently on hold.
You can use left/right join/gather operators to control the associativity of parsed expressions in a non-left-recursive grammar.
I am trying to write a Python 3.6.0 script to find elements in a page. It extracts the line after words that appear in 2 formats : "Element:" Or "Element :" (with a space before the ":").
So I tried to use regular expressions. It works only half the time and I could not figure out what is wrong in my code. Here is the code with an example:
import re
TestString = r"""Some text
Year: 2015.12.10
Some other text
"""
ListOfTags = ["Year(?= ?):", "Year(?=\s?):", "Year(?= *):"]
for i in range(0, len(ListOfTags)):
try:
TagsFound = re.search(str(ListOfTags[i]) + '(.+?)\n', TestString).group(1)
print(TransformString('"' + ListOfTags[i] + '"') + " returns: " + TagsFound)
except AttributeError:
# TestString not found in the original string (or something else ???)
TagsFound = ''
print("No tag found..")
(With this code, I could test several expressions at a time)
Here, when the expression is "Year: 2015.12.10" all the regular expressions work and return " 2015.12.10"
But, they don't work when it is "Year :" (with a space before the ":")...
I also tried the expressions "Year( ?):", "Year(\s?):", "Year( *):" , "Year( |:?)( |:?)" but they did not work.
I think regular expressions may be overkill here (unless you have a good reason for using them). You could try processing your text line by line. For each line you could use the partition method on the str to split it at the first colon found.
for line in TestString.splitlines():
if ':' in line:
tag, __, value = line.partition(':')
#Now see if this is a tag you care about and do something with the value