Pyparsing nested transformString - python

I had something working for a little while to transform a tag from lua to hmtl, but recently I got a special case where those tags could be nested. Here is a quick sample out of my code :
from pyparsing import Literal, Word, Suppress, SkipTo, LineEnd, hexnums
text = "|c71d5FFFFI'm saying something in color|cFFFFFFFF then in white |r|r"
def colorize (t):
hexRGB = "".join(list(t.hex)[:6])
return "<span style=\"color:#{};\">{}</span>".format(hexRGB, t.content)
vbar = Literal("|")
eol = LineEnd().suppress()
endTag = ((vbar + (Literal("r")|Literal("R"))|eol))
parser = (
Suppress(vbar + (Literal("c")|Literal("C"))) +
Word(hexnums, exact=8).setResultsName("hex") +
SkipTo(endTag).setResultsName("content") +
Suppress(endTag)
).addParseAction(colorize)
result = parser.transformString(text)
print (result)
I saw an another similar question Pyparsing: nested Markdown emphasis, but my problem is a bit different, sometime there is no closetag and lineEnd is acting as one.

You can add a while loop to iterate over result until all the colors are found:
from pyparsing import Literal, Word, Suppress, SkipTo, LineEnd, hexnums
def colorize (t):
hexRGB = "".join(list(t.hex)[:6])
return "<span style=\"color:#{};\">{}</span>".format(hexRGB, t.content)
vbar = Literal("|")
eol = LineEnd().suppress()
endTag = ((vbar + (Literal("r")|Literal("R"))|eol))
parser = (
Suppress(vbar + (Literal("c")|Literal("C"))) +
Word(hexnums, exact=8).setResultsName("hex") +
SkipTo(endTag).setResultsName("content") +
Suppress(endTag)
).addParseAction(colorize)
result = parser.transformString(text)
new_result = parser.transformString(result)
while(result != new_result):
result = new_result
new_result = parser.transformString(result)
print (result)
when text = "|c71d5FFFFI'm saying something in color|cFFFFFFFF then in white |r|r":
output:
<span style="color:#71d5FF;">I'm saying something in color<span style="color:#FFFFFF;"> then in white</span></span>
when text = "|c71d5FFFFI'm saying something in color"
output:
<span style="color:#71d5FF;">I'm saying something in color</span>

Related

Iterating with multiple variables

I am taking a course from Georgia Tech and I have spent all my evening trying to figure this out and I havent been able to do so. My task is as follows:
Write a function called my_TAs. The function should take as
input three strings: first_TA, second_TA, and third_TA. It
should return as output the string, "[first_TA], [second_TA],#and [third_TA] are awesome!", with the values replacing the
variable names.
For example, my_TAs("Sridevi", "Lucy", "Xu") would return
the string "Sridevi, Lucy, and Xu are awesome!".
Hint: Notice that because you're returning a string instead
of printing a string, you can't use the print() statement
-- you'll have to create the string yourself, then return it.
My function returns "Joshua are awesome" instead of all three variables names. I tried this
result = str(first_TA), str(second_TA), str(third_TA) + "are awesome!"
but didn't work.
def my_TAs(first_TA, second_TA, third_TA):
result = str(first_TA) + " are Awesome!"
return result
first_TA = "Joshua"
second_TA = "Jackie"
third_TA = "Marguerite"
test_first_TA = "Joshua"
test_second_TA = "Jackie"
test_third_TA = "Marguerite"
print(my_TAs(test_first_TA, test_second_TA, test_third_TA))
You can use f-Strings to accomplish this:
def my_TAs(first_TA, second_TA, third_TA):
return f"{first_TA}, {second_TA}, and {third_TA} are awesome!"
test_first_TA = "Joshua"
test_second_TA = "Jackie"
test_third_TA = "Marguerite"
print(my_TAs(test_first_TA, test_second_TA, test_third_TA))
Output:
Joshua, Jackie, and Marguerite are awesome!
Use + instead of ,
def my_TAs(first_TA, second_TA, third_TA):
result = str(first_TA) + ", " + str(second_TA) + ", and " + str(third_TA)
+ " are Awesome!"
return result
first_TA = "Joshua"
second_TA = "Jackie"
third_TA = "Marguerite"
test_first_TA = "Joshua"
test_second_TA = "Jackie"
test_third_TA = "Marguerite"
print(my_TAs(test_first_TA, test_second_TA, test_third_TA))

Splits based on pyparsing

so I want to do this (but using pyparsing)
Package:numpy11 Package:scipy
will be split into
[["Package:", "numpy11"], ["Package:", "scipy"]]
My code so far is
package_header = Literal("Package:")
single_package = Word(printables + " ") + ~Literal("Package:")
full_parser = OneOrMore( pp.Group( package_header + single_package ) )
The current output is this
([(['Package:', 'numpy11 Package:scipy'], {})], {})
I was hoping for something like this
([(['Package:', 'numpy11'], {})], [(['Package:', 'scipy'], {})], {})
Essentially the rest of the text matches pp.printables
I am aware that I can use Words but I want to do
all printables but not the Literal
How do I accomplish this? Thank you.
You shouldn't need the negative lookahead, ie. this:
from pyparsing import *
package_header = Literal("Package:")
single_package = Word(printables)
full_parser = OneOrMore( Group( package_header + single_package ) )
print full_parser.parseString("Package:numpy11 Package:scipy")
prints:
[['Package:', 'numpy11'], ['Package:', 'scipy']]
Update: to parse packages delimited by | you can use the delimitedList() function (now you can also have spaces in package names):
from pyparsing import *
package_header = Literal("Package:")
package_name = Regex(r'[^|]+') # | is a printable, so create a regex that excludes it.
package = Group(package_header + package_name)
full_parser = delimitedList(package, delim="|" )
print full_parser.parseString("Package:numpy11 foo|Package:scipy")
prints:
[['Package:', 'numpy11 foo'], ['Package:', 'scipy']]

Python docx add_paragraph() inserts leading newline

I'm able to use a paragraph object to select font size, color, bold, etc. within a table cell. But, add_paragraph() seems to always insert a leading \n into the cell and this messes up the formatting on some tables.
If I just use the cell.text('') method it doesn't insert this newline but then I can't control the text attributes.
Is there a way to eliminate this leading newline?
Here is my function:
def add_table_cell(table, row, col, text, fontSize=8, r=0, g=0, b=0, width=-1):
cell = table.cell(row,col)
if (width!=-1):
cell.width = Inches(width)
para = cell.add_paragraph(style=None)
para.alignment = WD_ALIGN_PARAGRAPH.LEFT
run = para.add_run(text)
run.bold = False
run.font.size = Pt(fontSize)
run.font.color.type == MSO_COLOR_TYPE.RGB
run.font.color.rgb = RGBColor(r, g, b)
I tried the following and it worked out for me. Not sure if is the best approach:
cells[0].text = 'Some text' #Write the text to the cell
#Modify the paragraph alignment, first paragraph
cells[0].paragraphs[0].paragraph_format.alignment=WD_ALIGN_PARAGRAPH.CENTER
The solution that I find is to use text attribute instead of add_paragraph() but than use add_run():
row_cells[0].text = ''
row_cells[0].paragraphs[0].add_run('Total').bold = True
row_cells[0].paragraphs[0].paragraph_format.alignment = WD_ALIGN_PARAGRAPH.RIGHT
I've look through the documentation of cell, and it's not the problem of add_paragraph(). The problem is when you having a cell, by default, it will have a paragraph inside it.
class docx.table._Cell:
paragraphs: ... By default, a new cell contains a single paragraph. Read-only
Therefore, if you want to add paragraphs in the first row of cell, you should first delete the default paragraph first. Since python-docx don't have paragraph.delete(), you can use the function mention in this github issue: feature: Paragraph.delete()
def delete_paragraph(paragraph):
p = paragraph._element
p.getparent().remove(p)
p._p = p._element = None
Therefore, you should do something like:
cell = table.cell(0,0)
paragraph = cell.paragraphs[0]
delete_paragraph(paragraph)
paragraph = cell.add_paragraph('text you want to add', style='style you want')
Update at 10/8/2022
Sorry, the above approach is kinda unnecessary.
It's much intuitive to edit the default paragraph instead of first deleting it and add it back.
For the function add_table_cell, just replace the para = cell.paragraphs[0]
and para.style = None, the para.style = None is not necessary as it should be default value for a new paragraph.
Here is what worked for me. I don't call add_paragraph(). I just reference the first paragraph with this call -> para = cell.paragraphs[0]. Everything else after that is the usual api calls.
table = doc.add_table( rows=1, cols=3 ) # bar codes
for tableRow in table.rows:
for cell in tableRow.cells:
para = cell.paragraphs[0]
run = para.add_run( "*" + specIDStr + "*" )
font = run.font
font.name = 'Free 3 of 9'
font.size = Pt( 20 )
run = para.add_run( "\n" + specIDStr
+ "\n" + firstName + " " + lastName
+ "\tDOB: " + dob )
font = run.font
font.name = 'Arial'
font.size = Pt( 8 )

Parsing Python textfile with tags

I am parsing a 300 page document with python and I need to find out the attribute values of the Response element after the ThisVal element. There are multiple points where the Response element is used for differentVals, so I need to find out what is in the Response elements attribute value after finding the ThisVal element.
If it helps, the tokens are unique to ThisVal, but are different in every document.
11:44:49 <ThisVal Token="5" />
11:44:49 <Response Token="5" Code="123123" elements="x.one,x.two,x.three,x.four,x.five,x.six,x.seven" />
Have you considered using pyparsing? I've found it to be very useful for this kind of thing. Below is my attempt at a solution to your problem.
import pyparsing as pp
document = """11:44:49 <ThisVal Token="5" />
11:44:49 <Response Token="5" Code="123123" elements="x.one,x.two,x.three,x.four,x.five,x.six,x.seven" />
"""
num = pp.Word(pp.nums)
colon = ":"
start = pp.Suppress("<")
end = pp.Suppress("/>")
eq = pp.Suppress("=")
tag_name = pp.Word(pp.alphas)("tag_name")
value = pp.QuotedString("\"")
timestamp = pp.Suppress(num + colon + num + colon + num)
other_attr = pp.Group(pp.Word(pp.alphas) + eq + value)
tag = start + tag_name + pp.ZeroOrMore(other_attr)("attr") + end
tag_line = timestamp + tag
thisval_found = False
for line in document.splitlines():
result = tag_line.parseString(line)
print("Tag: {}\nAttributes: {}\n".format(result.tag_name, result.attr))
if thisval_found and tag_name == "Response":
for a in result.attr:
if a[0] == "elements":
print("FOUND: {}".format(a[1]))
thisval_found = result.tag_name == "ThisVal"

Cannot parse correctly this file with pyparsing

I am trying to parse a file using the amazing python library pyparsing but I am having a lot of problems...
The file I am trying to parse is something like:
sectionOne:
list:
- XXitem
- XXanotherItem
key1: value1
product: milk
release: now
subSection:
skey : sval
slist:
- XXitem
mods:
- XXone
- XXtwo
version: last
sectionTwo:
base: base-0.1
config: config-7.0-7
As you can see is an indented configuration file, and this is more or less how I have tried to define the grammar
The file can have one or more sections
Each section is formed by a section name and a section content.
Each section have an indented content
Each section content can have one or more pairs of key/value or a subsection.
Each value can be just a single word or a list of items.
A list of items is a group of one or more items.
Each item is an HYPHEN + a name starting with 'XX'
I have tried to create this grammar using pyparsing but with no success.
import pprint
import pyparsing
NEWLINE = pyparsing.LineEnd().suppress()
VALID_CHARACTERS = pyparsing.srange("[a-zA-Z0-9_\-\.]")
COLON = pyparsing.Suppress(pyparsing.Literal(":"))
HYPHEN = pyparsing.Suppress(pyparsing.Literal("-"))
XX = pyparsing.Literal("XX")
list_item = HYPHEN + pyparsing.Combine(XX + pyparsing.Word(VALID_CHARACTERS))
list_of_items = pyparsing.Group(pyparsing.OneOrMore(list_item))
key = pyparsing.Word(VALID_CHARACTERS) + COLON
pair_value = pyparsing.Word(VALID_CHARACTERS) + NEWLINE
value = (pair_value | list_of_items)
pair = pyparsing.Group(key + value)
indentStack = [1]
section = pyparsing.Forward()
section_name = pyparsing.Word(VALID_CHARACTERS) + COLON
section_value = pyparsing.OneOrMore(pair | section)
section_content = pyparsing.indentedBlock(section_value, indentStack, True)
section << pyparsing.Group(section_name + section_content)
parser = pyparsing.OneOrMore(section)
def main():
try:
with open('simple.info', 'r') as content_file:
content = content_file.read()
print "content:\n", content
print "\n"
result = parser.parseString(content)
print "result1:\n", result
print "len", len(result)
pprint.pprint(result.asList())
except pyparsing.ParseException, err:
print err.line
print " " * (err.column - 1) + "^"
print err
except pyparsing.ParseFatalException, err:
print err.line
print " " * (err.column - 1) + "^"
print err
if __name__ == '__main__':
main()
This is the result :
result1:
[['sectionOne', [[['list', ['XXitem', 'XXanotherItem']], ['key1', 'value1'], ['product', 'milk'], ['release', 'now'], ['subSection', [[['skey', 'sval'], ['slist', ['XXitem']], ['mods', ['XXone', 'XXtwo']], ['version', 'last']]]]]]], ['sectionTwo', [[['base', 'base-0.1'], ['config', 'config-7.0-7']]]]]
len 2
[
['sectionOne',
[[
['list', ['XXitem', 'XXanotherItem']],
['key1', 'value1'],
['product', 'milk'],
['release', 'now'],
['subSection',
[[
['skey', 'sval'],
['slist', ['XXitem']],
['mods', ['XXone', 'XXtwo']],
['version', 'last']
]]
]
]]
],
['sectionTwo',
[[
['base', 'base-0.1'],
['config', 'config-7.0-7']
]]
]
]
As you can see I have two main problems:
1.- Each section content is nested twice into a list
2.- the key "version" is parsed inside the "subSection" when it belongs to the "sectionOne"
My real target is to be able to get a structure of python nested dictionaries with the keys and values to easily extract the info for each field, but the pyparsing.Dict is something obscure to me.
Could anyone please help me ?
Thanks in advance
( sorry for the long post )
You really are pretty close - congrats, indented parsers are not the easiest to write with pyparsing.
Look at the commented changes. Those marked with 'A' are changes to fix your two stated problems. Those marked with 'B' add Dict constructs so that you can access the parsed data as a nested structure using the names in the config.
The biggest culprit is that indentedBlock does some extra Group'ing for you, which gets in the way of Dict's name-value associations. Using ungroup to peel that away lets Dict see the underlying pairs.
Best of luck with pyparsing!
import pprint
import pyparsing
NEWLINE = pyparsing.LineEnd().suppress()
VALID_CHARACTERS = pyparsing.srange("[a-zA-Z0-9_\-\.]")
COLON = pyparsing.Suppress(pyparsing.Literal(":"))
HYPHEN = pyparsing.Suppress(pyparsing.Literal("-"))
XX = pyparsing.Literal("XX")
list_item = HYPHEN + pyparsing.Combine(XX + pyparsing.Word(VALID_CHARACTERS))
list_of_items = pyparsing.Group(pyparsing.OneOrMore(list_item))
key = pyparsing.Word(VALID_CHARACTERS) + COLON
pair_value = pyparsing.Word(VALID_CHARACTERS) + NEWLINE
value = (pair_value | list_of_items)
#~ A: pair = pyparsing.Group(key + value)
pair = (key + value)
indentStack = [1]
section = pyparsing.Forward()
section_name = pyparsing.Word(VALID_CHARACTERS) + COLON
#~ A: section_value = pyparsing.OneOrMore(pair | section)
section_value = (pair | section)
#~ B: section_content = pyparsing.indentedBlock(section_value, indentStack, True)
section_content = pyparsing.Dict(pyparsing.ungroup(pyparsing.indentedBlock(section_value, indentStack, True)))
#~ A: section << Group(section_name + section_content)
section << (section_name + section_content)
#~ B: parser = pyparsing.OneOrMore(section)
parser = pyparsing.Dict(pyparsing.OneOrMore(pyparsing.Group(section)))
Now instead of pprint(result.asList()) you can write:
print (result.dump())
to show the Dict hierarchy:
[['sectionOne', ['list', ['XXitem', 'XXanotherItem']], ... etc. ...
- sectionOne: [['list', ['XXitem', 'XXanotherItem']], ... etc. ...
- key1: value1
- list: ['XXitem', 'XXanotherItem']
- mods: ['XXone', 'XXtwo']
- product: milk
- release: now
- subSection: [['skey', 'sval'], ['slist', ['XXitem']]]
- skey: sval
- slist: ['XXitem']
- version: last
- sectionTwo: [['base', 'base-0.1'], ['config', 'config-7.0-7']]
- base: base-0.1
- config: config-7.0-7
allowing you to write statements like:
print (result.sectionTwo.base)

Categories