pyparsing how to SkipTo end of indented block? - python

I am trying to parse a structure like this with pyparsing:
identifier: some description text here which will wrap
on to the next line. the follow-on text should be
indented. it may contain identifier: and any text
at all is allowed
next_identifier: more description, short this time
last_identifier: blah blah
I need something like:
import pyparsing as pp
colon = pp.Suppress(':')
term = pp.Word(pp.alphanums + "_")
description = pp.SkipTo(next_identifier)
definition = term + colon + description
grammar = pp.OneOrMore(definition)
But I am struggling to define the next_identifier of the SkipTo clause since the identifiers may appear freely in the description text.
It seems that I need to include the indentation in the grammar, so that I can SkipTo the next non-indented line.
I tried:
description = pp.Combine(
pp.SkipTo(pp.LineEnd()) +
pp.indentedBlock(
pp.ZeroOrMore(
pp.SkipTo(pp.LineEnd())
),
indent_stack
)
)
But I get the error:
ParseException: not a subentry (at char 55), (line:2, col:1)
Char 55 is at the very beginning of the run-on line:
...will wrap\n on to the next line...
^
Which seems a bit odd, because that char position is clearly followed by the whitespace which makes it an indented subentry.
My traceback in ipdb looks like:
5311 def checkSubIndent(s,l,t):
5312 curCol = col(l,s)
5313 if curCol > indentStack[-1]:
5314 indentStack.append( curCol )
5315 else:
-> 5316 raise ParseException(s,l,"not a subentry")
5317
ipdb> indentStack
[1]
ipdb> curCol
1
I should add that the whole structure above that I'm matching may also be indented (by an unknown amount), so a solution like:
description = pp.Combine(
pp.SkipTo(pp.LineEnd()) + pp.LineEnd() +
pp.ZeroOrMore(
pp.White(' ') + pp.SkipTo(pp.LineEnd()) + pp.LineEnd()
)
)
...which works for the example as presented will not work in my case as it will consume the subsequent definitions.

When you use indentedBlock, the argument you pass in is the expression for each line in the block, so it shouldn't be a indentedBlock(ZeroOrMore(line_expression), stack), just indentedBlock(line_expression, stack). Pyparsing includes a builtin expression for "everything from here to the end of the line", titled restOfLine, so we will just use that for the expression for each line in the indented block:
import pyparsing as pp
NL = pp.LineEnd().suppress()
label = pp.ungroup(pp.Word(pp.alphas, pp.alphanums+'_') + pp.Suppress(":"))
indent_stack = [1]
# see corrected version below
#description = pp.Group((pp.Empty()
# + pp.restOfLine + NL
# + pp.ungroup(pp.indentedBlock(pp.restOfLine, indent_stack))))
description = pp.Group(pp.restOfLine + NL
+ pp.Optional(pp.ungroup(~pp.StringEnd()
+ pp.indentedBlock(pp.restOfLine,
indent_stack))))
labeled_text = pp.Group(label("label") + pp.Empty() + description("description"))
We use ungroup to remove the extra level of nesting created by indentedBlock but we also need to remove the per-line nesting that is created internally in indentedBlock. We do this with a parse action:
def combine_parts(tokens):
# recombine description parts into a single list
tt = tokens[0]
new_desc = [tt.description[0]]
new_desc.extend(t[0] for t in tt.description[1:])
# reassign rebuild description into the parsed token structure
tt['description'] = new_desc
tt[1][:] = new_desc
labeled_text.addParseAction(combine_parts)
At this point, we are pretty much done. Here is your sample text parsed and dumped:
parsed_data = (pp.OneOrMore(labeled_text)).parseString(sample)
print(parsed_data[0].dump())
['identifier', ['some description text here which will wrap', 'on to the next line. the follow-on text should be', 'indented. it may contain identifier: and any text', 'at all is allowed']]
- description: ['some description text here which will wrap', 'on to the next line. the follow-on text should be', 'indented. it may contain identifier: and any text', 'at all is allowed']
- label: 'identifier'
Or this code to pull out the label and description fields:
for item in parsed_data:
print(item.label)
print('..' + '\n..'.join(item.description))
print()
identifier
..some description text here which will wrap
..on to the next line. the follow-on text should be
..indented. it may contain identifier: and any text
..at all is allowed
next_identifier
..more description, short this time
last_identifier
..blah blah

Related

How to store string in quotation that contains two words?

I wrote the search code and I want to store what is between " " as one place in the list, how I may do that? In this case, I have 3 lists but the second one should is not as I want.
import re
message='read read read'
others = ' '.join(re.split('\(.*\)', message))
others_split = others.split()
to_compile = re.compile('.*\((.*)\).*')
to_match = to_compile.match(message)
ors_string = to_match.group(1)
should = ors_string.split(' ')
must = [term for term in re.findall(r'\(.*?\)|(-?(?:".*?"|\w+))', message) if term and not term.startswith('-')]
must_not = [term for term in re.findall(r'\(.*?\)|(-?(?:".*?"|\w+))', message) if term and term.startswith('-')]
must_not = [s.replace("-", "") for s in must_not]
print(f'must: {must}')
print(f'should: {should}')
print(f'must_not: {must_not}')
Output:
must: ['read', '"find find"', 'within', '"plane"']
should: ['"exactly', 'needed"', 'empty']
must_not: ['russia', '"destination good"']
Wanted result:
must: ['read', '"find find"', 'within', '"plane"']
should: ['"exactly needed"', 'empty'] <---
must_not: ['russia', '"destination good"']
Error when edited the message, how to handle it?
Traceback (most recent call last):
ors_string = to_match.group(1)
AttributeError: 'NoneType' object has no attribute 'group'
Your should list splits on whitespace: should = ors_string.split(' '), this is why the word is split in the list. The following code gives you the output you requested but I'm not sure that is solves your problem for future inputs.
import re
message = 'read "find find":within("exactly needed" OR empty) "plane" -russia -"destination good"'
others = ' '.join(re.split('\(.*\)', message))
others_split = others.split()
to_compile = re.compile('.*\((.*)\).*')
to_match = to_compile.match(message)
ors_string = to_match.group(1)
# Split on OR instead of whitespace.
should = ors_string.split('OR')
to_remove_or = "OR"
while to_remove_or in should:
should.remove(to_remove_or)
# Remove trailing whitespace that is left after the split.
should = [word.strip() for word in should]
must = [term for term in re.findall(r'\(.*?\)|(-?(?:".*?"|\w+))', message) if term and not term.startswith('-')]
must_not = [term for term in re.findall(r'\(.*?\)|(-?(?:".*?"|\w+))', message) if term and term.startswith('-')]
must_not = [s.replace("-", "") for s in must_not]
print(f'must: {must}')
print(f'should: {should}')
print(f'must_not: {must_not}')

text format in python-docx

I need to enter text in 1000s word document. I found python-docx suitable for this purmose. But the string I have to enter has font color and style, like:
{\color{Red}Mr Mike} Bold Class, Italics College, City
Is it possible to format that?
As an mwe,
from docx import Document
document = Document('doc.docx')
Name = ["Name1", "Name2", "Name3"]
Class = ["Class1", "Class2", "Class3"]
for i in range(len(Name)):
string = Name[i] + ", " + Class[i] + "" Class"
# Name[i] should be red, bold, 12 pt; Classp[i] should be 12pt, italics
for p in document.paragraphs:
if p.text.find("REPLACE") >= 0:
p.text = p.text.replace("REPLACE", string)
document.save(Name[i] + '.docx')
Character formatting is applied at the run level. So you will need to work at that level if you want "inline" differences in character formatting. In general, a paragraph is composed of zero or more runs. All characters in a run share the same character-level formatting ("font"). Pseudo-code for a possible approach is:
old = "foo"
new = "bar"
text = paragraph.text
start = text.find(old)
prefix = text[:start]
suffix = text[start+len(old):]
# --- assigning to `paragraph.text` leaves the paragraph with
# --- exactly one run
paragraph.text = prefix
# --- add an additional run for your replacement word, keeping
# --- a reference to it so you can set its visual attributes
run = paragraph.add_run("bar")
run.italic = True
# --- etc. to get the formatting you want ---
# --- add the suffix to complete the original paragraph text ---
paragraph.add_run(suffix)

How can I tokenize this text into sentences with Regex

"You could not possibly have come at a better time, my dear Watson,"
he said cordially. 'It is not worth your while to wait,' she went
on."You can pass through the door; no one hinders." And then, seeing that I smiled and shook my head, she suddenly threw aside her
constraint and made a step forward, with her hands wrung together.
Look at the highlighted area. How can I possibly distinguish a case where '"' is followed by a period (.) to end a sentence and a case where a period (.) is followed by a '"'
I have tried this piece for the tokenizer. It works well except for just that one part.
(([^।\.?!]|[।\.?!](?=[\"\']))+\s*[।\.?!]\s*)
Edit: I am not planning to use any NLP toolkit to solve this problem.
Use NLTK instead of regular expressions here:
from nltk import sent_tokenize
parts = sent_tokenize(your_string)
# ['"You could not possibly have come at a better time, my dear Watson," he said cordially.', "'It is not worth your while to wait,' she went on.", '"You can pass through the door; no one hinders."', 'And then, seeing that I smiled and shook my head, she suddenly threw aside her constraint and made a step forward, with her hands wrung together.']
Found this function a while ago
def split_into_sentences(text):
caps = u"([A-Z])"
prefixes = u"(Mr|St|Mrs|Ms|Dr)[.]"
suffixes = u"(Inc|Ltd|Jr|Sr|Co)"
starters = u"(Mr|Mrs|Ms|Dr|He\s|She\s|It\s|They\s|Their\s|Our\s|We\s|But\s|However\s|That\s|This\s|Wherever)"
acronyms = u"([A-Z][.][A-Z][.](?:[A-Z][.])?)"
websites = u"[.](com|net|org|io|gov|mobi|info|edu)"
if not isinstance(text,unicode):
text = text.decode('utf-8')
text = u" {0} ".format(text)
text = text.replace(u"\n",u" ")
text = re.sub(prefixes,u"\\1<prd>",text)
text = re.sub(websites,u"<prd>\\1",text)
if u"Ph.D" in text: text = text.replace(u"Ph.D.",u"Ph<prd>D<prd>")
text = re.sub(u"\s" + caps + u"[.] ",u" \\1<prd> ",text)
text = re.sub(acronyms+u" "+starters,u"\\1<stop> \\2",text)
text = re.sub(caps + u"[.]" + caps + u"[.]" + caps + u"[.]",u"\\1<prd>\\2<prd>\\3<prd>",text)
text = re.sub(caps + u"[.]" + caps + u"[.]",u"\\1<prd>\\2<prd>",text)
text = re.sub(u" "+suffixes+u"[.] "+starters,u" \\1<stop> \\2",text)
text = re.sub(u" "+suffixes+u"[.]",u" \\1<prd>",text)
text = re.sub(u" " + caps + u"[.]",u" \\1<prd>",text)
if u"\"" in text: text = text.replace(u".\"",u"\".")
if u"!" in text: text = text.replace(u"!\"",u"\"!")
if u"?" in text: text = text.replace(u"?\"",u"\"?")
text = text.replace(u".",u".<stop>")
text = text.replace(u"?",u"?<stop>")
text = text.replace(u"!",u"!<stop>")
text = text.replace(u"<prd>",u".")
sentences = text.split(u"<stop>")
sentences = sentences[:-1]
sentences = [s.strip() for s in sentences]
return sentences

parse a function with pyparsing

I am trying to parse some LUA functions with pyparsing. It also works for almost everything I need except one case where the last parameter is just a word.
So this is my code. This is my first parser using pyparsing but I did my best to structure it logically:
To explain my comments within the code:
trigger_async(<object>, <name>, <param>)
trigger_async(<parameters>)
<param> = <name> = <type>
def parse_events_from_text(text):
variable = Word(alphanums + "-_.:()")
# Entity on which the event will be triggered
obj = variable.setResultsName("object")
# Name of the event
name = (quotedString + Optional(Suppress("..") + variable)).setResultsName("name")
# Parameter List of the event
paramName = variable.setResultsName("name")
paramType = variable.setResultsName("type")
param = Group(paramName + Suppress("=") + paramType).setResultsName("parameter")
paramList = Group(Optional(Suppress("{") + ZeroOrMore(delimitedList(param)) + Suppress("}"))).setResultsName("parameters")
function_parameter = obj | name | paramList
# Function Start
trigger = "trigger"
async = Optional("_async")
# Function Call
functionOpening = Combine(trigger + async + "(").setResultsName("functionOpening")
functionCall = ZeroOrMore(Group(functionOpening + delimitedList(function_parameter) + Suppress(")")))
resultsList = functionCall.searchString(text)
results = []
for resultsL in resultsList:
if len(resultsL) != 0:
if resultsL not in results:
results.append(resultsL)
return results
So the parser was written for those kinds of events:
trigger(self._entity, 'game:construction:changed', { entity = target })`
trigger_async(entity, 'game:heal:healer_damaged', { healer = entity })`
trigger_async(entity, 'game:heal:healer_damaged', { healer = entity, entity = target, test = party})`
trigger_async(entity, 'game:heal:healer')`
trigger(entity.function(), 'game:heal:healer', {})`
But the problem is if there aren't any curly braces:
trigger(entity, 'game:heal:healer', entity.test)
it won't work because of my declared variable
variable = Word(alphanums + "-_.:()")
where braces are allowed so the parser is confused with the last one which is "missing" for the function end. If I would write
trigger(entity,'game:heal:healer',entity.test))
it would work.
I sat down and wanted to rewrite the parser but I dont know how? Somehow I must tell that it is only valid if the variable has 1 open bracket and 1 closing bracket like so:
trigger(entity,'game:heal:healer',entity.test(input))
else don't eat up that closing brace.
trigger(entity,'game:heal:healer',entity.test) <-- Variable, don't eat it!

Highlighting certain characters in Tkinter

I'm creating a simple text editor in Python 3.4 and Tkinter. At the moment I'm stuck on the find feature.
I can find characters successfully but I'm not sure how to highlight them. I've tried the tag method without success, error:
str object has no attribute 'tag_add'.
Here's my code for the find function:
def find(): # is called when the user clicks a menu item
findin = tksd.askstring('Find', 'String to find:')
contentsfind = editArea.get(1.0, 'end-1c') # editArea is a scrolledtext area
findcount = 0
for x in contentsfind:
if x == findin:
findcount += 1
print('find - found ' + str(findcount) + ' of ' + findin)
if findcount == 0:
nonefound = ('No matches for ' + findin)
tkmb.showinfo('No matches found', nonefound)
print('find - found 0 of ' + findin)
The user inputs text into a scrolledtext field, and I want to highlight the matching strings on that scrolledtext area.
How would I go about doing this?
Use tag_add to add a tag to a region. Also, instead of getting all the text and searching the text, you can use the search method of the widget. I will return the start of the match, and can also return how many characters matched. You can then use that information to add the tag.
It would look something like this:
...
editArea.tag_configure("find", background="yellow")
...
def find():
findin = tksd.askstring('Find', 'String to find:')
countVar = tk.IntVar()
index = "1.0"
matches = 0
while True:
index = editArea.search(findin, index, "end", count=countVar)
if index == "": break
matches += 1
start = index
end = editArea.index("%s + %s c" % (index, countVar.get()))
editArea.tag_add("find", start, end)
index = end

Categories