How do you parse sections of text with Lark in Python - python

I'm trying to figure out how to use the Lark Python Module to parse a document that looks like this:
---> TITLE
Introduction
---> CONTENT
The quick
Brown fox
---> TEST
Jumps over
---> CONTENT
The lazy dog
Each ---> marks the start of a section of a specific type that has some content that goes until the next ---> section starts.
So far, I have this
from lark import Lark
parser = Lark(r"""
start: section*
| line*
section.1 : "---> " SECTION_TITLE "\n\n"
SECTION_TITLE.1 : "TITLE" | "CONTENT" | "SOURCE" | "OUTPUT"
line.-1: ANY_LINE
ANY_LINE.-1: /.+\n*/
""", start='start')
with open("src/index.mdx") as _in:
print(parser.parse(_in.read()))
It parses the file, but everything shows up in ANY_LINE tokens instead of splitting out the section headers. I'm new to this type of parser and feel like I'm missing something obvious, but I haven't been able to figure it out.

I think this is doing what I'm after. Not marking this as the answer for now in case other folks have better ideas
parser = Lark(r"""
start: section*
section : THING SECTION_TITLE line*
THING : "--->"
SECTION_TITLE : "TITLE" | "CONTENT" | "SOURCE" | "OUTPUT" | "TEST"
line: ANY_LINE
ANY_LINE.-1: /.+\n*/
%import common.WS
%ignore WS
""", start='start')

Related

How to use Signature or User ID to get user's information?

I'm a beginner at Wikimedia, and I'm using Wiki API to finish my project. My dataset looks like this:
rev_id | comment | timestamp | page_id | page_title | user_id | user_text
-- -- -- -- -- -- -- -- -- -- -- --
352194497 | Welcome to Wikipedia | 2010-03-26T18:16:48Z | 26709696 | 116.197.206.138 | 8356162 | Mlpearc
I'm trying to find some user information of these comment posters. However, I find the "user_text" here is not the user name but the signature. If I use the official API demos get_users.py to get the information, it turns out the error because some signature have space in it, but usernames are all single word. Like in the code below, I can get the information of Catrope and Bob using Catrope|Bob. But it doesn't work if I use Catrope|Tide rolls, if Tide rolls is the signature.
import requests
S = requests.Session()
URL = "https://en.wikipedia.org/w/api.php"
PARAMS = {
"action": "query",
"format": "json",
"list": "users",
"ususers": "Catrope|Tide rolls",
"usprop": "blockinfo|groups|editcount|registration|emailable|gender"
}
R = S.get(url=URL, params=PARAMS)
DATA = R.json()
USERS = DATA["query"]["users"]
for u in USERS:
print(str(u["name"]) + " has " + str(u["editcount"]) + " edits.")
So my question is, is there any way that we can get user information through the signature using API? And since we also have page_id and user_id here, will this information be helpful? Thank you so much in advance!
Updated: I used Bob Ben here as a fake ID. Now it is replaced by a real one. Problems solved by using _ to replace space.(Thanks for the reminder from AXO.)
You've not mentioned the error and traceback that you're getting. The code sample should work fine as long as the username exists, even if the username has a space in it.
But user account "Bob Ben" is not registered. In such cases the API replies with {'name': 'Bob Ben', 'missing': ''}.
So you're code could be:
for u in USERS:
if 'missing' not in u:
print(u["name"] + " has " + str(u["editcount"]) + " edits.")
else:
print(u["name"], "is not registered.")
BTW, if for some reason you prefer not to use space, you may use _ (underscore) instead. A blank space is equivalent with an underscore.
Regarding "user information", I'm not sure what kind of information you're looking for. According to API:Users one may get blockinfo|groups|groupmemberships|implicitgroups|rights|editcount|registration|emailable|gender|centralids|cancreate using the usprop parameter. But if some other information, for example the information on the user page, is to be fetched, then you'll perhaps need to use one of the methods mentioned in API:Get the contents of a page to get the contents of the user page and then write a program to look for the information you need.

How to output CDATA using yattag library

I'm trying with the following code to generate an XML file contains tags </documents>.
string = "dasdd Wonder asdf new single, “Tomorrow” #URL# | " \
"oiojk asfddsf releases new asdfdf, “gfsg” | " \
"Identity of asfqw who dasd off asdfsdf Mainland jtyjyjui revealed #URL#"
from yattag import Doc, indent
import html, re
doc, tag, text = Doc().tagtext()
with tag('author', lang='en'):
with tag('documents'):
for tweet in string.split(' | '):
with tag('document'):
tweet = html.unescape(tweet)
text('<![CDATA[{}]]'.format(tweet))
result = indent(doc.getvalue(), indentation=' ' * 4, newline='\n')
with open('test.xml', 'w', encoding='utf-8') as f:
f.write(result)
I wanted to add CDATA token around the text, but when I open the generated file using Notepad++ instead of have the output as:
<document><![CDATA[oiojk asfddsf releases new asdfdf, “gfsg”]]></document>
it appears like (with HTML entities):
<document><![CDATA[oiojk asfddsf releases new asdfdf, “gfsg”]]</document>
I tried to use HTML library (html.unescape line) to discard the HTML entities but I wasn't able.
How can I solve this encoding issue?
The text method always replaces '<' with <. If you wanted no escaping of that kind, you would use the asis method instead (it inserts the string "as is"). But, in your case, it would be more appropriate to use Yattag's cdata method.
from yattag import Doc
help(Doc.cdata)
cdata(self, strg, safe=False) appends a CDATA section containing the supplied string.
You don't have to worry about potential ]]> sequences that would terminate
the CDATA section. They are replaced with ]]]]><![CDATA[>.
If you're sure your string does not contain ]]>, you can pass safe = True.
If you do that, your string won't be searched for ]]> sequences.
So, in your case, you can do:
for tweet in string.split(' | '):
with tag('document'):
tweet = html.unescape(tweet)
doc.cdata(tweet)

maintaining formatting of imported text with mako and rst2pdf

I've created a template which renders pdf files from csv input. However, when the csv input fields contain user formatting, with line breaks and indentations, it messes with the rst2pdf formatting engine. Is there a way to consistently deal with user input in a way that doesn't break the document flow, but also maintains the formatting of the input text? Example script below:
from mako.template import Template
from rst2pdf.createpdf import RstToPdf
mytext = """This is the first line
Then there is a second
Then a third
This one could be indented
I'd like it to maintain the formatting."""
template = """
My PDF Document
===============
It starts with a paragraph, but after this I'd like to insert `mytext`.
It should keep the formatting intact, though I don't know what formatting to expect.
${mytext}
"""
mytemplate = Template(template)
pdf = RstToPdf()
pdf.createPdf(text=mytemplate.render(mytext=mytext),output='foo.pdf')
I have tried adding the following function in the template to insert | at the start of each line, but that doesn't seem to work either.
<%!
def wrap(text):
return text.replace("\\n", "\\n|")
%>
Then ${mytext} would become |${mytext | wrap}. This throws the error:
<string>:10: (WARNING/2) Inline substitution_reference start-string without end-string.
Actually it turns out I was on the right track, I just needed a space between the | and the text. So the following code works:
from mako.template import Template
from rst2pdf.createpdf import RstToPdf
mytext = """This is the first line
Then there is a second
Then a third
How about an indent?
I'd like it to maintain the formatting."""
template = """
<%!
def wrap(text):
return text.replace("\\n", "\\n| ")
%>
My PDF Document
===============
It starts with a paragraph, but after this I'd like to insert `mytext`.
It should keep the formatting intact.
| ${mytext | wrap}
"""
mytemplate = Template(template)
pdf = RstToPdf()
#print mytemplate.render(mytext=mytext)
pdf.createPdf(text=mytemplate.render(mytext=mytext),output='foo.pdf')

Regular Expression to remove " ' from a string in Python

I am fetching my result from a RSS feed using following code:
try:
desc = item.xpath('description')[0].text
if date is not None:
desc =date +"\n"+"\n"+desc
except:
desc = None
But sometimes the description contains few unicode html charecters in feed as below:
The text from XML looks like " and with ' and other &...; stuff
While displaying the content I do not want them to be displayed. Is there any regular expression to remove the HTML tags.
I used something called "Unescaping XML", don't know if it's helpfull to you.
see : http://wiki.python.org/moin/EscapingXml
from xml.sax.saxutils import unescape
unescape("< & >")
'< & >'
unescape("&apos; "", {"&apos;": "'", """: '"'})
'\' "'
edit
Just saw this, may be interresting. (Not tested) : unescape with urllib

Find all strings in python code files

I would like to list all strings within my large python project.
Imagine the different possibilities to create a string in python:
mystring = "hello world"
mystring = ("hello "
"world")
mystring = "hello " \
"world"
I need a tool that outputs "filename, linenumber, string" for each string in my project. Strings that are spread over multiple lines using "\" or "('')" should be shown in a single line.
Any ideas how this could be done?
unwind's suggestion of using the ast module in 2.6 is a good one. (There's also the undocumented _ast module in 2.5.) Here's example code for that
code = """a = 'blah'
b = '''multi
line
string'''
c = u"spam"
"""
import ast
root = ast.parse(code)
class ShowStrings(ast.NodeVisitor):
def visit_Str(self, node):
print "string at", node.lineno, node.col_offset, repr(node.s)
show_strings = ShowStrings()
show_strings.visit(root)
The problem is multiline strings. If you run the above you'll get.
string at 1 4 'blah'
string at 4 -1 'multi\nline\nstring'
string at 5 4 u'spam'
You see that it doesn't report the start of the multiline string, only the end. There's no good solution for that using the builtin Python tools.
Another option is that you can use my 'python4ply' module. This is a grammar definition for Python for PLY, which is a parser generator. Here's how you might use it:
import compiler
import compiler.visitor
# from python4ply; requires the ply parser generator
import python_yacc
code = """a = 'blah'
b = '''multi
line
string'''
c = u"spam"
d = 1
"""
tree = python_yacc.parse(code, "<string>")
#print tree
class ShowStrings(compiler.visitor.ASTVisitor):
def visitConst(self, node):
if isinstance(node.value, basestring):
print "string at", node.lineno, repr(node.value)
visitor = ShowStrings()
compiler.walk(tree, visitor)
The output from this is
string at 1 'blah'
string at 2 'multi\nline\nstring'
string at 5 u'spam'
There's no support for column information. (There is some mostly complete commented out code to support that, but it's not fully tested.) Then again, I see you don't need it. It also means working with Python's 'compiler' module, which is clumsier than the AST module.
Still, with a 30-40 lines of code you should have exactly what you want.
Python's included tokenize module will also do the trick.
from __future__ import with_statement
import sys
import tokenize
for filename in sys.argv[1:]:
with open(filename) as f:
for toktype, tokstr, (lineno, _), _, _ in tokenize.generate_tokens(f.readline):
if toktype == tokenize.STRING:
strrepr = repr(eval(tokstr))
print filename, lineno, strrepr
If you can do this in Python, I'd suggest starting by looking at the ast (Abstract Syntax Tree) module, and going from there.
Are you asking about the I18N utilities in Python?
http://docs.python.org/library/gettext.html#internationalizing-your-programs-and-modules
There's a utility called po-utils (formerly xpot) that can help with this.
http://po-utils.progiciels-bpi.ca/README.html
You may also consider to parse your code with
pygments.
I don't know the other solution, but it sure is very
simple to use.
Gettext might help you. Put your strings in _(...) structures:
a = _('Test')
b = a
c = _('Another text')
Then run in the shell prompt:
pygettext test.py
You'll get a messages.pot file with the information you need:
# SOME DESCRIPTIVE TITLE.
# Copyright (C) YEAR ORGANIZATION
# FIRST AUTHOR <EMAIL#ADDRESS>, YEAR.
#
msgid ""
msgstr ""
"Project-Id-Version: PACKAGE VERSION\n"
"POT-Creation-Date: 2009-02-25 08:48+BRT\n"
"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
"Last-Translator: FULL NAME <EMAIL#ADDRESS>\n"
"Language-Team: LANGUAGE <LL#li.org>\n"
"MIME-Version: 1.0\n"
"Content-Type: text/plain; charset=CHARSET\n"
"Content-Transfer-Encoding: ENCODING\n"
"Generated-By: pygettext.py 1.5\n"
#: teste.py:1
msgid "Test"
msgstr ""
#: teste.py:3
msgid "Another text"
msgstr ""

Categories