I would like to add an additional syntax to Python-Markdown: if n is a positive integer, >>n should expand into n. (Double angled brackets (>>) is a conventional syntax for creating links in imageboard forums.)
By default, Python-Markdown expands >>n into nested blockquotes: <blockquote><blockquote>n</blockquote></blockquote>. Is there a way create links out of >>n, while preserving the rest of blockquote's default behavior? In other words, if x is a positive integer, >>x should expand into a link, but if x is not a positive integer, >>x should still expand into nested blockquotes.
I have read the relevant wiki article: Tutorial 1 Writing Extensions for Python Markdown. Based on what I learned in the wiki, I wrote a custom extension:
import markdown
import xml.etree.ElementTree as ET
from markdown.extensions import Extension
from markdown.inlinepatterns import Pattern
class ImageboardLinkPattern(Pattern):
def handleMatch(self, match):
number = match.group('number')
# Create link.
element = ET.Element('a', attrib={'href': f'#post-{number}'})
element.text = f'>>{number}'
return element
class ImageboardLinkExtension(Extension):
def extendMarkdown(self, md):
IMAGEBOARD_LINK_RE = '>>(?P<number>[1-9][0-9]*)'
imageboard_link = ImageboardLinkPattern(IMAGEBOARD_LINK_RE)
md.inlinePatterns['imageboard_link'] = imageboard_link
html = markdown.markdown('>>123',
extensions=[ImageboardLinkExtension()])
print(html)
However, >>123 still produces <blockquote><blockquote>123</blockquote></blockquote>. What is wrong with the implementation above?
The problem is that your new syntax conflicts with the preexisting blockquote syntax. Your extension would presumably work if it was ever called. However, due to the conflict, that never happens. Note that their are five types of processors. As documented:
Preprocessors alter the source before it is passed to the parser.
Block Processors work with blocks of text separated by blank lines.
Tree Processors modify the constructed ElementTree
Inline Processors are common tree processors for inline elements, such as *strong*.
Postprocessors munge of the output of the parser just before it is returned.
Of importance here is that the processors are run in that order. In other words, all block processors are run before any inline processors are run. Therefore, the blockquote block processor runs first on your input and removes the double angle bracket, wrapping the rest of the line in double blockquote tags. By the time your inline processor sees the document, your regex will no longer match and will therefore never be called.
That being said, an inline processor is the correct way to implement a link syntax. However, you would need to do one of two things to make it work.
Alter the syntax so that it does not clash with any preexisting syntax; or
Alter the blockquote behavior to avoid the conflict.
Personally, I would recommend option 1, but I understand you are trying to implement a preexisting syntax from another environment. So, if you want to explore option 2, then I would suggest perhaps making the blockquote syntax a little more strict. For example, while it is not required, the recommended syntax is to always insert a space after the angle bracket in a blockquote. It should be relatively simple to alter the BlockquoteProcessor to require the space, which would cause your syntax to no longer clash.
This is actually pretty simple. As you may note, the entire syntax is defined via a rather simple regex:
RE = re.compile(r'(^|\n)[ ]{0,3}>[ ]?(.*)')
You simply need to rewrite that so that 0 whitespace is no longer accepted (> rather than >[ ]?). First import and subclass the existing processor and then override the regex:
from markdown.blockprocessors import BlockquoteProcessor
class CustomBlockquoteProcessor(BlockquoteProcessor):
RE = re.compile(r'(^|\n)[ ]{0,3}> (.*)')
Finally, you just need to tell Markdown to use your custom class rather than the default. Add the following to the extendMarkdown method of your ImageboardLinkExtension class:
md.parser.blockprocessors.register(CustomBlockQuoteProcessor(md.parser), 'quote', 20)
Now the blockquote syntax will no longer clash with your link syntax and you will get an opportunity to have your code run on the text. Just be careful to remember to always include the now required space for any actual blockquotes.
Related
I wrote a library using just ast and inspect libraries to parse and emit [uses astor on Python < 3.9] internal Python constructs.
Just realised that I really need to preserve comments afterall. Preferably without resorting to a RedBaron or LibCST; as I just need to emit the unaltered commentary; is there a clean and concise way of comment-preserving parsing/emitting Python source with just stdlib?
What I ended up doing was writing a simple parser, without a meta-language in 339 source lines:
https://github.com/offscale/cdd-python/blob/master/cdd/cst_utils.py
Implementation of Concrete Syntax Tree [List!]
Reads source character by character;
Once end of statement† is detected, add statement-type into 1D list;
†end of line if line.lstrip().startswith("#") or line not endswith('\\') and balanced_parens(line) else continue munching until that condition is true… plus some edge-cases around multiline strings and the like;
Once finished there is a big (1D) list where each element is a namedtuple with a value property.
Integration with builtin Abstract Syntax Tree ast
Limit ast nodes to modify—not remove—to: {ClassDef,AsyncFunctionDef,FunctionDef} docstring (first body element Constant|Str), Assign and AnnAssign;
cst_idx, cst_node = find_cst_at_ast(cst_list, _node);
if doc_str node then maybe_replace_doc_str_in_function_or_class(_node, cst_idx, cst_list)
…
Now the cst_list contains only changes to those aforementioned nodes, and only when that change is more than whitespace, and can be created into a string with "".join(map(attrgetter("value"), cst_list)) for outputting to eval or straight out to a source file (e.g., in-place overriding).
Quality control
100% test coverage
100% doc coverage
Support for last 6 versions of Python (including latest alpha)
CI/CD
(Apache-2.0 OR MIT) licensed
Limitations
Lack of meta-language, specifically lack of using Python's provided grammar means new syntax elements won't automatically be supported (match/case is supported, but if there's new syntax introduced since, it isn't [yet?] supported… at least not automatically);
Not builtin to stdlib so stdlib could break compatibility;
Deleting nodes is [probably] not supported;
Nodes can be incorrectly identified if there are shadow variables or similar issues that linters should point out.
Comments can be preserved by merging them back into the generated source code by capturing them with the tokenizer.
Given a toy program in a program variable, we can demonstrate how comments get lost in the AST:
import ast
program = """
# This comment lost
p1v = 4 + 4
p1l = ['a', # Implicit line joining comment for a lost
'b'] # Ending comment for b lost
def p1f(x):
"p1f docstring"
# Comment in function p1f lost
return x
print(p1f(p1l), p1f(p1v))
"""
tree = ast.parse(program)
print('== Full program code:')
print(ast.unparse(tree))
The output shows all comments gone:
== Full program code:
p1v = 4 + 4
p1l = ['a', 'b']
def p1f(x):
"""p1f docstring"""
return x
print(p1f(p1l), p1f(p1v))
However, if we scan the comments with the tokenizer, we can
use this to merge the comments back in:
from io import StringIO
import tokenize
def scan_comments(source):
""" Scan source code file for relevant comments
"""
# Find token for comments
for k,v in tokenize.tok_name.items():
if v == 'COMMENT':
comment = k
break
comtokens = []
with StringIO(source) as f:
tokens = tokenize.generate_tokens(f.readline)
for token in tokens:
if token.type != comment:
continue
comtokens += [token]
return comtokens
comtokens = scan_comments(program)
print('== Comment after p1l[0]\n\t', comtokens[1])
Output (edited to split long line):
== Comment after p1l[0]
TokenInfo(type=60 (COMMENT),
string='# Implicit line joining comment for a lost',
start=(4, 12), end=(4, 54),
line="p1l = ['a', # Implicit line joining comment for a lost\n")
Using a slightly modified version of ast.unparse(), replacing
methods maybe_newline() and traverse() with modified versions,
you should be able to merge back in all comments at their
approximate locations, using the location info from the comment
scanner (start variable), combined with the location info from the
AST; most nodes have a lineno attribute.
Not exactly. See for example the list variable assignment. The
source code is split out over two lines, but ast.unparse()
generates only one line (see output in the second code segment).
Also, you need to ensure to update the location info in the AST
using ast.increment_lineno() after adding code.
It seems some more calls to
maybe_newline() might be needed in the library code (or its
replacement).
What advantages and/or disadvantages are there to using a "snippets" plugin, e.g. snipmate, ultisnips, for VIM as opposed to simply using the builtin "abbreviations" functionality?
Are there specific use-cases where declaring iabbr, cabbr, etc. lack some major features that the snippets plugins provide? I've been unsuccessful in finding a thorough comparison between these two "features" and their respective implementations.
As #peter-rincker pointed out in a comment:
It should be noted that abbreviations can execute code as well. Often via <c-r>= or via an expression abbreviation (<expr>). Example which expands ## to the current file's path: :iabbrev ## <c-r>=expand('%:p')<cr>
As an example for python, let's compare a snipmate snippet and an abbrev in Vim for inserting lines for class declaration.
Snipmate
# New Class
snippet cl
class ${1:ClassName}(${2:object}):
"""${3:docstring for $1}"""
def __init__(self, ${4:arg}):
${5:super($1, self).__init__()}
self.$4 = $4
${6}
Vimscript
au FileType python :iabbr cl class ClassName(object):<CR><Tab>"""docstring for ClassName"""<CR>def __init__(self, arg):<CR><Tab>super(ClassName, self).__init__()<CR>self.arg = arg
Am I missing some fundamental functionality of "snippets" or am I correct in assuming they are overkill for the most part, when Vim's abbr and :help template templates are able to do all most of the stuff snippets do?
I assume it's easier to implement snippets, and they provide additional aesthetic/visual features. For instance, if I use abbr in Vim and other plugins for running/testing python code inside vim--e.g. syntastic, pytest, ropevim, pep8, etc--am I missing out on some key features that snippets provide?
Everything that can be done with snippets can be done with abbreviations and vice-versa. You can have (mirrored or not) placeholders with abbreviations, you can have context-sensitive snippets.
There are two important differences:
Abbreviations are triggered when the abbreviation text has been typed, and a non word character (or esc) is hit. Snippets are triggered on demand, and shortcuts are possible (no need to type while + tab. w + tab may be enough).
It's much more easier to define new snippets (or to maintain old ones) than to define abbreviations. With abbreviations, a lot of boiler plate code is required when we want to do neat things.
There are a few other differences. For instance, abbreviations are always triggered everywhere. And seeing for expanded into for(placeholder) {\n} within a comment or a string context is certainly not what the end-user expects. With snippets, this is not a problem any more: we can expect the end-user to know what's he's doing when he asks to expand a snippet. Still, we can propose context-aware snippets that expand throw into #throw {domain::exception} {explanation} within a comment, or into throw domain::exception({message}); elsewhere.
Snippets
Rough superset of Vim's native abbreviations. Here are the highlights:
Only trigger on key press
Uses placeholders which a user can jump between
Exist only for insert mode
Dynamic expansions
Abbreviations
Great for common typos and small snippets.
Native to Vim so no need for plugins
Typically expand on whitespace or <c-]>
Some special rules on trigger text (See :h abbreviations)
Can be used in command mode via :cabbrev (often used to create command aliases)
No placeholders
Dynamic expansions
Conclusion
For the most part snippets are more powerful and provide many features that other editors enjoy, but you can use both and many people do. Abbreviations enjoy the benefit of being native which can be useful for remote environments. Abbreviations also enjoy another clear advantage which is can be used in command mode.
Snippets are more powerful.
Depending on the implementation, snippets can let you change (or accept defaults for) multiple placeholders and can even execute code when the snippet is expanded.
For example with ultisnips, you can have it execute shell commands, vimscript but also Python code.
An (ultisnips) example:
snippet hdr "General file header" b
# file: `!v expand('%:t')`
# vim:fileencoding=utf-8:ft=`!v &filetype`
# ${1}
#
# Author: ${2:J. Doe} ${3:<jdoe#gmail.com>}
# Created: `!v strftime("%F %T %z")`
# Last modified: `!v strftime("%F %T %z")`
endsnippet
This presents you with three placeholders to fill in (it gives default values for two of them), and sets the filename, filetype and current date and time.
After the word "snippet", the start line contains three items;
the trigger string,
a description and
options for the snippet.
Personally I mostly use the b option where the snippet is expanded at the beginning of a line and the w option that expands the snippet if the trigger string starts at the beginning of a word.
Note that you have to type the trigger string and then input a key or key combination that actually triggers the expansion. So a snippet is not expanded unless you want it to.
Additionally, snippets can be specialized by filetype. Suppose you want to define four levels of headings, h1 .. h4. You can have the same name expand differently between e.g. an HTML, markdown, LaTeX or restructuredtext file.
snippets are like the built-in :abbreviate on steroids, usually with:
parameter insertions: You can insert (type or select) text fragments in various places inside the snippet. An abbreviation just expands once.
mirroring: Parameters may be repeated (maybe even in transformed fashion) elsewhere in the snippet, usually updated as you type.
multiple stops inside: You can jump from one point to another within the snippet, sometimes even recursively expand snippets within one.
There are three things to evaluate in a snippet plugin: First, the features of the snippet engine itself, second, the quality and breadth of snippets provided by the author or others; third, how easy it is to add new snippets.
I'm using a library ABPY (library here) for python but it is in older version i think. I'm using Python 3.3.
I did fix some PRINT errors, but that's how much i know, I'm really new on programing.
I want to fetch some webpage and filter it from advertising and then print it again.
EDITED after Sg'te'gmuj told me how to convert from python 2.x to 3.x this is my new code:
#!/usr/local/bin/python3.1
import cgitb;cgitb.enable()
import urllib.request
response = urllib.request.build_opener()
response.addheaders = [('User-agent', 'Mozilla/5.0')]
response = urllib.request.urlopen("http://www.youtube.com")
html = response.read()
from abpy import Filter
with open("easylist.txt") as f:
ABPFilter = Filter(file('easylist.txt'))
ABPFilter.match(html)
print("Content-type: text/html")
print()
print (html)
Now it is displaying a blank page
Just took a peek at the library, it seems that the file "easylist.txt" does not exist; you need to create the file, and populate it with the appropriate filters (in whatever format ABP specifies).
Additionally, it appears it takes a file object; try something like this instead:
with open("easylist.txt") as f:
ABPFilter = Filter(f)
I can't say this is wholly accurate though since I have no experience with the library, but looking at it's code I'd suspect either of the two are the problem, if not both.
Addendum #1
Looking at the code more in-depth, I have to agree that even if that fix I supplied does work, you're going to have more problems (it's in 2.x as you suggested, when you're using 3.x). I'd suggest utilizing Python's 2to3 function, to convert from typical Python 2 to Python 3 code (it's not foolproof though). The command line would be as so:
2to3 -w abpy.py
That will convert it from Python 2.x to 3.x code, and re-write the source file.
Addendum #2
The code to pass the file object should be the "f" variable, as shown above (modified to represent that; I wasn't paying attention and just left the old file function call in the argument).
You need to pass a URI to the function as well:
ABPFilter.match(URI)
You'll need to modify the code to pass those items into an array (I'm assuming at least); I'm playing with it now to see. At present I'm getting a rule error (not a Python error; but merely error handling used by abpy.py, which is good because it suggests that it's the right train of thought).
The code for the Filter.match function is as following (after using the 2to3 Python script):
def match(self, url, elementtype=None):
tokens = RE_TOK.split(url)
print(tokens)
for tok in tokens:
if len(tok) > 2:
if tok in self.index:
for rule in self.index[tok]:
if rule.match(url, elementtype=elementtype):
print(str(rule))
What this means is you're, at present, at a point where you need to program the functionality; it appears this module only indicates the rule. However, that is still useful.
What this means is that you're going to have to modify this function to take the HTML, in place of the the "url" parameter. You're going to regex the HTML (this may be rather intensive) for a list of URIs and then run each item through the match loop Where you go from there to actually filter the nodes, I'm not sure; but there is a list of filter types, so I'm assuming there is a typical procedural ABP does to remove the nodes (possibly, in some cases merely by removing the given URI from the HTML?)
References
http://docs.python.org/3.3/library/2to3.html
Situation:
I am writing a basic templating system in Python/mod_python that reads in a main HTML template and replaces instances of ":value:" throughout the document with additional HTML or db results and then returns it as a view to the user.
I am not trying to replace all instances of 1 substring. Values can vary. There is a finite list of what's acceptable. It is not unlimited. The syntax for the values is [colon]value[colon]. Examples might be ":gallery: , :related: , :comments:". The replacement may be additional static HTML or a call to a function. The functions may vary as well.
Question:
What's the most efficient way to read in the main HTML file and replace the unknown combination of values with their appropriate replacement?
Thanks in advance for any thoughts/solutions,
c
There are dozens of templating options that already exist. Consider genshi, mako, jinja2, django templates, or more.
You'll find that you're reinventing the wheel with little/no benefit.
If you can't use an existing templating system for whatever reason, your problem seems best tackled with regular expressions:
import re
valre = re.compile(r':\w+:')
def dosub(correspvals, correspfuns, lastditch):
def f(value):
v = value.group()[1:-1]
if v in correspvals:
return correspvals[v]
if v in correspfuns:
return correspfuns[v]() # or whatever args you need
# what if a value has neither a corresponding value to
# substitute, NOR a function to call? Whatever...:
return lastditch(v)
return f
replacer = dosub(adict, another, somefun)
thehtml = valre.sub(replacer, thehtml)
Basically you'll need two dictionaries (one mapping values to corresponding values, another mapping values to corresponding functions to be called) and a function to be called as a last-ditch attempt for values that can't be found in either dictionary; the code above shows you how to put these things together (I'm using a closure, a class would of course do just as well) and how to apply them for the required replacement task.
This is probably a job for a templating engine and for Python there are a number of choices. In this stackoveflow question people have listed their favourites and some helpfully explain why: What is your single favorite Python templating engine?
I'm writing a C parser using PLY, and recently ran into a problem.
This code:
typedef int my_type;
my_type x;
Is correct C code, because my_type is defined as a type previously to
being used as such. I handle it by filling a type symbol table in the
parser that gets used by the lexer to differentiate between types and
simple identifiers.
However, while the type declaration rule ends with SEMI (the ';' token), PLY shifts the token my_type from the second line before deciding it's done with the first one. Because of this, I have no chance to pass the update in the type symbol table to the lexer and it
sees my_type as an identifier and not a type.
Any ideas for a fix ?
The full code is at: http://code.google.com/p/pycparser/source/browse/trunk/src/c_parser.py
Not sure how I can create a smaller example out of this.
Edit:
Problem solved. See my solution below.
Not sure why you're doing that level of analysis in your lexer.
Lexical analysis should probably be used to separate the input stream into lexical tokens (number, line-change, keyword and so on). It's the parsing phase that should be doing that level of analysis, including table lookups for typedefs and such.
That's the way I've always separated the duties between lexx and yacc, my tools of choice.
With some help from Dave Beazley (PLY's creator), my problem was solved.
The idea is to use special sub-rules and do the actions in them. In my case, I split the declaration rule to:
def p_decl_body(self, p):
""" decl_body : declaration_specifiers init_declarator_list_opt
"""
# <<Handle the declaration here>>
def p_declaration(self, p):
""" declaration : decl_body SEMI
"""
p[0] = p[1]
decl_body is always reduced before the token after SEMI is shifted in, so my action gets executed at the correct time.
I think you need to move the check for whether an ID is a TYPEID from c_lexer.py to c_parser.py.
As you said, since the parser is looking ahead 1 token, you can't make that decision in the lexer.
Instead, alter your parser to check ID's to see if they are TYPEID's in declarations, and, if they aren't, generate an error.
As Pax Diablo said in his excellent answer, the lexer/tokenizer's job isn't to make those kinds of decisions about tokens. That's the parser's job.