Python pygments lexer state preservation

Python pygments lexer state preservation - python

Running pygments default lexer on the following c++ text: class foo{};, results in this:
(Token.Keyword, 'class')
(Token.Text, ' ')
(Token.Name.Class, 'foo')
(Token.Punctuation, '{')
(Token.Punctuation, '}')
(Token.Punctuation, ';')
Note that the toke foo has the type Token.Name.Class.
If i change the class name to foobar i want to be able to run the default lexer only on the touched tokens, in this case original tokens foo and {.
Q: How can i save the lexer state so that tokenizing foobar{ will give tokens with type Token.Name.Class?
Having this feature would optimize syntax highlighting for large source files that suffered a change (user is typing text) right in the middle of the file for example. There seems no documented way of doing this and no information on how to do this using the default pygments lexers.
Are there any other syntax highlighting systems that have support for this behavior ?
EDIT:
Regarding performance here is an example: http://tpcg.io/ESYjiF

From my understanding of the source code what you want is not possible.
I won't dig and try to explain every single relevant lines of code, but basically, here is what happend:
Your Lexer class is pygments.lexers.c_cpp.CLexer, which inherits from pygments.lexer.RegexLexer.
pygments.lex(lexer, code) function do nothing more than calling get_tokens method on lexer and handle errors.
lexer.get_tokens basically parse source code in unicode string and call self.get_tokens_unprocessed
get_tokens_unprocessed is defined by each Lexer in your case the relevant method is pygments.lexers.c_cpp.CFamilyLexer.get_tokens_unprocessed.
CFamilyLexer.get_tokens_unprocessed basically get tokens from RegexLexer.get_tokens_unprocessed and reprocess some of them.
Finally, RegexLexer.get_tokens_unprocessed loop on defined token types (something like (("function", ('pattern-to-find-c-function',)), ("class", ('function-to-find-c-class',)))) and for each type (function, class, comment...) find all matches within the source text, then process the next type.
This behavior make what you want impossible because it loops on token types, not on text.
To make more obvious my point, I added 2 lines of code in the lib, file: pygments/lexer.py, line: 628
for rexmatch, action, new_state in statetokens:
print('looking for {}'.format(action))
m = rexmatch(text, pos)
print('found: {}'.format(m))
And ran it with this code:
import pygments
import pygments.lexers
lexer = pygments.lexers.get_lexer_for_filename("foo.h")
sample="""
class foo{};
"""
print(list(lexer.get_tokens(sample)))
Output:
[...]
looking for Token.Keyword.Reserved
found: None
looking for Token.Name.Builtin
found: None
looking for <function bygroups.<locals>.callback at 0x7fb1f29b52f0>
found: None
looking for Token.Name
found: <_sre.SRE_Match object; span=(6, 9), match='foo'>
[...]
As you can see, the token types are what the code iterate on.
Taking that and (as Tarun Lalwani said in comments) the fact that a single new character can break the whole source-code structure, you cannot do better than re-lexing the whole text at each update.

Related

Python comment-preserving parsing using only builtin libraries?

I wrote a library using just ast and inspect libraries to parse and emit [uses astor on Python < 3.9] internal Python constructs.
Just realised that I really need to preserve comments afterall. Preferably without resorting to a RedBaron or LibCST; as I just need to emit the unaltered commentary; is there a clean and concise way of comment-preserving parsing/emitting Python source with just stdlib?

What I ended up doing was writing a simple parser, without a meta-language in 339 source lines:
https://github.com/offscale/cdd-python/blob/master/cdd/cst_utils.py
Implementation of Concrete Syntax Tree [List!]
Reads source character by character;
Once end of statement† is detected, add statement-type into 1D list;
†end of line if line.lstrip().startswith("#") or line not endswith('\\') and balanced_parens(line) else continue munching until that condition is true… plus some edge-cases around multiline strings and the like;
Once finished there is a big (1D) list where each element is a namedtuple with a value property.
Integration with builtin Abstract Syntax Tree ast
Limit ast nodes to modify—not remove—to: {ClassDef,AsyncFunctionDef,FunctionDef} docstring (first body element Constant|Str), Assign and AnnAssign;
cst_idx, cst_node = find_cst_at_ast(cst_list, _node);
if doc_str node then maybe_replace_doc_str_in_function_or_class(_node, cst_idx, cst_list)
…
Now the cst_list contains only changes to those aforementioned nodes, and only when that change is more than whitespace, and can be created into a string with "".join(map(attrgetter("value"), cst_list)) for outputting to eval or straight out to a source file (e.g., in-place overriding).
Quality control
100% test coverage
100% doc coverage
Support for last 6 versions of Python (including latest alpha)
CI/CD
(Apache-2.0 OR MIT) licensed
Limitations
Lack of meta-language, specifically lack of using Python's provided grammar means new syntax elements won't automatically be supported (match/case is supported, but if there's new syntax introduced since, it isn't [yet?] supported… at least not automatically);
Not builtin to stdlib so stdlib could break compatibility;
Deleting nodes is [probably] not supported;
Nodes can be incorrectly identified if there are shadow variables or similar issues that linters should point out.

Comments can be preserved by merging them back into the generated source code by capturing them with the tokenizer.
Given a toy program in a program variable, we can demonstrate how comments get lost in the AST:
import ast
program = """
# This comment lost
p1v = 4 + 4
p1l = ['a', # Implicit line joining comment for a lost
'b'] # Ending comment for b lost
def p1f(x):
"p1f docstring"
# Comment in function p1f lost
return x
print(p1f(p1l), p1f(p1v))
"""
tree = ast.parse(program)
print('== Full program code:')
print(ast.unparse(tree))
The output shows all comments gone:
== Full program code:
p1v = 4 + 4
p1l = ['a', 'b']
def p1f(x):
"""p1f docstring"""
return x
print(p1f(p1l), p1f(p1v))
However, if we scan the comments with the tokenizer, we can
use this to merge the comments back in:
from io import StringIO
import tokenize
def scan_comments(source):
""" Scan source code file for relevant comments
"""
# Find token for comments
for k,v in tokenize.tok_name.items():
if v == 'COMMENT':
comment = k
break
comtokens = []
with StringIO(source) as f:
tokens = tokenize.generate_tokens(f.readline)
for token in tokens:
if token.type != comment:
continue
comtokens += [token]
return comtokens
comtokens = scan_comments(program)
print('== Comment after p1l[0]\n\t', comtokens[1])
Output (edited to split long line):
== Comment after p1l[0]
TokenInfo(type=60 (COMMENT),
string='# Implicit line joining comment for a lost',
start=(4, 12), end=(4, 54),
line="p1l = ['a', # Implicit line joining comment for a lost\n")
Using a slightly modified version of ast.unparse(), replacing
methods maybe_newline() and traverse() with modified versions,
you should be able to merge back in all comments at their
approximate locations, using the location info from the comment
scanner (start variable), combined with the location info from the
AST; most nodes have a lineno attribute.
Not exactly. See for example the list variable assignment. The
source code is split out over two lines, but ast.unparse()
generates only one line (see output in the second code segment).
Also, you need to ensure to update the location info in the AST
using ast.increment_lineno() after adding code.
It seems some more calls to
maybe_newline() might be needed in the library code (or its
replacement).

Can Python-Markdown support imageboard-style links?

I would like to add an additional syntax to Python-Markdown: if n is a positive integer, >>n should expand into n. (Double angled brackets (>>) is a conventional syntax for creating links in imageboard forums.)
By default, Python-Markdown expands >>n into nested blockquotes: <blockquote><blockquote>n</blockquote></blockquote>. Is there a way create links out of >>n, while preserving the rest of blockquote's default behavior? In other words, if x is a positive integer, >>x should expand into a link, but if x is not a positive integer, >>x should still expand into nested blockquotes.
I have read the relevant wiki article: Tutorial 1 Writing Extensions for Python Markdown. Based on what I learned in the wiki, I wrote a custom extension:
import markdown
import xml.etree.ElementTree as ET
from markdown.extensions import Extension
from markdown.inlinepatterns import Pattern
class ImageboardLinkPattern(Pattern):
def handleMatch(self, match):
number = match.group('number')
# Create link.
element = ET.Element('a', attrib={'href': f'#post-{number}'})
element.text = f'>>{number}'
return element
class ImageboardLinkExtension(Extension):
def extendMarkdown(self, md):
IMAGEBOARD_LINK_RE = '>>(?P<number>[1-9][0-9]*)'
imageboard_link = ImageboardLinkPattern(IMAGEBOARD_LINK_RE)
md.inlinePatterns['imageboard_link'] = imageboard_link
html = markdown.markdown('>>123',
extensions=[ImageboardLinkExtension()])
print(html)
However, >>123 still produces <blockquote><blockquote>123</blockquote></blockquote>. What is wrong with the implementation above?

The problem is that your new syntax conflicts with the preexisting blockquote syntax. Your extension would presumably work if it was ever called. However, due to the conflict, that never happens. Note that their are five types of processors. As documented:
Preprocessors alter the source before it is passed to the parser.
Block Processors work with blocks of text separated by blank lines.
Tree Processors modify the constructed ElementTree
Inline Processors are common tree processors for inline elements, such as *strong*.
Postprocessors munge of the output of the parser just before it is returned.
Of importance here is that the processors are run in that order. In other words, all block processors are run before any inline processors are run. Therefore, the blockquote block processor runs first on your input and removes the double angle bracket, wrapping the rest of the line in double blockquote tags. By the time your inline processor sees the document, your regex will no longer match and will therefore never be called.
That being said, an inline processor is the correct way to implement a link syntax. However, you would need to do one of two things to make it work.
Alter the syntax so that it does not clash with any preexisting syntax; or
Alter the blockquote behavior to avoid the conflict.
Personally, I would recommend option 1, but I understand you are trying to implement a preexisting syntax from another environment. So, if you want to explore option 2, then I would suggest perhaps making the blockquote syntax a little more strict. For example, while it is not required, the recommended syntax is to always insert a space after the angle bracket in a blockquote. It should be relatively simple to alter the BlockquoteProcessor to require the space, which would cause your syntax to no longer clash.
This is actually pretty simple. As you may note, the entire syntax is defined via a rather simple regex:
RE = re.compile(r'(^|\n)[ ]{0,3}>[ ]?(.*)')
You simply need to rewrite that so that 0 whitespace is no longer accepted (> rather than >[ ]?). First import and subclass the existing processor and then override the regex:
from markdown.blockprocessors import BlockquoteProcessor
class CustomBlockquoteProcessor(BlockquoteProcessor):
RE = re.compile(r'(^|\n)[ ]{0,3}> (.*)')
Finally, you just need to tell Markdown to use your custom class rather than the default. Add the following to the extendMarkdown method of your ImageboardLinkExtension class:
md.parser.blockprocessors.register(CustomBlockQuoteProcessor(md.parser), 'quote', 20)
Now the blockquote syntax will no longer clash with your link syntax and you will get an opportunity to have your code run on the text. Just be careful to remember to always include the now required space for any actual blockquotes.

Exclude header comments when generating documentation with sphinx

I am trying to use sphinx to generate documentation for my python package. I have included well documented docstrings within all of my functions. I have called sphinx-quickstart to create the template, filled in the template, and called make html to generate the documentation. My problem is that I also have comments within my python module that are also showing up in the rendered documentation. The comments are not within the functions, but rather as a header at the top of the .py files. I am required to have these comment blocks and cannot remove them, but I don't want them to show up in my documentation.
I'm current using automodule to pull all of the functions out of the module. I know I can use autofunction to get the individual functions one by one, and this avoids the file headers, but I have a lot of functions and there must be a better way.
How can I adjust my conf.py file, or something else to use automodule, but to only pick up the functions and not the comments outside of the functions?
This is what my .py file looks like:
"""
This code is a part of a proj repo.
https://github.com/CurtLH/proj
author: greenbean4me
date: 2018-02-07
"""
def hit(nail):
"""
Hit the nail on the head
This function takes a string and returns the first character in the string
Parameter
----------
nail : str
Can be any word with any length
Return
-------
x : str
First character in the string
Example
-------
>>> x = hit("hello")
>>> print(x)
>>> "h"
"""
return nail[0]
This is my the auto-generated documentation looks like:
Auto-Generated Documentation
This code is a part of a proj repo.
https://github.com/CurtLH/proj
author: greenbean4me date: 2018-02-07
hammer.hit(nail)
Hit the nail on the head
This function takes a string and returns the first character in the string
nail : str
Can be any word with any length
x : str
First character in the string
>>> x = hit("hello")
>>> print(x)
>>> "h"
For a more comprehensive example, check out this example repo on GitHub: https://github.com/CurtLH/proj

As far as I know, there is no way to configure autodoc not to do this.
There is, however, a simple workaround: Adding an empty docstring at the top of your module.
"""""" # <- add this
"""
This code is a part of a proj repo.
https://github.com/CurtLH/proj
author: greenbean4me
date: 2018-02-07
"""
It's barely noticeable, and will trick autodoc into not displaying your module's real docstring.

PyYAML and unusual tags

I am working on a project that uses the Unity3D game engine. For some of the pipeline requirements, it is best to be able to update some files from external tools using Python. Unity's meta and anim files are in YAML so I thought this would be strait forward enough using PyYAML.
The problem is that Unity's format uses custom attributes and I am not sure how to work with them as all the examples show more common tags used by Python and Ruby.
Here is what the top lines of a file look like:
%YAML 1.1
%TAG !u! tag:unity3d.com,2011:
--- !u!74 &7400000
AnimationClip:
m_ObjectHideFlags: 0
m_PrefabParentObject: {fileID: 0}
...
When I try to read the file I get this error:
could not determine a constructor for the tag 'tag:unity3d.com,2011:74'
Now after looking at all the other questions asked, this tag scheme does not seem to resemble those questions and answers. For example this file uses "!u!" which I was unable to figure out what it means or how something similar would behave (my wild uneducated guess says it looks like an alias or namespace).
I can do a hack way and strip the tags out but that is not the ideal way to try to do this. I am looking for help on a solution that will properly handle the tags and allow me to parse & encode the data in a way that preserves the proper format.
Thanks,
-R

I also had this problem, and the internet was not very helpful. After bashing my head against this problem for 3 days, I was able to sort it out...or at least get a working solution. If anyone wants to add more info, please do. But here's what I got.
1) The documentation on Unity's YAML file format(they call it a "textual scene file" because it contains text that is human readable) - http://docs.unity3d.com/Manual/TextualSceneFormat.html
It is a YAML 1.1 compliant format. So you should be able to use PyYAML or any other Python YAML library to load up a YAML object.
Okay, great. But it doesn't work. Every YAML library has issues with this file.
2) The file is not correctly formed. It turns out, the Unity file has some syntactical issues that make YAML parsers error out on it. Specifically:
2a) At the top, it uses a %TAG directive to create an alias for the string "unity3d.com,2011". It looks like:
%TAG !u! tag:unity3d.com,2011:
What this means is anywhere you see "!u!", replace it with "tag:unity3d.com,2011".
2b) Then it goes on to use "!u!" all over the place before each object stream. But the problem is that - to be YAML 1.1 compliant - it should actually declare a tag alias for each stream (any time a new object starts with "--- "). Declaring it once at the top and never again is only valid for the first stream, and the next stream knows nothing about "!u!", so it errors out.
Also, this tag is useless. It basically appends "tag:unity3d.com,2011" to each entry in the stream. Which we don't care about. We already know it's a Unity YAML file. Why clutter the data?
3) The object types are given by Unity's Class ID. Here is the documentation on that:
http://docs.unity3d.com/Manual/ClassIDReference.html
Basically, each stream is defined as a new class of object...corresponding to the IDs in that link. So a "GameObject" is "1", etc. The line looks like this:
--- !u!1 &100000
So the "--- " defines a new stream. The "!u!" is an alias for "tag:unity3d.com,2011" and the "&100000" is the file ID for this object (inside this file, if something references this object, it uses this ID....remember YAML is a node-based representation, so that ID is used to denote a node connection).
The next line is the root of the YAML object, which happens to be the name of the Unity Class...example "GameObject". So it turns out we don't actually need to translate from Class ID to Human Readable node type. It's right there. If you ever need to use it, just take the root node. And if you need to construct a YAML object for Unity, just keep a dictionary around based on that documentation link to translate "GameObject" to "1", etc.
The other problem is that most YAML parsers (PyYAML is the one I tested) only support 3 types of YAML objects out of the box:
Scalar
Sequence
Mapping
You can define/extend custom nodes. But this amounts to hand writing your own YAML parser because you have to define EXPLICITLY how each YAML constructor is created, and outputs. Why would I use a Library like PyYAML, then go ahead and write my own parser to read these custom nodes? The whole point of using a library is to leverage previous work and get all that functionality from day one. I spent 2 days trying to make a new constructor for each class ID in unity. It never worked, and I got into the weeds trying to build the constructors correctly.
THE GOOD NEWS/SOLUTION:
Turns out, all the Unity nodes I've ever run into so far are basic "Mapping" nodes in YAML. So you can throw away the custom node mapping and just let PyYAML auto-detect the node type. From there, everything works great!
In PyYAML, you can pass a file object, or a string. So, my solution was to write a simple 5 line pre-parser to strip out the bits that confuse PyYAML(the bits that Unity incorrectly syntaxed) and feed this new string to PyYAML.
1) Remove line 2 entirely, or just ignore it:
%TAG !u! tag:unity3d.com,2011:
We don't care. We know it's a unity file. And the tag does nothing for us.
2) For each stream declaration, remove the tag alias ("!u!") and remove the class ID. Leave the fileID. Let PyYAML auto-detect the node as a Mapping node.
--- !u!1 &100000
becomes...
--- &100000
3) The rest, output as is.
The code for the pre-parser looks like this:
def removeUnityTagAlias(filepath):
"""
Name: removeUnityTagAlias()
Description: Loads a file object from a Unity textual scene file, which is in a pseudo YAML style, and strips the
parts that are not YAML 1.1 compliant. Then returns a string as a stream, which can be passed to PyYAML.
Essentially removes the "!u!" tag directive, class type and the "&" file ID directive. PyYAML seems to handle
rest just fine after that.
Returns: String (YAML stream as string)
"""
result = str()
sourceFile = open(filepath, 'r')
for lineNumber,line in enumerate( sourceFile.readlines() ):
if line.startswith('--- !u!'):
result += '--- ' + line.split(' ')[2] + '\n' # remove the tag, but keep file ID
else:
# Just copy the contents...
result += line
sourceFile.close()
return result
To create a PyYAML object from a Unity textual scene file, call your pre-parser function on the file:
import yaml
# This fixes Unity's YAML %TAG alias issue.
fileToLoad = '/Users/vlad.dumitrascu/<SOME_PROJECT>/Client/Assets/Gear/MeleeWeapons/SomeAsset_test.prefab'
UnityStreamNoTags = removeUnityTagAlias(fileToLoad)
ListOfNodes = list()
for data in yaml.load_all(UnityStreamNoTags):
ListOfNodes.append( data )
# Example, print each object's name and type
for node in ListOfNodes:
if 'm_Name' in node[ node.keys()[0] ]:
print( 'Name: ' + node[ node.keys()[0] ]['m_Name'] + ' NodeType: ' + node.keys()[0] )
else:
print( 'Name: ' + 'No Name Attribute' + ' NodeType: ' + node.keys()[0] )
Hope that helps!
-Vlad
PS. To Answer the next issue in making this usable:
You also need to walk the entire project directory and parse all ".meta" files for the "GUID", which is Unity's inter-file reference. So, when you see a reference in a Unity YAML file for something like:
m_Materials:
- {fileID: 2100000, guid: 4b191c3a6f88640689fc5ea3ec5bf3a3, type: 2}
That file is somewhere else. And you can re-cursively open that one to find out any dependencies.
I just ripped through the game project and saved a dictionary of GUID:Filepath Key:Value pairs which I can match against.

PLY: Token shifting problem in C parser

I'm writing a C parser using PLY, and recently ran into a problem.
This code:
typedef int my_type;
my_type x;
Is correct C code, because my_type is defined as a type previously to
being used as such. I handle it by filling a type symbol table in the
parser that gets used by the lexer to differentiate between types and
simple identifiers.
However, while the type declaration rule ends with SEMI (the ';' token), PLY shifts the token my_type from the second line before deciding it's done with the first one. Because of this, I have no chance to pass the update in the type symbol table to the lexer and it
sees my_type as an identifier and not a type.
Any ideas for a fix ?
The full code is at: http://code.google.com/p/pycparser/source/browse/trunk/src/c_parser.py
Not sure how I can create a smaller example out of this.
Edit:
Problem solved. See my solution below.

Not sure why you're doing that level of analysis in your lexer.
Lexical analysis should probably be used to separate the input stream into lexical tokens (number, line-change, keyword and so on). It's the parsing phase that should be doing that level of analysis, including table lookups for typedefs and such.
That's the way I've always separated the duties between lexx and yacc, my tools of choice.

With some help from Dave Beazley (PLY's creator), my problem was solved.
The idea is to use special sub-rules and do the actions in them. In my case, I split the declaration rule to:
def p_decl_body(self, p):
""" decl_body : declaration_specifiers init_declarator_list_opt
"""
# <<Handle the declaration here>>
def p_declaration(self, p):
""" declaration : decl_body SEMI
"""
p[0] = p[1]
decl_body is always reduced before the token after SEMI is shifted in, so my action gets executed at the correct time.

I think you need to move the check for whether an ID is a TYPEID from c_lexer.py to c_parser.py.
As you said, since the parser is looking ahead 1 token, you can't make that decision in the lexer.
Instead, alter your parser to check ID's to see if they are TYPEID's in declarations, and, if they aren't, generate an error.
As Pax Diablo said in his excellent answer, the lexer/tokenizer's job isn't to make those kinds of decisions about tokens. That's the parser's job.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python pygments lexer state preservation - python

Related

Python comment-preserving parsing using only builtin libraries?

Can Python-Markdown support imageboard-style links?

Exclude header comments when generating documentation with sphinx

PyYAML and unusual tags

PLY: Token shifting problem in C parser

Categories

Resources