PLY: Token shifting problem in C parser - python

I'm writing a C parser using PLY, and recently ran into a problem.
This code:
typedef int my_type;
my_type x;
Is correct C code, because my_type is defined as a type previously to
being used as such. I handle it by filling a type symbol table in the
parser that gets used by the lexer to differentiate between types and
simple identifiers.
However, while the type declaration rule ends with SEMI (the ';' token), PLY shifts the token my_type from the second line before deciding it's done with the first one. Because of this, I have no chance to pass the update in the type symbol table to the lexer and it
sees my_type as an identifier and not a type.
Any ideas for a fix ?
The full code is at: http://code.google.com/p/pycparser/source/browse/trunk/src/c_parser.py
Not sure how I can create a smaller example out of this.
Edit:
Problem solved. See my solution below.

Not sure why you're doing that level of analysis in your lexer.
Lexical analysis should probably be used to separate the input stream into lexical tokens (number, line-change, keyword and so on). It's the parsing phase that should be doing that level of analysis, including table lookups for typedefs and such.
That's the way I've always separated the duties between lexx and yacc, my tools of choice.

With some help from Dave Beazley (PLY's creator), my problem was solved.
The idea is to use special sub-rules and do the actions in them. In my case, I split the declaration rule to:
def p_decl_body(self, p):
""" decl_body : declaration_specifiers init_declarator_list_opt
"""
# <<Handle the declaration here>>
def p_declaration(self, p):
""" declaration : decl_body SEMI
"""
p[0] = p[1]
decl_body is always reduced before the token after SEMI is shifted in, so my action gets executed at the correct time.

I think you need to move the check for whether an ID is a TYPEID from c_lexer.py to c_parser.py.
As you said, since the parser is looking ahead 1 token, you can't make that decision in the lexer.
Instead, alter your parser to check ID's to see if they are TYPEID's in declarations, and, if they aren't, generate an error.
As Pax Diablo said in his excellent answer, the lexer/tokenizer's job isn't to make those kinds of decisions about tokens. That's the parser's job.

Related

How does vs code get intellisense hints from python comments?

I am seeing that VS Code is getting intellisense hints from the comments I put in my class functions:
def GyroDriveOnHeading(self, desiredHeading, desiredDistance):
"""
Drives the robot very straight on a given heading for a \
given distance, using the acceleration and the gyro. \
Accelerates to prevent wheel slipping. \
Gyro keeps the robot pointing on the desired heading.
Minimum distance that this will work for is about 16cm.
If you need to go a very short distance, use move_tank.
Parameters
-------------
desiredHeading: On what heading should the robot drive (float)
type: float
values: any. Best if the desired heading is close to the current heading. Unpredictable robot movement may occur for large heading differences.
default: no default value
desiredDistance: How far the robot should go in cm (float)
type: float
values: any value above 16.0. You can enter smaller numbers, but the robot will still go 16cm
default: no default value
Example
-------------
import base_robot
br = base_robot.BaseRobot()
br.GyroDriveOnHeading(90, 40) #drive on heading 90 for 40 cm
"""
Which gives me a really nice popup when I use that function:
As you can see here, since I am about to enter the first parameter, desiredHeading, the intellisense was smart enough to know that the line in the comments under "Parameters" that starts with the variable name should be the first thing displayed in the hint. And indeed, once I type the first parameter and a comma, the first line of the intellisense popup changes to show the information about desiredDistance.
But I would like to know more about how the comments should be written. I read about the numpy style guide as being close to a standard most widely adopted, but when I change the parameter documentation format to match numpy (and somethihng called Sphinx has something to do with this too, I think), the popups were not the same. Really, I just want to see the documentation on how to document (yikes!) my python code so it renders correct intellisense. For example, how can I bold a word in the middle of a sentence? Are there other formatting options available?
This is just for a middle-school robotics club, nothing like production code for real programmers. Nothing is broken, I just want to learn more about how this works.
That's it for docstrings in python, about it's introduction:
https://docs.python.org/3.10/tutorial/controlflow.html#documentation-strings
https://peps.python.org/pep-0287/
In addition, you can use the type stub of the parameter in this way.
def open(url: str, new: int = ..., autoraise: bool = ...) -> bool: ...

Can Python-Markdown support imageboard-style links?

I would like to add an additional syntax to Python-Markdown: if n is a positive integer, >>n should expand into n. (Double angled brackets (>>) is a conventional syntax for creating links in imageboard forums.)
By default, Python-Markdown expands >>n into nested blockquotes: <blockquote><blockquote>n</blockquote></blockquote>. Is there a way create links out of >>n, while preserving the rest of blockquote's default behavior? In other words, if x is a positive integer, >>x should expand into a link, but if x is not a positive integer, >>x should still expand into nested blockquotes.
I have read the relevant wiki article: Tutorial 1 Writing Extensions for Python Markdown. Based on what I learned in the wiki, I wrote a custom extension:
import markdown
import xml.etree.ElementTree as ET
from markdown.extensions import Extension
from markdown.inlinepatterns import Pattern
class ImageboardLinkPattern(Pattern):
def handleMatch(self, match):
number = match.group('number')
# Create link.
element = ET.Element('a', attrib={'href': f'#post-{number}'})
element.text = f'>>{number}'
return element
class ImageboardLinkExtension(Extension):
def extendMarkdown(self, md):
IMAGEBOARD_LINK_RE = '>>(?P<number>[1-9][0-9]*)'
imageboard_link = ImageboardLinkPattern(IMAGEBOARD_LINK_RE)
md.inlinePatterns['imageboard_link'] = imageboard_link
html = markdown.markdown('>>123',
extensions=[ImageboardLinkExtension()])
print(html)
However, >>123 still produces <blockquote><blockquote>123</blockquote></blockquote>. What is wrong with the implementation above?
The problem is that your new syntax conflicts with the preexisting blockquote syntax. Your extension would presumably work if it was ever called. However, due to the conflict, that never happens. Note that their are five types of processors. As documented:
Preprocessors alter the source before it is passed to the parser.
Block Processors work with blocks of text separated by blank lines.
Tree Processors modify the constructed ElementTree
Inline Processors are common tree processors for inline elements, such as *strong*.
Postprocessors munge of the output of the parser just before it is returned.
Of importance here is that the processors are run in that order. In other words, all block processors are run before any inline processors are run. Therefore, the blockquote block processor runs first on your input and removes the double angle bracket, wrapping the rest of the line in double blockquote tags. By the time your inline processor sees the document, your regex will no longer match and will therefore never be called.
That being said, an inline processor is the correct way to implement a link syntax. However, you would need to do one of two things to make it work.
Alter the syntax so that it does not clash with any preexisting syntax; or
Alter the blockquote behavior to avoid the conflict.
Personally, I would recommend option 1, but I understand you are trying to implement a preexisting syntax from another environment. So, if you want to explore option 2, then I would suggest perhaps making the blockquote syntax a little more strict. For example, while it is not required, the recommended syntax is to always insert a space after the angle bracket in a blockquote. It should be relatively simple to alter the BlockquoteProcessor to require the space, which would cause your syntax to no longer clash.
This is actually pretty simple. As you may note, the entire syntax is defined via a rather simple regex:
RE = re.compile(r'(^|\n)[ ]{0,3}>[ ]?(.*)')
You simply need to rewrite that so that 0 whitespace is no longer accepted (> rather than >[ ]?). First import and subclass the existing processor and then override the regex:
from markdown.blockprocessors import BlockquoteProcessor
class CustomBlockquoteProcessor(BlockquoteProcessor):
RE = re.compile(r'(^|\n)[ ]{0,3}> (.*)')
Finally, you just need to tell Markdown to use your custom class rather than the default. Add the following to the extendMarkdown method of your ImageboardLinkExtension class:
md.parser.blockprocessors.register(CustomBlockQuoteProcessor(md.parser), 'quote', 20)
Now the blockquote syntax will no longer clash with your link syntax and you will get an opportunity to have your code run on the text. Just be careful to remember to always include the now required space for any actual blockquotes.

Python pygments lexer state preservation

Running pygments default lexer on the following c++ text: class foo{};, results in this:
(Token.Keyword, 'class')
(Token.Text, ' ')
(Token.Name.Class, 'foo')
(Token.Punctuation, '{')
(Token.Punctuation, '}')
(Token.Punctuation, ';')
Note that the toke foo has the type Token.Name.Class.
If i change the class name to foobar i want to be able to run the default lexer only on the touched tokens, in this case original tokens foo and {.
Q: How can i save the lexer state so that tokenizing foobar{ will give tokens with type Token.Name.Class?
Having this feature would optimize syntax highlighting for large source files that suffered a change (user is typing text) right in the middle of the file for example. There seems no documented way of doing this and no information on how to do this using the default pygments lexers.
Are there any other syntax highlighting systems that have support for this behavior ?
EDIT:
Regarding performance here is an example: http://tpcg.io/ESYjiF
From my understanding of the source code what you want is not possible.
I won't dig and try to explain every single relevant lines of code, but basically, here is what happend:
Your Lexer class is pygments.lexers.c_cpp.CLexer, which inherits from pygments.lexer.RegexLexer.
pygments.lex(lexer, code) function do nothing more than calling get_tokens method on lexer and handle errors.
lexer.get_tokens basically parse source code in unicode string and call self.get_tokens_unprocessed
get_tokens_unprocessed is defined by each Lexer in your case the relevant method is pygments.lexers.c_cpp.CFamilyLexer.get_tokens_unprocessed.
CFamilyLexer.get_tokens_unprocessed basically get tokens from RegexLexer.get_tokens_unprocessed and reprocess some of them.
Finally, RegexLexer.get_tokens_unprocessed loop on defined token types (something like (("function", ('pattern-to-find-c-function',)), ("class", ('function-to-find-c-class',)))) and for each type (function, class, comment...) find all matches within the source text, then process the next type.
This behavior make what you want impossible because it loops on token types, not on text.
To make more obvious my point, I added 2 lines of code in the lib, file: pygments/lexer.py, line: 628
for rexmatch, action, new_state in statetokens:
print('looking for {}'.format(action))
m = rexmatch(text, pos)
print('found: {}'.format(m))
And ran it with this code:
import pygments
import pygments.lexers
lexer = pygments.lexers.get_lexer_for_filename("foo.h")
sample="""
class foo{};
"""
print(list(lexer.get_tokens(sample)))
Output:
[...]
looking for Token.Keyword.Reserved
found: None
looking for Token.Name.Builtin
found: None
looking for <function bygroups.<locals>.callback at 0x7fb1f29b52f0>
found: None
looking for Token.Name
found: <_sre.SRE_Match object; span=(6, 9), match='foo'>
[...]
As you can see, the token types are what the code iterate on.
Taking that and (as Tarun Lalwani said in comments) the fact that a single new character can break the whole source-code structure, you cannot do better than re-lexing the whole text at each update.

Parser that preserves comments and recover from error

I'm working on a GUI editor for a propriety config format. Basically the editor will parse the config file, display the object properties so that users can edit from GUI and then write the objects back to the file.
I've got the parse - edit - write part done, except for:
The parsed data structure only include object properties information, so comments and whitespaces are lost on write
If there is any syntax error, the rest of the file is skipped
How would you address these issues? What is the usual approach to this problem? I'm using Python and Parsec module https://pythonhosted.org/parsec/documentation.html, however and help and general direction is appreciated.
I've also tried Pylens (https://pythonhosted.org/pylens/), which is really close to what I need except it can not skip syntax errors.
You asked about typical approaches to this problem. Here are two projects which tackle similar challenges to the one you describe:
sketch-n-sketch: "Direct manipulation" interface for vector images, where you can either edit the image-describing source language, or edit the image it represents directly and see those changes reflected in the source code. Check out the video presentation, it's super cool.
Boomerang: Using lenses to "focus" on the abstract meaning of some concrete syntax, alter that abstract model, and then reflect those changes in the original source.
Both projects have yielded several papers describing the approaches their authors took. As far as I can tell, the Lens approach is popular, where parsing and printing become the get and put functions of a Lens which takes a some source code and focuses on the abstract concept which that code describes.
Eventually I ran out of research time and have to settle with a rather manual skipping. Basically each time the parser fail we try to advance the cursor one character and repeat. Any parts skipped by the process, regardless of whitespace/comment/syntax error is dump into a Text structure. The code is quite reusable, except for the part you have to incorporate it to all the places with repeated results and the original parser may fail.
Here's the code, in case it helps anyone. It is written for Parsy.
class Text(object):
'''Structure to contain all the parts that the parser does not understand.
A better name would be Whitespace
'''
def __init__(self, text=''):
self.text = text
def __repr__(self):
return "Text(text='{}')".format(self.text)
def __eq__(self, other):
return self.text.strip() == getattr(other, 'text', '').strip()
def many_skip_error(parser, skip=lambda t, i: i + 1, until=None):
'''Repeat the original `parser`, aggregate result into `values`
and error in `Text`.
'''
#Parser
def _parser(stream, index):
values, result = [], None
while index < len(stream):
result = parser(stream, index)
# Original parser success
if result.status:
values.append(result.value)
index = result.index
# Check for end condition, effectively `manyTill` in Parsec
elif until is not None and until(stream, index).status:
break
# Aggregate skipped text into last `Text` value, or create a new one
else:
if len(values) > 0 and isinstance(values[-1], Text):
values[-1].text += stream[index]
else:
values.append(Text(stream[index]))
index = skip(stream, index)
return Result.success(index, values).aggregate(result)
return _parser
# Example usage
skip_error_parser = many_skip_error(original_parser)
On other note, I guess the real issue here is I'm using a parser combinator library instead of a proper two stages parsing process. In traditional parsing, the tokenizer will handle retaining/skipping any whitespace/comment/syntax error, making them all effectively whitespace and are invisible to the parser.

Snippets vs. Abbreviations in Vim

What advantages and/or disadvantages are there to using a "snippets" plugin, e.g. snipmate, ultisnips, for VIM as opposed to simply using the builtin "abbreviations" functionality?
Are there specific use-cases where declaring iabbr, cabbr, etc. lack some major features that the snippets plugins provide? I've been unsuccessful in finding a thorough comparison between these two "features" and their respective implementations.
As #peter-rincker pointed out in a comment:
It should be noted that abbreviations can execute code as well. Often via <c-r>= or via an expression abbreviation (<expr>). Example which expands ## to the current file's path: :iabbrev ## <c-r>=expand('%:p')<cr>
As an example for python, let's compare a snipmate snippet and an abbrev in Vim for inserting lines for class declaration.
Snipmate
# New Class
snippet cl
class ${1:ClassName}(${2:object}):
"""${3:docstring for $1}"""
def __init__(self, ${4:arg}):
${5:super($1, self).__init__()}
self.$4 = $4
${6}
Vimscript
au FileType python :iabbr cl class ClassName(object):<CR><Tab>"""docstring for ClassName"""<CR>def __init__(self, arg):<CR><Tab>super(ClassName, self).__init__()<CR>self.arg = arg
Am I missing some fundamental functionality of "snippets" or am I correct in assuming they are overkill for the most part, when Vim's abbr and :help template templates are able to do all most of the stuff snippets do?
I assume it's easier to implement snippets, and they provide additional aesthetic/visual features. For instance, if I use abbr in Vim and other plugins for running/testing python code inside vim--e.g. syntastic, pytest, ropevim, pep8, etc--am I missing out on some key features that snippets provide?
Everything that can be done with snippets can be done with abbreviations and vice-versa. You can have (mirrored or not) placeholders with abbreviations, you can have context-sensitive snippets.
There are two important differences:
Abbreviations are triggered when the abbreviation text has been typed, and a non word character (or esc) is hit. Snippets are triggered on demand, and shortcuts are possible (no need to type while + tab. w + tab may be enough).
It's much more easier to define new snippets (or to maintain old ones) than to define abbreviations. With abbreviations, a lot of boiler plate code is required when we want to do neat things.
There are a few other differences. For instance, abbreviations are always triggered everywhere. And seeing for expanded into for(placeholder) {\n} within a comment or a string context is certainly not what the end-user expects. With snippets, this is not a problem any more: we can expect the end-user to know what's he's doing when he asks to expand a snippet. Still, we can propose context-aware snippets that expand throw into #throw {domain::exception} {explanation} within a comment, or into throw domain::exception({message}); elsewhere.
Snippets
Rough superset of Vim's native abbreviations. Here are the highlights:
Only trigger on key press
Uses placeholders which a user can jump between
Exist only for insert mode
Dynamic expansions
Abbreviations
Great for common typos and small snippets.
Native to Vim so no need for plugins
Typically expand on whitespace or <c-]>
Some special rules on trigger text (See :h abbreviations)
Can be used in command mode via :cabbrev (often used to create command aliases)
No placeholders
Dynamic expansions
Conclusion
For the most part snippets are more powerful and provide many features that other editors enjoy, but you can use both and many people do. Abbreviations enjoy the benefit of being native which can be useful for remote environments. Abbreviations also enjoy another clear advantage which is can be used in command mode.
Snippets are more powerful.
Depending on the implementation, snippets can let you change (or accept defaults for) multiple placeholders and can even execute code when the snippet is expanded.
For example with ultisnips, you can have it execute shell commands, vimscript but also Python code.
An (ultisnips) example:
snippet hdr "General file header" b
# file: `!v expand('%:t')`
# vim:fileencoding=utf-8:ft=`!v &filetype`
# ${1}
#
# Author: ${2:J. Doe} ${3:<jdoe#gmail.com>}
# Created: `!v strftime("%F %T %z")`
# Last modified: `!v strftime("%F %T %z")`
endsnippet
This presents you with three placeholders to fill in (it gives default values for two of them), and sets the filename, filetype and current date and time.
After the word "snippet", the start line contains three items;
the trigger string,
a description and
options for the snippet.
Personally I mostly use the b option where the snippet is expanded at the beginning of a line and the w option that expands the snippet if the trigger string starts at the beginning of a word.
Note that you have to type the trigger string and then input a key or key combination that actually triggers the expansion. So a snippet is not expanded unless you want it to.
Additionally, snippets can be specialized by filetype. Suppose you want to define four levels of headings, h1 .. h4. You can have the same name expand differently between e.g. an HTML, markdown, LaTeX or restructuredtext file.
snippets are like the built-in :abbreviate on steroids, usually with:
parameter insertions: You can insert (type or select) text fragments in various places inside the snippet. An abbreviation just expands once.
mirroring: Parameters may be repeated (maybe even in transformed fashion) elsewhere in the snippet, usually updated as you type.
multiple stops inside: You can jump from one point to another within the snippet, sometimes even recursively expand snippets within one.
There are three things to evaluate in a snippet plugin: First, the features of the snippet engine itself, second, the quality and breadth of snippets provided by the author or others; third, how easy it is to add new snippets.

Categories