Regular expression forums in python with examples - python

I'm new to python and I'm working with code the requires me to make use of regular expressions substantially.
I've gone through the python documentation for regular expressions (http://docs.python.org/2/library/re.html)
However, since I'm new to python I find it hard to actually implement the functions that are specified within the documentation with the data I need to work with.
I was wondering if there's a forum out there that explains regular expressions in python with multiple examples for each function and all its possible variations. The more the merrier.
I've found this (http://docs.python.org/2/howto/regex.html) and this (http://flockhart.virtualave.net/RBIF0100/regexp.html) so far. I've found them useful but I was wondering if there is something out there that's better.

You can use RegexOne for excellent tutorials on regular expressions in general.
You can use Debuggex (which is built by me) if you want to visualize and understand what a specific regex is doing.
If you want something python-specific, you can try Google's python regex tutorial.

You can try this pythonregex , its python specific and generates python code for your regular expressions.
enjoy!

Related

Python package to parse identifiers in a program (C, Scala, Lisp)?

In the title I mention 3 different languages in which I would like to find out if a python package exists which can give me a list of identifiers for a program in any of those; so doesn't have to be all three of them as I doubt it there would be one like that. So my question is does a function or class exist in python that allows me too get a list of identifiers for a specific program in a language, preferably one in the 3 I listed in the title. Any help appreciated.
In general, this is not possible without having a nearly complete language implementation.
There is a rudimentary preprocessor in C, which could allow to mask function declarations from an ad hoc scanning. There is a powerful metaprogramming in Lisp, which means you can only extract the definitions using a full-featured Lisp compiler, simple parsing won't help at all.
Scala is the simplest of these three, but still its syntax is over-bloated and you'll need at least a complete parser. Python is not nearly a right tool for doing this sort of things any way.
There's pycparser, which you can use to generate a C AST from code and then traverse it to get whatever you want.
There's this simple lisp interpreter in Python from which you should be able to scrap the parser.
And I doubt there's anything similar and readily available for Scala, but you can use something like ply to make a parser. It won't be as easy, but will do.

What should I know about Python to identify comments in different source files?

I have a need to identify comments in different kinds of source files in a given directory. ( For example java,XML, JavaScript, bash). I have decided to do this using Python (as an attempt to learn Python). The questions I have are
1) What should I know about python to get this done? ( I have an idea that Regular Expressions will be useful but are there alternatives/other modules that will be useful? Libraries that I can use to get this done?)
2) Is Python a good choice for such a task? Will some other language make this easier to accomplish?
Your problem seems to be more related to programming language parsing. I believe with regular expressions you will be able to find comments in most of the languages. The good thing is that you have regular expressions almost everywhere: Perl, Python, Ruby, AWK, Sed, etc.
But, as the other answer said, you'd better use some parsing machinery. And, if not a full blown parser, a lexer. For Python, check out the Pygments library, which has lexers for many languages already implemented.
1) What you need to know about is parsing, not regex. Additionally you will need the os module and some knowledge about pythons file handling. DiveIntoPython (http://www.diveintopython.net/) is a good start here. I'd recommend chapter 6. (And maybe 1-5 as well :) )
2) Python is a good start. Another language is not going to make it easier, but different. Python allready is pretty simple to start with.
I would recommend not to use regex for your task, as it is as simple as searching for comment signs and linefeeds.
The pyparsing module directly supports several styles of comments. E.g.,
from pyparsing import javaStyleComment
for match in javaStyleComment.scanString(text):
<do stuff>
So if your goal is just getting the job done, look into this since the comment parsers are likely to be more robust than anything you'd throw together. If you're more interested in learning to do it yourself, this might be too much processed food for your taste.

Regex? Search Engine?

I've read through some documentation on the re module that comes with built-in python, but I just can't seem to get a grasp on it. In fact, I'm not exactly sure that is what I'm looking for, so let me explain:
I have a huge dictionary. What I want is to be able to type in a search criteria, let's say for example hello, and then have it search through the dictionary and give me a list like this:
hello, hell, hello world, hello123. Basically anything resembling the search criteria. Would I use regex for this or something else?
Since you are using Python, you should look at Xapian, it had great Python bindings.
What you are asking for is way more sophisticated that what regular expressions are for.
You need full text search, with stemming and other tricks to do the fuzzy matching.
You might want to look at something that can compute a Levenshtein (edit) distance. There's an excellent article here on how to build something like you are talking about from scratch (in Python! well and it has been ported to lots of other languages).
You might not want to go the "from-scratch" route, but the article will give you lots of interesting background that should help you decide which tool has the right level of sophistication for you. Xapian, as suggested above, Lucene, and other full-text search engines will provide this kind of capability, and it can be very sophisticated, but then again you might not need all that.
There is a new regexp module in PyPI repository (which will possibly replace the current Python re module sometimes).
It allows fuzzy matching.

Mini-languages in Python

I'm after creating a simple mini-language parser in Python, programming close to the problem domain and all that.
Anyway, I was wondering how the people on here would go around doing that - what are the preferred ways of doing this kind of thing in Python?
I'm not going to give specific details of what I'm after because at the moment I'm just investigating how easy this whole field is in Python.
Pyparsing is handy for writing "little languages". I gave a presentation at PyCon'06 on writing a simple adventure game engine, in which the language being parsed and interpreted was the game command set ("inventory", "take sword", "drop book", etc.). (Source code here.)
You can also find links to other pyparsing articles at the pyparsing wiki.
I have limited but positive experience with PLY (Python Lex-Yacc). It combines Lex and Yacc functionality in a single Python class. You may want to check it out.
Fellow Stackoverflow'er Ned Batchelder has a nice overview of available tools on his website. There's also an overview on the Python website itself.
I would recommend funcparserlib. It was written especially for parsing little languages and DSLs and it is faster and smaller than pyparsing (see stats on its homepage). Minimalists and functional programmers should like funcparserlib.
Edit: By the way, I'm the author of this library, so my opinion may be biased.
Python is such a wonderfully simple and extensible language that I'd suggest merely creating a comprehensive python module, and coding against that.
I see that while I typed up the above, PLY has already been mentioned.
If you ask me this now, I would try the textx library for python. You can very easily create a dsl in that with python! Advantages are that it creates an AST for you, and lexing and parsing is combined.
http://igordejanovic.net/textX/
In order to be productive, I'd always use a parser generator like CocoPy (Tutorial) to have your grammar transformed into a (correct) parser (unless you want to implement the parser manually for the sake of learning).
The rest is writing the actual interpreter/compiler (Create stack-based byte code or memory AST to be interpreted and then evaluate it).

Resources for lexing, tokenising and parsing in python

Can people point me to resources on lexing, parsing and tokenising with Python?
I'm doing a little hacking on an open source project (hotwire) and wanted to do a few changes to the code that lexes, parses and tokenises the commands entered into it. As it is real working code it is fairly complex and a bit hard to work out.
I haven't worked on code to lex/parse/tokenise before, so I was thinking one approach would be to work through a tutorial or two on this aspect. I would hope to learn enough to navigate around the code I actually want to alter. Is there anything suitable out there? (Ideally it could be done in an afternoon without having to buy and read the dragon book first ...)
Edit: (7 Oct 2008) None of the below answers quite give what I want. With them I could generate parsers from scratch, but I want to learn how to write my own basic parser from scratch, not using lex and yacc or similar tools. Having done that I can then understand the existing code better.
So could someone point me to a tutorial where I can build a basic parser from scratch, using just python?
I'm a happy user of PLY. It is a pure-Python implementation of Lex & Yacc, with lots of small niceties that make it quite Pythonic and easy to use. Since Lex & Yacc are the most popular lexing & parsing tools and are used for the most projects, PLY has the advantage of standing on giants' shoulders. A lot of knowledge exists online on Lex & Yacc, and you can freely apply it to PLY.
PLY also has a good documentation page with some simple examples to get you started.
For a listing of lots of Python parsing tools, see this.
This question is pretty old, but maybe my answer would help someone who wants to learn the basics. I find this resource to be very good. It is a simple interpreter written in python without the use of any external libraries. So this will help anyone who would like to understand the internal working of parsing, lexing, and tokenising:
"A Simple Intepreter from Scratch in Python:" Part 1, Part 2,
Part 3, and Part 4.
For medium-complex grammars, PyParsing is brilliant. You can define grammars directly within Python code, no need for code generation:
>>> from pyparsing import Word, alphas
>>> greet = Word( alphas ) + "," + Word( alphas ) + "!" # <-- grammar defined here
>>> hello = "Hello, World!"
>>>> print hello, "->", greet.parseString( hello )
Hello, World! -> ['Hello', ',', 'World', '!']
(Example taken from the PyParsing home page).
With parse actions (functions that are invoked when a certain grammar rule is triggered), you can convert parses directly into abstract syntax trees, or any other representation.
There are many helper functions that encapsulate recurring patterns, like operator hierarchies, quoted strings, nesting or C-style comments.
pygments is a source code syntax highlighter written in python. It has lexers and formatters, and may be interesting to peek at the source.
Here's a few things to get you started (roughly from simplest-to-most-complex, least-to-most-powerful):
http://en.wikipedia.org/wiki/Recursive_descent_parser
http://en.wikipedia.org/wiki/Top-down_parsing
http://en.wikipedia.org/wiki/LL_parser
http://effbot.org/zone/simple-top-down-parsing.htm
http://en.wikipedia.org/wiki/Bottom-up_parsing
http://en.wikipedia.org/wiki/LR_parser
http://en.wikipedia.org/wiki/GLR_parser
When I learned this stuff, it was in a semester-long 400-level university course. We did a number of assignments where we did parsing by hand; if you want to really understand what's going on under the hood, I'd recommend the same approach.
This isn't the book I used, but it's pretty good: Principles of Compiler Design.
Hopefully that's enough to get you started :)
Have a look at the standard module shlex and modify one copy of it to match the syntax you use for your shell, it is a good starting point
If you want all the power of a complete solution for lexing/parsing, ANTLR can generate python too.
Frederico Tomassetti had a good (but short) concise write-up to all things related from BNF to binary deciphering on:
lexical,
parser,
abstract-syntax tree (AST), and
Construct/code-generator.
He even mentioned the new Parsing Expression Grammar (PEG).
https://tomassetti.me/parsing-in-python/
I suggest http://www.canonware.com/Parsing/, since it is pure python and you don't need to learn a grammar, but it isn't widely used, and has comparatively little documentation. The heavyweight is ANTLR and PyParsing. ANTLR can generate java and C++ parsers too, and AST walkers but you will have to learn what amounts to a new language.

Categories