Parsing pseudo code/language in Python

Parsing pseudo code/language in Python - python

So I need to write a parser (or simulator) which would take an input file with simple code written in my own pseudo code like language, for instance:
a = 5
b = 5 * a
[FOR 10]
b = b * 5
[ENDFOR]
[IF b>30]
a = a + 3
[ENDIF]
So the pseudo language only supports integer variables, basic operations with them (+,-,/,*), a basic for loop and a basic if statement. I need to build parser that will in the end deliver the final values of a and b (or any other variables used in the code).
I was thinking of trying to do this in XML so I simulate the loop and the if with tags, but I am not really sure if this is the right (or most efficient) approach. Any suggestions?
Quick edit ^^:
It's not about my own programming language...it's part of a bigger project...I need a simple way of evaluating small snippets of code written like the example and get the states of the variables used after simulating it...thats why I wanted to use XML...this is not intended to be a programming lanuage of any sort...

Much of this may already be implemented using the examples from the pyparsing wiki, such as this one, or this one, which uses the more current operatorPrecedence helper method.
EDIT Links to PyParsing Wikispace are dead but you can found an another wiki on the github repository from here : https://github.com/pyparsing/pyparsing/wiki

Take a look at PLY. Python's implementation of glorious LEX/YACC. You can definitely write a compiler or interpreter for your language with this tool.

Related

Is my understanding of how Python is written/implemented correct?

I want to understand how Python works at a base level, and this will hopefully help me understand a bit more about the inner workings of other compiled/interpreted languages. Unfortunately, the compilers class is a bit away for now. From what I read on this site and elsewhere, people answering "What base language is Python written in" seem to convey that there's a difference between talking about the "rules" of a language versus how the language rules are implemented for usage. So, is it correct to say that Python (and other high-level languages) are all essentially just sets of rules "written" in any natural language? And then the matter of how they're actually used (where used means compiled/interpreted to actually create things) can vary, with various languages being used to implement compilers? So in this case, CPython, IronPython, and Jython would be syntactically equal languages which all follow the same set of rules, just that those rules are implemented themselves in their respective languages.
Please let me know if my understanding of this is correct, if you have anything to add that might further solidify my understanding, or if I'm blatantly wrong.

Code written in Python should be able to run on any Python interpreter. Python is essentially a specification for a programming language with a reference implementation (CPython). Whenever the Python specifications and PEPs are ambiguous, the other interpreters usually choose to implement the same behavior, unless they have reason not to.
That being said, it's entirely possible that a program written in Python will behave differently on different implementations. This is because many programmers venture into "undefined behavior." For example, CPython has a "Global Interpreter Lock" that means only one thread is actually executing at a time (modulo some conditions), but other interpreters do not have that behavior. So, for example, there is different behaviors about atomicity (e.g., each bytecode instruction is atomic in CPython) as other interpreters.
You can consider it like C. C is a language specification, but there are many compilers implementing it: GCC, LLVM, Borland, MSVC++, ICC, etc. There are programming languages and implementations of those programming languages.

You are correct when you make the distinction between what a language means and how it does what it means.
What it means
The first step to compiling a language is to parse its code to generate an Abstract Syntax Tree. That is a tree that defines what the code you wrote means, what it is supposed to do. By example if you have the following code
a = 1
if a:
print('not zero')
It would generate a tree that looks more or less like this.
code
___________|______
| |
declaration if
__|__ ___|____
| | | |
a 1 a print
|
'not zero'
This represents what the code means, but tells us nothing about how it executes it.
Edit: of course the above is far from what Python's parsers would actually generate, I made plenty of oversimplification for the purpose of readability. Luckily for us, if you are curious about what is actually generated you can import ast that provides a Python parser.
import ast
code = """
a = 1
if a:
print('not zero')
"""
my_ast = ast.parse(code)
Enjoy inspecting my_ast.
What it does
Once you have an AST, you can convert it back to whatver you want. It can be C, it can be machine code, you can even convert it back to Python if you wish. The most used implementation of Python is CPython which is written in C.
What is going on under the hood is thus pretty close to your understanding. First, a language is a set of rules that defines a behaviour, and only then is there an implementation to that languages that defines how it does it. And yes of course, you can have different implementations of a same language with slight difference of behaviours.

Basically it's a bunch of dictionary data structures implementing functions, modules, etc. The global variables and their values live in a per-module dictionary. Variables within a class are another dictionary. Those within an object are yet another dictionary and so are those within a function. Even a function call has its own dictionary so that different calls have different copies of the local variables.
It has no lexical scope unlike most other languages and, in my opinion, was designed to be implemented as simply as possible by 1 coder using dictionaries.

C Preprocessor Macro equivalent for Python

I use to define macros (not just constants) in C like
#define loop(i,a,b) for(i=a; i<b; ++i)
#define long_f(a,b,c) (a*0.123 + a*b*5.6 - 0.235*c + 7.23*c - 5*a*a + 1.5)
Is there a way of doing this in python using a preprocess instead of a function?
*By preprocess I mean something that replaces the occurrences of the definition before running the code (actually not the whole code but the rest of the code, because since it's part of the code, I guess it will replace everything during runtime).
If there is, worth it? Will there be a significant difference in run time?

Is there a way? Yes. There's always a way. Should you do it? Probably not.
Just define a function that does what you want. If you are just concerned about code getting really long and want a one-liner, you can use a lambda function.
long_f = lambda a,b,c: a*0.123 + a*b*5.6 - 0.235*c + 7.23*c - 5*a*a + 1.5
long_f(1,2,3) == 28.808
And of course your first example is already way prettier in Python.
for i in range(a,b):
...
Edit: for completeness, I should answer the question as asked. If you ABSOLUTELY MUST preproccess your Python code, you can use any programming language designed for templating things like web pages. For example, I've heard of PHP being used for preprocessing code. Instead of HTML, you write your code. When you want something preprocessesed, you do your PHP blocks.

Well, if you're going to perform some really hard calculations that could be performed in advance, then, perhaps, this makes sense: usually users are more happy with fast programs rather than slow ones.
But, I'm afraid python isn't a good choice when it comes to 'raw performance', that is, speed of arithmetic calculations. At least if we talk about the standard python implementation, called CPython.
Alternatively, you could check other variants:
PyPy. This is an alternative python implementation, in pure Python. Thanks to a JIT compiler it gives better performance but requires a lot more memory.
Cython. This is an extension to Python, which allows one to [conveniently] create compileable snippets for perfomance critical parts of the code.
Use a whatever external pre-processor you like. M4 and FilePP are what come to my mind first, but there're plenty of them.

Best way to put Python Operators and Functions into Classes?

This is a follow up question to the last question I'd ask, which you can read here:
Using Python's basic I/O to manipulate or create Python Files?
Basically I was asking for the best way to allow Python to edit other Python programs sources, for the purpose of screwing around/experimenting and seeing if I could hook them up to either a genetic algorithm or some sort of backprop network to get results.
Basically, one of the answers suggests making every python operator or code-bit, such as '=', 'if', etc, etc, into classes which can then be used by me/various other programs to piece together/edit other python files, utilizing Python's basic file i/o as a means of manipulating.
The classes would each, upon initialization, get their own unique ID, which would be stored, along with line number, type, etc, in an SQLite3 database for logging purposes, and would be operated upon by other classes/the basic file i/o system.
My question is: Is this a sane task? If not, what can I change or do differently? Am I undertaking something that will be worthwhile, or is it completely idiotic? If you need clarification please ask, I want to know if what I'm doing seems reasonable to an outside source, or if I should reconsider/scrap the whole deal...
Thanks in advance.

Don't generate code modify the AST for the code.
import ast
p = ast.parse("testcode")
Docs for AST: http://docs.python.org/library/ast.html
Example of modifying code: http://docs.python.org/library/ast.html#ast.NodeTransformer
Example: (modifying 3*3 to 3+3)
from ast import *
tree = parse("print(3*3)") #parse the code
tree.body[0].value.args[0].op = Add() #modify multiplication to plus
codeobj = compile(tree,"","exec") # compile the code
exec(codeobj) #run it - it should print 6 (not 9 which the original code would print)
BTW, I am interested in genetic algorithms.
If you start a project, I can help you.

Trying to understand which is better in python creating variables or using expressions

One of the practices I have gotten into in Python from the beginning is to reduce the number of variables I create as compared to the number I would create when trying to do the same thing in SAS or Fortran
for example here is some code I wrote tonight:
def idMissingFilings(dEFilings,indexFilings):
inBoth=set(indexFilings.keys()).intersection(dEFilings.keys())
missingFromDE=[]
for each in inBoth:
if len(dEFilings[each])<len(indexFilings[each]):
dEtemp=[]
for filing in dEFilings[each]:
#dateText=filing.split("\\")[-1].split('-')[0]
#year=dateText[0:5]
#month=dateText[5:7]
#day=dateText[7:]
#dETemp.append(year+"-"+month+"-"+day+"-"+filing[-2:])
dEtemp.append(filing.split('\\')[-1].split('-')[0][1:5]+"-"+filing.split('\\')[-1].split('-')[0][5:7]+"-"+filing.split('\\')[-1].split('-')[0][7:]+"-"+filing[-2:])
indexTemp=[]
for infiling in indexFilings[each]:
indexTemp.append(infiling.split('|')[3]+"-"+infiling[-6:-4])
tempMissing=set(indexTemp).difference(dEtemp)
for infiling in indexFilings[each]:
if infiling.split('|')[3]+"-"+infiling[-6:-4] in tempMissing:
missingFromDE.append(infiling)
return missingFromDE
Now I split one of the strings I am processing 4 times in the line dEtemp.append(blah blah blah)
filing.split('\\')
Historically in Fortran or SAS if I were to attempt the same I would have 'sliced' my string once and assigned a variable to each part of the string that I was going to use in this expression.
I am constantly forcing myself to use expressions instead of first resolving to a value and using the value. The only reason I do this is that I am learning by mimicking other people's code but it has been in the back of my mind to ask this question - where can I find a cogent discussion of why one is better than the other
The code compares a set of documents on a drive and a source list of those documents and checks to see whether all of those from the source are represented on the drive
Okay the commented section is much easier to read and how I decided to respond to nosklos answer

Yeah, it is not better to put everything in the expression. Please use variables.
Using variables is not only better because you will do the operation only once and save the value for multiple uses. The main reason is that code becomes more readable that way. If you name the variable right, it doubles as free implicit documentation!

Use more variables. Python is known for its readability; taking away that feature is called not "Pythonic" (See https://docs.python-guide.org/writing/style/). Code that is more readable will be easier for others to understand, and easier to understand yourself later.

Python design mistakes [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 11 years ago.
A while ago, when I was learning Javascript, I studied Javascript: the good parts, and I particularly enjoyed the chapters on the bad and the ugly parts. Of course, I did not agree with everything, as summing up the design defects of a programming language is to a certain extent subjective - although, for instance, I guess everyone would agree that the keyword with was a mistake in Javascript. Nevertheless, I find it useful to read such reviews: even if one does not agree, there is a lot to learn.
Is there a blog entry or some book describing design mistakes for Python? For instance I guess some people would count the lack of tail call optimization a mistake; there may be other issues (or non-issues) which are worth learning about.

You asked for a link or other source, but there really isn't one. The information is spread over many different places. What really constitutes a design mistake, and do you count just syntactic and semantic issues in the language definition, or do you include pragmatic things like platform and standard library issues and specific implementation issues? You could say that Python's dynamism is a design mistake from a performance perspective, because it makes it hard to make a straightforward efficient implementation, and it makes it hard (I didn't say completely impossible) to make an IDE with code completion, refactoring, and other nice things. At the same time, you could argue for the pros of dynamic languages.
Maybe one approach to start thinking about this is to look at the language changes from Python 2.x to 3.x. Some people would of course argue that print being a function is inconvenient, while others think it's an improvement. Overall, there are not that many changes, and most of them are quite small and subtle. For example, map() and filter() return iterators instead of lists, range() behaves like xrange() used to, and dict methods like dict.keys() return views instead of lists. Then there are some changes related to integers, and one of the big changes is binary/string data handling. It's now text and data, and text is always Unicode. There are several syntactic changes, but they are more about consistency than revamping the whole language.
From this perspective, it appears that Python has been pretty well designed on the language (syntax and sematics) level since at least 2.x. You can always argue about indentation-based block syntax, but we all know that doesn't lead anywhere... ;-)
Another approach is to look at what alternative Python implementations are trying to address. Most of them address performance in some way, some address platform issues, and some add or make changes to the language itself to more efficiently solve certain kinds of tasks. Unladen swallow wants to make Python significantly faster by optimizing the runtime byte-compilation and execution stages. Stackless adds functionality for efficient, heavily threaded applications by adding constructs like microthreads and tasklets, channels to allow bidirectional tasklet communication, scheduling to run tasklets cooperatively or preemptively, and serialisation to suspend and resume tasklet execution. Jython allows using Python on the Java platform and IronPython on the .Net platform. Cython is a Python dialect which allows calling C functions and declaring C types, allowing the compiler to generate efficient C code from Cython code. Shed Skin brings implicit static typing into Python and generates C++ for standalone programs or extension modules. PyPy implements Python in a subset of Python, and changes some implementation details like adding garbage collection instead of reference counting. The purpose is to allow Python language and implementation development to become more efficient due to the higher-level language. Py V8 bridges Python and JavaScript through the V8 JavaScript engine – you could say it's solving a platform issue. Psyco is a special kind of JIT that dynamically generates special versions of the running code for the data that is currently being handled, which can give speedups for your Python code without having to write optimised C modules.
Of these, something can be said about the current state of Python by looking at PEP-3146 which outlines how Unladen Swallow would be merged into CPython. This PEP is accepted and is thus the Python developers' judgement of what is the most feasible direction to take at the moment. Note it addresses performance, not the language per se.
So really I would say that Python's main design problems are in the performance domain – but these are basically the same challenges that any dynamic language has to face, and the Python family of languages and implementations are trying to address the issues. As for outright design mistakes like the ones listed in Javascript: the good parts, I think the meaning of "mistake" needs to be more explicitly defined, but you may want to check out the following for thoughts and opinions:
FLOSS Weekly 11: Guido van Rossum (podcast August 4th, 2006)
The History of Python blog

Is there a blog entry or some book describing design mistakes for Python?
Yes.
It's called the Py3K list of backwards-incompatible changes.
Start here: http://docs.python.org/release/3.0.1/whatsnew/3.0.html
Read all the Python 3.x release notes for additional details on the mistakes in Python 2.

My biggest peeve with Python - and one which was not really addressed in the move to 3.x - is the lack of proper naming conventions in the standard library.
Why, for example, does the datetime module contain a class itself called datetime? (To say nothing of why we have separate datetime and time modules, but also a datetime.time class!) Why is datetime.datetime in lower case, but decimal.Decimal is upper case? And please, tell me why we have that terrible mess under the xml namespace: xml.sax, but xml.etree.ElementTree - what is going on there?

Try these links:
http://c2.com/cgi/wiki?PythonLanguage
http://c2.com/cgi/wiki?PythonProblems

Things that frequently surprise inexperienced developers are candidate mistakes. Here is one, default arguments:
http://www.deadlybloodyserious.com/2008/05/default-argument-blunders/

A personal language peeve of mine is name binding for lambdas / local functions:
fns = []
for i in range(10):
fns.append(lambda: i)
for fn in fns:
print(fn()) # !!! always 9 - not what I'd naively expect
IMO, I'd much prefer looking up the names referenced in a lambda at declaration time. I understand the reasons for why it works the way it does, but still...
You currently have to work around it by binding i into a new name whos value doesn't change, using a function closure.

This is more of a minor problem with the language, rather than a fundamental mistake, but: Property overriding. If you override a property (using getters and setters), there is no easy way of getting the parent class' property.

Yeah, it's strange but I guess that's what you get for having mutable variables.
I think the reason is that the "i" refers to a box which has a mutable value and the "for" loop will change that value over time, so reading the box value later gets you the only value there is left.
I don't know how one would fix that short of making it a functional programming language without mutable variables (at least without unchecked mutable variables).
The workaround I use is creating a new variable with a default value (default values being evaluated at DEFINITION time in Python, which is annoying at other times) which causes copying of the value to the new box:
fns = []
for i in range(10):
fns.append(lambda j=i: j)
for fn in fns:
print(fn()) # works

I find it surprising that nobody mentioned the global interpreter lock.

One of the things I find most annoying in Python is using writelines() and readlines() on a file. readlines() not only returns a list of lines, but it also still has the \n characters at the end of each line, so you have to always end up doing something like this to strip them:
lines = [l.replace("\n", "").replace("\r", "") for l in f.readlines()]
And when you want to use writelines() to write lines to a file, you have to add \n at the end of every line in the list before you write them, sort of like this:
f.writelines([l + "\n" for l in lines])
writelines() and readlines() should take care of endline characters in an OS independent way, so you don't have to deal with it yourself.
You should just be able to go:
lines = f.readlines()
and it should return a list of lines, without \n or \r characters at the end of the lines.
Likewise, you should just be able to go:
f.writelines(lines)
To write a list of lines to a file, and it should use the operating systems preferred enline characters when writing the file, you shouldn't need to do this yourself to the list first.

My biggest dislike is range(), because it doesn't do what you'd expect, e.g.:
>>> for i in range(1,10): print i,
1 2 3 4 5 6 7 8 9
A naive user coming from another language would expect 10 to be printed as well.

You asked for liks; I have written a document on that topic some time ago: http://segfaulthunter.github.com/articles/biggestsurprise/

I think there's a lot of weird stuff in python in the way they handle builtins/constants. Like the following:
True = "hello"
False = "hello"
print True == False
That prints True...
def sorted(x):
print "Haha, pwned"
sorted([4, 3, 2, 1])
Lolwut? sorted is a builtin global function. The worst example in practice is list, which people tend to use as a convenient name for a local variable and end up clobbering the global builtin.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.