Best Practices for Text Generation in Python

Best Practices for Text Generation in Python - python

I'm writing a python script that generates another python script based off an external file. A small section of my code can be seen below. I haven't been exposed to many examples of these kinds of scripts, so I was wondering what the best practices were.
As seen in the last two lines of the code example, the techniques that I'm using can be unwieldy at times.
SIG_DICT_NAME = "sig_dict"
SIG_LEN_KEYWORD = "len"
SIG_BUS_IND_KEYWORD = "ind"
SIG_EP_ADDR_KEYWORD = "ep_addr"
KEYWORD_DEC = "{} = \"{}\""
SIG_LEN_KEYWORD_DEC = KEYWORD_DEC.format(SIG_LEN_KEYWORD, SIG_LEN_KEYWORD)
SIG_BUS_IND_KEYWORD_DEC = KEYWORD_DEC.format(SIG_BUS_IND_KEYWORD,
SIG_BUS_IND_KEYWORD)
SIG_EP_ADDR_KEYWORD_DEC = KEYWORD_DEC.format(SIG_EP_ADDR_KEYWORD,
SIG_EP_ADDR_KEYWORD)
SIG_DICT_DEC = "{} = dict()"
SIG_DICT_BODY_LINE = "{}[{}.{}] = {{{}:{}, {}:{}, {}:{}}}"
#line1 = SIG_DICT_DEC.format(SIG_DICT_NAME)
#line2 = SIG_DICT_BODY.format(SIG_DICT_NAME, x, y, z...)

You don't really see examples of this kind of thing because your solution might be a wee bit over-engineered ;)
I'm guessing that you're trying to collect some "state of things", and then you want to run a script to process that "state of things". Rather than writing a meta-script, what is typically far more convenient is to write a script that will do the processing (say, process.py), and another script that will do the collecting of the "state of things" (say, collect.py).
Then you can take the results from collect.py and throw them at process.py and write out todays_results.txt or some such:
collect.py -> process.py -> 20150207_results.txt
If needed, you can write intermediate files to disk with something like:
with open('todays_progress.txt') as f_out:
for thing, state in states_of_things.iteritems():
f.write('{}<^_^>{}\n'.format(state, thing))
Then you can parse it back in later with something like:
with open('todays_progress.txt') as f_in:
lines = f_in.read().splitlines()
things, states = [x, y for x, y in lines.split('<^_^>')]
states_of_things = dict(zip(things, states))
More complicated data structures than a flat dict? Well, this is Python. There's probably more than one module for that! Off the top of my head I would suggest json if plaintext will do, or pickle if you need some more detailed structures. Two warnings with pickle: custom objects don't always get reinstantiated well, and it's vulnerable to code injection attacks, so only use it if your entire workflow is trusted.
Hope this helps!

You seem to be translating keyword-by-keyword.
It would almost certainly be better to read each "sentence" into a representative Python class; you could then run the simulation directly, or have each class write itself to an "output sentence".
Done correctly, this should be much easier to write and debug and produce more idiomatic output.

Related

How to use refactoring method with functions on python code?

I am in the learning phase of writing python code's. I have created the below code and have got results successfully however, i have been asked to refactor the code and i am not very sure how to proceed. I did refer to multiple post related to refactoring but got more confused and was not clear how its done. Any assistance will be appreciated. Thanks.
import pandas as pd
import numpy as np
pd.set_option('display.max_columns',100)
data = pd.read_excel (r'S:\folder\file1.xlsx')
df_mail =pd.DataFrame(data,columns= ['CustomerName','CDAAccount','Transit'])
print(df_mail)
df_maillist =df_mail.rename(columns={'CDAAccount':'ACOUNT_NUM','Transit':'BRANCH_NUM'})
print(df_maillist)
## 1) Read SAS files
pathcifbas = 'S:\folder\custbas.sas7bdat'
pathcifadr = 'S:\folder\cusadr.sas7bdat'
pathcifacc = 'S:\folder\cusact.sas7bdat'
##custbas.sas7bdat
columns=['CIFNUM','CUSTOMR_LANGUG_C']
dfcifbas = pd.read_sas(pathcifbas)
print(dfcifbas.head())
df_langprf= dfcifbas[columns]
print(df_langprf.head())
df_lang =df_langprf.rename(columns={'CUSTOMR_LANGUG_C':'Language Preference'})
print(df_lang)
## cusadr.sas7bdat
dfcifadr = pd.read_sas(pathcifadr)
print(dfcifadr.head())
cols=['CIFNUM','ADRES_STREET_NUM','ADRES_STREET_NAME','ADRES_CITY','ADRES_STATE_PROV_C','FULL_POSTAL','ADRES_COUNTRY_C','ADRES_SPECL_ADRES']
df_adr= dfcifadr[cols]
print(df_adr.head)
### Renaming the columns
df_adrress =df_adr.rename(columns={'ADRES_CITY':'City','ADRES_STATE_PROV_C':'Province','FULL_POSTAL':'Postal Code','ADRES_COUNTRY_C':'Country','ADRES_SPECL_ADRES':'Special Address'})
print(df_adrress)
## cusact.sas7bdat
dfcifacc = pd.read_sas(pathcifacc)
print(dfcifacc.head())
colmns=['CIFNUM','ACOUNT_NUM','BRANCH_NUM','APLICTN_ID']
df_acc= dfcifacc[colmns]
print(df_acc)
## Filtering the tables with ['APLICTN_ID']== b'CDA'
df_cda= df_acc['APLICTN_ID']== b'CDA'
print(df_cda.head())
df_acccda = df_acc[df_cda]
print(df_acccda)
## Joining dataframes (df_lang), (df_adrress) and (df_acccda) on CIF_NUM
from functools import reduce
Combine_CIFNUM= [df_acccda,df_lang,df_adrress ]
df_cifnum = reduce(lambda left,right: pd.merge(left,right,on='CIFNUM'), Combine_CIFNUM)
print(df_cifnum)
#convert multiple columns object byte to string
df_cifnumstr= df_cifnum.select_dtypes([np.object])
df_cifnumstr=df_cifnumstr.stack().str.decode('latin1').unstack()
for col in df_cifnumstr:
df_cifnum[col] = df_cifnumstr[col]
print(df_cifnum) ## Combined Data Frame
# Joining Mail list with df_cifnum(combined dataframe)
Join1_mailcifnum=pd.merge(df_maillist,df_cifnum, on=['ACOUNT_NUM','BRANCH_NUM'],how='left')
print(Join1_mailcifnum)
## dropping unwanted columns
Com_maillist= Join1_mailcifnum.drop(['CIFNUM','APLICTN_ID'], axis =1)
print(Com_maillist)
## concatenating Street Num + Street Name = Street Address
Com_maillist["Street Address"]=(Com_maillist['ADRES_STREET_NUM'].map(str)+ ' ' + Com_maillist['ADRES_STREET_NAME'].map(str))
print (Com_maillist.head())
## Rearranging columns
Final_maillist= Com_maillist[["CustomerName","ACOUNT_NUM","BRANCH_NUM","Street Address","City","Province","Postal Code","Country","Language Preference","Special Address"]]
print(Final_maillist)
## Export to excel
Final_maillist.to_excel(r'S:\Data Analysis\folder\Final_List.xlsx',index= False, sheet_name='Final_Maillist',header=True)```

Good code refactoring can be composed of many different steps, and depending on what your educator/client/manager/etc. expects, could involve vastly different amounts of effort and time spent. It's a good idea to ask this person what expectations they have for this specific project and start there.
However, for someone relatively new to Python I'd recommend you start with readability and organization. Make sure all your variable names are explicit and readable (assuming you're not using a required pattern like Hungarian notation). As a starting point, the Python naming conventions tend to use lowercase letters and underscores, with exceptions for certain objects or class names. Python actually has a really in-depth style guide called PEP-8. You can find it here
https://www.python.org/dev/peps/pep-0008/
A personal favorite of mine are comments. Comments should always contain the "why" of something, not necessarily the "how" (your code should be readable enough to make this part relatively obvious). This is a bit harder for smaller scripts or assignments where you don't have a ton of individual choice, but it's good to keep in mind.
If you've learned about object oriented programming, you should definitely split up tasks into functions and classes. In your specific case, you could create individual functions for things like loading files, performing specific operations on the file contents, and exporting. If you notice a bunch of functions that tend to have similar themes, that may be a good time to look into creating a class for those functions!
Finally, and again this is a personal preference (for basic scripts anyways), but I like to see a main declaration for readability and organization.
# imports go here!
# specific functions
def some_function():
return
if __name__ == "__main__":
# the start of your program goes here!
This is all pretty heavily simplified for the purposes of just starting out. There are plenty of other resources that can go more in depth in organization, good practices, and optimization.
Best of luck!

Python - any property file or data format that is mostly free-form?

I'm about to roll my own property file parser. I've got a somewhat odd requirement where I need to be able to store metadata in an existing field of a GUI. The data needs to be easily parse-able and human readable, preferably with some flexibility in defining the data (no yaml for example).
I was thinking I could do something like this:
this is random text that is truly a description
.metadata.
owner.first: rick
owner.second: bob
property: blue
pets.mammals.dog: rufus
pets.mammals.cat: ludmilla
I was thinking I could use something like '.metadata.' to denote that anything below that line is metadata to be parsed. Then, I would treat the properties almost like java properties where I would read each line in and build a map (or object) to hold the metadata, which would then be outputted and searchable via a simple web app.
My real question before I roll this on my own, is can anyone suggest a better method for solving this problem? A specific data format or library that would fit this use case? I would normally use something like yaml or the like, but there's no good way for me to validate that the data is indeed in yaml format when it is saved.

You have 3 problems:
How to fit two different things into one box.
If you are mixing free form text and something that is more tightly defined, you are always going to end up with stuff that you can't parse. Then you will have a never ending battle of trying to deal with the rubbish that gets put in. Is there really no other way?
How to define a simple format for metadata that is robust enough for simple use.
This is a hard problem - all attempts to do so seem to expand until they become quite complicated (e.g. YAML). You will probably have custom requirements for your domain, so what you've proposed may be best.
How to parse that format.
For this I would recommend parsy.
It would be quite simple to split the text on .metadata. and then parse what remains.
Here is an example using parsy:
from parsy import *
attribute = letter.at_least(1).concat()
name = attribute.sep_by(string("."))
value = regex(r"[^\n]+")
definition = seq(name << string(":") << string(" ").many(), value)
metadata = definition.sep_by(string("\n"))
Example usage:
>>> metadata.parse_partial("""owner.first: rick
owner.second: bob
property: blue
pets.mammals.dog: rufus
pets.mammals.cat: ludmilla""")
([[['owner', 'first'], 'rick'],
[['owner', 'second'], 'bob'],
[['property'], 'blue'],
[['pets', 'mammals', 'dog'], 'rufus'],
[['pets', 'mammals', 'cat'], 'ludmilla']],
'')

YAML is a simple and nice solution. There is a YAML library in Python:
import yaml
output = {'a':1,'b':{'c':output = {'a':1,'b':{'c':[2,3,4]}}}}
print yaml.dump(output,default_flow_style=False)
Giving as a result:
a: 1
b:
c:
- 2
- 3
- 4
You can also parse from string and so. Just explore it and check if it fits your requeriments.
Good luck!

Extracting information from unconventional text files? (Python)

I am trying to extract some information from a set of files sent to me by a collaborator. Each file contains some python code which names a sequence of lists. They look something like this:
#PHASE = 0
x = np.array(1,2,...)
y = np.array(3,4,...)
z = np.array(5,6,...)
#PHASE = 30
x = np.array(1,4,...)
y = np.array(2,5,...)
z = np.array(3,6,...)
#PHASE = 40
...
And so on. There are 12 files in total, each with 7 phase sets. My goal is to convert each phase into it's own file which can then be read by ascii.read() as a Table object for manipulation in a different section of code.
My current method is extremely inefficient, both in terms of resources and time/energy required to assemble. It goes something like this: Start with a function
def makeTable(a,b,c):
output = Table()
output['x'] = a
output['y'] = b
output['z'] = c
return output
Then for each phase, I have manually copy-pasted the relevant part of the text file into a cell and appended a line of code
fileName_phase = makeTable(a,b,c)
Repeat ad nauseam. It would take 84 iterations of this to process all the data, and naturally each would need some minor adjustments to match the specific fileName and phase.
Finally, at the end of my code, I have a few lines of code set up to ascii.write each of the tables into .dat files for later manipulation.
This entire method is extremely exhausting to set up. If it's the only way to handle the data, I'll do it. I'm hoping I can find a quicker way to set it up, however. Is there one you can suggest?

If efficiency and code reuse instead of copy is the goal, I think that Classes might provide a good way. I'm going to sleep now, but I'll edit later. Here's my thoughts: create a class called FileWithArrays and use a parser to read the lines and put them inside the object FileWithArrays you will create using the class. Once that's done, you can then create a method to transform the object in a table.
P.S. A good idea for the parser is to store all the lines in a list and parse them one by one, using list.pop() to auto shrink the list. Hope it helps, tomorrow I'll look more on it if this doesn't help a lot. Try to rewrite/reformat the question if I misunderstood anything, it's not very easy to read.

I will suggest a way which will be scorned by many but will get your work done.
So apologies to every one.
The prerequisites for this method is that you absolutely trust the correctness of the input files. Which I guess you do. (After all he is your collaborator).
So the key point here is that the text in the file is code which means it can be executed.
So you can do something like this
import re
import numpy as np # this is for the actual code in the files. You might have to install numpy library for this to work.
file = open("xyz.txt")
content = file.read()
Now that you have all the content, you have to separate it by phase.
For this we will use the re.split function.
phase_data = re.split("#PHASE = .*\n", content)
Now we have the content of each phase in an array.
Now comes for the part of executing it.
for phase in phase_data:
if len(phase.strip()) == 0:
continue
exec(phase)
table = makeTable(x, y, z) # the x, y and z are defined by the exec.
# do whatever you want with the table.
I will reiterate that you have to absolutely trust the contents of the file. Since you are executing it as code.
But your work seems like a scripting one and I believe this will get your work done.
PS : The other "safer" alternative to exec is to have a sandboxing library which takes the string and executes it without affecting the parent scope.

To avoid the safety issue of using exec as suggested by #Ajay Brahmakshatriya, but keeping his first processing step, you can create your own minimal 'phase parser', something like:
VARS = 'xyz'
def makeTable(phase):
assert len(phase) >= 3
output = Table()
for i in range(3):
line = [s.strip() for s in phase[i].split('=')]
assert len(line) == 2
var, arr = line
assert var == VARS[i]
assert arr[:10]=='np.array([' and arr[-2:]=='])'
output[var] = np.fromstring(arr[10:-2], sep=',')
return output
and then call
table = makeTable(phase)
instead of
exec(phase)
table = makeTable(x, y, z)
You could also skip all these assert statements without compromising safety, if the file is corrupted or not formatted as expected the error that will be thrown might just be harder to understand...

Fastest way to compare and replace key value pairs in Python

I have a number of files where I want to replace all instances of a specific string with another one.
I currently have this code:
mappings = {'original-1': 'replace-1', 'original-2': 'replace-2'}
# Open file for substitution
replaceFile = open('file', 'r+')
# read in all the lines
lines = replaceFile.readlines()
# seek to the start of the file and truncate
# (this is cause i want to do an "inline" replace
replaceFile.seek(0)
replaceFile.truncate()
# Loop through each line from file
for line in lines:
# Loop through each Key in the mappings dict
for i in mappings.keys():
# if the key appears in the line
if i in line:
# do replacement
line = line.replace(i, mappings[i])
# Write the line to the file and move to next line
replaceFile.write(line)
This works ok, but it is very slow for the size of the mappings and the size of the files I am dealing with.
For instance, in the "mappings" dict there are 60728 key value pairs.
I need to process up to 50 files and replace all instances of "key" with the corresponding value, and each of the 50 files is approximately 250000 lines.
There are also multiple instances where there are multiple keys that need to be replaced on the one line, hence I cant just find the first match and then move on.
So my question is:
Is there a faster way to do the above?
I have thought about using a regex, but I am not sure how to craft one that will do multiple in-line replaces using key/value pairs from a dict.
If you need more info, let me know.

If this performance is slow, you'll have to find something fancy. It's just about all running at C-level:
for filename in filenames:
with open(filename, 'r+') as f:
data = f.read()
f.seek(0)
f.truncate()
for k, v in mappings.items():
data = data.replace(k, v)
f.write(data)
Note that you can run multiple processes where each process tackles a portion of the total list of files. That should make the whole job a lot faster. Nothing fancy, just run multiple instances off the shell, each with a different file list.
Apparently str.replace is faster than regex.sub.
So I got to thinking about this a bit more: suppose you have a really huge mappings. So much so that the likelihood of any one key in mappings being detected in your files is very low. In this scenario, all the time will be spent doing the searching (as pointed out by #abarnert).
Before resorting to exotic algorithms, it seems plausible that multiprocessing could at least be used to do the searching in parallel, and thereafter do the replacements in one process (you can't do replacements in multiple processes for obvious reasons: how would you combine the result?).
So I decided to finally get a basic understanding of multiprocessing, and the code below looks like it could plausibly work:
import multiprocessing as mp
def split_seq(seq, num_pieces):
# Splits a list into pieces
start = 0
for i in xrange(num_pieces):
stop = start + len(seq[i::num_pieces])
yield seq[start:stop]
start = stop
def detect_active_keys(keys, data, queue):
# This function MUST be at the top-level, or
# it can't be pickled (multiprocessing using pickling)
queue.put([k for k in keys if k in data])
def mass_replace(data, mappings):
manager = mp.Manager()
queue = mp.Queue()
# Data will be SHARED (not duplicated for each process)
d = manager.list(data)
# Split the MAPPINGS KEYS up into multiple LISTS,
# same number as CPUs
key_batches = split_seq(mappings.keys(), mp.cpu_count())
# Start the key detections
processes = []
for i, keys in enumerate(key_batches):
p = mp.Process(target=detect_active_keys, args=(keys, d, queue))
# This is non-blocking
p.start()
processes.append(p)
# Consume the output from the queues
active_keys = []
for p in processes:
# We expect one result per process exactly
# (this is blocking)
active_keys.append(queue.get())
# Wait for the processes to finish
for p in processes:
# Note that you MUST only call join() after
# calling queue.get()
p.join()
# Same as original submission, now with MUCH fewer keys
for key in active_keys:
data = data.replace(k, mappings[key])
return data
if __name__ == '__main__':
# You MUST call the mass_replace function from
# here, due to how multiprocessing works
filenames = <...obtain filenames...>
mappings = <...obtain mappings...>
for filename in filenames:
with open(filename, 'r+') as f:
data = mass_replace(f.read(), mappings)
f.seek(0)
f.truncate()
f.write(data)
Some notes:
I have not executed this code yet! I hope to test it out sometime but it takes time to create the test files and so on. Please consider it as somewhere between pseudocode and valid python. It should not be difficult to get it to run.
Conceivably, it should be pretty easy to use multiple physical machines, i.e. a cluster with the same code. The docs for multiprocessing show how to work with machines on a network.
This code is still pretty simple. I would love to know whether it improves your speed at all.
There seem to be a lot of hackish caveats with using multiprocessing, which I tried to point out in the comments. Since I haven't been able to test the code yet, it may be the case that I haven't used multiprocessing correctly anyway.

According to http://pravin.paratey.com/posts/super-quick-find-replace, regex is the fastest way to go for Python. (Building a Trie data structure would be fastest for C++) :
import sys, re, time, hashlib
class Regex:
# Regex implementation of find/replace for a massive word list.
def __init__(self, mappings):
self._mappings = mappings
def replace_func(self, matchObj):
key = matchObj.group(0)
if self._mappings.has_key(key):
return self._mappings[key]
else:
return key
def replace_all(self, filename):
text = ''
with open(filename, 'r+') as fp
text = fp.read()
text = re.sub("[a-zA-Z]+", self.replace_func, text)
fp = with open(filename, "w") as fp:
fp.write(text)
# mapping dictionary of (find, replace) tuples defined
mappings = {'original-1': 'replace-1', 'original-2': 'replace-2'}
# initialize regex class with mapping tuple dictionary
r = Regex(mappings)
# replace file
r.replace_all( 'file' )

The slow part of this is the searching, not the replacing. (Even if I'm wrong, you can easily speed up the replacing part by first searching for all the indices, then splitting and replacing from the end; it's only the searching part that needs to be clever.)
Any naive mass string search algorithm is obviously going to be O(NM) for an N-length string and M substrings (and maybe even worse, if the substrings are long enough to matter). An algorithm that searched M times at each position, instead of M times over the whole string, might be offer some cache/paging benefits, but it'll be a lot more complicated for probably only a small benefit.
So, you're not going to do much better than cjrh's implementation if you stick with a naive algorithm. (You could try compiling it as Cython or running it in PyPy to see if it helps, but I doubt it'll help much—as he explains, all the inner loops are already in C.)
The way to speed it up is to somehow look for many substrings at a time. The standard way to do that is to build a prefix tree (or suffix tree), so, e.g, "original-1" and "original-2" are both branches off the same subtree "original-", so they don't need to be handled separately until the very last character.
The standard implementation of a prefix tree is a trie. However, as Efficient String Matching: An Aid to Bibliographic Search and the Wikipedia article Aho-Corasick string matching algorithm explain, you can optimize further for this use case by using a custom data structure with extra links for fallbacks. (IIRC, this improves the average case by logM.)
Aho and Corasick further optimize things by compiling a finite state machine out of the fallback trie, which isn't appropriate to every problem, but sounds like it would be for yours. (You're reusing the same mappings dict 50 times.)
There are a number of variant algorithms with additional benefits, so it might be worth a bit of further research. (Common use cases are things like virus scanners and package filters, which might help your search.) But I think Aho-Corasick, or even just a plain trie, is probably good enough.
Building any of these structures in pure Python might add so much overhead that, at M~60000, the extra cost will defeat the M/logM algorithmic improvement. But fortunately, you don't have to. There are many C-optimized trie implementations, and at least one Aho-Corasick implementation, on PyPI. It also might be worth looking at something like SuffixTree instead of using a generic trie library upside-down if you think suffix matching will work better with your data.
Unfortunately, without your data set, it's hard for anyone else to do a useful performance test. If you want, I can write test code that uses a few different modules, that you can then run against you data. But here's a simple example using ahocorasick for the search and a dumb replace-from-the-end implementation for the replace:
tree = ahocorasick.KeywordTree()
for key in mappings:
tree.add(key)
tree.make()
for start, end in reversed(list(tree.findall(target))):
target = target[:start] + mappings[target[start:end]] + target[end:]

This use a with block to prevent leaking file descriptors. The string replace function will ensure all instances of key get replaced within the text.
mappings = {'original-1': 'replace-1', 'original-2': 'replace-2'}
# Open file for substitution
with open('file', 'r+') as fd:
# read in all the data
text = fd.read()
# seek to the start of the file and truncate so file will be edited inline
fd.seek(0)
fd.truncate()
for key in mappings.keys():
text = text.replace(key, mappings[key])
fd.write(text)

Writing a compiler for a DSL in python

I am writing a game in python and have decided to create a DSL for the map data files. I know I could write my own parser with regex, but I am wondering if there are existing python tools which can do this more easily, like re2c which is used in the PHP engine.
Some extra info:
Yes, I do need a DSL, and even if I didn't I still want the experience of building and using one in a project.
The DSL contains only data (declarative?), it doesn't get "executed". Most lines look like:
SOMETHING: !abc #123 #xyz/123
I just need to read the tree of data.

I've always been impressed by pyparsing. The author, Paul McGuire, is active on the python list/comp.lang.python and has always been very helpful with any queries concerning it.

Here's an approach that works really well.
abc= ONETHING( ... )
xyz= ANOTHERTHING( ... )
pqr= SOMETHING( this=abc, that=123, more=(xyz,123) )
Declarative. Easy-to-parse.
And...
It's actually Python. A few class declarations and the work is done. The DSL is actually class declarations.
What's important is that a DSL merely creates objects. When you define a DSL, first you have to start with an object model. Later, you put some syntax around that object model. You don't start with syntax, you start with the model.

Yes, there are many -- too many -- parsing tools, but none in the standard library.
From what what I saw PLY and SPARK are popular. PLY is like yacc, but you do everything in Python because you write your grammar in docstrings.
Personally, I like the concept of parser combinators (taken from functional programming), and I quite like pyparsing: you write your grammar and actions directly in python and it is easy to start with. I ended up producing my own tree node types with actions though, instead of using their default ParserElement type.
Otherwise, you can also use existing declarative language like YAML.

I have written something like this in work to read in SNMP notification definitions and automatically generate Java classes and SNMP MIB files from this. Using this little DSL, I could write 20 lines of my specification and it would generate roughly 80 lines of Java code and a 100 line MIB file.
To implement this, I actually just used straight Python string handling (split(), slicing etc) to parse the file. I find Pythons string capabilities to be adequate for most of my (simple) parsing needs.
Besides the libraries mentioned by others, if I were writing something more complex and needed proper parsing capabilities, I would probably use ANTLR, which supports Python (and other languages).

For "small languages" as the one you are describing, I use a simple split, shlex (mind that the # defines a comment) or regular expressions.
>>> line = 'SOMETHING: !abc #123 #xyz/123'
>>> line.split()
['SOMETHING:', '!abc', '#123', '#xyz/123']
>>> import shlex
>>> list(shlex.shlex(line))
['SOMETHING', ':', '!', 'abc', '#', '123']
The following is an example, as I do not know exactly what you are looking for.
>>> import re
>>> result = re.match(r'([A-Z]*): !([a-z]*) #([0-9]*) #([a-z0-9/]*)', line)
>>> result.groups()
('SOMETHING', 'abc', '123', 'xyz/123')

DSLs are a good thing, so you don't need to defend yourself :-)
However, have you considered an internal DSL ? These have so many pros versus external (parsed) DSLs that they're at least worth consideration. Mixing a DSL with the power of the native language really solves lots of the problems for you, and Python is not really bad at internal DSLs, with the with statement handy.

On the lines of declarative python, I wrote a helper module called 'bpyml' which lets you declare data in python in a more XML structured way without the verbose tags, it can be converted to/from XML too, but is valid python.
https://svn.blender.org/svnroot/bf-blender/trunk/blender/release/scripts/modules/bpyml.py
Example Use
http://wiki.blender.org/index.php/User:Ideasman42#Declarative_UI_In_Blender

Here is a simpler approach to solve it
What if I can extend python syntax with new operators to introduce new functionally to the language? For example, a new operator <=> for swapping the value of two variables.
How can I implement such behavior? Here comes AST module.
The last module is a handy tool for handling abstract syntax trees. What’s cool about this module is it allows me to write python code that generates a tree and then compiles it to python code.
Let’s say we want to compile a superset language (or python-like language) to python:
from :
a <=> b
to:
a , b = b , a
I need to convert my 'python like' source code into a list of tokens.
So I need a tokenizer, a lexical scanner for Python source code. Tokenize module
I may use the same meta-language to define both the grammar of new 'python-like' language and then build the structure of the abstract syntax tree AST
Why use AST?
AST is a much safer choice when evaluating untrusted code
manipulate the tree before executing the code Working on the Tree
from tokenize import untokenize, tokenize, NUMBER, STRING, NAME, OP, COMMA
import io
import ast
s = b"a <=> b\n" # i may read it from file
b = io.BytesIO(s)
g = tokenize(b.readline)
result = []
for token_num, token_val, _, _, _ in g:
# naive simple approach to compile a<=>b to a,b = b,a
if token_num == OP and token_val == '<=' and next(g).string == '>':
first = result.pop()
next_token = next(g)
second = (NAME, next_token.string)
result.extend([
first,
(COMMA, ','),
second,
(OP, '='),
second,
(COMMA, ','),
first,
])
else:
result.append((token_num, token_val))
src = untokenize(result).decode('utf-8')
exp = ast.parse(src)
code = compile(exp, filename='', mode='exec')
def my_swap(a, b):
global code
env = {
"a": a,
"b": b
}
exec(code, env)
return env['a'], env['b']
print(my_swap(1,10))
Other modules using AST, whose source code may be a useful reference:
textX-LS: A DSL used to describe a collection of shapes and draw it for us.
pony orm: You can write database queries using Python generators and lambdas with translate to SQL query sting—pony orm use AST under the hood
osso: Role Based Access Control a framework handle permissions.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Best Practices for Text Generation in Python - python

Related

How to use refactoring method with functions on python code?

Python - any property file or data format that is mostly free-form?

Extracting information from unconventional text files? (Python)

Fastest way to compare and replace key value pairs in Python

Writing a compiler for a DSL in python

Categories

Resources