Parsing an existing config file - python

I have a config file that is in the following form:
protocol sample_thread {
{ AUTOSTART 0 }
{ BITMAP thread.gif }
{ COORDS {0 0} }
{ DATAFORMAT {
{ TYPE hl7 }
{ PREPROCS {
{ ARGS {{}} }
{ PROCS sample_proc }
} }
} }
}
The real file may not have these exact fields, and I'd rather not have to describe the the structure of the data is to the parser before it parses.
I've looked for other configuration file parsers, but none that I've found seem to be able to accept a file of this syntax.
I'm looking for a module that can parse a file like this, any suggestions?
If anyone is curious, the file in question was generated by Quovadx Cloverleaf.

pyparsing is pretty handy for quick and simple parsing like this. A bare minimum would be something like:
import pyparsing
string = pyparsing.CharsNotIn("{} \t\r\n")
group = pyparsing.Forward()
group << pyparsing.Group(pyparsing.Literal("{").suppress() +
pyparsing.ZeroOrMore(group) +
pyparsing.Literal("}").suppress())
| string
toplevel = pyparsing.OneOrMore(group)
The use it as:
>>> toplevel.parseString(text)
['protocol', 'sample_thread', [['AUTOSTART', '0'], ['BITMAP', 'thread.gif'],
['COORDS', ['0', '0']], ['DATAFORMAT', [['TYPE', 'hl7'], ['PREPROCS',
[['ARGS', [[]]], ['PROCS', 'sample_proc']]]]]]]
From there you can get more sophisticated as you want (parse numbers seperately from strings, look for specific field names etc). The above is pretty general, just looking for strings (defined as any non-whitespace character except "{" and "}") and {} delimited lists of strings.

Taking Brian's pyparsing solution another step, you can create a quasi-deserializer for this format by using the Dict class:
import pyparsing
string = pyparsing.CharsNotIn("{} \t\r\n")
# use Word instead of CharsNotIn, to do whitespace skipping
stringchars = pyparsing.printables.replace("{","").replace("}","")
string = pyparsing.Word( stringchars )
# define a simple integer, plus auto-converting parse action
integer = pyparsing.Word("0123456789").setParseAction(lambda t : int(t[0]))
group = pyparsing.Forward()
group << ( pyparsing.Group(pyparsing.Literal("{").suppress() +
pyparsing.ZeroOrMore(group) +
pyparsing.Literal("}").suppress())
| integer | string )
toplevel = pyparsing.OneOrMore(group)
sample = """
protocol sample_thread {
{ AUTOSTART 0 }
{ BITMAP thread.gif }
{ COORDS {0 0} }
{ DATAFORMAT {
{ TYPE hl7 }
{ PREPROCS {
{ ARGS {{}} }
{ PROCS sample_proc }
} }
} }
}
"""
print toplevel.parseString(sample).asList()
# Now define something a little more meaningful for a protocol structure,
# and use Dict to auto-assign results names
LBRACE,RBRACE = map(pyparsing.Suppress,"{}")
protocol = ( pyparsing.Keyword("protocol") +
string("name") +
LBRACE +
pyparsing.Dict(pyparsing.OneOrMore(
pyparsing.Group(LBRACE + string + group + RBRACE)
) )("parameters") +
RBRACE )
results = protocol.parseString(sample)
print results.name
print results.parameters.BITMAP
print results.parameters.keys()
print results.dump()
Prints
['protocol', 'sample_thread', [['AUTOSTART', 0], ['BITMAP', 'thread.gif'], ['COORDS',
[0, 0]], ['DATAFORMAT', [['TYPE', 'hl7'], ['PREPROCS', [['ARGS', [[]]], ['PROCS', 'sample_proc']]]]]]]
sample_thread
thread.gif
['DATAFORMAT', 'COORDS', 'AUTOSTART', 'BITMAP']
['protocol', 'sample_thread', [['AUTOSTART', 0], ['BITMAP', 'thread.gif'], ['COORDS', [0, 0]], ['DATAFORMAT', [['TYPE', 'hl7'], ['PREPROCS', [['ARGS', [[]]], ['PROCS', 'sample_proc']]]]]]]
- name: sample_thread
- parameters: [['AUTOSTART', 0], ['BITMAP', 'thread.gif'], ['COORDS', [0, 0]], ['DATAFORMAT', [['TYPE', 'hl7'], ['PREPROCS', [['ARGS', [[]]], ['PROCS', 'sample_proc']]]]]]
- AUTOSTART: 0
- BITMAP: thread.gif
- COORDS: [0, 0]
- DATAFORMAT: [['TYPE', 'hl7'], ['PREPROCS', [['ARGS', [[]]], ['PROCS', 'sample_proc']]]]
I think you will get further faster with pyparsing.
-- Paul

I'll try and answer what I think is the missing question(s)...
Configuration files come in many formats. There are well known formats such as *.ini or apache config - these tend to have many parsers available.
Then there are custom formats. That is what yours appears to be (it could be some well-defined format you and I have never seen before - but until you know what that is it doesn't really matter).
I would start with the software this came from and see if they have a programming API that can load/produce these files. If nothing is obvious give Quovadx a call. Chances are someone has already solved this problem.
Otherwise you're probably on your own to create your own parser.
Writing a parser for this format would not be terribly difficult assuming that your sample is representative of a complete example. It's a hierarchy of values where each node can contain either a value or a child hierarchy of values. Once you've defined the basic types that the values can contain the parser is a very simple structure.
You could write this reasonably quickly using something like Lex/Flex or just a straight-forward parser in the language of your choosing.

You can easily write a script in python which will convert it to python dict, format looks almost like hierarchical name value pairs, only problem seems to be
Coards {0 0}, where {0 0} isn't a name value pair, but a list
so who know what other such cases are in the format
I think your best bet is to have spec for that format and write a simple python script to read it.

Your config file is very similar to JSON (pretty much, replace all your "{" and "}" with "[" and "]"). Most languages have a built in JSON parser (PHP, Ruby, Python, etc), and if not, there are libraries available to handle it for you.
If you can not change the format of the configuration file, you can read all file contents as a string, and replace all the "{" and "}" characters via whatever means you prefer. Then you can parse the string as JSON, and you're set.

I searched a little on the Cheese Shop, but I didn't find anything helpful for your example. Check the Examples page, and this specific parser ( it's syntax resembles yours a bit ). I think this should help you write your own.

Look into LEX and YACC. A bit of a learning curve, but they can generate parsers for any language.

Maybe you could write a simple script that will convert your config into xml file and then read it just using lxml, Beatuful Soup or anything else? And your converter could use PyParsing or regular expressions for example.

Related

Snakemake expand() arguments

I inherited a complicated Snakemake setup. It uses a configfile that contains
{
"sub": [
1234,
],
"ses": [
"1"
],
"task": [
"fake"
],
"run": [
"1"
],
"acq": [
"mb"
],
"bids_dir": "../../bids"
In the all recipe, it uses for input calls to expand() that look like this.
expand('data/{task}/preproc/acq-{acq}/sub-{sub}/ses-{ses}/run-{run}/bold.nii', **config)
Then, I have a recipe that looks like this:
rule getRawFunc:
input:
rawFunc = config['bids_dir'] + '/sub-{sub}/ses-{ses}/func/sub-{sub}_ses-{ses}_task-{task}_acq-{acq}_run-{run}_bold.nii.gz'
output:
func = temp('data/{task}/preproc/acq-{acq}/sub-{sub}/ses-{ses}/run-{run}/bold.nii')
shell:
'gunzip -c {input} > {output}'
I am not understanding why it needs config['bids_dir'] to get the value for that, but it seemingly does not need that to expand the values for {sub} and the like.
I looked at the section about expand at
https://snakemake.readthedocs.io/en/latest/snakefiles/configuration.html#standard-configuration
and that and the tutorials explain the use of config['bids_dir'] well, it's just that **config that I am not quite getting.
Further explication would be most appreciated!
In general expand function requires the template and keyword arguments to use when filling in the template, like so:
expand('{a}_{b}', a='some', b='test')
# this will return 'some_test'
Now, in Python one can do dictionary unpacking by placing two asterisks before the dictionary '**some_dict'. What this does is unpack the contents of the dictionary as 'key=value'. In the example above, we can get the same result by unpacking a dictionary:
some_dict = {'a': 'some', 'b': 'test'}
expand('{a}_{b}', **some_dict)
# this will return 'some_test'
You can read more details in this answer.

Modifying a python dictionary from user inputted dot notation

I'm trying to provide an API like interface in my Django python application that allows someone to input an id and then also include key/values with the request as form data.
For example the following field name and values for ticket 111:
ticket.subject = Hello World
ticket.group_id = 12345678
ticket.collaborators = [123, 4567, 890]
ticket.custom_fields: [{id: 32656147,value: "something"}]
On the backend, I have a corresponding Dict that should match this structure (and i'd do validation). Something like this:
ticket: {
subject: "some subject I want to change",
group_id: 99999,
collaborator_ids: [ ],
custom_fields: [
{
id: 32656147,
value: null
}
]
}
1) I'm not sure exactly the best way to parse the dot notation there, and
2) Assuming I am able to parse it, how would I be able to change the values of the Dict to match what was passed in. I'd imagine maybe something like a class with these inputs?
class SetDictValueFromUserInput(userDotNotation, userNewValue, originalDict)
...
SetDictValueFromUserInput("ticket.subject", "hello world", myDict)
Fastest way is probably splitting the string and indexing based on seperation. For example:
obj = "ticket.subject".split(".")
actual_obj = eval(obj[0]) # this is risky, they is a way around this if you just use if statements and predifined variables.
actual_obj[obj[1]] = value
To have further indexing where an object like ticket.subject.name might work try using a for loop as so.
for key in obj[1:-2]: # basically for all the values in between the object name and the defining key
actual_obj = actual_obj[key] # make the new object based on the value in-between.
actual_obj[obj[-1]] = value

Line-length based custom python JSON encoding for serializables

My problem is similar to Can I implement custom indentation for pretty-printing in Python’s JSON module? and How to change json encoding behaviour for serializable python object? but instead I'd like to collapse lines together if the entire JSON encoded structure can fit on that single line, with configurable line length, in Python 2.X and 3.X. The output is intended for easy-to-read documentation of the JSON structures, rather than debugging. Clarifying: the result MUST be valid JSON, and allow for the regular JSON encoding features of OrderedDicts/sort_keys, default handlers, and so forth.
The solution from custom indentation does not apply, as the individual structures would need to know their serialized lengths in advance, thus adding a NoIndent class doesn't help as every structure might or might not be indented. The solution from changing the behavior of json serializable does not apply as there aren't any (weird) custom overrides on the data structures, they're just regular lists and dicts.
For example, instead of:
{
"#context": "http://linked.art/ns/context/1/full.jsonld",
"id": "http://lod.example.org/museum/ManMadeObject/0",
"type": "ManMadeObject",
"classified_as": [
"aat:300033618",
"aat:300133025"
]
}
I would like to produce:
{
"#context": "http://linked.art/ns/context/1/full.jsonld",
"id": "http://lod.example.org/museum/ManMadeObject/0",
"type": "ManMadeObject",
"classified_as": ["aat:300033618", "aat:300133025"]
}
At any level of nesting within the structure, and across any numbers of levels of nesting until the line length was reached. Thus if there was a list with a single object inside, with a single key/value pair, it would become:
{
"#context": "http://linked.art/ns/context/1/full.jsonld",
"id": "http://lod.example.org/museum/ManMadeObject/0",
"type": "ManMadeObject",
"classified_as": [{"id": "aat:300033618"}]
}
It seems like a recursive descent parser on the indented output would work, along the lines of #robm's approach to custom indentation, but the complexity seems to quickly approach that of writing a JSON parser and serializer.
Otherwise it seems like a very custom JSONEncoder is needed.
Your thoughts appreciated!
Very inefficient, but seems to work so far:
def _collapse_json(text, collapse):
js_indent = 2
lines = text.splitlines()
out = [lines[0]]
while lines:
l = lines.pop(0)
indent = len(re.split('\S', l, 1)[0])
if indent and l.rstrip()[-1] in ['[', '{']:
curr = indent
temp = []
stemp = []
while lines and curr <= indent:
if temp and curr == indent:
break
temp.append(l[curr:])
stemp.append(l.strip())
l = lines.pop(0)
indent = len(re.split('\S', l, 1)[0])
temp.append(l[curr:])
stemp.append(l.lstrip())
short = " " * curr + ''.join(stemp)
if len(short) < collapse:
out.append(short)
else:
ntext = '\n'.join(temp)
nout = _collapse_json(ntext, collapse)
for no in nout:
out.append(" " * curr + no)
l = lines.pop(0)
elif indent:
out.append(l)
out.append(l)
return out
def collapse_json(text, collapse):
return '\n'.join(_collapse_json(text, collapse))
Happy to accept something else that produces the same output without crawling up and down constantly!

Comparing Swift and Python Dictionary objects

I'm trying to get familiar with Swift, so I'm doing some basic computations that I would normally do in Python.
I want to get a value from a dictionary using a key. In Python I would simply :
sequences = ["ATG","AAA","TAG"]
D_codon_aa = {"ATG": "M", "AAA": "R", "TAG": "*"}
for seq in sequences:
print D_codon_aa[seq]
>>>
M
R
*
When I try this in Swift.
let sequences = ["ATG","AAA","TAG"]
let D_codon_aa = ["ATG": "M", "AAA": "R", "TAG": "*"]
for seq in sequences
{
var codon = D_codon_aa[seq]
println(codon)
}
>>>
Optional("M")
Optional("R")
Optional("*")
1) What is Optional() and why is it around the dictionary value?
2) Why can't I make a dictionary with multiple types of objects inside?
In Python I can do this:
sequence= {'A':0,'C':1, 'G':'2', 'T':3.0}
In Swift I can't do this:
let sequences = ["A":0,"C":1, "G":"2", "T":3.0]
1:
Look at the declaration of the dictionarys subscript:
subscript(key: Key) -> Value?
It returns an optional, since you can use any key you want in subscripts, but they might not associated with values, so in that case it returns nil, otherwise the value wrapped in an optional.
2: Actually, you can, if you define your dictionary as for eg. ["String": AnyObject], and now you can associate keys with values, thats conforms to the AnyObject protocol.
Updated
And your example let sequences = ["A":0,"C":1, "G":"2", "T":3.0] compiles fine in Xcode 6.1.1.

Find string of method in a file

Right now I have this to find a method
def getMethod(text, a, filetype):
start = a
fin = a
if filetype == "cs":
for x in range(a, 0, -1):
if text[x] == "{":
start = x
break
for x in range(a, len(text)):
if text[x] == "}":
fin = x
break
return text[start:fin + 1]
How can I get the method the index a is in?
I can't just find { and } because you can have things like new { } which won't work
if I had a file with a few methods and I wanted to find what method the index of x is in then I want the body of that method for example if I had the file
private string x(){
return "x";
}
private string b(){
return "b";
}
private string z(){
return "z";
}
private string a(){
var n = new {l = "l"};
return "a";
}
And I got the index of "a" which lets say is 100
then I want to find the body of that method. So everything within { and }
So this...
{
var n = new {l = "l"};
return "a";
}
But using what I have now it would return:
{l = "l"};
return "a";
}
If my interpretation is correct, it seems you are attempting to parse C# source code to find the C# method that includes a given position a in a .cs file, the content of which is contained in text.
Unfortunately, if you want to do a complete and accurate job, I think you would need a full C# parser.
If that sounds like too much work, I'd think about using a version of ctags that is compatible with C# to generate a tag file and then search in the tag file for the method that applies to a given source file line instead of the original source file.
As Simon stated, if your problem is to parse source code, the best bet is to get a proper parser for that language.
If you're just looking to match up the braces however, there is a well-known algorithm for that: Python parsing bracketed blocks
Just be aware that since source code is a complex beast, don't expect this to work 100%.

Categories