Parsing a text file with a special markup - python

I need to parse a DSL file using Python. A DSL file is a text file with a text having a special markup with tags used by ABBYY Lingvo.
It looks like:
activate
[m0][b]ac·ti·vate[/b] {{id=000000367}} [c rosybrown]\[[/c][c darkslategray][b]activate[/b][/c] [c darkslategray][b]activates[/b][/c] [c darkslategray][b]activated[/b][/c] [c darkslategray][b]activating[/b][/c][c rosybrown]\][/c] [p]BrE[/p] [c darkgray] [/c][c darkcyan]\[ˈæktɪveɪt\][/c] [s]z_activate__gb_1.wav[/s] [p]NAmE[/p] [c darkgray] [/c][c darkcyan]\[ˈæktɪveɪt\][/c] [s]z_activate__us_1.wav[/s] [c orange] verb[/c] [c darkgray] [/c][b]{{cf}}\~ sth{{/cf}} [/b]
[m1]{{d}}to make sth such as a device or chemical process start working{{/d}}
[m2][ex][*]• [/*][/ex][ex][*]{{x}}The burglar alarm is activated by movement.{{/x}} [/*][/ex]
[m2][ex][*]• [/*][/ex][c darkgray] [/c][ex][*]{{x}}The gene is activated by a specific protein.{{/x}} [/*][/ex]
{{Derived Word}}[m3][c darkslategray][u]Derived Word:[/u][/c] ↑<<activation>>{{/Derived Word}}
{{side_verb_forms}}[m3][c darkslategray][u]Verb forms:[/u][/c] [s]x_verb_forms_activate.jpg[/s]{{/side_verb_forms}}
Now I see the only option to parse this file using regexps. But I doubt if it can be achieved since tags in that format has some hierarchy, where some of them are inside others.
I can't use special xml and html parsers. They are perfect in creating a tree-structure of the document, but they are designed for special tags of html and xml.
What is the best way to parse a file in such a format? Is there any Python library for that purpose?

"some engine which allows to create a tree basing on nesting tag structure".
Look at http://www.dabeaz.com/ply/
You may be able to define the syntax quickly and easily as a set of Lexical rules and some grammar productions.
If you don't like that one, here's a list of alternatives.
http://wiki.python.org/moin/LanguageParsing

Using RegExp for this for something other than trivial use will give heartache and pain.
If you insist on using a RegEx (NOT RECOMMENDED), look at the methods used HERE on XML
If by ".dsl" you mean the ABBRY or Lingvo dict format, you may want to look at stardict. It can read the ABBRY dsl format.

Related

Python - any property file or data format that is mostly free-form?

I'm about to roll my own property file parser. I've got a somewhat odd requirement where I need to be able to store metadata in an existing field of a GUI. The data needs to be easily parse-able and human readable, preferably with some flexibility in defining the data (no yaml for example).
I was thinking I could do something like this:
this is random text that is truly a description
.metadata.
owner.first: rick
owner.second: bob
property: blue
pets.mammals.dog: rufus
pets.mammals.cat: ludmilla
I was thinking I could use something like '.metadata.' to denote that anything below that line is metadata to be parsed. Then, I would treat the properties almost like java properties where I would read each line in and build a map (or object) to hold the metadata, which would then be outputted and searchable via a simple web app.
My real question before I roll this on my own, is can anyone suggest a better method for solving this problem? A specific data format or library that would fit this use case? I would normally use something like yaml or the like, but there's no good way for me to validate that the data is indeed in yaml format when it is saved.
You have 3 problems:
How to fit two different things into one box.
If you are mixing free form text and something that is more tightly defined, you are always going to end up with stuff that you can't parse. Then you will have a never ending battle of trying to deal with the rubbish that gets put in. Is there really no other way?
How to define a simple format for metadata that is robust enough for simple use.
This is a hard problem - all attempts to do so seem to expand until they become quite complicated (e.g. YAML). You will probably have custom requirements for your domain, so what you've proposed may be best.
How to parse that format.
For this I would recommend parsy.
It would be quite simple to split the text on .metadata. and then parse what remains.
Here is an example using parsy:
from parsy import *
attribute = letter.at_least(1).concat()
name = attribute.sep_by(string("."))
value = regex(r"[^\n]+")
definition = seq(name << string(":") << string(" ").many(), value)
metadata = definition.sep_by(string("\n"))
Example usage:
>>> metadata.parse_partial("""owner.first: rick
owner.second: bob
property: blue
pets.mammals.dog: rufus
pets.mammals.cat: ludmilla""")
([[['owner', 'first'], 'rick'],
[['owner', 'second'], 'bob'],
[['property'], 'blue'],
[['pets', 'mammals', 'dog'], 'rufus'],
[['pets', 'mammals', 'cat'], 'ludmilla']],
'')
YAML is a simple and nice solution. There is a YAML library in Python:
import yaml
output = {'a':1,'b':{'c':output = {'a':1,'b':{'c':[2,3,4]}}}}
print yaml.dump(output,default_flow_style=False)
Giving as a result:
a: 1
b:
c:
- 2
- 3
- 4
You can also parse from string and so. Just explore it and check if it fits your requeriments.
Good luck!

Using DTDs to Parse XML

I'm attempting to parse the USPTO data that is hosted Here. I have also retrieved the DTDs associated with the files. My question is: is it possible to use these to parse the files, or are they only used for validation? I have already used one as a guideline for parsing some of the documents, but doing it the way I am would require having a separate parser for each DTD. Here is an example snippet of how I'm currently doing it.
# <!ELEMENT document-id (country, doc-number, kind?, name?, date?)>
def parseDocumentId(ref):
data = {}
data["Country"] = ref.find("country").text
data["ID"] = ref.find("doc-number").text
if ref.find("date") != None:
d= ref.find("date").text
try:
date = datetime.strptime(d, "%Y%m%d").date()
except:
date= None
data["Date"]= date
if ref.find("kind") != None:
data["Kind"]= ref.find("kind").text
if ref.find("name") != None:
data["Name"]= ref.find("name").text
return data
This way just seems very manual to me, so I'm curious if there is a better way to help automate the process
Note: I'm using lxml for parsing.
DTDs will just help you to follow specifications. You can create a dictionary for tokenize the document and then parse it. Anyway, I believe that using lxml is the better way.
The usual approach to processing XML is to use an off-the-shelf XML parser for your programming language, and from its API construct whatever data structures you want to have. When many XML documents using the same XML vocabulary must be processed, it may make sense to generate a parser for that class of XML documents using a tool, or even to construct a parser by hand. But most programs use generic XML parsers instead of custom-constructed parsers.
To store XML documents in a database, however, it may not be necessary to employ an XML parser at all (except perhaps in checking beforehand that the documents are all in fact well-formed): all XML databases and many SQL databases have the ability to read and ingest XML documents.

Output list of links grouped by extension or base URL - built on regex using python.

Working on this assignment for a while now. The regex is not particularly difficult, but I don't quite follow how to get the output they want
Your program should:
Read the html of a webpage (which has been stored as textfile);
Extract all the domains referred to and list all the full http addresses related to these domains;
Extract all the resource types referred to and list all the full http * addresses related to these resource types.
Please solve the task using regular expressions and re functions/methods. I suggest using ‘finditer’ and ‘groups’ (there might be other possibilities). Please do not use string functions where re is better suited."
The output is supposed to look like this
www.fairfaxmedia.co.nz
http://www.fairfaxmedia.co.nz
www.essentialmums.co.nz
http://www.essentialmums.co.nz/
http://www.essentialmums.co.nz/
http://www.essentialmums.co.nz/
www.nzfishingnews.co.nz
http://www.nzfishingnews.co.nz/
www.nzlifeandleisure.co.nz
http://www.nzlifeandleisure.co.nz/
www.weatherzone.co.nz
http://www.weatherzone.co.nz/
www.azdirect.co.nz
http://www.azdirect.co.nz/
i.stuff.co.nz
http://i.stuff.co.nz/
ico
http://static.stuff.co.nz/781/3251781.ico
zip
http://static2.stuff.co.nz/1392867595/static/jwplayer/skin/Modieus.zip
mp4
http://file2.stuff.co.nz/1394587586/272/9819272.mp4
I really need help with how to filter stuff out so the output shows up like that?
create list of tuples (keyword, url)
sort it according to keyword
using itertools.groupby group per keyword
for each keyword, print keyword and then all urls (these to be printed indentend).

How can I anonymise XML data for selected tags?

My question is as follows:
I have to read a big XML file, 50 MB; and anonymise some tags/fields that relate to private issues, like name surname address, email, phone number, etc...
I know exactly which tags in XML are to be anonymised.
s|<a>alpha</a>|MD5ed(alpha)|e;
s|<h>beta</h>|MD5ed(beta)|e;
where alpha and beta refer to any characters within, which will also be hashed, using probably an algorithm like MD5.
I will only convert the tag value, not the tags themselves.
I hope, I am clear enough about my problem. How do I achieve this?
You have to do something like the following in Python.
import xml.etree.ElementTree as xml # or lxml or whatever
import hashlib
theDoc= xml.parse( "sample.xml" )
for alphaTag in theDoc.findall( "xpath/to/tag" ):
print alphaTag, alphaTag.text
alphaTag.text = hashlib.md5(alphaTag.text).hexdigest()
xml.dump(theDoc)
Using regexps is indeed dangerous, unless you know exactly the format of the file, it's easy to parse with regexps, and you are sure that it will not change in the future.
Otherwise you could indeed use XML::Twig,as below. An alternative would be to use XML::LibXML, although the file might be a bit big to load it entirely in memory (then again, maybe not, memory is cheap these days) so you might have to use the pull mode, which I don't know much about.
Compact XML::Twig code:
#!/usr/bin/perl
use strict;
use warnings;
use XML::Twig;
use Digest::MD5 'md5_base64';
my #tags_to_anonymize= qw( name surname address email phone);
# the handler for each element ($_) sets its content with the md5 and then flushes
my %handlers= map { $_ => sub { $_->set_text( md5_base64( $_->text))->flush } } #tags_to_anonymize;
XML::Twig->new( twig_roots => \%handlers, twig_print_outside_roots => 1)
->parsefile( "my_big_file.xml")
->flush;
Bottom line: don't parse XML using regex.
Use your language's DOM parsing libraries instead, and if you know the elements you need to anonymize, grab them using XPath and hash their contents by setting their innerText/innerHTML properties (or whatever your language calls them).
As Welbog said, don't try to parse XML with a regex. You'll regret it eventually.
Probably the easiest way to do this is using XML::Twig. It can process XML in chunks, which lets you handle very large files.
Another possibility would be using SAX, especially with XML::SAX::Machines. I've never really used that myself, but it's a stream-oriented system, so it should be able to handle large files. The downside is that you'll probably have to write more code to collect the text inside each tag that you care about (where XML::Twig will collect that text for you).

Writing a compiler for a DSL in python

I am writing a game in python and have decided to create a DSL for the map data files. I know I could write my own parser with regex, but I am wondering if there are existing python tools which can do this more easily, like re2c which is used in the PHP engine.
Some extra info:
Yes, I do need a DSL, and even if I didn't I still want the experience of building and using one in a project.
The DSL contains only data (declarative?), it doesn't get "executed". Most lines look like:
SOMETHING: !abc #123 #xyz/123
I just need to read the tree of data.
I've always been impressed by pyparsing. The author, Paul McGuire, is active on the python list/comp.lang.python and has always been very helpful with any queries concerning it.
Here's an approach that works really well.
abc= ONETHING( ... )
xyz= ANOTHERTHING( ... )
pqr= SOMETHING( this=abc, that=123, more=(xyz,123) )
Declarative. Easy-to-parse.
And...
It's actually Python. A few class declarations and the work is done. The DSL is actually class declarations.
What's important is that a DSL merely creates objects. When you define a DSL, first you have to start with an object model. Later, you put some syntax around that object model. You don't start with syntax, you start with the model.
Yes, there are many -- too many -- parsing tools, but none in the standard library.
From what what I saw PLY and SPARK are popular. PLY is like yacc, but you do everything in Python because you write your grammar in docstrings.
Personally, I like the concept of parser combinators (taken from functional programming), and I quite like pyparsing: you write your grammar and actions directly in python and it is easy to start with. I ended up producing my own tree node types with actions though, instead of using their default ParserElement type.
Otherwise, you can also use existing declarative language like YAML.
I have written something like this in work to read in SNMP notification definitions and automatically generate Java classes and SNMP MIB files from this. Using this little DSL, I could write 20 lines of my specification and it would generate roughly 80 lines of Java code and a 100 line MIB file.
To implement this, I actually just used straight Python string handling (split(), slicing etc) to parse the file. I find Pythons string capabilities to be adequate for most of my (simple) parsing needs.
Besides the libraries mentioned by others, if I were writing something more complex and needed proper parsing capabilities, I would probably use ANTLR, which supports Python (and other languages).
For "small languages" as the one you are describing, I use a simple split, shlex (mind that the # defines a comment) or regular expressions.
>>> line = 'SOMETHING: !abc #123 #xyz/123'
>>> line.split()
['SOMETHING:', '!abc', '#123', '#xyz/123']
>>> import shlex
>>> list(shlex.shlex(line))
['SOMETHING', ':', '!', 'abc', '#', '123']
The following is an example, as I do not know exactly what you are looking for.
>>> import re
>>> result = re.match(r'([A-Z]*): !([a-z]*) #([0-9]*) #([a-z0-9/]*)', line)
>>> result.groups()
('SOMETHING', 'abc', '123', 'xyz/123')
DSLs are a good thing, so you don't need to defend yourself :-)
However, have you considered an internal DSL ? These have so many pros versus external (parsed) DSLs that they're at least worth consideration. Mixing a DSL with the power of the native language really solves lots of the problems for you, and Python is not really bad at internal DSLs, with the with statement handy.
On the lines of declarative python, I wrote a helper module called 'bpyml' which lets you declare data in python in a more XML structured way without the verbose tags, it can be converted to/from XML too, but is valid python.
https://svn.blender.org/svnroot/bf-blender/trunk/blender/release/scripts/modules/bpyml.py
Example Use
http://wiki.blender.org/index.php/User:Ideasman42#Declarative_UI_In_Blender
Here is a simpler approach to solve it
What if I can extend python syntax with new operators to introduce new functionally to the language? For example, a new operator <=> for swapping the value of two variables.
How can I implement such behavior? Here comes AST module.
The last module is a handy tool for handling abstract syntax trees. What’s cool about this module is it allows me to write python code that generates a tree and then compiles it to python code.
Let’s say we want to compile a superset language (or python-like language) to python:
from :
a <=> b
to:
a , b = b , a
I need to convert my 'python like' source code into a list of tokens.
So I need a tokenizer, a lexical scanner for Python source code. Tokenize module
I may use the same meta-language to define both the grammar of new 'python-like' language and then build the structure of the abstract syntax tree AST
Why use AST?
AST is a much safer choice when evaluating untrusted code
manipulate the tree before executing the code Working on the Tree
from tokenize import untokenize, tokenize, NUMBER, STRING, NAME, OP, COMMA
import io
import ast
s = b"a <=> b\n" # i may read it from file
b = io.BytesIO(s)
g = tokenize(b.readline)
result = []
for token_num, token_val, _, _, _ in g:
# naive simple approach to compile a<=>b to a,b = b,a
if token_num == OP and token_val == '<=' and next(g).string == '>':
first = result.pop()
next_token = next(g)
second = (NAME, next_token.string)
result.extend([
first,
(COMMA, ','),
second,
(OP, '='),
second,
(COMMA, ','),
first,
])
else:
result.append((token_num, token_val))
src = untokenize(result).decode('utf-8')
exp = ast.parse(src)
code = compile(exp, filename='', mode='exec')
def my_swap(a, b):
global code
env = {
"a": a,
"b": b
}
exec(code, env)
return env['a'], env['b']
print(my_swap(1,10))
Other modules using AST, whose source code may be a useful reference:
textX-LS: A DSL used to describe a collection of shapes and draw it for us.
pony orm: You can write database queries using Python generators and lambdas with translate to SQL query sting—pony orm use AST under the hood
osso: Role Based Access Control a framework handle permissions.

Categories