restructuredtext: what is the best way of extracting bibliographic and other fields?

restructuredtext: what is the best way of extracting bibliographic and other fields? - python

I am putting together a website where the content is maintained as restructuredtext that is then converted into html. I need more control than e.g. rst2html.py, so I am using a my own python script that uses things like
docutils.core.publish_parts(source, writer_name='html')
to create the html.
publish_parts() gives me useful parts like the title, body, etc. However, it seems I must look elsewhere to get the values of rst fields like
:Authors:
:version:
etc. For this, I have been using publish_doctree() as in
doctree = core.publish_doctree(source).asdom()
and then going through this recursively using using getElementsByTagName() as in
doctree.getElementsByTagName('authors')
doctree.getElementsByTagName('version')
etc.
Using publish_doctree() to extract fields does the job, and that's good, but it does seem more convoluted than using e.g. publish_parts().
My question is simply whether this is the best recommended way of extracting out these rst fields, or is there a more direct and less convoluted way? If not, that is fine, but I thought I would inquire in case I am missing something.

Related

json data for/in python unittests

I'm writing an ORM-like library for Mongo. I've written some models, and want to make sure that they and the machinery that supports them are correct, so I'd like to write unit tests for them. I figured the best way to do that would be to simply write out some test data as JSON, then pass them to my models and see if the valid data is considered valid and the invalid data as invalid.
My question is where to put that data: it seems like a lot of non-TestCase stuff to put in test_models.py, but adding a separate test_data_for_models.json contributes its own headaches (versioning, etc.)
Which is the more recommended/idiomatic? Bear in mind I don't really need to generate any data--I'll likely just adapt data that's already in the db--I'm just unsure where to put it.

Use iterparse and, subsequently, xpath on documents with inconsistent namespace declarations

I need to put together a piece of code that parses a possibly large XML file into custom Python objects. The idea is roughly the following:
from lxml import etree
for e, tag in etree.iterparse(source, tag='Foo'):
print tag.xpath('bar/baz')[42] # there's actually a function call here
The problem is, some of the documents have a namespace declaration, and some don't have any. That means that in the code above both tag='Foo' and xpath parts won't work.
For now I've been putting up with the ugly
for e, tag in etree.iterparse(source):
if tag.tag.endswith('Foo'):
print tag.xpath('*[local-name()="bar"]/*[local-name()="baz"]')[42]
but this is so awful that I want to get it right even though it works fine. (I guess it should be slower, too.)
Is there a way to write sane code that would account for both cases using iterparse?
For now I can only think of catching start-ns and end-ns events and updating a "state-keeping" variable, which I'll have to pass to the function that is called within the loop to do the work. The function will then construct the xpath queries accordingly. This makes some sense, but I'm wondering if there's a simpler way around this.
P.S. I've obviously tried searching around, but haven't found a solution that would work both with and without a namespace. I would also accept a solution that eliminates namespaces from the XML, but only if it doesn't store the whole tree in RAM in the process.

All elements have a .nsmap mapping attribute; use it to detect your namespace and branch accordingly.

Pythonic way to ID a mystery file, then call a filetype-specific parser for it? Class creation q's

(note) I would appreciate help generalizing the title. I am sure that this is a class of problems in OO land, and probably has a reasonable pattern, I just don't know a better way to describe it.
I'm considering the following -- Our server script will be called by an outside program, and have a bunch of text dumped at it, (usually XML).
There are multiple possible types of data we could be getting, and multiple versions of the data representation we could be getting, e.g. "Report Type A, version 1.2" vs. "Report Type A, version 2.0"
We will generally want to do the same thing action with all the data -- namely, determine what sort and version it is, then parse it with a custom parser, then call a synchronize-to-database function on it.
We will definitely be adding types and versions as time goes on.
So, what's a good design pattern here? I can come up with two, both seem like they may have som problems.
Option 1
Write a monolithic ID script which determines the type, and then
imports and calls the properly named class functions.
Benefits
Probably pretty easy to debug,
Only one file that does the parsing.
Downsides
Seems hack-ish.
It would be nice to not have to create
knowledge of dataformats in two places, once for ID, once for actual
merging.
Option 2
Write an "ID" function for each class; returns Yes / No / Maybe when given identifying text.
the ID script now imports a bunch of classes, instantiates them on the text and asks if the text and class type match.
Upsides:
Cleaner in that everything lives in one module?
Downsides:
Slower? Depends on logic of running through the classes.
Put abstractly, should Python instantiate a bunch of Classes, and consume an ID function, or should Python instantiate one (or many) ID classes which have a paired item class, or some other way?

You could use the Strategy pattern which would allow you to separate the logic for the different formats which need to be parsed into concrete strategies. Your code would typically parse a portion of the file in the interface and then decide on a concrete strategy.
As far as defining the grammar for your files I would find a fast way to identify the file without implementing the full definition, perhaps a header or other unique feature at the beginning of the document. Then once you know how to handle the file you can pick the best concrete strategy for that file handling the parsing and writes to the database.

Django - converting URL into links, images, objects

I'm creating simple comment-like application and need to convert normal urls into links, image links into images and yt/vimeo/etc. links into flash objects. E.g.:
http://foo.bar to http://foo.bar
http://foo.bar/image.gif to <img src="http://foo.bar/image.gif"/>
etc.
Of course i can write all of that by myself, but i think it's such obvious piece of code that somebody has already wrote it (maybe even with splitting text into paragraphs). I was googling for some time but couldn't find anything complex, just few snippets. Does filter (or something like that) exist?
Thanks!
PS. There is urlize but it works only for the first case.

Write a custom filter to handle all the necessary cases. Look at the source code for urlize to get started. You'll also need the urlize function from utils.
In your filter, first test for the first case and call urlize on that. Handle the second case and any other cases you may have.

Need example/help with GtkTextBuffer (of GtkTextView) serialize/deserialize

I am trying to save user's bold/italic/font/etc tags in a GtkTextView.
Using GtkTextBuffer.get_text() does not return the tags.
The best documentation I have found on this is:
http://www.pygtk.org/docs/pygtk/class-gtktextbuffer.html#method-gtktextbuffer--register-serialize-format
However, I do not understand the function arguments.
It would be infinitely handy to have an example of how these are used to save/load a textview with tags in it.
Edit: I would like to clarify what I am trying to accomplish. Basically I want to save/load the textview's text+tags. I have no desire to do anything more complicated than that. I am using pickle as the file format, so I dont need any help here on how to save it or in what format. Just need a way to pull/push the data so that the user loses nothing that he/she sees on screen. Thank you.

If you need to save the tags because you just want to copy the text into another text buffer, you can use gtk.TextBuffer.insert_range().
If you need to save the text with tags into another format readable by other programs, I once wrote a library with a GTK text buffer serializer to and from RTF. It doesn't have any Python bindings though. But in any case the code is a good example of how to use the serializer facility. Link: Osxcart

I haven't worked with GtkTextBuffer's serialization. Reading the documentation you linked, I would suggest trying the default serializer, by calling
textbuffer.register_serialize_tagset()
This gives you GTK+'s built-in proprietary serializer. Being proprietary here means that it doesn't serialize into some well-known format; but if all you need is the ability to save out the text buffer's contents and load them back, this should be fine.
Of course the source code is available inside GTK+ if you really want to figure out how it works; I would recommend against trying to implement e.g. a stand-alone de-serializer though, since there are probably no guarantees made by GTK+ that the format will remain as-is.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

restructuredtext: what is the best way of extracting bibliographic and other fields? - python

Related

json data for/in python unittests

Use iterparse and, subsequently, xpath on documents with inconsistent namespace declarations

Pythonic way to ID a mystery file, then call a filetype-specific parser for it? Class creation q's

Django - converting URL into links, images, objects

Need example/help with GtkTextBuffer (of GtkTextView) serialize/deserialize

Categories

Resources