Use iterparse and, subsequently, xpath on documents with inconsistent namespace declarations - python

I need to put together a piece of code that parses a possibly large XML file into custom Python objects. The idea is roughly the following:
from lxml import etree
for e, tag in etree.iterparse(source, tag='Foo'):
print tag.xpath('bar/baz')[42] # there's actually a function call here
The problem is, some of the documents have a namespace declaration, and some don't have any. That means that in the code above both tag='Foo' and xpath parts won't work.
For now I've been putting up with the ugly
for e, tag in etree.iterparse(source):
if tag.tag.endswith('Foo'):
print tag.xpath('*[local-name()="bar"]/*[local-name()="baz"]')[42]
but this is so awful that I want to get it right even though it works fine. (I guess it should be slower, too.)
Is there a way to write sane code that would account for both cases using iterparse?
For now I can only think of catching start-ns and end-ns events and updating a "state-keeping" variable, which I'll have to pass to the function that is called within the loop to do the work. The function will then construct the xpath queries accordingly. This makes some sense, but I'm wondering if there's a simpler way around this.
P.S. I've obviously tried searching around, but haven't found a solution that would work both with and without a namespace. I would also accept a solution that eliminates namespaces from the XML, but only if it doesn't store the whole tree in RAM in the process.

All elements have a .nsmap mapping attribute; use it to detect your namespace and branch accordingly.

Related

Find all possible stacktraces for a function

First of all, I'm not sure I'm asking for something that is doable.
Sometimes, when I have to do some refactoring in a large codebase, I need to change the output or even the signature of a certain function. Before than I do that, I have to make sure that that output is handled correctly by the other functions that are going to call it.
The way I do it is by searching for the function name, say get_timezone_from user and then I see that function is used by two other functions called format and change_timezone_for_user. And I start looking for where those two functions are used etc... until I end up with a graph where each path looks like a stacktrace.
There are two problems.
The first one is that doing this manually is pretty time consuming.
The second is that when I look at all occurrences of a function name like format, I will often find occurences of the word format in different contexts. It can be a comment or even another function defined somewhere else.
My question is simple:
Is there an IDE or a tool that allows to find where in your code a certain function is called? Even better would be if that tool is recursive so that it draws the whole graph.
I'm trying to achieve this in Python if that helps.

restructuredtext: what is the best way of extracting bibliographic and other fields?

I am putting together a website where the content is maintained as restructuredtext that is then converted into html. I need more control than e.g. rst2html.py, so I am using a my own python script that uses things like
docutils.core.publish_parts(source, writer_name='html')
to create the html.
publish_parts() gives me useful parts like the title, body, etc. However, it seems I must look elsewhere to get the values of rst fields like
:Authors:
:version:
etc. For this, I have been using publish_doctree() as in
doctree = core.publish_doctree(source).asdom()
and then going through this recursively using using getElementsByTagName() as in
doctree.getElementsByTagName('authors')
doctree.getElementsByTagName('version')
etc.
Using publish_doctree() to extract fields does the job, and that's good, but it does seem more convoluted than using e.g. publish_parts().
My question is simply whether this is the best recommended way of extracting out these rst fields, or is there a more direct and less convoluted way? If not, that is fine, but I thought I would inquire in case I am missing something.

Getting Vim to be aware of ctag type annotations for python

I use Vim+Ctags to write Python, and my problem is that Vim often jumps to the import for a tag, rather than the definition. This is a common issue, and has already been addressed in a few posts here.
this post shows how to remove the imports from the tags file. This works quite well, except that sometimes it is useful to have tags form the imports (e.g. when you want to list all places where a class/function has been imported).
this post shows how to get to the definition without removing the imports from the tags file. This is basically what I've been doing so far (just remapped :tjump to a single keystroke). However, you still need to navigate the list of tags that comes up to find the definition entry.
It would be nice it if it were possible to just tell Vim to "got the the definition" with a single key chord (e.g. ). Exuberant Ctags annotates the tag entries with the type of entry (e.g. c for classes, i for imports). Does anyone know if there is a way to get Vim to utilize these annotations, so that I could say things like "go to the first tag that is not of type i"?
Unfortunately, there's no way for Vim itself to do that inference business and jump to an import or a definition depending on some context: when searching for a tag in your tags file, Vim stops at the first match whatever it is. A plugin may help but I'm not aware of such a thing.
Instead of <C-]> or :tag foo, you could use g] or :ts foo which shows you a list of matches (with kinds and a preview of the line of each match) instead of jumping to the first one. This way, you are able to tell Vim exactly where you want to go.

Pythonic way to ID a mystery file, then call a filetype-specific parser for it? Class creation q's

(note) I would appreciate help generalizing the title. I am sure that this is a class of problems in OO land, and probably has a reasonable pattern, I just don't know a better way to describe it.
I'm considering the following -- Our server script will be called by an outside program, and have a bunch of text dumped at it, (usually XML).
There are multiple possible types of data we could be getting, and multiple versions of the data representation we could be getting, e.g. "Report Type A, version 1.2" vs. "Report Type A, version 2.0"
We will generally want to do the same thing action with all the data -- namely, determine what sort and version it is, then parse it with a custom parser, then call a synchronize-to-database function on it.
We will definitely be adding types and versions as time goes on.
So, what's a good design pattern here? I can come up with two, both seem like they may have som problems.
Option 1
Write a monolithic ID script which determines the type, and then
imports and calls the properly named class functions.
Benefits
Probably pretty easy to debug,
Only one file that does the parsing.
Downsides
Seems hack-ish.
It would be nice to not have to create
knowledge of dataformats in two places, once for ID, once for actual
merging.
Option 2
Write an "ID" function for each class; returns Yes / No / Maybe when given identifying text.
the ID script now imports a bunch of classes, instantiates them on the text and asks if the text and class type match.
Upsides:
Cleaner in that everything lives in one module?
Downsides:
Slower? Depends on logic of running through the classes.
Put abstractly, should Python instantiate a bunch of Classes, and consume an ID function, or should Python instantiate one (or many) ID classes which have a paired item class, or some other way?
You could use the Strategy pattern which would allow you to separate the logic for the different formats which need to be parsed into concrete strategies. Your code would typically parse a portion of the file in the interface and then decide on a concrete strategy.
As far as defining the grammar for your files I would find a fast way to identify the file without implementing the full definition, perhaps a header or other unique feature at the beginning of the document. Then once you know how to handle the file you can pick the best concrete strategy for that file handling the parsing and writes to the database.

Python analyze method calls from other classes/modules

I've got a Codebase of around 5,3k LOC with around 30 different classe. The code is already very well formatted and I want to improve it further by prefixing methods that are only called in the module that were defined in with a "_", in order to indicate that. Yes it would have been a good idea to do that from the beginning on but now it's too late :D
Basically I'm searching for a tool that will tell me if a method is not called outside of the module it was defined in, I'm not looking for stuff that will automatically convert the whole thing to use underscores, just a "simple" thing that tells me where I have to look for prefixing stuff.
I'd took a look at the AST module, but there's no easy way to get a list of method definitions and calls, also parsing the plain text yields just too many false positives. I don't insist in spending day(s) on reinventing the wheel when there might be an already existing solution to my problem.
For me, this sounds like special case of coverage.
Thus I'd take a look at coverage.py or figleaf and modify it to ignore inter-module calls.

Categories