python xml.etree - how to search on more than one attribute - python

I have an XML file with this line:
<op type="create" file="C:/Users/mureadr/Desktop/A/HMI_FORGF/bld/armle-v7/release/SimpleNetwork/Makefile" found="0"/>
I want to use xml.etree to search on more than one attribute:
result = tree.search('.//op[#type="create" #file="c:/Users/mureadr/Desktop/A/HMI_FORGF/bld/armle-v7/release/HmiLogging/Makefile"]')
But I get an error
raise SyntaxError("invalid predicate")
I tried this (added and), still got same error
'.//op[#type="create" and #file="c:/Users/mureadr/Desktop/A/HMI_FORGF/bld/armle-v7/release/HmiLogging/Makefile"]'
Tried adding &&, still got same error
'.//op[#type="create" && #file="c:/Users/mureadr/Desktop/A/HMI_FORGF/bld/armle-v7/release/HmiLogging/Makefile"]'
Finally, tried &, still got same error
'.//op[#type="create" & #file="c:/Users/mureadr/Desktop/A/HMI_FORGF/bld/armle-v7/release/HmiLogging/Makefile"]'
I'm guessing that this is a limitation of xml.etree.
Probably I shouldn't use it in the future, but I'm almost done with my project.
For N attributes, how do I use etree.xml to be able to search on all N attributes?

You can use multiple square brackets in succession
'.//op[#type="create"][#file="/some/path"]'
UPDATE: I see that you are using python's xml.etree module. I am not sure if the above answer is valid for that module (It has extremely limited support for XPath). I'd suggest using the go-to library for all XML tasks -- LXML. If you'd use lxml, it would be simply doc.xpath(".//op[..][..]")

Related

How to access attribute value in xml containing namespace using ElementTree in python

XML file:
<?xml version="1.0" encoding="iso-8859-1"?>
<rdf:RDF xmlns:cim="http://iec.ch/TC57/2008/CIM-schema-cim13#" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
<cim:Terminal rdf:ID="A_T1">
<cim:Terminal.ConductingEquipment rdf:resource="#A_EF2"/>
<cim:Terminal.ConnectivityNode rdf:resource="#A_CN1"/>
</cim:Terminal>
</rdf:RDF>
I want to get the Terminal.ConnnectivityNode element's attribute value and Terminal element's attribute value also as output from the above xml. I have tried in below way!
Python code:
from elementtree import ElementTree as etree
tree= etree.parse(r'N:\myinternwork\files xml of bus systems\cimxmleg.xml')
cim= "{http://iec.ch/TC57/2008/CIM-schema-cim13#}"
rdf= "{http://www.w3.org/1999/02/22-rdf-syntax-ns#}"
Appending the below line to the code
print tree.find('{0}Terminal'.format(cim)).attrib
output1: : Is as expected
{'{http://www.w3.org/1999/02/22-rdf-syntax-ns#}ID': 'A_T1'}
If we Append with this below line to above code
print tree.find('{0}Terminal'.format(cim)).attrib['rdf:ID']
output2: key error in rdf:ID
If we append with this below line to above code
print tree.find('{0}Terminal/{0}Terminal.ConductivityEquipment'.format(cim))
output3 None
How to get output2 as A_T1 & Output3 as #A_CN1?
What is the significance of {0} in the above code, I have found that it must be used through net didn't get the significance of it?
First off, the {0} you're wondering about is part of the syntax for Python's built-in string formatting facility. The Python documentation has a fairly comprehensive guide to the syntax. In your case, it simply gets substituted by cim, which results in the string {http://iec.ch/TC57/2008/CIM-schema-cim13#}Terminal.
The problem here is that ElementTree is a bit silly about namespaces. Instead of being able to simply supply the namespace prefix (like cim: or rdf:), you have to supply it in XPath form. This means that rdf:id becomes {http://www.w3.org/1999/02/22-rdf-syntax-ns#}ID, which is very clunky.
ElementTree does support a way to use the namespace prefix for finding tags, but not for attributes. This means you'll have to expand rdf: to {http://www.w3.org/1999/02/22-rdf-syntax-ns#} yourself.
In your case, it could look as following (note also that ID is case-sensitive):
tree.find('{0}Terminal'.format(cim)).attrib['{0}ID'.format(rdf)]
Those substitutions expand to:
tree.find('{http://iec.ch/TC57/2008/CIM-schema-cim13#}Terminal').attrib['{http://www.w3.org/1999/02/22-rdf-syntax-ns#}ID']
With those hoops jumped through, it works (note that the ID is A_T1 and not #A_T1, however). Of course, this is all really annoying to have to deal with, so you could also switch to lxml and have it mostly handled for you.
Your third case doesn't work simply because 1) it's named Terminal.ConductingEquipment and not Terminal.ConductivityEquipment, and 2) if you really want A_CN1 and not A_EF2, that's the ConnectivityNode and not the ConductingEquipment. You can get A_CN1 with tree.find('{0}Terminal/{0}Terminal.ConnectivityNode'.format(cim)).attrib['{0}resource'.format(rdf)].

How can I parse an XML document into a Python object?

I'm trying to consume an XML API. I'd like to have some Python objects that represent the XML data. I have several XSD and some example API responses from the documentation.
http://www.isan.org/schema/v1.11/common/common.xsd
http://www.isan.org/schema/v1.21/common/serial.xsd
http://www.isan.org/schema/v1.11/common/version.xsd
http://www.isan.org/ISAN/isan.xsd
http://www.isan.org/schema/v1.11/common/title.xsd
http://www.isan.org/schema/v1.11/common/externalid.xsd
http://www.isan.org/schema/v1.11/common/participant.xsd
http://www.isan.org/schema/v1.11/common/language.xsd
http://www.isan.org/schema/v1.11/common/country.xsd
Here's one example XML response:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<serial:serialHeaderType xmlns:isan="http://www.isan.org/ISAN/isan"
xmlns:title="http://www.isan.org/schema/v1.11/common/title"
xmlns:serial="http://www.isan.org/schema/v1.21/common/serial"
xmlns:externalid="http://www.isan.org/schema/v1.11/common/externalid"
xmlns:common="http://www.isan.org/schema/v1.11/common/common"
xmlns:participant="http://www.isan.org/schema/v1.11/common/participant"
xmlns:language="http://www.isan.org/schema/v1.11/common/language"
xmlns:country="http://www.isan.org/schema/v1.11/common/country">
<common:status>
<common:DataType>SERIAL_HEADER_TYPE</common:DataType>
<common:ISAN root="0000-0002-3B9F"/>
<common:WorkStatus>ACTIVE</common:WorkStatus>
</common:status>
<serial:SerialHeaderId root="0000-0002-3B9F"/>
<serial:MainTitles>
<title:TitleDetail>
<title:Title>Braquo</title:Title>
<title:Language>
<language:LanguageLabel>French</language:LanguageLabel>
<language:LanguageCode>
<language:CodingSystem>ISO639_2</language:CodingSystem>
<language:ISO639_2Code>FRE</language:ISO639_2Code>
</language:LanguageCode>
</title:Language>
<title:TitleKind>ORIGINAL</title:TitleKind>
</title:TitleDetail>
</serial:MainTitles>
<serial:TotalEpisodes>11</serial:TotalEpisodes>
<serial:TotalSeasons>0</serial:TotalSeasons>
<serial:MinDuration>
<common:TimeUnit>MIN</common:TimeUnit>
<common:TimeValue>45</common:TimeValue>
</serial:MinDuration>
<serial:MaxDuration>
<common:TimeUnit>MIN</common:TimeUnit>
<common:TimeValue>144</common:TimeValue>
</serial:MaxDuration>
<serial:MinYear>2009</serial:MinYear>
<serial:MaxYear>2009</serial:MaxYear>
<serial:MainParticipantList>
<participant:Participant>
<participant:FirstName>Frédéric</participant:FirstName>
<participant:LastName>Schoendoerffer</participant:LastName>
<participant:RoleCode>DIR</participant:RoleCode>
</participant:Participant>
<participant:Participant>
<participant:FirstName>Karole</participant:FirstName>
<participant:LastName>Rocher</participant:LastName>
<participant:RoleCode>ACT</participant:RoleCode>
</participant:Participant>
</serial:MainParticipantList>
<serial:CompanyList>
<common:Company>
<common:CompanyKind>PRO</common:CompanyKind>
<common:CompanyName>R.T.B.F.</common:CompanyName>
</common:Company>
<common:Company>
<common:CompanyKind>PRO</common:CompanyKind>
<common:CompanyName>Capa Drama</common:CompanyName>
</common:Company>
<common:Company>
<common:CompanyKind>PRO</common:CompanyKind>
<common:CompanyName>Marathon</common:CompanyName>
</common:Company>
</serial:CompanyList>
</serial:serialHeaderType>
I tried simply ignoring the XSD and using lxml.objectify on the XML I'd get from the API. I had a problem with namespaces. Having to refer to every child node with its explicit namespace was a real pain and doesn't make for readable code.
from lxml import objectify
obj = objectify.fromstring(response)
print obj.MainTitles.TitleDetail
# This will fail to find the element because you need to specify the namespace
print obj.MainTitles['{http://www.isan.org/schema/v1.11/common/title}TitleDetail']
# Or something like that, I couldn't get it to work, and I'd much rather use attributes and not specify the namespace
So then I tried generateDS to create some Python class definitions for me. I've lost the error messages that this attempt gave me but I couldn't get it to work. It would generate a module for each XSD that I gave it but it wouldn't parse the example XML.
I'm now trying pyxb and this seems much nicer so far. It's generating nicer definitions than generateDS (splitting them into multiple, reusable modules) but it won't parse the XML:
from models import serial
obj = serial.CreateFromDocument(response)
Traceback (most recent call last):
...
File "/vagrant/isan/isan.py", line 58, in lookup
return serial.CreateFromDocument(resp.content)
File "/vagrant/isan/models/serial.py", line 69, in CreateFromDocument
instance = handler.rootObject()
File "/home/vagrant/venv/lib/python2.7/site-packages/pyxb/binding/saxer.py", line 285, in rootObject
raise pyxb.UnrecognizedDOMRootNodeError(self.__rootObject)
UnrecognizedDOMRootNodeError: <pyxb.utils.saxdom.Element object at 0x2b53664dc850>
The unrecognised node is the <serial:serialHeaderType> node from the example. Looking at the pyxb source it seems that this error comes about "if the top-level element got processed as a DOM instance" but I don't know what this means or how to prevent it.
I've run out of steam for trying to explore this, I don't know what to do next.
I have had a lot of luck parsing XML into Python using Beautiful Soup. It is extremely straightforward, and they provide pretty strong documentation. Check it out here:
http://www.crummy.com/software/BeautifulSoup/
http://www.crummy.com/software/BeautifulSoup/bs4/doc/
UnrecognizedDOMRootNodeError indicates that PyXB could not locate the element in a namespace for which it has bindings registered. In your case it fails on the first element, which is {http://www.isan.org/schema/v1.21/common/serial}serialHeaderType.
The schema for that namespace defines a complexType named SerialHeaderType but does not define an element with the name serialHeaderType. In fact it defines no top-level elements. So PyXB can't recognize it, and the XML does not validate.
Either there's an additional schema for the namespace that you'll need to locate which provides elements, or the message you're sending really doesn't validate. That may be because somebody's expecting a implicit mapping from a complex type to an element with that type, or because it's a fragment that would normally be found within some other element where that QName is a member element name.
UPDATE: You can hand-craft an element in that namespace by adding the
following to the generated bindings in serial.py:
serialHeaderType = pyxb.binding.basis.element(pyxb.namespace.ExpandedName(Namespace, 'serialHeaderType'), SerialHeaderType)
Namespace.addCategoryObject('elementBinding', serialHeaderType.name().localName(), serialHeaderType)
If you do that, you won't get the UnrecognizedDOMRootNodeError but you
will get an IncompleteElementContentError at:
<common:status>
<common:DataType>SERIAL_HEADER_TYPE</common:DataType>
<common:ISAN root="0000-0002-3B9F"/>
<common:WorkStatus>ACTIVE</common:WorkStatus>
</common:status>
which provides the following details:
The containing element {http://www.isan.org/schema/v1.11/common/common}status is defined at common.xsd[243:3].
The containing element type {http://www.isan.org/schema/v1.11/common/common}StatusType is defined at common.xsd[289:1]
The {http://www.isan.org/schema/v1.11/common/common}StatusType automaton is not in an accepting state.
Any accepted content has been stored in instance
The following element and wildcard content would be accepted:
An element {http://www.isan.org/schema/v1.11/common/common}ActiveISAN per common.xsd[316:3]
An element {http://www.isan.org/schema/v1.11/common/common}MatchingISANs per common.xsd[317:3]
An element {http://www.isan.org/schema/v1.11/common/common}Description per common.xsd[318:3]
No content remains unconsumed
Reviewing the schema confirms that, at a minimum, a {http://www.isan.org/schema/v1.11/common/common}Description element is missing but required.
So it seems these documents are not meant to be validated, and PyXB is
probably the wrong technology to use.

How to convert jenkins job configuration config.xml to YAML format in python to be used jenkins-job-builder?

jenkins-job-builder is a nice tool to help me to maintain jobs in YAML files. see example in configuration chapter.
Now I had lots of old jenkins jobs, it will be nice to have a python script xml2yaml to convert the existing jenkins job config.xml to YAML file format.
Do you any suggestions to had a quick solution in python ?
I don't need it to be used in jenkins-job-builder directly, just can be converted it into YAML for reference.
For the convert, some part can be ignored like namespace.
config.xml segment looks like:
<project>
<logRotator class="hudson.tasks.LogRotator">
<daysToKeep>-1</daysToKeep>
<numToKeep>20</numToKeep>
<artifactDaysToKeep>-1</artifactDaysToKeep>
<artifactNumToKeep>-1</artifactNumToKeep>
</logRotator>
...
</project>
The yaml output could be:
- project:
logrotate:
daysToKeep: -1
numToKeep: 20
artifactDaysToKeep: -1
artifactNumToKeep: -1
If you are not familiar with config.xml in jenkins, you can check infra_backend-merge-all-repo job in https://ci.jenkins-ci.org
I'm writing a program that does this conversion from XML to YAML. It can dynamically query a Jenkins server and translate all the jobs to YAML.
https://github.com/ktdreyer/jenkins-job-wrecker
Right now it works for very simple jobs. I've taken a safe/pessimistic approach and the program will bail if it encounters XML that it cannot yet translate.
It's hard to tell from your question exactly what you're looking for here, but assuming you're looking for the basic structure:
Python has good support on most platforms for XML Parsing. Chances are you'll want to use something simple and easy to use like minidom. See the XML Processing Modules in the python docs for your version of python.
Once you've opened the file, looking for project and then parsing down from there and using a simple mapping should work pretty well given the simplicity of the yaml format.
from xml.dom.minidom import parse
def getText(nodelist):
rc = []
for node in nodelist:
if node.nodeType == node.TEXT_NODE:
rc.append(node.data)
return ''.join(rc)
def getTextForTag(nodelist,tag):
elements = nodelist.getElementsByTagName(tag)
if (elements.length>0):
return getText( elements[0].childNodes)
return ''
def printValueForTag(parent, indent, tag, valueName=''):
value = getTextForTag( parent,tag)
if (len(value)>0):
if (valueName==''):
valueName = tag
print indent + valueName+": "+value
def emitLogRotate(indent, rotator):
print indent+"logrotate:"
indent+=' '
printValueForTag( rotator,indent, 'daysToKeep')
printValueForTag( rotator,indent, 'numToKeep')
def emitProject(project):
print "- project:"
# all projects have log rotators, so no need to chec
emitLogRotate(" ",project.getElementsByTagName('logRotator')[0])
# next section...
dom = parse('config.xml')
emitProject(dom)
This snippet will print just a few lines of the eventual configuration file, but it puts you in the right direction for a simple translator. Based on what I've seen, there's not much room for an automatic translation scheme due to naming differences. You could streamline the code as you iterate for more options and to be table driven, but that's "just a matter of programming", this will at least get you started with the DOM parsers in python.
I suggest querying and accessing the xml with xpath expressions using xmlstarlet on the command line and in shell scripts. No trouble with low-level programmatical access to XML. XMLStarlet is an XPath swiss-army knife on the command line.
"xmlstarlet el" shows you the element structure of the entire XML as XPath expressions.
"xmlstarlet sel -t -c XPath-expression" will extract exactly what you want.
Maybe you want to spend an hour (or two) on freshing up your XPath know-how in advance.
You will shed a couple of tears, once you recognize how much time you spent with programming XML access before you used XMLStarlet.

Python 3.3 library abpy, file undefined

I'm using a library ABPY (library here) for python but it is in older version i think. I'm using Python 3.3.
I did fix some PRINT errors, but that's how much i know, I'm really new on programing.
I want to fetch some webpage and filter it from advertising and then print it again.
EDITED after Sg'te'gmuj told me how to convert from python 2.x to 3.x this is my new code:
#!/usr/local/bin/python3.1
import cgitb;cgitb.enable()
import urllib.request
response = urllib.request.build_opener()
response.addheaders = [('User-agent', 'Mozilla/5.0')]
response = urllib.request.urlopen("http://www.youtube.com")
html = response.read()
from abpy import Filter
with open("easylist.txt") as f:
ABPFilter = Filter(file('easylist.txt'))
ABPFilter.match(html)
print("Content-type: text/html")
print()
print (html)
Now it is displaying a blank page
Just took a peek at the library, it seems that the file "easylist.txt" does not exist; you need to create the file, and populate it with the appropriate filters (in whatever format ABP specifies).
Additionally, it appears it takes a file object; try something like this instead:
with open("easylist.txt") as f:
ABPFilter = Filter(f)
I can't say this is wholly accurate though since I have no experience with the library, but looking at it's code I'd suspect either of the two are the problem, if not both.
Addendum #1
Looking at the code more in-depth, I have to agree that even if that fix I supplied does work, you're going to have more problems (it's in 2.x as you suggested, when you're using 3.x). I'd suggest utilizing Python's 2to3 function, to convert from typical Python 2 to Python 3 code (it's not foolproof though). The command line would be as so:
2to3 -w abpy.py
That will convert it from Python 2.x to 3.x code, and re-write the source file.
Addendum #2
The code to pass the file object should be the "f" variable, as shown above (modified to represent that; I wasn't paying attention and just left the old file function call in the argument).
You need to pass a URI to the function as well:
ABPFilter.match(URI)
You'll need to modify the code to pass those items into an array (I'm assuming at least); I'm playing with it now to see. At present I'm getting a rule error (not a Python error; but merely error handling used by abpy.py, which is good because it suggests that it's the right train of thought).
The code for the Filter.match function is as following (after using the 2to3 Python script):
def match(self, url, elementtype=None):
tokens = RE_TOK.split(url)
print(tokens)
for tok in tokens:
if len(tok) > 2:
if tok in self.index:
for rule in self.index[tok]:
if rule.match(url, elementtype=elementtype):
print(str(rule))
What this means is you're, at present, at a point where you need to program the functionality; it appears this module only indicates the rule. However, that is still useful.
What this means is that you're going to have to modify this function to take the HTML, in place of the the "url" parameter. You're going to regex the HTML (this may be rather intensive) for a list of URIs and then run each item through the match loop Where you go from there to actually filter the nodes, I'm not sure; but there is a list of filter types, so I'm assuming there is a typical procedural ABP does to remove the nodes (possibly, in some cases merely by removing the given URI from the HTML?)
References
http://docs.python.org/3.3/library/2to3.html

BioPython Pubmed Eutils url?

I'm trying to run some queries against Pubmed's Eutils service. If I run them on the website I get a certain number of records returned, in this case 13126 (link to pubmed).
A while ago I bodged together a python script to build a query to do much the same thing, and the resultant url returns the same number of hits (link to Eutils result).
Of course, not having any formal programming background, it was all a bit cludgy, so I'm trying to do the same thing using Biopython. I think the following code should do the same thing, but it returns a greater number of hits, 23303.
from Bio import Entrez
Entrez.email = "A.N.Other#example.com"
handle = Entrez.esearch(db="pubmed", term="stem+cell[All Fields]",datetype="pdat", mindate="2012", maxdate="2012")
record = Entrez.read(handle)
print(record["Count"])
I'm fairly sure it's just down to some subtlety in how the url is being generated, but I can't work out how to see what url is being generated by Biopython. Can anyone give me some pointers?
Thanks!
EDIT:
It's something to do with how the url is being generated, as I can get back the original number of hits by modifying the code to include double quotes around the search term, thus:
handle = Entrez.esearch(db='pubmed', term='"stem+cell"[ALL]', datetype='pdat', mindate='2012', maxdate='2012')
I'm still interested in knowing what url is being generated by Biopython as it'll help me work out how i have to structure the search term for when i want to do more complicated searches.
handle = Entrez.esearch(db="pubmed", term="stem+cell[All Fields]",datetype="pdat", mindate="2012", maxdate="2012")
print(handle.url)
You've solved this already (Entrez likes explicit double quoting round combined search terms), but currently the URL generated is not exposed via the API. The simplest trick would be to edit the Bio/Entrez/__init__.py file to add a print statement inside the _open function.
Update: Recent versions of Biopython now save the URL as an attribute of the returned handle, i.e. in this example try doing print(handle.url)

Categories