Python xml parsing using minidom

Python xml parsing using minidom - python

I just started learning how to parse xml using minidom. I tried to get the author's names (xml data is down below) using the following code:
from xml.dom import minidom
xmldoc = minidom.parse("cora.xml")
author = xmldoc.getElementsByTagName ('author')
for author in author:
authorID=author.getElementsByTagName('author id')
print authorID
I got empty brackets([]) all the way. Can someone please help me out? I will also need the title and venue. Thanks in advance. See xml data below:
<?xml version="1.0" encoding="UTF-8"?>
<coraRADD>
<publication id="ahlskog1994a">
<author id="199">M. Ahlskog</author>
<author id="74"> J. Paloheimo</author>
<author id="64"> H. Stubb</author>
<author id="103"> P. Dyreklev</author>
<author id="54"> M. Fahlman</author>
<title>Inganas</title>
<title>and</title>
<title>M.R.</title>
<venue>
<venue pubid="ahlskog1994a" id="1">
<name>Andersson</name>
<name> J Appl. Phys.</name>
<vol>76</vol>
<date> (1994). </date>
</venue>

You can only find tags with getElementsByTagName(), not attributes. You'll need to access those through the Element.getAttribute() method instead:
for author in author:
authorID = author.getAttribute('id')
print authorID
If you are still learning about parsing XML, you really want to stay away from the DOM. The DOM API is overly verbose to fit many different programming languages.
The ElementTree API would be easier to use:
import xml.etree.ElementTree as ET
tree = ET.parse('cora.xml')
root = tree.getroot()
# loop over all publications
for pub in root.findall('publication'):
print ' '.join([t.text for t in pub.findall('title')])
for author in pub.findall('author'):
print 'Author id: {}'.format(author.attrib['id'])
print 'Author name: {}'.format(author.text)
for venue in pub.findall('.//venue[#id]'): # all venue tags with id attribute
print ', '.join([name.text for name in venue.findall('name')])

Related

I need to update multiple elements in xml file using dom python

I am a begginer using Python. What I am trying to do is to update the InvoiceStatus of a certain list of invoices - I want to update it to "N" instead of "A". Below the XML file extract:
<?xml version="1.0" encoding="WINDOWS-1252"?>
<AuditFile>
<Header>
<AuditFileVersion>1.04_01</AuditFileVersion>
<CompanyID>51630</CompanyID>
</Header>
<MasterFiles>
<Customer>
<CustomerID>20201376</CustomerID>
<AccountID>20000</AccountID>
</Customer>
</MasterFiles>
<SourceDocuments>
<SalesInvoices>
<NumberOfEntries>981</NumberOfEntries>
<Invoice>
<InvoiceNo>F2 UF/3510000211</InvoiceNo>
<ATCUD>0</ATCUD>
<DocumentStatus>
<InvoiceStatus>A</InvoiceStatus>
<SourceBilling>P</SourceBilling>
</DocumentStatus>
<InvoiceNo>F2 UF/3510020247</InvoiceNo>
<ATCUD>0</ATCUD>
<DocumentStatus>
<InvoiceStatus>A</InvoiceStatus>
<SourceBilling>P</SourceBilling>
</DocumentStatus>
<InvoiceNo>F2 UF/3510020247</InvoiceNo>
<ATCUD>0</ATCUD>
<DocumentStatus>
<InvoiceStatus>A</InvoiceStatus>
<SourceBilling>P</SourceBilling>
</DocumentStatus>
</Invoice>
</SalesInvoices>
</SourceDocuments>
</AuditFile>
Here the script:
from xml.dom import minidom
def reemplazaTexto(nodo,textonuevo):
nodo.firstChild.replaceWholeText(textonuevo)
doc = minidom.parse('sample.xml')
print(doc.toxml())
invoices = doc.getElementsByTagName('InvoiceStatus')
for nodo in invoices:
reemplazaTexto(nodo, 'N')
print(doc.toxml())
But this script modifies all the InvoiceStatus. I would appreciate a hand on this.
Cheers,
Axel

How do I extract internal nested tag that has the same name as the external tag?

I am new to data science and was hoping to get your input into this query. When I parse and try and use findall() for "Title", I am getting all the values of Title. What I really want is the value of 'Title' tags within RelatedTerms.
Can anyone help?
Thanks,
<?xml version="1.0" encoding="utf-8"?>
<Terms>
<Term>
<Title>.177 (4.5mm) Airgun</Title>
<Description>The standard airgun calibre for international target shooting.
</Description>
<RelatedTerms>
<Term>
<Title>Shooting sport equipment</Title>
<Relationship>Narrower Term</Relationship>
</Term>
</RelatedTerms>
</Term>
</Terms>

Use beautifulsoup:
from bs4 import BeautifulSoup
temp ="""<Terms>
<Term>
<Title>.177 (4.5mm) Airgun</Title>
<Description>The standard airgun calibre for international target shooting.
</Description>
<RelatedTerms>
<Term>
<Title>Shooting sport equipment</Title>
<Relationship>Narrower Term</Relationship>
</Term>
</RelatedTerms>
</Term>"""
temp=BeautifulSoup(temp,"lxml")
#see caps is off
s = temp.find('relatedterms')
print(s.find_all('title'))
Output:
[<title>Shooting sport equipment</title>]
[Finished in 1.2s]

Using xml.etree.ElementTree:
import xml.etree.ElementTree as ET
tree = ET.parse("file.xml") # Replace "file.xml" with the name of your XML file
root = tree.getroot()
for related_terms in root.findall("./Term/RelatedTerms"):
for title_internal in related_terms.findall("./Term/Title"):
print(title_internal.text)
Output:
Shooting sport equipment
Replace file.xml in tree = ET.parse("test.xml") with the name of your XML file.

Parse Grobid .tei.xml output with Beautiful Soup

I am trying to use Beautiful Soup to extract elements from a .tei.xml file that was generated using Grobid.
I can get title(s) using:
titles = soup.findAll('title')
What is the correct syntax to access the 'lower level' elements? (Author / Affiliation etc)
This is a portion of the tei.xml file that is the Grobid output:
<?xml version="1.0" encoding="UTF-8"?>
<TEI xmlns="http://www.tei-c.org/ns/1.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 /data/grobid-0.5.1/grobid-home/schemas/xsd/Grobid.xsd"
xmlns:xlink="http://www.w3.org/1999/xlink">
<teiHeader xml:lang="en">
<encodingDesc>
<appInfo>
<application version="0.5.1-SNAPSHOT" ident="GROBID" when="2018-08-15T14:51+0000">
<ref target="https://github.com/kermitt2/grobid">GROBID - A machine learning software for extracting information from scholarly documents</ref>
</application>
</appInfo>
</encodingDesc>
<fileDesc>
<titleStmt>
<title level="a" type="main">The Role of Artificial Intelligence in Software Engineering</title>
</titleStmt>
<publicationStmt>
<publisher/>
<availability status="unknown"><licence/></availability>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<author>
<persName xmlns="http://www.tei-c.org/ns/1.0"><forename type="first">Mark</forename><surname>Harman</surname></persName>
<affiliation key="aff0">
<orgName type="department">CREST Centre</orgName>
<orgName type="institution">University College London</orgName>
<address>
<addrLine>Malet Place</addrLine>
<postCode>WC1E 6BT</postCode>
<settlement>London</settlement>
<country key="GB">UK</country>
</address>
</affiliation>
</author>
<title level="a" type="main">The Role of Artificial Intelligence in Software Engineering</title>
</analytic>
<monogr>
<imprint>
<date/>
</imprint>
</monogr>
</biblStruct>
</sourceDesc>
</fileDesc>
Thanks.

BeautifulSoup lowercases the nodes, here's some examples:
title = soup.html.body.teiheader.filedesc.analytic.title.string
for author in soup.html.body.teiheader.filedesc.sourcedesc.find_all('author'):
tag_or_none = author.persname.forename
first_affiliation = author.affiliation
Also see the BeautifulSoup documentation which covers everything.
I'm working on a similar problem now and looking for collaboration. Let me know if you want to team up -- sof#nconnor.com

Parsing soap/XML response in Python

I am trying to parse the below xml using the python. I do not understand which type of xml this is as I never worked on this kind of xml.I just got it from a api response form Microsoft.
Now my question is how to parse and get the value of BinarySecurityToken in my python code.
I refer this question Parse XML SOAP response with Python
But look like this has also some xmlns to get the text .However in my xml I can't see any nearby xmlns value through I can get the value.
Please let me know how to get the value of a specific filed using python from below xml.
<?xml version="1.0" encoding="utf-8" ?>
<S:Envelope xmlns:S="http://www.w3.org/2003/05/soap-envelope" xmlns:wsse="http://docs.oasis-open.org/wss/2004/01/oasis-200401-wss-wssecurity-secext-1.0.xsd" xmlns:wsu="http://docs.oasis-open.org/wss/2004/01/oasis-200401-wss-wssecurity-utility-1.0.xsd" xmlns:wsa="http://www.w3.org/2005/08/addressing">
<S:Header>
<wsa:Action xmlns:S="http://www.w3.org/2003/05/soap-envelope" xmlns:wsa="http://www.w3.org/2005/08/addressing" xmlns:wsu="http://docs.oasis-open.org/wss/2004/01/oasis-200401-wss-wssecurity-utility-1.0.xsd" wsu:Id="Action" S:mustUnderstand="1">http://schemas.xmlsoap.org/ws/2005/02/trust/RSTR/Issue</wsa:Action>
<wsa:To xmlns:S="http://www.w3.org/2003/05/soap-envelope" xmlns:wsa="http://www.w3.org/2005/08/addressing" xmlns:wsu="http://docs.oasis-open.org/wss/2004/01/oasis-200401-wss-wssecurity-utility-1.0.xsd" wsu:Id="To" S:mustUnderstand="1">http://schemas.xmlsoap.org/ws/2004/08/addressing/role/anonymous</wsa:To>
<wsse:Security S:mustUnderstand="1">
<wsu:Timestamp xmlns:wsu="http://docs.oasis-open.org/wss/2004/01/oasis-200401-wss-wssecurity-utility-1.0.xsd" wsu:Id="TS">
<wsu:Created>2017-06-12T10:23:01Z</wsu:Created>
<wsu:Expires>2017-06-12T10:28:01Z</wsu:Expires>
</wsu:Timestamp>
</wsse:Security>
</S:Header>
<S:Body>
<wst:RequestSecurityTokenResponse xmlns:S="http://www.w3.org/2003/05/soap-envelope" xmlns:wst="http://schemas.xmlsoap.org/ws/2005/02/trust" xmlns:wsse="http://docs.oasis-open.org/wss/2004/01/oasis-200401-wss-wssecurity-secext-1.0.xsd" xmlns:wsu="http://docs.oasis-open.org/wss/2004/01/oasis-200401-wss-wssecurity-utility-1.0.xsd" xmlns:saml="urn:oasis:names:tc:SAML:1.0:assertion" xmlns:wsp="http://schemas.xmlsoap.org/ws/2004/09/policy" xmlns:psf="http://schemas.microsoft.com/Passport/SoapServices/SOAPFault">
<wst:TokenType>urn:passport:compact</wst:TokenType>
<wsp:AppliesTo xmlns:wsa="http://www.w3.org/2005/08/addressing">
<wsa:EndpointReference>
<wsa:Address>https://something.something.something.com</wsa:Address>
</wsa:EndpointReference>
</wsp:AppliesTo>
<wst:Lifetime>
<wsu:Created>2017-06-12T10:23:01Z</wsu:Created>
<wsu:Expires>2017-06-13T10:23:01Z</wsu:Expires>
</wst:Lifetime>
<wst:RequestedSecurityToken>
<wsse:BinarySecurityToken Id="Compact0">my token</wsse:BinarySecurityToken>
</wst:RequestedSecurityToken>
<wst:RequestedAttachedReference>
<wsse:SecurityTokenReference>
<wsse:Reference URI="wwwww=">
</wsse:Reference>
</wsse:SecurityTokenReference>
</wst:RequestedAttachedReference>
<wst:RequestedUnattachedReference>
<wsse:SecurityTokenReference>
<wsse:Reference URI="swsw=">
</wsse:Reference>
</wsse:SecurityTokenReference>
</wst:RequestedUnattachedReference>
</wst:RequestSecurityTokenResponse>
</S:Body>
</S:Envelope>

This declaration is part of the start tag of the root element:
xmlns:wsse="http://docs.oasis-open.org/wss/2004/01/oasis-200401-wss-wssecurity-secext-1.0.xsd"
It means that elements with the wsse prefix (such as BinarySecurityToken) are in the http://docs.oasis-open.org/wss/2004/01/oasis-200401-wss-wssecurity-secext-1.0.xsd namespace.
The solution is basically the same as in the answer to the linked question. It's just another namespace:
import xml.etree.ElementTree as ET
tree = ET.parse('soap.xml')
print tree.find('.//{http://docs.oasis-open.org/wss/2004/01/oasis-200401-wss-wssecurity-secext-1.0.xsd}BinarySecurityToken').text
Here is another way of doing it:
import xml.etree.ElementTree as ET
ns = {"wsse": "http://docs.oasis-open.org/wss/2004/01/oasis-200401-wss-wssecurity-secext-1.0.xsd"}
tree = ET.parse('soap.xml')
print tree.find('.//wsse:BinarySecurityToken', ns).text
The output in both cases is my token.
See https://docs.python.org/2.7/library/xml.etree.elementtree.html#parsing-xml-with-namespaces.

Creating a namespace dict helped me. Thank you #mzjn for linking that article.
In my SOAP response, I found that I was having to use the full path to the element to extract the text.
For example, I am working with FEDEX API, and one element that I needed to find was TrackDetails. My initial .find() looked like .find('{http://fedex.com/ws/track/v16}TrackDetails')
I was able to simplify this to the following:
ns = {'TrackDetails': 'http://fedex.com/ws/track/v16'}
tree.find('TrackDetails:TrackDetails',ns)
You see TrackDetails twice because I named the key TrackDetails in the dict, but you could name this anything you want. Just helped me to remember what I was working on in my project, but the TrackDetails after the : is the actual element in the SOAP response that I need.
Hope this helps someone!

Parsing large xml data using python's elementtree

I'm currently learning how to parse xml data using elementtree. I got an error that say:ParseError: not well-formed (invalid token): line 1, column 2.
My code is right below, and a bit of the xml data is after my code.
import xml.etree.ElementTree as ET
tree = ET.fromstring("C:\pbc.xml")
root = tree.getroot()
for article in root.findall('article'):
print ' '.join([t.text for t in pub.findall('title')])
for author in article.findall('author'):
print 'Author name: {}'.format(author.text)
for journal in article.findall('journal'): # all venue tags with id attribute
print 'journal'
<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE dblp SYSTEM "dblp.dtd">
<dblp>
<article mdate="2002-01-03" key="persons/Codd71a">
<author>E. F. Codd</author>
<title>Further Normalization of the Data Base Relational Model.</title>
<journal>IBM Research Report, San Jose, California</journal>
<volume>RJ909</volume>
<month>August</month>
<year>1971</year>
<cdrom>ibmTR/rj909.pdf</cdrom>
<ee>db/labs/ibm/RJ909.html</ee>
</article>
<article mdate="2002-01-03" key="persons/Hall74">
<author>Patrick A. V. Hall</author>
<title>Common Subexpression Identification in General Algebraic Systems.</title>
<journal>Technical Rep. UKSC 0060, IBM United Kingdom Scientific Centre</journal>
<month>November</month>
<year>1974</year>
</article>

You are using .fromstring() instead of .parse():
import xml.etree.ElementTree as ET
tree = ET.parse("C:\pbc.xml")
root = tree.getroot()
.fromstring() expects to be given the XML data in a bytestring, not a filename.
If the document is really large (many megabytes or more) then you should use the ET.iterparse() function instead and clear elements you have processed:
for event, article in ET.iterparse('C:\\pbc.xml', tag='article'):
for title in aarticle.findall('title'):
print 'Title: {}'.format(title.txt)
for author in article.findall('author'):
print 'Author name: {}'.format(author.text)
for journal in article.findall('journal'):
print 'journal'
article.clear()

with open("C:\pbc.xml", 'rb') as f:
root = ET.fromstring(f.read().strip())
Unlike ET.parse, ET.fromstring expects a string with XML content, not the name of a file.
Also in contrast to ET.parse, ET.fromstring returns a root Element, not a Tree. So you should omit
root = tree.getroot()
Also, the XML snippet you posted needs a closing </dblp> to be parsable. I assume your real data has that closing tag...
The iterparse provided by xml.etree.ElementTree does not have a tag argument, although lxml.etree.iterparse does have a tag argument.
Try:
import xml.etree.ElementTree as ET
import htmlentitydefs
filename = "test.xml"
# http://stackoverflow.com/a/10792473/190597 (lambacck)
parser = ET.XMLParser()
parser.entity.update((x, unichr(i)) for x, i in htmlentitydefs.name2codepoint.iteritems())
context = ET.iterparse(filename, events = ('end', ), parser=parser)
for event, elem in context:
if elem.tag == 'article':
for author in elem.findall('author'):
print 'Author name: {}'.format(author.text)
for journal in elem.findall('journal'): # all venue tags with id attribute
print(journal.text)
elem.clear()
Note: To use iterparse your XML must be valid, which means among other things that there can not be empty lines at the beginning of the file.

You'd better not putting the meta-info of the xml file into the parser. The parser do well if the tags are well-closed. So the <?xml may not be recognized by the parser. So omit the first two lines and try again. :-)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python xml parsing using minidom - python

Related

I need to update multiple elements in xml file using dom python

How do I extract internal nested tag that has the same name as the external tag?

Parse Grobid .tei.xml output with Beautiful Soup

Parsing soap/XML response in Python

Parsing large xml data using python's elementtree

Categories

Resources