Parsing large xml data using python's elementtree

Parsing large xml data using python's elementtree - python

I'm currently learning how to parse xml data using elementtree. I got an error that say:ParseError: not well-formed (invalid token): line 1, column 2.
My code is right below, and a bit of the xml data is after my code.
import xml.etree.ElementTree as ET
tree = ET.fromstring("C:\pbc.xml")
root = tree.getroot()
for article in root.findall('article'):
print ' '.join([t.text for t in pub.findall('title')])
for author in article.findall('author'):
print 'Author name: {}'.format(author.text)
for journal in article.findall('journal'): # all venue tags with id attribute
print 'journal'
<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE dblp SYSTEM "dblp.dtd">
<dblp>
<article mdate="2002-01-03" key="persons/Codd71a">
<author>E. F. Codd</author>
<title>Further Normalization of the Data Base Relational Model.</title>
<journal>IBM Research Report, San Jose, California</journal>
<volume>RJ909</volume>
<month>August</month>
<year>1971</year>
<cdrom>ibmTR/rj909.pdf</cdrom>
<ee>db/labs/ibm/RJ909.html</ee>
</article>
<article mdate="2002-01-03" key="persons/Hall74">
<author>Patrick A. V. Hall</author>
<title>Common Subexpression Identification in General Algebraic Systems.</title>
<journal>Technical Rep. UKSC 0060, IBM United Kingdom Scientific Centre</journal>
<month>November</month>
<year>1974</year>
</article>

You are using .fromstring() instead of .parse():
import xml.etree.ElementTree as ET
tree = ET.parse("C:\pbc.xml")
root = tree.getroot()
.fromstring() expects to be given the XML data in a bytestring, not a filename.
If the document is really large (many megabytes or more) then you should use the ET.iterparse() function instead and clear elements you have processed:
for event, article in ET.iterparse('C:\\pbc.xml', tag='article'):
for title in aarticle.findall('title'):
print 'Title: {}'.format(title.txt)
for author in article.findall('author'):
print 'Author name: {}'.format(author.text)
for journal in article.findall('journal'):
print 'journal'
article.clear()

with open("C:\pbc.xml", 'rb') as f:
root = ET.fromstring(f.read().strip())
Unlike ET.parse, ET.fromstring expects a string with XML content, not the name of a file.
Also in contrast to ET.parse, ET.fromstring returns a root Element, not a Tree. So you should omit
root = tree.getroot()
Also, the XML snippet you posted needs a closing </dblp> to be parsable. I assume your real data has that closing tag...
The iterparse provided by xml.etree.ElementTree does not have a tag argument, although lxml.etree.iterparse does have a tag argument.
Try:
import xml.etree.ElementTree as ET
import htmlentitydefs
filename = "test.xml"
# http://stackoverflow.com/a/10792473/190597 (lambacck)
parser = ET.XMLParser()
parser.entity.update((x, unichr(i)) for x, i in htmlentitydefs.name2codepoint.iteritems())
context = ET.iterparse(filename, events = ('end', ), parser=parser)
for event, elem in context:
if elem.tag == 'article':
for author in elem.findall('author'):
print 'Author name: {}'.format(author.text)
for journal in elem.findall('journal'): # all venue tags with id attribute
print(journal.text)
elem.clear()
Note: To use iterparse your XML must be valid, which means among other things that there can not be empty lines at the beginning of the file.

You'd better not putting the meta-info of the xml file into the parser. The parser do well if the tags are well-closed. So the <?xml may not be recognized by the parser. So omit the first two lines and try again. :-)

Related

How do I extract internal nested tag that has the same name as the external tag?

I am new to data science and was hoping to get your input into this query. When I parse and try and use findall() for "Title", I am getting all the values of Title. What I really want is the value of 'Title' tags within RelatedTerms.
Can anyone help?
Thanks,
<?xml version="1.0" encoding="utf-8"?>
<Terms>
<Term>
<Title>.177 (4.5mm) Airgun</Title>
<Description>The standard airgun calibre for international target shooting.
</Description>
<RelatedTerms>
<Term>
<Title>Shooting sport equipment</Title>
<Relationship>Narrower Term</Relationship>
</Term>
</RelatedTerms>
</Term>
</Terms>

Use beautifulsoup:
from bs4 import BeautifulSoup
temp ="""<Terms>
<Term>
<Title>.177 (4.5mm) Airgun</Title>
<Description>The standard airgun calibre for international target shooting.
</Description>
<RelatedTerms>
<Term>
<Title>Shooting sport equipment</Title>
<Relationship>Narrower Term</Relationship>
</Term>
</RelatedTerms>
</Term>"""
temp=BeautifulSoup(temp,"lxml")
#see caps is off
s = temp.find('relatedterms')
print(s.find_all('title'))
Output:
[<title>Shooting sport equipment</title>]
[Finished in 1.2s]

Using xml.etree.ElementTree:
import xml.etree.ElementTree as ET
tree = ET.parse("file.xml") # Replace "file.xml" with the name of your XML file
root = tree.getroot()
for related_terms in root.findall("./Term/RelatedTerms"):
for title_internal in related_terms.findall("./Term/Title"):
print(title_internal.text)
Output:
Shooting sport equipment
Replace file.xml in tree = ET.parse("test.xml") with the name of your XML file.

Create array of values from specific element in XML using Python

I have an XML file which has many elements. I would like to create a list/array of all the values which have a specific element name, in my case "pair:ApplicationNumber".
I've gone over a lot of the other questions however I am not able to find an answer. I know that I can do this by loading the text file and going over it using pandas however, I'm sure there's a much better way.
I was unsuccessful trying ElementTree as well as XML.Dom using minidom
My code currently looks as follows:
import os
from xml.dom import minidom
WindowsUser = os.getenv('username')
XMLPath = os.path.join('C:\\Users', WindowsUser, 'Downloads', 'ApplicationsByCustomerNumber.xml')
xmldoc = minidom.parse(XMLPath)
itemlist = xmldoc.getElementsByTagName('pair:ApplicationNumber')
for s in itemlist:
print(s.attributes['pair:ApplicationNumber'].value)
an example XML file looks as follows:
<?xml version="1.0" encoding="UTF-8"?>
<pair:PatentApplicationList xsi:schemaLocation="urn:us:gov:uspto:pair PatentApplicationList.xsd" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:pair="urn:us:gov:uspto:pair">
<pair:FileHeader>
<pair:FileCreationTimeStamp>2017-07-10T10:52:12.12</pair:FileCreationTimeStamp>
</pair:FileHeader>
<pair:ApplicationStatusData>
<pair:ApplicationNumber>62383607</pair:ApplicationNumber>
<pair:ApplicationStatusCode>20</pair:ApplicationStatusCode>
<pair:ApplicationStatusText>Application Dispatched from Preexam, Not Yet Docketed</pair:ApplicationStatusText>
<pair:ApplicationStatusDate>2016-09-16</pair:ApplicationStatusDate>
<pair:AttorneyDocketNumber>1354-T-02-US</pair:AttorneyDocketNumber>
<pair:FilingDate>2016-09-06</pair:FilingDate>
<pair:LastModifiedTimestamp>2017-05-30T21:40:37.37</pair:LastModifiedTimestamp>
<pair:CustomerNumber>122761</pair:CustomerNumber><pair:LastFileHistoryTransaction>
<pair:LastTransactionDate>2017-05-30</pair:LastTransactionDate>
<pair:LastTransactionDescription>Email Notification</pair:LastTransactionDescription> </pair:LastFileHistoryTransaction>
<pair:ImageAvailabilityIndicator>true</pair:ImageAvailabilityIndicator>
</pair:ApplicationStatusData>
<pair:ApplicationStatusData>
<pair:ApplicationNumber>62292372</pair:ApplicationNumber>
<pair:ApplicationStatusCode>160</pair:ApplicationStatusCode>
<pair:ApplicationStatusText>Abandoned -- Incomplete Application (Pre-examination)</pair:ApplicationStatusText>
<pair:ApplicationStatusDate>2016-11-01</pair:ApplicationStatusDate>
<pair:AttorneyDocketNumber>681-S-23-US</pair:AttorneyDocketNumber>
<pair:FilingDate>2016-02-08</pair:FilingDate>
<pair:LastModifiedTimestamp>2017-06-20T21:59:26.26</pair:LastModifiedTimestamp>
<pair:CustomerNumber>122761</pair:CustomerNumber><pair:LastFileHistoryTransaction>
<pair:LastTransactionDate>2017-06-20</pair:LastTransactionDate>
<pair:LastTransactionDescription>Petition Entered</pair:LastTransactionDescription> </pair:LastFileHistoryTransaction>
<pair:ImageAvailabilityIndicator>true</pair:ImageAvailabilityIndicator>
</pair:ApplicationStatusData>
<pair:ApplicationStatusData>
<pair:ApplicationNumber>62289245</pair:ApplicationNumber>
<pair:ApplicationStatusCode>160</pair:ApplicationStatusCode>
<pair:ApplicationStatusText>Abandoned -- Incomplete Application (Pre-examination)</pair:ApplicationStatusText>
<pair:ApplicationStatusDate>2016-10-26</pair:ApplicationStatusDate>
<pair:AttorneyDocketNumber>1526-P-01-US</pair:AttorneyDocketNumber>
<pair:FilingDate>2016-01-31</pair:FilingDate>
<pair:LastModifiedTimestamp>2017-06-15T21:24:13.13</pair:LastModifiedTimestamp>
<pair:CustomerNumber>122761</pair:CustomerNumber><pair:LastFileHistoryTransaction>
<pair:LastTransactionDate>2017-06-15</pair:LastTransactionDate>
<pair:LastTransactionDescription>Petition Entered</pair:LastTransactionDescription> </pair:LastFileHistoryTransaction>
<pair:ImageAvailabilityIndicator>true</pair:ImageAvailabilityIndicator>
</pair:ApplicationStatusData>
</pair:PatentApplicationList>

The XML in your example is expanding the "pair:" part of the tags according to the schema you've used, so it doesn't match 'pair:ApplicationNumber', even though it looks like it should.
I've used element tree to extract the application numbers as follows (I've just used a local XML file in my examples, rather than the full path in your code)
Example 1:
from xml.etree import ElementTree
tree = ElementTree.parse('ApplicationsByCustomerNumber.xml')
root = tree.getroot()
for item in root:
if 'ApplicationStatusData' in item.tag:
for child in item:
if 'ApplicationNumber' in child.tag:
print child.text
Example 2:
from xml.etree import ElementTree
tree = ElementTree.parse('ApplicationsByCustomerNumber.xml')
root = tree.getroot()
for item in root.iter('{urn:us:gov:uspto:pair}ApplicationStatusData'):
for child in item.iter('{urn:us:gov:uspto:pair}ApplicationNumber'):
print child.text
Hope this may be useful.

Python: Xml parsing method

I have a problem with a python script which is used to parse a xml file. This is the xml file:
file.xml
<Tag1 SchemaVersion="1.1" xmlns="http://www.microsoft.com/axe">
<RandomTag>TextText</RandomTag>
<Tag2 xmlns="http://schemas.datacontract.org/2004/07">
<AnotherRandom>Abc</AnotherRandom>
</Tag2>
</Tag1>
I am using xml.etree.ElementTree as parsing method. My task is to change the tags between RandomTag (in this case "TextText"). This is the python code:
python code
import xml.etree.ElementTree as ET
customXmlFile = 'file.xml'
ns = {
'ns': 'http://www.microsoft.com/axe',
'sc': 'http://schemas.datacontract.org/2004/07/Microsoft.Assessments.Relax.ObjectModel_V1'
}
tree = ET.parse(customXmlFile)
root = tree.getroot()
node = root.find('ns:RandomTag', namespaces=ns)
node.text = 'NEW TEXT'
ET.register_namespace('', 'http://www.microsoft.com/axe')
tree.write(customXmlFile + ".new",
xml_declaration=True,
encoding='utf-8',
method="xml")
I don't have run time errors, the code works fine, but all the namespaces are defined in the first node (Tag1) and in AnotherRandom and Tag2 is used a shorcut. Also, the SchemaVersion is deleted.
file.xml.new - output
<?xml version='1.0' encoding='utf-8'?>
<Tag1 xmlns="http://www.microsoft.com/axe" xmlns:ns1="http://schemas.datacontract.org/2004/07" SchemaVersion="1.1">
<RandomTag>NEW TEXT</RandomTag>
<ns1:Tag2>
<ns1:AnotherRandom>Abc</ns1:AnotherRandom>
</ns1:Tag2>
</Tag1>
file.xml.new - desired output
<Tag1 SchemaVersion="1.1" xmlns="http://www.microsoft.com/axe">
<RandomTag>TextText</RandomTag>
<Tag2 xmlns="http://schemas.datacontract.org/2004/07">
<AnotherRandom>NEW TEXT</AnotherRandom>
</Tag2>
</Tag1>
What should I change to get exact the same format of XML as at the beggining with that only text changed?

This is a bit of a hack but will do kind of what you want. However, playing around with namespaces like this surely violates the XML standard. I suggest you check out lxml if you want better handling of namespaces.
You must call register_namespace() before parsing in the file. Since repeated calls to that function overwrite previous mapping, you must manually edit the internal dict.
import xml.etree.ElementTree as ET
customXmlFile = 'test.xml'
ns = {'ns': 'http://www.microsoft.com/axe',
'sc': 'http://schemas.datacontract.org/2004/07/'}
ET.register_namespace('', 'http://www.microsoft.com/axe')
ET._namespace_map['http://schemas.datacontract.org/2004/07'] = ''
tree = ET.parse(customXmlFile)
root = tree.getroot()
node = root.find('ns:RandomTag', namespaces=ns)
node.text = 'NEW TEXT'
tree.write(customXmlFile + ".new",
xml_declaration=True,
encoding='utf-8',
method="xml")
For more information about this see:
http://effbot.org/zone/element-namespaces.htm
Saving XML files using ElementTree
Cannot write XML file with default namespace

Python xml parsing using minidom

I just started learning how to parse xml using minidom. I tried to get the author's names (xml data is down below) using the following code:
from xml.dom import minidom
xmldoc = minidom.parse("cora.xml")
author = xmldoc.getElementsByTagName ('author')
for author in author:
authorID=author.getElementsByTagName('author id')
print authorID
I got empty brackets([]) all the way. Can someone please help me out? I will also need the title and venue. Thanks in advance. See xml data below:
<?xml version="1.0" encoding="UTF-8"?>
<coraRADD>
<publication id="ahlskog1994a">
<author id="199">M. Ahlskog</author>
<author id="74"> J. Paloheimo</author>
<author id="64"> H. Stubb</author>
<author id="103"> P. Dyreklev</author>
<author id="54"> M. Fahlman</author>
<title>Inganas</title>
<title>and</title>
<title>M.R.</title>
<venue>
<venue pubid="ahlskog1994a" id="1">
<name>Andersson</name>
<name> J Appl. Phys.</name>
<vol>76</vol>
<date> (1994). </date>
</venue>

You can only find tags with getElementsByTagName(), not attributes. You'll need to access those through the Element.getAttribute() method instead:
for author in author:
authorID = author.getAttribute('id')
print authorID
If you are still learning about parsing XML, you really want to stay away from the DOM. The DOM API is overly verbose to fit many different programming languages.
The ElementTree API would be easier to use:
import xml.etree.ElementTree as ET
tree = ET.parse('cora.xml')
root = tree.getroot()
# loop over all publications
for pub in root.findall('publication'):
print ' '.join([t.text for t in pub.findall('title')])
for author in pub.findall('author'):
print 'Author id: {}'.format(author.attrib['id'])
print 'Author name: {}'.format(author.text)
for venue in pub.findall('.//venue[#id]'): # all venue tags with id attribute
print ', '.join([name.text for name in venue.findall('name')])

Python xml etree DTD from a StringIO source?

I'm adapting the following code (created via advice in this question), that took an XML file and it's DTD and converted them to a different format. For this problem only the loading section is important:
xmldoc = open(filename)
parser = etree.XMLParser(dtd_validation=True, load_dtd=True)
tree = etree.parse(xmldoc, parser)
This worked fine, whilst using the file system, but I'm converting it to run via a web framework, where the two files are loaded via a form.
Loading the xml file works fine:
tree = etree.parse(StringIO(data['xml_file'])
But as the DTD is linked to in the top of the xml file, the following statement fails:
parser = etree.XMLParser(dtd_validation=True, load_dtd=True)
tree = etree.parse(StringIO(data['xml_file'], parser)
Via this question, I tried:
etree.DTD(StringIO(data['dtd_file'])
tree = etree.parse(StringIO(data['xml_file'])
Whilst the first line doesn't cause an error, the second falls over on unicode entities the DTD is meant to pick up (and does so in the file system version):
XMLSyntaxError: Entity 'eacute' not
defined, line 4495, column 46
How do I go about correctly loading this DTD?

Here's a short but complete example, using the custom resolver technique #Steven mentioned.
from StringIO import StringIO
from lxml import etree
data = dict(
xml_file = '''<?xml version="1.0"?>
<!DOCTYPE x SYSTEM "a.dtd">
<x><y>ézz</y></x>
''',
dtd_file = '''<!ENTITY eacute "é">
<!ELEMENT x (y)>
<!ELEMENT y (#PCDATA)>
''')
class DTDResolver(etree.Resolver):
def resolve(self, url, id, context):
return self.resolve_string(data['dtd_file'], context)
xmldoc = StringIO(data['xml_file'])
parser = etree.XMLParser(dtd_validation=True, load_dtd=True)
parser.resolvers.add(DTDResolver())
try:
tree = etree.parse(xmldoc, parser)
except etree.XMLSyntaxError as e:
# handle xml and validation errors

You could probably use a custom resolver. The docs actually give an example of doing this to provide a dtd.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Parsing large xml data using python's elementtree - python

You'd better not putting the meta-info of the xml file into the parser. The parser do well if the tags are well-closed. So the <?xml may not be recognized by the parser. So omit the first two lines and try again. :-)

Related

How do I extract internal nested tag that has the same name as the external tag?

Create array of values from specific element in XML using Python

Python: Xml parsing method

Python xml parsing using minidom

Python xml etree DTD from a StringIO source?

Categories

Resources