I'm trying to build a script to read an xml file.
This is my first time parsing an xml and i'm doing it using python with xml.etree.ElementTree. The section of the file that i would like to process looks like:
<component>
<section>
<id root="42CB916B-BB58-44A0-B8D2-89B4B27F04DF" />
<code code="34089-3" codeSystem="2.16.840.1.113883.6.1" codeSystemName="LOINC" displayName="DESCRIPTION SECTION" />
<title mediaType="text/x-hl7-title+xml">DESCRIPTION</title>
<text>
<paragraph>Renese<sup>®</sup> is designated generically as polythiazide, and chemically as 2<content styleCode="italics">H</content>-1,2,4-Benzothiadiazine-7-sulfonamide, 6-chloro-3,4-dihydro-2-methyl-3-[[(2,2,2-trifluoroethyl)thio]methyl]-, 1,1-dioxide. It is a white crystalline substance, insoluble in water but readily soluble in alkaline solution.</paragraph>
<paragraph>Inert Ingredients: dibasic calcium phosphate; lactose; magnesium stearate; polyethylene glycol; sodium lauryl sulfate; starch; vanillin. The 2 mg tablets also contain: Yellow 6; Yellow 10.</paragraph>
</text>
<effectiveTime value="20051214" />
</section>
</component>
<component>
<section>
<id root="CF5D392D-F637-417C-810A-7F0B3773264F" />
<code code="42229-5" codeSystem="2.16.840.1.113883.6.1" codeSystemName="LOINC" displayName="SPL UNCLASSIFIED SECTION" />
<title mediaType="text/x-hl7-title+xml">ACTION</title>
<text>
<paragraph>The mechanism of action results in an interference with the renal tubular mechanism of electrolyte reabsorption. At maximal therapeutic dosage all thiazides are approximately equal in their diuretic potency. The mechanism whereby thiazides function in the control of hypertension is unknown.</paragraph>
</text>
<effectiveTime value="20051214" />
</section>
</component>
The full file can be downloaded from:
https://dailymed.nlm.nih.gov/dailymed/getFile.cfm?setid=abd6ecf0-dc8e-41de-89f2-1e36ed9d6535&type=zip&name=Renese
Here my code:
import xml.etree.ElementTree as ElementTree
import re
with open("ABD6ECF0-DC8E-41DE-89F2-1E36ED9D6535.xml") as f:
xmlstring = f.read()
# Remove the default namespace definition (xmlns="http://some/namespace")
xmlstring = re.sub(r'\sxmlns="[^"]+"', '', xmlstring, count=1)
tree = ElementTree.fromstring(xmlstring)
for title in tree.iter('title'):
print(title.text)
So far I'm able to print the titles but I would like to print also the corresponding text that is captured in the tags.
I have tried this:
for title in tree.iter('title'):
print(title.text)
for paragraph in title.iter('paragraph'):
print(paragraph.text)
But I have no output from the paragraph.text
Doing
for title in tree.iter('title'):
print(title.text)
for paragraph in tree.iter('paragraph'):
print(paragraph.text)
I print the text of the paragraphs but (obviously) it is printed all together for each title found in the xml structure.
I would like to find a way to 1) identify the title; 2) print the corresponding paragraph(s).
How can I do it?
If you are willing to use lxml, then the following is a solution which uses XPath:
import re
from lxml.etree import fromstring
with open("ABD6ECF0-DC8E-41DE-89F2-1E36ED9D6535.xml") as f:
xmlstring = f.read()
xmlstring = re.sub(r'\sxmlns="[^"]+"', '', xmlstring, count=1)
doc = fromstring(xmlstring.encode()) # lxml only accepts bytes input, hence we encode
for title in doc.xpath('//title'): # for all title nodes
title_text = title.xpath('./text()') # get text value of the node
# get all text values of the paragraph nodes that appear lower (//paragraph)
# in the hierarchy than the parent (..) of <title>
paragraphs_for_title = title.xpath('..//paragraph/text()')
print(title_text[0] if title_text else '')
for paragraph in paragraphs_for_title:
print(paragraph)
Related
I am using xpath to parse an xml file
from lxml import etree
example='''<div n="0001" type="car" xml:id="_3a327f0002">
<p xml:id="_3a327f0003">
1. A car is
<p xml:id="_3a327f0004"> - big, yellow and red;</p>
<p xml:id="_3a327f0005"> - has a big motor;</p>
<p xml:id="_3a327f0006"> - and also has <lb/>
big seats.
</p>
</p>
</div>'''
I want to serialize the above XML file in the following way:
{"_3a327f0003": "1. A car is",
"_3a327f0004":"- big, yellow and red;"
"_3a327f0005":"- has a big motor;"
"_3a327f0006":"- and also has big seats"
Basically extracting the text and building a dictionary where every text belongs to his xml:id. My code is as follows:
parser = etree.XMLParser(resolve_entities=False, strip_cdata=False, recover=True, ns_clean=True)
XML_tree = etree.fromstring(example.encode() , parser=parser)
all_paras = XML_tree.xpath('.//p[#xml:id]')
list_of_paragraphs = []
for para in all_paras:
mydict = {}
mydict['text'] = para.text
for att in para.attrib:
mykey=att
if 'id' in mykey:
mykey='xmlid'
mydict[mykey] = para.attrib[att]
list_of_paragraphs.append(mydict)
PDM_XML_serializer(example)
It works except for the fact that if I have a node like:
<p xml:id="_3a327f0006"> - and also has <lb/>
big seats.
</p>
It will not extract the part from
How should I modify:
XML_tree.xpath('.//p[#xml:id]')
in order to get all the text from <p to /p> ?
EDIT:
para.itertext() could be used but then the first node will give back all the text of the other nodes as well.
Using xml.etree.ElementTree
import xml.etree.ElementTree as ET
xml = '''<div n="0001" type="car" xml:id="_3a327f0002">
<p xml:id="_3a327f0003">
1. A car is
<p xml:id="_3a327f0004"> - big, yellow and red;</p>
<p xml:id="_3a327f0005"> - has a big motor;</p>
<p xml:id="_3a327f0006"> - and also has <lb/>
big seats.
</p>
</p>
</div>'''
def _get_element_txt(element):
txt = element.text
children = list(element)
if children:
txt += children[0].tail.strip()
return txt
root = ET.fromstring(xml)
data = {p.attrib['{http://www.w3.org/XML/1998/namespace}id']: _get_element_txt(p)
for p in root.findall('.//p/p')}
for k, v in data.items():
print(f'{k} --> {v}')
output
_3a327f0004 --> - big, yellow and red;
_3a327f0005 --> - has a big motor;
_3a327f0006 --> - and also has big seats.
Using lxml.etree parse all elements in all_paras in a list/dict comprehension. Since your XML uses the special xml prefix and lxml does not yet support parsing namespace prefix in attributes (see #mzjn's answer here), below uses workaround with next + iter to retrieve attribute value.
Additionally, to retrieve all text values between nodes, xpath("text()") is used with str.strip and .join to clean up whitespace and line breaks and concatenate together.
from lxml import etree
example='''<div n="0001" type="car" xml:id="_3a327f0002">
<p xml:id="_3a327f0003">
1. A car is
<p xml:id="_3a327f0004"> - big, yellow and red;</p>
<p xml:id="_3a327f0005"> - has a big motor;</p>
<p xml:id="_3a327f0006"> - and also has <lb/>
big seats.
</p>
</p>
</div>'''
XML_tree = etree.fromstring(example)
all_paras = XML_tree.xpath('.//p[#xml:id]')
output = {
next(iter(t.attrib.values())):" ".join(i.strip()
for i in t.xpath("text()")).strip()
for t in all_paras
}
output
# {
# '_3a327f0003': '1. A car is',
# '_3a327f0004': '- big, yellow and red;',
# '_3a327f0005': '- has a big motor;',
# '_3a327f0006': '- and also has big seats.'
# }
You could use lxml itertext() to get text content of the p element:
mydict['text'] = ''.join(para.itertext())
See this question as well for more generic solution.
This modifies the xpath to exclude the "A car is" text as per your example. It also uses the xpath functions string and normalize-space to evaluate the para node as a string and join its text nodes, as well as clean up the text to match your example.
from lxml import etree
example='''<div n="0001" type="car" xml:id="_3a327f0002">
<p xml:id="_3a327f0003">
1. A car is
<p xml:id="_3a327f0004"> - big, yellow and red;</p>
<p xml:id="_3a327f0005"> - has a big motor;</p>
<p xml:id="_3a327f0006"> - and also has <lb/>
big seats.
</p>
</p>
</div>'''
parser = etree.XMLParser(resolve_entities=False, strip_cdata=False, recover=True, ns_clean=True)
XML_tree = etree.fromstring(example.encode() , parser=parser)
all_paras = XML_tree.xpath('./p/p[#xml:id]')
list_of_paragraphs = []
for para in all_paras:
mydict = {}
mydict['text'] = para.xpath('normalize-space(string(.))')
for att in para.attrib:
mykey=att
if 'id' in mykey:
mykey='xmlid'
mydict[mykey] = para.attrib[att]
list_of_paragraphs.append(mydict)
PDM_XML_serializer(example)
If these tags are just noise for you, you can simply remove them before reading the xml
XML_tree = etree.fromstring(example.replace('<lb/>', '').encode() , parser=parser)
I have:
import xml.etree.ElementTree as ET
def strip_tag_name(t):
t = elem.tag
idx = k = t.rfind("}")
if idx != -1:
t = t[idx + 1:]
return t
events = ("start", "end")
title = None
for event, elem in ET.iterparse('data/enwiki-20190620-pages-articles-multistream.xml', events=events):
tname = strip_tag_name(elem.tag)
if event == 'end':
if tname == 'title':
title = elem.text
elif tname == 'page':
print(title, elem.text)
This seems to give the title just fine, but the page text always seems blank. What am I missing?
I haven't been able to open the file (it's huge), but I think this is an accurate snippet:
<mediawiki xmlns="http://www.mediawiki.org/xml/export-0.10/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.mediawiki.org/xml/export-0.10/ http://www.mediawiki.org/xml/export-0.10.xsd" version="0.10" xml:lang="en">
<siteinfo>
<sitename>Wikipedia</sitename>
<dbname>enwiki</dbname>
<base>https://en.wikipedia.org/wiki/Main_Page</base>
<generator>MediaWiki 1.29.0-wmf.12</generator>
<case>first-letter</case>
<namespaces>
...
</namespaces>
</siteinfo>
<page>
<title>AccessibleComputing</title>
<ns>0</ns>
<id>10</id>
<redirect title="Computer accessibility" />
<revision>
<id>631144794</id>
<parentid>381202555</parentid>
<timestamp>2014-10-26T04:50:23Z</timestamp>
<contributor>
<username>Paine Ellsworth</username>
<id>9092818</id>
</contributor>
<comment>add [[WP:RCAT|rcat]]s</comment>
<model>wikitext</model>
<format>text/x-wiki</format>
<text xml:space="preserve">#REDIRECT [[Computer accessibility]]
\{\{Redr|move|from CamelCase|up\}\}</text>
<sha1>4ro7vvppa5kmm0o1egfjztzcwd0vabw</sha1>
</revision>
</page>
<page>
<title>Anarchism</title>
<ns>0</ns>
<id>12</id>
<revision>
<id>766348469</id>
<parentid>766047928</parentid>
<timestamp>2017-02-19T18:08:07Z</timestamp>
<contributor>
<username>GreenC bot</username>
<id>27823944</id>
</contributor>
<minor />
<comment>Reformat 1 archive link. [[User:Green Cardamom/WaybackMedic_2.1|Wayback Medic 2.1]]</comment>
<model>wikitext</model>
<format>text/x-wiki</format>
<text xml:space="preserve">
...
</text>
</revision>
</page>
</mediawiki>
The best approach is to use a the MWXML python package which is part of the Mediawiki Utilities (installable with pip3 install mwxml). MWXML is designed to solve this specific problem and is widely used. The software was created by research staff at the Wikimedia Foundation and is maintained by a set of researchers inside and outside of the foundation.
Here's a code example adapted from an example notebook distributed with the library that prints out page IDs, revision IDs, timestamp, and the length of the text:
import mwxml
import glob
paths = glob.glob('/public/dumps/public/nlwiki/20151202/nlwiki-20151202-pages-meta-history*.xml*.bz2')
def process_dump(dump, path):
for page in dump:
for revision in page:
yield page.id, revision.id, revision.timestamp, len(revision.text)
for page_id, rev_id, rev_timestamp, rev_textlength in mwxml.map(process_dump, paths):
print("\t".join(str(v) for v in [page_id, rev_id, rev_timestamp, rev_textlength]))
The full example from which this is adapted reports the number of added and removed image links within each revision. It is fully documented but includes only 25 lines of code.
The text refers to the text between the element tags (i.e. <tag>text</tag>) and not to all the child elements. Thus, in case of the title element one has:
<title>AccessibleComputing</title>
and the text between the tags is AccessibleComputing.
In the case of the page element, the only text defined is '\n ' and there are other child elements (see below), including the title element:
<page>
<title>Anarchism</title>
<ns>0</ns>
<id>12</id>
...
</page>
See more details in w3schools page
If you want to parse the file, I would recomend to use either findall method:
from lxml import etree
from lxml.etree import tostring
tree = etree.parse('data/enwiki-20190620-pages-articles-multistream.xml')
root = tree.getroot()
# iterate through all the titles
for title in root.findall(".//title", namespaces=root.nsmap):
print(tostring(title))
print(title.text)
which generates this output:
b'<title xmlns="http://www.mediawiki.org/xml/export-0.10/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">AccessibleComputing</title>\n '
AccessibleComputing
b'<title xmlns="http://www.mediawiki.org/xml/export-0.10/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">Anarchism</title>\n '
Anarchism
or the xpath method:
nsmap = root.nsmap
nsmap['x'] = root.nsmap[None]
nsmap.pop(None)
# iterate through all the pages
for page in root.findall(".//x:page", namespaces=nsmap):
print(page)
print(repr(page.text)) # which prints '\n '
print('number of children: %i' % len(page.getchildren()))
and the output is:
<Element {http://www.mediawiki.org/xml/export-0.10/}page at 0x7ff75cc610c8>
'\n '
number of children: 5
<Element {http://www.mediawiki.org/xml/export-0.10/}page at 0x7ff75cc71bc8>
'\n '
number of children: 5
Please see lxml tutorial for more details.
You are trying to get the content of the text property of the <page> element, but that is just whitespace.
To get the text of the <text> element, just change
elif tname == 'page':
to
elif tname == 'text':
For XML parsing I use package untangle from PYPI, which presents a complete document view. Then you have:
import untangle
doc = untangle.parse('data/enwiki-20190620-pages-articles-multistream.xml')
for page in doc.mediawiki.page:
print(page.title.cdata)
for text in page.revision.text:
print(text.cdata)
To get the Wikipedia article, you need to access the content of the text property of the <text> element, and not the <page> element.
Here is the corrected version of your code:
import xml.etree.ElementTree as ET
def strip_tag_name(t):
t = elem.tag
idx = k = t.rfind("}")
if idx != -1:
t = t[idx + 1:]
return t
events = ("start", "end")
title = None
for event, elem in ET.iterparse('data/enwiki-20190620-pages-articles-multistream.xml', events=events):
tname = strip_tag_name(elem.tag)
if event == 'end':
if tname == 'title':
title = elem.text
elif tname == 'text':
print(title, elem.text)
elem.clear()
Since the Wikipedia dump is quite large, don't forget the elem.clear() at the end of the for loop.
As mentioned in mzjn answers the content of the text property of the <page> element is just whitespace.
<?xml version="1.0" encoding="utf-8"?>
<bookstore name="Libreria Pastor">
<book category="COOKING">
<title lang="en">Everyday Italian</title>
<author>
<writer>Giada De Laurentiis</writer>
<resumer>Pepe Lopez</resumer>
</author>
<year>2005</year>
<price>30.00</price>
</book>
<book category="CHILDREN">
<title lang="en">Harry Potter</title>
<author>
<writer>J K. Rowling</writer>
<resumer>Ana Martinez</resumer>
</author>
<year>2005</year>
<price>29.99</price>
</book>
<book category="PROGRAMMING">
<title lang="en">Python for All</title>
<author>
<writer>M.L. Jobs</writer>
<resumer>Delton Jones</resumer>
</author>
<year>2015</year>
<price>39.99</price>
</book>
</bookstore>
from xml.dom import minidom
arbol_dom = minidom.parse('C:\\Users\\MiguelRG\\Desktop\\sge\\Pythons\\e3.xml')
listaBibliotecas = arbol_dom.getElementsByTagName("bookstore");
listaLibros = arbol_dom.getElementsByTagName("book");
listaAutores = arbol_dom.getElementsByTagName("author");
for biblioteca in listaBibliotecas:
print(biblioteca.tagName);
print("Nombre : " +biblioteca.getAttribute("name"));
print("Tiene hijos:"+str(biblioteca.hasChildNodes()));
for l in listaLibros:
print("Tipo: "+l.tagName);
print("Categoria: "+l.getAttribute("category"));
print("Titulo : " +l.childNodes[0].nodeValue);
print("Lenguaje : "+l.getAttribute("lang"));
for a in listaAutores:
**print("Escritor : " + str(a.childNodes[0].nodeValue));**
**print("Resumen por : "+str(a.childNodes[1].nodeValue));**
break;
I want to read that xml with that program or something alike but I cant get the information inside the titles and the price and stuff, i need to print the information of the bookstore first, then the information of every book and then the information of the authors.
Any help will be apreciated
Thank you.
There are a lot of nodes in an xml document. For instance, with
<book>
<title>I Am The Very Model</title>
</book>
title is not childNodes[0]. That one is a text node with the newline and spaces between <book> and <title>. You need to search child nodes for the title element and the easiest way to do that is with getElementsByTagName. Once you get the right element, there may be multiple nodes holding text. You need to enumerate all of them to find the text you want. You also need to decide which bits of white space around a node can be stripped or you risk having odd gaps in your output.
One reason to move to ElementTree or lxml is that they tend to tidy this up and give you an easier API.
You also need to be careful where you call getElementsByTagName. When you did listaAutores = arbol_dom.getElementsByTagName("author"); you got all of the authors in the document, when you really just wanted the author for a given book.
As an aside, get rid of the extra semicolons at the end of the line. They are unneeded and drive python programmers nuts!
As another aside, print adds spaces and converts objects to strings. Just use its functionality instead of string concatenation so that your code has a consistent look and feel.
from xml.dom import minidom
arbol_dom = minidom.parse('test.xml')
def get_elem_text(elem):
"""join text in all immediate child text nodes"""
return ''.join(node.data for node in elem.childNodes
if node.nodeType == node.TEXT_NODE)
for biblioteca in arbol_dom.getElementsByTagName("bookstore"):
print(biblioteca.tagName)
print("Nombre :", biblioteca.getAttribute("name"))
print("Tiene hijos:", biblioteca.hasChildNodes())
for l in biblioteca.getElementsByTagName("book"):
print("Tipo:", l.tagName)
print("Categoria:", l.getAttribute("category"))
print("Titulo :", get_elem_text(l.getElementsByTagName("title")[0]))
print("Lenguaje :", l.getAttribute("lang"))
for a in l.getElementsByTagName("author"):
print("Escritor :",
get_elem_text(a.getElementsByTagName("writer")[0]))
print("Resumen por :",
get_elem_text(a.getElementsByTagName("resumer")[0]))
break
I would suggest using xmltodict.
import xmltodict
xml = None
with open('test.xml', 'r') as xmlfile:
xml = xmlfile.read()
data = xmltodict.parse(xml)
books = data['bookstore']['book']
for book in books:
print('\n-------------')
print(book['title']['#text'])
print(book['author']['writer'])
print(book['price'])
print(book['year'])
The output would look like this:
-------------
Everyday Italian
Giada De Laurentiis
30.00
2005
-------------
Harry Potter
J K. Rowling
29.99
2005
-------------
Python for All
M.L. Jobs
39.99
2015
You can install it with pip.
pip install xmltodict
Then you can access all the information in a standard dict.
I am using lxml to parse web document, I want to get all the text in a <p> element, so I use the code as follow:
from lxml import etree
page = etree.HTML("<html><p>test1 <br /> test2</p></html>")
print page.xpath("//p")[0].text # this just print "test1" not "test1 <br/> test2"
The problem is I want to get all text in <p> which is test1 <br /> test2 in the example, but lxml just give me test1.
How can I get all text in <p> element?
Several other possible ways :
p = page.xpath("//p")[0]
print etree.tostring(p, method="text")
or using XPath string() function (notice that XPath position index starts from 1 instead of 0) :
page.xpath("string(//p[1])")
Maybe like this
from lxml import etree
pag = etree.HTML("<html><p>test1 <br /> test2</p></html>")
# get all texts
print(pag.xpath("//p/text()"))
['test1 ', ' test2']
# concate
print("".join(pag.xpath("//p/text()")))
test1 test2
I have an XML file similar to this:
<root>
<a>Some <b>bad</b> text <i>that</i> I <u>do <i>not</i></u> want to keep.</a>
</root>
I want to remove all text in <b> or <u> elements (and descendants), and print the rest. This is what I tried:
from __future__ import print_function
import xml.etree.ElementTree as ET
tree = ET.parse('a.xml')
root = tree.getroot()
parent_map = {c:p for p in root.iter() for c in p}
for item in root.findall('.//b'):
parent_map[item].remove(item)
for item in root.findall('.//u'):
parent_map[item].remove(item)
print(''.join(root.itertext()).strip())
(I used the recipe in this answer to build the parent_map). The problem, of course, is that with remove(item) I'm also removing the text after the element, and the result is:
Some that I
whereas what I want is:
Some text that I want to keep.
Is there any solution?
If you won't end up using anything better, you can use clear() instead of remove() keeping the tail of the element:
import xml.etree.ElementTree as ET
data = """<root>
<a>Some <b>bad</b> text <i>that</i> I <u>do <i>not</i></u> want to keep.</a>
</root>"""
tree = ET.fromstring(data)
a = tree.find('a')
for element in a:
if element.tag in ('b', 'u'):
tail = element.tail
element.clear()
element.tail = tail
print ET.tostring(tree)
prints (see empty b and u tags):
<root>
<a>Some <b /> text <i>that</i> I <u /> want to keep.</a>
</root>
Also, here's a solution using xml.dom.minodom:
import xml.dom.minidom
data = """<root>
<a>Some <b>bad</b> text <i>that</i> I <u>do <i>not</i></u> want to keep.</a>
</root>"""
dom = xml.dom.minidom.parseString(data)
a = dom.getElementsByTagName('a')[0]
for child in a.childNodes:
if getattr(child, 'tagName', '') in ('u', 'b'):
a.removeChild(child)
print dom.toxml()
prints:
<?xml version="1.0" ?><root>
<a>Some text <i>that</i> I want to keep.</a>
</root>