I have a wikipedia dump that I want to parse, and I am having some difficulties/mysterious problems while using Python xml parser, ElementTree.
My recent problem is, ElementTree does not seem to find the texts that are actually there. This is an example data:
<page>
<title>Cengiz Han</title>
<ns>0</ns>
<id>10</id>
<revision>
<id>20337884</id>
<parentid>20218916</parentid>
<timestamp>2019-01-29T14:02:43Z</timestamp>
<contributor>
<username>CommonsDelinker</username>
<id>31545</id>
</contributor>
<comment>China_11b.jpg dosyası Map_of_China_1142.jpg ile değiştirildi</comment>
<model>wikitext</model>
<format>text/x-wiki</format>
<text xml:space="preserve">
...some long Genghis Khan stuff...
</text>
</page>
Now when I parse it with this:
for event, elem in et.iterparse('dataset/wiki_test', events=('start', 'end', 'start-ns', 'end-ns')):
if event == 'start':
if elem.tag == 'page':
if len(list(elem)) == 0:
continue
title = elem.find('title').text
if title == None or 'MediaWiki' in title:
elem.clear()
continue
wiki_id = elem.find('id')
if wiki_id == None:
elem.clear()
continue
wiki_id = wiki_id.text
revision = elem.find('revision')
if revision != None:
print(list(revision))
text = revision.find('text').text
print(text)
if text != None:
count += 1
titles += title + '\n'
page = {'wiki_id': wiki_id, 'title': title, 'text': text.text}
pages += json.dumps(page, ensure_ascii=False) + '\n'
elem.clear()
revision.find('text').text line seems to find no text for some elements, including the one above, and that some makes like one seventh of my data, which is annoying. This was also the case for page->id for some other entries, in which it claimed that element does not exist at all. I solved that problem by ignoring that ones, but I don't really want to do that, also this error does not make sense to me at all.
Here is another page, which works totally fine.
<page>
<title>Mustafa Suphi</title>
<ns>0</ns>
<id>22</id>
<revision>
<id>20077185</id>
<parentid>20017115</parentid>
<timestamp>2018-10-14T08:31:32Z</timestamp>
<contributor>
<username>Vikiçizer</username>
<id>90501</id>
</contributor>
<comment>/* top */düzeltme [[Vikipedi:AWB|AWB]] ile</comment>
<model>wikitext</model>
<format>text/x-wiki</format>
<text xml:space="preserve">
...some Mustafa Suphi stuff...
</text>
<sha1>m5finh6h2kr8h2fbtmsatp5fhz1siwq</sha1>
</revision>
</page>
What am I doing wrong?
You have posted two examples: "working" and "not working".
In the "not working" one there is no
</revision>
Are you sure this the XML you have or it is just copy & paste mistake.
Related
I would like to extract the comment section of the XML file. The information that I would like to extract is found between the Tag and then within Text tag which is "EXAMPLE".
The structure of the XML file looks below.
<Boxes>
<Box Id="3" ZIndex="13">
<Shape>Rectangle</Shape>
<Brush Id="0" />
<Pen>
<Color>#FF000000</Color>
</Pen>
<Tag><?xml version="1.0"?>
<PFDComment xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema">
<Text>**EXAMPLE** </Text>
</PFDComment></Tag>
</Box>
</Boxes>
I tried it something below but couldn't get the information that I want.
def read_cooments(xml):
tree = lxml.etree.parse(xml)
Comments= {}
for comment in tree.xpath("//Boxes/Box"):
#
get_id = comment.attrib['Id']
Comments[get_id] = []
for group in comment.xpath(".//Tag"):
#
Comments[get_id].append(group.text)
df_name1 = pd.DataFrame(dict([(k,pd.Series(v)) for k,v in Comments.items()]))
Can anyone help to extract comments from XML file shown above? Any help is appreciated!
Use the code given below:
def read_comments(xml):
tree = etree.parse(xml)
rows= []
for box in tree.xpath('Box'):
id = box.attrib['Id']
tagTxt = box.findtext('Tag')
if tagTxt is None:
continue
txtNode = etree.XML(tagTxt).find('Text')
if txtNode is None:
continue
rows.append([id, txtNode.text.strip()])
return pd.DataFrame(rows, columns=['id', 'Comment'])
Note that if you create a DataFrame within a function, it is a local
variable of this function and is not visible from outside.
A better and more readable approach (as I did) is that the function returns
this DataFrame.
This function contains also continue in 2 places, to guard against possible
"error cases", when either Box element does not contain Tag child or
Tag does not contain any Text child element.
I also noticed that there is no need to replace < or > with < or
> with my own code, as lxml performs it on its own.
Edit
My test is as follows: Start form imports:
import pandas as pd
from lxml import etree
I used a file containing:
<Boxes>
<Box Id="3" ZIndex="13">
<Shape>Rectangle</Shape>
<Brush Id="0" />
<Pen>
<Color>#FF000000</Color>
</Pen>
<Tag><?xml version="1.0"?>
<PFDComment xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema">
<Text>**EXAMPLE** </Text>
</PFDComment></Tag>
</Box>
</Boxes>
I called the above function:
df_name1 = read_comments('Boxes.xml')
and when I printed df_name1, I got:
id Comment
0 3 **EXAMPLE**
If something goes wrong, use the "extended" version of the above function,
with test printouts:
def read_comments(xml):
tree = etree.parse(xml)
rows= []
for box in tree.xpath('Box'):
id = box.attrib['Id']
tagTxt = box.findtext('Tag')
if tagTxt is None:
print('No Tag element')
continue
txtNode = etree.XML(tagTxt).find('Text')
if txtNode is None:
print('No Text element')
continue
txt = txtNode.text.strip()
print(f'{id}: {txt}')
rows.append([id, txt])
return pd.DataFrame(rows, columns=['id', 'Comment'])
and take a look at printouts.
I have:
import xml.etree.ElementTree as ET
def strip_tag_name(t):
t = elem.tag
idx = k = t.rfind("}")
if idx != -1:
t = t[idx + 1:]
return t
events = ("start", "end")
title = None
for event, elem in ET.iterparse('data/enwiki-20190620-pages-articles-multistream.xml', events=events):
tname = strip_tag_name(elem.tag)
if event == 'end':
if tname == 'title':
title = elem.text
elif tname == 'page':
print(title, elem.text)
This seems to give the title just fine, but the page text always seems blank. What am I missing?
I haven't been able to open the file (it's huge), but I think this is an accurate snippet:
<mediawiki xmlns="http://www.mediawiki.org/xml/export-0.10/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.mediawiki.org/xml/export-0.10/ http://www.mediawiki.org/xml/export-0.10.xsd" version="0.10" xml:lang="en">
<siteinfo>
<sitename>Wikipedia</sitename>
<dbname>enwiki</dbname>
<base>https://en.wikipedia.org/wiki/Main_Page</base>
<generator>MediaWiki 1.29.0-wmf.12</generator>
<case>first-letter</case>
<namespaces>
...
</namespaces>
</siteinfo>
<page>
<title>AccessibleComputing</title>
<ns>0</ns>
<id>10</id>
<redirect title="Computer accessibility" />
<revision>
<id>631144794</id>
<parentid>381202555</parentid>
<timestamp>2014-10-26T04:50:23Z</timestamp>
<contributor>
<username>Paine Ellsworth</username>
<id>9092818</id>
</contributor>
<comment>add [[WP:RCAT|rcat]]s</comment>
<model>wikitext</model>
<format>text/x-wiki</format>
<text xml:space="preserve">#REDIRECT [[Computer accessibility]]
\{\{Redr|move|from CamelCase|up\}\}</text>
<sha1>4ro7vvppa5kmm0o1egfjztzcwd0vabw</sha1>
</revision>
</page>
<page>
<title>Anarchism</title>
<ns>0</ns>
<id>12</id>
<revision>
<id>766348469</id>
<parentid>766047928</parentid>
<timestamp>2017-02-19T18:08:07Z</timestamp>
<contributor>
<username>GreenC bot</username>
<id>27823944</id>
</contributor>
<minor />
<comment>Reformat 1 archive link. [[User:Green Cardamom/WaybackMedic_2.1|Wayback Medic 2.1]]</comment>
<model>wikitext</model>
<format>text/x-wiki</format>
<text xml:space="preserve">
...
</text>
</revision>
</page>
</mediawiki>
The best approach is to use a the MWXML python package which is part of the Mediawiki Utilities (installable with pip3 install mwxml). MWXML is designed to solve this specific problem and is widely used. The software was created by research staff at the Wikimedia Foundation and is maintained by a set of researchers inside and outside of the foundation.
Here's a code example adapted from an example notebook distributed with the library that prints out page IDs, revision IDs, timestamp, and the length of the text:
import mwxml
import glob
paths = glob.glob('/public/dumps/public/nlwiki/20151202/nlwiki-20151202-pages-meta-history*.xml*.bz2')
def process_dump(dump, path):
for page in dump:
for revision in page:
yield page.id, revision.id, revision.timestamp, len(revision.text)
for page_id, rev_id, rev_timestamp, rev_textlength in mwxml.map(process_dump, paths):
print("\t".join(str(v) for v in [page_id, rev_id, rev_timestamp, rev_textlength]))
The full example from which this is adapted reports the number of added and removed image links within each revision. It is fully documented but includes only 25 lines of code.
The text refers to the text between the element tags (i.e. <tag>text</tag>) and not to all the child elements. Thus, in case of the title element one has:
<title>AccessibleComputing</title>
and the text between the tags is AccessibleComputing.
In the case of the page element, the only text defined is '\n ' and there are other child elements (see below), including the title element:
<page>
<title>Anarchism</title>
<ns>0</ns>
<id>12</id>
...
</page>
See more details in w3schools page
If you want to parse the file, I would recomend to use either findall method:
from lxml import etree
from lxml.etree import tostring
tree = etree.parse('data/enwiki-20190620-pages-articles-multistream.xml')
root = tree.getroot()
# iterate through all the titles
for title in root.findall(".//title", namespaces=root.nsmap):
print(tostring(title))
print(title.text)
which generates this output:
b'<title xmlns="http://www.mediawiki.org/xml/export-0.10/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">AccessibleComputing</title>\n '
AccessibleComputing
b'<title xmlns="http://www.mediawiki.org/xml/export-0.10/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">Anarchism</title>\n '
Anarchism
or the xpath method:
nsmap = root.nsmap
nsmap['x'] = root.nsmap[None]
nsmap.pop(None)
# iterate through all the pages
for page in root.findall(".//x:page", namespaces=nsmap):
print(page)
print(repr(page.text)) # which prints '\n '
print('number of children: %i' % len(page.getchildren()))
and the output is:
<Element {http://www.mediawiki.org/xml/export-0.10/}page at 0x7ff75cc610c8>
'\n '
number of children: 5
<Element {http://www.mediawiki.org/xml/export-0.10/}page at 0x7ff75cc71bc8>
'\n '
number of children: 5
Please see lxml tutorial for more details.
You are trying to get the content of the text property of the <page> element, but that is just whitespace.
To get the text of the <text> element, just change
elif tname == 'page':
to
elif tname == 'text':
For XML parsing I use package untangle from PYPI, which presents a complete document view. Then you have:
import untangle
doc = untangle.parse('data/enwiki-20190620-pages-articles-multistream.xml')
for page in doc.mediawiki.page:
print(page.title.cdata)
for text in page.revision.text:
print(text.cdata)
To get the Wikipedia article, you need to access the content of the text property of the <text> element, and not the <page> element.
Here is the corrected version of your code:
import xml.etree.ElementTree as ET
def strip_tag_name(t):
t = elem.tag
idx = k = t.rfind("}")
if idx != -1:
t = t[idx + 1:]
return t
events = ("start", "end")
title = None
for event, elem in ET.iterparse('data/enwiki-20190620-pages-articles-multistream.xml', events=events):
tname = strip_tag_name(elem.tag)
if event == 'end':
if tname == 'title':
title = elem.text
elif tname == 'text':
print(title, elem.text)
elem.clear()
Since the Wikipedia dump is quite large, don't forget the elem.clear() at the end of the for loop.
As mentioned in mzjn answers the content of the text property of the <page> element is just whitespace.
I'm trying to build a script to read an xml file.
This is my first time parsing an xml and i'm doing it using python with xml.etree.ElementTree. The section of the file that i would like to process looks like:
<component>
<section>
<id root="42CB916B-BB58-44A0-B8D2-89B4B27F04DF" />
<code code="34089-3" codeSystem="2.16.840.1.113883.6.1" codeSystemName="LOINC" displayName="DESCRIPTION SECTION" />
<title mediaType="text/x-hl7-title+xml">DESCRIPTION</title>
<text>
<paragraph>Renese<sup>®</sup> is designated generically as polythiazide, and chemically as 2<content styleCode="italics">H</content>-1,2,4-Benzothiadiazine-7-sulfonamide, 6-chloro-3,4-dihydro-2-methyl-3-[[(2,2,2-trifluoroethyl)thio]methyl]-, 1,1-dioxide. It is a white crystalline substance, insoluble in water but readily soluble in alkaline solution.</paragraph>
<paragraph>Inert Ingredients: dibasic calcium phosphate; lactose; magnesium stearate; polyethylene glycol; sodium lauryl sulfate; starch; vanillin. The 2 mg tablets also contain: Yellow 6; Yellow 10.</paragraph>
</text>
<effectiveTime value="20051214" />
</section>
</component>
<component>
<section>
<id root="CF5D392D-F637-417C-810A-7F0B3773264F" />
<code code="42229-5" codeSystem="2.16.840.1.113883.6.1" codeSystemName="LOINC" displayName="SPL UNCLASSIFIED SECTION" />
<title mediaType="text/x-hl7-title+xml">ACTION</title>
<text>
<paragraph>The mechanism of action results in an interference with the renal tubular mechanism of electrolyte reabsorption. At maximal therapeutic dosage all thiazides are approximately equal in their diuretic potency. The mechanism whereby thiazides function in the control of hypertension is unknown.</paragraph>
</text>
<effectiveTime value="20051214" />
</section>
</component>
The full file can be downloaded from:
https://dailymed.nlm.nih.gov/dailymed/getFile.cfm?setid=abd6ecf0-dc8e-41de-89f2-1e36ed9d6535&type=zip&name=Renese
Here my code:
import xml.etree.ElementTree as ElementTree
import re
with open("ABD6ECF0-DC8E-41DE-89F2-1E36ED9D6535.xml") as f:
xmlstring = f.read()
# Remove the default namespace definition (xmlns="http://some/namespace")
xmlstring = re.sub(r'\sxmlns="[^"]+"', '', xmlstring, count=1)
tree = ElementTree.fromstring(xmlstring)
for title in tree.iter('title'):
print(title.text)
So far I'm able to print the titles but I would like to print also the corresponding text that is captured in the tags.
I have tried this:
for title in tree.iter('title'):
print(title.text)
for paragraph in title.iter('paragraph'):
print(paragraph.text)
But I have no output from the paragraph.text
Doing
for title in tree.iter('title'):
print(title.text)
for paragraph in tree.iter('paragraph'):
print(paragraph.text)
I print the text of the paragraphs but (obviously) it is printed all together for each title found in the xml structure.
I would like to find a way to 1) identify the title; 2) print the corresponding paragraph(s).
How can I do it?
If you are willing to use lxml, then the following is a solution which uses XPath:
import re
from lxml.etree import fromstring
with open("ABD6ECF0-DC8E-41DE-89F2-1E36ED9D6535.xml") as f:
xmlstring = f.read()
xmlstring = re.sub(r'\sxmlns="[^"]+"', '', xmlstring, count=1)
doc = fromstring(xmlstring.encode()) # lxml only accepts bytes input, hence we encode
for title in doc.xpath('//title'): # for all title nodes
title_text = title.xpath('./text()') # get text value of the node
# get all text values of the paragraph nodes that appear lower (//paragraph)
# in the hierarchy than the parent (..) of <title>
paragraphs_for_title = title.xpath('..//paragraph/text()')
print(title_text[0] if title_text else '')
for paragraph in paragraphs_for_title:
print(paragraph)
I am parsing big XMLs (~500MB) with the help of LXML library in Python. I have used BeautifulSoup with lxml-xml parser for small files. But when I came across huge XMLs, it was inefficient as it reads the whole file once, and then parses it.
I need to parse a XML to get root to leaf paths (except the outermost tag).
eg.
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE A>
<A>
<B>
<C>
abc
</C>
<D>
abd
</D>
</B>
</A>
Above XML should give keys and values as output (root to leaf paths).
A.B.C = abc
A.B.D = abd
Here's the code that I've written to parse it:
(ignore1 and ignore2 are the tags that need to be ignored, and tu.clean_text() is the function which will remove unnecessary characters)
def fast_parser(filename, keys, values, ignore1, ignore2):
context = etree.iterparse(filename, events=('start', 'end',))
path = list()
i = 0
lastevent = ""
for event, elem in context:
i += 1
tag = elem.tag if "}" not in elem.tag else elem.tag.split('}', 1)[1]
if tag == ignore1 or tag == ignore2:
pass
elif event == "start":
path.append(tag)
elif event == "end":
if lastevent == "start":
keys.append(".".join(path))
values.append(tu.clean_text(elem.text))
# free memory
elem.clear()
while elem.getprevious() is not None:
del elem.getparent()[0]
if len(path) > 0:
path.pop()
lastevent = event
del context
return keys, values
I have already referred the following article for parsing a large file ibm.com/developerworks/xml/library/x-hiperfparse/#listing4
Here's the screenshot of top command. Memory usage goes beyond 2 GB for a ~500 MB XML file. I suspect that memory is not getting freed.
I have already gone through few StackOverflow questions. But it didn't help. Please advice.
I took the code from https://stackoverflow.com/a/7171543/131187, chopped out comments and print statements, and added a suitable func to get this. I wouldn't like to guess how much time it would take to process a 500 Mb file!
Even in writing func I have done nothing original, having adopted the original authors' use of the xpath expression, 'ancestor-or-self::*', to provide the absolute path that you want.
However, since this code conforms more closely to the original scripts it might not leak memory.
import lxml.etree as ET
input_xml = 'temp.xml'
for line in open(input_xml).readlines():
print (line[:-1])
def mod_fast_iter(context, func, *args, **kwargs):
for event, elem in context:
func(elem, *args, **kwargs)
elem.clear()
for ancestor in elem.xpath('ancestor-or-self::*'):
while ancestor.getprevious() is not None:
del ancestor.getparent()[0]
del context
def func(elem):
content = '' if not elem.text else elem.text.strip()
if content:
ancestors = elem.xpath('ancestor-or-self::*')
print ('%s=%s' % ('.'.join([_.tag for _ in ancestors]), content))
print ('\nResult:\n')
context = ET.iterparse(open(input_xml , 'rb'), events=('end', ))
mod_fast_iter(context, func)
Output:
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE A>
<A>
<B>
<C>
abc
</C>
<D>
abd
</D>
</B>
</A
Result:
A.B.C=abc
A.B.D=abd
I am trying to use elementTree's iterparse function to filter nodes based on the text and write them to a new file. I am using iterparse becuase the input file is large (100+ MB)
input.xml
<xmllist>
<page id="1">
<title>movie title 1</title>
<text>this is a moviein theatres/text>
</page>
<page id="2">
<title>movie title 2</title>
<text>this is a horror film</text>
</page>
<page id="3">
<title></title>
<text>actor in film</text>
</page>
<page id="4">
<title>some other topic</title>
<text>nothing related</text>
</page>
</xmllist>
Expected output (all pages where the text has "movie" or "film" in them)
<xmllist>
<page id="1">
<title>movie title 1</title>
<text>this is a movie<n theatres/text>
</page>
<page id="2">
<title>movie title 2</title>
<text>this is a horror film</text>
</page>
<page id="3">
<title></title>
<text>actor in film</text>
</page>
</xmllist>
Current code
import xml.etree.cElementTree as etree
from xml.etree.cElementTree import dump
output_file=open('/tmp/outfile.xml','w')
for event, elem in iter(etree.iterparse("/tmp/test.xml", events=('start','end'))):
if event == "end" and elem.tag == "page": #need to add condition to search for strings
output_file.write(elem)
elem.clear()
How do I add the regular expression to filter based on page's text attribute?
You're looking for a child, not an attribute, so it's simplest to analyze the title as it "passes by" in the iteration and remember the result until you get the end of the resulting page:
import re
good_page = False
for event, elem in iter(etree.iterparse("/tmp/test.xml", events=('start','end'))):
if event == 'end':
if elem.tag = 'title':
good_page = re.search(r'film|movie', elem.text)
elif elem.tag == 'page':
if good_page:
output_file.write(elem)
good_page = False
elem.clear()
The re.search will return None if not found, and the if treats that as false, so we're avoiding the writing of pages without a title as well as ones whose title's text does not match your desired RE.