How can I parse a Wikipedia XML dump with Python?

How can I parse a Wikipedia XML dump with Python? - python

I have:
import xml.etree.ElementTree as ET
def strip_tag_name(t):
t = elem.tag
idx = k = t.rfind("}")
if idx != -1:
t = t[idx + 1:]
return t
events = ("start", "end")
title = None
for event, elem in ET.iterparse('data/enwiki-20190620-pages-articles-multistream.xml', events=events):
tname = strip_tag_name(elem.tag)
if event == 'end':
if tname == 'title':
title = elem.text
elif tname == 'page':
print(title, elem.text)
This seems to give the title just fine, but the page text always seems blank. What am I missing?
I haven't been able to open the file (it's huge), but I think this is an accurate snippet:
<mediawiki xmlns="http://www.mediawiki.org/xml/export-0.10/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.mediawiki.org/xml/export-0.10/ http://www.mediawiki.org/xml/export-0.10.xsd" version="0.10" xml:lang="en">
<siteinfo>
<sitename>Wikipedia</sitename>
<dbname>enwiki</dbname>
<base>https://en.wikipedia.org/wiki/Main_Page</base>
<generator>MediaWiki 1.29.0-wmf.12</generator>
<case>first-letter</case>
<namespaces>
...
</namespaces>
</siteinfo>
<page>
<title>AccessibleComputing</title>
<ns>0</ns>
<id>10</id>
<redirect title="Computer accessibility" />
<revision>
<id>631144794</id>
<parentid>381202555</parentid>
<timestamp>2014-10-26T04:50:23Z</timestamp>
<contributor>
<username>Paine Ellsworth</username>
<id>9092818</id>
</contributor>
<comment>add [[WP:RCAT|rcat]]s</comment>
<model>wikitext</model>
<format>text/x-wiki</format>
<text xml:space="preserve">#REDIRECT [[Computer accessibility]]
\{\{Redr|move|from CamelCase|up\}\}</text>
<sha1>4ro7vvppa5kmm0o1egfjztzcwd0vabw</sha1>
</revision>
</page>
<page>
<title>Anarchism</title>
<ns>0</ns>
<id>12</id>
<revision>
<id>766348469</id>
<parentid>766047928</parentid>
<timestamp>2017-02-19T18:08:07Z</timestamp>
<contributor>
<username>GreenC bot</username>
<id>27823944</id>
</contributor>
<minor />
<comment>Reformat 1 archive link. [[User:Green Cardamom/WaybackMedic_2.1|Wayback Medic 2.1]]</comment>
<model>wikitext</model>
<format>text/x-wiki</format>
<text xml:space="preserve">
...
</text>
</revision>
</page>
</mediawiki>

The best approach is to use a the MWXML python package which is part of the Mediawiki Utilities (installable with pip3 install mwxml). MWXML is designed to solve this specific problem and is widely used. The software was created by research staff at the Wikimedia Foundation and is maintained by a set of researchers inside and outside of the foundation.
Here's a code example adapted from an example notebook distributed with the library that prints out page IDs, revision IDs, timestamp, and the length of the text:
import mwxml
import glob
paths = glob.glob('/public/dumps/public/nlwiki/20151202/nlwiki-20151202-pages-meta-history*.xml*.bz2')
def process_dump(dump, path):
for page in dump:
for revision in page:
yield page.id, revision.id, revision.timestamp, len(revision.text)
for page_id, rev_id, rev_timestamp, rev_textlength in mwxml.map(process_dump, paths):
print("\t".join(str(v) for v in [page_id, rev_id, rev_timestamp, rev_textlength]))
The full example from which this is adapted reports the number of added and removed image links within each revision. It is fully documented but includes only 25 lines of code.

The text refers to the text between the element tags (i.e. <tag>text</tag>) and not to all the child elements. Thus, in case of the title element one has:
<title>AccessibleComputing</title>
and the text between the tags is AccessibleComputing.
In the case of the page element, the only text defined is '\n ' and there are other child elements (see below), including the title element:
<page>
<title>Anarchism</title>
<ns>0</ns>
<id>12</id>
...
</page>
See more details in w3schools page
If you want to parse the file, I would recomend to use either findall method:
from lxml import etree
from lxml.etree import tostring
tree = etree.parse('data/enwiki-20190620-pages-articles-multistream.xml')
root = tree.getroot()
# iterate through all the titles
for title in root.findall(".//title", namespaces=root.nsmap):
print(tostring(title))
print(title.text)
which generates this output:
b'<title xmlns="http://www.mediawiki.org/xml/export-0.10/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">AccessibleComputing</title>\n '
AccessibleComputing
b'<title xmlns="http://www.mediawiki.org/xml/export-0.10/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">Anarchism</title>\n '
Anarchism
or the xpath method:
nsmap = root.nsmap
nsmap['x'] = root.nsmap[None]
nsmap.pop(None)
# iterate through all the pages
for page in root.findall(".//x:page", namespaces=nsmap):
print(page)
print(repr(page.text)) # which prints '\n '
print('number of children: %i' % len(page.getchildren()))
and the output is:
<Element {http://www.mediawiki.org/xml/export-0.10/}page at 0x7ff75cc610c8>
'\n '
number of children: 5
<Element {http://www.mediawiki.org/xml/export-0.10/}page at 0x7ff75cc71bc8>
'\n '
number of children: 5
Please see lxml tutorial for more details.

You are trying to get the content of the text property of the <page> element, but that is just whitespace.
To get the text of the <text> element, just change
elif tname == 'page':
to
elif tname == 'text':

For XML parsing I use package untangle from PYPI, which presents a complete document view. Then you have:
import untangle
doc = untangle.parse('data/enwiki-20190620-pages-articles-multistream.xml')
for page in doc.mediawiki.page:
print(page.title.cdata)
for text in page.revision.text:
print(text.cdata)

To get the Wikipedia article, you need to access the content of the text property of the <text> element, and not the <page> element.
Here is the corrected version of your code:
import xml.etree.ElementTree as ET
def strip_tag_name(t):
t = elem.tag
idx = k = t.rfind("}")
if idx != -1:
t = t[idx + 1:]
return t
events = ("start", "end")
title = None
for event, elem in ET.iterparse('data/enwiki-20190620-pages-articles-multistream.xml', events=events):
tname = strip_tag_name(elem.tag)
if event == 'end':
if tname == 'title':
title = elem.text
elif tname == 'text':
print(title, elem.text)
elem.clear()
Since the Wikipedia dump is quite large, don't forget the elem.clear() at the end of the for loop.
As mentioned in mzjn answers the content of the text property of the <page> element is just whitespace.

Related

ElementTree does not seem to get some texts/elements in the tree

I have a wikipedia dump that I want to parse, and I am having some difficulties/mysterious problems while using Python xml parser, ElementTree.
My recent problem is, ElementTree does not seem to find the texts that are actually there. This is an example data:
<page>
<title>Cengiz Han</title>
<ns>0</ns>
<id>10</id>
<revision>
<id>20337884</id>
<parentid>20218916</parentid>
<timestamp>2019-01-29T14:02:43Z</timestamp>
<contributor>
<username>CommonsDelinker</username>
<id>31545</id>
</contributor>
<comment>China_11b.jpg dosyası Map_of_China_1142.jpg ile değiştirildi</comment>
<model>wikitext</model>
<format>text/x-wiki</format>
<text xml:space="preserve">
...some long Genghis Khan stuff...
</text>
</page>
Now when I parse it with this:
for event, elem in et.iterparse('dataset/wiki_test', events=('start', 'end', 'start-ns', 'end-ns')):
if event == 'start':
if elem.tag == 'page':
if len(list(elem)) == 0:
continue
title = elem.find('title').text
if title == None or 'MediaWiki' in title:
elem.clear()
continue
wiki_id = elem.find('id')
if wiki_id == None:
elem.clear()
continue
wiki_id = wiki_id.text
revision = elem.find('revision')
if revision != None:
print(list(revision))
text = revision.find('text').text
print(text)
if text != None:
count += 1
titles += title + '\n'
page = {'wiki_id': wiki_id, 'title': title, 'text': text.text}
pages += json.dumps(page, ensure_ascii=False) + '\n'
elem.clear()
revision.find('text').text line seems to find no text for some elements, including the one above, and that some makes like one seventh of my data, which is annoying. This was also the case for page->id for some other entries, in which it claimed that element does not exist at all. I solved that problem by ignoring that ones, but I don't really want to do that, also this error does not make sense to me at all.
Here is another page, which works totally fine.
<page>
<title>Mustafa Suphi</title>
<ns>0</ns>
<id>22</id>
<revision>
<id>20077185</id>
<parentid>20017115</parentid>
<timestamp>2018-10-14T08:31:32Z</timestamp>
<contributor>
<username>Vikiçizer</username>
<id>90501</id>
</contributor>
<comment>/* top */düzeltme [[Vikipedi:AWB|AWB]] ile</comment>
<model>wikitext</model>
<format>text/x-wiki</format>
<text xml:space="preserve">
...some Mustafa Suphi stuff...
</text>
<sha1>m5finh6h2kr8h2fbtmsatp5fhz1siwq</sha1>
</revision>
</page>
What am I doing wrong?

You have posted two examples: "working" and "not working".
In the "not working" one there is no
</revision>
Are you sure this the XML you have or it is just copy & paste mistake.

Print nested element of xml with python etree

I'm trying to build a script to read an xml file.
This is my first time parsing an xml and i'm doing it using python with xml.etree.ElementTree. The section of the file that i would like to process looks like:
<component>
<section>
<id root="42CB916B-BB58-44A0-B8D2-89B4B27F04DF" />
<code code="34089-3" codeSystem="2.16.840.1.113883.6.1" codeSystemName="LOINC" displayName="DESCRIPTION SECTION" />
<title mediaType="text/x-hl7-title+xml">DESCRIPTION</title>
<text>
<paragraph>Renese<sup>®</sup> is designated generically as polythiazide, and chemically as 2<content styleCode="italics">H</content>-1,2,4-Benzothiadiazine-7-sulfonamide, 6-chloro-3,4-dihydro-2-methyl-3-[[(2,2,2-trifluoroethyl)thio]methyl]-, 1,1-dioxide. It is a white crystalline substance, insoluble in water but readily soluble in alkaline solution.</paragraph>
<paragraph>Inert Ingredients: dibasic calcium phosphate; lactose; magnesium stearate; polyethylene glycol; sodium lauryl sulfate; starch; vanillin. The 2 mg tablets also contain: Yellow 6; Yellow 10.</paragraph>
</text>
<effectiveTime value="20051214" />
</section>
</component>
<component>
<section>
<id root="CF5D392D-F637-417C-810A-7F0B3773264F" />
<code code="42229-5" codeSystem="2.16.840.1.113883.6.1" codeSystemName="LOINC" displayName="SPL UNCLASSIFIED SECTION" />
<title mediaType="text/x-hl7-title+xml">ACTION</title>
<text>
<paragraph>The mechanism of action results in an interference with the renal tubular mechanism of electrolyte reabsorption. At maximal therapeutic dosage all thiazides are approximately equal in their diuretic potency. The mechanism whereby thiazides function in the control of hypertension is unknown.</paragraph>
</text>
<effectiveTime value="20051214" />
</section>
</component>
The full file can be downloaded from:
https://dailymed.nlm.nih.gov/dailymed/getFile.cfm?setid=abd6ecf0-dc8e-41de-89f2-1e36ed9d6535&type=zip&name=Renese
Here my code:
import xml.etree.ElementTree as ElementTree
import re
with open("ABD6ECF0-DC8E-41DE-89F2-1E36ED9D6535.xml") as f:
xmlstring = f.read()
# Remove the default namespace definition (xmlns="http://some/namespace")
xmlstring = re.sub(r'\sxmlns="[^"]+"', '', xmlstring, count=1)
tree = ElementTree.fromstring(xmlstring)
for title in tree.iter('title'):
print(title.text)
So far I'm able to print the titles but I would like to print also the corresponding text that is captured in the tags.
I have tried this:
for title in tree.iter('title'):
print(title.text)
for paragraph in title.iter('paragraph'):
print(paragraph.text)
But I have no output from the paragraph.text
Doing
for title in tree.iter('title'):
print(title.text)
for paragraph in tree.iter('paragraph'):
print(paragraph.text)
I print the text of the paragraphs but (obviously) it is printed all together for each title found in the xml structure.
I would like to find a way to 1) identify the title; 2) print the corresponding paragraph(s).
How can I do it?

If you are willing to use lxml, then the following is a solution which uses XPath:
import re
from lxml.etree import fromstring
with open("ABD6ECF0-DC8E-41DE-89F2-1E36ED9D6535.xml") as f:
xmlstring = f.read()
xmlstring = re.sub(r'\sxmlns="[^"]+"', '', xmlstring, count=1)
doc = fromstring(xmlstring.encode()) # lxml only accepts bytes input, hence we encode
for title in doc.xpath('//title'): # for all title nodes
title_text = title.xpath('./text()') # get text value of the node
# get all text values of the paragraph nodes that appear lower (//paragraph)
# in the hierarchy than the parent (..) of <title>
paragraphs_for_title = title.xpath('..//paragraph/text()')
print(title_text[0] if title_text else '')
for paragraph in paragraphs_for_title:
print(paragraph)

Search xml for text and return element/node

I'd like to be able to search an xml formatted file by the text value and return the id it is part of. I've looked through the python library at the xml commands but only saw examples for searching by elements/nodes. I have a simplified xml sample below and I'd like search for "3x3 Eyes" for example and return "2". It should also search for the exact text minus case. There will normally be multiple entries for title under each anime so the search can stop at the first match. Thanks
<?xml version="1.0" encoding="UTF-8"?>
<animetitles>
<anime aid="1">
<title type="official" xml:lang="fr">Crest of the Stars</title>
<title type="official" xml:lang="fr">Crest of the Stars</title>
</anime>
<anime aid="2">
<title type="official" xml:lang="en">3x3 Eyes</title>
</anime>
<anime aid="3">
<title type="official" xml:lang="en">3x3 Eyes: Legend of the Divine Demon</title>
</anime>
</animetitles>

tree = et.parse( ... )
# Unique match
results = []
for anime in tree.findall('anime'):
for title in anime.findall('title'):
if title.text == '3x3 Eyes':
results.append(anime.get('aid'))
print results
# Everything that starts with
results = []
for anime in tree.findall('anime'):
for title in anime.findall('title'):
if title.text.startswith('3x3 Eyes'):
results.append(anime.get('aid'))
print results
First one returns [2], the second one [2, 3].
Or a little bit more cryptic but, hey, why not :)
results = [anime.get('aid') for anime in tree.findall('anime')
for title in anime.findall('title') if title.text == '3x3 Eyes']

You can use ElementTree for your purpose.
import xml.etree.ElementTree as ET
tree = ET.parse('a.xml')
root = tree.getroot()
def findParentAttrib(string):
for neighbor in root.iter():
for parent in neighbor.getiterator():
for child in parent:
if child.text == string:
return parent.attrib['aid']
print findParentAttrib("3x3 Eyes") # returns 2
Also refer to this page.

Getting subelements using lxml and iterparse

I am trying to write a parsing algorithm to efficiently pull data from an xml document. I am currently rolling through the document based on elements and children, but would like to use iterparse instead. One issue is that I have a list of elements that when found, I want to pull the child data from them, but it seems like using iterparse my options are to filter based on either one element name, or get every single element.
Example xml:
<?xml version="1.0" encoding="UTF-8"?>
<data_object xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<source id="0">
<name>Office Issues</name>
<datetime>2012-01-13T16:09:15</datetime>
<data_id>7</data_id>
</source>
<event id="125">
<date>2012-11-06</date>
<state_id>7</state_id>
</event>
<state id="7">
<name>Washington</name>
</state>
<locality id="2">
<name>Olympia</name>
<state_id>7</state_id>
<type>City</type>
</locality>
<locality id="3">
<name>Town</name>
<state_id>7</state_id>
<type>Town</type>
</locality>
</data_object>
Code example:
from lxml import etree
fname = "test.xml"
ELEMENT_LIST = ["source", "event", "state", "locality"]
with open(fname) as xml_doc:
context = etree.iterparse(xml_doc, events=("start", "end"))
context = iter(context)
event, root = context.next()
base = False
b_name = ""
for event, elem in context:
if event == "start" and elem.tag in ELEMENT_LIST:
base = True
bname = elem.tag
children = elem.getchildren()
child_list = []
for child in children:
child_list.append(child.tag)
print bname + ":" + str(child_list)
elif event == "end" and elem.tag in ELEMENT_LIST:
base = False
root.clear()

With iterparse you cannot limit parsing to some types of tags, you may do this only with one tag (by passing argument tag). However it is easy to do manually what you would like to achieve. In the following snippet:
from lxml import etree
fname = "test.xml"
ELEMENT_LIST = ["source", "event", "state", "locality"]
with open(fname) as xml_doc:
context = etree.iterparse(xml_doc, events=("start", "end"))
for event, elem in context:
if event == "start" and elem.tag in ELEMENT_LIST:
print "this elem is interesting, do some processing: %s: [%s]" % (elem.tag, ", ".join(child.tag for child in elem))
elem.clear()
you limit your search to interesting tags only. Important part of iterparse is the elem.clear() which clears memory when item is obsolete. That is why it is memory efficient, see http://lxml.de/parsing.html#modifying-the-tree

I would use XPath instead. It's much more elegant than walking the document on your own and certainly more efficient I assume.

Use tag='{http://www.sitemaps.org/schemas/sitemap/0.9}url'
Similar question with right answer https://stackoverflow.com/a/7019273/1346222
#!/usr/bin/python
# coding: utf-8
""" Parsing xml file. Basic example """
from StringIO import StringIO
from lxml import etree
import urllib2
sitemap = urllib2.urlopen(
'http://google.com/sitemap.xml',
timeout=10
).read()
NS = {
'x': 'http://www.sitemaps.org/schemas/sitemap/0.9',
'x2': 'http://www.google.com/schemas/sitemap-mobile/1.0'
}
res = []
urls = etree.iterparse(StringIO(sitemap), tag='{http://www.sitemaps.org/schemas/sitemap/0.9}url')
for event, url in urls:
t = []
t = url.xpath('.//x:loc/text() | .//x:priority/text()', namespaces=NS)
t.append(url.xpath('boolean(.//x2:mobile)', namespaces=NS))
res.append(t)

Parsing XML file etree module

I'm reading XML file using Etree module. Im using following code to print the value of <page> and <title> tags. My code working fine. But I want little change. If the <page id='...'> attribute id is exists then print the value of tag. Is it possible? thanks
import xml.etree.cElementTree as etree
from pprint import pprint
tree = etree.parse('find_title.xml')
for value in tree.getiterator(tag='title'):
print value.text
for value in tree.getiterator(tag='page'):
pprint(value.attrib)
Here is my xml File.
<mediawiki>
<siteinfo>
<sitename>Wiki</sitename>
<namespaces>
<namespace key="-2" case="first-letter">Media</namespace>
</namespaces>
</siteinfo>
<page id="31239628" orglength="6822" newlength="4524" stub="0" categories="0" outlinks="1" urls="10">
<title>Title</title>
<categories></categories>
<links>15099779</links>
<urls>
</urls>
<text>
Books
</text>
</page>
</mediawiki>

for el in tree.getiterator(tag='page'):
page_id = el.get('id', None) # returns second arg if id not exists
if page_id:
print page_id, el.find('title').text
else:
pprint(el.attrib)
Edit: Updated for commment: "Thanks can i print page_id and title at same time? Means 31239628 - Title"

The element.get() method is used to retrieve option attribute values in a tag:
>>> page_id = tree.find('page').get('id')
>>> if page_id:
print page_id
31239628

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How can I parse a Wikipedia XML dump with Python? - python

You are trying to get the content of the text property of the <page> element, but that is just whitespace. To get the text of the <text> element, just change elif tname == 'page': to elif tname == 'text':

Related

ElementTree does not seem to get some texts/elements in the tree

Print nested element of xml with python etree

Search xml for text and return element/node

Getting subelements using lxml and iterparse

Parsing XML file etree module

Categories

Resources