Parse Grobid .tei.xml output with Beautiful Soup - python

I am trying to use Beautiful Soup to extract elements from a .tei.xml file that was generated using Grobid.
I can get title(s) using:
titles = soup.findAll('title')
What is the correct syntax to access the 'lower level' elements? (Author / Affiliation etc)
This is a portion of the tei.xml file that is the Grobid output:
<?xml version="1.0" encoding="UTF-8"?>
<TEI xmlns="http://www.tei-c.org/ns/1.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 /data/grobid-0.5.1/grobid-home/schemas/xsd/Grobid.xsd"
xmlns:xlink="http://www.w3.org/1999/xlink">
<teiHeader xml:lang="en">
<encodingDesc>
<appInfo>
<application version="0.5.1-SNAPSHOT" ident="GROBID" when="2018-08-15T14:51+0000">
<ref target="https://github.com/kermitt2/grobid">GROBID - A machine learning software for extracting information from scholarly documents</ref>
</application>
</appInfo>
</encodingDesc>
<fileDesc>
<titleStmt>
<title level="a" type="main">The Role of Artificial Intelligence in Software Engineering</title>
</titleStmt>
<publicationStmt>
<publisher/>
<availability status="unknown"><licence/></availability>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<author>
<persName xmlns="http://www.tei-c.org/ns/1.0"><forename type="first">Mark</forename><surname>Harman</surname></persName>
<affiliation key="aff0">
<orgName type="department">CREST Centre</orgName>
<orgName type="institution">University College London</orgName>
<address>
<addrLine>Malet Place</addrLine>
<postCode>WC1E 6BT</postCode>
<settlement>London</settlement>
<country key="GB">UK</country>
</address>
</affiliation>
</author>
<title level="a" type="main">The Role of Artificial Intelligence in Software Engineering</title>
</analytic>
<monogr>
<imprint>
<date/>
</imprint>
</monogr>
</biblStruct>
</sourceDesc>
</fileDesc>
Thanks.

BeautifulSoup lowercases the nodes, here's some examples:
title = soup.html.body.teiheader.filedesc.analytic.title.string
for author in soup.html.body.teiheader.filedesc.sourcedesc.find_all('author'):
tag_or_none = author.persname.forename
first_affiliation = author.affiliation
Also see the BeautifulSoup documentation which covers everything.
I'm working on a similar problem now and looking for collaboration. Let me know if you want to team up -- sof#nconnor.com

Related

how to read a xml file python? [duplicate]

I have a big XML file with several article nodes. I have included only one with the problem. I try to parse it in Python to filter some data and I get the error
File "<string>", line unknown
ParseError: undefined entity Ö: line 90, column 17
Sample of the XML file
<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE dblp SYSTEM "dblp.dtd">
<dblp>
<article mdate="2019-10-25" key="tr/gte/TR-0146-06-91-165" publtype="informal">
<author>Alejandro P. Buchmann</author>
<author>M. Tamer Özsu</author>
<author>Dimitrios Georgakopoulos</author>
<title>Towards a Transaction Management System for DOM.</title>
<journal>GTE Laboratories Incorporated</journal>
<volume>TR-0146-06-91-165</volume>
<month>June</month>
<year>1991</year>
<url>db/journals/gtelab/index.html#TR-0146-06-91-165</url>
</article>
</dblp>
From my search in Google, I found that this kind of error appears if you have issues in the node names. However, the line with the error is the second author, in the text.
This is my Python code
with open('xaa.xml', 'r') as xml_file:
xml_tree = etree.parse(xml_file)
The declaration of the Ouml entity is presumably in the DTD (dblp.dtd), but ElementTree does not support external DTDs. ElementTree only recognizes entities declared directly in the XML file (in the "internal subset"). This is a working example:
<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE dblp [
<!ENTITY Ouml 'Ö'>
]>
<dblp>
<article mdate="2019-10-25" key="tr/gte/TR-0146-06-91-165" publtype="informal">
<author>Alejandro P. Buchmann</author>
<author>M. Tamer Özsu</author>
<author>Dimitrios Georgakopoulos</author>
<title>Towards a Transaction Management System for DOM.</title>
<journal>GTE Laboratories Incorporated</journal>
<volume>TR-0146-06-91-165</volume>
<month>June</month>
<year>1991</year>
<url>db/journals/gtelab/index.html#TR-0146-06-91-165</url>
</article>
</dblp>
To parse the XML file in the question without errors, you need a more powerful XML library that supports external DTDs. lxml is a good choice for that.

What is the best way to use Xpath for processing larger XML files?

I have a requirement where I have to use an large XML (4 GB file) for finding values in it. Basically I have to write around 30 different Xpath and store the values in a list. When I try to parse an XML, it throws memory error. I have tried using lxml and ElementTree with start and end events, still no luck the processing time is too high and my Pycharm/Jupyter notebook throws me memory error.
Is there a better way to do it? Even though this implementation is not restricted to any programming language, I prefer Python because its my right hand. Thanks in advance.
Eg of a search: If I want the value of year where category is cooking. Then I use ./bookstore[#category=cooking]/book/year. So the value is 2005
Similarly I have to find the values of my tags based on my business requirements. In my requirement the XML is not simple as the below example.
<?xml version="1.0" encoding="UTF-8"?>
<bookstore>
<book category="cooking">
<title lang="en">Everyday Italian</title>
<author>Giada De Laurentiis</author>
<year>2005</year>
<price>30.00</price>
</book>
<book category="children">
<title lang="en">Harry Potter</title>
<author>J K. Rowling</author>
<year>2005</year>
<price>29.99</price>
</book>
<book category="web">
<title lang="en">XQuery Kick Start</title>
<author>James McGovern</author>
<author>Per Bothner</author>
<author>Kurt Cagle</author>
<author>James Linn</author>
<author>Vaidyanathan Nagarajan</author>
<year>2003</year>
<price>49.99</price>
</book>
</bookstore>
With the example given, the problem can be readily handled using streaming in XSLT 3.0:
<xsl:mode streamable="yes">
<xsl:template match="book[#category='cooking']">
<xsl:value-of select="year"/>
</xsl:template>
I know the example is a simplification, but we can only judge whether the real data/query is streamable by seeing the real data/query. The devil is in the detail.
I don't know if this will meet your needs. Try it first. If there is a problem, we can continue to communicate.
from simplified_scrapy import SimplifiedDoc,req,utils
def getBook(lines,category,year):
html = "".join(lines)
book = SimplifiedDoc(html).book
if book.category==category and book.year.text==year:
return book.category,book.title.text,book.authors.text,book.year.text,book.price.text
lst = []
with open('bookstore.xml', 'r') as file: # bookstore.xml is your xml file path
lines = []
flag = False
for line in file:
if flag or line.find('<book ')>=0:
flag = True
lines.append(line)
if line.find('</book>')>=0:
b = getBook(lines,'web','2003')
if b:
lst.append(b)
break # If you only want the first one, add a break
flag = False
lines = []
print (lst)
Result:
[('web', 'XQuery Kick Start', ['James McGovern', 'Per Bothner', 'Kurt Cagle', 'James Linn', 'Vaidyanathan Nagarajan'], '2003', '49.99')]

Blank XML Namespace processing With Python

I am trying to parse a XML using python ,xml example snippet:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<raml xmlns="raml21.xsd" version="2.1">
<series xmlns="" scope="USA" name="Arizona">
<header>
<log action="created"/>
</header>
<x_ns color="Blue">
<p name="timeZone">(GMT-10)</p>
</x_ns>
<x_ns color="Red">
<p name="AvgHeight">175</p>
</x_ns>
<x_ns color="black">
<p name="AvgWeight">235</p>
</x_ns>
the problem is namespaces keeps changing so as an alternative I tried to read the xmlns string first then create a dicionary using namespaces using the below code
root = raw_xml.getroot()
namespace_temp1=root.tag.split("}")
namespace_temp2=namespace_temp1[0].strip('{')
namespaces_auto={}
tag_name =["x","y","z","w","v"]
ns_name=[namespace_temp2,namespace_temp2,namespace_temp2,namespace_temp2,namespace_temp2]
namespace_temp3=zip(tag_name,ns_name)
for tag,ns in namespace_temp3:
namespaces_auto[tag]=ns
namespaces=namespaces_auto
to access a particular tag with namespace I am using the code as follows
for data in raw_xml.findall('x:x_ns',namespaces)
this pretty much solves the problem but gets stuck when the child node has blank xmlns as seen in the series tag (xmlns=""). Not Sure how to incorporate it in the code to check this condition.

Generating standard xml and making request in Python/Django

I need to generate a standard XML as follows using python:
<?xml version="1.0" encoding="UTF-8"?>
<req:ShipmentValidateRequestAP xmlns:req="http://www.example.com" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.example.com/schema">
<root>
<doc>
<field1>some value1</field1>
<field2>some vlaue2</field2>
</doc>
</root>
</req:ShipmentValidateRequestAP>
And this needed to be requested to a server URL:
https://sampletest-ea.example.com/XMLShippingServlet
Can anybody please help me to implement this ?
you can finish your work by the help of http://docs.python.org/2/library/xml.dom.minidom.html.
It is a simple libary

Python xml parsing using minidom

I just started learning how to parse xml using minidom. I tried to get the author's names (xml data is down below) using the following code:
from xml.dom import minidom
xmldoc = minidom.parse("cora.xml")
author = xmldoc.getElementsByTagName ('author')
for author in author:
authorID=author.getElementsByTagName('author id')
print authorID
I got empty brackets([]) all the way. Can someone please help me out? I will also need the title and venue. Thanks in advance. See xml data below:
<?xml version="1.0" encoding="UTF-8"?>
<coraRADD>
<publication id="ahlskog1994a">
<author id="199">M. Ahlskog</author>
<author id="74"> J. Paloheimo</author>
<author id="64"> H. Stubb</author>
<author id="103"> P. Dyreklev</author>
<author id="54"> M. Fahlman</author>
<title>Inganas</title>
<title>and</title>
<title>M.R.</title>
<venue>
<venue pubid="ahlskog1994a" id="1">
<name>Andersson</name>
<name> J Appl. Phys.</name>
<vol>76</vol>
<date> (1994). </date>
</venue>
You can only find tags with getElementsByTagName(), not attributes. You'll need to access those through the Element.getAttribute() method instead:
for author in author:
authorID = author.getAttribute('id')
print authorID
If you are still learning about parsing XML, you really want to stay away from the DOM. The DOM API is overly verbose to fit many different programming languages.
The ElementTree API would be easier to use:
import xml.etree.ElementTree as ET
tree = ET.parse('cora.xml')
root = tree.getroot()
# loop over all publications
for pub in root.findall('publication'):
print ' '.join([t.text for t in pub.findall('title')])
for author in pub.findall('author'):
print 'Author id: {}'.format(author.attrib['id'])
print 'Author name: {}'.format(author.text)
for venue in pub.findall('.//venue[#id]'): # all venue tags with id attribute
print ', '.join([name.text for name in venue.findall('name')])

Categories