Entity 'ouml' error while using lxml to parse dblp data - python

I am trying to parse dblp data(xml format). So far my code is :
#-*-coding:utf-8-*-
from lxml import etree # lxml import library
parser = etree.XMLParser (load_dtd =True)
Tree = etree.parse( "dblp.xml" ,parser)
Root = tree.getroot()
I tried running the code and I get the following error:
Tree = etree.parse( "dblp.xml" ,parser) # Parse the xml with tree structure
File "src/lxml/etree.pyx", line 3426, in lxml.etree.parse
File "src/lxml/parser.pxi", line 1839, in lxml.etree._parseDocument
File "src/lxml/parser.pxi", line 1865, in lxml.etree._parseDocumentFromURL
File "src/lxml/parser.pxi", line 1769, in lxml.etree._parseDocFromFile
File "src/lxml/parser.pxi", line 1162, in lxml.etree._BaseParser._parseDocFromFile
File "src/lxml/parser.pxi", line 600, in lxml.etree._ParserContext._handleParseResultDoc
File "src/lxml/parser.pxi", line 710, in lxml.etree._handleParseResult
File "src/lxml/parser.pxi", line 639, in lxml.etree._raiseParseError
File "dblp.xml", line 70
lxml.etree.XMLSyntaxError: Entity 'ouml' not defined, line 70,
column 27
how can i resolve this error?
Note: I have xml and dtd files in same location.

I recently encountered the same issue whilst parsing DBLP's XML database. In my case, I was missing the appropriate .dtd file for my dblp.xml (which provides the necessary information for parsing certain custom entities, including ouml). The top of your file should look something like this:
<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE dblp SYSTEM "dblp-2017-08-29.dtd">
The .dtd file specified on the second line should be located in the same directory as the dblp.xml file that you're attempting to parse. You can download the appropriate .dtd file your XML file from here: http://dblp.org/xml/release/
$ ls
dblp-2017-08-29.dtd dblp-2018-11-01.xml
Also, given the size of dblp.xml, you may also want to use lxml.etree.iterparse to stream the contents of the file instead. Below is some of the code that I used to obtain entries for certain types of publication within the database.
fn = 'dblp.xml'
for event, elem in lxml.etree.iterparse(fn, load_dtd=True):
if elem.tag not in ['article', 'inproceedings', 'proceedings']:
continue
title = elem.find('title') # type: Optional[str]
year = elem.find('year') # type: Optional[int]
authors = elem.find('author') # type: Optional[str]
venue = elem.find('venue') # type: Optional[str]
...
elem.clear()

Related

Read XML using pandas parsing to csv

I have the following code to extract data from XML to CSV file, but there is an error and I don't know how to solve it.
if anyone can help, please.
url = "http://90.161.233.78:65519/services/user/records.xml?begin=04052022?end=06052022?var=EDSLINEEMBEDDED.Module2.VI1?var=EDSLINEEMBEDDED.Module2.API1?period=900"
s = unescape(requests.get(url).text)[5:-6]
df = pd.read_xml(s, xpath="//record/* | //dateTime")
df["field"] = df["field"].ffill()
df.to_csv('output0.csv')
The Error is
doc = fromstring(
File "src\lxml\etree.pyx", line 3252, in lxml.etree.fromstring
File "src\lxml\parser.pxi", line 1913, in lxml.etree._parseMemoryDocument
File "src\lxml\parser.pxi", line 1800, in lxml.etree._parseDoc
File "src\lxml\parser.pxi", line 1141, in lxml.etree._BaseParser._parseDoc
File "src\lxml\parser.pxi", line 615, in lxml.etree._ParserContext._handleParseResultDoc
File "src\lxml\parser.pxi", line 725, in lxml.etree._handleParseResult
File "src\lxml\parser.pxi", line 654, in lxml.etree._raiseParseError
File "<string>", line 1
lxml.etree.XMLSyntaxError: Start tag expected, '<' not found, line 1, column 2
Consider reading URL without requests or escaping content directly into pandas.read_xml(). Per docs, emphasis added:
path_or_buffer: str, path object, or file-like object
String, path object (implementing os.PathLike[str]), or file-like object implementing a read() function. The string can be any valid XML string or a path. The string can further be a URL. Valid URL schemes include http, ftp, s3, and file.
import pandas as pd
url = (
"http://90.161.233.78:65519/services/user/records.xml?"
"begin=04052022?end=06052022?var=EDSLINEEMBEDDED.Module2.VI1?"
"var=EDSLINEEMBEDDED.Module2.API1?period=900"
)
df = pd.read_xml(url, xpath="//record/* | //dateTime")
# FILL PARENT TEXT FORWARD TO CHILD ITEMS
df["dateTime"] = df["dateTime"].ffill()
# DROP UNNEEDED ROWS
df[(pd.notnull(df["id"])) & (pd.notnull(df["value"]))]
df.to_csv('output0.csv')

Python: append multiple sub-elements to a parent element by reading from a string

I am new to python and trying to read and add multiple xml elements represented by a string as a subelement of an XML element.
For eg:
<student>
<name>abc</name>
<description>abcd</description>
<regno>200</regno>
</student>
I have a string which can represent further student info which may contain nested information as well. For eg:
"<grade>100</grade><address-info><street>xyz</street><city>efgh</city><zip>505050</zip></address-info>"
I need this string to be parsed and to be added inside the student element and
result in something like
<student>
<name>abc</name>
<description>abcd</description>
<regno>200</regno>
<grade>100</grade>
<address-info>
<street>xyz</street>
<city>efgh</city>
<zip>505050</zip>
</address-info>
</student>
I tried using append method which is resulting in an error
def add_cfg_under_correct_student(in_name, cfg_to_be_added, root):
if root is None:
return True
for student in root.findall('student'):
name = student.find('name')
if name.text != in_name:
continue
student.append(ET.fromstring(cfg_to_be_added))
return True
But I got an error as
Traceback (most recent call last):
add_cfg_under_correct_student
student.append(ET.fromstring(cfg_to_be_added))
File "src/lxml/lxml.etree.pyx", line 3213, in lxml.etree.fromstring (src/lxml/lxml.etree.c:79003)
File "src/lxml/parser.pxi", line 1848, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:118334)
File "src/lxml/parser.pxi", line 1736, in lxml.etree._parseDoc (src/lxml/lxml.etree.c:117014)
File "src/lxml/parser.pxi", line 1102, in lxml.etree._BaseParser._parseDoc (src/lxml/lxml.etree.c:111258)
File "src/lxml/parser.pxi", line 595, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:105102)
File "src/lxml/parser.pxi", line 706, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:106810)
File "src/lxml/parser.pxi", line 635, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:105664)
File "", line 1
lxml.etree.XMLSyntaxError: Extra content at the end of the document, line 1, column 27
Then I tried using the ET.ElementTree(ET.fromstring(xmlstring))
option as suggested in this answer, but still get a similar error.
Then I looked up another answer on adding multiple elements at once in this question, but it doesn't exactly solve my scenario.
Does the solution mentioned in the above question to use extend work on sub elements which in turn could have sub elements under them as well?
Please help

lxml: some XML from URL give this lxml.etree.XMLSyntaxError

I have a script which is suppose to extract some terms from XML files from a list of URLs.
All the URL's give access to XML data.
It is working fine at first opening, parsing and extracting correctly but then get interrupted in the process by some XML files with this error:
File "<stdin>", line 18, in <module>
File "lxml.etree.pyx", line 2953, in lxml.etree.parse (src/lxml/lxml.etree.c:56204)
File "parser.pxi", line 1555, in lxml.etree._parseDocument (src/lxml/lxml.etree.c:82511)
File "parser.pxi", line 1585, in lxml.etree._parseFilelikeDocument (src/lxml/lxml.etree.c:82832)
File "parser.pxi", line 1468, in lxml.etree._parseDocFromFilelike (src/lxml/lxml.etree.c:81688)
File "parser.pxi", line 1024, in lxml.etree._BaseParser._parseDocFromFilelike (src/lxml/lxml.etree.c:78735)
File "parser.pxi", line 569, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:74472)
File "parser.pxi", line 650, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:75363)
File "parser.pxi", line 590, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:74696)
lxml.etree.XMLSyntaxError: Document is empty, line 1, column 1
From my search it might be because some XML files have white spaces but i'm not sure if it is the problem. I can't tell which files give the error.
Is there a way to get around this error?
Here is my script:
URLlist = ["http://www.uniprot.org/uniprot/"+x+".xml" for x in IDlist]
for id, item in zip(IDlist, URLlist):
goterm_location = []
goterm_function = []
goterm_process = []
location_list[id] = []
function_list[id] = []
biological_list[id] = []
try:
textfile = urllib2.urlopen(item);
except urllib2.HTTPError:
print("URL", item, "could not be read.")
continue
#Try to solve empty line error#
tree = etree.parse(textfile);
#root = tree.getroot()
for node in tree.iter('{http://uniprot.org/uniprot}dbReference'):
if node.attrib.get('type') == 'GO':
for child in node:
value = child.attrib.get('value');
if value.startswith('C:'):
goterm_C = node.attrib.get('id')
if goterm_C:
location_list[id].append(goterm_C);
if value.startswith('F:'):
goterm_F = node.attrib.get('id')
if goterm_F:
function_list[id].append(goterm_F);
if value.startswith('P:'):
goterm_P = node.attrib.get('id')
if goterm_P:
biological_list[id].append(goterm_P);
I have tried:
tree = etree.iterparse(textfile, events = ("start","end"));
OR
parser = etree.XMLParser(remove_blank_text=True)
tree = etree.parse(textfile, parser)
Without success.
Any help would be greatly appreciated
I can't tell which files give the error
Debug by printing the name of the file/URL prior to parsing. Then you'll see which file(s) cause the error.
Also, read the error message:
lxml.etree.XMLSyntaxError: Document is empty, line 1, column 1
this suggests that the downloaded XML file is empty. Once you have determined the URL(s) that cause the problem, try downloading the file and check its contents. I suspect it might be empty.
You can ignore problematic files (empty or otherwise syntactically invalid) by using a try/except block when parsing:
try:
tree = etree.parse(textfile)
except lxml.etree.XMLSyntaxError:
print 'Skipping invalid XML from URL {}'.format(item)
continue # go on to the next URL
Or you could check just for empty files by checking the 'Content-length' header, or even by reading the resource returned by urlopen(), but I think that the above is better as it will also catch other potential errors.
I got the same error message in Python 3.6
lxml.etree.XMLSyntaxError: Document is empty, line 1, column 1
In my case the xml file is not empty. Issue is because of encoding,
Initially used utf-8,
from lxml import etree
etree.iterparse(my_xml_file.xml, tag='MyTag', encoding='utf-8')
changing encoding to iso-8859-1 solved my issue,
etree.iterparse(my_xml_file.xml, tag='MyTag', encoding='iso-8859-1')

IOError passing requests Response.content to lxml.etree.parse() [duplicate]

This question already has an answer here:
lxml error "IOError: Error reading file" when parsing facebook mobile in a python scraper script
(1 answer)
Closed 7 years ago.
I have the following xml on a webpage -
<entry>
<id>1750</id>
<title>variablename</title>
<source>
com.tidalsoft.webclient.tes.dsp.db.datatypes.Variable
</source>
<tes:variable>
<tes:ownername>ownergroup</tes:ownername>
<tes:productiondate>2015-08-17T00:00:00-0400</tes:productiondate>
<tes:readonly>N</tes:readonly>
<tes:publish>N</tes:publish>
<tes:description>
Decription Here
</tes:description>
<tes:startcalendar>0</tes:startcalendar>
<tes:ownerid>666</tes:ownerid>
<tes:type>1</tes:type>
<tes:lastusermodifiedtime>2015-06-15T15:42:27-0400</tes:lastusermodifiedtime>
<tes:innervalue>\\share\location</tes:innervalue>
<tes:calc>N</tes:calc>
<tes:name>variablename</tes:name>
<tes:startdate>1899-12-30T00:00:00-0500</tes:startdate>
<tes:pub>Y</tes:pub>
<tes:lastvalue>\\share\location</tes:lastvalue>
<tes:id>1750</tes:id>
<tes:startdateasstring>18991230000000</tes:startdateasstring>
<tes:lastchangetime>2015-06-15T15:42:27-0400</tes:lastchangetime>
<tes:clientcachelastchangetime>2015-08-17T09:56:49-0400</tes:clientcachelastchangetime>
</tes:variable>
</entry>
I'm trying to parse this data. I have a get through requests -
r = requests.get(url, auth=('username', 'password'))
but when I try to parse the content I get errors.
>>> xmlObject = etree.parse(r.content)
Traceback (most recent call last):
File "apiTest.py", line 46, in <module>
xmlObject = etree.parse(r.content)
File "lxml.etree.pyx", line 3310, in lxml.etree.parse (src\lxml\lxml.etree.c:7
2517)
File "parser.pxi", line 1791, in lxml.etree._parseDocument (src\lxml\lxml.etre
e.c:105979)
File "parser.pxi", line 1817, in lxml.etree._parseDocumentFromURL (src\lxml\lx
ml.etree.c:106278)
File "parser.pxi", line 1721, in lxml.etree._parseDocFromFile (src\lxml\lxml.e
tree.c:105277)
File "parser.pxi", line 1122, in lxml.etree._BaseParser._parseDocFromFile (src
\lxml\lxml.etree.c:100227)
File "parser.pxi", line 580, in lxml.etree._ParserContext._handleParseResultDo
c (src\lxml\lxml.etree.c:94350)
File "parser.pxi", line 690, in lxml.etree._handleParseResult (src\lxml\lxml.e
tree.c:95786)
File "parser.pxi", line 618, in lxml.etree._raiseParseError (src\lxml\lxml.etr
ee.c:94818)
IOError: Error reading file ''
On the last line what is between the quotes is the xml stated at the beginning as a string -
<?xml version="1.0" encoding="UTF-8" standalone="ye
s"?><entry xmlns="http://purl.org/atom/ns#"><id>1750</id><title>....
The data is being provided as content-type: text/xml
etree.parse expects a filename, a file-like object, or a URL as its first argument (see help(etree.parse)). It does not expect an XML string. To parse an XML string use
xmlObject = etree.fromstring(r.content)
Note that etree.fromstring returns a lxml.etree._Element. In contrast, etree.parse returns a lxml.etree._ElementTree. Given the _Element, you can obtain the _ElementTree with the getroottree method:
xmlTree = xmlObject.getroottree()

Using Python and lxml to validate XML against an external DTD

I'm trying to validate an XML file against an external DTD referenced in the doctype tag. Specifically:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE en-export SYSTEM "http://xml.evernote.com/pub/evernote-export3.dtd">
...the rest of the document...
I'm using Python 3.3 and the lxml module. From reading http://lxml.de/validation.html#validation-at-parse-time, I've thrown this together:
enexFile = open(sys.argv[2], mode="rb") # sys.argv[2] is the path to an XML file in local storage.
enexParser = etree.XMLParser(dtd_validation=True)
enexTree = etree.parse(enexFile, enexParser)
From what I understand of validation.html, the lxml library should now take care of retrieving the DTD and performing validation. But instead, I get this:
$ ./mapwrangler.py validate notes.enex
Traceback (most recent call last):
File "./mapwrangler.py", line 27, in <module>
enexTree = etree.parse(enexFile, enexParser)
File "lxml.etree.pyx", line 3239, in lxml.etree.parse (src/lxml/lxml.etree.c:69955)
File "parser.pxi", line 1769, in lxml.etree._parseDocument (src/lxml/lxml.etree.c:102257)
File "parser.pxi", line 1789, in lxml.etree._parseFilelikeDocument (src/lxml/lxml.etree.c:102516)
File "parser.pxi", line 1684, in lxml.etree._parseDocFromFilelike (src/lxml/lxml.etree.c:101442)
File "parser.pxi", line 1134, in lxml.etree._BaseParser._parseDocFromFilelike (src/lxml/lxml.etree.c:97069)
File "parser.pxi", line 582, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:91275)
File "parser.pxi", line 683, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:92461)
File "parser.pxi", line 622, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:91757)
lxml.etree.XMLSyntaxError: Validation failed: no DTD found !, line 3, column 43
This surprises me, because if I turn off validation, then the document parses in just fine and I can do print(enexTree.docinfo.doctype) to get
$ ./mapwrangler.py validate notes.enex
<!DOCTYPE en-export SYSTEM "http://xml.evernote.com/pub/evernote-export3.dtd">
So it looks to me like there shouldn't be any problem finding the DTD.
Thanks for your help.
You need to add no_network=False when constructing the parser object. This option is set to True by default.
From the documentation of parser options at http://lxml.de/parsing.html#parsers:
no_network - prevent network access when looking up external documents (on by default)
For a reason I still don't know, my problem was related to where the XML catalog was located on my local file system.
In my case, I use an XML editor that has a tight integration with a component content management system (CCMS, in this case SDL Trisoft 2011 R2). When the editor connects to the CCMS, DTDs, catalog files and a bunch of other files are synced. These files end up on the local file system in:
C:\Users\[username]\AppData\Local\Trisoft\InfoShare Client\[id]\Config\DocTypes\catalog.xml
I could not get that to work. Simply COPYING the whole catalog to another location fixed things, and this works:
f = r"path/to/my/file.xml"
# set XML catatog file path
os.environ['XML_CATALOG_FILES'] = r'C:\DATA\Mydoctypes\catalog.xml'
# configure parser
parser = etree.XMLParser(dtd_validation=True, no_network=True)
# validate
try:
valid = etree.parse(f, parser=parser)
print("This file is valid against the DTD.")
except etree.XMLSyntaxError, error:
print("This file is INVALID against the DTD!")
print(error)
Obviously this is not ideal, but it works.
Could it be something to do with file permissions, or perhaps that good old "file path too long" problem in Windows? I have not tried whether a symbolic link would work.
I am using Windows 7, Python 2.7.11 and the version of lxml is (3.6.0).

Categories