I am trying to use Python's xml.etree.ElementTree.parse() function to parse an XML file I created by exporting all of the content from a WordPress blog. However, when I try like so:
import xml.etree.ElementTree as xml
tree = xml.parse('/path/to/file.xml')
I get the following error:
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 1183, in parse
tree.parse(source, parser)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 656, in parse
parser.feed(data)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 1643, in feed
self._raiseerror(v)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 1507, in _raiseerror
raise err
ParseError: unbound prefix: line 189, column 1
Here's what's on line 189 of my XML file:
<atom:link rel="search" type="application/opensearchdescription+xml" href="http://blogname.wordpress.com/osd.xml" title="blog name" />
I've seen many questions about this error coming up with Android development, but I can't tell if and how that applies to my situation. Can anyone help with this?
Apologies to everyone for whom this was stupidly obvious, but it turns out I simply didn't have a namespace definition for "atom" in the document. I'm guessing that "unbound prefix" means that the prefix "atom" wasn't "bound" to a namespace definition?
Anyway, adding said definition has solved the problem. Although it makes me wonder why WordPress exports XML files without proper definitions for all of the namespaces they use...
If you remove all the Name Space, works absolutely fine.
Change
<s:home>USA</s:home>
to
<home>USA</home>
Just in case it helps someone some day, I was also working with a WordPress XML export (WordPress eXtended RSS) file in Python and was getting the same error. In my case, WordPress had included most of the correct namespace definitions. However, the XML had iTunes podcast information as well, and the iTunes namespace declaration was not present.
I fixed it by adding xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd" into the RSS declaration block. So this:
<!-- generator="WordPress/4.9.8" created="2018-08-06 03:12" -->
<rss version="2.0"
xmlns:excerpt="http://wordpress.org/export/1.2/excerpt/"
xmlns:content="http://purl.org/rss/1.0/modules/content/"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:wp="http://wordpress.org/export/1.2/"
>
became this:
<!-- generator="WordPress/4.9.8" created="2018-08-06 03:12" -->
<rss version="2.0"
xmlns:excerpt="http://wordpress.org/export/1.2/excerpt/"
xmlns:content="http://purl.org/rss/1.0/modules/content/"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:wp="http://wordpress.org/export/1.2/"
xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd"
>
Related
I have a big XML file with several article nodes. I have included only one with the problem. I try to parse it in Python to filter some data and I get the error
File "<string>", line unknown
ParseError: undefined entity Ö: line 90, column 17
Sample of the XML file
<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE dblp SYSTEM "dblp.dtd">
<dblp>
<article mdate="2019-10-25" key="tr/gte/TR-0146-06-91-165" publtype="informal">
<author>Alejandro P. Buchmann</author>
<author>M. Tamer Özsu</author>
<author>Dimitrios Georgakopoulos</author>
<title>Towards a Transaction Management System for DOM.</title>
<journal>GTE Laboratories Incorporated</journal>
<volume>TR-0146-06-91-165</volume>
<month>June</month>
<year>1991</year>
<url>db/journals/gtelab/index.html#TR-0146-06-91-165</url>
</article>
</dblp>
From my search in Google, I found that this kind of error appears if you have issues in the node names. However, the line with the error is the second author, in the text.
This is my Python code
with open('xaa.xml', 'r') as xml_file:
xml_tree = etree.parse(xml_file)
The declaration of the Ouml entity is presumably in the DTD (dblp.dtd), but ElementTree does not support external DTDs. ElementTree only recognizes entities declared directly in the XML file (in the "internal subset"). This is a working example:
<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE dblp [
<!ENTITY Ouml 'Ö'>
]>
<dblp>
<article mdate="2019-10-25" key="tr/gte/TR-0146-06-91-165" publtype="informal">
<author>Alejandro P. Buchmann</author>
<author>M. Tamer Özsu</author>
<author>Dimitrios Georgakopoulos</author>
<title>Towards a Transaction Management System for DOM.</title>
<journal>GTE Laboratories Incorporated</journal>
<volume>TR-0146-06-91-165</volume>
<month>June</month>
<year>1991</year>
<url>db/journals/gtelab/index.html#TR-0146-06-91-165</url>
</article>
</dblp>
To parse the XML file in the question without errors, you need a more powerful XML library that supports external DTDs. lxml is a good choice for that.
the following code gives me the python error 'failed to parse' addon.xml:
(I've used an online checker and it says "error on line 33 at column 15: Opening and ending tag mismatch: description line 0 and extension" - which is the very end of the /extension end tag at the end of the document).
Any advice would be appreciated. This worked yesterday and I have no idea why it's not working at all
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<addon id="plugin.audio.criminalpodcast" name="Criminal Podcast" version="1.1.0" provider-name="leopheard">
<requires>
<import addon="xbmc.python" version="2.1.0"/>
<import addon="script.module.xbmcswift2" version="2.4.0"/>
<import addon="script.module.beautifulsoup4" version="4.3.1"/>
<import addon="script.module.requests" version="1.1.0"/>
<import addon="script.module.routing" version="0.2.0"/> </requires>
<provides>audio</provides> </extension>
<extension point="xbmc.addon.metadata">
<platform>all</platform>
<language></language>
<summary lang="en"></summary>
<description lang="en">description </description>
<license>The MIT License (MIT)</license>
<forum>https://forum.kodi.tv/showthread.php?tid=344790</forum>
<email>leopheard#gmail.com</email>
<source>https://github.com/leopheard/criminalpodcast</source>
<website>http://www.thisiscriminal.com</website>
<audio_guide></audio_guide>
<assets>
<icon>icon.png</icon>
<fanart>fanart.jpg</fanart>
<screenshot>resources/media/Criminal_SocialShare_2.png</screenshot>
<screenshot>resources/media/Criminal_SocialShare_3.png</screenshot>
<screenshot>resources/media/Radiotopia-logo.png</screenshot>
</assets>
Your "XML" file is not well-formed, so it cannot be parsed. Find out how it was created, correct the process so the problem does not occur again, and then regenerate the file.
Files that are vaguely XML-like but not well-formed are pretty well useless. Repair is sometimes possible if the errors are very systematic, but that doesn't appear to the the case here.
Most of the time a "failed to parse" error msg is due to the XML File itself.
Check you're XML File for the correct formatting.
I once forgot the root tag and had the same error message.
I'm getting an error when trying to grab a value from my XML. I get "Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration."
Here is my code:
import requests
import lxml.etree
from requests.auth import HTTPBasicAuth
r= requests.get("https://somelinkhere/folder/?parameter=abc", auth=HTTPBasicAuth('username', 'password'))
print r.text
root = lxml.etree.fromstring(r.text)
textelem = root.find("opensearch:totalResults")
print textelem.text
I'm getting this error:
Traceback (most recent call last):
File "tickets2.py", line 8, in <module>
root = lxml.etree.fromstring(r.text)
File "src/lxml/lxml.etree.pyx", line 3213, in lxml.etree.fromstring (src/lxml/lxml.etree.c:82934)
File "src/lxml/parser.pxi", line 1814, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:124471)
ValueError: Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration.
Here is what the XML looks like, where I'm trying to grab the file in the last line.
<feed xmlns="http://www.w3.org/2005/Atom" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#" xmlns:apple-wallpapers="http://www.apple.com/ilife/wallpapers" xmlns:g-custom="http://base.google.com/cns/1.0" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" xmlns:opensearch="http://a9.com/-/spec/opensearch/1.1/" xmlns:media="http://search.yahoo.com/mrss/" xmlns:sy="http://purl.org/rss/1.0/modules/syndication/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:georss="http://www.georss.org/georss/" xmlns:creativeCommons="http://backend.userland.com/creativeCommonsRssModule" xmlns:cc="http://web.resource.org/cc/" xmlns:slash="http://purl.org/rss/1.0/modules/slash/" xmlns:g-core="http://base.google.com/ns/1.0">
<title>Feed from some link here</title>
<link rel="self" href="https://somelinkhere/folder/?parameter=abc" />
<link rel="first" href="https://somelinkhere/folder/?parameter=abc" />
<id>https://somelinkhere/folder/?parameter=abc</id>
<updated>2018-03-06T17:48:09Z</updated>
<dc:creator>company.com</dc:creator>
<dc:date>2018-03-06T17:48:09Z</dc:date>
<opensearch:totalResults>4</opensearch:totalResults>
I have tried various changes from links like https://twigstechtips.blogspot.com/2013/06/python-lxml-strings-with-encoding.html and http://makble.com/how-to-parse-xml-with-python-and-lxml but I keep running into the same error.
Instead of r.text, which guesses at the text encoding and decodes it, try using r.content which accesses the response body as bytes. (See http://docs.python-requests.org/en/latest/user/quickstart/#response-content.)
You could also use r.raw. See parsing XML file gets UnicodeEncodeError (ElementTree) / ValueError (lxml) for more info.
Once that issue is fixed, you'll have the issue of the namespace. The element you're trying to find (opensearch:totalResults) has the prefix opensearch which is bound to the uri http://a9.com/-/spec/opensearch/1.1/.
You can find the element by combining the namespace uri and the local name (Clark notation):
{http://a9.com/-/spec/opensearch/1.1/}totalResults
See http://lxml.de/tutorial.html#namespaces for more info.
Here's an example with both changes implemented:
os = "{http://a9.com/-/spec/opensearch/1.1/}"
root = lxml.etree.fromstring(r.content)
textelem = root.find(os + "totalResults")
print textelem.text
I am trying to get the contents of a sub tag using lxml. The XML file I am parsing is valid but for some reason when I try and parse the child element it seems to think I have invalid XML. I have seen from other posts that this error is usually generated when there isn't a closing tag but the XML parses fine in a browser. Any ideas why this is happening ?
Contents of XML file (test.xml):
<?xml version="1.0" encoding="UTF-8"?>
<Group id="RHEL-07-010010">
<title>SRG-OS-000257-GPOS-00098</title>
<description><GroupDescription></GroupDescription> </description>
<Rule id="RHEL-07-010010_rule" severity="high" weight="10.0">
<version>RHEL-07-010010</version>
<title>The file permissions, ownership, and group membership of system files and commands must match the vendor values.</title>
<description><VulnDiscussion>Discretionary access control is weakened if a user or group has access permissions to system files and directories greater than the default.
Satisfies: SRG-OS-000257-GPOS-00098, SRG-OS-000278- GPOS-00108</VulnDiscussion><FalsePositives>< /FalsePositives><FalseNegatives>< /FalseNegatives><Documentable>false< /Documentable><Mitigations>< /Mitigations><SecurityOverrideGuidance>< /SecurityOverrideGuidance><PotentialImpacts>< /PotentialImpacts><ThirdPartyTools>< /ThirdPartyTools><MitigationControl>< /MitigationControl><Responsibility>< /Responsibility><IAControls></IAControls></description>
<ident system="http://iase.disa.mil/cci">CCI-001494</ident>
<ident system="http://iase.disa.mil/cci">CCI-001496</ident>
<fixtext fixref="F-RHEL-07-010010_fix">Run the following command to determine which package owns the file:
# rpm -qf <filename>
Reset the permissions of files within a package with the following command:
#rpm --setperms <packagename>
Reset the user and group ownership of files within a package with the following command:
#rpm --setugids <packagename></fixtext>
<fix id="F-RHEL-07-010010_fix" />
<check system="C-RHEL-07-010010_chk">
<check-content-ref name="M" href="VMS_XCCDF_Benchmark_SRG.xml" />
<check-content>Verify the file permissions, ownership, and group membership of system files and commands match the vendor values.
Check the file permissions, ownership, and group membership of system files and commands with the following command:
# rpm -Va | grep '^.M'
If there is any output from the command, this is a finding.</check-content>
</check>
</Rule>
</Group>
I am trying to get the contents of the VulnDiscussion tag. I can get the contents of the parent tag, discussion like this:
from lxml import etree as ET
xml = ET.parse("test.xml")
for description in xml.xpath('//description/text()'):
print(description)
This produces the following output:
<GroupDescription></GroupDescription>
<VulnDiscussion>Discretionary access control is weakened if a user or group has access permissions to system files and directories greater than the default.
Satisfies: SRG-OS-000257-GPOS-00098, SRG-OS-000278-GPOS-00108</VulnDiscussion> <FalsePositives></FalsePositives><FalseNegatives> </FalseNegatives><Documentable>false</Documentable><Mitigations></Mitigations> <SecurityOverrideGuidance></SecurityOverrideGuidance><PotentialImpacts> </PotentialImpacts><ThirdPartyTools></ThirdPartyTools><MitigationControl> </MitigationControl><Responsibility></Responsibility><IAControls></IAControls>
So far so good, now I try and extract the contents of VulnDiscussion with this code:
for description in xml.xpath('//description/text()'):
vulnDiscussion = next(iter(ET.XML(description).xpath('//VulnDiscussion/text()')), None)
print(vulnDiscussion)
and get the following error :
vulnDiscussion = next(iter(ET.XML(description).xpath('//VulnDiscussion/text()')), None)
File "src/lxml/lxml.etree.pyx", line 3192, in lxml.etree.XML (src/lxml/lxml.etree.c:78763)
File "src/lxml/parser.pxi", line 1848, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:118341)
File "src/lxml/parser.pxi", line 1736, in lxml.etree._parseDoc (src/lxml/lxml.etree.c:117021)
File "src/lxml/parser.pxi", line 1102, in lxml.etree._BaseParser._parseDoc (src/lxml/lxml.etree.c:111265)
File "src/lxml/parser.pxi", line 595, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:105109)
File "src/lxml/parser.pxi", line 706, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:106817)
File "src/lxml/parser.pxi", line 635, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:105671)
File "<string>", line 3
lxml.etree.XMLSyntaxError: Extra content at the end of the document, line 3, column 79
XML can only have one "root", xml.xpath('//description/text()') return multiple elements. Wrap all elements in to a single element, then your XML document will only have one root element.
Also noted that the text in the original XML has a space before each closing tag that you should remove
from lxml import etree as ET
xml = ET.parse("test.xml")
for description in xml.xpath('//description/text()'):
x = ET.XML('<Testroot>'+description.replace('< /','</')+'</Testroot>') # add root tag and remove space before the closing tag
vulnDiscussion = next(iter(x.xpath('//VulnDiscussion/text()')), None)
if vulnDiscussion:
print(vulnDiscussion)
Output
Discretionary access control is weakened if a user or group has access permissions to system files and directories greater than the default.
Satisfies: SRG-OS-000257-GPOS-00098, SRG-OS-000278- GPOS-00108
I am trying to validate an xml by using a group of schemas which a schema that includes the others.
Main schematron :
<?xml version="1.0" encoding="UTF-8"?>
<sch:schema xmlns="http://purl.oclc.org/dsdl/schematron"
xmlns:sch="http://purl.oclc.org/dsdl/schematron"
xmlns:sh="http://www.unece.org/cefact/namespaces/StandardBusinessDocumentHeader"
xmlns:ef="http://www.efatura.gov.tr/envelope-namespace">
<sch:include href="UBL-TR_Codelist.sch#codes"/>
<sch:include href="UBL-TR_Common_Schematron.sch#abstracts"/>
<sch:ns prefix="sh" uri="http://www.unece.org/cefact/namespaces/StandardBusinessDocumentHeader" />
<sch:ns prefix="ef" uri="http://www.efatura.gov.tr/package-namespace" />
<!-- .... -->
<sch:pattern id="document">
<sch:rule context="sh:StandardBusinessDocument">
<sch:extends rule="DocumentCheck"/>
</sch:rule>
</sch:pattern>
</sch:schema>
Common schmatron:
<sch:schema xmlns="http://purl.oclc.org/dsdl/schematron"
xmlns:sch="http://purl.oclc.org/dsdl/schematron">
<sch:pattern name="AbstractRules" id="abstracts">
<sch:p>Pattern for storing abstract rules</sch:p>
<!-- Rule to validate StandardBusinessDocument -->
<sch:rule abstract="true" id="DocumentCheck">
<sch:assert test="sh:StandardBusinessDocumentHeader">sh:StandardBusinessDocumentHeader zorunlu bir elemandır.</sch:assert>
<sch:assert test="ef:Package">ef:Package zorunlu bir elemandır.</sch:assert>
</sch:rule>
</sch:pattern>
</sch:schema>
The problem is that, in main schema, if i put a directly assertion tag, for example:
<assert test="sum(//Percent)=100">Sum is not 100%.</assert>
between "rule" tags, like that:
<sch:pattern id="document">
<sch:rule context="sh:StandardBusinessDocument">
<assert test="sum(//Percent)=100">Sum is not 100%.</assert>
</sch:rule>
</sch:pattern>
Than the etree's isoschematron.Schematron class validates my main schematron. Else it throws an error like this:
Traceback (most recent call last):
File "C:\SUNUCU\validate\v.py", line 102, in <module>
schematron = etree.Schematron(s)
File "schematron.pxi", line 116, in lxml.etree.Schematron.__init__ (src\lxml\lxml.etree.c:156251)
SchematronParseError: Document is not a valid Schematron schema
I've tried it with etree.Schematron class and it throws "SchematronParseError: invalid schematron schema:" too.
I am thinking that the problem is about schematron's
<sch:extends />
tag. I mean, errors appear when schematron use an external assertion of rule.
What is the correct way working with related and united schematrons by using python?
Thanks in advance.
I think the problem was much simpler:
it seems like you forgot the "sch:" XML-namespace prefix in the <assert> element.