Python XML Parsing Child Tag - python

I am trying to get the contents of a sub tag using lxml. The XML file I am parsing is valid but for some reason when I try and parse the child element it seems to think I have invalid XML. I have seen from other posts that this error is usually generated when there isn't a closing tag but the XML parses fine in a browser. Any ideas why this is happening ?
Contents of XML file (test.xml):
<?xml version="1.0" encoding="UTF-8"?>
<Group id="RHEL-07-010010">
<title>SRG-OS-000257-GPOS-00098</title>
<description><GroupDescription></GroupDescription> </description>
<Rule id="RHEL-07-010010_rule" severity="high" weight="10.0">
<version>RHEL-07-010010</version>
<title>The file permissions, ownership, and group membership of system files and commands must match the vendor values.</title>
<description><VulnDiscussion>Discretionary access control is weakened if a user or group has access permissions to system files and directories greater than the default.
Satisfies: SRG-OS-000257-GPOS-00098, SRG-OS-000278- GPOS-00108</VulnDiscussion><FalsePositives>< /FalsePositives><FalseNegatives>< /FalseNegatives><Documentable>false< /Documentable><Mitigations>< /Mitigations><SecurityOverrideGuidance>< /SecurityOverrideGuidance><PotentialImpacts>< /PotentialImpacts><ThirdPartyTools>< /ThirdPartyTools><MitigationControl>< /MitigationControl><Responsibility>< /Responsibility><IAControls></IAControls></description>
<ident system="http://iase.disa.mil/cci">CCI-001494</ident>
<ident system="http://iase.disa.mil/cci">CCI-001496</ident>
<fixtext fixref="F-RHEL-07-010010_fix">Run the following command to determine which package owns the file:
# rpm -qf <filename>
Reset the permissions of files within a package with the following command:
#rpm --setperms <packagename>
Reset the user and group ownership of files within a package with the following command:
#rpm --setugids <packagename></fixtext>
<fix id="F-RHEL-07-010010_fix" />
<check system="C-RHEL-07-010010_chk">
<check-content-ref name="M" href="VMS_XCCDF_Benchmark_SRG.xml" />
<check-content>Verify the file permissions, ownership, and group membership of system files and commands match the vendor values.
Check the file permissions, ownership, and group membership of system files and commands with the following command:
# rpm -Va | grep '^.M'
If there is any output from the command, this is a finding.</check-content>
</check>
</Rule>
</Group>
I am trying to get the contents of the VulnDiscussion tag. I can get the contents of the parent tag, discussion like this:
from lxml import etree as ET
xml = ET.parse("test.xml")
for description in xml.xpath('//description/text()'):
print(description)
This produces the following output:
<GroupDescription></GroupDescription>
<VulnDiscussion>Discretionary access control is weakened if a user or group has access permissions to system files and directories greater than the default.
Satisfies: SRG-OS-000257-GPOS-00098, SRG-OS-000278-GPOS-00108</VulnDiscussion> <FalsePositives></FalsePositives><FalseNegatives> </FalseNegatives><Documentable>false</Documentable><Mitigations></Mitigations> <SecurityOverrideGuidance></SecurityOverrideGuidance><PotentialImpacts> </PotentialImpacts><ThirdPartyTools></ThirdPartyTools><MitigationControl> </MitigationControl><Responsibility></Responsibility><IAControls></IAControls>
So far so good, now I try and extract the contents of VulnDiscussion with this code:
for description in xml.xpath('//description/text()'):
vulnDiscussion = next(iter(ET.XML(description).xpath('//VulnDiscussion/text()')), None)
print(vulnDiscussion)
and get the following error :
vulnDiscussion = next(iter(ET.XML(description).xpath('//VulnDiscussion/text()')), None)
File "src/lxml/lxml.etree.pyx", line 3192, in lxml.etree.XML (src/lxml/lxml.etree.c:78763)
File "src/lxml/parser.pxi", line 1848, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:118341)
File "src/lxml/parser.pxi", line 1736, in lxml.etree._parseDoc (src/lxml/lxml.etree.c:117021)
File "src/lxml/parser.pxi", line 1102, in lxml.etree._BaseParser._parseDoc (src/lxml/lxml.etree.c:111265)
File "src/lxml/parser.pxi", line 595, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:105109)
File "src/lxml/parser.pxi", line 706, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:106817)
File "src/lxml/parser.pxi", line 635, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:105671)
File "<string>", line 3
lxml.etree.XMLSyntaxError: Extra content at the end of the document, line 3, column 79

XML can only have one "root", xml.xpath('//description/text()') return multiple elements. Wrap all elements in to a single element, then your XML document will only have one root element.
Also noted that the text in the original XML has a space before each closing tag that you should remove
from lxml import etree as ET
xml = ET.parse("test.xml")
for description in xml.xpath('//description/text()'):
x = ET.XML('<Testroot>'+description.replace('< /','</')+'</Testroot>') # add root tag and remove space before the closing tag
vulnDiscussion = next(iter(x.xpath('//VulnDiscussion/text()')), None)
if vulnDiscussion:
print(vulnDiscussion)
Output
Discretionary access control is weakened if a user or group has access permissions to system files and directories greater than the default.
Satisfies: SRG-OS-000257-GPOS-00098, SRG-OS-000278- GPOS-00108

Related

parse xml file when element contains smth. special with python

i would like to parse an XML file and write some parts into a csv file. I will do it with python. I am pretty new to programming and XML. I read a lot, but i couldn't found a useful example for my problem.
My XML file looks like this:
<Host name="1.1.1.1">
<Properties>
<tag name="id">1</tag>
<tag name="os">windows</tag>
<tag name="ip">1.11.111.1</tag>
</Properties>
<Report id="123">
<output>
Host is configured to get updates from another server.
Update status:
last detected: 2015-12-02 18:48:28
last downloaded: 2015-11-17 12:34:22
last installed: 2015-11-23 01:05:32
Automatic settings:.....
</output>
</Report>
<Report id="123">
<output>
Host is configured to get updates from another server.
Environment Options:
Automatic settings:.....
</output>
</Report>
</Host>
My XML file contains 500 of this entries! I just want to parse XML blocks where the output contains Update status, because i want to write the 3 dates (last detected, last downloaded and last installed in my CSV file. I would also add the id, os and ip.
I tried it with ElementTree library but i am not able to filter element.text where the output contains Update status. For the moment i am able to extract all text and attributes from the whole file but i am not able to filter blocks where my output contains Update status, last detected, last downloaded or last installed.
Can anyone give some advice how to achieve this?
desired output:
id:1
os:windows
ip:1.11.111.1
last detected: 2015-12-02 18:48:28
last downloaded: 2015-11-17 12:34:22
last installed:2015-11-23 01:05:32
all of this infos written in a .csv file
At the moment my code looks like this:
#!/usr/bin/env python
import xml.etree.ElementTree as ET
import csv
tree = ET.parse("file.xml")
root = tree.getroot()
# open csv file for writing
data = open('test.csv', 'w')
# create csv writer object
csvwriter = csv.writer(data)
# filter xml file
for tag in root.findall(".Host/Properties/tag[#name='ip']"):print(tag.text) # gives all ip's from whole xml
for output in root.iter('output'):print(plugin.text) # gives all outputs from whole xml
data.close()
Best regards
It's relatively straightforward when you start at the <Host> element and work your way down.
Iterate all the nodes, but only output something when the substring "Update status:" occurs in the value of <output>:
for host in tree.iter("Host"):
host_id = host.find('./Properties/tag[#name="id"]')
host_os = host.find('./Properties/tag[#name="os"]')
host_ip = host.find('./Properties/tag[#name="ip"]')
for output in host.iter("output"):
if output.text is not None and "Update status:" in output.text:
print("id:" + host_id.text)
print("os:" + host_os.text)
print("ip:" + host_ip.text)
for line in output.text.splitlines():
if ("last detected:" in line or
"last downloaded" in line or
"last installed" in line):
print(line.strip())
outputs this for your sample XML:
id:1
os:windows
ip:1.11.111.1
last detected: 2015-12-02 18:48:28
last downloaded: 2015-11-17 12:34:22
last installed: 2015-11-23 01:05:32
Minor point: That's not really CSV, so writing that to a *.csv file as-is wouldn't be very clean.

ParseError while parsing AndroidManifest.xml in python

I'm trying to parse an AndroiManifest.xml file to get informations and I have this error when I'm charging my file
xml.etree.ElementTree.ParseError: not well-formed (invalid token):
line 1, column 0
Here is my code :
import xml.etree.cElementTree as ET
tree = ET.ElementTree(file='AndroidManifest.xml')
root = tree.getroot()
My XML file seems well formed :
<?xml version="1.0" encoding="utf-8"?>
<manifest
xmlns:android="http://schemas.android.com/apk/res/android"
android:versionCode="132074037"
android:versionName="193.0.0.21.98"
android:installLocation="0"
package="com.facebook.orca">
How can I fix that and parse my XML to get a 'android:versionName' tag ?
Solved
I was trying to parse an AndroidManifest.xml after I've unzipped an apk but with this method, the AndroidManifest.xml is encoded so it's impossible to open, read or parse it. I was able to read it only by using Android Studio that automatically decodes an AndroidManifest file.
To parse an AndroidManifest.xml after unzipping an apk, the best way is to use aapt command line :
/Users/{Path_to_your_sdk}/sdk/build-tools/28.0.3/aapt dump
badging com.squareup.cash.apk | sed -n
"s/.*versionName='\([^']*\).*/\1/p"
And you will obtain the versionName of your app. Hope it will help.

Python XML Iterparse halt on text

I am new to python, using 3.x, and am running into an issue with an XML file that I'm testing/learning on. When I look at the raw file (which is ASCII encoded btw), the issue (I'm pretty sure) is that there's a U+00A0 code in there.
The XML is as follows:
<?xml version="1.0" encoding="utf-8"?>
<XMLSetData xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns="http://www.clientsite.com/subdir/r2.4/v1">
<FileCreationDate>2018-05-05T11:35:44.1043858-05:00</FileCreationDate>
<XMLSetDataList>
<DataIDNumber>99345346</DataIDNumber>
<DataName>RSRS TVL5697 ULLĀ  Georgetown</DataName>
</XMLSetDataList>
</XMLSetData>
Using Notepad++, it shows me that the text has "xA0 " instead of " " (two spaces) between ULL and Georgetown. So when I do the code below:
import xml.etree.ElementTree as ET
events = ("end", "start-ns", "end-ns")
for event, elem in ET.iterparse(xml_file, events=events):
if event == "end":
eltag = elem.tag
eltext = elem.text
print( eltag, eltext)
It gives me an error stating:
File "C:\Users\d\AppData\Local\Programs\Python\Python37-32\lib\xml\etree\ElementTree.py", line 1222, in iterator
yield from pullparser.read_events()
File "C:\Users\d\AppData\Local\Programs\Python\Python37-32\lib\xml\etree\ElementTree.py", line 1297, in read_events
raise event
File "C:\Users\d\AppData\Local\Programs\Python\Python37-32\lib\xml\etree\ElementTree.py", line 1269, in feed
self._parser.feed(data)
File "<string>", line None
xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 6, column 30
How do I fix this / get around it? If I remove the xA0 part, it parses fine, but obviously something like this may come up again, and I'd like to programmatically handle it.

Parsing HTML tag with ":" with lxml

I am new in python and I'm trying to parse a Html page with lxml. I want to get text from <p> tag. But inside it I have a strange tag like this:
<p style="margin-left:0px;padding:0 0 0 0;float:left;">
<g:plusone size="medium">
</g:plusone>
</p>
How can I ignore this tag inside <p> ? I want to cut all tags with ":" inside any html page,because another functions of lxml didn't work properly with tags like this.
parser=etree.HTMLParser()
tree = etree.parse('problemtags.html',parser)
root=tree.getroot()
text = [ b.text for b in root.iterfind(".//p")]
I expect to get some text inside <p> tags.But when i look like this, it fails on fragment like above. it writes: "b'Tag g:plusone invalid'". All i need - it is ignore all incorect tags like this. I don't know exactly how many tags like this i will have in future, but i think a problem really in ":" now, because when I use ".tag" and get name,it is just "plusone",not "g:plusone".
Here is a way I found to clean up the html:
from lxml import etree
from StringIO import StringIO
s = '''<p style="margin-left:0px;padding:0 0 0 0;float:left;">
<g:plusone size="medium">
</g:plusone>
</p>'''
parser = etree.HTMLParser()
tree = etree.parse(StringIO(s), parser)
result = etree.tostring(tree.getroot(),pretty_print=True,method="html")
print result
This prints
<html><body><p style="margin-left:0px;padding:0 0 0 0;float:left;">
<plusone size="medium">
</plusone>
</p></body></html>
To get an etree.Element reference, namely an etree._Element, from an etree._ElementTree, just
root = tree.getroot()
print type(root) # prints lxml.etree._Element
According to _Element-class, lxml.etree._Element is the class of document instance references, in other words its what results from instantiating etree.Element, for example
el = etree.Element("an_etree.Element_reference")
print type(el) # prints lxml.etree._Element
The g: is a namespace prefix. The actual tag name is only plusone. So, lxml is correct in only returning plusone as the tag name. See a summary of namespaces here.
As I understand it, lxml's HTML Parser is not namespace aware. However, the XML Parser is. Presumably, given that this HTML document contains XML, it is most likely actually an XHTML document (if not, then it is probably an invalid HTML document and you cannot expect lxml to parse it correctly). Therefore, you need to run it through the XML Parser rather than HTML Parser. lxml's namespace API is explained in their tutorial.
However, with the fragment you provided the parser returns this:
>>> d = etree.fromstring('''<p style="margin-left:0px;padding:0 0 0 0;float:left;">
... <g:plusone size="medium">
... </g:plusone>
... </p>''')
Traceback (most recent call last):
File "<stdin>", line 4, in <module>
File "lxml.etree.pyx", line 3032, in lxml.etree.fromstring (src\lxml\lxml.etree.c:68121)
File "parser.pxi", line 1786, in lxml.etree._parseMemoryDocument (src\lxml\lxml.etree.c:102470)
File "parser.pxi", line 1674, in lxml.etree._parseDoc (src\lxml\lxml.etree.c:101299)
File "parser.pxi", line 1074, in lxml.etree._BaseParser._parseDoc (src\lxml\lxml.etree.c:96481)
File "parser.pxi", line 582, in lxml.etree._ParserContext._handleParseResultDoc (src\lxml\lxml.etree.c:91290)
File "parser.pxi", line 683, in lxml.etree._handleParseResult (src\lxml\lxml.etree.c:92476)
File "parser.pxi", line 622, in lxml.etree._raiseParseError (src\lxml\lxml.etree.c:91772)
lxml.etree.XMLSyntaxError: Namespace prefix g on plusone is not defined, line 2, column 23
Note that it complains that the "Namespace prefix g on plusone is not defined." Presumably, elsewhere in your document the namespace prefix is defined. As I don't know what that is, I'll just make something up and define if on the plusone tag in your fragment:
>>> d = etree.fromstring('''<p style="margin-left:0px;padding:0 0 0 0;float:left;">
... <g:plusone xmlns:g="something" size="medium">
... </g:plusone>
... </p>''')
>>> d
<Element p at 0x2563cd8>
>>> d.tag
'p'
>>> d[0]
<Element {something}plusone at 0x2563940>
>>> d[0].tag
'{something}plusone'
Notice that the g: prefix was replaced with the actual namespace ({something} in this case as I set is like so: xmlns:g="something"). Usually the namespace would actually be a URI. So you may find that your tag looks something like this: {http://where.it/is/from.xml}plusone
Nevertheless, I find working with namespaces rather bothersome when they are not necessary. You may actually find it easier to use the HTML parser which ignores the namespaces. Now that you know that the tag is named plusone, not g:plusone you may be able get on with your work using just the HTML parser.

WordPress XML ParseError: unbound prefix?

I am trying to use Python's xml.etree.ElementTree.parse() function to parse an XML file I created by exporting all of the content from a WordPress blog. However, when I try like so:
import xml.etree.ElementTree as xml
tree = xml.parse('/path/to/file.xml')
I get the following error:
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 1183, in parse
tree.parse(source, parser)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 656, in parse
parser.feed(data)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 1643, in feed
self._raiseerror(v)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 1507, in _raiseerror
raise err
ParseError: unbound prefix: line 189, column 1
Here's what's on line 189 of my XML file:
<atom:link rel="search" type="application/opensearchdescription+xml" href="http://blogname.wordpress.com/osd.xml" title="blog name" />
I've seen many questions about this error coming up with Android development, but I can't tell if and how that applies to my situation. Can anyone help with this?
Apologies to everyone for whom this was stupidly obvious, but it turns out I simply didn't have a namespace definition for "atom" in the document. I'm guessing that "unbound prefix" means that the prefix "atom" wasn't "bound" to a namespace definition?
Anyway, adding said definition has solved the problem. Although it makes me wonder why WordPress exports XML files without proper definitions for all of the namespaces they use...
If you remove all the Name Space, works absolutely fine.
Change
<s:home>USA</s:home>
to
<home>USA</home>
Just in case it helps someone some day, I was also working with a WordPress XML export (WordPress eXtended RSS) file in Python and was getting the same error. In my case, WordPress had included most of the correct namespace definitions. However, the XML had iTunes podcast information as well, and the iTunes namespace declaration was not present.
I fixed it by adding xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd" into the RSS declaration block. So this:
<!-- generator="WordPress/4.9.8" created="2018-08-06 03:12" -->
<rss version="2.0"
xmlns:excerpt="http://wordpress.org/export/1.2/excerpt/"
xmlns:content="http://purl.org/rss/1.0/modules/content/"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:wp="http://wordpress.org/export/1.2/"
>
became this:
<!-- generator="WordPress/4.9.8" created="2018-08-06 03:12" -->
<rss version="2.0"
xmlns:excerpt="http://wordpress.org/export/1.2/excerpt/"
xmlns:content="http://purl.org/rss/1.0/modules/content/"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:wp="http://wordpress.org/export/1.2/"
xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd"
>

Categories