Fortify XML Injection issue in Python - python

I have a very small snippet of HTML to create in my Python application and use ElementTree to create it. When scanning my application, Fortify detects an XML Injection vulnerability in the following code
from xml.etree.ElementTree import Element, SubElement
html = Element('html')
head = SubElement(html, 'head')
I tried to escape the text using
from xml.etree.ElementTree import Element, SubElement
from xml.sax.saxutils import escape
html = Element(escape('html'))
head = SubElement(html, escape('head'))
but Fortify still still detects a vulnerability. How can I rewrite this code so that Fortify doesn't complain?

Related

Parsing file object XML with lxml returns external entity error [duplicate]

I'm trying to parse an XML document I retrieve from the web, but it crashes after parsing with this error:
': failed to load external entity "<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="GreenButtonDataStyleSheet.xslt"?>
That is the second line in the XML that is downloaded. Is there a way to prevent the parser from trying to load the external entity, or another way to solve this? This is the code I have so far:
import urllib2
import lxml.etree as etree
file = urllib2.urlopen("http://www.greenbuttondata.org/data/15MinLP_15Days.xml")
data = file.read()
file.close()
tree = etree.parse(data)
In concert with what mzjn said, if you do want to pass a string to etree.parse(), just wrap it in a StringIO object.
Example:
from lxml import etree
from StringIO import StringIO
myString = "<html><p>blah blah blah</p></html>"
tree = etree.parse(StringIO(myString))
This method is used in the lxml documentation.
etree.parse(source) expects source to be one of
a file name/path
a file object
a file-like object
a URL using the HTTP or FTP protocol
The problem is that you are supplying the XML content as a string.
You can also do without urllib2.urlopen(). Just use
tree = etree.parse("http://www.greenbuttondata.org/data/15MinLP_15Days.xml")
Demonstration (using lxml 2.3.4):
>>> from lxml import etree
>>> tree = etree.parse("http://www.greenbuttondata.org/data/15MinLP_15Days.xml")
>>> tree.getroot()
<Element {http://www.w3.org/2005/Atom}feed at 0xedaa08>
>>>
In a competing answer, it is suggested that lxml fails because of the stylesheet referenced by the processing instruction in the document. But that is not the problem here. lxml does not try to load the stylesheet, and the XML document is parsed just fine if you do as described above.
If you want to actually load the stylesheet, you have to be explicit about it. Something like this is needed:
from lxml import etree
tree = etree.parse("http://www.greenbuttondata.org/data/15MinLP_15Days.xml")
# Create an _XSLTProcessingInstruction object
pi = tree.xpath("//processing-instruction()")[0]
# Parse the stylesheet and return an ElementTree
xsl = pi.parseXSL()
lxml docs for parse says To parse from a string, use the fromstring() function instead.
parse(...)
parse(source, parser=None, base_url=None)
Return an ElementTree object loaded with source elements. If no parser
is provided as second argument, the default parser is used.
The ``source`` can be any of the following:
- a file name/path
- a file object
- a file-like object
- a URL using the HTTP or FTP protocol
To parse from a string, use the ``fromstring()`` function instead.
Note that it is generally faster to parse from a file path or URL
than from an open file object or file-like object. Transparent
decompression from gzip compressed sources is supported (unless
explicitly disabled in libxml2).
You're getting that error because the XML you're loading references an external resource:
<?xml-stylesheet type="text/xsl" href="GreenButtonDataStyleSheet.xslt"?>
LXML doesn't know how to resolve GreenButtonDataStyleSheet.xslt. You and I probably realize that it's going to be available relative to your original URL, http://www.greenbuttondata.org/data/15MinLP_15Days.xml...the trick is to tell lxml how to go about loading it.
The lxml documentation includes a section titled "Document loading and URL resolving", which has just about all the information you need.

Remove ns0 from XML

I have an XML file where I would like to edit certain attributes. I am able to properly edit the attributes but when I write the changes to the file, the tags have a strange "ns0" added onto them. How can I get rid of this? This is what I have tried and have been unsuccessful. I am working in Python and using lxml.
import xml.etree.ElementTree as ET
from xml.etree import ElementTree as etree
from lxml import etree, objectify
frag_xml_tree = ET.parse(xml_name)
frag_root = frag_xml_tree.getroot()
for e in frag_root:
for elem in frag_root.iter(e):
elem.attrib[frag_param_name] = update_val
etree.register_namespace("", "http://www.w3.org/2001")
frag_xml_tree.write(xml_name)
However, when I do this, I only get the error Invalid tag name u''. I thought this error came up if the xml tags started with digits but that is not the case with my xml. I am really stuck on how to proceed. Thanks
Actually the way to do it seemed to be a combination of two things.
The import statement is import xml.etree.ElementTree as ET
ET.register_namespace("", NAMESPACE) is the correct call, where NAMESPACE is the namespace listed in the input xml, ie the url after xmlns.
Here's the corrected code using only xml.etree.ElementTree instead of lxml:
import xml.etree.ElementTree as ET
frag_xml_tree = ET.parse(xml_name)
frag_root = frag_xml_tree.getroot()
for e in frag_root:
for elem in frag_root.iter(e):
elem.attrib[frag_param_name] = update_val
ET.register_namespace("", "http://www.w3.org/2001")
frag_xml_tree.write(xml_name)
the following snippet removes the presence of ns0 throughout the xml file
for i in range (0,len(list(root))):
print(root[i])
ET.register_namespace("",NAMESPACE)
tree.write('TP_updated2.xml',xml_declaration=True,method='xml',encoding="utf8",default_namespace=None)
NAMESPACE = the url after the xmlns

How to (push) parse XML files in Python?

I've already seen this question, but it's from the 2009.
What's a simple modern way to handle XML files in Python 3?
I.e., from this TLD (adapted from here):
<?xml version="1.0" encoding="UTF-8" ?>
<taglib>
<tlib-version>1.0</tlib-version>
<short-name>bar-baz</short-name>
<tag>
<name>present</name>
<tag-class>condpkg.IfSimpleTag</tag-class>
<body-content>scriptless</body-content>
<attribute>
<name>test</name>
<required>true</required>
<rtexprvalue>true</rtexprvalue>
</attribute>
</tag>
</taglib>
I want to parse TLD files (Java Server Pages Tag Library Descriptors), to obtain some sort of structure in Python (I have still to decide about that part).
Hence, I need a push parser. But I won't do much more with it, so I'd rather prefer a simple API (I'm new to Python).
xml.etree.ElementTree is still there, in the standard library:
import xml.etree.ElementTree as ET
data = """your xml here"""
tree = ET.fromstring(data)
print(tree.find('tag/name').text) # prints "present"
If you look outside of the standard library, there is a very popular and fast lxml module that follows the ElementTree interface and supports Python3:
from lxml import etree as ET
data = """your xml here"""
tree = ET.fromstring(data)
print(tree.find('tag/name').text) # prints "present"
Besides, there is lxml.objectify that allows you to deal with XML structure like with a Python object.

Reading a page and parsing it with minidom.parse or minidom.parseString in Python?

I have either of these codes:
import urllib
from xml.dom import minidom
res = urllib.urlopen('https://www.google.com/webhp#q=apple&start=10')
dom = minidom.parse(res)
which gives me the error xml.parsers.expat.ExpatError: syntax error: line 1, column 0
Or this:
import urllib
from xml.dom import minidom
res = urllib.urlopen('https://www.google.com/webhp#q=apple&start=10')
dom = minidom.parseString(res.read())
which gives me the same error. res.read() reads fine and is a string.
I would like to parse through the code later. How can I do this using xml.dom.minidom?
The reason you're getting this error is that the page isn't valid XML. It's HTML 5. The doctype right at the top tells you this, even if you ignore the content type. You can't parse HTML with an XML parser.*
If you want to stick with what's in the stdlib, you can use html.parser (Python 3.x) / HTMLParser (2.x).** However, you may want to consider third-party libraries like lxml (which, despite the name, can parse HTML), html5lib, or BeautifulSoup (which wraps up a lower-level parser in a really nice interface).
* Well, unless it's XHTML, or the XML output of HTML5, but that's not the case here.
** Do not use htmllib unless you're using an old version of Python without a working HTMLParser. This module is deprecated for a reason.

Error 'failed to load external entity' when using Python lxml

I'm trying to parse an XML document I retrieve from the web, but it crashes after parsing with this error:
': failed to load external entity "<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="GreenButtonDataStyleSheet.xslt"?>
That is the second line in the XML that is downloaded. Is there a way to prevent the parser from trying to load the external entity, or another way to solve this? This is the code I have so far:
import urllib2
import lxml.etree as etree
file = urllib2.urlopen("http://www.greenbuttondata.org/data/15MinLP_15Days.xml")
data = file.read()
file.close()
tree = etree.parse(data)
In concert with what mzjn said, if you do want to pass a string to etree.parse(), just wrap it in a StringIO object.
Example:
from lxml import etree
from StringIO import StringIO
myString = "<html><p>blah blah blah</p></html>"
tree = etree.parse(StringIO(myString))
This method is used in the lxml documentation.
etree.parse(source) expects source to be one of
a file name/path
a file object
a file-like object
a URL using the HTTP or FTP protocol
The problem is that you are supplying the XML content as a string.
You can also do without urllib2.urlopen(). Just use
tree = etree.parse("http://www.greenbuttondata.org/data/15MinLP_15Days.xml")
Demonstration (using lxml 2.3.4):
>>> from lxml import etree
>>> tree = etree.parse("http://www.greenbuttondata.org/data/15MinLP_15Days.xml")
>>> tree.getroot()
<Element {http://www.w3.org/2005/Atom}feed at 0xedaa08>
>>>
In a competing answer, it is suggested that lxml fails because of the stylesheet referenced by the processing instruction in the document. But that is not the problem here. lxml does not try to load the stylesheet, and the XML document is parsed just fine if you do as described above.
If you want to actually load the stylesheet, you have to be explicit about it. Something like this is needed:
from lxml import etree
tree = etree.parse("http://www.greenbuttondata.org/data/15MinLP_15Days.xml")
# Create an _XSLTProcessingInstruction object
pi = tree.xpath("//processing-instruction()")[0]
# Parse the stylesheet and return an ElementTree
xsl = pi.parseXSL()
lxml docs for parse says To parse from a string, use the fromstring() function instead.
parse(...)
parse(source, parser=None, base_url=None)
Return an ElementTree object loaded with source elements. If no parser
is provided as second argument, the default parser is used.
The ``source`` can be any of the following:
- a file name/path
- a file object
- a file-like object
- a URL using the HTTP or FTP protocol
To parse from a string, use the ``fromstring()`` function instead.
Note that it is generally faster to parse from a file path or URL
than from an open file object or file-like object. Transparent
decompression from gzip compressed sources is supported (unless
explicitly disabled in libxml2).
You're getting that error because the XML you're loading references an external resource:
<?xml-stylesheet type="text/xsl" href="GreenButtonDataStyleSheet.xslt"?>
LXML doesn't know how to resolve GreenButtonDataStyleSheet.xslt. You and I probably realize that it's going to be available relative to your original URL, http://www.greenbuttondata.org/data/15MinLP_15Days.xml...the trick is to tell lxml how to go about loading it.
The lxml documentation includes a section titled "Document loading and URL resolving", which has just about all the information you need.

Categories