Escaping '<' and '>' in xml when using xml.dom.minidom - python

I am stuck while escaping "<" and ">" in the xml file using xml.dom.minidom.
I tried to get the unicode hex value and use that instead
http://slayeroffice.com/tools/unicode_lookup/
Tried to use the standard "<" and ">" but still with no success.
from xml.dom.minidom import Document
doc = Document()
e = doc.createElement("abc")
s1 = '<hello>bhaskar</hello>'
text = doc.createTextNode(s1)
e.appendChild(text)
e.toxml()
'<abc><hello>bhaskar</hello></abc>'
same result with writexml()
Also tried by specifying encoding 'UTF-8', 'utf-8', 'utf' in the toxml() writexml() calls but with same results.
from xml.dom.minidom import Document
doc = Document()
e = doc.createElement("abc")
s1 = u'<hello>bhaskar</hello>'
text = doc.createTextNode(s1)
e.appendChild(text)
e.toxml()
u'<abc>&lt;hello&gt;bhaskar&lt;/hello&gt;</abc>'
Tried other ways but with same results.
Only way i could work-around is by overriding the writer
import xml.dom.minidom as md
# XXX Hack to handle '<' and '>'
def wd(writer, data):
data = data.replace("<", "<").replace(">", ">")
writer.write(data)
md._write_data = wd
Edit - This is the code.
import xml.dom.minidom as md
doc = md.Document()
entity_descr = doc.createElement("EntityDescriptor")
doc.appendChild(entity_descr)
entity_descr.setAttribute('xmlns', 'urn:oasis:names:tc:SAML:2.0:metadata')
entity_descr.setAttribute('xmlns:saml', 'urn:oasis:names:tc:SAML:2.0:assertion')
entity_descr.setAttribute('xmlns:ds', 'http://www.w3.org/2000/09/xmldsig#')
# Get the entity_id from saml20_idp_settings
entity_descr.setAttribute('entityID', self.group['entity_id'])
idpssodescr = doc.createElement('IDPSSODescriptor')
idpssodescr.setAttribute('WantAuthnRequestsSigned', 'true')
idpssodescr.setAttribute('protocolSupportEnumeration',
'urn:oasis:names:tc:SAML:2.0:protocol')
entity_descr.appendChild(idpssodescr)
keydescr = doc.createElement('KeyDescriptor')
keydescr.setAttribute('use', 'signing')
idpssodescr.appendChild(keydescr)
keyinfo = doc.createElement('ds:KeyInfo')
keyinfo.setAttribute('xmlns:ds', 'http://www.w3.org/2000/09/xmldsig#')
keydescr.appendChild(keyinfo)
x509data = doc.createElement('ds:X509Data')
keyinfo.appendChild(x509data)
# check this part
s = "this is a cert blah blah"
x509cert = doc.createElement('ds:X509Certificate')
cert = doc.createTextNode(s)
x509cert.appendChild(cert)
x509data.appendChild(x509cert)
sso = doc.createElement('SingleSignOnService')
sso.setAttribute('Binding', 'urn:oasis:names:tc:SAML:2.0:bindings:HTTP-Redirect')
sso.setAttribute('Location', 'http://googleapps/singleSignOn')
idpssodescr.appendChild(sso)
# Write the metadata file.
fobj = open('metadata.xml', 'w')
doc.writexml(fobj, " ", "", "\n", "UTF-8")
fobj.close()
This produces
<?xml version="1.0" encoding="UTF-8"?>
<EntityDescriptor entityID="skar" xmlns="urn:oasis:names:tc:SAML:2.0:metadata"
xmlns:ds="http://www.w3.org/2000/09/xmldsig#"
xmlns:saml="urn:oasis:names:tc:SAML:2.0:assertion">
<IDPSSODescriptor WantAuthnRequestsSigned="true"
protocolSupportEnumeration="urn:oasis:names:tc:SAML:2.0:protocol">
<KeyDescriptor use="signing">
<ds:KeyInfo xmlns:ds="http://www.w3.org/2000/09/xmldsig#">
<ds:X509Data>
<ds:X509Certificate>
this is a cert blah blah
</ds:X509Certificate>
</ds:X509Data>
</ds:KeyInfo>
</KeyDescriptor>
<SingleSignOnService Binding="urn:oasis:names:tc:SAML:2.0:bindings:HTTP-Redirect"
Location="http:///singleSignOn"/>
</IDPSSODescriptor>
</EntityDescriptor>
Note the "This is a cert" comes seperately
Have broken my head over this but with the same results.

This is not a bug, it is a feature. To insert actual XML, insert DOM objects instead. Text inside an XML tag needs to be entity escaped though to be valid XML.
from xml.dom.minidom import Document
doc = Document()
e = doc.createElement("abc")
eh = doc.createElement("hello")
s1 = 'bhaskar'
text = doc.createTextNode(s1)
eh.appendChild(text)
e.appendChild(eh)
e.toxml()
EDIT: I don't know what Python's API is like, but it looks very similar to C#'s, so you might be able to do something like e.innerXml = s1 to do what you're trying to do... but that could be bad. The better thing to do is parse it and appendChild it as well.
EDIT 2: I just ran this via Python locally, and there's definitely something wrong on your end, not in the libraries. Make sure that your string doesn't have any newlines or whitespace at the start of it. For reference, the test code I used was:
Python 2.6.2 (release26-maint, Apr 19 2009, 01:56:41)
[GCC 4.3.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from xml.dom.minidom import Document
>>> cert = "---- START CERTIFICATE ----\n Hello world\n---- END CERTIFICATE ---"
>>> doc = Document()
>>> e = doc.createElement("cert")
>>> certEl = doc.createTextNode(cert)
>>> e.appendChild(certEl)
<DOM Text node "'---- START'...">
>>> print e.toxml()
<cert>---- START CERTIFICATE ----
Hello world
---- END CERTIFICATE ---</cert>
>>>
EDIT 3: The final edit. The problem is in your writexml call. Simply using the following fixes this:
doc.writexml(fobj)
# or
doc.writexml(fobj, "", " ", "")
Unfortuanately, it seems that you won't be able to use the newline parameter to get pretty printing though... it seems that the Python library (or atleast minidom) is written rather poorly and will modify TextNode's while printing them. Not so much a poor implementation as a naive one. A shame really...

If you use "<" as text in XML, you need to escape it, else it is considered markup. So xml.dom is right in escaping it, since you've asked for a text node.
Assuming you really want to insert a piece of XML, I recommend to use createElement("hello"). If you have a fragment of XML that you don't know the structure of, you should first parse it, and then move the nodes of that parse result into the other tree.
If you want to hack, you can inherit from xml.dom.minidom.Text, and overwrite the writexml method. See the source of minidom for details.

Related

Getting raw text from lxml

Trying to get the text from an HtmlElement in lxml. For example, I have the HTML read in by
thing = lxml.html.fromstring("<code><div></code>")
But when I call thing.text I get <div>, meaning that lxml is translating escape characters. Is there a way to get this raw text, i.e., <div>? This is part of the output when I do lxml.html.tostring(thing), but that includes the opening and closing tags which I don't want.
Tried calling tostring with a few different encoding options but no luck.
So I looked into it a bit closer:
cdef tostring(...) in src\lxml\etree.pyx - see https://github.com/lxml/lxml/blob/master/src/lxml/etree.pyx
cdef _tostring(...) in src\lxml\serializer.pxi - see https://github.com/lxml/lxml/blob/master/src/lxml/serializer.pxi
and I couldn't find anything that would suggest you could get the escaped string by configuring the parameters of the tostring() function. It seems like it will always return the unescaped string maybe due to security concerns ...
The way I see it, you would have to use another function such as html.escape to get the escaped string:
import lxml.html
from html import escape as html_escape
thing = lxml.html.fromstring("<code><div>MY TEST DIV</div></code>")
raw_thing = lxml.html.tostring(thing, method="text", encoding="unicode") # <div>MY TEST DIV</div>
escaped_thing = html_escape(raw_thing) # <div>MY TEST DIV</div>
print(escaped_thing)
Essentialy what you are looking for is lxml.html.tostring(root, method='text', encoding="unicode"):
import lxml.html
thing = lxml.html.fromstring("<code><div>MY TEST DIV</div></code>")
output = lxml.html.tostring(thing, method='xml', encoding="unicode")
print(output) # <code><div>MY TEST DIV</div></code>
The problem is that it cannot separate the root element from its child in <code><div>MY TEST DIV</div></code>
However with a different approach you can get the desired output:
import xml.etree.ElementTree as ET
thing = """
<code><div>MY TEST DIV</div><div><div>AAA</div></div><div><div>BBB</div></div></code>
"""
root = ET.fromstring(thing)
root_text = ET._escape_attrib(root.text)
print(root_text)
for child in root:
child_text = ET._escape_attrib(child.text)
print(child_text)
The code above prints out:
<div>MY TEST DIV</div>
<div>AAA</div>
<div>BBB</div>

How to output CDATA using yattag library

I'm trying with the following code to generate an XML file contains tags </documents>.
string = "dasdd Wonder asdf new single, “Tomorrow” #URL# | " \
"oiojk asfddsf releases new asdfdf, “gfsg” | " \
"Identity of asfqw who dasd off asdfsdf Mainland jtyjyjui revealed #URL#"
from yattag import Doc, indent
import html, re
doc, tag, text = Doc().tagtext()
with tag('author', lang='en'):
with tag('documents'):
for tweet in string.split(' | '):
with tag('document'):
tweet = html.unescape(tweet)
text('<![CDATA[{}]]'.format(tweet))
result = indent(doc.getvalue(), indentation=' ' * 4, newline='\n')
with open('test.xml', 'w', encoding='utf-8') as f:
f.write(result)
I wanted to add CDATA token around the text, but when I open the generated file using Notepad++ instead of have the output as:
<document><![CDATA[oiojk asfddsf releases new asdfdf, “gfsg”]]></document>
it appears like (with HTML entities):
<document><![CDATA[oiojk asfddsf releases new asdfdf, “gfsg”]]</document>
I tried to use HTML library (html.unescape line) to discard the HTML entities but I wasn't able.
How can I solve this encoding issue?
The text method always replaces '<' with <. If you wanted no escaping of that kind, you would use the asis method instead (it inserts the string "as is"). But, in your case, it would be more appropriate to use Yattag's cdata method.
from yattag import Doc
help(Doc.cdata)
cdata(self, strg, safe=False) appends a CDATA section containing the supplied string.
You don't have to worry about potential ]]> sequences that would terminate
the CDATA section. They are replaced with ]]]]><![CDATA[>.
If you're sure your string does not contain ]]>, you can pass safe = True.
If you do that, your string won't be searched for ]]> sequences.
So, in your case, you can do:
for tweet in string.split(' | '):
with tag('document'):
tweet = html.unescape(tweet)
doc.cdata(tweet)

XML parsing from URL with python

I am trying to parse a xml from an url.
So originaly my code looked like this:
from xml.dom import minidom
xmldoc = minidom.parse('all.xml')
Sensor0Elm = xmldoc.getElementsByTagName('t0')
Sensor1Elm = xmldoc.getElementsByTagName('t1')
Sensor2Elm = xmldoc.getElementsByTagName('t2')
Sensor0Elm = Sensor0Elm[0]
Sensor1Elm = Sensor1Elm[0]
Sensor2Elm = Sensor2Elm[0]
Sensor0 = Sensor0Elm.childNodes[0].data
Sensor1 = Sensor1Elm.childNodes[0].data
Sensor2 = Sensor2Elm.childNodes[0].data
Sensor0 = float(Sensor0)
Sensor1 = float(Sensor1)
Sensor2 = float(Sensor2)
In this case the xml what I intend to parse was on my local harddrive and it worked out perfectly!
The next step was to parse a xml from an url. A sensormeter from allnet constantly inserts xml data into the networkt which is over the following url with a browser accessible: 192.168.60.242/xml
this is the embedded xml:
<HTML><HEAD><meta http-equiv="content-type" content="text/html; charset=ISO-8859-1"></HEAD><BODY><FORM><TEXTAREA COLS=132 ROWS=50><xml><data>
<devicename>ALL4000</devicename>
<n0>0</n0><t0> 1.27</t0><min0> 0.00</min0><max0> 2.55</max0><l0>-55</l0><h0>150</h0><s0>102</s0>
<n1>1</n1><t1> 2.53</t1><min1> 2.32</min1><max1> 10487.04</max1><l1>-55</l1><h1>150</h1><s1>102</s1>
<n2>2</n2><t2> 2.45</t2><min2> 0.00</min2><max2> 2.55</max2><l2>-55</l2><h2>150</h2><s2>102</s2>
<n3>3</n3><t3>-20480.00</t3><min3> 0.00</min3><max3> 5580.80</max3><l3>-55</l3><h3>150</h3><s3>0</s3>
<n4>4</n4><t4>-20480.00</t4><min4> 40.96</min4><max4> 41943.04</max4><l4>-55</l4><h4>150</h4><s4>0</s4>
<n5>5</n5><t5>-20480.00</t5><min5> 10.24</min5><max5> 0.08</max5><l5>-55</l5><h5>150</h5><s5>0</s5>
<n6>6</n6><t6>-20480.00</t6><min6> 0.00</min6><max6>-20480.00</max6><l6>-55</l6><h6>150</h6><s6>0</s6>
<n7>7</n7><t7>-20480.00</t7><min7> 0.00</min7><max7> 0.00</max7><l7>-55</l7><h7>150</h7><s7>0</s7>
<n8>8</n8><t8>-20480.00</t8><min8> 336855.04</min8><max8> 1342177.28</max8><l8>-55</l8><h8>150</h8><s8>0</s8>
<n9>9</n9><t9>-20480.00</t9><min9> 0.00</min9><max9> 0.00</max9><l9>-55</l9><h9>150</h9><s9>0</s9>
<n10>10</n10><t10>-20480.00</t10><min10> 0.00</min10><max10> 0.00</max10><l10>-55</l10><h10>150</h10><s10>0</s10>
<n11>11</n11><t11>-20480.00</t11><min11> 0.00</min11><max11> 0.00</max11><l11>-55</l11><h11>150</h11><s11>0</s11>
<n12>12</n12><t12>-20480.00</t12><min12> 0.00</min12><max12> 0.00</max12><l12>-55</l12><h12>150</h12><s12>0</s12>
<n13>13</n13><t13>-20480.00</t13><min13> 0.00</min13><max13> 0.00</max13><l13>-55</l13><h13>150</h13><s13>0</s13>
<n14>14</n14><t14>-20480.00</t14><min14> 0.00</min14><max14> 0.00</max14><l14>-55</l14><h14>150</h14><s14>0</s14>
<n15>15</n15><t15>-20480.00</t15><min15> 0.00</min15><max15> 0.00</max15><l15>-55</l15><h15>150</h15><s15>0</s15>
<fn0>1</fn0><ft0>0</ft0><fs0>0</fs0>
<fn1>2</fn1><ft1>0</ft1><fs1>0</fs1>
<fn2>3</fn2><ft2>0</ft2><fs2>0</fs2>
<fn3>4</fn3><ft3>0</ft3><fs3>0</fs3>
<fn4>5</fn4><ft4>0</ft4><fs4>0</fs4>
<fn5>6</fn5><ft5>0</ft5><fs5>0</fs5>
<fn6>7</fn6><ft6>0</ft6><fs6>0</fs6>
<fn7>8</fn7><ft7>0</ft7><fs7>0</fs7>
<fn8>9</fn8><ft8>0</ft8><fs8>0</fs8>
<fn9>10</fn9><ft9>0</ft9><fs9>0</fs9>
<fn10>11</fn10><ft10>0</ft10><fs10>0</fs10>
<fn11>12</fn11><ft11>0</ft11><fs11>0</fs11>
<fn12>13</fn12><ft12>0</ft12><fs12>0</fs12>
<fn13>14</fn13><ft13>0</ft13><fs13>0</fs13>
<fn14>15</fn14><ft14>0</ft14><fs14>0</fs14>
<fn15>16</fn15><ft15>0</ft15><fs15>0</fs15>
<rn0>0</rn0><rt0>0</rt0>
<rn1>1</rn1><rt1>0</rt1>
<rn2>2</rn2><rt2>0</rt2>
<rn3>3</rn3><rt3>0</rt3>
<it0>248</it0><it1>254</it1><it2>255</it2><it3>255</it3><it4>128</it4><it5>1</it5><it6>255</it6><it7>255</it7>
<date>06.08.2006</date><time>03:27:49</time><ad>1</ad><ntpsync>-1</ntpsync><i>10</i><f>0</f>
<sys>18844128</sys><mem>25048</mem><fw>2.89</fw><dev>ALL4000</dev>
<sensorx>5</sensorx><sensory>3</sensory>
</data></xml>
</TEXTAREA></FORM></BODY></HTML>
So I changed the code into this:
import urllib
import time
while True:
### XML Extraction ###
from xml.dom import minidom
allxml = urllib.urlopen("http://192.168.60.242/xml")
allxml_string = allxml.read()
allxml.close()
print allxml_string
xmldoc = minidom.parseString(allxml_string)
Sensor0Elm = xmldoc.getElementsByTagName('t0')
Sensor1Elm = xmldoc.getElementsByTagName('t1')
Sensor2Elm = xmldoc.getElementsByTagName('t2')
Sensor0Elm = Sensor0Elm[0]
Sensor1Elm = Sensor1Elm[0]
Sensor2Elm = Sensor2Elm[0]
Sensor0 = Sensor0Elm.childNodes[0].data
Sensor1 = Sensor1Elm.childNodes[0].data
Sensor2 = Sensor2Elm.childNodes[0].data
Sensor0 = float(Sensor0)
Sensor1 = float(Sensor1)
Sensor2 = float(Sensor2)
Unfortunatelly it does not work. If executed, This is what gets returned:
(by using the function print(), the xml is correctly inserted into te programm. the only problem seems to be a proper further processing by the parse function.)
PLEASE LOOK AT THE ERROR MESSAGE ON THE BOTTOM
Python 2.7.3 (default, Mar 18 2014, 05:13:23)
[GCC 4.6.3] on linux2
Type "copyright", "credits" or "license()" for more information.
>>> ================================ RESTART ================================
>>>
<HTML><HEAD><meta http-equiv="content-type" content="text/html; charset=ISO-8859-1"></HEAD><BODY><FORM><TEXTAREA COLS=132 ROWS=50><xml><data>
<devicename>ALL4000</devicename>
<n0>0</n0><t0> 1.09</t0><min0> 0.00</min0><max0> 2.55</max0><l0>-55</l0><h0>150</h0><s0>102</s0>
<n1>1</n1><t1> 2.52</t1><min1> 2.32</min1><max1> 10487.04</max1><l1>-55</l1><h1>150</h1><s1>102</s1>
<n2>2</n2><t2> 2.45</t2><min2> 0.00</min2><max2> 2.55</max2><l2>-55</l2><h2>150</h2><s2>102</s2>
<n3>3</n3><t3>-20480.00</t3><min3> 0.00</min3><max3> 5580.80</max3><l3>-55</l3><h3>150</h3><s3>0</s3>
<n4>4</n4><t4>-20480.00</t4><min4> 40.96</min4><max4> 41943.04</max4><l4>-55</l4><h4>150</h4><s4>0</s4>
<n5>5</n5><t5>-20480.00</t5><min5> 10.24</min5><max5> 0.08</max5><l5>-55</l5><h5>150</h5><s5>0</s5>
<n6>6</n6><t6>-20480.00</t6><min6> 0.00</min6><max6>-20480.00</max6><l6>-55</l6><h6>150</h6><s6>0</s6>
<n7>7</n7><t7>-20480.00</t7><min7> 0.00</min7><max7> 0.00</max7><l7>-55</l7><h7>150</h7><s7>0</s7>
<n8>8</n8><t8>-20480.00</t8><min8> 336855.04</min8><max8> 1342177.28</max8><l8>-55</l8><h8>150</h8><s8>0</s8>
<n9>9</n9><t9>-20480.00</t9><min9> 0.00</min9><max9> 0.00</max9><l9>-55</l9><h9>150</h9><s9>0</s9>
<n10>10</n10><t10>-20480.00</t10><min10> 0.00</min10><max10> 0.00</max10><l10>-55</l10><h10>150</h10><s10>0</s10>
<n11>11</n11><t11>-20480.00</t11><min11> 0.00</min11><max11> 0.00</max11><l11>-55</l11><h11>150</h11><s11>0</s11>
<n12>12</n12><t12>-20480.00</t12><min12> 0.00</min12><max12> 0.00</max12><l12>-55</l12><h12>150</h12><s12>0</s12>
<n13>13</n13><t13>-20480.00</t13><min13> 0.00</min13><max13> 0.00</max13><l13>-55</l13><h13>150</h13><s13>0</s13>
<n14>14</n14><t14>-20480.00</t14><min14> 0.00</min14><max14> 0.00</max14><l14>-55</l14><h14>150</h14><s14>0</s14>
<n15>15</n15><t15>-20480.00</t15><min15> 0.00</min15><max15> 0.00</max15><l15>-55</l15><h15>150</h15><s15>0</s15>
<fn0>1</fn0><ft0>0</ft0><fs0>0</fs0>
<fn1>2</fn1><ft1>0</ft1><fs1>0</fs1>
<fn2>3</fn2><ft2>0</ft2><fs2>0</fs2>
<fn3>4</fn3><ft3>0</ft3><fs3>0</fs3>
<fn4>5</fn4><ft4>0</ft4><fs4>0</fs4>
<fn5>6</fn5><ft5>0</ft5><fs5>0</fs5>
<fn6>7</fn6><ft6>0</ft6><fs6>0</fs6>
<fn7>8</fn7><ft7>0</ft7><fs7>0</fs7>
<fn8>9</fn8><ft8>0</ft8><fs8>0</fs8>
<fn9>10</fn9><ft9>0</ft9><fs9>0</fs9>
<fn10>11</fn10><ft10>0</ft10><fs10>0</fs10>
<fn11>12</fn11><ft11>0</ft11><fs11>0</fs11>
<fn12>13</fn12><ft12>0</ft12><fs12>0</fs12>
<fn13>14</fn13><ft13>0</ft13><fs13>0</fs13>
<fn14>15</fn14><ft14>0</ft14><fs14>0</fs14>
<fn15>16</fn15><ft15>0</ft15><fs15>0</fs15>
<rn0>0</rn0><rt0>0</rt0>
<rn1>1</rn1><rt1>0</rt1>
<rn2>2</rn2><rt2>0</rt2>
<rn3>3</rn3><rt3>0</rt3>
<it0>248</it0><it1>254</it1><it2>255</it2><it3>255</it3><it4>128</it4><it5>1</it5><it6>255</it6><it7>255</it7>
<date>06.08.2006</date><time>06:45:46</time><ad>1</ad><ntpsync>-1</ntpsync><i>10</i><f>0</f>
<sys>18856004</sys><mem>25048</mem><fw>2.89</fw><dev>ALL4000</dev>
<sensorx>5</sensorx><sensory>3</sensory>
</data></xml>
</TEXTAREA></FORM></BODY></HTML>
Traceback (most recent call last):
File "/home/pi/Desktop/sig_v3.py", line 14, in <module>
xmldoc = minidom.parseString(allxml_string)
File "/usr/lib/python2.7/xml/dom/minidom.py", line 1930, in parseString
return expatbuilder.parseString(string)
File "/usr/lib/python2.7/xml/dom/expatbuilder.py", line 940, in parseString
return builder.parseString(string)
File "/usr/lib/python2.7/xml/dom/expatbuilder.py", line 223, in parseString
parser.Parse(string, True)
ExpatError: mismatched tag: line 1, column 86
I hope someone can help me out on this.
thanks
Instead of
xmldoc = minidom.parse(allxml_string)
use
xmldoc = minidom.parseString(allxml_string)
The parse() function can take either a filename or an open file
object.
If you have XML in a string, you can use the parseString() function
instead:
Source: https://docs.python.org/2/library/xml.dom.minidom.html
After you've read the XML into xml_string, you'll want to do something to the received text to close that unclosed <meta> tag you're seeing. While it's normally not a good idea to try parsing HTML with regular expressions, in this case a regular expression is probably your simplest solution. At the top of your code, add import re somewhere, then do the following:
fixed_xml_string = re.sub("<meta(.*?)>", "<meta\\1></meta>", xml_string)
Then try parsing fixed_xml_string.
If that works, great. If it doesn't work and there are more errors, then instead of fixing them one at a time, you'll be better off using a "generous" XML parser like BeautifulSoup instead of minidom. BeautifulSoup tries very hard not to give you errors when it encounters bad XML (or HTML), and instead figure out what the XML's author intended and give you that. It might guess wrong, which is why using a "strict" parser is better if you can -- but if you're dealing with malformed XML and just want to get your work done rather than fix someone else's broken code, that's exactly what BeautifulSoup was written for.
I won't spell out how to use BeautifulSoup, since its own documentation is pretty good. But if you use it and run into trouble, come back to StackOverflow and ask a second question.
I managed to solve the issue by extracting the "xml only" out of the html/xml mixture.
re_xml = raw_xml[137:2938]
thanks for your input

reading web pages including various languages such Russian, Korean and etc

everyone.
For my research projects, I have collected some web pages.
For example, http://git.gnome.org/browse/anjuta/commit/?id=d17caca8f81bb0f0ba4d341d6d6132ff51d186e3
As you see the above web page, the committer's name is not English.
Other web pages, also, have committers' names written in various languages not English.
The following codes are for handling with committers' names.
import csv
import re
import urllib
def get_page (link):
k = 1
while k == 1:
try:
f = urllib.urlopen (link)
htmlSource = f.read()
return htmlSource
except EnvironmentError:
print ('Error occured:', link)
else:
k = 2
f.close()
def get_commit_info (commit_page):
commit_page_string = str (commit_page)
author_pattern = re.compile (r'<tr><th>author</th><td>(.*?)</td><td class=', re.DOTALL)
t_author = author_pattern.findall (commit_page_string)
t_author_string = str (t_author)
author_point = re.search (" <", t_author_string)
author = t_author_string[:author_point.start()]
print author
git_url = "http://git.gnome.org/browse/anjuta/commit/?id=d17caca8f81bb0f0ba4d341d6d6132ff51d186e3"
commit_page = get_page (git_url)
get_commit_info (commit_page)
The result of 'print author' is as follows:
\xd0\x9c\xd0\xb8\xd1\x80\xd0\xbe\xd1\x81\xd0\xbb\xd0\xb0\xd0\xb2 \xd0\x9d\xd0\
xb8\xd0\xba\xd0\xbe\xd0\xbb\xd0\xb8\xd1\x9b
How can I print the name exactly?
WELL... this will do what you want
author = 'Мирослав Николић'
print author.decode('utf8') # Мирослав Николић
But it also won't work if the encoding isn't UTF8...
Mostly things use utf8. Mostly.
Unicode is complicated stuff to get your head around. 'author' is a string object that contains bytes. There is no information in those bytes to tell you what those bytes represent. Absolutely none. You have to tell Python that this string of bytes are code points in UTF8. For each byte you come across, look it up in the UTF8 code table and see which UTF8 unicode glyph this represents.
You could detect the encoding for each page by looking at the meta tags. In html5 they would look like this:
<meta charset="utf-8">.

Extracting text from HTML file using Python

I'd like to extract the text from an HTML file using Python. I want essentially the same output I would get if I copied the text from a browser and pasted it into notepad.
I'd like something more robust than using regular expressions that may fail on poorly formed HTML. I've seen many people recommend Beautiful Soup, but I've had a few problems using it. For one, it picked up unwanted text, such as JavaScript source. Also, it did not interpret HTML entities. For example, I would expect ' in HTML source to be converted to an apostrophe in text, just as if I'd pasted the browser content into notepad.
Update html2text looks promising. It handles HTML entities correctly and ignores JavaScript. However, it does not exactly produce plain text; it produces markdown that would then have to be turned into plain text. It comes with no examples or documentation, but the code looks clean.
Related questions:
Filter out HTML tags and resolve entities in python
Convert XML/HTML Entities into Unicode String in Python
The best piece of code I found for extracting text without getting javascript or not wanted things :
from urllib.request import urlopen
from bs4 import BeautifulSoup
url = "http://news.bbc.co.uk/2/hi/health/2284783.stm"
html = urlopen(url).read()
soup = BeautifulSoup(html, features="html.parser")
# kill all script and style elements
for script in soup(["script", "style"]):
script.extract() # rip it out
# get text
text = soup.get_text()
# break into lines and remove leading and trailing space on each
lines = (line.strip() for line in text.splitlines())
# break multi-headlines into a line each
chunks = (phrase.strip() for line in lines for phrase in line.split(" "))
# drop blank lines
text = '\n'.join(chunk for chunk in chunks if chunk)
print(text)
You just have to install BeautifulSoup before :
pip install beautifulsoup4
html2text is a Python program that does a pretty good job at this.
NOTE: NTLK no longer supports clean_html function
Original answer below, and an alternative in the comments sections.
Use NLTK
I wasted my 4-5 hours fixing the issues with html2text. Luckily i could encounter NLTK.
It works magically.
import nltk
from urllib import urlopen
url = "http://news.bbc.co.uk/2/hi/health/2284783.stm"
html = urlopen(url).read()
raw = nltk.clean_html(html)
print(raw)
Found myself facing just the same problem today. I wrote a very simple HTML parser to strip incoming content of all markups, returning the remaining text with only a minimum of formatting.
from HTMLParser import HTMLParser
from re import sub
from sys import stderr
from traceback import print_exc
class _DeHTMLParser(HTMLParser):
def __init__(self):
HTMLParser.__init__(self)
self.__text = []
def handle_data(self, data):
text = data.strip()
if len(text) > 0:
text = sub('[ \t\r\n]+', ' ', text)
self.__text.append(text + ' ')
def handle_starttag(self, tag, attrs):
if tag == 'p':
self.__text.append('\n\n')
elif tag == 'br':
self.__text.append('\n')
def handle_startendtag(self, tag, attrs):
if tag == 'br':
self.__text.append('\n\n')
def text(self):
return ''.join(self.__text).strip()
def dehtml(text):
try:
parser = _DeHTMLParser()
parser.feed(text)
parser.close()
return parser.text()
except:
print_exc(file=stderr)
return text
def main():
text = r'''
<html>
<body>
<b>Project:</b> DeHTML<br>
<b>Description</b>:<br>
This small script is intended to allow conversion from HTML markup to
plain text.
</body>
</html>
'''
print(dehtml(text))
if __name__ == '__main__':
main()
I know there are a lot of answers already, but the most elegent and pythonic solution I have found is described, in part, here.
from bs4 import BeautifulSoup
text = ' '.join(BeautifulSoup(some_html_string, "html.parser").findAll(text=True))
Update
Based on Fraser's comment, here is more elegant solution:
from bs4 import BeautifulSoup
clean_text = ' '.join(BeautifulSoup(some_html_string, "html.parser").stripped_strings)
Here is a version of xperroni's answer which is a bit more complete. It skips script and style sections and translates charrefs (e.g., ') and HTML entities (e.g., &).
It also includes a trivial plain-text-to-html inverse converter.
"""
HTML <-> text conversions.
"""
from HTMLParser import HTMLParser, HTMLParseError
from htmlentitydefs import name2codepoint
import re
class _HTMLToText(HTMLParser):
def __init__(self):
HTMLParser.__init__(self)
self._buf = []
self.hide_output = False
def handle_starttag(self, tag, attrs):
if tag in ('p', 'br') and not self.hide_output:
self._buf.append('\n')
elif tag in ('script', 'style'):
self.hide_output = True
def handle_startendtag(self, tag, attrs):
if tag == 'br':
self._buf.append('\n')
def handle_endtag(self, tag):
if tag == 'p':
self._buf.append('\n')
elif tag in ('script', 'style'):
self.hide_output = False
def handle_data(self, text):
if text and not self.hide_output:
self._buf.append(re.sub(r'\s+', ' ', text))
def handle_entityref(self, name):
if name in name2codepoint and not self.hide_output:
c = unichr(name2codepoint[name])
self._buf.append(c)
def handle_charref(self, name):
if not self.hide_output:
n = int(name[1:], 16) if name.startswith('x') else int(name)
self._buf.append(unichr(n))
def get_text(self):
return re.sub(r' +', ' ', ''.join(self._buf))
def html_to_text(html):
"""
Given a piece of HTML, return the plain text it contains.
This handles entities and char refs, but not javascript and stylesheets.
"""
parser = _HTMLToText()
try:
parser.feed(html)
parser.close()
except HTMLParseError:
pass
return parser.get_text()
def text_to_html(text):
"""
Convert the given text to html, wrapping what looks like URLs with <a> tags,
converting newlines to <br> tags and converting confusing chars into html
entities.
"""
def f(mo):
t = mo.group()
if len(t) == 1:
return {'&':'&', "'":''', '"':'"', '<':'<', '>':'>'}.get(t)
return '%s' % (t, t)
return re.sub(r'https?://[^] ()"\';]+|[&\'"<>]', f, text)
I know there's plenty of answers here already but I think newspaper3k also deserves a mention. I recently needed to complete a similar task of extracting the text from articles on the web and this library has done an excellent job of achieving this so far in my tests. It ignores the text found in menu items and side bars as well as any JavaScript that appears on the page as the OP requests.
from newspaper import Article
article = Article(url)
article.download()
article.parse()
article.text
If you already have the HTML files downloaded you can do something like this:
article = Article('')
article.set_html(html)
article.parse()
article.text
It even has a few NLP features for summarizing the topics of articles:
article.nlp()
article.summary
You can use html2text method in the stripogram library also.
from stripogram import html2text
text = html2text(your_html_string)
To install stripogram run sudo easy_install stripogram
There is Pattern library for data mining.
http://www.clips.ua.ac.be/pages/pattern-web
You can even decide what tags to keep:
s = URL('http://www.clips.ua.ac.be').download()
s = plaintext(s, keep={'h1':[], 'h2':[], 'strong':[], 'a':['href']})
print s
if you need more speed and less accuracy then you could use raw lxml.
import lxml.html as lh
from lxml.html.clean import clean_html
def lxml_to_text(html):
doc = lh.fromstring(html)
doc = clean_html(doc)
return doc.text_content()
PyParsing does a great job. The PyParsing wiki was killed so here is another location where there are examples of the use of PyParsing (example link). One reason for investing a little time with pyparsing is that he has also written a very brief very well organized O'Reilly Short Cut manual that is also inexpensive.
Having said that, I use BeautifulSoup a lot and it is not that hard to deal with the entities issues, you can convert them before you run BeautifulSoup.
Goodluck
This isn't exactly a Python solution, but it will convert text Javascript would generate into text, which I think is important (E.G. google.com). The browser Links (not Lynx) has a Javascript engine, and will convert source to text with the -dump option.
So you could do something like:
fname = os.tmpnam()
fname.write(html_source)
proc = subprocess.Popen(['links', '-dump', fname],
stdout=subprocess.PIPE,
stderr=open('/dev/null','w'))
text = proc.stdout.read()
Instead of the HTMLParser module, check out htmllib. It has a similar interface, but does more of the work for you. (It is pretty ancient, so it's not much help in terms of getting rid of javascript and css. You could make a derived class, but and add methods with names like start_script and end_style (see the python docs for details), but it's hard to do this reliably for malformed html.) Anyway, here's something simple that prints the plain text to the console
from htmllib import HTMLParser, HTMLParseError
from formatter import AbstractFormatter, DumbWriter
p = HTMLParser(AbstractFormatter(DumbWriter()))
try: p.feed('hello<br>there'); p.close() #calling close is not usually needed, but let's play it safe
except HTMLParseError: print ':(' #the html is badly malformed (or you found a bug)
I recommend a Python Package called goose-extractor
Goose will try to extract the following information:
Main text of an article
Main image of article
Any Youtube/Vimeo movies embedded in article
Meta Description
Meta tags
More :https://pypi.python.org/pypi/goose-extractor/
Anyone has tried bleach.clean(html,tags=[],strip=True) with bleach? it's working for me.
install html2text using
pip install html2text
then,
>>> import html2text
>>>
>>> h = html2text.HTML2Text()
>>> # Ignore converting links from HTML
>>> h.ignore_links = True
>>> print h.handle("<p>Hello, <a href='http://earth.google.com/'>world</a>!")
Hello, world!
Best worked for me is inscripts .
https://github.com/weblyzard/inscriptis
import urllib.request
from inscriptis import get_text
url = "http://www.informationscience.ch"
html = urllib.request.urlopen(url).read().decode('utf-8')
text = get_text(html)
print(text)
The results are really good
Beautiful soup does convert html entities. It's probably your best bet considering HTML is often buggy and filled with unicode and html encoding issues. This is the code I use to convert html to raw text:
import BeautifulSoup
def getsoup(data, to_unicode=False):
data = data.replace(" ", " ")
# Fixes for bad markup I've seen in the wild. Remove if not applicable.
masssage_bad_comments = [
(re.compile('<!-([^-])'), lambda match: '<!--' + match.group(1)),
(re.compile('<!WWWAnswer T[=\w\d\s]*>'), lambda match: '<!--' + match.group(0) + '-->'),
]
myNewMassage = copy.copy(BeautifulSoup.BeautifulSoup.MARKUP_MASSAGE)
myNewMassage.extend(masssage_bad_comments)
return BeautifulSoup.BeautifulSoup(data, markupMassage=myNewMassage,
convertEntities=BeautifulSoup.BeautifulSoup.ALL_ENTITIES
if to_unicode else None)
remove_html = lambda c: getsoup(c, to_unicode=True).getText(separator=u' ') if c else ""
Another non-python solution: Libre Office:
soffice --headless --invisible --convert-to txt input1.html
The reason I prefer this one over other alternatives is that every HTML paragraph gets converted into a single text line (no line breaks), which is what I was looking for. Other methods require post-processing. Lynx does produce nice output, but not exactly what I was looking for. Besides, Libre Office can be used to convert from all sorts of formats...
I had a similar question and actually used one of the answers with BeautifulSoup.
The problem was it was really slow. I ended up using library called selectolax.
It's pretty limited but it works for this task.
The only issue was that I had manually remove unnecessary white spaces.
But it seems to be working much faster that BeautifulSoup solution.
from selectolax.parser import HTMLParser
def get_text_selectolax(html):
tree = HTMLParser(html)
if tree.body is None:
return None
for tag in tree.css('script'):
tag.decompose()
for tag in tree.css('style'):
tag.decompose()
text = tree.body.text(separator='')
text = " ".join(text.split()) # this will remove all the whitespaces
return text
Another option is to run the html through a text based web browser and dump it. For example (using Lynx):
lynx -dump html_to_convert.html > converted_html.txt
This can be done within a python script as follows:
import subprocess
with open('converted_html.txt', 'w') as outputFile:
subprocess.call(['lynx', '-dump', 'html_to_convert.html'], stdout=testFile)
It won't give you exactly just the text from the HTML file, but depending on your use case it may be preferable to the output of html2text.
#PeYoTIL's answer using BeautifulSoup and eliminating style and script content didn't work for me. I tried it using decompose instead of extract but it still didn't work. So I created my own which also formats the text using the <p> tags and replaces <a> tags with the href link. Also copes with links inside text. Available at this gist with a test doc embedded.
from bs4 import BeautifulSoup, NavigableString
def html_to_text(html):
"Creates a formatted text email message as a string from a rendered html template (page)"
soup = BeautifulSoup(html, 'html.parser')
# Ignore anything in head
body, text = soup.body, []
for element in body.descendants:
# We use type and not isinstance since comments, cdata, etc are subclasses that we don't want
if type(element) == NavigableString:
# We use the assumption that other tags can't be inside a script or style
if element.parent.name in ('script', 'style'):
continue
# remove any multiple and leading/trailing whitespace
string = ' '.join(element.string.split())
if string:
if element.parent.name == 'a':
a_tag = element.parent
# replace link text with the link
string = a_tag['href']
# concatenate with any non-empty immediately previous string
if ( type(a_tag.previous_sibling) == NavigableString and
a_tag.previous_sibling.string.strip() ):
text[-1] = text[-1] + ' ' + string
continue
elif element.previous_sibling and element.previous_sibling.name == 'a':
text[-1] = text[-1] + ' ' + string
continue
elif element.parent.name == 'p':
# Add extra paragraph formatting newline
string = '\n' + string
text += [string]
doc = '\n'.join(text)
return doc
I've had good results with Apache Tika. Its purpose is the extraction of metadata and text from content, hence the underlying parser is tuned accordingly out of the box.
Tika can be run as a server, is trivial to run / deploy in a Docker container, and from there can be accessed via Python bindings.
While alot of people mentioned using regex to strip html tags, there are a lot of downsides.
for example:
<p>hello world</p>I love you
Should be parsed to:
Hello world
I love you
Here's a snippet I came up with, you can cusomize it to your specific needs, and it works like a charm
import re
import html
def html2text(htm):
ret = html.unescape(htm)
ret = ret.translate({
8209: ord('-'),
8220: ord('"'),
8221: ord('"'),
160: ord(' '),
})
ret = re.sub(r"\s", " ", ret, flags = re.MULTILINE)
ret = re.sub("<br>|<br />|</p>|</div>|</h\d>", "\n", ret, flags = re.IGNORECASE)
ret = re.sub('<.*?>', ' ', ret, flags=re.DOTALL)
ret = re.sub(r" +", " ", ret)
return ret
in a simple way
import re
html_text = open('html_file.html').read()
text_filtered = re.sub(r'<(.*?)>', '', html_text)
this code finds all parts of the html_text started with '<' and ending with '>' and replace all found by an empty string
In Python 3.x you can do it in a very easy way by importing 'imaplib' and 'email' packages. Although this is an older post but maybe my answer can help new comers on this post.
status, data = self.imap.fetch(num, '(RFC822)')
email_msg = email.message_from_bytes(data[0][1])
#email.message_from_string(data[0][1])
#If message is multi part we only want the text version of the body, this walks the message and gets the body.
if email_msg.is_multipart():
for part in email_msg.walk():
if part.get_content_type() == "text/plain":
body = part.get_payload(decode=True) #to control automatic email-style MIME decoding (e.g., Base64, uuencode, quoted-printable)
body = body.decode()
elif part.get_content_type() == "text/html":
continue
Now you can print body variable and it will be in plaintext format :) If it is good enough for you then it would be nice to select it as accepted answer.
Here's the code I use on a regular basis.
from bs4 import BeautifulSoup
import urllib.request
def processText(webpage):
# EMPTY LIST TO STORE PROCESSED TEXT
proc_text = []
try:
news_open = urllib.request.urlopen(webpage.group())
news_soup = BeautifulSoup(news_open, "lxml")
news_para = news_soup.find_all("p", text = True)
for item in news_para:
# SPLIT WORDS, JOIN WORDS TO REMOVE EXTRA SPACES
para_text = (' ').join((item.text).split())
# COMBINE LINES/PARAGRAPHS INTO A LIST
proc_text.append(para_text)
except urllib.error.HTTPError:
pass
return proc_text
I hope that helps.
you can extract only text from HTML with BeautifulSoup
url = "https://www.geeksforgeeks.org/extracting-email-addresses-using-regular-expressions-python/"
con = urlopen(url).read()
soup = BeautifulSoup(con,'html.parser')
texts = soup.get_text()
print(texts)
Another example using BeautifulSoup4 in Python 2.7.9+
includes:
import urllib2
from bs4 import BeautifulSoup
Code:
def read_website_to_text(url):
page = urllib2.urlopen(url)
soup = BeautifulSoup(page, 'html.parser')
for script in soup(["script", "style"]):
script.extract()
text = soup.get_text()
lines = (line.strip() for line in text.splitlines())
chunks = (phrase.strip() for line in lines for phrase in line.split(" "))
text = '\n'.join(chunk for chunk in chunks if chunk)
return str(text.encode('utf-8'))
Explained:
Read in the url data as html (using BeautifulSoup), remove all script and style elements, and also get just the text using .get_text(). Break into lines and remove leading and trailing space on each, then break multi-headlines into a line each chunks = (phrase.strip() for line in lines for phrase in line.split(" ")). Then using text = '\n'.join, drop blank lines, finally return as sanctioned utf-8.
Notes:
Some systems this is run on will fail with https:// connections because of SSL issue, you can turn off the verify to fix that issue. Example fix: http://blog.pengyifan.com/how-to-fix-python-ssl-certificate_verify_failed/
Python < 2.7.9 may have some issue running this
text.encode('utf-8') can leave weird encoding, may want to just return str(text) instead.
Answer using Pandas to get table data from HTML.
If you want to extract table data quickly from HTML. You can use the read_HTML function, docs are here. Before using this function you should read the gotchas/issues surrounding the BeautifulSoup4/html5lib/lxml parsers HTML parsing libraries.
import pandas as pd
http = r'https://www.ibm.com/docs/en/cmofz/10.1.0?topic=SSQHWE_10.1.0/com.ibm.ondemand.mp.doc/arsa0257.htm'
table = pd.read_html(http)
df = table[0]
df
output
There are a number of option that can be played with see here and here.

Categories