XML parsing from URL with python - python

I am trying to parse a xml from an url.
So originaly my code looked like this:
from xml.dom import minidom
xmldoc = minidom.parse('all.xml')
Sensor0Elm = xmldoc.getElementsByTagName('t0')
Sensor1Elm = xmldoc.getElementsByTagName('t1')
Sensor2Elm = xmldoc.getElementsByTagName('t2')
Sensor0Elm = Sensor0Elm[0]
Sensor1Elm = Sensor1Elm[0]
Sensor2Elm = Sensor2Elm[0]
Sensor0 = Sensor0Elm.childNodes[0].data
Sensor1 = Sensor1Elm.childNodes[0].data
Sensor2 = Sensor2Elm.childNodes[0].data
Sensor0 = float(Sensor0)
Sensor1 = float(Sensor1)
Sensor2 = float(Sensor2)
In this case the xml what I intend to parse was on my local harddrive and it worked out perfectly!
The next step was to parse a xml from an url. A sensormeter from allnet constantly inserts xml data into the networkt which is over the following url with a browser accessible: 192.168.60.242/xml
this is the embedded xml:
<HTML><HEAD><meta http-equiv="content-type" content="text/html; charset=ISO-8859-1"></HEAD><BODY><FORM><TEXTAREA COLS=132 ROWS=50><xml><data>
<devicename>ALL4000</devicename>
<n0>0</n0><t0> 1.27</t0><min0> 0.00</min0><max0> 2.55</max0><l0>-55</l0><h0>150</h0><s0>102</s0>
<n1>1</n1><t1> 2.53</t1><min1> 2.32</min1><max1> 10487.04</max1><l1>-55</l1><h1>150</h1><s1>102</s1>
<n2>2</n2><t2> 2.45</t2><min2> 0.00</min2><max2> 2.55</max2><l2>-55</l2><h2>150</h2><s2>102</s2>
<n3>3</n3><t3>-20480.00</t3><min3> 0.00</min3><max3> 5580.80</max3><l3>-55</l3><h3>150</h3><s3>0</s3>
<n4>4</n4><t4>-20480.00</t4><min4> 40.96</min4><max4> 41943.04</max4><l4>-55</l4><h4>150</h4><s4>0</s4>
<n5>5</n5><t5>-20480.00</t5><min5> 10.24</min5><max5> 0.08</max5><l5>-55</l5><h5>150</h5><s5>0</s5>
<n6>6</n6><t6>-20480.00</t6><min6> 0.00</min6><max6>-20480.00</max6><l6>-55</l6><h6>150</h6><s6>0</s6>
<n7>7</n7><t7>-20480.00</t7><min7> 0.00</min7><max7> 0.00</max7><l7>-55</l7><h7>150</h7><s7>0</s7>
<n8>8</n8><t8>-20480.00</t8><min8> 336855.04</min8><max8> 1342177.28</max8><l8>-55</l8><h8>150</h8><s8>0</s8>
<n9>9</n9><t9>-20480.00</t9><min9> 0.00</min9><max9> 0.00</max9><l9>-55</l9><h9>150</h9><s9>0</s9>
<n10>10</n10><t10>-20480.00</t10><min10> 0.00</min10><max10> 0.00</max10><l10>-55</l10><h10>150</h10><s10>0</s10>
<n11>11</n11><t11>-20480.00</t11><min11> 0.00</min11><max11> 0.00</max11><l11>-55</l11><h11>150</h11><s11>0</s11>
<n12>12</n12><t12>-20480.00</t12><min12> 0.00</min12><max12> 0.00</max12><l12>-55</l12><h12>150</h12><s12>0</s12>
<n13>13</n13><t13>-20480.00</t13><min13> 0.00</min13><max13> 0.00</max13><l13>-55</l13><h13>150</h13><s13>0</s13>
<n14>14</n14><t14>-20480.00</t14><min14> 0.00</min14><max14> 0.00</max14><l14>-55</l14><h14>150</h14><s14>0</s14>
<n15>15</n15><t15>-20480.00</t15><min15> 0.00</min15><max15> 0.00</max15><l15>-55</l15><h15>150</h15><s15>0</s15>
<fn0>1</fn0><ft0>0</ft0><fs0>0</fs0>
<fn1>2</fn1><ft1>0</ft1><fs1>0</fs1>
<fn2>3</fn2><ft2>0</ft2><fs2>0</fs2>
<fn3>4</fn3><ft3>0</ft3><fs3>0</fs3>
<fn4>5</fn4><ft4>0</ft4><fs4>0</fs4>
<fn5>6</fn5><ft5>0</ft5><fs5>0</fs5>
<fn6>7</fn6><ft6>0</ft6><fs6>0</fs6>
<fn7>8</fn7><ft7>0</ft7><fs7>0</fs7>
<fn8>9</fn8><ft8>0</ft8><fs8>0</fs8>
<fn9>10</fn9><ft9>0</ft9><fs9>0</fs9>
<fn10>11</fn10><ft10>0</ft10><fs10>0</fs10>
<fn11>12</fn11><ft11>0</ft11><fs11>0</fs11>
<fn12>13</fn12><ft12>0</ft12><fs12>0</fs12>
<fn13>14</fn13><ft13>0</ft13><fs13>0</fs13>
<fn14>15</fn14><ft14>0</ft14><fs14>0</fs14>
<fn15>16</fn15><ft15>0</ft15><fs15>0</fs15>
<rn0>0</rn0><rt0>0</rt0>
<rn1>1</rn1><rt1>0</rt1>
<rn2>2</rn2><rt2>0</rt2>
<rn3>3</rn3><rt3>0</rt3>
<it0>248</it0><it1>254</it1><it2>255</it2><it3>255</it3><it4>128</it4><it5>1</it5><it6>255</it6><it7>255</it7>
<date>06.08.2006</date><time>03:27:49</time><ad>1</ad><ntpsync>-1</ntpsync><i>10</i><f>0</f>
<sys>18844128</sys><mem>25048</mem><fw>2.89</fw><dev>ALL4000</dev>
<sensorx>5</sensorx><sensory>3</sensory>
</data></xml>
</TEXTAREA></FORM></BODY></HTML>
So I changed the code into this:
import urllib
import time
while True:
### XML Extraction ###
from xml.dom import minidom
allxml = urllib.urlopen("http://192.168.60.242/xml")
allxml_string = allxml.read()
allxml.close()
print allxml_string
xmldoc = minidom.parseString(allxml_string)
Sensor0Elm = xmldoc.getElementsByTagName('t0')
Sensor1Elm = xmldoc.getElementsByTagName('t1')
Sensor2Elm = xmldoc.getElementsByTagName('t2')
Sensor0Elm = Sensor0Elm[0]
Sensor1Elm = Sensor1Elm[0]
Sensor2Elm = Sensor2Elm[0]
Sensor0 = Sensor0Elm.childNodes[0].data
Sensor1 = Sensor1Elm.childNodes[0].data
Sensor2 = Sensor2Elm.childNodes[0].data
Sensor0 = float(Sensor0)
Sensor1 = float(Sensor1)
Sensor2 = float(Sensor2)
Unfortunatelly it does not work. If executed, This is what gets returned:
(by using the function print(), the xml is correctly inserted into te programm. the only problem seems to be a proper further processing by the parse function.)
PLEASE LOOK AT THE ERROR MESSAGE ON THE BOTTOM
Python 2.7.3 (default, Mar 18 2014, 05:13:23)
[GCC 4.6.3] on linux2
Type "copyright", "credits" or "license()" for more information.
>>> ================================ RESTART ================================
>>>
<HTML><HEAD><meta http-equiv="content-type" content="text/html; charset=ISO-8859-1"></HEAD><BODY><FORM><TEXTAREA COLS=132 ROWS=50><xml><data>
<devicename>ALL4000</devicename>
<n0>0</n0><t0> 1.09</t0><min0> 0.00</min0><max0> 2.55</max0><l0>-55</l0><h0>150</h0><s0>102</s0>
<n1>1</n1><t1> 2.52</t1><min1> 2.32</min1><max1> 10487.04</max1><l1>-55</l1><h1>150</h1><s1>102</s1>
<n2>2</n2><t2> 2.45</t2><min2> 0.00</min2><max2> 2.55</max2><l2>-55</l2><h2>150</h2><s2>102</s2>
<n3>3</n3><t3>-20480.00</t3><min3> 0.00</min3><max3> 5580.80</max3><l3>-55</l3><h3>150</h3><s3>0</s3>
<n4>4</n4><t4>-20480.00</t4><min4> 40.96</min4><max4> 41943.04</max4><l4>-55</l4><h4>150</h4><s4>0</s4>
<n5>5</n5><t5>-20480.00</t5><min5> 10.24</min5><max5> 0.08</max5><l5>-55</l5><h5>150</h5><s5>0</s5>
<n6>6</n6><t6>-20480.00</t6><min6> 0.00</min6><max6>-20480.00</max6><l6>-55</l6><h6>150</h6><s6>0</s6>
<n7>7</n7><t7>-20480.00</t7><min7> 0.00</min7><max7> 0.00</max7><l7>-55</l7><h7>150</h7><s7>0</s7>
<n8>8</n8><t8>-20480.00</t8><min8> 336855.04</min8><max8> 1342177.28</max8><l8>-55</l8><h8>150</h8><s8>0</s8>
<n9>9</n9><t9>-20480.00</t9><min9> 0.00</min9><max9> 0.00</max9><l9>-55</l9><h9>150</h9><s9>0</s9>
<n10>10</n10><t10>-20480.00</t10><min10> 0.00</min10><max10> 0.00</max10><l10>-55</l10><h10>150</h10><s10>0</s10>
<n11>11</n11><t11>-20480.00</t11><min11> 0.00</min11><max11> 0.00</max11><l11>-55</l11><h11>150</h11><s11>0</s11>
<n12>12</n12><t12>-20480.00</t12><min12> 0.00</min12><max12> 0.00</max12><l12>-55</l12><h12>150</h12><s12>0</s12>
<n13>13</n13><t13>-20480.00</t13><min13> 0.00</min13><max13> 0.00</max13><l13>-55</l13><h13>150</h13><s13>0</s13>
<n14>14</n14><t14>-20480.00</t14><min14> 0.00</min14><max14> 0.00</max14><l14>-55</l14><h14>150</h14><s14>0</s14>
<n15>15</n15><t15>-20480.00</t15><min15> 0.00</min15><max15> 0.00</max15><l15>-55</l15><h15>150</h15><s15>0</s15>
<fn0>1</fn0><ft0>0</ft0><fs0>0</fs0>
<fn1>2</fn1><ft1>0</ft1><fs1>0</fs1>
<fn2>3</fn2><ft2>0</ft2><fs2>0</fs2>
<fn3>4</fn3><ft3>0</ft3><fs3>0</fs3>
<fn4>5</fn4><ft4>0</ft4><fs4>0</fs4>
<fn5>6</fn5><ft5>0</ft5><fs5>0</fs5>
<fn6>7</fn6><ft6>0</ft6><fs6>0</fs6>
<fn7>8</fn7><ft7>0</ft7><fs7>0</fs7>
<fn8>9</fn8><ft8>0</ft8><fs8>0</fs8>
<fn9>10</fn9><ft9>0</ft9><fs9>0</fs9>
<fn10>11</fn10><ft10>0</ft10><fs10>0</fs10>
<fn11>12</fn11><ft11>0</ft11><fs11>0</fs11>
<fn12>13</fn12><ft12>0</ft12><fs12>0</fs12>
<fn13>14</fn13><ft13>0</ft13><fs13>0</fs13>
<fn14>15</fn14><ft14>0</ft14><fs14>0</fs14>
<fn15>16</fn15><ft15>0</ft15><fs15>0</fs15>
<rn0>0</rn0><rt0>0</rt0>
<rn1>1</rn1><rt1>0</rt1>
<rn2>2</rn2><rt2>0</rt2>
<rn3>3</rn3><rt3>0</rt3>
<it0>248</it0><it1>254</it1><it2>255</it2><it3>255</it3><it4>128</it4><it5>1</it5><it6>255</it6><it7>255</it7>
<date>06.08.2006</date><time>06:45:46</time><ad>1</ad><ntpsync>-1</ntpsync><i>10</i><f>0</f>
<sys>18856004</sys><mem>25048</mem><fw>2.89</fw><dev>ALL4000</dev>
<sensorx>5</sensorx><sensory>3</sensory>
</data></xml>
</TEXTAREA></FORM></BODY></HTML>
Traceback (most recent call last):
File "/home/pi/Desktop/sig_v3.py", line 14, in <module>
xmldoc = minidom.parseString(allxml_string)
File "/usr/lib/python2.7/xml/dom/minidom.py", line 1930, in parseString
return expatbuilder.parseString(string)
File "/usr/lib/python2.7/xml/dom/expatbuilder.py", line 940, in parseString
return builder.parseString(string)
File "/usr/lib/python2.7/xml/dom/expatbuilder.py", line 223, in parseString
parser.Parse(string, True)
ExpatError: mismatched tag: line 1, column 86
I hope someone can help me out on this.
thanks

Instead of
xmldoc = minidom.parse(allxml_string)
use
xmldoc = minidom.parseString(allxml_string)
The parse() function can take either a filename or an open file
object.
If you have XML in a string, you can use the parseString() function
instead:
Source: https://docs.python.org/2/library/xml.dom.minidom.html

After you've read the XML into xml_string, you'll want to do something to the received text to close that unclosed <meta> tag you're seeing. While it's normally not a good idea to try parsing HTML with regular expressions, in this case a regular expression is probably your simplest solution. At the top of your code, add import re somewhere, then do the following:
fixed_xml_string = re.sub("<meta(.*?)>", "<meta\\1></meta>", xml_string)
Then try parsing fixed_xml_string.
If that works, great. If it doesn't work and there are more errors, then instead of fixing them one at a time, you'll be better off using a "generous" XML parser like BeautifulSoup instead of minidom. BeautifulSoup tries very hard not to give you errors when it encounters bad XML (or HTML), and instead figure out what the XML's author intended and give you that. It might guess wrong, which is why using a "strict" parser is better if you can -- but if you're dealing with malformed XML and just want to get your work done rather than fix someone else's broken code, that's exactly what BeautifulSoup was written for.
I won't spell out how to use BeautifulSoup, since its own documentation is pretty good. But if you use it and run into trouble, come back to StackOverflow and ask a second question.

I managed to solve the issue by extracting the "xml only" out of the html/xml mixture.
re_xml = raw_xml[137:2938]
thanks for your input

Related

Error validating/parsing xml file against xsd with lxml/objectify in Python

in Python/Django, I need to parse and objectify a file .xml according to a given XMLSchema made of three .xsd files referring each other in such a way:
schema3.xsd (referring schema1.xsd)
schema2.xsd (referring schema1.xsd)
schema1.xsd (referring schema2.xsd)
xml schemas import
For this I'm using the following piece of code which I've already tested being succesfull when used with a couple of xml/xsd files (where .xsd is "standalone" without refering others .xsd):
import lxml
import os.path
from lxml import etree, objectify
from lxml.etree import XMLSyntaxError
def xml_validator(request):
# define path of files
path_file_xml = '../myxmlfile.xml'
path_file_xsd = '../schema3.xsd'
# get file XML
xml_file = open(path_file_xml, 'r')
xml_string = xml_file.read()
xml_file.close()
# get XML Schema
doc = etree.parse(path_file_xsd)
schema = etree.XMLSchema(doc)
#define parser
parser = objectify.makeparser(schema=schema)
# trasform XML file
root = objectify.fromstring(xml_string, parser)
test1 = root.tag
return render(request, 'risultati.html', {'risultato': test1})
Unfortunately, I'm stucked with the following error that i got with the multiple .xsd described above:
complex type 'ObjectType': The content model is not determinist.
Request Method: GET Request URL: http://127.0.0.1:8000/xml_validator
Django Version: 1.9.1 Exception Type: XMLSchemaParseError Exception
Value: complex type 'ObjectType': The content model is not
determinist., line 80
Any idea about that ?
Thanks a lot in advance for any suggestion or useful tips to approach this problem...
cheers
Update 23/03/2016
Here (and in the following answers to the post, because it actually exceed the max number of characters for a post), a sample of the files to figure out the problem...
sample files on GitHub
My best guess would be that your XSD model does not obey the Unique Particle Attribution rule. I would rule that out before looking at anything else.

How to get HTML from URL that returns "junk" data?

I want to get the html source code of a given url. I had tried using this
import urllib2
url = 'http://mp3.zing.vn' # write the url here
usock = urllib2.urlopen(url)
data = usock.read()
usock.close()
print data
But the returned data is not in HTML format for some pages. I tried with another link like http://phuctrancs.info and it works (as this page is plain html based). I have also tried using BeautifulSoup library but it didn't work also. Any suggestion?
You're getting the HTML you expect, but it's compressed. I tried this URL by hand and got back a binary mess with this in the headers:
Content-Encoding: gzip
I saved the response body to a file and was able to gunzip it on the command line. You should also be able to decompress it in your program with the functions in the standard library's zlib module.
Update for anyone having trouble with zlib.decompress...
The compressed data you will get (or at least that I got in Python 2.6) apparently has a "gzip header and trailer" like you'd expect in *.gz files, while zlib.decompress expects a "zlib wrapper"... probably. I kept getting an unhelpful zlib.error exception:
Traceback (most recent call last):
File "./fixme.py", line 32, in <module>
text = zlib.decompress(data)
zlib.error: Error -3 while decompressing data: incorrect header check
The solution is entirely undocumented in the Python standard library, but can be found in Greg Hewgill's answer to a question about gzip streams: You have to feed zlib.decompress a wbits argument, created by adding a magic number to an undocumented module-level constant <grumble, mutter...>:
text = zlib.decompress(data, 16 + zlib.MAX_WBITS)
If you feel this isn't obfuscated enough, note that a 32 here would be every bit as magical as the 16.
The only hint of this is buried in the original zlib's manual, under the deflateInit2 function:
windowBits can also be greater than 15 for optional gzip decoding. Add 16 to windowBits to write a simple gzip header and trailer around the compressed data instead of a zlib wrapper.
...and the inflateInit2 function:
windowBits can also be greater than 15 for optional gzip decoding. Add 32 to windowBits to enable zlib and gzip decoding with automatic header detection, or add 16 to decode only the gzip format [...]
Note that the zlib.decompress docs explicitly tell you that you can't do this:
The default value is therefore the highest value, 15.
But this is... the opposite of true.
<fume, curse, rant...>
have you look into the response code, urllib2 may need you to handle the response such as 301 redirect and so on.
you should print the response code like:
data = usock.read()
if usock.getcode() != 200:
print "something unexpected"
updated:
if the response contains None-localized or none-readable text, then you might need to specify the request character set in the request header.
import cookielib
cookie = cookielib.CookieJar()
opener = urllib2.opener(urllib2.HTTPCookieProcessor(cookie))
opener.addheaders = [("Content-type: text/html; charset=UTF-8")]
urllib2.install_opener(opener)
PS: untested.
use beautiful soup from python
import request
from bs4 import BeautifulSoup
url = 'http://www.google.com'
r=request.get(url)
b=BeautifulSoup(r.text)
b will contain all the html tags and also provides you iteractor to traverse elements/tags. To know more link is https://pypi.python.org/pypi/beautifulsoup4/4.3.2

Python 3.4 - XML Parse - IndexError: List Index Out of Range - How do I find range of XML?

Okay guys, I'm new to parsing XML and Python, and am trying to get this to work. If someone could help me with this it would be greatly appreciated. If you can help me (educate me) on how to figure it out for myself, that would be even better!
I am having trouble trying to figure out the range to reference for an XML document as I can't find any documentation on it. Here is my code and I'll include the entire Traceback after.
#import library to do http requests:
import urllib.request
#import easy to use xml parser called minidom:
from xml.dom.minidom import parseString
#all these imports are standard on most modern python implementations
#download the file:
file = urllib.request.urlopen('http://www.wizards.com/dndinsider/compendium/CompendiumSearch.asmx/KeywordSearch?Keywords=healing%20%word&nameOnly=True&tab=')
#convert to string:
data = file.read()
#close file because we dont need it anymore:
file.close()
#parse the xml you downloaded
dom = parseString(data)
#retrieve the first xml tag (<tag>data</tag>) that the parser finds with name tagName:
xmlTag = dom.getElementsByTagName('Data.Results.Power.ID')[0].toxml()
#strip off the tag (<tag>data</tag> ---> data):
xmlData=xmlTag.replace('<id>','').replace('</id>','')
#print out the xml tag and data in this format: <tag>data</tag>
print(xmlTag)
#just print the data
print(xmlData)
Traceback
/usr/bin/python3.4 /home/mint/PycharmProjects/DnD_Project/Power_Name.py
Traceback (most recent call last):
File "/home/mint/PycharmProjects/DnD_Project/Power_Name.py", line 14, in <module>
xmlTag = dom.getElementsByTagName('id')[0].toxml()
IndexError: list index out of range
Process finished with exit code 1
print len( dom.getElementsByTagName('id') )
EDIT:
ids = dom.getElementsByTagName('id')
if len( ids ) > 0 :
xmlTag = ids[0].toxml()
# rest of code
EDIT: I add example because I saw in other comment tha you don't know how to use it
BTW: I add some comment in code about file/connection
import urllib.request
from xml.dom.minidom import parseString
# create connection to data/file on server
connection = urllib.request.urlopen('http://www.wizards.com/dndinsider/compendium/CompendiumSearch.asmx/KeywordSearch?Keywords=healing%20%word&nameOnly=True&tab=')
# read from server as string (not "convert" to string):
data = connection.read()
#close connection because we dont need it anymore:
connection.close()
dom = parseString(data)
# get tags from dom
ids = dom.getElementsByTagName('Data.Results.Power.ID')
# check if there are any data
if len( ids ) > 0 :
xmlTag = ids[0].toxml()
xmlData=xmlTag.replace('<id>','').replace('</id>','')
print(xmlTag)
print(xmlData)
else:
print("Sorry, there was no data")
or you can use for loop if there is more tags
dom = parseString(data)
# get tags from dom
ids = dom.getElementsByTagName('Data.Results.Power.ID')
# get all tags - one by one
for one_tag in ids:
xmlTag = one_tag.toxml()
xmlData = xmlTag.replace('<id>','').replace('</id>','')
print(xmlTag)
print(xmlData)
BTW:
getElementsByTagName() expects tagname ID - not path Data.Results.Power.ID
tagname is ID so you have to replace <ID> not <id>
for this tag you can event use one_tag.firstChild.nodeValue in place of xmlTag.replace
.
dom = parseString(data)
# get tags from dom
ids = dom.getElementsByTagName('ID') # tagname
# get all tags - one by one
for one_tag in ids:
xmlTag = one_tag.toxml()
#xmlData = xmlTag.replace('<ID>','').replace('</ID>','')
xmlData = one_tag.firstChild.nodeValue
print(xmlTag)
print(xmlData)
I haven't used the built in xml library in a while, but it's covered in Mark Pilgrim's great Dive into Python book.
-- I see as I'm typing this that your question has already been answered but since you mention being new to Python I think you will find the text useful for xml parsing and as an excellent introduction to the language.
If you would like to try another approach to parsing xml and html, I highly recommend lxml.

Universal Feed Parser issue

I am working on a python script to parse RSS links.
I use the Universal Feed Parser and I am encountering issues with some links, for example while trying to parse the FreeBSD Security Advisories
Here is the sample code:
feed = feedparser.parse(url)
items = feed["items"]
Basically the feed["items"] should return all the entries in the feed, the fields that start with item, but it always returns empty.
I can also confirm that the following links are parsed as expected:
Ubuntu
Redhat
Is this a issue with the feeds, in that the ones from FreeBSD do nor respect the standard ?
EDIT:
I am using python 2.7.
I ended up using feedparser, in combination with BeautifulSoup, like Hai Vu proposed.
Here is the sample code I ended up with, slightly changed:
def rss_get_items_feedparser(self, webData):
feed = feedparser.parse(webData)
items = feed["items"]
return items
def rss_get_items_beautifulSoup(self, webData):
soup = BeautifulSoup(webData)
for item_node in soup.find_all('item'):
item = {}
for subitem_node in item_node.findChildren():
if subitem_node.name is not None:
item[str(subitem_node.name)] = str(subitem_node.contents[0])
yield item
def rss_get_items(self, webData):
items = self.rss_get_items_feedparser(webData)
if (len(items) > 0):
return items;
return self.rss_get_items_beautifulSoup(webData)
def parse(self, url):
request = urllib2.Request(url)
response = urllib2.urlopen(request)
webData = response .read()
for item in self.rss_get_items(webData):
#parse items
I also tried passing the response directly to rss_get_items, without reading it, but it throws and exception, when BeautifulSoup tries to read:
File "bs4/__init__.py", line 161, in __init__
markup = markup.read()
TypeError: 'NoneType' object is not callable
I found out the problem was with the use of namespace.
for FreeBSD's RSS feed:
<rss xmlns:atom="http://www.w3.org/2005/Atom"
xmlns="http://www.w3.org/1999/xhtml"
version="2.0">
For Ubuntu's feed:
<rss xmlns:atom="http://www.w3.org/2005/Atom"
version="2.0">
When I remove the extra namespace declaration from FreeBSD's feed, everything works as expected.
So what does it means for you? I can think of a couple of different approaches:
Use something else, such as BeautifulSoup. I tried it and it seems to work.
Download the whole RSS feed, apply some search/replace to fix up the namespaces, then use feedparser.parse() afterward. This approach is a big hack; I would not use it myself.
Update
Here is a sample code for rss_get_items() which will returns you a list of items from an RSS feed. Each item is a dictionary with some standard keys such as title, pubdate, link, and guid.
from bs4 import BeautifulSoup
import urllib2
def rss_get_items(url):
request = urllib2.Request(url)
response = urllib2.urlopen(request)
soup = BeautifulSoup(response)
for item_node in soup.find_all('item'):
item = {}
for subitem_node in item_node.findChildren():
key = subitem_node.name
value = subitem_node.text
item[key] = value
yield item
if __name__ == '__main__':
url = 'http://www.freebsd.org/security/rss.xml'
for item in rss_get_items(url):
print item['title']
print item['pubdate']
print item['link']
print item['guid']
print '---'
Output:
FreeBSD-SA-14:04.bind
Tue, 14 Jan 2014 00:00:00 PST
http://security.FreeBSD.org/advisories/FreeBSD-SA-14:04.bind.asc
http://security.FreeBSD.org/advisories/FreeBSD-SA-14:04.bind.asc
---
FreeBSD-SA-14:03.openssl
Tue, 14 Jan 2014 00:00:00 PST
http://security.FreeBSD.org/advisories/FreeBSD-SA-14:03.openssl.asc
http://security.FreeBSD.org/advisories/FreeBSD-SA-14:03.openssl.asc
---
...
Notes:
I omit error checking for sake of brevity.
I recommend only using the BeautifulSoup API when feedparser fails. The reason is feedparser is the right tool the the job. Hopefully, they will update it to be more forgiving in the future.

Escaping '<' and '>' in xml when using xml.dom.minidom

I am stuck while escaping "<" and ">" in the xml file using xml.dom.minidom.
I tried to get the unicode hex value and use that instead
http://slayeroffice.com/tools/unicode_lookup/
Tried to use the standard "<" and ">" but still with no success.
from xml.dom.minidom import Document
doc = Document()
e = doc.createElement("abc")
s1 = '<hello>bhaskar</hello>'
text = doc.createTextNode(s1)
e.appendChild(text)
e.toxml()
'<abc><hello>bhaskar</hello></abc>'
same result with writexml()
Also tried by specifying encoding 'UTF-8', 'utf-8', 'utf' in the toxml() writexml() calls but with same results.
from xml.dom.minidom import Document
doc = Document()
e = doc.createElement("abc")
s1 = u'<hello>bhaskar</hello>'
text = doc.createTextNode(s1)
e.appendChild(text)
e.toxml()
u'<abc>&lt;hello&gt;bhaskar&lt;/hello&gt;</abc>'
Tried other ways but with same results.
Only way i could work-around is by overriding the writer
import xml.dom.minidom as md
# XXX Hack to handle '<' and '>'
def wd(writer, data):
data = data.replace("<", "<").replace(">", ">")
writer.write(data)
md._write_data = wd
Edit - This is the code.
import xml.dom.minidom as md
doc = md.Document()
entity_descr = doc.createElement("EntityDescriptor")
doc.appendChild(entity_descr)
entity_descr.setAttribute('xmlns', 'urn:oasis:names:tc:SAML:2.0:metadata')
entity_descr.setAttribute('xmlns:saml', 'urn:oasis:names:tc:SAML:2.0:assertion')
entity_descr.setAttribute('xmlns:ds', 'http://www.w3.org/2000/09/xmldsig#')
# Get the entity_id from saml20_idp_settings
entity_descr.setAttribute('entityID', self.group['entity_id'])
idpssodescr = doc.createElement('IDPSSODescriptor')
idpssodescr.setAttribute('WantAuthnRequestsSigned', 'true')
idpssodescr.setAttribute('protocolSupportEnumeration',
'urn:oasis:names:tc:SAML:2.0:protocol')
entity_descr.appendChild(idpssodescr)
keydescr = doc.createElement('KeyDescriptor')
keydescr.setAttribute('use', 'signing')
idpssodescr.appendChild(keydescr)
keyinfo = doc.createElement('ds:KeyInfo')
keyinfo.setAttribute('xmlns:ds', 'http://www.w3.org/2000/09/xmldsig#')
keydescr.appendChild(keyinfo)
x509data = doc.createElement('ds:X509Data')
keyinfo.appendChild(x509data)
# check this part
s = "this is a cert blah blah"
x509cert = doc.createElement('ds:X509Certificate')
cert = doc.createTextNode(s)
x509cert.appendChild(cert)
x509data.appendChild(x509cert)
sso = doc.createElement('SingleSignOnService')
sso.setAttribute('Binding', 'urn:oasis:names:tc:SAML:2.0:bindings:HTTP-Redirect')
sso.setAttribute('Location', 'http://googleapps/singleSignOn')
idpssodescr.appendChild(sso)
# Write the metadata file.
fobj = open('metadata.xml', 'w')
doc.writexml(fobj, " ", "", "\n", "UTF-8")
fobj.close()
This produces
<?xml version="1.0" encoding="UTF-8"?>
<EntityDescriptor entityID="skar" xmlns="urn:oasis:names:tc:SAML:2.0:metadata"
xmlns:ds="http://www.w3.org/2000/09/xmldsig#"
xmlns:saml="urn:oasis:names:tc:SAML:2.0:assertion">
<IDPSSODescriptor WantAuthnRequestsSigned="true"
protocolSupportEnumeration="urn:oasis:names:tc:SAML:2.0:protocol">
<KeyDescriptor use="signing">
<ds:KeyInfo xmlns:ds="http://www.w3.org/2000/09/xmldsig#">
<ds:X509Data>
<ds:X509Certificate>
this is a cert blah blah
</ds:X509Certificate>
</ds:X509Data>
</ds:KeyInfo>
</KeyDescriptor>
<SingleSignOnService Binding="urn:oasis:names:tc:SAML:2.0:bindings:HTTP-Redirect"
Location="http:///singleSignOn"/>
</IDPSSODescriptor>
</EntityDescriptor>
Note the "This is a cert" comes seperately
Have broken my head over this but with the same results.
This is not a bug, it is a feature. To insert actual XML, insert DOM objects instead. Text inside an XML tag needs to be entity escaped though to be valid XML.
from xml.dom.minidom import Document
doc = Document()
e = doc.createElement("abc")
eh = doc.createElement("hello")
s1 = 'bhaskar'
text = doc.createTextNode(s1)
eh.appendChild(text)
e.appendChild(eh)
e.toxml()
EDIT: I don't know what Python's API is like, but it looks very similar to C#'s, so you might be able to do something like e.innerXml = s1 to do what you're trying to do... but that could be bad. The better thing to do is parse it and appendChild it as well.
EDIT 2: I just ran this via Python locally, and there's definitely something wrong on your end, not in the libraries. Make sure that your string doesn't have any newlines or whitespace at the start of it. For reference, the test code I used was:
Python 2.6.2 (release26-maint, Apr 19 2009, 01:56:41)
[GCC 4.3.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from xml.dom.minidom import Document
>>> cert = "---- START CERTIFICATE ----\n Hello world\n---- END CERTIFICATE ---"
>>> doc = Document()
>>> e = doc.createElement("cert")
>>> certEl = doc.createTextNode(cert)
>>> e.appendChild(certEl)
<DOM Text node "'---- START'...">
>>> print e.toxml()
<cert>---- START CERTIFICATE ----
Hello world
---- END CERTIFICATE ---</cert>
>>>
EDIT 3: The final edit. The problem is in your writexml call. Simply using the following fixes this:
doc.writexml(fobj)
# or
doc.writexml(fobj, "", " ", "")
Unfortuanately, it seems that you won't be able to use the newline parameter to get pretty printing though... it seems that the Python library (or atleast minidom) is written rather poorly and will modify TextNode's while printing them. Not so much a poor implementation as a naive one. A shame really...
If you use "<" as text in XML, you need to escape it, else it is considered markup. So xml.dom is right in escaping it, since you've asked for a text node.
Assuming you really want to insert a piece of XML, I recommend to use createElement("hello"). If you have a fragment of XML that you don't know the structure of, you should first parse it, and then move the nodes of that parse result into the other tree.
If you want to hack, you can inherit from xml.dom.minidom.Text, and overwrite the writexml method. See the source of minidom for details.

Categories