How to display HTML using LXML in Python - python

So what I am trying to achieve is really simple.
I want to call python test.py and would like to go to my local host and see the html result. However I keep getting an error ValueError: Invalid tag name u'<html><body><h1>Test!</h1></body></html>'
Below is my code. What's the problem here?
import lxml.etree as ETO
html = ETO.Element("<html><body><h1>Test!</h1></body></html>")
self.wfile.write(ETO.tostring(html, xml_declaration=False, pretty_print=True))

You have to create each element in turn, and put them in the structure that you want them to have:
html = ETO.Element('html')
body = ETO.SubElement(html, 'body')
h1 = ETO.SubElement(body, 'h1')
h1.text = 'Test!'
Then ETO.tostring(html) will return a bytestring that looks like this:
>>> ETO.tostring(html)
b'<html><body><h1>Test!</h1></body></html>'

Since you are reading an existing file, Element isn't useful here; try changing this
html = ETO.Element("<html><body><h1>Test!</h1></body></html>")
to this
html = ETO.fromstring("<html><body><h1>Test!</h1></body></html>")
and see if it works for you.

Related

How to set the content of a markup called "text" with lxml.objectify?

I am trying to parse an XML File in python with lxml.objectify, modify the text from <ns:text> and then put back together the XML.
This is a short overview of my XML file:
<ns:map xmlns:ns="...">
<ns:topics>
<ns:topic>
<ns:text>HI</ns:text>
</ns:topic>
<ns:topic>
<ns:text>HELLO</ns:text>
</ns:topic>
</ns:topics>
</ns:map>
I parsed the XML and retrieved the topic in topics. I tried to change the "HI" text into "test" for <ns:text> in the first topic.
from lxml import objectify as obj
from lxml import etree
root= obj.parse(xmlFileName)
root.topics.topic[0].text = 'test'
obj_xml= etree.tostring(root, encoding="unicode", pretty_print=True)
print(obj_xml)
I supposed this would be the outcome:
<ns:map xmlns:ns="...">
<ns:topics>
<ns:topic>
<ns:text>test</ns:text>
</ns:topic>
<ns:topic>
<ns:text>HELLO</ns:text>
</ns:topic>
</ns:topics>
</ns:map>
but the .text is a read-only attribute and not my <ns:text>. I can't access my <ns:text> to change the text.
I have been reading a lot of tutorials with lxml.objectify but I can't find something similar to my problem.
PS: I cannot change the name of the markup. It's a generated XML.
Thank you,
Elena
I found the answer. I can access and change my text doing this:
root.topics.topic[0]["text"] = "test"

How to parse an XML file and get its data Python

I have a the below web service : 'https://news.google.com/news/rss/?ned=us&hl=en'
I need to parse it and get the title and date values of each item in the XML file.
I have tried to get the data to an xml file and i am trying to parse it but i see all blank values:
import requests
import xml.etree.ElementTree as ET
response = requests.get('https://news.google.com/news/rss/?ned=us&hl=en')
with open('text.xml','w') as xmlfile:
xmlfile.write(response.text)
with open('text.xml','rt') as f:
tree = ET.parse(f)
for node in tree.iter():
print (node.tag, node.attrib)
I am not sure where i am going wrong . I have to somehow extract the values of title and published date of each and every item in the XML.
Thanks for any answers in advance.
#Ilja Everilä is right, you should use feedparser.
For sure there is no need to write any xml file... except if you want to archive it.
I didn't really get what output you expected but something like this works (python3)
import feedparser
url = 'https://news.google.com/news/rss/?ned=us&hl=en'
d = feedparser.parse(url)
#print the feed title
print(d['feed']['title'])
#print tuples (title, tag)
print([(d['entries'][i]['title'], d['entries'][i]['tags'][0]['term']) for i in range(len(d['entries']))] )
to explicitly print it as utf8 strings use:
print([(d['entries'][i]['title'].encode('utf8'), d['entries'][i]['tags'][0]['term'].encode('utf8')) for i in range(len(d['entries']))])
Maybe if you show your expected output, we could help you to get the right content from the parser.

Generating XML file with proper indentation

I am trying to generate the XML file in python but its not getting indented the out put is coming in straight line.
from xml.etree.ElementTree import Element, SubElement, Comment, tostring
name = str(request.POST.get('name'))
top = Element('scenario')
environment = SubElement(top, 'environment')
cluster = SubElement(top, 'cluster')
cluster.text=name
I tried to use pretty parser but its giving me an error as: 'Element' object has no attribute 'read'
import xml.dom.minidom
xml_p = xml.dom.minidom.parse(top)
pretty_xml = xml_p.toprettyxml()
Is the input given to parser is proper format ? if this is wrong method please suggest another way to indent.
You cannot directly parse top which is an Element(), you need to make that a string (which is why you should import tostring. that you are currently not using), and use xml.dom.minidom.parseString() on the result:
import xml.dom.minidom
xml_p = xml.dom.minidom.parseString(tostring(top))
pretty_xml = xml_p.toprettyxml()
print(pretty_xml)
that gives:
<?xml version="1.0" ?>
<scenario>
<environment/>
<cluster>xyz</cluster>
</scenario>

How to pass ET.dump() xml string from a Django view to a template -- python django ElementTree

I would like to format a bit of XML and pass it to a Django template. In the shell, I am able to successfully create the XML string using the following code:
locations = Location.objects.all()
industries = Industry.objects.all()
root = ET.Element("root")
for industry in industries:
doc = ET.SubElement(root, "industry")
doc.set("name", industry.text)
for location in locations:
if industry.id == location.company.industry_id:
item = ET.SubElement(doc, "item")
latitude = ET.SubElement(item, "latitude")
latitude.text = str(location.latitude)
longitude = ET.SubElement(item, "longitude")
longitude.text = str(location.longitude)
Then, still in the shell, ET.dump(root) outputs the XML I expect.
But, how can I use ET.dump(root) to pass the XML string from a Django view to a template file?
I have tried to pass it as {{xml_items}} using 'xml_items': ET.dump(root) and I have also tried to assign ET.dump(root) to a variable and pass it like 'xml_items': xml_items.
In both cases, the template outputs None for {{xml_items}}
dump is just a debug function. You should use the tostring function:
ET.tostring(root)
which will give you exactly what ET.dump() prints, but as a string.
If you're using lxml, you can also use
ET.tostring(root, pretty_print=True)
to get a better-looking XML, but if this is just going to be consumed by another code layer, then you don't really want that anyways. And it's not available in the stock ElementTree.

Python XML parsing from website

I am trying to Parse from a website. I am stuck. I will provide the XML below. It is coming from a webiste. I have two questions. What is the best way to read xml from a website, and then I am having trouble digging into the xml to get the rate I need.
The figure I need back is Base:OBS_VALUE 0.12
What I have so far:
from xml.dom import minidom
import urllib
document = ('http://www.newyorkfed.org/markets/omo/dmm/fftoXML.cfm?type=daily''r')
web = urllib.urlopen(document)
get_web = web.read()
xmldoc = minidom.parseString(document)
ff_DataSet = xmldoc.getElementsByTagName('ff:DataSet')[0]
ff_series = ff_DataSet.getElementsByTagName('ff:Series')[0]
for line in ff_series:
price = line.getElementsByTagName('base:OBS_VALUE')[0].firstChild.data
print(price)
XML code from webiste:
-<Header> <ID>FFD</ID>
<Test>false</Test>
<Name xml:lang="en">Federal Funds daily averages</Name> <Prepared>2013-05-08</Prepared>
<Sender id="FRBNY"> <Name xml:lang="en">Federal Reserve Bank of New York</Name>
<Contact>
<Name xml:lang="en">Public Information Web Team</Name> <Email>ny.piwebteam#ny.frb.org</Email>
</Contact>
</Sender>
<!--ReportingBegin></ReportingBegin-->
</Header>
<ff:DataSet> -<ff:Series TIME_FORMAT="P1D" DISCLAIMER="G" FF_METHOD="D" DECIMALS="2" AVAILABILITY="A">
<ffbase:Key>
<base:FREQ>D</base:FREQ>
<base:RATE>FF</base:RATE>
<base:MATURITY>O</base:MATURITY>
<ffbase:FF_SCOPE>D</ffbase:FF_SCOPE>
</ffbase:Key>
<ff:Obs OBS_CONF="F" OBS_STATUS="A">
<base:TIME_PERIOD>2013-05-07</base:TIME_PERIOD>
<base:OBS_VALUE>0.12</base:OBS_VALUE>
If you wanted to stick with xml.dom.minidom, try this...
from xml.dom import minidom
import urllib
url_str = 'http://www.newyorkfed.org/markets/omo/dmm/fftoXML.cfm?type=daily'
xml_str = urllib.urlopen(url_str).read()
xmldoc = minidom.parseString(xml_str)
obs_values = xmldoc.getElementsByTagName('base:OBS_VALUE')
# prints the first base:OBS_VALUE it finds
print obs_values[0].firstChild.nodeValue
# prints the second base:OBS_VALUE it finds
print obs_values[1].firstChild.nodeValue
# prints all base:OBS_VALUE in the XML document
for obs_val in obs_values:
print obs_val.firstChild.nodeValue
However, if you want to use lxml, use underrun's solution. Also, your original code had some errors. You were actually attempting to parse the document variable, which was the web address. You needed to parse the xml returned from the website, which in your example is the get_web variable.
Take a look at your code:
document = ('http://www.newyorkfed.org/markets/omo/dmm/fftoXML.cfm?type=daily''r')
web = urllib.urlopen(document)
get_web = web.read()
xmldoc = minidom.parseString(document)
I'm not sure you have document correct unless you want http://www.newyorkfed.org/markets/omo/dmm/fftoXML.cfm?type=dailyr because that's what you'll get (the parens group in this case and strings listed next to each other automatically concatenate).
After that you do some work to create get_web but then you don't use it in the next line. Instead you try to parse your document which is the url...
Beyond that, I would totally suggest you use ElementTree, preferably lxml's ElementTree (http://lxml.de/). Also, lxml's etree parser takes a file-like object which can be a urllib object. If you did, after straightening out the rest of your doc, you could do this:
from lxml import etree
from io import StringIO
import urllib
url = 'http://www.newyorkfed.org/markets/omo/dmm/fftoXML.cfm?type=daily'
root = etree.parse(urllib.urlopen(url))
for obs in root.xpath('/ff:DataSet/ff:Series/ff:Obs'):
price = obs.xpath('./base:OBS_VALUE').text
print(price)

Categories