Python XML parsing from website

Python XML parsing from website - python

I am trying to Parse from a website. I am stuck. I will provide the XML below. It is coming from a webiste. I have two questions. What is the best way to read xml from a website, and then I am having trouble digging into the xml to get the rate I need.
The figure I need back is Base:OBS_VALUE 0.12
What I have so far:
from xml.dom import minidom
import urllib
document = ('http://www.newyorkfed.org/markets/omo/dmm/fftoXML.cfm?type=daily''r')
web = urllib.urlopen(document)
get_web = web.read()
xmldoc = minidom.parseString(document)
ff_DataSet = xmldoc.getElementsByTagName('ff:DataSet')[0]
ff_series = ff_DataSet.getElementsByTagName('ff:Series')[0]
for line in ff_series:
price = line.getElementsByTagName('base:OBS_VALUE')[0].firstChild.data
print(price)
XML code from webiste:
-<Header> <ID>FFD</ID>
<Test>false</Test>
<Name xml:lang="en">Federal Funds daily averages</Name> <Prepared>2013-05-08</Prepared>
<Sender id="FRBNY"> <Name xml:lang="en">Federal Reserve Bank of New York</Name>
<Contact>
<Name xml:lang="en">Public Information Web Team</Name> <Email>ny.piwebteam#ny.frb.org</Email>
</Contact>
</Sender>
<!--ReportingBegin></ReportingBegin-->
</Header>
<ff:DataSet> -<ff:Series TIME_FORMAT="P1D" DISCLAIMER="G" FF_METHOD="D" DECIMALS="2" AVAILABILITY="A">
<ffbase:Key>
<base:FREQ>D</base:FREQ>
<base:RATE>FF</base:RATE>
<base:MATURITY>O</base:MATURITY>
<ffbase:FF_SCOPE>D</ffbase:FF_SCOPE>
</ffbase:Key>
<ff:Obs OBS_CONF="F" OBS_STATUS="A">
<base:TIME_PERIOD>2013-05-07</base:TIME_PERIOD>
<base:OBS_VALUE>0.12</base:OBS_VALUE>

If you wanted to stick with xml.dom.minidom, try this...
from xml.dom import minidom
import urllib
url_str = 'http://www.newyorkfed.org/markets/omo/dmm/fftoXML.cfm?type=daily'
xml_str = urllib.urlopen(url_str).read()
xmldoc = minidom.parseString(xml_str)
obs_values = xmldoc.getElementsByTagName('base:OBS_VALUE')
# prints the first base:OBS_VALUE it finds
print obs_values[0].firstChild.nodeValue
# prints the second base:OBS_VALUE it finds
print obs_values[1].firstChild.nodeValue
# prints all base:OBS_VALUE in the XML document
for obs_val in obs_values:
print obs_val.firstChild.nodeValue
However, if you want to use lxml, use underrun's solution. Also, your original code had some errors. You were actually attempting to parse the document variable, which was the web address. You needed to parse the xml returned from the website, which in your example is the get_web variable.

Take a look at your code:
document = ('http://www.newyorkfed.org/markets/omo/dmm/fftoXML.cfm?type=daily''r')
web = urllib.urlopen(document)
get_web = web.read()
xmldoc = minidom.parseString(document)
I'm not sure you have document correct unless you want http://www.newyorkfed.org/markets/omo/dmm/fftoXML.cfm?type=dailyr because that's what you'll get (the parens group in this case and strings listed next to each other automatically concatenate).
After that you do some work to create get_web but then you don't use it in the next line. Instead you try to parse your document which is the url...
Beyond that, I would totally suggest you use ElementTree, preferably lxml's ElementTree (http://lxml.de/). Also, lxml's etree parser takes a file-like object which can be a urllib object. If you did, after straightening out the rest of your doc, you could do this:
from lxml import etree
from io import StringIO
import urllib
url = 'http://www.newyorkfed.org/markets/omo/dmm/fftoXML.cfm?type=daily'
root = etree.parse(urllib.urlopen(url))
for obs in root.xpath('/ff:DataSet/ff:Series/ff:Obs'):
price = obs.xpath('./base:OBS_VALUE').text
print(price)

Related

Extracting all sub-tags' values in a main tag for XML File no matter the tag name

<GPO>
<Computer>
<ExtensionData>
<Extension xmlns:q1="http://www.microsoft.com/GroupPolicy/Settings/Security"
xsi:type="q1:SecuritySettings">
<q1:Account>
<q1:Name>ClearTextPassword</q1:Name>
<q1:SettingBoolean>false</q1:SettingBoolean>
<q1:Type>Password</q1:Type>
</q1:Account>
<q1:Account>
<q1:Name>MaximumPasswordAge</q1:Name>
<q1:SettingNumber>120</q1:SettingNumber>
<q1:Type>Password</q1:Type>
</q1:Account>
</Extension>
</ExtensionData>
</Computer>
</GPO>
Hi, this is my current XML file saved into C:\XMLFile.xml. How can I change the code to extract all the sub-tags' values inside each <q1:Account> tag using Python 3.8 instead of just by tag name? I have no experience towards XML parsing and reading with Python before.
This is my code so far:
from xml.dom import minidom
xmlFile = minidom.parse("C:\GPOReportAD.xml")
computer = xmlFile.getElementsByTagName("Computer")[0]
extensionData = computer.getElementsByTagName("ExtensionData")[0]
for i in extensionData.getElementsByTagName("q1:Name"):
for x in extensionData.getElementsByTagName("q1:SettingBoolean"):
print("Result: " + i.firstChild.data + " " + x.firstChild.data)
break
Expected Output:
ClearTextPassword false
MaxmimumPasswordAge 120

You are dealing with an xml snippet which uses namespaces, making things a bit complicated. Best way to approach it, I believe, is to use the html (not xml) parser from lxml and use xpath to select the values:
import lxml.html as lh
gpo = """[your snippet above]"""
doc = lh.fromstring(gpo)
#either:
for i in doc.xpath(".//*[local-name()='name']"):
#or
for i in doc.xpath(".//name", namespaces={'ql':'http://www.microsoft.com/GroupPolicy/Settings/Security'}):
print(i.text,i.xpath('./following-sibling::*[1]/text()')[0])
Output:
ClearTextPassword false
MaximumPasswordAge 120

How to display HTML using LXML in Python

So what I am trying to achieve is really simple.
I want to call python test.py and would like to go to my local host and see the html result. However I keep getting an error ValueError: Invalid tag name u'<html><body><h1>Test!</h1></body></html>'
Below is my code. What's the problem here?
import lxml.etree as ETO
html = ETO.Element("<html><body><h1>Test!</h1></body></html>")
self.wfile.write(ETO.tostring(html, xml_declaration=False, pretty_print=True))

You have to create each element in turn, and put them in the structure that you want them to have:
html = ETO.Element('html')
body = ETO.SubElement(html, 'body')
h1 = ETO.SubElement(body, 'h1')
h1.text = 'Test!'
Then ETO.tostring(html) will return a bytestring that looks like this:
>>> ETO.tostring(html)
b'<html><body><h1>Test!</h1></body></html>'

Since you are reading an existing file, Element isn't useful here; try changing this
html = ETO.Element("<html><body><h1>Test!</h1></body></html>")
to this
html = ETO.fromstring("<html><body><h1>Test!</h1></body></html>")
and see if it works for you.

How to set the content of a markup called "text" with lxml.objectify?

I am trying to parse an XML File in python with lxml.objectify, modify the text from <ns:text> and then put back together the XML.
This is a short overview of my XML file:
<ns:map xmlns:ns="...">
<ns:topics>
<ns:topic>
<ns:text>HI</ns:text>
</ns:topic>
<ns:topic>
<ns:text>HELLO</ns:text>
</ns:topic>
</ns:topics>
</ns:map>
I parsed the XML and retrieved the topic in topics. I tried to change the "HI" text into "test" for <ns:text> in the first topic.
from lxml import objectify as obj
from lxml import etree
root= obj.parse(xmlFileName)
root.topics.topic[0].text = 'test'
obj_xml= etree.tostring(root, encoding="unicode", pretty_print=True)
print(obj_xml)
I supposed this would be the outcome:
<ns:map xmlns:ns="...">
<ns:topics>
<ns:topic>
<ns:text>test</ns:text>
</ns:topic>
<ns:topic>
<ns:text>HELLO</ns:text>
</ns:topic>
</ns:topics>
</ns:map>
but the .text is a read-only attribute and not my <ns:text>. I can't access my <ns:text> to change the text.
I have been reading a lot of tutorials with lxml.objectify but I can't find something similar to my problem.
PS: I cannot change the name of the markup. It's a generated XML.
Thank you,
Elena

I found the answer. I can access and change my text doing this:
root.topics.topic[0]["text"] = "test"

Generating one XML file from another using Python

How would one go about generating a local XML file from a url?
I need to be able to select out certain values from the remote XML file and place thme into a local one. Currently I only have snippets.
from xml.dom import minidom
from urllib.request import urlopen
import requests
url = 'http://url.php'
private_list = [16735,7456,18114]
xmldoc = minidom.parse(urlopen(url))
public_list = xmldoc.getElementsByTagName('server')
for public_server in public_list:
for private_server in private_list:
if (public_server.attributes['id'].value) == str(private_server):
print("Found one!")
Sadly that is about as far as I can get I am able to grab the correct number of elements from the public list but I am not sure how to take those elements and build a local copy of those.
Can anyone assist?
Edit: Example
The example XML looks like this:
<settings>
<servers>
<server url="192.168.1.100" name="CentOS" id="12" host="Kirk.corporate.lan"/>
<server url="10.0.0.95" name="Ubunutu" id="4" url2="192.168.1.50" host="Spock.corporate.lan"/>
<server url="10.0.1.95" id="30" host="scotty.corporate.lan"/>
</servers>
</settings>
I this example my list will only include ID 4. So I want to take just that subelement whos ID = 4. I can find that with my code above I dont know how to take that entire element though and shove it into a new file.
It looks like with xml.etree.ElementTree I can do
import xml.etree.ElementTree as ET
settings = ET.Element('settings')
servers = ET.SubElement(settings, 'servers')
That will build the base but thats about as far as I can get successfully.
Edit 2:
Got a little further
from lxml import etree
root = etree.Element('settings')
main = etree.SubElement(root,'servers')
main = etree.SubElement(main, "server", url = "192.168.1.100", name="CentOS", id="12", host="Kirk.corporate.lan")

How to parse an XML file and get its data Python

I have a the below web service : 'https://news.google.com/news/rss/?ned=us&hl=en'
I need to parse it and get the title and date values of each item in the XML file.
I have tried to get the data to an xml file and i am trying to parse it but i see all blank values:
import requests
import xml.etree.ElementTree as ET
response = requests.get('https://news.google.com/news/rss/?ned=us&hl=en')
with open('text.xml','w') as xmlfile:
xmlfile.write(response.text)
with open('text.xml','rt') as f:
tree = ET.parse(f)
for node in tree.iter():
print (node.tag, node.attrib)
I am not sure where i am going wrong . I have to somehow extract the values of title and published date of each and every item in the XML.
Thanks for any answers in advance.

#Ilja Everilä is right, you should use feedparser.
For sure there is no need to write any xml file... except if you want to archive it.
I didn't really get what output you expected but something like this works (python3)
import feedparser
url = 'https://news.google.com/news/rss/?ned=us&hl=en'
d = feedparser.parse(url)
#print the feed title
print(d['feed']['title'])
#print tuples (title, tag)
print([(d['entries'][i]['title'], d['entries'][i]['tags'][0]['term']) for i in range(len(d['entries']))] )
to explicitly print it as utf8 strings use:
print([(d['entries'][i]['title'].encode('utf8'), d['entries'][i]['tags'][0]['term'].encode('utf8')) for i in range(len(d['entries']))])
Maybe if you show your expected output, we could help you to get the right content from the parser.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python XML parsing from website - python

Related

Extracting all sub-tags' values in a main tag for XML File no matter the tag name

How to display HTML using LXML in Python

How to set the content of a markup called "text" with lxml.objectify?

Generating one XML file from another using Python

How to parse an XML file and get its data Python

Categories

Resources