How to parse an XML file and get its data Python

How to parse an XML file and get its data Python - python

I have a the below web service : 'https://news.google.com/news/rss/?ned=us&hl=en'
I need to parse it and get the title and date values of each item in the XML file.
I have tried to get the data to an xml file and i am trying to parse it but i see all blank values:
import requests
import xml.etree.ElementTree as ET
response = requests.get('https://news.google.com/news/rss/?ned=us&hl=en')
with open('text.xml','w') as xmlfile:
xmlfile.write(response.text)
with open('text.xml','rt') as f:
tree = ET.parse(f)
for node in tree.iter():
print (node.tag, node.attrib)
I am not sure where i am going wrong . I have to somehow extract the values of title and published date of each and every item in the XML.
Thanks for any answers in advance.

#Ilja Everilä is right, you should use feedparser.
For sure there is no need to write any xml file... except if you want to archive it.
I didn't really get what output you expected but something like this works (python3)
import feedparser
url = 'https://news.google.com/news/rss/?ned=us&hl=en'
d = feedparser.parse(url)
#print the feed title
print(d['feed']['title'])
#print tuples (title, tag)
print([(d['entries'][i]['title'], d['entries'][i]['tags'][0]['term']) for i in range(len(d['entries']))] )
to explicitly print it as utf8 strings use:
print([(d['entries'][i]['title'].encode('utf8'), d['entries'][i]['tags'][0]['term'].encode('utf8')) for i in range(len(d['entries']))])
Maybe if you show your expected output, we could help you to get the right content from the parser.

Related

How to display HTML using LXML in Python

So what I am trying to achieve is really simple.
I want to call python test.py and would like to go to my local host and see the html result. However I keep getting an error ValueError: Invalid tag name u'<html><body><h1>Test!</h1></body></html>'
Below is my code. What's the problem here?
import lxml.etree as ETO
html = ETO.Element("<html><body><h1>Test!</h1></body></html>")
self.wfile.write(ETO.tostring(html, xml_declaration=False, pretty_print=True))

You have to create each element in turn, and put them in the structure that you want them to have:
html = ETO.Element('html')
body = ETO.SubElement(html, 'body')
h1 = ETO.SubElement(body, 'h1')
h1.text = 'Test!'
Then ETO.tostring(html) will return a bytestring that looks like this:
>>> ETO.tostring(html)
b'<html><body><h1>Test!</h1></body></html>'

Since you are reading an existing file, Element isn't useful here; try changing this
html = ETO.Element("<html><body><h1>Test!</h1></body></html>")
to this
html = ETO.fromstring("<html><body><h1>Test!</h1></body></html>")
and see if it works for you.

How to set the content of a markup called "text" with lxml.objectify?

I am trying to parse an XML File in python with lxml.objectify, modify the text from <ns:text> and then put back together the XML.
This is a short overview of my XML file:
<ns:map xmlns:ns="...">
<ns:topics>
<ns:topic>
<ns:text>HI</ns:text>
</ns:topic>
<ns:topic>
<ns:text>HELLO</ns:text>
</ns:topic>
</ns:topics>
</ns:map>
I parsed the XML and retrieved the topic in topics. I tried to change the "HI" text into "test" for <ns:text> in the first topic.
from lxml import objectify as obj
from lxml import etree
root= obj.parse(xmlFileName)
root.topics.topic[0].text = 'test'
obj_xml= etree.tostring(root, encoding="unicode", pretty_print=True)
print(obj_xml)
I supposed this would be the outcome:
<ns:map xmlns:ns="...">
<ns:topics>
<ns:topic>
<ns:text>test</ns:text>
</ns:topic>
<ns:topic>
<ns:text>HELLO</ns:text>
</ns:topic>
</ns:topics>
</ns:map>
but the .text is a read-only attribute and not my <ns:text>. I can't access my <ns:text> to change the text.
I have been reading a lot of tutorials with lxml.objectify but I can't find something similar to my problem.
PS: I cannot change the name of the markup. It's a generated XML.
Thank you,
Elena

I found the answer. I can access and change my text doing this:
root.topics.topic[0]["text"] = "test"

I am facing trouble to extract xml tags values

Problem statement
I am working on an API. I've saved its response data into an XML file and now wanted to extract data out of it. It's a very big file and having a lot of data tags in it,
but I want to extract a few data and wanted to make its json file for my project on which I am working on.
Sample xml response is:
xmlns:meta="http://www.tomtom.com/service/tis/parkingprobabilities/metadata/1.1"
schemaVersion="1.1">
<meta:metaData>
<meta:creatorUUID>aaac93fc-ba74-102b-b5ef-00304891a58c</meta:creatorUUID>
<meta:creationTimeUTC>2016-09-30T19:58:01</meta:creationTimeUTC>
<meta:timeZone>Europe/Berlin</meta:timeZone>
<meta:cityName>Berlin</meta:cityName>
<meta:countryCode>DE</meta:countryCode>
<meta:description>Example showing parking probability and search time profile</meta:description>
</meta:metaData>
<roadSegment>
<uuid>00000000-069f-6d7a-017f-78b7f701185b</uuid>
<parkingDataProfile>
<dailyProfile>
<weekdays>
<day>MON</day>
<day>TUE</day>
<day>WED</day>
<day>THU</day>
<day>FRI</day>
</weekdays>
<hourlyData>
<hourOfDay>0</hourOfDay>
<parkingProbability>0.10</parkingProbability>
<averageSearchTime>12</averageSearchTime>
</hourlyData>
<hourlyData>
<hourOfDay>1</hourOfDay>
<parkingProbability>0.10</parkingProbability>
<averageSearchTime>11</averageSearchTime>
</hourlyData>
<hourlyData>
<hourOfDay>2</hourOfDay>
<parkingProbability>0.10</parkingProbability>
<averageSearchTime>10</averageSearchTime>
</hourlyData>
<!-- usually contains more -->
<!-- some time slots could be missing -->
<hourlyData>
<hourOfDay>23</hourOfDay>
<parkingProbability>0.10</parkingProbability>
<averageSearchTime>9</averageSearchTime>
</hourlyData>
</dailyProfile>
<!-- could contain more -->
</parkingDataProfile>
</roadSegment>
<!-- many more -->
</parkingProbabilities>
expected output:
whole hourly data tags values from each daily profile node
Code tried so far:
from xml.dom import minidom
mydoc = minidom.parse('data_file.xml')
hourly_data = mydoc.getElementsByTagName("hourlyData")
for data in hourly_data:
print(data.nodeValue)
Sorry I am making an unusual mistake.
Output:
I am getting None printed on screen.

Try this. You need to get the actual node to get the data.
from xml.dom import minidom
mydoc = minidom.parse('data_file.xml')
hourly_data = mydoc.getElementsByTagName("hourlyData")
for data in hourly_data:
print(data.getElementsByTagName("parkingProbability")[0].childNodes[0].data)
print(data.getElementsByTagName("averageSearchTime")[0].childNodes[0].data)
print(data.getElementsByTagName("hourOfDay")[0].childNodes[0].data)

How to remove all " \n" in xml payload by using lxml library

I'm trying to change a text value in xml file, and I need to return the updated xml content by using lxml library. I can able to successfully update the value, but the updated xml file contains "\n"(next line) character as below.
Output:
<?xml version='1.0' encoding='ASCII'?>\n<Order>\n <content>\n <sID>123</sID>\n <spNumber>UserTemp</spNumber>\n <client>ARRCHANA</client>\n <orderType>Dashboard</orderType>\n </content>\n
<content>\n <sID>111</sID>\n <spNumber>UserTemp</spNumber>\n <client>ARRCHANA</client>\n <orderType>Dashboard</orderType>\n </content>\n
</Order>
Note: I didn't format the above xml output, and posted it how exactly I get it from output console.
Input:
<Order>
<content>
<sID>123</sID>
<spNumber>UserTemp</spNumber>
<client>WALLMART</client>
<orderType>Dashboard</orderType>
</content>
<content>
<sID>111</sID>
<spNumber>UserTemp</spNumber>
<client>D&B</client>
<orderType>Dashboard</orderType>
</content>
</Order>
Also, I tried to remove the \n character in output xml file by using
getValue = getValue.replace('\n','')
but, no luck.
The below code I used to update the xml( tag), and tried to return the updated xml content back.
Python Code:
from lxml import etree
from io import StringIO
import six
import numpy
def getListOfNodes(location):
f = open(location)
xml = f.read()
f.close()
#print(xml)
getXml = etree.parse(location)
for elm in getXml.xpath('.//Order//content/client'):
index='ARRCHANA'
elm.text=index
#with open('C:\\New folder\\temp.xml','w',newline='\r\n') as writeFile:
#writeFile.write(str(etree.tostring(getXml,pretty_print=True, xml_declaration=True)))
getValue=str((etree.tostring(getXml,pretty_print=True, xml_declaration=True)))
#getValue = getValue.replace('\n','')
#getValue=getValue.replace("\n","<br/>")
print(getValue)
return getValue
When I'm trying to open the response payload through firefox browser, then It says the below error message:
XML Parsing Error: no element found Location:
file:///C:/New%20folder/Confidential.xml
Line Number 1, Column 1:
It says that "no element found location in Line Number 1, column 1" in xml file when it found "\n" character in it.
Can somebody assist me the better way to update the text value, and return it back without any additional characters.

It's fixed by myself by using the below script:
code = root.xpath('.//Order//content/client')
if code:
code[0].text = 'ARRCHANA'
etree.ElementTree(root).write('D:\test.xml', pretty_print=True)

Python XML parsing from website

I am trying to Parse from a website. I am stuck. I will provide the XML below. It is coming from a webiste. I have two questions. What is the best way to read xml from a website, and then I am having trouble digging into the xml to get the rate I need.
The figure I need back is Base:OBS_VALUE 0.12
What I have so far:
from xml.dom import minidom
import urllib
document = ('http://www.newyorkfed.org/markets/omo/dmm/fftoXML.cfm?type=daily''r')
web = urllib.urlopen(document)
get_web = web.read()
xmldoc = minidom.parseString(document)
ff_DataSet = xmldoc.getElementsByTagName('ff:DataSet')[0]
ff_series = ff_DataSet.getElementsByTagName('ff:Series')[0]
for line in ff_series:
price = line.getElementsByTagName('base:OBS_VALUE')[0].firstChild.data
print(price)
XML code from webiste:
-<Header> <ID>FFD</ID>
<Test>false</Test>
<Name xml:lang="en">Federal Funds daily averages</Name> <Prepared>2013-05-08</Prepared>
<Sender id="FRBNY"> <Name xml:lang="en">Federal Reserve Bank of New York</Name>
<Contact>
<Name xml:lang="en">Public Information Web Team</Name> <Email>ny.piwebteam#ny.frb.org</Email>
</Contact>
</Sender>
<!--ReportingBegin></ReportingBegin-->
</Header>
<ff:DataSet> -<ff:Series TIME_FORMAT="P1D" DISCLAIMER="G" FF_METHOD="D" DECIMALS="2" AVAILABILITY="A">
<ffbase:Key>
<base:FREQ>D</base:FREQ>
<base:RATE>FF</base:RATE>
<base:MATURITY>O</base:MATURITY>
<ffbase:FF_SCOPE>D</ffbase:FF_SCOPE>
</ffbase:Key>
<ff:Obs OBS_CONF="F" OBS_STATUS="A">
<base:TIME_PERIOD>2013-05-07</base:TIME_PERIOD>
<base:OBS_VALUE>0.12</base:OBS_VALUE>

If you wanted to stick with xml.dom.minidom, try this...
from xml.dom import minidom
import urllib
url_str = 'http://www.newyorkfed.org/markets/omo/dmm/fftoXML.cfm?type=daily'
xml_str = urllib.urlopen(url_str).read()
xmldoc = minidom.parseString(xml_str)
obs_values = xmldoc.getElementsByTagName('base:OBS_VALUE')
# prints the first base:OBS_VALUE it finds
print obs_values[0].firstChild.nodeValue
# prints the second base:OBS_VALUE it finds
print obs_values[1].firstChild.nodeValue
# prints all base:OBS_VALUE in the XML document
for obs_val in obs_values:
print obs_val.firstChild.nodeValue
However, if you want to use lxml, use underrun's solution. Also, your original code had some errors. You were actually attempting to parse the document variable, which was the web address. You needed to parse the xml returned from the website, which in your example is the get_web variable.

Take a look at your code:
document = ('http://www.newyorkfed.org/markets/omo/dmm/fftoXML.cfm?type=daily''r')
web = urllib.urlopen(document)
get_web = web.read()
xmldoc = minidom.parseString(document)
I'm not sure you have document correct unless you want http://www.newyorkfed.org/markets/omo/dmm/fftoXML.cfm?type=dailyr because that's what you'll get (the parens group in this case and strings listed next to each other automatically concatenate).
After that you do some work to create get_web but then you don't use it in the next line. Instead you try to parse your document which is the url...
Beyond that, I would totally suggest you use ElementTree, preferably lxml's ElementTree (http://lxml.de/). Also, lxml's etree parser takes a file-like object which can be a urllib object. If you did, after straightening out the rest of your doc, you could do this:
from lxml import etree
from io import StringIO
import urllib
url = 'http://www.newyorkfed.org/markets/omo/dmm/fftoXML.cfm?type=daily'
root = etree.parse(urllib.urlopen(url))
for obs in root.xpath('/ff:DataSet/ff:Series/ff:Obs'):
price = obs.xpath('./base:OBS_VALUE').text
print(price)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to parse an XML file and get its data Python - python

Related

How to display HTML using LXML in Python

How to set the content of a markup called "text" with lxml.objectify?

I am facing trouble to extract xml tags values

How to remove all " \n" in xml payload by using lxml library

Python XML parsing from website

Categories

Resources