I am trying to parse a simple XML document located at http://www.webservicex.net/airport.asmx/getAirportInformationByAirportCode?airportCode=jfk using the ElementTree module. The code (so far):
import urllib2
from xml.etree import ElementTree
from xml.etree.ElementTree import Element
from xml.etree.ElementTree import SubElement
url = "http://www.webservicex.net/airport.asmx/getAirportInformationByAirportCode?airportCode=jfk"
s = urllib2.urlopen(url)
print s
document = ElementTree.parse(s)
root = document.getroot()
print root
dataset = SubElement(root, 'NewDataSet')
print dataset
table = SubElement(dataset, 'Table')
print table
airportName = SubElement(table, 'CityOrAirportName')
print airportName.text
The final line yields "none" not the name of the airport in the XML. Can anyone assist? This should be realtively simply, but I am missing something.
Look at the documentation for that module. It says, among other things:
The SubElement() function also provides a convenient way to create new sub-elements for a given element
In particular note the word create. You are creating a new element, not reading the elements that are already there.
If you want to locate certain elements within the parsed XML, read the rest of the documentation on that page to understand how to use the library to do that.
Related
The ElementTree.parse reads from a file, how can I use this if I already have the XML data in a string?
Maybe I am missing something here, but there must be a way to use the ElementTree without writing out the string to a file and reading it again.
xml.etree.elementtree
You can parse the text as a string, which creates an Element, and create an ElementTree using that Element.
import xml.etree.ElementTree as ET
tree = ET.ElementTree(ET.fromstring(xmlstring))
I just came across this issue and the documentation, while complete, is not very straightforward on the difference in usage between the parse() and fromstring() methods.
If you're using xml.etree.ElementTree.parse to parse from a file, then you can use xml.etree.ElementTree.fromstring to get the root Element of the document. Often you don't actually need an ElementTree.
See xml.etree.ElementTree
You need the xml.etree.ElementTree.fromstring(text)
from xml.etree.ElementTree import XML, fromstring
myxml = fromstring(text)
io.StringIO is another option for getting XML into xml.etree.ElementTree:
import io
f = io.StringIO(xmlstring)
tree = ET.parse(f)
root = tree.getroot()
Hovever, it does not affect the XML declaration one would assume to be in tree (although that's needed for ElementTree.write()). See How to write XML declaration using xml.etree.ElementTree.
I'm new to parsing in XML and am stuck with my code regarding finding all titles (title tags) in an XML. This is what I came up with, but it is returning just an empty list, while there should be titles in there.
import bz2
from xml.etree import ElementTree as etree
def parse_xml(filename):
with bz2.BZ2File(filename) as f:
doc = etree.parse(f)
titles = doc.findall('.//{http://www.mediawiki.org/xml/export-0.7/}title')
print titles[:10]
Can someone tell me why this is not working properly? Just to be clear; I need to find all text inside title tags stored in a list, taken from an XML wrapped in a bz2 file (as far as I read the best way is without unzipping).
I am trying to parse the xml from YouTube that is embedded in the code below. I am trying to display all of the titles. However, I am running into trouble when I try to print the 'title' only enter lines appear. Any advice?
#import library to do http requests:
import urllib2
#import easy to use xml parser called minidom:
from xml.dom.minidom import parseString
#all these imports are standard on most modern python implementations
#download the file:
file = urllib2.urlopen('http://gdata.youtube.com/feeds/api/users/buzzfeed/uploads?v=2&max-results=50')
#convert to string:
data = file.read()
#close file because we dont need it anymore:
file.close()
#parse the xml you downloaded
dom = parseString(data)
entry=dom.getElementsByTagName('entry')
for node in entry:
video_title=node.getAttribute('title')
print video_title
Title is not an attribute, it is a child element of an entry.
here is an example how to extract it:
for node in entry:
video_title = node.getElementsByTagName('title')[0].firstChild.nodeValue
print video_title
lxml can be a bit difficult to figure out, so here's a really simple beautiful soup solution (It's called beautifulsoup for a reason). You can also set up beautiful soup to use the lxml parser, so the speed is about the same.
from bs4 import BeautifulSoup
soup = BeautifulSoup(data) # data as is seen in your code
soup.findAll('title')
returns a list of title elements. you can also use soup.findAll('media:title') in this case to return just the media:title elements (the actual video names).
There a small bug in your code. You access title as an attribute, although it's a child element of entry. Your code can be fixed by:
dom = parseString(data)
for node in dom.getElementsByTagName('entry'):
print node.getElementsByTagName('title')[0].firstChild.data
All I want to do is get the content of an XML tag in Python. I'm maybe using the wrong import; ideally I'd love to have the way PHP deals with XML (i.e $XML->this_tag), like the way pyodbc does database stuff (i.e. table.field)
Here's my example:
from xml.dom.minidom import parseString
dom = parseString("<test>I want to read this</test>")
dom.getElementsByTagName("test")[0].toxml()
>>> u'<test>I want to read this</test>'
All I want to be able to do read the contents of the tag (like innerHTML in javascript).
instead of dom.getElementsByTagName("test")[0].toxml() put dom.getElementsByTagName("test")[0].firstChild.data It will print the node value.
I like BeautifulSoup :
from BeautifulSoup import BeautifulStoneSoup
xml = """<test>I want to read this</test>"""
soup = BeautifulStoneSoup(xml)
soup.find('test')
I want to read this
looks somewhat better.
Use firstChild.data instead of toxml:
from xml.dom.minidom import parseString
dom = parseString('<test>I want to read this</test>')
element = dom.getElementsByTagName('test')[0]
print element.firstChild.data
Output:
>>> I want to read this
The ElementTree.parse reads from a file, how can I use this if I already have the XML data in a string?
Maybe I am missing something here, but there must be a way to use the ElementTree without writing out the string to a file and reading it again.
xml.etree.elementtree
You can parse the text as a string, which creates an Element, and create an ElementTree using that Element.
import xml.etree.ElementTree as ET
tree = ET.ElementTree(ET.fromstring(xmlstring))
I just came across this issue and the documentation, while complete, is not very straightforward on the difference in usage between the parse() and fromstring() methods.
If you're using xml.etree.ElementTree.parse to parse from a file, then you can use xml.etree.ElementTree.fromstring to get the root Element of the document. Often you don't actually need an ElementTree.
See xml.etree.ElementTree
You need the xml.etree.ElementTree.fromstring(text)
from xml.etree.ElementTree import XML, fromstring
myxml = fromstring(text)
io.StringIO is another option for getting XML into xml.etree.ElementTree:
import io
f = io.StringIO(xmlstring)
tree = ET.parse(f)
root = tree.getroot()
Hovever, it does not affect the XML declaration one would assume to be in tree (although that's needed for ElementTree.write()). See How to write XML declaration using xml.etree.ElementTree.