How can I get specific elements from XML data?

How can I get specific elements from XML data? - python

I have some code to retrieve XML data:
import cStringIO
import pycurl
from xml.etree import ElementTree
_API_KEY = 'my api key'
_ima = '/the/path/to/a/image'
sock = cStringIO.StringIO()
upl = pycurl.Curl()
values = [
("key", _API_KEY),
("image", (upl.FORM_FILE, _ima))]
upl.setopt(upl.URL, "http://api.imgur.com/2/upload.xml")
upl.setopt(upl.HTTPPOST, values)
upl.setopt(upl.WRITEFUNCTION, sock.write)
upl.perform()
upl.close()
xmldata = sock.getvalue()
#print xmldata
sock.close()
The resulting data looks like:
<?xml version="1.0" encoding="utf-8"?>
<upload><image><name></name><title></title><caption></caption><hash>dxPGi</hash><deletehash>kj2XOt4DC13juUW</deletehash><datetime>2011-06-10 02:59:26</datetime><type>image/png</type><animated>false</animated><width>1024</width><height>768</height><size>172863</size><views>0</views><bandwidth>0</bandwidth></image><links><original>http://i.stack.imgur.com/dxPGi.png</original><imgur_page>http://imgur.com/dxPGi</imgur_page><delete_page>http://imgur.com/delete/kj2XOt4DC13juUW</delete_page><small_square>http://i.stack.imgur.com/dxPGis.jpg</small_square><large_thumbnail>http://i.stack.imgur.com/dxPGil.jpg</large_thumbnail></links></upload>
Now, following this answer, I'm trying to get some specific values from the data.
This is my attempt:
tree = ElementTree.fromstring(xmldata)
url = tree.findtext('original')
webpage = tree.findtext('imgur_page')
delpage = tree.findtext('delete_page')
print 'Url: ' + str(url)
print 'Pagina: ' + str(webpage)
print 'Link de borrado: ' + str(delpage)
I get an AttributeError if I try to add the .text access:
Traceback (most recent call last):
File "<pyshell#28>", line 27, in <module>
url = tree.find('original').text
AttributeError: 'NoneType' object has no attribute 'text'
I couldn't find anything in Python's help for ElementTree about this attribute. How can I get only the text, not the object?
I found some info about getting a text string here; but when I try it I get a TypeError:
Traceback (most recent call last):
File "<pyshell#32>", line 34, in <module>
print 'Url: ' + url
TypeError: cannot concatenate 'str' and 'NoneType' objects
If I try to print 'Url: ' + str(url) instead, there is no error, but the result shows as None.
How can I get the url, webpageanddelete_page` data from this XML?

Your find() call is trying to find an immediate child of the top of the tree with a tag named original, not a tag at any lower level than that. Use:
url = tree.find('.//original').text
if you want to find all elements in the tree with the tag named original. The pattern matching rules for ElementTree's find() method are laid out in a table on this page: http://effbot.org/zone/element-xpath.htm
For // matching it says:
Selects all subelements, on all levels beneath the current element (search the entire subtree). For example, “.//egg” selects all “egg” elements in the entire tree.
Edit: here is some test code for you, it use the XML sample string you posted I just ran it through XML Tidy in TextMate to make it legible:
from xml.etree import ElementTree
xmldata = '''<?xml version="1.0" encoding="utf-8"?>
<upload>
<image>
<name/>
<title/>
<caption/>
<hash>dxPGi</hash>
<deletehash>kj2XOt4DC13juUW</deletehash>
<datetime>2011-06-10 02:59:26</datetime>
<type>image/png</type>
<animated>false</animated>
<width>1024</width>
<height>768</height>
<size>172863</size>
<views>0</views>
<bandwidth>0</bandwidth>
</image>
<links>
<original>http://i.stack.imgur.com/dxPGi.png</original>
<imgur_page>http://imgur.com/dxPGi</imgur_page>
<delete_page>http://imgur.com/delete/kj2XOt4DC13juUW</delete_page>
<small_square>http://i.stack.imgur.com/dxPGis.jpg</small_square>
<large_thumbnail>http://i.stack.imgur.com/dxPGil.jpg</large_thumbnail>
</links>
</upload>'''
tree = ElementTree.fromstring(xmldata)
print tree.find('.//original').text
On my machine (OS X running python 2.6.1) that produces:
Ian-Cs-MacBook-Pro:tmp ian$ python test.py
http://i.stack.imgur.com/dxPGi.png

Related

Convert html to xml Python chilkat

Good morning, I am looking to convert html to xml using chilkat library. but it throws this error at me.
import chilkat
a = "asd asd asd asd"
xml = a.toXml()
print(xml)
Traceback (most recent call last):
File "C:\Users\acalobish\Desktop\iaa.py", line 4, in <module>
xml = a.toXml()
AttributeError: 'str' object has no attribute 'toXml'

import chilkat
htmlToXml = chilkat.CkHtmlToXml()
# Indicate the charset of the output XML we'll want.
htmlToXml.put_XmlCharset("utf-8")
success = htmlToXml.ConvertFile("test.html","out.xml")
if (success != True):
print(htmlToXml.lastErrorText())
else:
print("Success")

Python xml.ElementTree - function to return parsed xml in variable to be used later

I have a funcion which sends get request and parse response to xml:
def get_object(object_name):
...
...
#parse xml file
encoded_text = response.text.encode('utf-8', 'replace')
root = ET.fromstring(encoded_text)
tree = ET.ElementTree(root)
return tree
Then I use this function to loop through object from list to get xmls and store them in variable:
jx_task_tree = ''
for jx in jx_tasks_lst:
jx_task_tree += str(get_object(jx))
I am not sure, if the function returns me data in correct format/form to use them later the way I need to.
When I want to parse variable jx_task_tree like this:
parser = ET.XMLParser(encoding="utf-8")
print(type(jx_task_tree))
tree = ET.parse(jx_task_tree, parser=parser)
print(ET.tostring(tree))
it throws me an error:
Traceback (most recent call last):
File "import_uac_wf.py", line 59, in <module>
tree = ET.parse(jx_task_tree, parser=parser)
File "/usr/lib64/python2.7/xml/etree/ElementTree.py", line 1182, in
parse
tree.parse(source, parser)
File "/usr/lib64/python2.7/xml/etree/ElementTree.py", line 647, in parse
source = open(source, "rb")
IOError: [Errno 36] File name too long:
'<xml.etree.ElementTree.ElementTree
object at 0x7ff2607c8910>\n<xml.etree.ElementTree.ElementTree object at
0x7ff2607e23d0>\n<xml.etree.ElementTree.ElementTree object at
0x7ff2607ee4d0>\n<xml.etree.ElementTree.ElementTree object at
0x7ff2607d8e90>\n<xml.etree.ElementTree.ElementTree object at
0x7ff2607e2550>\n<xml.etree.ElementTree.ElementTree object at
0x7ff2607889d0>\n<xml.etree.ElementTree.ElementTree object at
0x7ff26079f3d0>\n'
Would anybody help me, what should function get_object() return and how to work with it later, so what's returned can be joined into one variable and parsed?

Regarding to your current exception:
According to [Python 3.Docs]: xml.etree.ElementTree.parse(source, parser=None) (emphasis is mine):
Parses an XML section into an element tree. source is a filename or file object containing XML data.
If you want to load the XML from a string, use ET.fromstring instead.
Then, as you suspected, the 2nd code snippet is completely wrong:
get_object(jx) returns an already parsed XML, so an ElementTree object
Calling str on it, will yield its textual representation (e.g. "<xml.etree.ElementTree.ElementTree object at 0x7ff26079f3d0>") which is not what you want
You could do something like:
jx_tasks_string = ""
for jx in jx_tasks_lst:
jx_tasks_string += ET.tostring(get_object(jx).getroot())
Since jx_tasks_string is the concatenation of some strings obtained from parsing some XML blobs, there's no reason to parse it again.

Python , XML Index error

Hello I am having trouble with a xml file I am using. Now what happens is on a short xml file the program works fine but for some reason once it reaches a size ( I am thinking 1 MB)
it gives me a "IndexError: list index out of range"
Here is the code I am writing so far.
from xml.dom import minidom
import smtplib
from email.mime.text import MIMEText
from datetime import datetime
def xml_data():
f = open('C:\opidea_2.xml', 'r')
data = f.read()
f.close()
dom = minidom.parseString(data)
ic = (dom.getElementsByTagName('logentry'))
dom = None
content = ''
for num in ic:
name = num.getElementsByTagName('author')[0].firstChild.nodeValue
if name:
content += "***Changes by:" + str(name) + "*** " + '\n\n Date: '
else:
content += "***Changes are made Anonymously *** " + '\n\n Date: '
print content
if __name__ == "__main__":
xml_data ()
Here is part of the xml if it helps.
<log>
<logentry
revision="33185">
<author>glv</author>
<date>2012-08-06T21:01:52.494219Z</date>
<paths>
<path
kind="file"
action="M">/branches/Patch_4_2_0_Branch/text.xml</path>
<path
kind="dir"
action="M">/branches/Patch_4_2_0_Branch</path>
</paths>
<msg>PATCH_BRANCH:N/A
BUG_NUMBER:N/A
FEATURE_AFFECTED:N/A
OVERVIEW:N/A
Adding the SVN log size requirement to the branch
</msg>
</logentry>
</log>
The actual xml file is much bigger but this is the general format. It will actually work if it was this small but once it gets bigger I get problems.
here is the traceback
Traceback (most recent call last):
File "C:\python\src\SVN_Email_copy.py", line 141, in <module>
xml_data ()
File "C:\python\src\SVN_Email_copy.py", line 50, in xml_data
name = num.getElementsByTagName('author')[0].firstChild.nodeValue
IndexError: list index out of range

Based on the code provided your error is going to be in this line:
name = num.getElementsByTagName('author')[0].firstChild.nodeValue
#xml node-^
#function call -------------------------^
#list indexing ----------------------------^
#attribute access -------------------------------------^
That's the only place in the demonstrated code that you're indexing into a list. That would imply that in your larger XML Sample you're missing an <author> tag. You'll have to correct that, or add in some level of error handling / data validation.
Please see the code elaboration for more explanation. You're doing a ton of things in a single line by taking advantage of the return behaviors of successive commands. So, the num is defined, that's fine. Then you call a function (method). It returns a list. You attempt to retrieve from that list and it throws an exception, so you never make it to the Attribute Access to get to firstChild, which definitely means you get no nodeValue.
Error checking may look something like this:
authors = num.getElementsByTagName('author')
if len(authors) > 0:
name = authors[0].firstChild.nodeValue
Though there are many, many ways you could achieve that.

renderContents in beautifulsoup (python)

The code I'm trying to get working is:
h = str(heading)
# '<h1>Heading</h1>'
heading.renderContents()
I get this error:
Traceback (most recent call last):
File "<pyshell#6>", line 1, in <module>
print h.renderContents()
AttributeError: 'str' object has no attribute 'renderContents'
Any ideas?
I have a string with html tags and i need to clean it if there is a different way of doing that please suggest it.

Your error message and your code sample don't line up. You say you're calling:
heading.renderContents()
But your error message says you're calling:
print h.renderContents()
Which suggests that perhaps you have a bug in your code, trying to call renderContents() on a string object that doesn't define that method.
In any case, it would help if you checked what type of object heading is to make sure it's really a BeautifulSoup instance. This works for me with BeautifulSoup 3.2.0:
from BeautifulSoup import BeautifulSoup
heading = BeautifulSoup('<h1>heading</h1>')
repr(heading)
# '<h1>heading</h1>'
print heading.renderContents()
# <h1>heading</h1>
print str(heading)
# '<h1>heading</h1>'
h = str(heading)
print h
# <h1>heading</h1>

pubDate RSS parsing weirdness with Beautifulsoup/Python

I'm trying to parse an RSS/Podcast feed using Beautifulsoup and everything is working nicely except I can't seem to parse the 'pubDate' field.
data = urllib2.urlopen("http://www.democracynow.org/podcast.xml")
dom = BeautifulStoneSoup(data, fromEncoding='utf-8')
items = dom.findAll('item');
for item in items:
title = item.find('title').string.strip()
pubDate = item.find('pubDate').string.strip()
The title gets parsed fine but when it gets to pubDate, it says:
Traceback (most recent call last):
File "", line 2, in
AttributeError: 'NoneType' object has no attribute 'string'
However, when I download a copy of the XML file and rename 'pubDate' to something else, then parse it again, it seems to work. Is pubDate a reserved variable or something in Python?
Thanks,
g

It works with item.find('pubdate').string.strip().
Why don't you use feedparser ?

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How can I get specific elements from XML data? - python

Related

Convert html to xml Python chilkat

Python xml.ElementTree - function to return parsed xml in variable to be used later

Python , XML Index error

renderContents in beautifulsoup (python)

pubDate RSS parsing weirdness with Beautifulsoup/Python

Categories

Resources