elementtree - get title? - python

<item rdf:about="http://web.nvd.nist.gov/view/vuln/detail?vulnId=CVE-2014-0005">
<title>CVE-2014-0005</title>
<link>http://web.nvd.nist.gov/view/vuln/detail?vulnId=CVE-2014-0005</link>
<description>PicketBox and JBossSX, as used in Red Hat JBoss Enterprise Application Platform (JBEAP) 6.2.2 and JBoss BRMS before 6.0.3 roll up patch 2, allows remote authenticated users to read and modify the application sever configuration and state by deploying a crafted application.</description>
<dc:date>2015-02-20T16:59:00Z</dc:date>
</item>
<item rdf:about="http://web.nvd.nist.gov/view/vuln/detail?vulnId=CVE-2014-1831">
<title>CVE-2014-1831 (passenger)</title>
<link>http://web.nvd.nist.gov/view/vuln/detail?vulnId=CVE-2014-1831</link>
<description>Phusion Passenger before 4.0.37 allows local users to write to certain files and directories via a symlink attack on (1) control_process.pid or a (2) generation-* file.</description>
<dc:date>2015-02-19T15:59:02Z</dc:date>
</item>
Hi,
Given the above, I am trying to extract the text value from the title item out of an xml file. However the below isn't working (am getting no results). Please advise.
def processxml():
tree = ET.parse('nvd-rss.xml')
for item in tree.findall( 'item/title' ):
print (title.text)
Thanks in advance

As BorrajaX mentioned you need to change your code in the for loop to print(item.text), because you iterate through all tags ('elements') the findall() -method of ElementTree-module returns. You can get the text inside an element by reading the attribute text of an ElementTree element instance.
# create an ElementTree instance called tree
for element in tree.findall( 'item/title' ):
print(element.text)
Some other attributes of an ElementTree element instance:
.tag Name of the element
.text Text inside the element
.tail Text following the element
.attrib Dictionary containing all element's attribute names and their corresponding values

Related

Parsing XML Attribute Names

Python, currently using 2.7 but can easily change to latest and greatest.
Needing to parse this XML and return the INT value contained within the item. This isn't my XML. This is coming from a piece of enterprise level software.
<counters>
<item name="stats/counters/session/responsetime" type="int">1047</item>
<item name="stats/counters/session/responsecount" type="int">7423</item>
<item name="stats/counters/init/inittime" type="int">36339</item>
<item name="stats/counters/init/fetchtime" type="int">8097</item>
<item name="stats/connectionsetups" type="int">579</item>
<item name="stats/activesessions" type="int">4294967289</item>
<item name="stats/activeconnections" type="int">0</item>
</counters>
Code:
import xml.etree.ElementTree as ET
import xml
def _getstats():
resp = requests.get(urlStats)
#Writing XML to disk. This makes parsing it MUCH easier.
with open('stats_10.xml', 'wb') as f:
f.write(resp.content)
f.close()
tree = ET.parse('stats_10.xml')
root = tree.getroot()
active = root.find('stats/activesessions')
print active
The return is always None. I'm Using ElementTree. Read through the documentation (https://docs.python.org/3.0/library/xml.etree.elementtree.html) and many StackOF pages.
I think the problem is that the parser doesn't understand the slash.
Attempted to pull by name using "active = int(root['stats/activesessions'])" in place of root find which returns this error:
TypeError: list indices must be integers, not str
Also tried xmltodict but that was even worse that using ElementTree. The error would always be 'list indices must be integers'.
Lastly, this is a dynamic XML document. Indexing by ROW is not an option because at idle, the software returns 10 rows for example and under a load it return 15, with additional rows being mixed with the other rows. I have to pull by child name.
Thank you in advance for any assistance!
ADDITION:
I can run an iteration through the XML and pull the value. However, as stated above, the XML will change and the number of rows will increase, thus throwing my indices off.
active = root[5].text
print active
I believe the find method is looking for a tag name, not an attribute value. You need to find the item tag, check if it has a name attribute, and then check if the attribute equals "stats/activesessions". If this condition is met, you can read in the value of the item tag.
This is obviously me not understanding XML and how it's structured. Added this in my code and I get the return value I'm looking for.
for item in root.findall("./item[#name='system/starttime']"):
starttime = int(item.text)

ElementTree XML API not matching subelement

I am attempting to use the USPS API to return the status of package tracking. I have a method that returns an ElementTree.Element object built from the XML string returned from the USPS API.
This is the returned XML string.
<?xml version="1.0" encoding="UTF-8"?>
<TrackResponse>
<TrackInfo ID="EJ958088694US">
<TrackSummary>The Postal Service could not locate the tracking information for your
request. Please verify your tracking number and try again later.</TrackSummary>
</TrackInfo>
</TrackResponse>
I format that into an Element object
response = xml.etree.ElementTree.fromstring(xml_str)
Now I can see in the xml string that the tag 'TrackSummary' exists and I would expect to be able to access that using ElementTree's find method.
As extra proof I can iterate over the response object and prove that the 'TrackSummary' tag exists.
for item in response.iter():
print(item, item.text)
returns:
<Element 'TrackResponse' at 0x00000000041B4B38> None
<Element 'TrackInfo' at 0x00000000041B4AE8> None
<Element 'TrackSummary' at 0x00000000041B4B88> The Postal Service could not locate the tracking information for your request. Please verify your tracking number and try again later.
So here is the problem.
print(response.find('TrackSummary')
returns
None
Am I missing something here? Seems like I should be able to find that child element without a problem?
import xml.etree.cElementTree as ET # 15 to 20 time faster
response = ET.fromstring(str)
Xpath Syntax
Selects all child elements. For example, */egg selects all grandchildren named egg.
element = response.findall('*/TrackSummary') # you will get a list
print element[0].text #fast print else iterate the list
>>> The Postal Service could not locate the tracking informationfor your request. Please verify your tracking number and try again later.
The .find() method only searches the next layer, not recursively. To search recursively, you need to use an XPath query. In XPath, the double slash // is a recursive search. Try this:
# returns a list of elements with tag TrackSummary
response.xpath('//TrackSummary')
# returns a list of the text contained in each TrackSummary tag
response.xpath('//TrackSummary/node()')

Reading text from XML nodes using Python's libxml2

I am a first time XPath user and need to be able to get the text values of these different elements.. for instance time, title, etc.. I am using the libxml2 module in Python and so far have not had much luck getting just the values of the text I need. The code below here only returns the element tags.. i need the values.. any help would be GREATLY appreciated!
I'm using this code:
doc = libxml2.parseDoc(xmlOutput)
result = doc.xpathEval('//*')
With the following document:
<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE SCAN_LIST_OUTPUT SYSTEM "https://qualysapi.qualys.com/api/2.0/fo/sca/scan_list_output.dtd">
<SCAN_LIST_OUTPUT>
<RESPONSE>
<DATETIME>2012-01-22T01:21:53Z</DATETIME>
<SCAN_LIST>
<SCAN>
<REF>scan/2343423</REF>
<TYPE>Scheduled</TYPE>
<TITLE><![CDATA[customer 1 5/20/2012]]></TITLE>
<USER_LOGIN>user1</USER_LOGIN>
<LAUNCH_DATETIME>2012-02-21T04:11:05Z</LAUNCH_DATETIME>
<STATUS>
<STATE>Finished</STATE>
</STATUS>
<TARGET><![CDATA[13.3.3.2, 13.8.8.10, 13.10.12.60, 13.10.12.11...]]></TARGET>
</SCAN>
</SCAN_LIST>
</RESPONSE>
</SCAN_LIST_OUTPUT>
You can call getContent() on each returned xmlNode object to retrieve the associated text. Note that this is recursive -- to non-recursively access text content in libxml2, you'll want to retrieve the associated text node under the element, and call .getContent() on that.
That said, this would be easier if you used lxml.etree (a higher-level Python API, still backing into the C libxml2 library) instead of the Python libxml2; in that case, it's simply element.text to access the associated content as a string.
Have a look at Mark Pilgrim's Dive Into Python 3, Chapter 12. XML
The chapter starts with short course to XML (general talk but with the Atom Syndication Feed example), then it continues with the standard xml.etree.ElementTree and continues with third party lxml that implements more with the same interface (full XPATH 1.0, based on libxml2).

How to grab the attribute value using beautifulSoup?

Code:
soup=BeautifulSoup(f.read())
data=soup.findAll('node',{'id':'memory'})
print data
Output
[<node id="memory" claimed="true" class="memory" handle="DMI:000E">
<description>
System Memory
</description>
<physid>
e
</physid>
<slot>
System board or motherboard
</slot>
<size units="bytes">
3221225472
</size>
<capacity units="bytes">
3221225472
</capacity>
</node>]
Now how will I grab the attributes value like the data between tag that is System Memory and so on. Any help is appreciated.
To get <...>this</...> you should use contents field, so in your case it would be:
print data.description.contents
To get attributes access them as they were a dictionary
print data.size['units']
And to iterate all the tags, use findAll that you already know:
for node in data.findAll(True):
# do stuff on node
beautifulsoup can create a tree. you can then iterate over that tree and get the attributes
check out the following link
http://www.crummy.com/software/BeautifulSoup/bs3/documentation.html#TheattributesofTags

XML attributes get sorted

When I create a document using the minidom, attributes get sorted alphabetically in the element. Take this example from here:
from xml.dom import minidom
# New document
xml = minidom.Document()
# Creates user element
userElem = xml.createElement("user")
# Set attributes to user element
userElem.setAttribute("name", "Sergio Oliveira")
userElem.setAttribute("nickname", "seocam")
userElem.setAttribute("email", "seocam#taboca.com")
userElem.setAttribute("photo","seocam.png")
# Append user element in xml document
xml.appendChild(userElem)
# Print the xml code
print xml.toprettyxml()
The result is this:
<?xml version="1.0" ?>
<user email="seocam#taboca.com" name="Sergio Oliveira" nickname="seocam" photo="seocam.png"/>
Which is all very well if you wanted the attributes in email/name/nickname/photo order instead of name/nickname/email/photo order as they were created.
How do you get the attributes to show up in the order you created them? Or, how do you control the order at all?
According to the documentation, the order of attributes is arbitrary but consistent for the life of the DOM. This is common across DOM implementations. Sorry.

Categories