How to grab the attribute value using beautifulSoup?

How to grab the attribute value using beautifulSoup? - python

Code:
soup=BeautifulSoup(f.read())
data=soup.findAll('node',{'id':'memory'})
print data
Output
[<node id="memory" claimed="true" class="memory" handle="DMI:000E">
<description>
System Memory
</description>
<physid>
e
</physid>
<slot>
System board or motherboard
</slot>
<size units="bytes">
3221225472
</size>
<capacity units="bytes">
3221225472
</capacity>
</node>]
Now how will I grab the attributes value like the data between tag that is System Memory and so on. Any help is appreciated.

To get <...>this</...> you should use contents field, so in your case it would be:
print data.description.contents
To get attributes access them as they were a dictionary
print data.size['units']
And to iterate all the tags, use findAll that you already know:
for node in data.findAll(True):
# do stuff on node

beautifulsoup can create a tree. you can then iterate over that tree and get the attributes
check out the following link
http://www.crummy.com/software/BeautifulSoup/bs3/documentation.html#TheattributesofTags

Related

Parsing XML Attribute Names

Python, currently using 2.7 but can easily change to latest and greatest.
Needing to parse this XML and return the INT value contained within the item. This isn't my XML. This is coming from a piece of enterprise level software.
<counters>
<item name="stats/counters/session/responsetime" type="int">1047</item>
<item name="stats/counters/session/responsecount" type="int">7423</item>
<item name="stats/counters/init/inittime" type="int">36339</item>
<item name="stats/counters/init/fetchtime" type="int">8097</item>
<item name="stats/connectionsetups" type="int">579</item>
<item name="stats/activesessions" type="int">4294967289</item>
<item name="stats/activeconnections" type="int">0</item>
</counters>
Code:
import xml.etree.ElementTree as ET
import xml
def _getstats():
resp = requests.get(urlStats)
#Writing XML to disk. This makes parsing it MUCH easier.
with open('stats_10.xml', 'wb') as f:
f.write(resp.content)
f.close()
tree = ET.parse('stats_10.xml')
root = tree.getroot()
active = root.find('stats/activesessions')
print active
The return is always None. I'm Using ElementTree. Read through the documentation (https://docs.python.org/3.0/library/xml.etree.elementtree.html) and many StackOF pages.
I think the problem is that the parser doesn't understand the slash.
Attempted to pull by name using "active = int(root['stats/activesessions'])" in place of root find which returns this error:
TypeError: list indices must be integers, not str
Also tried xmltodict but that was even worse that using ElementTree. The error would always be 'list indices must be integers'.
Lastly, this is a dynamic XML document. Indexing by ROW is not an option because at idle, the software returns 10 rows for example and under a load it return 15, with additional rows being mixed with the other rows. I have to pull by child name.
Thank you in advance for any assistance!
ADDITION:
I can run an iteration through the XML and pull the value. However, as stated above, the XML will change and the number of rows will increase, thus throwing my indices off.
active = root[5].text
print active

I believe the find method is looking for a tag name, not an attribute value. You need to find the item tag, check if it has a name attribute, and then check if the attribute equals "stats/activesessions". If this condition is met, you can read in the value of the item tag.

This is obviously me not understanding XML and how it's structured. Added this in my code and I get the return value I'm looking for.
for item in root.findall("./item[#name='system/starttime']"):
starttime = int(item.text)

ElementTree XML API not matching subelement

I am attempting to use the USPS API to return the status of package tracking. I have a method that returns an ElementTree.Element object built from the XML string returned from the USPS API.
This is the returned XML string.
<?xml version="1.0" encoding="UTF-8"?>
<TrackResponse>
<TrackInfo ID="EJ958088694US">
<TrackSummary>The Postal Service could not locate the tracking information for your
request. Please verify your tracking number and try again later.</TrackSummary>
</TrackInfo>
</TrackResponse>
I format that into an Element object
response = xml.etree.ElementTree.fromstring(xml_str)
Now I can see in the xml string that the tag 'TrackSummary' exists and I would expect to be able to access that using ElementTree's find method.
As extra proof I can iterate over the response object and prove that the 'TrackSummary' tag exists.
for item in response.iter():
print(item, item.text)
returns:
<Element 'TrackResponse' at 0x00000000041B4B38> None
<Element 'TrackInfo' at 0x00000000041B4AE8> None
<Element 'TrackSummary' at 0x00000000041B4B88> The Postal Service could not locate the tracking information for your request. Please verify your tracking number and try again later.
So here is the problem.
print(response.find('TrackSummary')
returns
None
Am I missing something here? Seems like I should be able to find that child element without a problem?

import xml.etree.cElementTree as ET # 15 to 20 time faster
response = ET.fromstring(str)
Xpath Syntax
Selects all child elements. For example, */egg selects all grandchildren named egg.
element = response.findall('*/TrackSummary') # you will get a list
print element[0].text #fast print else iterate the list
>>> The Postal Service could not locate the tracking informationfor your request. Please verify your tracking number and try again later.

The .find() method only searches the next layer, not recursively. To search recursively, you need to use an XPath query. In XPath, the double slash // is a recursive search. Try this:
# returns a list of elements with tag TrackSummary
response.xpath('//TrackSummary')
# returns a list of the text contained in each TrackSummary tag
response.xpath('//TrackSummary/node()')

elementtree - get title?

<item rdf:about="http://web.nvd.nist.gov/view/vuln/detail?vulnId=CVE-2014-0005">
<title>CVE-2014-0005</title>
<link>http://web.nvd.nist.gov/view/vuln/detail?vulnId=CVE-2014-0005</link>
<description>PicketBox and JBossSX, as used in Red Hat JBoss Enterprise Application Platform (JBEAP) 6.2.2 and JBoss BRMS before 6.0.3 roll up patch 2, allows remote authenticated users to read and modify the application sever configuration and state by deploying a crafted application.</description>
<dc:date>2015-02-20T16:59:00Z</dc:date>
</item>
<item rdf:about="http://web.nvd.nist.gov/view/vuln/detail?vulnId=CVE-2014-1831">
<title>CVE-2014-1831 (passenger)</title>
<link>http://web.nvd.nist.gov/view/vuln/detail?vulnId=CVE-2014-1831</link>
<description>Phusion Passenger before 4.0.37 allows local users to write to certain files and directories via a symlink attack on (1) control_process.pid or a (2) generation-* file.</description>
<dc:date>2015-02-19T15:59:02Z</dc:date>
</item>
Hi,
Given the above, I am trying to extract the text value from the title item out of an xml file. However the below isn't working (am getting no results). Please advise.
def processxml():
tree = ET.parse('nvd-rss.xml')
for item in tree.findall( 'item/title' ):
print (title.text)
Thanks in advance

As BorrajaX mentioned you need to change your code in the for loop to print(item.text), because you iterate through all tags ('elements') the findall() -method of ElementTree-module returns. You can get the text inside an element by reading the attribute text of an ElementTree element instance.
# create an ElementTree instance called tree
for element in tree.findall( 'item/title' ):
print(element.text)
Some other attributes of an ElementTree element instance:
.tag Name of the element
.text Text inside the element
.tail Text following the element
.attrib Dictionary containing all element's attribute names and their corresponding values

Sibling nodes in ElementTree in Python

I am looking at a piece of XML that I want to add a node in.
<profile>
<dog>1</dog>
<halfdog>0</halfdog>
<cat>545</cat>
<lions>0</lions>
<bird>23</bird>
<dino>0</dino>
<pineapples>2</pineapples>
<people>0</people>
</profile>
With the above XML, I'm able to insert XML nodes into it. However, I'm not able to insert it at exact locations.
Is there a way to find if I am next to a certain node, whether it be before or after. Say if I wanted to add <snail>2</snail> between the <dino>0</dino> and <pineapples>2</pineapples> nodes.
Using ElementTree how can I find what node is next to me? I'm asking about ElementTree or any standard Python library. Unfortunately, lxml is out of the question for me.

I believe its not doable using ElementTree, but you can do it using the standard python minidom:
# create snail element
snail = dom.createElement('snail')
snail_text = dom.createTextNode('2')
snail.appendChild(snail_text)
# add it in the right place
profile = dom.getElementsByTagName('profile')[0]
pineapples = dom.getElementsByTagName('pineapples')[0]
profile.insertBefore(snail, pineapples)
output:
<?xml version="1.0" ?><profile>
<dog>1</dog>
<halfdog>0</halfdog>
<cat>545</cat>
<lions>0</lions>
<bird>23</bird>
<dino>0</dino>
<snail>2</snail><pineapples>2</pineapples>
<people>0</people>
</profile>

If you know the parent element and the element to insert before, you can use the following method with ElementTree:
index = parentElem.getchildren().index(elemToInsertBefore)
parent.insert(index, newElement)

XML attributes get sorted

When I create a document using the minidom, attributes get sorted alphabetically in the element. Take this example from here:
from xml.dom import minidom
# New document
xml = minidom.Document()
# Creates user element
userElem = xml.createElement("user")
# Set attributes to user element
userElem.setAttribute("name", "Sergio Oliveira")
userElem.setAttribute("nickname", "seocam")
userElem.setAttribute("email", "seocam#taboca.com")
userElem.setAttribute("photo","seocam.png")
# Append user element in xml document
xml.appendChild(userElem)
# Print the xml code
print xml.toprettyxml()
The result is this:
<?xml version="1.0" ?>
<user email="seocam#taboca.com" name="Sergio Oliveira" nickname="seocam" photo="seocam.png"/>
Which is all very well if you wanted the attributes in email/name/nickname/photo order instead of name/nickname/email/photo order as they were created.
How do you get the attributes to show up in the order you created them? Or, how do you control the order at all?

According to the documentation, the order of attributes is arbitrary but consistent for the life of the DOM. This is common across DOM implementations. Sorry.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to grab the attribute value using beautifulSoup? - python

beautifulsoup can create a tree. you can then iterate over that tree and get the attributes check out the following link http://www.crummy.com/software/BeautifulSoup/bs3/documentation.html#TheattributesofTags

Related

Parsing XML Attribute Names

ElementTree XML API not matching subelement

elementtree - get title?

Sibling nodes in ElementTree in Python

XML attributes get sorted

Categories

Resources