Extracting the value of a key with BeautifulSoup - python

I want to extract the value of the "archivo" key of something like this:
...
<applet name="bla" code="Any.class" archive="Any.jar">
<param name="abc" value="space='1' archivo='bla.jpg'" </param>
<param name="def" value="space='2' archivo='bli.jpg'" </param>
<param name="jkl" value="space='3' archivo='blu.jpg'" </param>
</applet>
...
I suppose I need a list with [bla.jpg, bli.jpg, ...], so I try options like:
inputTag = soup.findAll("param",{'value':'archivo'})
or
inputTag = soup.findAll(attrs={"value" : "archivo"})
or
inputTag = soup.findAll("archivo")
and always I get an empty list: []
Other unsuccessful options:
inputTag = soup.findAll("param",{"value" : "archivo"}.contents)
I get something like: a dict object hasn't attribute contents
inputTag = unicode(getattr(soup.findAll('archivo'), 'string', ''))
I get nothing.
Finally I have seen: Difference between attrMap and attrs in beautifulSoup, and:
for tag in soup.recursiveChildGenerator():
print tag['archivo']
find nothing, it must be tag of name, code or archive keys.
and more finally:
tag.attrs = [(key,value) for key,value in tag.attrs if key == 'archivo']
but tag.attrs find nothing
OK, with jcollado's help I could get the list this way:
imageslist = []
patron = re.compile(r"archivo='([\w\./]+)'")
for tag in soup.findAll('param'):
if patron.search(tag['value']):
imageslist.append(patron.search(tag['value']).group(1))

The problem here is that archivo isn't an attribute of param, but something inside the value attribute. To extract archivo from value, I suggest to use a regular expression as follows:
>>> archivo_regex = re.compile(r"archivo='([\w\./]+)'")
>>> [archivo_regex.search(tag['value']).group(1)
... for tag in soup.findAll('param')]
[u'bla.jpg', u'bli.jpg', u'blu.jpg']

Related

Create a dictionary from an XML using xpath

I would like to create a dictionary from an XML file unsing xpath. Here's an example of the XML:
</Contract>
<Contract ID="1">
<UnwantedPatterns>
<Pattern>0</Pattern>
<Pattern>1</Pattern>
</Contract>
<Contract ID="2
<UnwantedPatterns>
<Pattern>0</Pattern>
<Pattern>1</Pattern>
</Contract>
What I would like it's having the contract ID as key and the unwanted patterns as value.
Here's my code:
UnwantedPatterns = []
key = []
DictUP = {}
for ID in root.xpath('//Contracts'):
key = ID.xpath('./Contract/#ID')
for patterns in root.xpath('.//Contract/UnwantedPatterns/Pattern'):
DictUP[key] = UnwantedPatterns.append(patterns.text)
I get the error "unhashable type: 'list'". Thank you for your help, the output should look like that:
{1: 0,1
2: 0,1}
xpath returns list, so instead of
key = ID.xpath('./Contract/#ID')
try
key = ID.xpath('./Contract/#ID')[0]
As for output, as dictionary cannot have multiple values with the same key DictUP[key] = UnwantedPatterns.append(patterns.text) will overwrite value on each iteration.
Try
for ID in root.xpath('//Contracts'):
key = ID.xpath('./Contract/#ID')[0]
_patterns = []
for unwanted in root.xpath('.//Contract/UnwantedPatterns'):
_patterns.extend([pattern.text for pattern in unwanted.xpath('./Pattern')])
DictUP[key] = _patterns

AttributeError when assigning value to function for XML data extraction

I'm coding a script to extract information from several XML files with the same structure but with missing sections when there is no information related to a tag. The easiest way to achieve this was using try/except so instead of getting a "AtributeError: 'NoneType' object has no atrribute 'find'" I assign an empty string('') to the object in the exeption. Something like this:
try:
string1=root.find('value1').find('value2').find('value3').text
except:
string1=''
The issue is that I want to shrink my code by using a function:
def extract(string):
tempstr=''
try:
tempstr=string.replace("\n", "")
except:
if tempstr is None:
tempstr=""
return string
And then I try to called it like this:
string1=extract(root.find('value1').find('value2').find('value3').text)
and value2 or value3 does not exist for the xml that is being processed, I get and AttributeError even if I don't use the variable in the function making the function useless.
Is there a way to make a function work, maybe there is a way to make it run without checking if the value entered is invalid?
Solution:
I'm using a mix of both answers:
def extract(root, xpath):
tempstr=''
try:
tempstr=root.findall(xpath)[0].text.replace("\n", "")
except:
tempstr=''#To avoid getting a Nonetype object
return tempstr
You can try something like that:
def extract(root, children_keys: list):
target_object = root
result_text = ''
try:
for child_key in children_keys:
target_object = target_object.find(child_key)
result_text = target_object.text
except:
pass
return result_text
You will go deeper at XML structure with for loop (children_keys - is predefined by you list of nested keys of XML - xml-path to your object).
And if error will throw inside that code - you will get '' as result.
Example XML (source):
<?xml version="1.0" encoding="UTF-8"?>
<note>
<to>Tove</to>
<from>Jani</from>
<heading>Reminder</heading>
<body>
<y>Don't forget me this weekend!</y>
</body>
</note>
Example:
import xml.etree.ElementTree as ET
tree = ET.parse('note.xml')
root = tree.getroot()
children_keys = ['body', 'y']
result_string = extract(root, children_keys)
print(result_string)
Output:
"Don't forget me this weekend!"
Use XPATH expression
import xml.etree.ElementTree as ET
xml1 = '''<r><v1><v2><v3>a string</v3></v2></v1></r>'''
root = ET.fromstring(xml1)
v3 = root.findall('./v1/v2/v3')
if v3:
print(v3[0].text)
else:
print('v3 not found')
xml2 = '''<r><v1><v3>a string</v3></v1></r>'''
root = ET.fromstring(xml2)
v3 = root.findall('./v1/v2/v3')
if v3:
print(v3[0].text)
else:
print('v3 not found')
output
a string
v3 not found

How can I parse a XML file to a dictionary in Python?

I 'am trying to parse a XML file using the Python library minidom (even tried xml.etree.ElementTree API).
My XML (resource.xml)
<?xml version='1.0'?>
<quota_result xmlns="https://some_url">
</quota_rule>
<quota_rule name='max_mem_per_user/5'>
<users>user1</users>
<limit resource='mem' limit='1550' value='921'/>
</quota_rule>
<quota_rule name='max_mem_per_user/6'>
<users>user2 /users>
<limit resource='mem' limit='2150' value='3'/>
</quota_rule>
</quota_result>
I would like to parse this file and store inside a dictionnary the information in the following form and be able to access it:
dict={user1=[resource,limit,value],user2=[resource,limit,value]}
So far I have only been able to do things like:
docXML = minidom.parse("resource.xml")
for node in docXML.getElementsByTagName('limit'):
print node.getAttribute('value')
You can use getElementsByTagName and getAttribute to trace the result:
dict_users = dict()
docXML = parse('mydata.xml')
users= docXML.getElementsByTagName("quota_rule")
for node in users:
user = 'None'
tag_user = node.getElementsByTagName("users") #check the length of the tag_user to see if tag <users> is exist or not
if len(tag_user) ==0:
print "tag <users> is not exist"
else:
user = tag_user[0]
resource = node.getElementsByTagName("limit")[0].getAttribute("resource")
limit = node.getElementsByTagName("limit")[0].getAttribute("limit")
value = node.getElementsByTagName("limit")[0].getAttribute("value")
dict_users[user.firstChild.data]=[resource, limit, value]
if user == 'None':
dict_users['None']=[resource, limit, value]
else:
dict_users[user.firstChild.data]=[resource, limit, value]
print(dict_users) # remove the <users>user1</users> in xml
Output:
tag <users> is not exist
{'None': [u'mem', u'1550', u'921'], u'user2': [u'mem', u'2150', u'3']}

Dictionary key and value flipping themselves unexpectedly

I am running python 3.5, and I've defined a function that creates XML SubElements and adds them under another element. The attributes are in a dictionary, but for some reason the dictionary keys and values will sometimes flip when I execute the script.
Here is a snippet of kind of what I have (the code is broken into many functions so I combined it here)
import xml.etree.ElementTree as ElementTree
def AddSubElement(parent, tag, text='', attributes = None):
XMLelement = ElementTree.SubElement(parent, tag)
XMLelement.text = text
if attributes != None:
for key, value in attributes:
XMLelement.set(key, value)
print("attributes =",attributes)
return XMLelement
descriptionTags = ([('xmlns:g' , 'http://base.google.com/ns/1.0')])
XMLroot = ElementTree.Element('rss')
XMLroot.set('version', '2.0')
XMLchannel = ElementTree.SubElement(XMLroot,'channel')
AddSubElement(XMLchannel,'g:description', 'sporting goods', attributes=descriptionTags )
AddSubElement(XMLchannel,'link', 'http://'+ domain +'/')
XMLitem = AddSubElement(XMLchannel,'item')
AddSubElement(XMLitem, 'g:brand', Product['ProductManufacturer'], attributes=bindingParam)
AddSubElement(XMLitem, 'g:description', Product['ProductDescriptionShort'], attributes=bindingParam)
AddSubElement(XMLitem, 'g:price', Product['ProductPrice'] + ' USD', attributes=bindingParam)
The key and value does get switched! Because I'll see this in the console sometimes:
attributes = [{'xmlns:g', 'http://base.google.com/ns/1.0'}]
attributes = [{'http://base.google.com/ns/1.0', 'xmlns:g'}]
attributes = [{'http://base.google.com/ns/1.0', 'xmlns:g'}]
...
And here is the xml string that sometimes comes out:
<rss version="2.0">
<channel>
<title>example.com</title>
<g:description xmlns:g="http://base.google.com/ns/1.0">sporting goods</g:description>
<link>http://www.example.com/</link>
<item>
<g:id http://base.google.com/ns/1.0="xmlns:g">8987983</g:id>
<title>Some cool product</title>
<g:brand http://base.google.com/ns/1.0="xmlns:g">Cool</g:brand>
<g:description http://base.google.com/ns/1.0="xmlns:g">Why is this so cool?</g:description>
<g:price http://base.google.com/ns/1.0="xmlns:g">69.00 USD</g:price>
...
What is causing this to flip?
attributes = [{'xmlns:g', 'http://base.google.com/ns/1.0'}]
This is a list containing a set, not a dictionary. Neither sets nor dictionaries are ordered.

Node.TEXT_NODE has the value, but I need the Attribute

I have an xml file like so:
<host name='ip-10-196-55-2.ec2.internal'>
<hostvalue name='arch_string'>lx24-x86</hostvalue>
<hostvalue name='num_proc'>1</hostvalue>
<hostvalue name='load_avg'>0.01</hostvalue>
</host>
I can get get out the Node.data from a Node.TEXT_NODE, but I also need the Attribute name, like I want to know load_avg = 0.01, without writing load_avg, num_proc, etc, one by one. I want them all.
My code looks like this, but I can't figure out what part of the Node has the attribute name for me.
for stat in h.getElementsByTagName("hostvalue"):
for node3 in stat.childNodes:
attr = "foo"
val = "poo"
if node3.nodeType == Node.ATTRINUTE_NODE:
attr = node3.tagName
if node3.nodeType == Node.TEXT_NODE:
#attr = node3.tagName
val = node3.data
From the above code, I'm able to get val, but not attr (compile error:
here's a short example of what you could achieve:
from xml.dom import minidom
xmldoc = minidom.parse("so.xml")
values = {}
for stat in xmldoc.getElementsByTagName("hostvalue"):
attr = stat.attributes["name"].value
value = "\n".join([x.data for x in stat.childNodes])
values[attr] = value
print repr(values)
This outputs, given your XML file:
$ ./parse.py
{u'num_proc': u'1', u'arch_string': u'lx24-x86', u'load_avg': u'0.01'}
Be warned that this is not failsafe, i.e. if you have nested elements inside <hostvalue>.

Categories