Create a dictionary from an XML using xpath - python

I would like to create a dictionary from an XML file unsing xpath. Here's an example of the XML:
</Contract>
<Contract ID="1">
<UnwantedPatterns>
<Pattern>0</Pattern>
<Pattern>1</Pattern>
</Contract>
<Contract ID="2
<UnwantedPatterns>
<Pattern>0</Pattern>
<Pattern>1</Pattern>
</Contract>
What I would like it's having the contract ID as key and the unwanted patterns as value.
Here's my code:
UnwantedPatterns = []
key = []
DictUP = {}
for ID in root.xpath('//Contracts'):
key = ID.xpath('./Contract/#ID')
for patterns in root.xpath('.//Contract/UnwantedPatterns/Pattern'):
DictUP[key] = UnwantedPatterns.append(patterns.text)
I get the error "unhashable type: 'list'". Thank you for your help, the output should look like that:
{1: 0,1
2: 0,1}

xpath returns list, so instead of
key = ID.xpath('./Contract/#ID')
try
key = ID.xpath('./Contract/#ID')[0]
As for output, as dictionary cannot have multiple values with the same key DictUP[key] = UnwantedPatterns.append(patterns.text) will overwrite value on each iteration.
Try
for ID in root.xpath('//Contracts'):
key = ID.xpath('./Contract/#ID')[0]
_patterns = []
for unwanted in root.xpath('.//Contract/UnwantedPatterns'):
_patterns.extend([pattern.text for pattern in unwanted.xpath('./Pattern')])
DictUP[key] = _patterns

Related

how to get desired output in python from following output

I am getting this output as pasted below .
[{'accel-world-infinite-burst-2016': 'https://yts.mx/torrent/download/92E58C7C69D015DA528D8D7F22844BF49D702DFC'}, {'accel-world-infinite-burst-2016': 'https://yts.mx/torrent/download/3086E306E7CB623F377B6F99261F82CC8BB57115'}, {'accel-world-infinite-burst-2016': 'https://yifysubtitles.org/movie-imdb/tt5923132'}, {'anna-to-the-infinite-power-1983': 'https://yts.mx/torrent/download/E92B664EE87663D7E5EC8E9FEED574C586A95A62'}, {'anna-to-the-infinite-power-1983': 'https://yts.mx/torrent/download/4F6F194996AC29924DB7596FB646C368C4E4224B'}, {'anna-to-the-infinite-power-1983': 'https://yts.mx/movies/anna-to-the-infinite-power-1983/request-subtitle'}, {'infinite-2021': 'https://yts.mx/torrent/download/304DB2FEC8901E996B066B74E5D5C010D2F818B4'}, {'infinite-2021': 'https://yts.mx/torrent/download/1320D6D3B332399B2F4865F36823731ABD1444C0'}, {'infinite-2021': 'https://yts.mx/torrent/download/45821E5B2E339382E7EAEFB2D89967BB2C9835F6'}, {'infinite-2021': 'https://yifysubtitles.org/movie-imdb/tt6654210'}, {'infinite-potential-the-life-ideas-of-david-bohm-2020': 'https://yts.mx/torrent/download/47EB04FBC7DC37358F86A5BFC115A0361F019B5B'}, {'infinite-potential-the-life-ideas-of-david-bohm-2020': 'https://yts.mx/torrent/download/88223BEAA09D0A3D8FB7EEA62BA9C5EB5FDE9282'}, {'infinite-potential-the-life-ideas-of-david-bohm-2020': 'https://yts.mx/movies/infinite-potential-the-life-ideas-of-david-bohm-2020/request-subtitle'}, {'the-infinite-man-2014': 'https://yts.mx/torrent/download/0E2ACFF422AF4F62877F59EAE4EF93C0B3623828'}, {'the-infinite-man-2014': 'https://yts.mx/torrent/download/52437F80F6BDB6FD326A179FC8A63003832F5896'}, {'the-infinite-man-2014': 'https://yifysubtitles.org/movie-imdb/tt2553424'}, {'nick-and-norahs-infinite-playlist-2008': 'https://yts.mx/torrent/download/DA101D139EE3668EEC9EC5B855B446A39C6C5681'}, {'nick-and-norahs-infinite-playlist-2008': 'https://yts.mx/torrent/download/8759CD554E8BB6CFFCFCE529230252AC3A22D4D4'}, {'nick-and-norahs-infinite-playlist-2008': 'https://yifysubtitles.org/movie-imdb/tt0981227'}]
As you can see each movie have multiple links and for each link movie name is repeating .I want all links related to same movie must appeared as same object e.g
[{accel-world-infinite-burst-2016:{link1,link2,link3,link4},........]
for item in li:
# print(item.partition("movies/")[2])
movieName["Movies"].append(item.partition("movies/")[2])
req=requests.get(item)
s=soup(req.text,"html.parser")
m=s.find_all("p",{"class":"hidden-xs hidden-sm"})
# print(m[0])
for a in m[0].find_all('a', href=True):
# movieName['Movies'][item.partition("movies/")[2]]=(a['href'])
downloadLinks.append ( {item.partition("movies/")[2]:a['href'] })
you can try this,
# input = your list of dict
otp_dict = {}
for l in input:
for key, value in l.items():
if key not in otp_dict:
otp_dict[key] = list([value])
else:
otp_dict[key].append(value)
print(otp_dict)
otp: {'accel-world-infinite-burst-2016':[link1,link2],...}
output is dict containing list of links if you want set as you mentioned in your desired op try this
for l in input:
for key, value in l.items():
if key not in otp_dict:
otp_dict[key] = set([value])
else:
otp_dict[key].add(value)
otp: {'accel-world-infinite-burst-2016':{link1,link2},...}

Element XML parsing not giving proper result

I have sample below XML file and I am trying to generate below JSON but I am not geeting expected result it is only add one document in dictionary.
Sample Input XML:
<results status="passed">
<num-records>2</num-records>
<records>
<volume-info>
<flexible-volume-info>
<agg-name>aggr1_split</aggregate-name>
</flexible-volume-info>
<volume-name>volume1</volume-name>
<volume-size>
<actual-size>44</actual-volume-size>
<afs-avail>90</afs-avail>
</volume-size>
</volume-info>
<volume-info>
<flexible-volume-info>
<agg-name>aggr2_split</aggregate-name>
</flexible-volume-info>
<volume-name>volume2</volume-name>
<volume-size>
<actual-size>10</actual-volume-size>
<afs-avail>14</afs-avail>
</volume-size>
</volume-info>
</records>
</results>
Expected Output:
{
"agg-name": "aggr1_split",
"volume-name": "volume1",
"actual-size": "44"
},
{
"agg-name": "aggr2_split",
"volume-name": "volume2",
"actual-size": "10"
}
Sample code:
result = {}
for child in root.iter("records"):
result['agg-name'] = child.find('volume-info/flexible-volume-info/agg-name').text
result['volume-name'] = child.find('volume-info/volume-name').text
result['actual-size'] = child.find('volume-info/volume-size/actual-size').text
print result
Your expected output would be a dictionary which contained multiple identical keys which is not possible. You either need to choose different keys for each iteration of your loop or better still have a list of dictionaries:
import xml.etree.ElementTree as ET
xml_data = """<results status="passed">
<num-records>2</num-records>
<records>
<volume-info>
<flexible-volume-info>
<agg-name>aggr1_split</agg-name>
</flexible-volume-info>
<volume-name>volume1</volume-name>
<volume-size>
<actual-size>44</actual-size>
<afs-avail>90</afs-avail>
</volume-size>
</volume-info>
<volume-info>
<flexible-volume-info>
<agg-name>aggr2_split</agg-name>
</flexible-volume-info>
<volume-name>volume2</volume-name>
<volume-size>
<actual-size>10</actual-size>
<afs-avail>14</afs-avail>
</volume-size>
</volume-info>
</records>
</results>"""
root = ET.fromstring(xml_data)
results = []
for child in root.iter("volume-info"):
result = {}
print(child)
result['agg-name'] = child.find('flexible-volume-info/agg-name').text
result['volume-name'] = child.find('volume-name').text
result['actual-size'] = child.find('volume-size/actual-size').text
results.append(result)
print(results)
This would give you:
[{'agg-name': 'aggr1_split', 'volume-name': 'volume1', 'actual-size': '44'}, {'agg-name': 'aggr2_split', 'volume-name': 'volume2', 'actual-size': '10'}]
Your XML is also badly formed, the open and closing tags do not always match.

Dictionary key and value flipping themselves unexpectedly

I am running python 3.5, and I've defined a function that creates XML SubElements and adds them under another element. The attributes are in a dictionary, but for some reason the dictionary keys and values will sometimes flip when I execute the script.
Here is a snippet of kind of what I have (the code is broken into many functions so I combined it here)
import xml.etree.ElementTree as ElementTree
def AddSubElement(parent, tag, text='', attributes = None):
XMLelement = ElementTree.SubElement(parent, tag)
XMLelement.text = text
if attributes != None:
for key, value in attributes:
XMLelement.set(key, value)
print("attributes =",attributes)
return XMLelement
descriptionTags = ([('xmlns:g' , 'http://base.google.com/ns/1.0')])
XMLroot = ElementTree.Element('rss')
XMLroot.set('version', '2.0')
XMLchannel = ElementTree.SubElement(XMLroot,'channel')
AddSubElement(XMLchannel,'g:description', 'sporting goods', attributes=descriptionTags )
AddSubElement(XMLchannel,'link', 'http://'+ domain +'/')
XMLitem = AddSubElement(XMLchannel,'item')
AddSubElement(XMLitem, 'g:brand', Product['ProductManufacturer'], attributes=bindingParam)
AddSubElement(XMLitem, 'g:description', Product['ProductDescriptionShort'], attributes=bindingParam)
AddSubElement(XMLitem, 'g:price', Product['ProductPrice'] + ' USD', attributes=bindingParam)
The key and value does get switched! Because I'll see this in the console sometimes:
attributes = [{'xmlns:g', 'http://base.google.com/ns/1.0'}]
attributes = [{'http://base.google.com/ns/1.0', 'xmlns:g'}]
attributes = [{'http://base.google.com/ns/1.0', 'xmlns:g'}]
...
And here is the xml string that sometimes comes out:
<rss version="2.0">
<channel>
<title>example.com</title>
<g:description xmlns:g="http://base.google.com/ns/1.0">sporting goods</g:description>
<link>http://www.example.com/</link>
<item>
<g:id http://base.google.com/ns/1.0="xmlns:g">8987983</g:id>
<title>Some cool product</title>
<g:brand http://base.google.com/ns/1.0="xmlns:g">Cool</g:brand>
<g:description http://base.google.com/ns/1.0="xmlns:g">Why is this so cool?</g:description>
<g:price http://base.google.com/ns/1.0="xmlns:g">69.00 USD</g:price>
...
What is causing this to flip?
attributes = [{'xmlns:g', 'http://base.google.com/ns/1.0'}]
This is a list containing a set, not a dictionary. Neither sets nor dictionaries are ordered.

How to associate values of tags with label of the tag the using ElementTree in a Pythonic way

I have some xml files I am trying to process.
Here is a derived sample from one of the files
fileAsString = """
<?xml version="1.0" encoding="utf-8"?>
<eventDocument>
<schemaVersion>X2</schemaVersion>
<eventTable>
<eventTransaction>
<eventTitle>
<value>Some Event</value>
</eventTitle>
<eventDate>
<value>2003-12-31</value>
</eventDate>
<eventCoding>
<eventType>47</eventType>
<eventCode>A</eventCode>
<footnoteId id="F1"/>
<footnoteId id="F2"/>
</eventCoding>
<eventCycled>
<value></value>
</eventCycled>
<eventAmounts>
<eventVoltage>
<value>40000</value>
</eventVoltage>
</eventAmounts>
</eventTransaction>
</eventTable>
</eventDocument>"""
Note, there can be many eventTables in each document and events can have more details then just the ones I have isolated.
My goal is to create a dictionary in the following form
{'eventTitle':'Some Event, 'eventDate':'2003-12-31','eventType':'47',\
'eventCode':'A', 'eventCoding_FTNT_1':'F1','eventCoding_FTNT_2':'F2',\
'eventCycled': , 'eventVoltage':'40000'}
I am actually reading these in from files but assuming I have a string my code to get the text for the elements right below the eventTransaction element where the text is inside a value tag is as follows
import xml.etree.cElementTree as ET
myXML = ET.fromstring(fileAsString)
eventTransactions = [ e for e in myXML.iter() if e.tag == 'eventTransaction']
testTransaction = eventTransactions[0]
my_dict = {}
for child_of in testTransaction:
grand_children_tags = [e.tag for e in child_of]
if grand_children_tags == ['value']:
my_dict[child_of.tag] = [e.text for e in child_of][0]
>>> my_dict
{'eventTitle': 'Some Event', 'eventCycled': None, 'eventDate': '2003-12-31'}
This seems wrong because I am not really taking advantage of xml instead I am using brute force but I have not seemed to find an example.
Is there a clearer and more pythonic way to create the output I am looking for?
Use XPath to pull out the elements you're interested in.
The following code creates a list of lists of dicts (i.e. tables/transactions/info):
tables = []
myXML = ET.fromstring(fileAsString)
for table in myXML.findall('./eventTable'):
transactions = []
tables.append(transactions)
for transaction in table.findall('./eventTransaction'):
info = {}
for element in table.findall('.//*[value]'):
info[element.tag] = element.find('./value').text or ''
coding = transaction.find('./eventCoding')
if coding is not None:
for tag in 'eventType', 'eventCode':
element = coding.find('./%s' % tag)
if element is not None:
info[tag] = element.text or ''
for index, element in enumerate(coding.findall('./footnoteId')):
info['eventCoding_FTNT_%d' % index] = element.get('id', '')
if info:
transactions.append(info)
Output:
[[{'eventCode': 'A',
'eventCoding_FTNT_0': 'F1',
'eventCoding_FTNT_1': 'F2',
'eventCycled': '',
'eventDate': '2003-12-31',
'eventTitle': 'Some Event',
'eventType': '47',
'eventVoltage': '40000'}]]

Extracting the value of a key with BeautifulSoup

I want to extract the value of the "archivo" key of something like this:
...
<applet name="bla" code="Any.class" archive="Any.jar">
<param name="abc" value="space='1' archivo='bla.jpg'" </param>
<param name="def" value="space='2' archivo='bli.jpg'" </param>
<param name="jkl" value="space='3' archivo='blu.jpg'" </param>
</applet>
...
I suppose I need a list with [bla.jpg, bli.jpg, ...], so I try options like:
inputTag = soup.findAll("param",{'value':'archivo'})
or
inputTag = soup.findAll(attrs={"value" : "archivo"})
or
inputTag = soup.findAll("archivo")
and always I get an empty list: []
Other unsuccessful options:
inputTag = soup.findAll("param",{"value" : "archivo"}.contents)
I get something like: a dict object hasn't attribute contents
inputTag = unicode(getattr(soup.findAll('archivo'), 'string', ''))
I get nothing.
Finally I have seen: Difference between attrMap and attrs in beautifulSoup, and:
for tag in soup.recursiveChildGenerator():
print tag['archivo']
find nothing, it must be tag of name, code or archive keys.
and more finally:
tag.attrs = [(key,value) for key,value in tag.attrs if key == 'archivo']
but tag.attrs find nothing
OK, with jcollado's help I could get the list this way:
imageslist = []
patron = re.compile(r"archivo='([\w\./]+)'")
for tag in soup.findAll('param'):
if patron.search(tag['value']):
imageslist.append(patron.search(tag['value']).group(1))
The problem here is that archivo isn't an attribute of param, but something inside the value attribute. To extract archivo from value, I suggest to use a regular expression as follows:
>>> archivo_regex = re.compile(r"archivo='([\w\./]+)'")
>>> [archivo_regex.search(tag['value']).group(1)
... for tag in soup.findAll('param')]
[u'bla.jpg', u'bli.jpg', u'blu.jpg']

Categories