Parsing Autosar xml using beautiful soup python 3

Parsing Autosar xml using beautiful soup python 3 - python

I am trying to parse AUTOSAR specific arxml (similar to xml file) using Python but I am unable to read the contents of the file. I want to get the DEFINITION-REF values of definitions inside multiple ECUC-CONTAINER-VALUE tags eg:
/AUTOSAR/ecucdef/BswM/BswMConfig/BswMArbitration/BswMLogicalExpression/BswMArgumentRef
I tried multiple ways but I am unable to print out the contents.
from bs4 import BeautifulSoup as Soup
def parseArxml():
handler = open('input.arxml').read()
soup = Soup(handler,"html.parser")
for ecuc_container in soup.findAll('ECUC-CONTAINER-VALUE'):
print(ecuc_container)
if __name__ == "__main__":
parseArxml()
Here is a part of the arxml file:
<?xml version="1.0" encoding="UTF-8"?>
<AUTOSAR xmlns="http://autosar.org/schema/r4.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://autosar.org/schema/r4.0 autosar_4-2-1.xsd">
<ECUC-CONTAINER-VALUE UUID="c112c504-e546-41c3-abf9-0aaf06b18284">
<SHORT-NAME>BswMLogicalExpression_2</SHORT-NAME>
<DEFINITION-REF DEST="ECUC-PARAM-CONF-CONTAINER-DEF">/AUTOSAR/ecucdef/BswM/BswMConfig/BswMArbitration/BswMLogicalExpression</DEFINITION-REF>
<REFERENCE-VALUES>
<ECUC-REFERENCE-VALUE>
<DEFINITION-REF DEST="ECUC-CHOICE-REFERENCE-DEF">/AUTOSAR/ecucdef/BswM/BswMConfig/BswMArbitration/BswMLogicalExpression/BswMArgumentRef</DEFINITION-REF>
<VALUE-REF DEST="ECUC-CONTAINER-VALUE">/ARRoot/BswM_0/BswMConfig_0/BswMArbitration_0/BswMModeCondition_2</VALUE-REF>
</ECUC-REFERENCE-VALUE>
</REFERENCE-VALUES>
</ECUC-CONTAINER-VALUE>
<ECUC-CONTAINER-VALUE UUID="c112c504-e546-41c3-abf9-0aaf06b18284">
<SHORT-NAME>BswMLogicalExpression_3</SHORT-NAME>
<DEFINITION-REF DEST="ECUC-PARAM-CONF-CONTAINER-DEF">/AUTOSAR/ecucdef/BswM/BswMConfig/BswMArbitration/BswMLogicalExpression</DEFINITION-REF>
<REFERENCE-VALUES>
<ECUC-REFERENCE-VALUE>
<DEFINITION-REF DEST="ECUC-CHOICE-REFERENCE-DEF">/AUTOSAR/ecucdef/BswM/BswMConfig/BswMArbitration/BswMLogicalExpression/BswMArgumentRef</DEFINITION-REF>
<VALUE-REF DEST="ECUC-CONTAINER-VALUE">/ARRoot/BswM_2/BswMConfig_2/BswMArbitration_2/BswMModeCondition_3</VALUE-REF>
</ECUC-REFERENCE-VALUE>
</REFERENCE-VALUES>
</ECUC-CONTAINER-VALUE>
</AUTOSAR>

You'll see with print(soup) that tag names were converted to lower-case by the parser. So use lowercase when searching for tag names:
for ecuc_container in soup.findAll('ECUC-CONTAINER-VALUE'.lower()):
or simply:
for ecuc_container in soup.findAll('ecuc-container-value'):
Or even better: explicitly parse the document as XML, so that the case of tags is not modified:
soup = Soup(handler,'xml')
Here's how you can get a list of the text inside <DEFINITION-REF DEST="ECUC-PARAM-CONF-CONTAINER-DEF"> elements:
def parseArxml():
handler = open('input.arxml').read()
soup = Soup(handler,'xml')
dest = [d.text for d in soup.findAll('DEFINITION-REF') if d['DEST']=='ECUC-CHOICE-REFERENCE-DEF']
print(dest)
Output:
['/AUTOSAR/ecucdef/BswM/BswMConfig/BswMArbitration/BswMLogicalExpression/BswMArgumentRef',
'/AUTOSAR/ecucdef/BswM/BswMConfig/BswMArbitration/BswMLogicalExpression/BswMArgumentRef']
Or if you want to get all definition-ref tags regardless of attribute, use
dest = [d.text for d in soup.findAll('definition-ref')]

Seems your parser and BeautifulSoup version is converting tags to lowercase.
You should do this:
from bs4 import BeautifulSoup as Soup
def parseArxml():
handler = open('input.arxml').read()
soup = Soup(handler,"html.parser")
for ecuc_container in soup.find_all('ecuc-container-value'):
for def_ref in ecuc_container.find_all('definition-ref'):
print(def_ref.get_text())
if __name__ == "__main__":
parseArxml()
OUTPUT:
/AUTOSAR/ecucdef/BswM/BswMConfig/BswMArbitration/BswMLogicalExpression
/AUTOSAR/ecucdef/BswM/BswMConfig/BswMArbitration/BswMLogicalExpression/BswMArgumentRef
/AUTOSAR/ecucdef/BswM/BswMConfig/BswMArbitration/BswMLogicalExpression
/AUTOSAR/ecucdef/BswM/BswMConfig/BswMArbitration/BswMLogicalExpression/BswMArgumentRef

Related

Python lxml: how to fetch XML tag names with xpath selector?

I'm trying to parse the following XML using Python and lxml:
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="/bind9.xsl"?>
<isc version="1.0">
<bind>
<statistics version="2.2">
<memory>
<summary>
<TotalUse>1232952256
</TotalUse>
<InUse>835252452
</InUse>
<BlockSize>598212608
</BlockSize>
<ContextSize>52670016
</ContextSize>
<Lost>0
</Lost>
</summary>
</memory>
</statistics>
</bind>
</isc>
The goal is to extract the tag name and text of every element under bind/statistics/memory/summary in order to produce the following mapping:
TotalUse: 1232952256
InUse: 835252452
BlockSize: 598212608
ContextSize: 52670016
Lost: 0
I've managed to extract the element values, but I can't figure out the xpath expression to get the element tag names.
A sample script:
from lxml import etree as et
def main():
xmlfile = "bind982.xml"
location = "bind/statistics/memory/summary/*"
label_selector = "??????" ## what to put here...?
value_selector = "text()"
with open(xmlfile, "r") as data:
xmldata = et.parse(data)
etree = xmldata.getroot()
statlist = etree.xpath(location)
for stat in statlist:
label = stat.xpath(label_selector)[0]
value = stat.xpath(value_selector)[0]
print "{0}: {1}".format(label, value)
if __name__ == '__main__':
main()
I know I could use value = stat.tag instead of stat.xpath(), but the script must be sufficiently generic to also process other pieces of XML where the label selector is different.
What xpath selector would return an element's tag name?

Simply use XPath's name(), and remove the zero index since this returns a string and not list.
from lxml import etree as et
def main():
xmlfile = "ExtractXPathTagName.xml"
location = "bind/statistics/memory/summary/*"
label_selector = "name()" ## what to put here...?
value_selector = "text()"
with open(xmlfile, "r") as data:
xmldata = et.parse(data)
etree = xmldata.getroot()
statlist = etree.xpath(location)
for stat in statlist:
label = stat.xpath(label_selector)
value = stat.xpath(value_selector)[0]
print("{0}: {1}".format(label, value).strip())
if __name__ == '__main__':
main()
Output
TotalUse: 1232952256
InUse: 835252452
BlockSize: 598212608
ContextSize: 52670016
Lost: 0

I think you don't need XPath for the two values, the element nodes have properties tag and text so use for instance a list comprehension:
[(element.tag, element.text) for element in etree.xpath(location)]
Or if you really want to use XPath
result = [(element.xpath('name()'), element.xpath('string()')) for element in etree.xpath(location)]
You could of course also construct a list of dictionaries:
result = [{ element.tag : element.text } for element in root.xpath(location)]
or
result = [{ element.xpath('name()') : element.xpath('string()') } for element in etree.xpath(location)]

How to retrieve information from xml tspan tag

I am currently using BeautifulSoup to do xml parsing, but don't know how to get information from inside the tspan tag. The the xml I am parsing looks like this:
<text
transform="matrix(0,-1,-1,0,7931,3626)"
style="font-variant:normal;font-weight:normal;font-size:92.3259964px;font-family:Arial;-inkscape-font-specification:ArialMT;writing-mode:lr-tb;fill:#000000;fill-opacity:1;fill-rule:nonzero;stroke:none"
id="text60264"><tspan
x="0 61.581444 123.16289 184.74432 251.4037 323.23334 384.81476 410.48138 436.14801 497.72946 559.31091 625.97028 692.62964 754.21112 805.54437 831.211 918.3667 979.94818 1005.6148 1077.4445 1144.1038 1200.515 1226.1816 1256.9261 1318.5076 1390.3373 1421.0818 1446.7484 1472.415 1526.3334 1587.9149 1649.4963 1716.1556 1782.8151 1844.3965 1895.7297 1921.3964 2008.5521 2070.1335 2095.8003 2167.6299 2234.2893 2290.7004"
y="0"
sodipodi:role="line"
id="tspan60262">APPROX. BARREL WEIGHT (KG): <BARREL WEIGHT></tspan></text>
I can get the text from the text tag, but I am trying to get the x="0 61.581..." so that I can change it to just x="0". So far my code only gets the tspan xml tags
from bs4 import BeautifulSoup
infile = open('1c E37 Face Electric Operation No No.svg', 'r')
contents = infile.read()
soup = BeautifulSoup(contents, 'lxml')
items = soup.find_all('tspan')
for item in items:
print(item)

You could have used css selectors as well - by adding in x attribute to selector you will only get elements where that attribute is present
from bs4 import BeautifulSoup as bs
s = '''
<text
transform="matrix(0,-1,-1,0,7931,3626)"
style="font-variant:normal;font-weight:normal;font-size:92.3259964px;font-family:Arial;-inkscape-font-specification:ArialMT;writing-mode:lr-tb;fill:#000000;fill-opacity:1;fill-rule:nonzero;stroke:none"
id="text60264"><tspan
x="0 61.581444 123.16289 184.74432 251.4037 323.23334 384.81476 410.48138 436.14801 497.72946 559.31091 625.97028 692.62964 754.21112 805.54437 831.211 918.3667 979.94818 1005.6148 1077.4445 1144.1038 1200.515 1226.1816 1256.9261 1318.5076 1390.3373 1421.0818 1446.7484 1472.415 1526.3334 1587.9149 1649.4963 1716.1556 1782.8151 1844.3965 1895.7297 1921.3964 2008.5521 2070.1335 2095.8003 2167.6299 2234.2893 2290.7004"
y="0"
sodipodi:role="line"
id="tspan60262">APPROX. BARREL WEIGHT (KG): <BARREL WEIGHT></tspan></text>
'''
soup = bs(s, 'lxml')
print([i['x'] for i in soup.select('tspan[x]')])

Finding element in xml with python

I am trying to parse XML before converting it's content into lists and then into CSV. Unfortunately, I think my search terms for finding the initial element are failing, causing subsequent searches further down the hierarchy. I am new to XML, so I've tried variations on namespace dictionaries and including the namespace references... The simplified XML is given below:
<?xml version="1.0" encoding="utf-8"?>
<StationList xmlns:xsd="http://www.w3.org/2001/XMLSchema"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:add="http://www.govtalk.gov.uk/people/AddressAndPersonalDetails"
xmlns:com="http://nationalrail.co.uk/xml/common" xsi:schemaLocation="http://internal.nationalrail.co.uk/xml/XsdSchemas/External/Version4.0/nre-station-v4-0.xsd"
xmlns="http://nationalrail.co.uk/xml/station">
<Station xsi:schemaLocation="http://internal.nationalrail.co.uk/xml/XsdSchemas/External/Version4.0/nre-station-v4-0.xsd">
<ChangeHistory>
<com:ChangedBy>spascos</com:ChangedBy>
<com:LastChangedDate>2018-11-07T00:00:00.000Z</com:LastChangedDate>
</ChangeHistory>
<Name>Aber</Name>
</Station>
The Code I am using to try to extract the com/...xml/station / ChangedBy element is below
tree = ET.parse(rootfilepath + "NRE_Station_Dataset_2019_raw.xml")
root = tree.getroot()
#get at the tags and their data
#for elem in tree.iter():
# print(f"this the tag {elem.tag} and this is the data: {elem.text}")
#open file for writing
station_data = open(rootfilepath + 'station_data.csv','w')
csvwriter = csv.writer(station_data)
station_head = []
count = 0
#inspiration for this code: http://blog.appliedinformaticsinc.com/how-to- parse-and-convert-xml-to-csv-using-python/
#this is where it goes wrong; some combination of the namespace and the tag can't find anything in line 27, 'StationList'
for member in root.findall('{http://nationalrail.co.uk/xml/station}Station'):
station = []
if count == 0:
changedby = member.find('{http://nationalrail.co.uk/xml/common}ChangedBy').tag
station_head.append(changedby)
name = member.find('{http://nationalrail.co.uk/xml/station}Name').tag
station_head.append(name)
count = count+1
changedby = member.find('{http://nationalrail.co.uk/xml/common}ChangedBy').text
station.append(changedby)
name = member.find('{http://nationalrail.co.uk/xml/station}Name').text
station.append(name)
csvwriter.writerow(station)
I have tried:
using dictionaries of namespaces but that results in nothing being found at all
using hard coded namespaces but that results in "Attribute Error: 'NoneType' object has no attribute 'tag'
Thanks in advance for all and any assistance.

First of all your XML is invalid (</StationList> is absent at the end of a file).
Assuming you have valid XML file:
<?xml version="1.0" encoding="utf-8"?>
<StationList xmlns:xsd="http://www.w3.org/2001/XMLSchema"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:add="http://www.govtalk.gov.uk/people/AddressAndPersonalDetails"
xmlns:com="http://nationalrail.co.uk/xml/common" xsi:schemaLocation="http://internal.nationalrail.co.uk/xml/XsdSchemas/External/Version4.0/nre-station-v4-0.xsd"
xmlns="http://nationalrail.co.uk/xml/station">
<Station xsi:schemaLocation="http://internal.nationalrail.co.uk/xml/XsdSchemas/External/Version4.0/nre-station-v4-0.xsd">
<ChangeHistory>
<com:ChangedBy>spascos</com:ChangedBy>
<com:LastChangedDate>2018-11-07T00:00:00.000Z</com:LastChangedDate>
</ChangeHistory>
<Name>Aber</Name>
</Station>
</StationList>
Then you can convert your XML to JSON and simply address to the required value:
import xmltodict
with open('file.xml', 'r') as f:
data = xmltodict.parse(f.read())
changed_by = data['StationList']['Station']['ChangeHistory']['com:ChangedBy']
Output:
spascos

Try lxml:
#!/usr/bin/env python3
from lxml import etree
ns = {"com": "http://nationalrail.co.uk/xml/common"}
with open("so.xml") as f:
tree = etree.parse(f)
for t in tree.xpath("//com:ChangedBy/text()", namespaces=ns):
print(t)
Output:
spascos

You can use Beautifulsoup which is an html and xml parser
from bs4 import BeautifulSoup
fd = open(rootfilepath + "NRE_Station_Dataset_2019_raw.xml")
soup = BeautifulSoup(fd,'lxml-xml')
for i in soup.findAll('ChangeHistory'):
print(i.ChangedBy.text)

how to handle duplicate tags in beautifulsoup

<system>
<load><avg01>0.03</avg01><avg05>0.15</avg05><avg15>0.16</avg15></load>
<cpu><user>7.4</user>
<system>3.2</system>
<wait>0.9</wait></cpu>
<memory><percent>17.1</percent>
<kilobyte>1220364</kilobyte></memory>
<swap><percent>0.0</percent>
<kilobyte>396</kilobyte></swap>
</system>
How to take the entire system tag in beautifulsoup and skip the intermediate ones. Note there is system tag inside outer system tag.
r = requests.get(url, timeout=0.5)
result = BeautifulSoup(r.content)
for item in result.findAll('system'):
print item
OUTOUT
<system><load><avg01>0.03</avg01><avg05>0.10</avg05><avg15>0.13</avg15></load><cpu><user>7.7</user></cpu></system>
Also I want to get percent value, but there are many percent tags in the entire xml which gets pulled out.

First, match only the outer one. After that you can loop through it to see the contents
>>> for item in soup.find('system'):
print item
<load><avg01>0.03</avg01><avg05>0.15</avg05><avg15>0.16</avg15></load>
<cpu><user>7.4</user>
<system>3.2</system>
<wait>0.9</wait></cpu>
<memory><percent>17.1</percent>
<kilobyte>1220364</kilobyte></memory>
<swap><percent>0.0</percent>
<kilobyte>396</kilobyte></swap>

Just use soup.system
from bs4 import BeautifulSoup
html = """
<system>
<load><avg01>0.03</avg01><avg05>0.15</avg05><avg15>0.16</avg15></load>
<cpu><user>7.4</user>
<system>3.2</system>
<wait>0.9</wait></cpu>
<memory><percent>17.1</percent>
<kilobyte>1220364</kilobyte></memory>
<swap><percent>0.0</percent>
<kilobyte>396</kilobyte></swap>
</system>
"""
soup = BeautifulSoup(html)
print soup.system
This yields:
<system>
<load><avg01>0.03</avg01><avg05>0.15</avg05><avg15>0.16</avg15></load>
<cpu><user>7.4</user>
<system>3.2</system>
<wait>0.9</wait></cpu>
<memory><percent>17.1</percent>
<kilobyte>1220364</kilobyte></memory>
<swap><percent>0.0</percent>
<kilobyte>396</kilobyte></swap>
</system>

xml file parsing in python

xml file :
<global>
<rtmp>
<fcsapp>
<password>
<key>hello123</key>
<key>check123</key>
</password>
</fcsapp>
</rtmp>
</global>
python code : To obtain all the key tag values.
hello123
check123
using xml.etree.ElementTree
for streams in xmlRoot.iter('global'):
xpath = "/rtmp/fcsapp/password"
tag = "key"
for child in streams.findall(xpath):
resultlist.append(child.find(tag).text)
print resultlist
The output obtained is [hello123], but I want it to display both ([hello123, check123])
How do I obtain this?

Using lxml and cssselect I would do it like this:
>>> from lxml.html import fromstring
>>> doc = fromstring(open("foo.xml", "r").read())
>>> doc.cssselect("password key")
[<Element key at 0x7f77a6786cb0>, <Element key at 0x7f77a6786d70>]
>>> [e.text for e in doc.cssselect("password key")]
['hello123 \n ', 'check123 \n ']

With lxml and xpath You can do it in the following way:
from lxml import etree
xml = """
<global>
<rtmp>
<fcsapp>
<password>
<key>hello123</key>
<key>check123</key>
</password>
</fcsapp>
</rtmp>
</global>
"""
tree = etree.fromstring(xml)
result = tree.xpath('//password/key/text()')
print result # ['hello123', 'check123']

try beautifulsoup package "https://pypi.python.org/pypi/BeautifulSoup"

using xml.etree.ElementTree
for streams in xmlRoot.iter('global'):
xpath = "/rtmp/fcsapp/password"
tag = "key"
for child in streams.iter(tag):
resultlist.append(child.text)
print resultlist
have to iter over the "key" tag in for loop to obtain the desired result. The above code solves the problem.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Parsing Autosar xml using beautiful soup python 3 - python

Related

Python lxml: how to fetch XML tag names with xpath selector?

How to retrieve information from xml tspan tag

Finding element in xml with python

how to handle duplicate tags in beautifulsoup

xml file parsing in python

Categories

Resources