How to pass <Br /> element when combine text in XML using python? - python

I've been trying to combine all text in the content element in XML using python.
I succeeded combining all content text but need to except content which is right below <'Br /> element.
<'Br /> element means Enter in adobe indesign program.
This XML is exported from adobe indesign.
This is example as follow :
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<Root>
<Story>
<ParagraphStyleRange>
<CharacterStyleRange>
<Content>AAA</Content>
</CharacterStyleRange>
<CharacterStyleRange>
<Content>BBB</Content>
</CharacterStyleRange>
<CharacterStyleRange>
<Br />
<Content>CCC</Content>
<Br />
<Content>DDD</Content>
</CharacterStyleRange>
<CharacterStyleRange>
<Content>EEE</Content>
</CharacterStyleRange>
<CharacterStyleRange>
<Br />
<Content>FFF</Content>
<Br />
</CharacterStyleRange>
</ParagraphStyleRange>
</Story>
</Root>
and it's what i want as follow :
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<Root>
<Story>
<ParagraphStyleRange>
<CharacterStyleRange>
<Content>AAA</Content>
</CharacterStyleRange>
<CharacterStyleRange>
<Content>AAABBB</Content>
</CharacterStyleRange>
<CharacterStyleRange>
<Br />
<Content>CCC</Content>
<Br />
<Content>DDD</Content>
</CharacterStyleRange>
<CharacterStyleRange>
<Content>DDDEEE</Content>
</CharacterStyleRange>
<CharacterStyleRange>
<Br />
<Content>FFF</Content>
<Br />
</CharacterStyleRange>
</ParagraphStyleRange>
</Story>
</Root>
As you see, i don't want to add content text to next one if there is <'Br /> element right above the content that i want to add.
In detail, the first Content element text is AAA and next one is BBB.
in this case AAA should be attched in front of BBB.
and BBB is not attached in front of CCC because there is <'Br /> element right above CCC Content.
Would you help me how to recognize the <'Br /> element to pass?
this is what i'am doing code so far, but it doesn't work well...
tree = ET.parse("C:\\Br_test.xml")
root = tree.getroot()
for ParagraphStyleRange in root.findall('.//Story/ParagraphStyleRange'):
CharacterStyleRange_count = len(ParagraphStyleRange.findall('CharacterStyleRange'))
#print(CharacterStyleRange_count)
if int(CharacterStyleRange_count) >= 2 :
try :
Content_collect = ''
for CharacterStyleRange in ParagraphStyleRange.findall('CharacterStyleRange'):
Br_count = len(CharacterStyleRange.findall('Br'))
print(Br_count)
if int(Br_count) == 0 :
for Content in CharacterStyleRange.findall('Content'):
Content_collect += Content.text
Content.text = str(Content_collect)
print(Content_collect)
#---- Code to delete Contents that are attached to next one---
#for CharacterStyleRange in ParagraphStyleRange.findall('CharacterStyleRange')[:-1]:
# for Content in CharacterStyleRange.findall('Content'):
# Content_remove = CharacterStyleRange.remove(Content)
except:
pass

Related

How to Extract the Information from XML Soap Response?

We have a requirement to get the data from a SOAP XML Response.
Below is the associated XML file
<?xml version="1.0" encoding="utf-8"?>
<soap:Envelope xmlns:soap="http://www.w3.org/2003/05/soap-envelope" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema">
<soap:Body>
<GetResultResponse xmlns="http://www.relatics.com/">
<GetResultResult>
<Report ReportName="RFC" GeneratedOn="2022-12-22" EnvironmentID="XXXX" EnvironmentName="Systematic Assurance – an XXX Solution" EnvironmentURL="https://XXXX.relaticsonline.com/" WorkspaceID="XXXXX" WorkspaceName="P - ADL Program Management - XXX" TargetDevice="Pc" ReportField="" xmlns="">
<Change_module>
<applied_individual_change_request Change_Request="TestKZIreport" RFC_GUID="XXXXX">
<code RFC_Code="VtW-0101" />
<progress RFC_Progress="agreed" />
<applied_individual_project_organisation Organisation="XXXX" />
<applied__individual_discipline Discipline="Highways" />
<specification Specification="Context of Documents">
<code Specification_Code="1.1.1a" />
</specification>
<applied_individual_workpackage Workpackage="Enabling work">
<code Workpackage_Code="WP-01" />
</applied_individual_workpackage>
<physical_object Physical_Object="Train Station">
<code Physical_Object_Code="TFO-0001" />
</physical_object>
<person approver="XXX" />
<applied_individual_change_consequence_qualification Consequence_Value="10 days">
<applied_conceptual_change_consequence_aspect Consequence_Aspect="Schedule" />
</applied_individual_change_consequence_qualification>
<document Document_Name="WI 300 Design.pdf">
<code Document_Code="DOC-0002" />
</document>
<answer_status BR_Status="no" />
<applied_individual_business_rule Business_Rule="Change Review compliance">
<code BR_Code="BR-006" />
</applied_individual_business_rule>
<applied_individual_change_consequence_qualification Consequence_Value="XXX">
<applied_conceptual_change_consequence_aspect Consequence_Aspect="Finance" />
</applied_individual_change_consequence_qualification>
</applied_individual_change_request>
</Change_module>
</Report>
</GetResultResult>
</GetResultResponse>
</soap:Body>
</soap:Envelope>
i need all the tag values after Change_module.i tried some online help in Stack overflow but it didn't work.
I never worked with XML documents before and here is the sample code i
tried from Stack Overflow.
import xml.etree.ElementTree as ET
import pandas as pd
import numpy as np
tree = ET.parse("Relatics_XML.xml")
root = tree.getroot()
print(root.tag)
print(root.attrib)
namespaces = {"soap": "http://www.w3.org/2003/05/soap-envelope/",
"xsi": "http://www.w3.org/2001/XMLSchema-instance",
"xsd": "http://www.w3.org/2001/XMLSchema/",
'a': 'http://www.relatics.com/',}
names = tree.findall('./soap:Body''/a:GetResultResponse''/a:GetResultResult', namespaces)
print(names)
for name in names:
print(name.text)
i tried different methods like find and findall and also inside the method i try to pass different values but all its printing is null.
I'm not sure how to get the values out of tags.
Using xml.etree.ElementTree make life easier.
documentation in here
It can parsing tag attribute or innerText.
import xml.etree.ElementTree as ET
xml = """\
<?xml version="1.0" encoding="utf-8"?>
<soap:Envelope xmlns:soap="http://www.w3.org/2003/05/soap-envelope"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:xsd="http://www.w3.org/2001/XMLSchema">
<soap:Body>
<GetResultResponse xmlns="http://www.relatics.com/">
<GetResultResult>
<Report ReportName="RFC" GeneratedOn="2022-12-22" EnvironmentID="XXXX" EnvironmentName="Systematic Assurance – an XXX Solution" EnvironmentURL="https://XXXX.relaticsonline.com/" WorkspaceID="XXXXX" WorkspaceName="P - ADL Program Management - XXX" TargetDevice="Pc" ReportField=""
xmlns="">
<Change_module>
<applied_individual_change_request Change_Request="TestKZIreport" RFC_GUID="XXXXX">
<code RFC_Code="VtW-0101" />
<progress RFC_Progress="agreed" />
<applied_individual_project_organisation Organisation="XXXX" />
<applied__individual_discipline Discipline="Highways" />
<specification Specification="Context of Documents">
<code Specification_Code="1.1.1a" />
</specification>
<applied_individual_workpackage Workpackage="Enabling work">
<code Workpackage_Code="WP-01" />
</applied_individual_workpackage>
<physical_object Physical_Object="Train Station">
<code Physical_Object_Code="TFO-0001" />
</physical_object>
<person approver="XXX" />
<applied_individual_change_consequence_qualification Consequence_Value="10 days">
<applied_conceptual_change_consequence_aspect Consequence_Aspect="Schedule" />
</applied_individual_change_consequence_qualification>
<document Document_Name="WI 300 Design.pdf">
<code Document_Code="DOC-0002" />
</document>
<answer_status BR_Status="no" />
<applied_individual_business_rule Business_Rule="Change Review compliance">
<code BR_Code="BR-006" />
</applied_individual_business_rule>
<applied_individual_change_consequence_qualification Consequence_Value="XXX">
<applied_conceptual_change_consequence_aspect Consequence_Aspect="Finance" />
</applied_individual_change_consequence_qualification>
</applied_individual_change_request>
</Change_module>
</Report>
</GetResultResult>
</GetResultResponse>
</soap:Body>
</soap:Envelope>
"""
root = ET.fromstring(xml)
print("RFC_Code: " + str(root.find(".//code[#RFC_Code]").attrib))
print("RFC_Progress: " + str(root.find(".//progress[#RFC_Progress]").attrib))
print("specification: " + str(root.find(".//specification[#Specification]").attrib))
print("Specification_Code: " + str(root.find(".//code[#Specification_Code]").attrib))
print("Workpackage_Code: " + str(root.find(".//code[#Workpackage_Code]").attrib))
print("Document_Code: " + str(root.find(".//code[#Document_Code]").attrib))
Result
$ python get-data.py
RFC_Code: {'RFC_Code': 'VtW-0101'}
RFC_Progress: {'RFC_Progress': 'agreed'}
specification: {'Specification': 'Context of Documents'}
Specification_Code: {'Specification_Code': '1.1.1a'}
Workpackage_Code: {'Workpackage_Code': 'WP-01'}
Document_Code: {'Document_Code': 'DOC-0002'}
If you using xml file open, using this code
with open('data.xml', 'r') as xml_file:
root = ET.parse(xml_file)

How to iterate over a xml file to extract some attributes?

I'm trying to extract the values from a xml file and save it as a dataframe. For each line element, I'd like to add the date from the chk element.
<?xml version="1.0" encoding="ISO-8859-1"?>
<sales>
<chk no="xxx" date="xxxx" time="xxx" total="xxxx" debtor="xxxx" name="xxx" cardnumber="xxxxxxx" mobil="" >
<line productId="xxxx" product="xxxx" productGroupId="xxx" productGroup="xxx" amount="x" price="xxx" />
<line productId="xxx" product="xxx" productGroupId="xxx" productGroup="xxx" amount="xx" price="xxxx" />
</chk>
<chk no="xxx" date="xxxx" time="xx" total="xxxx" debtor="xxxx" name="xxxx" cardnumber="xxxx" mobil="xxxxx" >
<line productId="xxxx" product="xxxxx" productGroupId="xxxx" productGroup="xxx" amount="xxxx" price="xxxx" />
<line productId="xxxxx" product="xxxxx" productGroupId="xxxx" productGroup="xxxx" amount="xxx" price="xxxxx" />
</chk>
</sales>
root = ET.fromstring(response.content)
sales = []
for date in root.iter('chk'):
sales.append(date.attrib)
lines = []
for line in root.iter('line'):
lines.append(line.attrib)
I am able to extract the chk and line element separately. How can I append the date to the lines list?
import xml.etree.ElementTree as ET
xml = '''<?xml version="1.0" encoding="ISO-8859-1"?>
<sales>
<chk no="xxx" date="xxxx" time="xxx" total="xxxx" debtor="xxxx" name="xxx" cardnumber="xxxxxxx" mobil="" >
<line productId="xxxx" product="xxxx" productGroupId="xxx" productGroup="xxx" amount="x" price="xxx" />
<line productId="xxx" product="xxx" productGroupId="xxx" productGroup="xxx" amount="xx" price="xxxx" />
</chk>
<chk no="xxx" date="zzzz" time="xx" total="xxxx" debtor="xxxx" name="xxxx" cardnumber="xxxx" mobil="xxxxx" >
<line productId="xxxx" product="xxxxx" productGroupId="xxxx" productGroup="xxx" amount="xxxx" price="xxxx" />
<line productId="xxxxx" product="xxxxx" productGroupId="xxxx" productGroup="xxxx" amount="xxx" price="xxxxx" />
</chk>
</sales>'''
root = ET.fromstring(xml)
for chk in root.findall('.//chk'):
for line in chk.findall('line'):
line.attrib['date'] = chk.attrib['date']
ET.dump(root)
Iterate over lines inside the chk iteration and use date i/o root as a iteration object. Something like that
root = ET.fromstring(resp)
for date in root.iter('chk'):
for line in date.iter('line'):
print(date.attrib,line.attrib)

Parsing XML with Python - Accessing Values

I have recently got a RaspberryPi and have started to learn Python. To begin with I want to parse an XML file and I am doing this via the untangle library.
My XML looks like:
<?xml version="1.0" encoding="utf-8"?>
<weatherdata>
<location>
<name>Katherine</name>
<type>Administrative division</type>
<country>Australia</country>
<timezone id="Australia/Darwin" utcoffsetMinutes="570" />
<location altitude="176" latitude="-14.65012" longitude="132.17414" geobase="geonames" geobaseid="7839404" />
</location>
<sun rise="2019-02-04T06:33:52" set="2019-02-04T19:16:15" />
<forecast>
<tabular>
<time from="2019-02-04T06:30:00" to="2019-02-04T12:30:00" period="1">
<!-- Valid from 2019-02-04T06:30:00 to 2019-02-04T12:30:00 -->
<symbol number="9" numberEx="9" name="Rain" var="09" />
<precipitation value="1.8" />
<!-- Valid at 2019-02-04T06:30:00 -->
<windDirection deg="314.8" code="NW" name="Northwest" />
<windSpeed mps="3.3" name="Light breeze" />
<temperature unit="celsius" value="26" />
<pressure unit="hPa" value="1005.0" />
</time>
<time from="2019-02-04T12:30:00" to="2019-02-04T18:30:00" period="2">
<!-- Valid from 2019-02-04T12:30:00 to 2019-02-04T18:30:00 -->
<symbol number="9" numberEx="9" name="Rain" var="09" />
<precipitation value="2.3" />
<!-- Valid at 2019-02-04T12:30:00 -->
<windDirection deg="253.3" code="WSW" name="West-southwest" />
<windSpeed mps="3.0" name="Light breeze" />
<temperature unit="celsius" value="29" />
<pressure unit="hPa" value="1005.0" />
</time>
</tabular>
</forecast>
</weatherdata>
From this I would like to be able to print out the from and to attributes of the <time> element as well as the value attribute in its child node <temperature>
I can correctly print out the temperature values if I run the Python script below:
for forecast in data.weatherdata.forecast.tabular.time:
print (forecast.temperature['value'])
but if I run
for forecast in data.weatherdata.forecast.tabular:
print ("time is " + forecast.time['from'] + "and temperature is " + forecast.time.temperature['value'])
I get an error:
print (forecast.time['from'] + forecast.time.temperature['value'])
TypeError: list indices must be integers, not str
Can anyone advise how I can correctly access these values?
forecast.time should be a list, as it does have multiple values, one for each <time> node.
Did you expect forecast.time['from'] to automatically aggregate that data?

Append xml to existing xml in python

I have an XML file as:
<a>
<b>
<c>
<condition>
....
</condition>
</c>
</b>
</a>
I have another XML in string type as :
<condition>
<comparison compare="and">
<operand idref="Agent" type="boolean" />
<comparison compare="lt">
<operand idref="Premium" type="float" />
<operand type="int" value="10000" />
</comparison>
</comparison>
</condition>
I need to comment the 'condition block' in the first xml and then append this second xml in place of it.
I did not try to comment the first block but tried to append the second xml in the first. I am able to append it to it but I am getting the '<' and '>' as
&lt ; and &gt ; respectively as
<a>
<b>
<c>
<condition>
....
</condition>
<condition>
<comparison compare="and">
<operand idref="Agent" type="boolean"/>
<comparison compare="lt">
<operand idref="Premium" type="float"/>
<operand type="int" value="10000"/>
</comparison>
</comparison>
</condition>
How do I convert this back to < and > rather than lt and gt?
And how do I delete or comment the <condition> block of the first xml below which I will append the new xml?
tree = ET.parse('basexml.xml') #This is the xml where i will append
tree1 = etree.parse(open('newxml.xml')) # This is the xml to append
xml_string = etree.tostring(tree1, pretty_print = True) #converted the xml to string
tree.find('a/b/c').text = xml_string #updating the content of the path with this new string(xml)
I converted the 'newxml.xml' into a string 'xml_string' and then appended to the path a/b/c of the first xml
You are adding newxml.xml, as a string, to the text property of the <c> element. That does not work. You need to add an Element object as a child of <c>.
Here is how it can be done:
from xml.etree import ElementTree as ET
# Parse both files into ElementTree objects
base_tree = ET.parse("basexml.xml")
new_tree = ET.parse("newxml.xml")
# Get a reference to the "c" element (the parent of "condition")
c = base_tree.find(".//c")
# Remove old "condition" and append new one
old_condition = c.find("condition")
new_condition = new_tree.getroot()
c.remove(old_condition)
c.append(new_condition)
print ET.tostring(base_tree.getroot())
Result:
<a>
<b>
<c>
<condition>
<comparison compare="and">
<operand idref="Agent" type="boolean" />
<comparison compare="lt">
<operand idref="Premium" type="float" />
<operand type="int" value="10000" />
</comparison>
</comparison>
</condition></c>
</b>
</a>

PyQuery get text node

I'm using PyQuery to process this HTML:
<div class="container">
<strong>Personality: Strengths</strong>
<br />
Text
<br />
<br />
<strong>Personality: Weaknesses</strong>
<br />
Text
<br />
<br />
</div>
Now that I've got a variable e point to .container, I'm looping through its children:
for c in e.iterchildren():
print c.tag
but in this way I can't get text nodes (the two Text string)
How can I loop an element's children include text nodes?
you can do it likes
for c in e.children():
p = PyQuery(c)
print p.__str__()
#here re.sub remove html tag
This code could get the raw text of each node.
If you want to distinguish the text tag from others :
raw = p.__str__().strip()
a = raw.rfind(">")
if (a+1!=len(raw)) :
print 'is text'

Categories