I have a text file with entries like this:
<soap:Envelope xmlns:soap="http://www.w3.org/2003/05/soap-envelope" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema">
<soap:Body>
<Applications_GetResponse xmlns="http://www.country.com">
<Applications>
<CS_Application>
<Name>Spain</Name>
<Key>2345364564</Key>
<Status>NORMAL</Status>
<Modules>
<CS_Module>
<Name>zaragoza</Name>
<Key>8743249725</Key>
<DevelopmentEffort>0</DevelopmentEffort>
<LogicalDBConnections/>
</CS_Module>
<CS_Module>
<Name>malaga</Name>
<Key>8743249725</Key>
<DevelopmentEffort>0</DevelopmentEffort>
<LogicalDBConnections/>
</CS_Module>
</Modules>
<CreatedBy>7</CreatedBy>
</CS_Application>
<CS_Application>
<Name>UK</Name>
<Key>2345364564</Key>
<Status>NORMAL</Status>
<Modules>
<CS_Module>
<Name>london</Name>
<Key>8743249725</Key>
<DevelopmentEffort>0</DevelopmentEffort>
<LogicalDBConnections/>
</CS_Module>
<CS_Module>
<Name>liverpool</Name>
<Key>8743249725</Key>
<DevelopmentEffort>0</DevelopmentEffort>
<LogicalDBConnections/>
</CS_Module>
</Modules>
<CreatedBy>7</CreatedBy>
</CS_Application>
</Applications>
</Applications_GetResponse>
</soap:Body>
</soap:Envelope>
I would like to analyze it and obtain the name of the country in the sequence of the cities.
I tried some things with python re.finall, but I didn't get anything like it
print("HERE APPLICATIONS")
applications = re.findall('<CS_Application><Name>(.*?)</Name>', response_apply.text)
print(applications)
print("HERE MODULES")
modules = re.findall('<CS_Module><Name>(.*?)</Name>', response_apply.text)
print(modules)
return:
host-10$ sudo python3 capture.py
HERE APPLICATIONS
['Spain', 'UK']
HERE MODULES
['zaragoza', 'malaga', 'london', 'liverpool']
The expected result is, I would like the result to be like this:
HERE
The Country: Spain - Cities: zaragoza,malaga
The Country: UK - Cities: london,liverpool
Regex is not good to parse xml. Better use xml parser..
If you want regex solution then hope below code help you.
import re
s = """\n<soap:Envelope xmlns:soap="http://www.w3.org/2003/05/soap-envelope" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema">\n <soap:Body>\n <Applications_GetResponse xmlns="http://www.country.com">\n <Applications>\n <CS_Application>\n <Name>Spain</Name>\n <Key>2345364564</Key>\n <Status>NORMAL</Status>\n <Modules>\n <CS_Module>\n <Name>zaragoza</Name>\n <Key>8743249725</Key>\n <DevelopmentEffort>0</DevelopmentEffort>\n <LogicalDBConnections/>\n </CS_Module>\n <CS_Module>\n <Name>malaga</Name>\n <Key>8743249725</Key>\n <DevelopmentEffort>0</DevelopmentEffort>\n <LogicalDBConnections/>\n </CS_Module>\n </Modules>\n <CreatedBy>7</CreatedBy>\n </CS_Application>\n <CS_Application>\n <Name>UK</Name>\n <Key>2345364564</Key>\n <Status>NORMAL</Status>\n <Modules>\n <CS_Module>\n <Name>london</Name>\n <Key>8743249725</Key>\n <DevelopmentEffort>0</DevelopmentEffort>\n <LogicalDBConnections/>\n </CS_Module>\n <CS_Module>\n <Name>liverpool</Name>\n <Key>8743249725</Key>\n <DevelopmentEffort>0</DevelopmentEffort>\n <LogicalDBConnections/>\n </CS_Module>\n </Modules>\n <CreatedBy>7</CreatedBy>\n </CS_Application>\n </Applications>\n </Applications_GetResponse>\n </soap:Body>\n</soap:Envelope>\n"""
pattern1 = re.compile(r'<CS_Application>([\s\S]*?)</CS_Application>')
pattern2 = re.compile(r'<Name>(.*)?</Name>')
for m in re.finditer(pattern1, s):
ss = m.group(1)
res = []
for mm in re.finditer(pattern2, ss):
res.append(mm.group(1))
print("The Country: "+res[0]+" - Cities: "+",".join(res[1:len(res)]))
the problem is that the 2nd xml file contains also the data from the first iteration of the excel row and the third xml file every data from the first and 2nd rows
Working since hours on that and cant figure it out
from lxml import etree
import openpyxl
# Create root element with namespace information
xmlns = "http://xml.datev.de/bedi/tps/ledger/v040"
xsi = "http://www.w3.org/2001/XMLSchema-instance"
schemaLocation = "http://xml.datev.de/bedi/tps/ledger/v040 Belegverwaltung_online_ledger_import_v040.xsd"
version = "4.0"
generator_info = "DATEV Musterdaten"
generating_system = "DATEV manuell"
xmlRoot = etree.Element(
"{" + xmlns + "}LedgerImport",
version=version,
attrib={"{" + xsi + "}schemaLocation": schemaLocation},
generator_info=generator_info,
generating_system=generating_system,
nsmap={'xsi': xsi, None: xmlns}
)
####open excel file speadsheet
wb = openpyxl.load_workbook('import_spendesk_datev.xlsx')
sheet = wb['Import']
# build the xml tree
for i in range(2,6):
consolidate = etree.SubElement(xmlRoot, 'consolidate', attrib={'consolidatedAmount': str(sheet.cell(row=i,column=16).value),'consolidatedDate': str(sheet.cell(row=i,column=2).value), 'consolidatedInvoiceId': str(sheet.cell(row=i,column=13).value), 'consolidatedCurrencyCode': str(sheet.cell(row=i,column=12).value) })
accountsPayableLedger = etree.SubElement(consolidate, 'accountsPayableLedger')
account = etree.SubElement(accountsPayableLedger, 'bookingText')
account.text = sheet.cell(row=i,column=21).value
invoice = etree.SubElement(accountsPayableLedger, 'invoiceId')
invoice.text = sheet.cell(row=i,column=13).value
date = etree.SubElement(accountsPayableLedger, 'date')
date.text = sheet.cell(row=i,column=2).value
amount = etree.SubElement(accountsPayableLedger, 'amount')
amount.text = sheet.cell(row=i,column=16).value
account_no = etree.SubElement(accountsPayableLedger, 'accountNo')
account_no.text = sheet.cell(row=i,column=19).value
cost1 = etree.SubElement(accountsPayableLedger, 'costCategoryId')
cost1.text = sheet.cell(row=i,column=15).value
currency_code = etree.SubElement(accountsPayableLedger, 'currencyCode')
currency_code.text = sheet.cell(row=i,column=12).value
party_id = etree.SubElement(accountsPayableLedger, 'partyId')
party_id.text = sheet.cell(row=i,column=20).value
bpaccount = etree.SubElement(accountsPayableLedger, 'bpAccountNo')
bpaccount.text = sheet.cell(row=i,column=20).value
doc = etree.ElementTree(xmlRoot)
doc.write( str(sheet.cell(row=i,column=13).value)+".xml", xml_declaration=True, encoding='utf-8', pretty_print=True)
as described
this for every single excel row and for each row one .xml file
<?xml version='1.0' encoding='UTF-8'?>
<LedgerImport xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://xml.datev.de/bedi/tps/ledger/v040" generating_system="DATEV manuell" generator_info="DATEV Musterdaten" version="4.0" xsi:schemaLocation="http://xml.datev.de/bedi/tps/ledger/v040 Belegverwaltung_online_ledger_import_v040.xsd">
<consolidate consolidatedAmount="1337.01">
<accountsPayableLedger>
<bookingText>amazon</bookingText>
<invoiceId>1</invoiceId>
</accountsPayableLedger>
</consolidate>
</LedgerImport>
The same xmlRoot object is reused several times. You need to create a new root element for each iteration in the for loop.
The code that creates the root element can be put in a function. Here is a simplified example:
from lxml import etree
def makeroot():
return etree.Element("LedgerImport")
for i in range(2, 6):
xmlRoot = makeroot()
consolidate = etree.SubElement(xmlRoot, 'consolidate',
attrib={'consolidatedAmount': str(i)})
doc = etree.ElementTree(xmlRoot)
doc.write(str(i) + ".xml", xml_declaration=True, encoding='utf-8', pretty_print=True)
After #mzjn pointed out your basic mistake, here is a thing I made for fun - you can create nested XML with a declarative mapping, instead of laboriously calling etree.SubElement yourself.
Here is how. Assume this as the basic situation:
from lxml import etree
import openpyxl
ns = {
None: 'http://xml.datev.de/bedi/tps/ledger/v040',
'xsi': 'http://www.w3.org/2001/XMLSchema-instance',
}
mapping = {
'_tag': '{' + ns[None] + '}LedgerImport',
'attrib': {
'version': '4.0',
'{' + ns['xsi'] + '}schemaLocation': 'http://xml.datev.de/bedi/tps/ledger/v040 Belegverwaltung_online_ledger_import_v040.xsd',
'generator_info': 'DATEV Musterdaten',
'generating_system': 'DATEV manuell',
},
'nsmap': ns,
'_children': [{
'_tag': 'consolidate',
'attrib': {
'consolidatedAmount': lambda: sheet.cell(i, 16).value,
'consolidatedDate': lambda: sheet.cell(i, 2).value,
'consolidatedInvoiceId': lambda: sheet.cell(i, 13).value,
'consolidatedCurrencyCode': lambda: sheet.cell(i, 12).value,
},
'_children': [{
'_tag': 'accountsPayableLedger',
'_children': [
{'_tag': 'bookingText', '_text': lambda: sheet.cell(i, 21).value},
{'_tag': 'invoiceId', '_text': lambda: sheet.cell(i, 13).value},
{'_tag': 'date', '_text': lambda: sheet.cell(i, 2).value},
{'_tag': 'amount', '_text': lambda: sheet.cell(i, 16).value},
{'_tag': 'accountNo', '_text': lambda: sheet.cell(i, 19).value},
{'_tag': 'costCategoryId', '_text': lambda: sheet.cell(i, 15).value},
{'_tag': 'currencyCode', '_text': lambda: sheet.cell(i, 12).value},
{'_tag': 'partyId', '_text': lambda: sheet.cell(i, 20).value},
{'_tag': 'bpAccountNo', '_text': lambda: sheet.cell(i, 20).value},
]
}]
}],
}
The nested dict resembles your final XML document. Its keys also resemble the parameters that etree.Element() and etree.SubElement() take, with the addition of _text and _children.
Now we can define a single recursive helper function that takes this input tree and transforms it into a nested XML tree of the same configuration. As a bonus we can execute the lambda functions, which allows us to dynamically calculate attribute values and text:
def build_tree(template, parent=None):
# prepare a dict for calling etree.Element()/etree.SubElement()
params = {k: v for k, v in template.items() if k not in ['_children', '_text']}
# calculate any dynamic attribute values
for name in params.get('attrib', {}):
value = params['attrib'][name]
params['attrib'][name] = str(value() if callable(value) else value)
if parent is None:
node = etree.Element(**params)
else:
params['_parent'] = parent
node = etree.SubElement(**params)
# calculate (if necessary) and set the node text
if '_text' in template:
if callable(template['_text']):
node.text = str(template['_text']())
else:
node.text = str(template['_text']) if template['_text'] else template['_text']
# recurse into children, if any
for child in template.get('_children', []):
build_tree(child, node)
return node
We can call this in a loop:
wb = openpyxl.load_workbook('import_spendesk_datev.xlsx')
sheet = wb['Import']
for i in range(2,6):
root = build_tree(mapping)
doc = etree.ElementTree(root)
name = "%s.xml" % sheet.cell(i, 13).value
doc.write(name, xml_declaration=True, encoding='utf-8', pretty_print=True)
This should generate a couple of nicely nested XML documents, and it should be a lot easier to manage if your XML structure changes or gets more complicated.
Alternatively, consider XSLT, the special-purpose declarative langauge designed to transform XML files, which lxml does support. Specifically, pass parameters from Python to the stylesheet to transform a template XML (not unlike passing parameters to a prepared SQL statement):
XML template (includes all top-level namespaces)
<?xml version='1.0' encoding='UTF-8'?>
<LedgerImport xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns="http://xml.datev.de/bedi/tps/ledger/v040"
generating_system="DATEV manuell"
generator_info="DATEV Musterdaten" version="4.0"
xsi:schemaLocation="http://xml.datev.de/bedi/tps/ledger/v040 Belegverwaltung_online_ledger_import_v040.xsd">
<consolidate consolidatedAmount="???">
<accountsPayableLedger>
<bookingText>???</bookingText>
<invoiceId>???</invoiceId>
<date>???</date>
<amount>???</amount>
<accountNo>???</accountNo>
<costCategoryId>???</costCategoryId>
<currencyCode>???</currencyCode>
<partyId>???</partyId>
<bpAccountNo>???</bpAccountNo>
</accountsPayableLedger>
</consolidate>
</LedgerImport>
XSLT (save as .xsl file, a little longer due to default namespace in XML)
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:doc="http://xml.datev.de/bedi/tps/ledger/v040">
<xsl:output indent="yes"/>
<xsl:strip-space elements="*"/>
<!-- INITIALIZE PARAMETERS -->
<xsl:param name="prm_consolidate" />
<xsl:param name="prm_bookingText" />
<xsl:param name="prm_invoiceId" />
<xsl:param name="prm_date" />
<xsl:param name="prm_amount" />
<xsl:param name="prm_accountNo" />
<xsl:param name="prm_costCategoryId" />
<xsl:param name="prm_currencyCode" />
<xsl:param name="prm_partyId" />
<xsl:param name="prm_bpAccountNo" />
<!-- IDENTITY TRANSFORM -->
<xsl:template match="#*|node()">
<xsl:copy>
<xsl:apply-templates select="#*|node()"/>
</xsl:copy>
</xsl:template>
<!-- REWRITE TITLE TEXT -->
<xsl:template match="doc:accountsPayableLedger">
<xsl:copy>
<xsl:element name="consolidate" namespace="http://xml.datev.de/bedi/tps/ledger/v040">
<xsl:attribute name="consolidatedAmount"><xsl:value-of select="$prm_consolidate"/></xsl:attribute>
</xsl:element>
<xsl:element name="bookingText" namespace="http://xml.datev.de/bedi/tps/ledger/v040"><xsl:value-of select="$prm_bookingText"/></xsl:element>
<xsl:element name="invoiceId" namespace="http://xml.datev.de/bedi/tps/ledger/v040"><xsl:value-of select="$prm_invoiceId"/></xsl:element>
<xsl:element name="date" namespace="http://xml.datev.de/bedi/tps/ledger/v040"><xsl:value-of select="$prm_date"/></xsl:element>
<xsl:element name="amount" namespace="http://xml.datev.de/bedi/tps/ledger/v040"><xsl:value-of select="$prm_amount"/></xsl:element>
<xsl:element name="accountNo" namespace="http://xml.datev.de/bedi/tps/ledger/v040"><xsl:value-of select="$prm_accountNo"/></xsl:element>
<xsl:element name="costCategoryId" namespace="http://xml.datev.de/bedi/tps/ledger/v040"><xsl:value-of select="$prm_costCategoryId"/></xsl:element>
<xsl:element name="currencyCode" namespace="http://xml.datev.de/bedi/tps/ledger/v040"><xsl:value-of select="$prm_currencyCode"/></xsl:element>
<xsl:element name="partyId" namespace="http://xml.datev.de/bedi/tps/ledger/v040"><xsl:value-of select="$prm_partyId"/></xsl:element>
<xsl:element name="bpAccountNo" namespace="http://xml.datev.de/bedi/tps/ledger/v040"><xsl:value-of select="$prm_bpAccountNo"/></xsl:element>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
Python (no DOM element building)
import lxml.etree as et
# LOAD XML AND XSL
xml = et.parse('/path/to/Template.xml')
xsl = et.parse('/path/to/XSLTScript.xsl')
### OPEN EXCEL SPREADSHEET
wb = openpyxl.load_workbook('import_spendesk_datev.xlsx')
sheet = wb['Import']
# LOOP THROUGH ROWS
for i in range(2, 6):
consolidate = et.XSLT.strparam(sheet.cell(row=i,column=16).value)
account = et.XSLT.strparam(sheet.cell(row=i,column=21).value)
invoice = et.XSLT.strparam(sheet.cell(row=i,column=13).value)
date = et.XSLT.strparam(sheet.cell(row=i,column=2).value)
amount = et.XSLT.strparam(sheet.cell(row=i,column=16).value)
account_no = et.XSLT.strparam(sheet.cell(row=i,column=19).value)
cost1 = et.XSLT.strparam(sheet.cell(row=i,column=15).value)
currency_code = et.XSLT.strparam(sheet.cell(row=i,column=12).value)
party_id = et.XSLT.strparam(sheet.cell(row=i,column=20).value)
bpaccount = et.XSLT.strparam(sheet.cell(row=i,column=20).value)
# PASS PARAMETER TO XSLT
transform = et.XSLT(xsl)
result = transform(xml, prm_consolidate = consolidate,
prm_bookingText=account,
prm_invoiceId = invoice,
prm_date = date,
prm_amount = amount,
prm_account_no = account_no,
prm_costCategoryId = cost1,
prm_currencyCode = currency_code,
prm_partyId = party_id,
prm_bpAccountNo = bpaccount)
# SAVE XML TO FILE
with open('/path/to/Output_Row{}.xml'.format(i), 'wb') as f:
f.write(result)
I have an xml with a structure as follows:
<routes xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="http://sumo.dlr.de/xsd/routes_file.xsd">
<vType id="Bus" vClass="ignoring" guiShape="bus" color="cyan"/>
<vehicle id="1.0" type="Bus" depart="0.00">
<route edges="207358226 206878618#0 206878618#1 206878618#2 206878571 206878624#0 195427225#1 25450515 171767377#0 171767377#1 195427224#0 96336873 96336870"/>
</vehicle>
<vehicle id="2.0" type="Taxi" depart="0.00">
<route edges="172428613 -25301974#1 -25301974#0 172428582 -172428593 -25301969#5 -25301969#4 -165310768#1 -165310768#0 -45073854#4 -45073854#3 -45073854#0 -32932418#2 172436826#1 172436826#2 172436826#3 172436826#4 172436826#5 172405270#0 24629564 172405301#1 -172405301#1 -24629564 -172405270#0"/>
</vehicle>
<vehicle id="1.1" type="Bus" depart="0.00">
<route edges="207358226 206878618#0 206878618#1 206878618#2 206878571 206878624#0 195427225#1 25450515 171767377#0 171767377#1 195427224#0 96336873 96336870"/>
</vehicle>
There are multiple vType elements (ex. Bus, taxi, passenger car etc) and for each vType, there are multiple instantantiations of vehicle (numbered 1.0, 1.1 etc.) that has the route edges as attributes.
I want to now append the file such that I now have a subelement stop under vehicle that specifies the stops as follows
<routes xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="http://sumo.dlr.de/xsd/routes_file.xsd">
<vType id="Bus" vClass="ignoring" guiShape="bus" color="cyan"/>
<vehicle id="1.0" type="Bus" depart="0.00">
<route edges="207358226 206878618#0 206878618#1 206878618#2 206878571 206878624#0 195427225#1 25450515 171767377#0 171767377#1 195427224#0 96336873 96336870"/>
<stop lane="207358226" endPos="10" duration="20"/>
<stop lane="206878618#0" endPos="10" duration="20"/>
<stop lane="206878618#1" endPos="10" duration="20"/>
..........
..........
</vehicle>
<vehicle id="2.0" type="Taxi" depart="0.00">
<route edges="172428613 -25301974#1 -25301974#0 172428582 -172428593 -25301969#5 -25301969#4 -165310768#1 -165310768#0 -45073854#4 -45073854#3 -45073854#0 -32932418#2 172436826#1 172436826#2 172436826#3 172436826#4 172436826#5 172405270#0 24629564 172405301#1 -172405301#1 -24629564 -172405270#0"/>
</vehicle>
<vehicle id="1.1" type="Bus" depart="0.00">
<route edges="207358226 206878618#0 206878618#1 206878618#2 206878571 206878624#0 195427225#1 25450515 171767377#0 171767377#1 195427224#0 96336873 96336870"/>
<stop lane="207358226" endPos="10" duration="20"/>
<stop lane="206878618#0" endPos="10" duration="20"/>
<stop lane="206878618#1" endPos="10" duration="20"/>
..........
..........
</vehicle>
My initial approach is to iteratively parse the xml and pick up the elements with tag vehicle and attribute Bus. I then copy the edges into a list edgesnew. I then create a subelement iteratively inside a loop under vehicle named stop. The code is as follows
parser = etree.XMLParser(encoding='utf-8', recover=True)
routesFileTree = etree.parse('kaiserslautern.rou1.xml', parser)
routesFileRoot = routesFileTree.getroot()
vehicle = routesFileRoot.find('vehicle')
route = etree.SubElement(vehicle, 'route')
for elem in routesFileRoot.iter(tag = 'vehicle'):
if elem.attrib['type'] == 'Bus':
for subelem in elem.iter(tag = 'route'):
if subelem.attrib.get('edges'):
edgesnew = subelem.attrib['edges'].split(' ')
for edges in range(0,len(edgesnew),3):
stop = etree.SubElement(vehicle,'stop', lane = stops[edgesnew[edges]], duration = "30")
The program executes but my algorithm is wrong as it returns me the following output when I try to print
<routes xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="http://sumo.dlr.de/xsd/routes_file.xsd">
<vType id="Bus" vClass="ignoring" guiShape="bus" color="cyan"/>
<vehicle id="1.0" type="Bus" depart="0.00">
<route edges="207358226 206878618#0 206878618#1 206878618#2 206878571 206878624#0 195427225#1 25450515 171767377#0 171767377#1 195427224#0 96336873 96336870"/>
<route><stop lane="207358226" duration="30" endPos="10"/><stop lane="206878618#0" duration="30" endPos="10"/><.........../></route></vehicle>
<vehicle id="2.0" type="Taxi" depart="0.00">
<route edges="172428613 -25301974#1 -25301974#0 172428582 -172428593 -25301969#5 -25301969#4 -165310768#1 -165310768#0 -45073854#4 -45073854#3 -45073854#0 -32932418#2 172436826#1 172436826#2 172436826#3 172436826#4 172436826#5 172405270#0 24629564 172405301#1 -172405301#1 -24629564 -172405270#0"/>
</vehicle>
<vehicle id="1.1" type="Bus" depart="0.00">
<route edges="207358226 206878618#0 206878618#1 206878618#2 206878571 206878624#0 195427225#1 25450515 171767377#0 171767377#1 195427224#0 96336873 96336870"/>
</vehicle>
There are multiple problems in the code..First it only creates a new subelement for one instantiation of the vehicle only. and secondly it creates new route element rathat than appending to the existing xml. I have seen that I need to use element.append but cant figure out where and how.
Thanks in advance for the help
This is my XML file:
<?xml version="1.0" ?>
<Items>
<Item>
<ASIN>3570102769</ASIN>
<DetailPageURL>http://www.amazon.de/Inside-IS-Tage-Islamischen-Staat/dp/3570102769%3FSubscriptionId%3DAKIAI554OLCUMRCYB7ZA%26tag%3DjPp08vuSO4osfgfbCbEdF7TNqnWOm7YtprtqRPB9%26linkCode%3Dxm2%26camp%3D2025%26creative%3D165953%26creativeASIN%3D3570102769</DetailPageURL>
<ItemLinks>
<ItemLink>
<Description>Add To Wishlist</Description>
<URL>http://www.amazon.de/gp/registry/wishlist/add-item.html%3Fasin.0%3D3570102769%26SubscriptionId%3DAKIAI554OLCUMRCYB7ZA%26tag%3DjPp08vuSO4osfgfbCbEdF7TNqnWOm7YtprtqRPB9%26linkCode%3Dxm2%26camp%3D2025%26creative%3D12738%26creativeASIN%3D3570102769</URL>
</ItemLink>
<ItemLink>
<Description>Tell A Friend</Description>
<URL>http://www.amazon.de/gp/pdp/taf/3570102769%3FSubscriptionId%3DAKIAI554OLCUMRCYB7ZA%26tag%3DjPp08vuSO4osfgfbCbEdF7TNqnWOm7YtprtqRPB9%26linkCode%3Dxm2%26camp%3D2025%26creative%3D12738%26creativeASIN%3D3570102769</URL>
</ItemLink>
<ItemLink>
<Description>All Customer Reviews</Description>
<URL>http://www.amazon.de/review/product/3570102769%3FSubscriptionId%3DAKIAI554OLCUMRCYB7ZA%26tag%3DjPp08vuSO4osfgfbCbEdF7TNqnWOm7YtprtqRPB9%26linkCode%3Dxm2%26camp%3D2025%26creative%3D12738%26creativeASIN%3D3570102769</URL>
</ItemLink>
<ItemLink>
<Description>All Offers</Description>
<URL>http://www.amazon.de/gp/offer-listing/3570102769%3FSubscriptionId%3DAKIAI554OLCUMRCYB7ZA%26tag%3DjPp08vuSO4osfgfbCbEdF7TNqnWOm7YtprtqRPB9%26linkCode%3Dxm2%26camp%3D2025%26creative%3D12738%26creativeASIN%3D3570102769</URL>
</ItemLink>
</ItemLinks>
<ItemAttributes>
<Author>Jürgen Todenhöfer</Author>
<Binding>Gebundene Ausgabe</Binding>
<EAN>9783570102763</EAN>
<EANList>
<EANListElement>9783570102763</EANListElement>
</EANList>
<ISBN>3570102769</ISBN>
<IsEligibleForTradeIn>1</IsEligibleForTradeIn>
<ItemDimensions>
<Height Units="hundredths-inches">874</Height>
<Length Units="hundredths-inches">575</Length>
<Width Units="hundredths-inches">126</Width>
</ItemDimensions>
<Label>C. Bertelsmann Verlag</Label>
<Languages>
<Language>
<Name>Deutsch</Name>
<Type>Published</Type>
</Language>
<Language>
<Name>Deutsch</Name>
<Type>Original</Type>
</Language>
<Language>
<Name>Deutsch</Name>
<Type>Unbekannt</Type>
</Language>
</Languages>
<ListPrice>
<Amount>1799</Amount>
<CurrencyCode>EUR</CurrencyCode>
<FormattedPrice>EUR 17,99</FormattedPrice>
</ListPrice>
<Manufacturer>C. Bertelsmann Verlag</Manufacturer>
<ManufacturerMinimumAge Units="months">192</ManufacturerMinimumAge>
<NumberOfPages>288</NumberOfPages>
<PackageDimensions>
<Height Units="hundredths-inches">118</Height>
<Length Units="hundredths-inches">567</Length>
<Weight Units="hundredths-pounds">93</Weight>
<Width Units="hundredths-inches">252</Width>
</PackageDimensions>
<PackageQuantity>1</PackageQuantity>
<ProductGroup>Book</ProductGroup>
<ProductTypeName>ABIS_BOOK</ProductTypeName>
<PublicationDate>2015-04-27</PublicationDate>
<Publisher>C. Bertelsmann Verlag</Publisher>
<Studio>C. Bertelsmann Verlag</Studio>
<Title>Inside IS - 10 Tage im 'Islamischen Staat'</Title>
<TradeInValue>
<Amount>930</Amount>
<CurrencyCode>EUR</CurrencyCode>
<FormattedPrice>EUR 9,30</FormattedPrice>
</TradeInValue>
</ItemAttributes>
<OfferSummary>
<LowestNewPrice>
<Amount>1799</Amount>
<CurrencyCode>EUR</CurrencyCode>
<FormattedPrice>EUR 17,99</FormattedPrice>
</LowestNewPrice>
<LowestUsedPrice>
<Amount>1390</Amount>
<CurrencyCode>EUR</CurrencyCode>
<FormattedPrice>EUR 13,90</FormattedPrice>
</LowestUsedPrice>
<LowestCollectiblePrice>
<Amount>4999</Amount>
<CurrencyCode>EUR</CurrencyCode>
<FormattedPrice>EUR 49,99</FormattedPrice>
</LowestCollectiblePrice>
<TotalNew>56</TotalNew>
<TotalUsed>8</TotalUsed>
<TotalCollectible>1</TotalCollectible>
<TotalRefurbished>0</TotalRefurbished>
</OfferSummary>
<Offers>
<TotalOffers>1</TotalOffers>
<TotalOfferPages>1</TotalOfferPages>
<MoreOffersUrl>http://www.amazon.de/gp/offer-listing/3570102769%3FSubscriptionId%3DAKIAI554OLCUMRCYB7ZA%26tag%3DjPp08vuSO4osfgfbCbEdF7TNqnWOm7YtprtqRPB9%26linkCode%3Dxm2%26camp%3D2025%26creative%3D12738%26creativeASIN%3D3570102769</MoreOffersUrl>
<Offer>
<OfferAttributes>
<Condition>New</Condition>
</OfferAttributes>
<OfferListing>
<OfferListingId>9KHCZj9qtL6ucVBPASfXaryQjU8tWbc0n%2F3F4F7GraOKW6Csji2OxpD93%2FkoHwgIGQctlnrtx4RWIeJULAcvvsFhiopFi08JdsZ%2FeO3u6g0%3D</OfferListingId>
<Price>
<Amount>1799</Amount>
<CurrencyCode>EUR</CurrencyCode>
<FormattedPrice>EUR 17,99</FormattedPrice>
</Price>
<Availability>Gewöhnlich versandfertig in 24 Stunden</Availability>
<AvailabilityAttributes>
<AvailabilityType>now</AvailabilityType>
<MinimumHours>0</MinimumHours>
<MaximumHours>0</MaximumHours>
</AvailabilityAttributes>
<IsEligibleForSuperSaverShipping>1</IsEligibleForSuperSaverShipping>
</OfferListing>
</Offer>
</Offers>
</Item>
<Item>
<ASIN>3813506479</ASIN>
<DetailPageURL>http://www.amazon.de/Altes-Land-Roman-D%C3%B6rte-Hansen/dp/3813506479%3FSubscriptionId%3DAKIAI554OLCUMRCYB7ZA%26tag%3DjPp08vuSO4osfgfbCbEdF7TNqnWOm7YtprtqRPB9%26linkCode%3Dxm2%26camp%3D2025%26creative%3D165953%26creativeASIN%3D3813506479</DetailPageURL>
<ItemLinks>
<ItemLink>
<Description>Add To Wishlist</Description>
<URL>http://www.amazon.de/gp/registry/wishlist/add-item.html%3Fasin.0%3D3813506479%26SubscriptionId%3DAKIAI554OLCUMRCYB7ZA%26tag%3DjPp08vuSO4osfgfbCbEdF7TNqnWOm7YtprtqRPB9%26linkCode%3Dxm2%26camp%3D2025%26creative%3D12738%26creativeASIN%3D3813506479</URL>
</ItemLink>
<ItemLink>
<Description>Tell A Friend</Description>
<URL>http://www.amazon.de/gp/pdp/taf/3813506479%3FSubscriptionId%3DAKIAI554OLCUMRCYB7ZA%26tag%3DjPp08vuSO4osfgfbCbEdF7TNqnWOm7YtprtqRPB9%26linkCode%3Dxm2%26camp%3D2025%26creative%3D12738%26creativeASIN%3D3813506479</URL>
</ItemLink>
<ItemLink>
<Description>All Customer Reviews</Description>
<URL>http://www.amazon.de/review/product/3813506479%3FSubscriptionId%3DAKIAI554OLCUMRCYB7ZA%26tag%3DjPp08vuSO4osfgfbCbEdF7TNqnWOm7YtprtqRPB9%26linkCode%3Dxm2%26camp%3D2025%26creative%3D12738%26creativeASIN%3D3813506479</URL>
</ItemLink>
<ItemLink>
<Description>All Offers</Description>
<URL>http://www.amazon.de/gp/offer-listing/3813506479%3FSubscriptionId%3DAKIAI554OLCUMRCYB7ZA%26tag%3DjPp08vuSO4osfgfbCbEdF7TNqnWOm7YtprtqRPB9%26linkCode%3Dxm2%26camp%3D2025%26creative%3D12738%26creativeASIN%3D3813506479</URL>
</ItemLink>
</ItemLinks>
<ItemAttributes>
<Author>Dörte Hansen</Author>
<Binding>Gebundene Ausgabe</Binding>
<EAN>9783813506471</EAN>
<EANList>
<EANListElement>9783813506471</EANListElement>
</EANList>
<ISBN>3813506479</ISBN>
<IsEligibleForTradeIn>1</IsEligibleForTradeIn>
<ItemDimensions>
<Height Units="hundredths-inches">870</Height>
<Length Units="hundredths-inches">567</Length>
<Width Units="hundredths-inches">114</Width>
</ItemDimensions>
<Label>Albrecht Knaus Verlag</Label>
<Languages>
<Language>
<Name>Deutsch</Name>
<Type>Published</Type>
</Language>
<Language>
<Name>Deutsch</Name>
<Type>Original</Type>
</Language>
</Languages>
<ListPrice>
<Amount>1999</Amount>
<CurrencyCode>EUR</CurrencyCode>
<FormattedPrice>EUR 19,99</FormattedPrice>
</ListPrice>
<Manufacturer>Albrecht Knaus Verlag</Manufacturer>
<NumberOfPages>288</NumberOfPages>
<PackageDimensions>
<Height Units="hundredths-inches">118</Height>
<Length Units="hundredths-inches">858</Length>
<Weight Units="hundredths-pounds">101</Weight>
<Width Units="hundredths-inches">559</Width>
</PackageDimensions>
<ProductGroup>Book</ProductGroup>
<ProductTypeName>ABIS_BOOK</ProductTypeName>
<PublicationDate>2015-02-16</PublicationDate>
<Publisher>Albrecht Knaus Verlag</Publisher>
<Studio>Albrecht Knaus Verlag</Studio>
<Title>Altes Land: Roman</Title>
<TradeInValue>
<Amount>965</Amount>
<CurrencyCode>EUR</CurrencyCode>
<FormattedPrice>EUR 9,65</FormattedPrice>
</TradeInValue>
</ItemAttributes>
<OfferSummary>
<LowestNewPrice>
<Amount>1999</Amount>
<CurrencyCode>EUR</CurrencyCode>
<FormattedPrice>EUR 19,99</FormattedPrice>
</LowestNewPrice>
<LowestUsedPrice>
<Amount>1599</Amount>
<CurrencyCode>EUR</CurrencyCode>
<FormattedPrice>EUR 15,99</FormattedPrice>
</LowestUsedPrice>
<TotalNew>72</TotalNew>
<TotalUsed>8</TotalUsed>
<TotalCollectible>0</TotalCollectible>
<TotalRefurbished>0</TotalRefurbished>
</OfferSummary>
<Offers>
<TotalOffers>1</TotalOffers>
<TotalOfferPages>1</TotalOfferPages>
<MoreOffersUrl>http://www.amazon.de/gp/offer-listing/3813506479%3FSubscriptionId%3DAKIAI554OLCUMRCYB7ZA%26tag%3DjPp08vuSO4osfgfbCbEdF7TNqnWOm7YtprtqRPB9%26linkCode%3Dxm2%26camp%3D2025%26creative%3D12738%26creativeASIN%3D3813506479</MoreOffersUrl>
<Offer>
<OfferAttributes>
<Condition>New</Condition>
</OfferAttributes>
<OfferListing>
<OfferListingId>aeRv5KPt26T8S0hLrgV8Bv9UPYABYOMijGRxffbNJXUZSN4XfeeOZZpCZ28EURzmgMLlcYEBSRlMXS%2F8Z0pN1JbYerndME%2B2VK3RosfdQJA%3D</OfferListingId>
<Price>
<Amount>1999</Amount>
<CurrencyCode>EUR</CurrencyCode>
<FormattedPrice>EUR 19,99</FormattedPrice>
</Price>
<Availability>Gewöhnlich versandfertig in 24 Stunden</Availability>
<AvailabilityAttributes>
<AvailabilityType>now</AvailabilityType>
<MinimumHours>0</MinimumHours>
<MaximumHours>0</MaximumHours>
</AvailabilityAttributes>
<IsEligibleForSuperSaverShipping>1</IsEligibleForSuperSaverShipping>
</OfferListing>
</Offer>
</Offers>
</Item>
</Items>
I want to get any ASIN element. So I tried this:
from lxml import etree
doc = etree.fromstring(xmlstring)
items = doc.xpath('//Items/Item')
for a in items:
asin = a.xpath('//ASIN/text()')
print asin
What I get is this:
['3570102769', '3813506479']
['3570102769', '3813506479']
But I want this:
['3570102769']
['3813506479']
I don't understand what's the problem here? I think I should iterate over any element and in every element is one item with one asin. Why does it return two times two asin?
When you're searching for a.xpath('//ASIN/text()') you're searching the complete document tree again. Quoting from the XML Path language specification:
//para selects all the para descendants of the document root and thus selects all para elements in the same document as the context node
So what you're doing is iterating over the matched Item nodes and saying "Give me all ASIN nodes in this document please". The context for this (the Item node) is ignored.
What you should do instead, is directly select the ASIN child-node directly. Keeping to your original implementation this could look like this:
doc = etree.fromstring(xmlstring)
items = doc.xpath('//Items/Item')
for a in items:
asin = a.xpath('ASIN/text()')
print asin
which gives the output you desire:
['3570102769']
['3813506479']
Alternatively, if you're not certain where in the Item node your ASIN appears, you could use .//ASIN/text()
I am trying to parse this xml file with python element tree:
<?xml version="1.0" encoding="Windows-1250"?>
<rsp:responsePack version="2.0" id="001" state="ok" note="" programVersion="9801.8 (19.5.2011)" xmlns:rsp="http://www.stormware.cz/schema/version_2/response.xsd" xmlns:rdc="http://www.stormware.cz/schema/version_2/documentresponse.xsd" xmlns:typ="http://www.stormware.cz/schema/version_2/type.xsd" xmlns:lst="http://www.stormware.cz/schema/version_2/list.xsd" xmlns:lStk="http://www.stormware.cz/schema/version_2/list_stock.xsd" xmlns:lAdb="http://www.stormware.cz/schema/version_2/list_addBook.xsd" xmlns:acu="http://www.stormware.cz/schema/version_2/accountingunit.xsd" xmlns:inv="http://www.stormware.cz/schema/version_2/invoice.xsd" xmlns:vch="http://www.stormware.cz/schema/version_2/voucher.xsd" xmlns:int="http://www.stormware.cz/schema/version_2/intDoc.xsd" xmlns:stk="http://www.stormware.cz/schema/version_2/stock.xsd" xmlns:ord="http://www.stormware.cz/schema/version_2/order.xsd" xmlns:ofr="http://www.stormware.cz/schema/version_2/offer.xsd" xmlns:enq="http://www.stormware.cz/schema/version_2/enquiry.xsd" xmlns:vyd="http://www.stormware.cz/schema/version_2/vydejka.xsd" xmlns:pri="http://www.stormware.cz/schema/version_2/prijemka.xsd" xmlns:bal="http://www.stormware.cz/schema/version_2/balance.xsd" xmlns:pre="http://www.stormware.cz/schema/version_2/prevodka.xsd" xmlns:vyr="http://www.stormware.cz/schema/version_2/vyroba.xsd" xmlns:pro="http://www.stormware.cz/schema/version_2/prodejka.xsd" xmlns:con="http://www.stormware.cz/schema/version_2/contract.xsd" xmlns:adb="http://www.stormware.cz/schema/version_2/addressbook.xsd" xmlns:prm="http://www.stormware.cz/schema/version_2/parameter.xsd" xmlns:lCon="http://www.stormware.cz/schema/version_2/list_contract.xsd" xmlns:ctg="http://www.stormware.cz/schema/version_2/category.xsd" xmlns:ipm="http://www.stormware.cz/schema/version_2/intParam.xsd">
<rsp:responsePackItem version="2.0" id="li1" state="ok">
<lst:listInvoice version="2.0" dateTimeStamp="2011-05-27T10:47:23Z" dateValidFrom="2011-05-27" state="ok">
<lst:invoice version="2.0">
<inv:invoiceHeader>
<inv:id>20</inv:id>
<inv:invoiceType>issuedInvoice</inv:invoiceType>
<inv:number>
<typ:id>26</typ:id>
<typ:ids>1101</typ:ids>
<typ:numberRequested>110100001</typ:numberRequested>
</inv:number>
<inv:symVar>110100001</inv:symVar>
<inv:date>2011-01-30</inv:date>
<inv:dateTax>2011-01-30</inv:dateTax>
<inv:dateAccounting>2011-01-30</inv:dateAccounting>
<inv:dateDue>2011-02-13</inv:dateDue>
<inv:accounting>
<typ:id>17</typ:id>
<typ:ids>3Fv</typ:ids>
</inv:accounting>
<inv:classificationVAT>
<typ:id>251</typ:id>
<typ:ids>UD</typ:ids>
<typ:classificationVATType/>
</inv:classificationVAT>
<inv:text>Fakturujeme Vám zboží dle Vaší objednávky: </inv:text>
<inv:partnerIdentity>
<typ:id>15</typ:id>
<typ:address>
<typ:company>INTEAK spol. s r. o.</typ:company>
<typ:division>prodejna</typ:division>
<typ:name>David Jánský</typ:name>
<typ:city>Benešovice</typ:city>
<typ:street>Jiřího z Poděbrad 35</typ:street>
<typ:zip>463 48</typ:zip>
<typ:ico>85236972</typ:ico>
<typ:dic>CZ85236972</typ:dic>
</typ:address>
<typ:shipToAddress>
<typ:company/>
<typ:division/>
<typ:name/>
<typ:city/>
<typ:street/>
</typ:shipToAddress>
</inv:partnerIdentity>
<inv:myIdentity>
<typ:address>
<typ:company>Novák </typ:company>
<typ:surname>Novák</typ:surname>
<typ:name>Jan</typ:name>
<typ:city>Jihlava 1</typ:city>
<typ:street>Horní</typ:street>
<typ:number>15</typ:number>
<typ:zip>586 01</typ:zip>
<typ:ico>12345678</typ:ico>
<typ:dic>CZ12345678</typ:dic>
<typ:phone>569 876 542</typ:phone>
<typ:mobilPhone>602 852 369</typ:mobilPhone>
<typ:fax>564 563 216</typ:fax>
<typ:email>info#novak.cz</typ:email>
<typ:www>www.novak.cz</typ:www>
</typ:address>
</inv:myIdentity>
<inv:dateOrder>2011-01-22</inv:dateOrder>
<inv:paymentType>
<typ:id>1</typ:id>
<typ:ids>příkazem</typ:ids>
<typ:paymentType>draft</typ:paymentType>
</inv:paymentType>
<inv:account>
<typ:id>2</typ:id>
<typ:ids>KB</typ:ids>
</inv:account>
<inv:symConst>0308</inv:symConst>
<inv:centre>
<typ:id>1</typ:id>
<typ:ids>BRNO</typ:ids>
</inv:centre>
<inv:activity>
<typ:id>2</typ:id>
<typ:ids>NÁBYTEK</typ:ids>
</inv:activity>
<inv:liquidation>
<typ:date>2011-02-12</typ:date>
<typ:amountHome>356</typ:amountHome>
</inv:liquidation>
</inv:invoiceHeader>
<inv:invoiceDetail>
<inv:invoiceItem>
<inv:id>19</inv:id>
<inv:text>Židle Z220</inv:text>
<inv:quantity>2.0</inv:quantity>
<inv:unit>ks</inv:unit>
<inv:coefficient>1.0</inv:coefficient>
<inv:rateVAT>high</inv:rateVAT>
<inv:discountPercentage>0.0</inv:discountPercentage>
<inv:homeCurrency>
<typ:unitPrice>1968</typ:unitPrice>
<typ:price>3936</typ:price>
<typ:priceVAT>787.2</typ:priceVAT>
<typ:priceSum>4723.2</typ:priceSum>
</inv:homeCurrency>
<inv:code>Z220</inv:code>
<inv:guarantee>0</inv:guarantee>
<inv:guaranteeType>none</inv:guaranteeType>
<inv:stockItem>
<typ:store>
<typ:id>1</typ:id>
<typ:ids>ZBOŽÍ</typ:ids>
</typ:store>
<typ:stockItem>
<typ:id>27</typ:id>
<typ:ids>Z220</typ:ids>
<typ:PLU>650</typ:PLU>
</typ:stockItem>
</inv:stockItem>
</inv:invoiceItem>
<inv:invoiceItem>
<inv:id>20</inv:id>
<inv:text>Konferenční stolek chrom</inv:text>
<inv:quantity>1.0</inv:quantity>
<inv:unit>ks</inv:unit>
<inv:coefficient>1.0</inv:coefficient>
<inv:rateVAT>high</inv:rateVAT>
<inv:discountPercentage>0.0</inv:discountPercentage>
<inv:homeCurrency>
<typ:unitPrice>7680</typ:unitPrice>
<typ:price>7680</typ:price>
<typ:priceVAT>1536</typ:priceVAT>
<typ:priceSum>9216</typ:priceSum>
</inv:homeCurrency>
<inv:note>Rozměr: 120 x 60</inv:note>
<inv:code>Konf11</inv:code>
<inv:guarantee>0</inv:guarantee>
<inv:guaranteeType>none</inv:guaranteeType>
<inv:stockItem>
<typ:store>
<typ:id>1</typ:id>
<typ:ids>ZBOŽÍ</typ:ids>
</typ:store>
<typ:stockItem>
<typ:id>10</typ:id>
<typ:ids>Konf11</typ:ids>
<typ:PLU>625</typ:PLU>
</typ:stockItem>
</inv:stockItem>
</inv:invoiceItem>
<inv:invoiceItem>
<inv:id>21</inv:id>
<inv:text>Křeslo čalouněné 1320</inv:text>
<inv:quantity>4.0</inv:quantity>
<inv:unit>ks</inv:unit>
<inv:coefficient>1.0</inv:coefficient>
<inv:rateVAT>high</inv:rateVAT>
<inv:discountPercentage>0.0</inv:discountPercentage>
<inv:homeCurrency>
<typ:unitPrice>5988</typ:unitPrice>
<typ:price>23952</typ:price>
<typ:priceVAT>4790.4</typ:priceVAT>
<typ:priceSum>28742.4</typ:priceSum>
</inv:homeCurrency>
<inv:code>Kř1320</inv:code>
<inv:guarantee>0</inv:guarantee>
<inv:guaranteeType>none</inv:guaranteeType>
<inv:stockItem>
<typ:store>
<typ:id>1</typ:id>
<typ:ids>ZBOŽÍ</typ:ids>
</typ:store>
<typ:stockItem>
<typ:id>13</typ:id>
<typ:ids>Kř1320</typ:ids>
<typ:PLU>627</typ:PLU>
</typ:stockItem>
</inv:stockItem>
</inv:invoiceItem>
</inv:invoiceDetail>
<inv:invoiceSummary>
<inv:roundingDocument>up2one</inv:roundingDocument>
<inv:roundingVAT>none</inv:roundingVAT>
<inv:homeCurrency>
<typ:priceNone>0</typ:priceNone>
<typ:priceLow>0</typ:priceLow>
<typ:priceLowVAT>0</typ:priceLowVAT>
<typ:priceLowSum>0</typ:priceLowSum>
<typ:priceHigh>35568</typ:priceHigh>
<typ:priceHighVAT>7113.6</typ:priceHighVAT>
<typ:priceHighSum>42681.6</typ:priceHighSum>
<typ:round>
<typ:priceRound>0.4</typ:priceRound>
</typ:round>
</inv:homeCurrency>
</inv:invoiceSummary>
</lst:invoice>
</lst:listInvoice>
</rsp:responsePackItem>
</rsp:responsePack>
Please, how can I get data such as: (?)
inv:invoiceSummary - typ:priceHighSum
inv:partnerIdentity - typ:name, typ:ico
inv:myIdentity - typ:company
inv:liquidation - typ:date
I tried this but can't get it working:
import xml.etree.ElementTree as ET
tree = ET.parse('temp_xml2.xml')
root = tree.getroot()
for listInvoice in root.findall('listInvoice'):
invoiceHeader = listInvoice.find('invoiceHeader').text
print invoiceHeader
Try yo use jsoup. An related example is parse XML.
this works:
for listInvoice in root.findall('.//{http://www.stormware.cz/schema/version_2/invoice.xsd}invoiceHeader'):
invoiceHeader = listInvoice.find('.//{http://www.stormware.cz/schema/version_2/invoice.xsd}id').text
print invoiceHeader