Problem on using lxml with tostring and pretty_print - python

I have read some of the answers for related questions, but none of them is directly related with lxml tostring and pretty_print.
I am using lxml and trying to create a xml file on Python 3.6.
The problem I found is that elements are not wrapped and ordered by parent element and believe it is related with the "pretty_print" option.
What I need to achieve is:
<root>
<element1></element1>
<element2></element2>
<child1></child1>
<child2></child2>
</root>
The result I get is:
<root><element1></element1><element2></element2><child1></child1><child2></child2></root>
Part of the code I am using:
from lxml import etree as et
CompanyID = "Company Identification"
TaxRegistrationNumber = "Company Reg. Number"
TaxAccountingBasis = "File Tipe"
CompanyName = "Company Name"
BusinessName = "Business Name"
root = et.Element("root")
header = et.SubElement(root, 'Header')
header.tail = '\n'
data = (
('CompanyID', str(CompanyID)),
('TaxRegistrationNumber', str(TaxRegistrationNumber)),
('TaxAccountingBasis', str(TaxAccountingBasis)),
('CompanyName', str(CompanyName)),
('BusinessName', str(BusinessName)),
)
for tag, value in data:
if value is None :
continue
et.SubElement(header, tag).text=value
xml_txt = et.tostring(root, pretty_print=True, encoding="UTF-8")
print(xml_txt)
If I print the elements with no data into it, it works fine and the "pretty_print" works fine.
If I add data to each of the elements (using the above variables), the "pretty_print" does not work and the structure gets messed up.
What could be wrong?

I found it.
I have removed the "header.tail = '\n'" from the code and it's working now.
root = et.Element("root")
header = et.SubElement(root, 'Header')
#header.tail = '\n'
Thank you all

Related

How to access UBL 2.1 xml tag using python

I need to access the tags in UBL 2.1 and modify them depend on the on the user input on python.
So, I used the ElementTree library to access the tags and modify them.
Here is a sample of the xml code:
<ns0:Invoice xmlns:ns0="urn:oasis:names:specification:ubl:schema:xsd:Invoice-2" xmlns:ns1="urn:oasis:names:specification:ubl:schema:xsd:CommonBasicComponents-2" xmlns:ns2="urn:oasis:names:specification:ubl:schema:xsd:CommonAggregateComponents-2">
<ns1:ProfileID>reporting:1.0</ns1:ProfileID>
<ns1:ID>0</ns1:ID>
<ns1:UUID>dbdf65eb-5d66-47e6-bb0c-a84bbf7baa30</ns1:UUID>
<ns1:IssueDate>2022-11-05</ns1:IssueDate>
The issue :
I want to access the tags but it is doesn't modifed and enter the loop
I tried both ways:
mytree = ET.parse('test.xml')
myroot = mytree.getroot()
for x in myroot.find({xmlns:ns1=urn:oasis:names:specification:ubl:schema:xsd:CommonBasicComponents-2}IssueDate}"):
x.text = '1999'
mytree.write('test.xml')
mytree = ET.parse('test.xml')
myroot = mytree.getroot()
for x in myroot.iter('./Invoice/AllowanceCharge/ChargeIndicator'):
x.text = str('true')
mytree.write('test.xml')
None of them worked and modify the tag.
So the questions is : How can I reach the specific tag and modify it?
If you correct the namespace and the brakets in your for loop it works for a valid XML like (root tag must be closed!):
Input:
<?xml version="1.0" encoding="utf-8"?>
<ns0:Invoice xmlns:ns0="urn:oasis:names:specification:ubl:schema:xsd:Invoice-2" xmlns:ns1="urn:oasis:names:specification:ubl:schema:xsd:CommonBasicComponents-2" xmlns:ns2="urn:oasis:names:specification:ubl:schema:xsd:CommonAggregateComponents-2">
<ns1:ProfileID>reporting:1.0</ns1:ProfileID>
<ns1:ID>0</ns1:ID>
<ns1:UUID>dbdf65eb-5d66-47e6-bb0c-a84bbf7baa30</ns1:UUID>
<ns1:IssueDate>2022-11-05</ns1:IssueDate>
</ns0:Invoice>
Your repaired code:
import xml.etree.ElementTree as ET
tree = ET.parse('test.xml')
root = tree.getroot()
for elem in root.findall("{urn:oasis:names:specification:ubl:schema:xsd:CommonBasicComponents-2}IssueDate"):
elem.text = '1999'
tree.write('test_changed.xml', encoding='utf-8', xml_declaration=True)
ET.dump(root)
Output:
<ns0:Invoice xmlns:ns0="urn:oasis:names:specification:ubl:schema:xsd:Invoice-2" xmlns:ns1="urn:oasis:names:specification:ubl:schema:xsd:CommonBasicComponents-2">
<ns1:ProfileID>reporting:1.0</ns1:ProfileID>
<ns1:ID>0</ns1:ID>
<ns1:UUID>dbdf65eb-5d66-47e6-bb0c-a84bbf7baa30</ns1:UUID>
<ns1:IssueDate>1999</ns1:IssueDate>
</ns0:Invoice>

Finding element in xml with python

I am trying to parse XML before converting it's content into lists and then into CSV. Unfortunately, I think my search terms for finding the initial element are failing, causing subsequent searches further down the hierarchy. I am new to XML, so I've tried variations on namespace dictionaries and including the namespace references... The simplified XML is given below:
<?xml version="1.0" encoding="utf-8"?>
<StationList xmlns:xsd="http://www.w3.org/2001/XMLSchema"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:add="http://www.govtalk.gov.uk/people/AddressAndPersonalDetails"
xmlns:com="http://nationalrail.co.uk/xml/common" xsi:schemaLocation="http://internal.nationalrail.co.uk/xml/XsdSchemas/External/Version4.0/nre-station-v4-0.xsd"
xmlns="http://nationalrail.co.uk/xml/station">
<Station xsi:schemaLocation="http://internal.nationalrail.co.uk/xml/XsdSchemas/External/Version4.0/nre-station-v4-0.xsd">
<ChangeHistory>
<com:ChangedBy>spascos</com:ChangedBy>
<com:LastChangedDate>2018-11-07T00:00:00.000Z</com:LastChangedDate>
</ChangeHistory>
<Name>Aber</Name>
</Station>​
The Code I am using to try to extract the com/...xml/station / ChangedBy element is below
tree = ET.parse(rootfilepath + "NRE_Station_Dataset_2019_raw.xml")
root = tree.getroot()
#get at the tags and their data
#for elem in tree.iter():
# print(f"this the tag {elem.tag} and this is the data: {elem.text}")
#open file for writing
station_data = open(rootfilepath + 'station_data.csv','w')
csvwriter = csv.writer(station_data)
station_head = []
count = 0
#inspiration for this code: http://blog.appliedinformaticsinc.com/how-to- parse-and-convert-xml-to-csv-using-python/
#this is where it goes wrong; some combination of the namespace and the tag can't find anything in line 27, 'StationList'
for member in root.findall('{http://nationalrail.co.uk/xml/station}Station'):
station = []
if count == 0:
changedby = member.find('{http://nationalrail.co.uk/xml/common}ChangedBy').tag
station_head.append(changedby)
name = member.find('{http://nationalrail.co.uk/xml/station}Name').tag
station_head.append(name)
count = count+1
changedby = member.find('{http://nationalrail.co.uk/xml/common}ChangedBy').text
station.append(changedby)
name = member.find('{http://nationalrail.co.uk/xml/station}Name').text
station.append(name)
csvwriter.writerow(station)
I have tried:
using dictionaries of namespaces but that results in nothing being found at all
using hard coded namespaces but that results in "Attribute Error: 'NoneType' object has no attribute 'tag'
Thanks in advance for all and any assistance.
First of all your XML is invalid (</StationList> is absent at the end of a file).
Assuming you have valid XML file:
<?xml version="1.0" encoding="utf-8"?>
<StationList xmlns:xsd="http://www.w3.org/2001/XMLSchema"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:add="http://www.govtalk.gov.uk/people/AddressAndPersonalDetails"
xmlns:com="http://nationalrail.co.uk/xml/common" xsi:schemaLocation="http://internal.nationalrail.co.uk/xml/XsdSchemas/External/Version4.0/nre-station-v4-0.xsd"
xmlns="http://nationalrail.co.uk/xml/station">
<Station xsi:schemaLocation="http://internal.nationalrail.co.uk/xml/XsdSchemas/External/Version4.0/nre-station-v4-0.xsd">
<ChangeHistory>
<com:ChangedBy>spascos</com:ChangedBy>
<com:LastChangedDate>2018-11-07T00:00:00.000Z</com:LastChangedDate>
</ChangeHistory>
<Name>Aber</Name>
</Station>​
</StationList>
Then you can convert your XML to JSON and simply address to the required value:
import xmltodict
with open('file.xml', 'r') as f:
data = xmltodict.parse(f.read())
changed_by = data['StationList']['Station']['ChangeHistory']['com:ChangedBy']
Output:
spascos
Try lxml:
#!/usr/bin/env python3
from lxml import etree
ns = {"com": "http://nationalrail.co.uk/xml/common"}
with open("so.xml") as f:
tree = etree.parse(f)
for t in tree.xpath("//com:ChangedBy/text()", namespaces=ns):
print(t)
Output:
spascos
You can use Beautifulsoup which is an html and xml parser
from bs4 import BeautifulSoup
fd = open(rootfilepath + "NRE_Station_Dataset_2019_raw.xml")
soup = BeautifulSoup(fd,'lxml-xml')
for i in soup.findAll('ChangeHistory'):
print(i.ChangedBy.text)

how to access xml attributes in python

My XML file sample is given below and I want to access text "The bread is top notch as well" and category "food".
<sentences>
<sentence id="32897564#894393#2">
<text>The bread is top notch as well.</text>
<aspectTerms>
<aspectTerm term="bread" polarity="positive" from="4" to="9"/>
</aspectTerms>
<aspectCategories>
<aspectCategory category="food" polarity="positive" />
</aspectCategories>
</sentence>
my code is
test_text_file=open('Restaurants_Test_Gold.txt', 'rt')
test_text_file1=test_text_file.read()
root = ET.fromstring(test_text_file1)
for page in list(root):
text = page.find('text').text
Category = page.find('aspectCategory')
print ('sentence: %s; category: %s' % (text,Category))
test_text_file.close()
It's depending on how complicated your XML format is. The easiest way is to access the path directly.
import xml.etree.ElementTree as ET
tree = ET.parse('x.xml')
root = tree.getroot()
print(root.find('.//text').text)
print(root.find('.//aspectCategory').attrib['category'])
But if there are similar tags, you might want to use longer path like .//aspectCategories/aspectCategory instead.
Here is my code solving your problem
import os
import xml.etree.ElementTree as ET
basedir = os.path.abspath(os.path.dirname(__file__))
filenamepath = os.path.join(basedir, 'Restaurants_Test_Gold.txt')
test_text_file = open(filenamepath, 'r')
file_contents = test_text_file.read()
tree = ET.fromstring(file_contents)
for sentence in list(tree):
sentence_items = list(sentence.iter())
# remove first element because it's the sentence element [<sentence>] itself
sentence_items = sentence_items[1:]
for item in sentence_items:
if item.tag == 'text':
print(item.text)
elif item.tag == 'aspectCategories':
category = item.find('aspectCategory')
print(category.attrib.get('category'))
test_text_file.close()
Hope it helps

Python print to file as XML

In the following function, I want to display the items of an embedded dictionary as XML tree and print it to a file.
def printToFile(self):
from lxml import etree as ET
for k,v in self.wordCount.items():
root = ET.Element(k)
tree = ET.ElementTree(root)
for k1,v1 in v.items():
DocID = ET.SubElement(root, 'DocID')
DocID.text = str(k1)
Occurences = ET.SubElement(root, 'Occurences')
Occurences.text = str(v1)
print ET.tostring(root, pretty_print=True, xml_declaration=False)
tree.write('output.xml', pretty_print=True, xml_declaration=False)
When I run the code, all the items are shown in the console screen but the problem is that it only prints the last item in the file.
In the console, I got this:
<weather>
<DocID>1</DocID>
<Occurences>1</Occurences>
</weather>
<london>
<DocID>1</DocID>
<Occurences>1</Occurences>
<DocID>2</DocID>
<Occurences>2</Occurences>
<DocID>3</DocID>
<Occurences>1</Occurences>
</london>
<expens>
<DocID>2</DocID>
<Occurences>1</Occurences>
</expens>
<nice>
<DocID>3</DocID>
<Occurences>1</Occurences>
</nice>
but when I open the file, I only got this:
<nice>
<DocID>3</DocID>
<Occurences>1</Occurences>
</nice>
Can someone help me solving this issue. Thanks
Based on the previous comments, I changed my function as follow and it worked:
def printToFile(self):
from lxml import etree as ET
with open('output.xml','a') as file:
for k,v in self.wordCount.items():
root = ET.Element(k)
for k1,v1 in v.items():
DocID = ET.SubElement(root, 'DocID')
DocID.text = str(k1)
Occurences = ET.SubElement(root, 'Occurences')
Occurences.text = str(v1)
//print ET.tostring(root, pretty_print=True, xml_declaration=False)
file.write(ET.tostring(root, pretty_print=True, xml_declaration=False))

Generating XML files Using CSV

i have a questiion about formatting xml files after generating them. Here is my code:
import csv
from xml.etree.ElementTree import Element, SubElement, Comment, tostring
from xml.etree.ElementTree import ElementTree
import xml.etree.ElementTree as etree
root = Element('Solution')
root.set('version','1.0')
tree = ElementTree(root)
head = SubElement(root, 'DrillHoles')
head.set('total_holes', '238')
description = SubElement(head,'description')
with open ('1250_12.csv', 'r') as data:
current_group = None
reader = csv.reader(data)
i = 0
for row in reader:
if i > 0:
x1,y1,z1,x2,y2,z2,cost = row
if current_group is None or i != current_group.text:
current_group = SubElement(description, 'hole',{'hole_id':"%s"%i})
information = SubElement (current_group, 'hole',{'collar':', '.join((x1,y1,z1)),
'toe':', '.join((x2,y2,z2)),
'cost': cost})
i+=1
Which produces the following xml file:
<?xml version="1.0"?>
-<Solution version="1.0">
-<DrillHoles total_holes="238">
-<description>
-<hole hole_id="1">
<hole toe="5797.82, 3061.01, 2576.29" cost="102.12" collar="5720.44, 3070.94, 2642.19"/></hole>
that is just a part of the xml file but it is enough to serve this purpose.
There are many things i would like to change, first is i would like the toe,cost, and collar to be on different lines like so:
<collar>0,-150,0</collar>
<toe>69.9891,-18.731,-19.2345</toe>
<cost>15</cost>
and i would like it to be in the order of collar then toe then cost shown above.
Furthermore, in the xml file it displays : "hole toe ="5797.82, 3061.01, 2576.29", how do i get rid of the hole? Yea thats about it, i am really new to this python thing so go easy on me. haha

Categories