Problem on using lxml with tostring and pretty_print

Problem on using lxml with tostring and pretty_print - python

I have read some of the answers for related questions, but none of them is directly related with lxml tostring and pretty_print.
I am using lxml and trying to create a xml file on Python 3.6.
The problem I found is that elements are not wrapped and ordered by parent element and believe it is related with the "pretty_print" option.
What I need to achieve is:
<root>
<element1></element1>
<element2></element2>
<child1></child1>
<child2></child2>
</root>
The result I get is:
<root><element1></element1><element2></element2><child1></child1><child2></child2></root>
Part of the code I am using:
from lxml import etree as et
CompanyID = "Company Identification"
TaxRegistrationNumber = "Company Reg. Number"
TaxAccountingBasis = "File Tipe"
CompanyName = "Company Name"
BusinessName = "Business Name"
root = et.Element("root")
header = et.SubElement(root, 'Header')
header.tail = '\n'
data = (
('CompanyID', str(CompanyID)),
('TaxRegistrationNumber', str(TaxRegistrationNumber)),
('TaxAccountingBasis', str(TaxAccountingBasis)),
('CompanyName', str(CompanyName)),
('BusinessName', str(BusinessName)),
)
for tag, value in data:
if value is None :
continue
et.SubElement(header, tag).text=value
xml_txt = et.tostring(root, pretty_print=True, encoding="UTF-8")
print(xml_txt)
If I print the elements with no data into it, it works fine and the "pretty_print" works fine.
If I add data to each of the elements (using the above variables), the "pretty_print" does not work and the structure gets messed up.
What could be wrong?

I found it.
I have removed the "header.tail = '\n'" from the code and it's working now.
root = et.Element("root")
header = et.SubElement(root, 'Header')
#header.tail = '\n'
Thank you all

Related

How to access UBL 2.1 xml tag using python

I need to access the tags in UBL 2.1 and modify them depend on the on the user input on python.
So, I used the ElementTree library to access the tags and modify them.
Here is a sample of the xml code:
<ns0:Invoice xmlns:ns0="urn:oasis:names:specification:ubl:schema:xsd:Invoice-2" xmlns:ns1="urn:oasis:names:specification:ubl:schema:xsd:CommonBasicComponents-2" xmlns:ns2="urn:oasis:names:specification:ubl:schema:xsd:CommonAggregateComponents-2">
<ns1:ProfileID>reporting:1.0</ns1:ProfileID>
<ns1:ID>0</ns1:ID>
<ns1:UUID>dbdf65eb-5d66-47e6-bb0c-a84bbf7baa30</ns1:UUID>
<ns1:IssueDate>2022-11-05</ns1:IssueDate>
The issue :
I want to access the tags but it is doesn't modifed and enter the loop
I tried both ways:
mytree = ET.parse('test.xml')
myroot = mytree.getroot()
for x in myroot.find({xmlns:ns1=urn:oasis:names:specification:ubl:schema:xsd:CommonBasicComponents-2}IssueDate}"):
x.text = '1999'
mytree.write('test.xml')
mytree = ET.parse('test.xml')
myroot = mytree.getroot()
for x in myroot.iter('./Invoice/AllowanceCharge/ChargeIndicator'):
x.text = str('true')
mytree.write('test.xml')
None of them worked and modify the tag.
So the questions is : How can I reach the specific tag and modify it?

If you correct the namespace and the brakets in your for loop it works for a valid XML like (root tag must be closed!):
Input:
<?xml version="1.0" encoding="utf-8"?>
<ns0:Invoice xmlns:ns0="urn:oasis:names:specification:ubl:schema:xsd:Invoice-2" xmlns:ns1="urn:oasis:names:specification:ubl:schema:xsd:CommonBasicComponents-2" xmlns:ns2="urn:oasis:names:specification:ubl:schema:xsd:CommonAggregateComponents-2">
<ns1:ProfileID>reporting:1.0</ns1:ProfileID>
<ns1:ID>0</ns1:ID>
<ns1:UUID>dbdf65eb-5d66-47e6-bb0c-a84bbf7baa30</ns1:UUID>
<ns1:IssueDate>2022-11-05</ns1:IssueDate>
</ns0:Invoice>
Your repaired code:
import xml.etree.ElementTree as ET
tree = ET.parse('test.xml')
root = tree.getroot()
for elem in root.findall("{urn:oasis:names:specification:ubl:schema:xsd:CommonBasicComponents-2}IssueDate"):
elem.text = '1999'
tree.write('test_changed.xml', encoding='utf-8', xml_declaration=True)
ET.dump(root)
Output:
<ns0:Invoice xmlns:ns0="urn:oasis:names:specification:ubl:schema:xsd:Invoice-2" xmlns:ns1="urn:oasis:names:specification:ubl:schema:xsd:CommonBasicComponents-2">
<ns1:ProfileID>reporting:1.0</ns1:ProfileID>
<ns1:ID>0</ns1:ID>
<ns1:UUID>dbdf65eb-5d66-47e6-bb0c-a84bbf7baa30</ns1:UUID>
<ns1:IssueDate>1999</ns1:IssueDate>
</ns0:Invoice>

Finding element in xml with python

I am trying to parse XML before converting it's content into lists and then into CSV. Unfortunately, I think my search terms for finding the initial element are failing, causing subsequent searches further down the hierarchy. I am new to XML, so I've tried variations on namespace dictionaries and including the namespace references... The simplified XML is given below:
<?xml version="1.0" encoding="utf-8"?>
<StationList xmlns:xsd="http://www.w3.org/2001/XMLSchema"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:add="http://www.govtalk.gov.uk/people/AddressAndPersonalDetails"
xmlns:com="http://nationalrail.co.uk/xml/common" xsi:schemaLocation="http://internal.nationalrail.co.uk/xml/XsdSchemas/External/Version4.0/nre-station-v4-0.xsd"
xmlns="http://nationalrail.co.uk/xml/station">
<Station xsi:schemaLocation="http://internal.nationalrail.co.uk/xml/XsdSchemas/External/Version4.0/nre-station-v4-0.xsd">
<ChangeHistory>
<com:ChangedBy>spascos</com:ChangedBy>
<com:LastChangedDate>2018-11-07T00:00:00.000Z</com:LastChangedDate>
</ChangeHistory>
<Name>Aber</Name>
</Station>
The Code I am using to try to extract the com/...xml/station / ChangedBy element is below
tree = ET.parse(rootfilepath + "NRE_Station_Dataset_2019_raw.xml")
root = tree.getroot()
#get at the tags and their data
#for elem in tree.iter():
# print(f"this the tag {elem.tag} and this is the data: {elem.text}")
#open file for writing
station_data = open(rootfilepath + 'station_data.csv','w')
csvwriter = csv.writer(station_data)
station_head = []
count = 0
#inspiration for this code: http://blog.appliedinformaticsinc.com/how-to- parse-and-convert-xml-to-csv-using-python/
#this is where it goes wrong; some combination of the namespace and the tag can't find anything in line 27, 'StationList'
for member in root.findall('{http://nationalrail.co.uk/xml/station}Station'):
station = []
if count == 0:
changedby = member.find('{http://nationalrail.co.uk/xml/common}ChangedBy').tag
station_head.append(changedby)
name = member.find('{http://nationalrail.co.uk/xml/station}Name').tag
station_head.append(name)
count = count+1
changedby = member.find('{http://nationalrail.co.uk/xml/common}ChangedBy').text
station.append(changedby)
name = member.find('{http://nationalrail.co.uk/xml/station}Name').text
station.append(name)
csvwriter.writerow(station)
I have tried:
using dictionaries of namespaces but that results in nothing being found at all
using hard coded namespaces but that results in "Attribute Error: 'NoneType' object has no attribute 'tag'
Thanks in advance for all and any assistance.

First of all your XML is invalid (</StationList> is absent at the end of a file).
Assuming you have valid XML file:
<?xml version="1.0" encoding="utf-8"?>
<StationList xmlns:xsd="http://www.w3.org/2001/XMLSchema"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:add="http://www.govtalk.gov.uk/people/AddressAndPersonalDetails"
xmlns:com="http://nationalrail.co.uk/xml/common" xsi:schemaLocation="http://internal.nationalrail.co.uk/xml/XsdSchemas/External/Version4.0/nre-station-v4-0.xsd"
xmlns="http://nationalrail.co.uk/xml/station">
<Station xsi:schemaLocation="http://internal.nationalrail.co.uk/xml/XsdSchemas/External/Version4.0/nre-station-v4-0.xsd">
<ChangeHistory>
<com:ChangedBy>spascos</com:ChangedBy>
<com:LastChangedDate>2018-11-07T00:00:00.000Z</com:LastChangedDate>
</ChangeHistory>
<Name>Aber</Name>
</Station>
</StationList>
Then you can convert your XML to JSON and simply address to the required value:
import xmltodict
with open('file.xml', 'r') as f:
data = xmltodict.parse(f.read())
changed_by = data['StationList']['Station']['ChangeHistory']['com:ChangedBy']
Output:
spascos

Try lxml:
#!/usr/bin/env python3
from lxml import etree
ns = {"com": "http://nationalrail.co.uk/xml/common"}
with open("so.xml") as f:
tree = etree.parse(f)
for t in tree.xpath("//com:ChangedBy/text()", namespaces=ns):
print(t)
Output:
spascos

You can use Beautifulsoup which is an html and xml parser
from bs4 import BeautifulSoup
fd = open(rootfilepath + "NRE_Station_Dataset_2019_raw.xml")
soup = BeautifulSoup(fd,'lxml-xml')
for i in soup.findAll('ChangeHistory'):
print(i.ChangedBy.text)

how to access xml attributes in python

My XML file sample is given below and I want to access text "The bread is top notch as well" and category "food".
<sentences>
<sentence id="32897564#894393#2">
<text>The bread is top notch as well.</text>
<aspectTerms>
<aspectTerm term="bread" polarity="positive" from="4" to="9"/>
</aspectTerms>
<aspectCategories>
<aspectCategory category="food" polarity="positive" />
</aspectCategories>
</sentence>
my code is
test_text_file=open('Restaurants_Test_Gold.txt', 'rt')
test_text_file1=test_text_file.read()
root = ET.fromstring(test_text_file1)
for page in list(root):
text = page.find('text').text
Category = page.find('aspectCategory')
print ('sentence: %s; category: %s' % (text,Category))
test_text_file.close()

It's depending on how complicated your XML format is. The easiest way is to access the path directly.
import xml.etree.ElementTree as ET
tree = ET.parse('x.xml')
root = tree.getroot()
print(root.find('.//text').text)
print(root.find('.//aspectCategory').attrib['category'])
But if there are similar tags, you might want to use longer path like .//aspectCategories/aspectCategory instead.

Here is my code solving your problem
import os
import xml.etree.ElementTree as ET
basedir = os.path.abspath(os.path.dirname(__file__))
filenamepath = os.path.join(basedir, 'Restaurants_Test_Gold.txt')
test_text_file = open(filenamepath, 'r')
file_contents = test_text_file.read()
tree = ET.fromstring(file_contents)
for sentence in list(tree):
sentence_items = list(sentence.iter())
# remove first element because it's the sentence element [<sentence>] itself
sentence_items = sentence_items[1:]
for item in sentence_items:
if item.tag == 'text':
print(item.text)
elif item.tag == 'aspectCategories':
category = item.find('aspectCategory')
print(category.attrib.get('category'))
test_text_file.close()
Hope it helps

Python print to file as XML

In the following function, I want to display the items of an embedded dictionary as XML tree and print it to a file.
def printToFile(self):
from lxml import etree as ET
for k,v in self.wordCount.items():
root = ET.Element(k)
tree = ET.ElementTree(root)
for k1,v1 in v.items():
DocID = ET.SubElement(root, 'DocID')
DocID.text = str(k1)
Occurences = ET.SubElement(root, 'Occurences')
Occurences.text = str(v1)
print ET.tostring(root, pretty_print=True, xml_declaration=False)
tree.write('output.xml', pretty_print=True, xml_declaration=False)
When I run the code, all the items are shown in the console screen but the problem is that it only prints the last item in the file.
In the console, I got this:
<weather>
<DocID>1</DocID>
<Occurences>1</Occurences>
</weather>
<london>
<DocID>1</DocID>
<Occurences>1</Occurences>
<DocID>2</DocID>
<Occurences>2</Occurences>
<DocID>3</DocID>
<Occurences>1</Occurences>
</london>
<expens>
<DocID>2</DocID>
<Occurences>1</Occurences>
</expens>
<nice>
<DocID>3</DocID>
<Occurences>1</Occurences>
</nice>
but when I open the file, I only got this:
<nice>
<DocID>3</DocID>
<Occurences>1</Occurences>
</nice>
Can someone help me solving this issue. Thanks

Based on the previous comments, I changed my function as follow and it worked:
def printToFile(self):
from lxml import etree as ET
with open('output.xml','a') as file:
for k,v in self.wordCount.items():
root = ET.Element(k)
for k1,v1 in v.items():
DocID = ET.SubElement(root, 'DocID')
DocID.text = str(k1)
Occurences = ET.SubElement(root, 'Occurences')
Occurences.text = str(v1)
//print ET.tostring(root, pretty_print=True, xml_declaration=False)
file.write(ET.tostring(root, pretty_print=True, xml_declaration=False))

Generating XML files Using CSV

i have a questiion about formatting xml files after generating them. Here is my code:
import csv
from xml.etree.ElementTree import Element, SubElement, Comment, tostring
from xml.etree.ElementTree import ElementTree
import xml.etree.ElementTree as etree
root = Element('Solution')
root.set('version','1.0')
tree = ElementTree(root)
head = SubElement(root, 'DrillHoles')
head.set('total_holes', '238')
description = SubElement(head,'description')
with open ('1250_12.csv', 'r') as data:
current_group = None
reader = csv.reader(data)
i = 0
for row in reader:
if i > 0:
x1,y1,z1,x2,y2,z2,cost = row
if current_group is None or i != current_group.text:
current_group = SubElement(description, 'hole',{'hole_id':"%s"%i})
information = SubElement (current_group, 'hole',{'collar':', '.join((x1,y1,z1)),
'toe':', '.join((x2,y2,z2)),
'cost': cost})
i+=1
Which produces the following xml file:
<?xml version="1.0"?>
-<Solution version="1.0">
-<DrillHoles total_holes="238">
-<description>
-<hole hole_id="1">
<hole toe="5797.82, 3061.01, 2576.29" cost="102.12" collar="5720.44, 3070.94, 2642.19"/></hole>
that is just a part of the xml file but it is enough to serve this purpose.
There are many things i would like to change, first is i would like the toe,cost, and collar to be on different lines like so:
<collar>0,-150,0</collar>
<toe>69.9891,-18.731,-19.2345</toe>
<cost>15</cost>
and i would like it to be in the order of collar then toe then cost shown above.
Furthermore, in the xml file it displays : "hole toe ="5797.82, 3061.01, 2576.29", how do i get rid of the hole? Yea thats about it, i am really new to this python thing so go easy on me. haha

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Problem on using lxml with tostring and pretty_print - python

I found it. I have removed the "header.tail = '\n'" from the code and it's working now. root = et.Element("root") header = et.SubElement(root, 'Header') #header.tail = '\n' Thank you all

Related

How to access UBL 2.1 xml tag using python

Finding element in xml with python

how to access xml attributes in python

Python print to file as XML

Generating XML files Using CSV

Categories

Resources