Outputting child nodes to CSV with Python

Outputting child nodes to CSV with Python - python

Edit: I've replaced the example XML with real data and provided my code at the bottom.
I have several xml-files containing from 1 to 10+ lines of the following data:
<?xml version="1.0" encoding="UTF-8"?>
<Invoice xmlns="urn:oasis:names:specification:ubl:schema:xsd:Invoice-2" xmlns:cac="urn:oasis:names:specification:ubl:schema:xsd:CommonAggregateComponents-2" xmlns:cbc="urn:oasis:names:specification:ubl:schema:xsd:CommonBasicComponents-2" xmlns:cec="urn:oasis:names:specification:ubl:schema:xsd:CommonExtensionComponents-2" xmlns:soapenv="http://www.w3.org/2003/05/soap-envelope" xmlns:wsu="http://docs.oasis-open.org/wss/2004/01/oasis-200401-wss-wssecurity-utility-1.0.xsd" xmlns:xenc="http://www.w3.org/2001/04/xmlenc#" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="urn:oasis:names:specification:ubl:schema:xsd:Invoice-2 UBL-Invoice-2.0.xsd">
<cac:LegalMonetaryTotal>
<cbc:PayableAmount currencyID="DKK">2586.61</cbc:PayableAmount>
</cac:LegalMonetaryTotal>
<cac:InvoiceLine>
<cbc:ID>1</cbc:ID>
<cbc:InvoicedQuantity unitCode="HUR">1.50</cbc:InvoicedQuantity>
<cbc:LineExtensionAmount currencyID="DKK">1633.65</cbc:LineExtensionAmount>
</cac:InvoiceLine>
<cac:InvoiceLine>
<cbc:ID>2</cbc:ID>
<cbc:InvoicedQuantity unitCode="HUR">1.00</cbc:InvoicedQuantity>
<cbc:LineExtensionAmount currencyID="DKK">952.96</cbc:LineExtensionAmount>
</cac:InvoiceLine>
</Invoice>
And I want to output the data to a CSV-file in the following structure:
filename,lineId,lineQuantity,lineAmount,payableAmount
file1,1,1.50,1633.65,2586.61
file1,2,1.00,952.96,2586.61
file2,.,.,.
...where there's a row for each line per file coupled with the filename and total amount.
This is my code:
from os import listdir, path, walk
import xml.etree.ElementTree as ET
import csv
def invoicelines(self):
filename = path.splitext(path.split(file)[1])[0]
lineId = root.find('./InvoiceLine/ID').text
lineQuantity = root.find('./InvoiceLine/InvoicedQuantity').text
lineAmount = root.find('./InvoiceLine/LineExtensionAmount').text
payableAmount = root.find('./LegalMonetaryTotal/PayableAmount').text
row = [
filename,
lineId,
lineQuantity,
lineAmount,
payableAmount
]
return row
csvfile = 'output.csv'
def csv_write_header(csvfile):
with open(csvfile, 'w', newline='') as outfile:
writer = csv.writer(outfile)
writer.writerow([
'filename',
'lineId',
'lineQuantity',
'lineAmount',
'payableAmount'
])
xml_files = []
for root, dirs, files in walk('mypath'):
for file in files:
if file.endswith('.xml'):
xml_files.append(path.join(root, file))
csv_write_header(csvfile)
for file in xml_files:
tree = ET.iterparse(file)
for _, el in tree:
el.tag = el.tag.split('}', 1)[1] # ignores namespaces
root = tree.root
if 'Invoice' in root.tag: # only invoice files
for e in root.iter('InvoiceLine'):
with open(csvfile, 'a', newline='') as outfile:
writer = csv.writer(outfile)
writer.writerow(invoicelines(e))
And the output I get if I just parse the above file is:
filename,lineId,lineQuantity,lineAmount,payableAmount
file1,1,1.50,1633.65,2586.61
file1,1,1.50,1633.65,2586.61
...so I'm guessing it's something with my iteration.

The following code achieves your desired result.
import os
import xml.etree.ElementTree as ET
def extract_line_id_data(line_element):
line_id = line_element[0].text
quantity = line_element[1].text
line_amount = line_element[2].text
return line_id, quantity, line_amount
# Iterate over all files in a directory
for _, dirs, files in os.walk('/path/to_folder/with/xml_files/'):
with open('output.csv', 'a') as output:
output.write('Filename,LineID,Quantity,LineAmount,TotalAmount\n') # Headers
for xml_file in files:
# If not all files in the folder files are XML you'll need to catch an exception here
tree = ET.parse(xml_file) # might need to use os.path.abspath
root = tree.getroot()
total_amount = root[0][0].text # Get total amount value
# Iterate over all "Line" elements
for e in root[1:]:
output.write('{},{},{},{},{}\n'.format(xml_file, * extract_line_id_data(e), total_amount))
Tested with your file and a "file2.xml" with a TotalAmount of 350, output looks like this:
Filename,LineID,Quantity,LineAmount,TotalAmount
file.xml,1,4,132,407
file.xml,2,1,72,407
file.xml,3,7,203,407
file2.xml,1,4,132,350
file2.xml,2,1,72,350
file2.xml,3,7,203,350
I hope this works for you. I have used ElementTree as preferred, although I would have used lxml myself.

Try following code :
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Data;
using System.Xml;
using System.Xml.Linq;
using System.IO;
namespace ConsoleApp2
{
class Program
{
const string FILENAME = #"c:\temp\text.csv";
static void Main()
{
string[] filenames = Directory.GetFiles(#"c:\temp", "*.xml");
StreamWriter writer = new StreamWriter(FILENAME);
foreach (string filename in filenames)
{
XDocument doc = XDocument.Load(filename);
string amount = (string)doc.Descendants("TotalAmount").FirstOrDefault();
foreach (XElement line in doc.Descendants("Line"))
{
writer.WriteLine(string.Join(",",
filename,
(string)line.Element("LineID"),
(string)line.Element("Quantity"),
(string)line.Element("LineAmount"),
amount));
}
}
writer.Flush();
writer.Close();
}
}
}

Related

Writing Python XML ElementTree output to CSV

TL;DR
I'm now able to output the information I want in the CSV but I'm just repeating the last XML file's data over and over again.
This is the latest version of the script:
import csv
import glob
import xml.etree.ElementTree as ET
filenames = glob.glob("..\Lib\macros\*.xml")
for filename in filenames:
with open(filename, 'r') as content:
element = ET.parse(content)
root = element.getroot()
print(root.attrib, filename)
e = element.findall('commands/MatrixSwitch/')
for i in e:
print (i.tag, i.text)
with open('results.csv', 'w', newline='') as file:
for filename in filenames:
writer = csv.writer(file)
writer.writerow([root.attrib, filename])
for i in e:
writer.writerow([i.tag, i.text])
Say I have 10 XML files, I'm getting the output related to XML "File 10" 10 times in the CSV, not anything for XML "File 1-9" ... sure its something simple?
=========================================================================
I've written a small script which ingests a folder of XML files, searches for a particular element and then recalls some of the data. This is then printed to the console and written to a CSV, except I'm having trouble formatting my CSV correctly.
This is where I've got so far:
import csv
import glob
import xml.etree.ElementTree as ET
filenames = glob.glob("..\Lib\macros\*.xml")
for filename in filenames:
with open(filename, 'r') as content:
element = ET.parse(content)
root = element.getroot()
print(root.attrib, filename)
e = element.findall('commands/MatrixSwitch/')
for i in e:
print (i.tag, i.text)
with open('results.csv', 'w', newline='') as file:
writer = csv.writer(file)
writer.writerow([root.attrib, filename])
I'm looking to capture the following data:
XML Filename
Macro Name
Monitor ID
Camera ID
I'm only interested in the and when a "Matrix Switch" is referred to in the XML. Sometimes there might only be one monitor ID and one camera ID, sometimes there might be more so the script needs to loop through and get all of the IDs within the "Matrix Switch" element. This seems to work so far.
Typical XML structure looks like this :
<macro name="NAME OF THE MACRO IS SHOWN HERE">
<execution>
<delay>0</delay>
</execution>
<parameters/>
<commands>
<MatrixSwitch>
<camera>1530</camera>
<monitor>1020</monitor>
</MatrixSwitch>
<MatrixSwitch>
<camera>1531</camera>
<monitor>1001</monitor>
</MatrixSwitch>
</commands>
</macro>
Or like this :
<macro name="ANOTHER NAME GOES HERE">
<execution>
<delay>0</delay>
</execution>
<parameters/>
<commands>
<MatrixSwitch>
<camera>201</camera>
<monitor>17</monitor>
</MatrixSwitch>
<MatrixSwitch>
<camera>206</camera>
<monitor>18</monitor>
</MatrixSwitch>
<MatrixSwitch>
<camera>202</camera>
<monitor>19</monitor>
</MatrixSwitch>
<MatrixSwitch>
<camera>207</camera>
<monitor>20</monitor>
</MatrixSwitch>
</commands>
</macro>
My current results.csv is only set to output the name and filename. This works but I'm unsure where I need to add the "writer" command to the loop where its dealing with the Monitor ID and Camera ID .
I want my CSV to show : Name, Filename, Monitor A, Camera A, Monitor B, Camera B, Monitor C, Camera C, Monitor D, Camera D etc.....
Any pointers greatly appreciated!!
Code has now been changed slightly :
import csv
import glob
import xml.etree.ElementTree as ET
filenames = glob.glob("..\Lib\macros\*.xml")
for filename in filenames:
with open(filename, 'r') as content:
element = ET.parse(content)
root = element.getroot()
print(root.attrib, filename)
e = element.findall('commands/MatrixSwitch/')
for i in e:
print (i.tag, i.text)
with open('results.csv', 'w', newline='') as file:
writer = csv.writer(file)
writer.writerow([root.attrib, filename])
for i in e:
writer.writerow([i.tag, i.text])
Output in the CSV is as below :
https://imgur.com/a/SrPrgjm

Just add a loop calling writerow:
...
with open('results.csv', 'w', newline='') as file:
writer = csv.writer(file)
writer.writerow([root.attrib, filename])
for i in e:
writer.writerow([i.tag, i.text])

Generate XML tree with values using py script

I am new to python and would like to create XML tree with values.
I want to put both jsc://xxx.js" files as well as "EXT.FC.XML" under resource & policy element in XML via python code. All jsc://xxx.js" and "EXT.FC.XML" files are stored in my local folder named "resources" and "policies".
The desired output
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<APIProxy revision="2" name="Retirement_Digital_Account_Balance">
<ManifestVersion>SHA-512:f9ae03c39bf00f567559e</ManifestVersion>
<Policies>
<Policy>EXT.FC_Env_Host</Policy>
<Policy>EXT.FC_JWTVerf</Policy>
<Policy>EXT.JSC_Handle_Fault</Policy>
</Policies>
<ProxyEndpoints>
<ProxyEndpoint>default</ProxyEndpoint>
</ProxyEndpoints>
<Resources>
<Resource>jsc://createErrorMessage.js</Resource>
<Resource>jsc://jwtHdrExt.js</Resource>
<Resource>jsc://log-variables.js</Resource>
<Resource>jsc://swagger.json</Resource>
<Resource>jsc://tgtDataForm.js</Resource>
</Resources>
</APIProxy>
I use Element tree for converting into xml file, this is the code I run
import xml.etree.ElementTree as ET
from xml.etree.ElementTree import Element, SubElement, Comment
from xml.etree import ElementTree, cElementTree
from xml.dom import minidom
from ElementTree_pretty import prettify
import datetime
import os
generated_on = str(datetime.datetime.now())
#proxy = Element('APIProxy')
proxy = Element('APIProxy', revision = "2", name = "Retirement_Digital_Account_Balance")
ManifestVersion = SubElement(proxy, 'ManifestVersion')
ManifestVersion.text = 'SHA-512:f9ae03c39bf00f567559e'
Policies = SubElement(proxy, 'Policies')
Policy = SubElement(Policies, 'Policy')
path = '/policies'
#files = []
# r=root, d=directories, f = files
for r, d, f in os.walk(path):
for file in f:
if '.xml' in file:
Policy.append(os.path.join(r, file))
for p in Policy:
print(p)
ProxyEndpoints = SubElement(proxy, 'ProxyEndpoints')
ProxyEndpoint = SubElement(ProxyEndpoints, 'ProxyEndpoint')
ProxyEndpoint.text = 'default'
Resources = SubElement(proxy, 'Resources')
Resource = SubElement(Resources, 'Resource')
path = '/Resources'
# r=root, d=directories, f = files
for r, d, f in os.walk(path):
for file in f:
if 'js' in file:
Resource.append(os.path.join(r, file))
for R in Resource:
print(R)
Spec = SubElement(proxy, 'Spec')
Spec.text = ""
#proxy.append(Spec)
proxy.append(Element('TargetServers'))
TargetEndpoints = SubElement(proxy, 'TargetEndpoints')
TargetEndpoint = SubElement(TargetEndpoints, 'TargetEndpoint')
TargetEndpoint.text = 'default'
print(ET.tostring(proxy))
tree = cElementTree.ElementTree(proxy) # wrap it in an ElementTree instance, and save as XML
t = minidom.parseString(ElementTree.tostring(proxy)).toprettyxml() # Since ElementTree write() has no pretty printing support, used minidom to beautify the xml.
tree1 = ElementTree.ElementTree(ElementTree.fromstring(t))
tree1.write("Retirement_Digital_Account_Balance_v2.xml",encoding='UTF-8', xml_declaration=True)
Okay, the code is working but i didnt get the desired output, I got the following:
<?xml version='1.0' encoding='UTF-8'?>
<APIProxy name="Retirement_Digital_Account_Balance" revision="2">
<ManifestVersion>SHA-512:f9ae03c39bf00f567559e</ManifestVersion>
<Policies>
<Policy />
</Policies>
<ProxyEndpoints>
<ProxyEndpoint>default</ProxyEndpoint>
</ProxyEndpoints>
<Resources>
<Resource />
</Resources>
</APIProxy>
How to use loop in ElementTree in python to import the values from folder and create XML tree with its values?

Looping through list of xml-files?

I'm trying to create a program that loops through a list of xml-files and extracts certain elements from the files:
from os import listdir, path
import xml.etree.ElementTree as ET
mypath = 'C:\myfolder'
files = [f for f in listdir(mypath) if f.endswith('.xml')]
for file in files:
tree = ET.parse(file)
root = tree.getroot()
ns = {namespaces}
def myfunction():
if 'something' in root.tag:
filename = path.splitext(file)[0]
var1 = root.find('./element1', ns)
var2 = root.find('./element2', ns)
row = [
var1.text,
var2.text
]
return row
The above code returns a list with var1, var2 (from the last file) if I call the function. The reason I have defined this function is that there are different types of xml-files with different element names, so I'm going to create a function for each file type.
Now I want to create a table where the output from each file is a row i.e.:
filename1, var1, var2
filename2, var1, var2
ect.
And ideally export the table to a csv-file. How do I go about that?

The easiest way to write a CSV file is using the Standard CSV.
To write a CSV file, is as simple as opening the file and using the default writer:
import csv
from os import listdir, path
import xml.etree.ElementTree as ET
mypath = 'C:\myfolder'
files = [f for f in listdir(mypath) if f.endswith('.xml')]
for file in files:
tree = ET.parse(file)
root = tree.getroot()
ns = {namespaces}
def myfunction():
if 'something' in root.tag:
filename = path.splitext(file)[0]
var1 = root.find('./element1', ns)
var2 = root.find('./element2', ns)
row = [
var1.text,
var2.text
]
# Open the file and store the data
with open('outfile.csv', 'a', newline='') as csvfile:
csv_writer = csv.writer(csvfile, delimiter=' ', quotechar='|', quoting=csv.QUOTE_MINIMAL)
csv_writer.writerow(row)
return row
Note that csf.writer receives a list as parameter.

Create a dataframe from an xml file

i have a real (and maybe pretty stupid) problem to convert a xml-file into a dataframe from pandas. Im new in python and need some help. I trying a code from another thread and modificate it but it not works.
I want to iterate through this file:
<objects>
<object id="123" name="some_string">
<object>
<id>123</id>
<site id="456" name="somename" query="some_query_as_string"/>
<create-date>some_date</create-date>
<update-date>some_date</update-date>
<update-user id="567" name="User:xyz" query="some_query_as_string"/>
<delete-date/>
<delete-user/>
<deleted>false</deleted>
<system-object>false</system-object>
<to-string>some_string_notifications</to-string>
</object>
<workflow>
<workflow-type id="12345" name="WorkflowType_some_workflow" query="some_query_as_string"/>
<validated>true</validated>
<name>somestring</name>
<exported>false</exported>
</workflow>
Here is my code:
import xml.etree.ElementTree as ET
import pandas as pd
path = "C:/Users/User/Desktop/test.xml"
with open(path, 'rb') as fp:
content = fp.read()
parser = ET.XMLParser(encoding="utf-8")
tree = ET.fromstring(content, parser=parser)
def xml2df(tree):
root = ET.XML(tree)
all_records = []
for i, child in enumerate(root):
record ={}
for subchild in child:
record[subchild.tag] = subchild.text
all_records.append(record)
return pd.DataFrame(all_records)
Where is the problem? Please help :O

You are passing the file location string to ET.fromstring(), which is not the actual contents of the file. You need to read the contents of the file first, then pass that to ET.fromstring().
path = "C:/Users/User/Desktop/test.xml"
with open(path, 'rb') as fp:
content = fp.read()
parser = ET.XMLParser(encoding="utf-8")
tree = ET.fromstring(content, parser=parser)

Write XML filename based of CSV cell Python

Trying to save output from this script to a file based on a cell within the csv. I am able to call the variable {file_root_name} to write into the xml file but not as a variable to write the file name. How can I use the variable file_root_name as a variable to generate a file name?
import csv
import sys
from xml.etree import ElementTree
from xml.etree.ElementTree import Element, SubElement, Comment, tostring
from xml.dom import minidom
def prettify(elem):
"""Return a pretty-printed XML string for the Element.
"""
rough_string = ElementTree.tostring(elem, 'utf-8')
reparsed = minidom.parseString(rough_string)
return reparsed.toprettyxml(indent=" ", encoding = 'utf-8')
doctype = '<!DOCTYPE smil PUBLIC "-//W3C//DTD SMIL 2.0//EN" "http://www.w3.org/2001/SMIL20/SMIL20.dtd">'
video_data = ((256, 336000),
(512, 592000),
(768, 848000),
(1128, 1208000))
with open(sys.argv[1], 'rU') as f:
reader = csv.DictReader(f)
for row in reader:
root = Element('smil')
root.set('xmlns', 'http://www.w3.org/2001/SMIL20/Language')
head = SubElement(root, 'head')
meta = SubElement(head, 'meta base="rtmp://cp23636.edgefcs.net/ondemand"')
body = SubElement(root, 'body')
switch_tag = ElementTree.SubElement(body, 'switch')
for suffix, bitrate in video_data:
attrs = {'src': ("mp4:soundcheck/{year}/{id}/{file_root_name}_{suffix}.mp4"
.format(suffix=str(suffix), **row)),
'system-bitrate': str(bitrate),
}
ElementTree.SubElement(switch_tag, 'video', attrs)
xml, doc = prettify(root).split('\n', 1)
output = open('file_root_name'+'.smil', 'w')
output.write(xml + doctype + doc)
output.close

I'm not sure that I follow, but if the line
attrs = {'src': ("mp4:soundcheck/{year}/{id}/{file_root_name}_{suffix}.mp4"
.format(suffix=str(suffix), **row)),
'system-bitrate': str(bitrate),
}
works then "file_root_name" must be a string key of the dictlike object row. The line
output = open('file_root_name'+'.smil', 'w')
actually combines the string 'file_root_name' with '.smil'. So you'd really want something like
output = open(row['file_root_name']+'.smil', 'w')
BTW, the line
output.close
won't do anything-- you want output.close() instead, or simply
with open(row['file_root_name']+'.smil', 'w') as output:
output.write(xml + doctype + doc)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Outputting child nodes to CSV with Python - python

Related

Writing Python XML ElementTree output to CSV

Generate XML tree with values using py script

Looping through list of xml-files?

Create a dataframe from an xml file

Write XML filename based of CSV cell Python

Categories

Resources