Transferring Excel data to XML - python

I'm new to handeling XML in python so be easy on me.
i've been trying transfer my excel data to an xml file that looks like so:
<?xml version="1.0" encoding="UTF-8"?>
<xml>
<shelter>
<adress>..</adress>
<code>...</code>
<neighborhood>..</neighborhood>
</shelter>
<shelter>
<adress>...</adress>
<code>...</code>
<neighborhood>...</neighborhood>
</shelter>
</xml>
my excel spread sheet looks like so:
Rather simple right? I tried a couple of methodes on excel also i tried to write a script to do but cant seem to make it work.
Any ideas?
Thanks a lot!

If this is just a one-time transformation you can use the CONCATENATE function to create the content of your XML. Put this formula in the D column for all rows:
=CONCATENATE("<shelter><adress>",A2,"</adress><code>",B2,"</code><neighborhood>",C2,"</neighborhood></shelter>")
Then copy the text to a new file, add the appropriate XML tags on the first and last line and you are done.
If you need to do this in Python, save the file as CSV such that you have something like (note that the header line is removed from this file):
adress1,1,n1
adress2,2,n1
adress3,3,n1
Then you can use the following Python script to get the desired output:
print '<?xml version="1.0" encoding="UTF-8"?>'
with open('test.csv','r') as f:
for x in f:
splitted = x.split(',')
print """
<shelter>
<adress>{0}</address>
<code>{1}</code>
<neighborhood>{2}</neighborhood>
</shelter>""".format(x[0],x[1],x[2])
print '</xml>'

Related

Parsing and extracting data from multiple xml files in a directory in order to export to csv

Could someone please help me with my code? I have only just begun to learn Python, in order to complete an assignment that probablty isn't as hard as I find it to be.
I need to write a Python script to facilitate extracting metadata of publishing material, stored in TEI xml files. To practice I was given 50 typical files with dummy data. I need to extract, for now, data from two tags in one of the elements.
These files are all alike, but they do not contain repeating data. The repetition is in that all files have the same structure, and the same tags. I need the data of these -deeply- nested tags.
I find it very difficult to parse through multiple files in stead of just one.
The xml files all look like this:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE encyclopedia SYSTEM "Encyclopedia.dtd">
<encyclopedia>
<div2 id="1">
<head />
<art>
<dummyarticle targets="http://referenceworks.brillonline.com/entries/brill-s-new-pauly/*brill000410" idno.doi="http://dx.doi.org/10.1163/1574-9347_bnp_brill000410" id="brill000410" entry="Child emperors" volume="" page="3:223" first-online="20061001" last-update="" first-print="9789004122598, 20110510">
<pseudoarticle>
<articleentry>
<mainentry>Child emperors</mainentry>
</articleentry>
<p>see  Emperors, child</p>
</pseudoarticle>
</dummyarticle>
</art>
</div2>
</encyclopedia>
I need to extract the id and entry tags from the element 'dummyarticle'. I then need to create csv file containing the data, for now two columns with 50 rows,
Like so:
id;entry
brill000410;Child emperor
brill000450;Clientela, military
brill000460;Clyster
...etc
For now I ask your help with the parsing through all the xml files a directory holds. This is the code I have so far:
import csv
import xml.etree.ElementTree as ET
import os
os.chdir('c:\\Users\\HP\\for_Anne')
print(os.getcwd())
for f in os.listdir():
if f.endswith(".xml"):
continue
dir(ET)
tree = ET.parse('C:/Users/HP/for_Anne/1574-9347_bnp_fulltextxml_brill000410.xml')
root = tree.getroot()
data = []
# testing:
# print(ET.tostring(root, encoding='utf8').decode('utf8'))
# create a csv file containing headers of attributes
attr_list = ['Id', 'entry']
for f in tree.findall('.//{"http://referenceworks.brillonline.com/entries/brill-s-new-pauly/*brill000410"}File'):
data.append({a:f.attrib[a] for a in attr_list})
with open('Data.csv', 'w') as f:
w = csv.DictWriter(f, fieldnames=attr_list)
w.writeheader()
w.writerows(data)
# tree.write('C:/Users/HP/for_Anne/1574-9347_bnp_fulltextxml_brill000410.xml')
Thanks a lot for thinking along!

Python -- open xml file, change data based on data from csv, save as new xml file

thanks for reading.
I'm very new to python, and even newer to xml files. I have a baseline xml file that I'm trying to update with data from a row in a csv.
4 pieces of data to change based on the 5 (find column4 and replace with column5) columns in the row of the csv. The csv will contain 30-50 rows, so I'm looking to generate 30-50 different xml files.
My csv file looks like this:
heading1,heading2,heading3,heading4,heading5
dataA1,dataA2,dataA3,dataA4,dataA5
dataB1,dataB2,dataB3,dataB4,dataB5
dataC1,dataC2,dataC3,dataC4,dataC5
XML data is like:
<Scalar name="system">
<Attribute name="sysDataA1" convert="ascii">dataA1</Attribute>
<Attribute name="sysDataA2" convert="ascii">dataA2</Attribute>
<Scalar name="hm2NetStaticGroup">
<Attribute name="hm2NetDataA3">dataA3</Attribute>
And then as I mentioned previously, the last changes would just be a simple find and replace of all occurrences of dataA4 with dataA5, and then save the new xml file with all the fields updated as "template-data1A.xml".
I have working code for the find and replace portion of the script, but I am struggling hard with how to pull the data from the csv rows and create unique xml files from the template.
This is the working code for the find and replace of dataA4, but again, not sure how to make the file save with dataA1 appended to the name of the file.
import xml.etree.ElementTree as ET
with open('template.xml', encoding='UTF-8') as t:
tree = ET.parse(t)
root = tree.getroot()
for elem in root.iter():
try:
elem.text = elem.text.replace('301', '302')
except AttributeError:
pass
tree.write('template-dataA1.xml', encoding='UTF-8')
This is what I've been working with on the csv data:
# open csv file and print data from column.
import csv
var1 = "dataA1", "dataA2", "dataA3", "dataA4", "dataA5"
csv_dic = {var1: []}
csvFile = csv.reader(open('airData.csv', 'rt'))
for row in csvFile:
csv_dic[var1].append(row[0])
for column in row:
print(column)
I just tried to print the column to see what data it is getting, and it just keeps going to the last row in my csv and printing that. Doesn't matter if I change row[0] to row[1]. What am I misunderstanding here?

Read XML block in Python

I have an XML file like below which contain multiple xml. I want to fetch <Sacd> content.
<?xml version="1.0" encoding="utf-8"?>
<Sacd>
<Acdpktg> <Acdpktg/>
</Sacd>
<?xml version="1.0" encoding="utf-8"?>
<Sacd>
<Acdpktg/>
</Sacd>
<?xml version="1.0" encoding="utf-8"?>
<Sacd>
<AcdpktG>
<Result Value="0"/>
<Packet Value="Dnd"/>
<Invoke Value="abc"/>
</AcdpktG>
</Sacd>
How do I extract the value inside Sacd tag?
Well, your xml is problematic in several respects. First, it contains multiple xml files within in - not a good idea; they have to be split into separate xml files. Second, the first <Acdpktg> <Acdpktg/> tag pair is invalid; it should be <Acdpktg> </Acdpktg>.
But once it's all fixed, you can get your expected output. So:
from lxml import etree
big = """[your xml above,fixed]"""
smalls = big.replace('<?xml','xxx<?xml').split('xxx')[1:] #split it into small xml files
for small in smalls:
xml = bytes(bytearray(small, encoding='utf-8')) #either this, or remove the xml declarations from each small file
doc = etree.XML(xml)
for value in doc.xpath('.//AcdpktG//*/#Value'):
print(value)
Output:
0
Dnd
abc
Or, a bit fancier output can be obtained by changing the inner for loop a bit:
for value in doc.xpath('.//AcdpktG//*'):
print(value.tag, value.xpath('./#Value')[0])
Output:
Result 0
Packet Dnd
Invoke abc

Write Open Office XML (e.g. docx) with XML that matches the OOXML namespace

I have a python program that edits the XML in a .docx file. I'd like to edit the XML with ETree.
When I read the XML from the .docx file, it begins like this:
b'<?xml version="1.0" encoding="UTF-8" standalone="yes"?>\r\n<w:document xmlns:wpc="http://schemas.micro'...
This is in a variable called data. I create the element tree with:
import xml.etree.ElementTree as ElementTree
tree = ElementTree.XML(data)
I convert it back with:
data = ElementTree.tostring(tree)
However, there have been subtle changes to the XML. It now looks like this:
b'<ns0:document xmlns:ns0="http://schemas.openxmlformats.org/wordprocessingml/2006/main" xmlns:ns1="ht...
Word won't read this, even though it is standard XML.
EDIT: I tried adding the string to my XML, just to get it to round-trip:
XML_HEADER=b'<?xml version="1.0" encoding="UTF-8" standalone="yes"?>\r\n'
tree = ElementTree.XML(data)
data = XML_HEADER + ElementTree.tostring(tree)
But I still get the error:
We're sorry. We can't open <filename>.docx because we found a problem with its contents.
Details:
The XML data is invalid according to the schema.
Location: Part: /word/document.xml, Line: 0, Column:0
I can't fix word. I've got to generate XML that looks exactly like the XML that I started with. How do I get ETree to generate that?

ElementTree.write doesn't pretty_print on second pass

I'm having an issue with formatting xml when writing to an xml file. The issue is, the first time I write to the xml file, the xml is formatted properly using pretty_print=True. Any subsequent attempts to append to the xml file are not formatted properly. The xml is written, but not formatted. My code looks like:
#does the library.xml file exist?
if os.path.isfile(libraryFile):
library = ET.ElementTree()
library.parse(libraryFile)
else:
#the library.xml does not exist at the given path
library = ET.ElementTree(project.getBoilerplateLibrary(path))
root = library.getroot()
root.append(xml) #xml is a lxml Element object
f = open(libraryFile, 'w')
library.write(f, pretty_print=True)
f.close()
The first time we write to the file I get something like:
<root>
<element>
<foo>bar</foo>
</element>
</root>
Any subsequent attempts to append to this file end up looking like:
<root>
<element>
<foo>bar</foo>
</element><element><bleep>bloop</bleep></element></root>
Any ideas?
The FAQ covers this answer: Why doesn't the pretty print options reformat my XML output
This question has also been asked before on StackOverflow as lxml pretty print write file problem.
It is unfortunately a side effect of using XML where whitespace (unfortunately) definitely matters.

Categories