Read XML block in Python - python

I have an XML file like below which contain multiple xml. I want to fetch <Sacd> content.
<?xml version="1.0" encoding="utf-8"?>
<Sacd>
<Acdpktg> <Acdpktg/>
</Sacd>
<?xml version="1.0" encoding="utf-8"?>
<Sacd>
<Acdpktg/>
</Sacd>
<?xml version="1.0" encoding="utf-8"?>
<Sacd>
<AcdpktG>
<Result Value="0"/>
<Packet Value="Dnd"/>
<Invoke Value="abc"/>
</AcdpktG>
</Sacd>
How do I extract the value inside Sacd tag?

Well, your xml is problematic in several respects. First, it contains multiple xml files within in - not a good idea; they have to be split into separate xml files. Second, the first <Acdpktg> <Acdpktg/> tag pair is invalid; it should be <Acdpktg> </Acdpktg>.
But once it's all fixed, you can get your expected output. So:
from lxml import etree
big = """[your xml above,fixed]"""
smalls = big.replace('<?xml','xxx<?xml').split('xxx')[1:] #split it into small xml files
for small in smalls:
xml = bytes(bytearray(small, encoding='utf-8')) #either this, or remove the xml declarations from each small file
doc = etree.XML(xml)
for value in doc.xpath('.//AcdpktG//*/#Value'):
print(value)
Output:
0
Dnd
abc
Or, a bit fancier output can be obtained by changing the inner for loop a bit:
for value in doc.xpath('.//AcdpktG//*'):
print(value.tag, value.xpath('./#Value')[0])
Output:
Result 0
Packet Dnd
Invoke abc

Related

Splitting large xml file into multiple files by using beautifulsoup

I am trying to split large xml file into smaller ones, first I started off beautifulsoup:
from bs4 import BeautifulSoup
import os
# Core settings
rootdir = r'C:\Users\XX\Documents\Grant Data\2010_xml'
extension = ".xml"
to_save = r'C:\Users\XX\Documents\all_patents_as_xml'
index = 0
for root, dirs, files in os.walk(rootdir):
for file in files:
if file.endswith(extension):
print(file)
file_name = os.path.join(root,file)
with open(file_name) as f:
data = f.read()
texts = data.split('?xml version="1.0" encoding="UTF-8"?')
for text in texts:
index += 1
filename = to_save + "\\"+ str(index) + ".txt"
with open(filename, 'w') as f:
f.write(text)
However, I got a memory error. Then I switched to xml etree:
from xml.etree import ElementTree as ET
import re
file_name = r'C:\Users\XX\Documents\Grant Data\2010_xml\2010cat_xml.xml'
with open(file_name) as f:
xml = f.read()
tree = ET.fromstring(re.sub(r"(<\?xml[^>]+\?>)", r"\1<root>", xml) + "</root>")
parser = ET.iterparse(tree)
to_save = r'C:\Users\Yilmaz\Documents\all_patents_as_xml'
index = 0
for event, element in parser:
# element is a whole element
if element.tag == '?xml version="1.0" encoding="UTF-8"?':
index += 1
filename = to_save + "\\"+ str(index) + ".txt"
with open(filename, 'w') as f:
f.write(ET.tostring(element))
# do something with this element
# then clean up
element.clear()
and I get the following error:
OverflowError: size does not fit in an int
I am using windows operating system, I know in Linux you can split the xmls from consule but in my case I don't know what to do.
If your XML can not be loaded because of memory limits, you should consider using SAX.
With SAX you will read "small bites" of the document, do what ever you want to do with them (Example: Save every N elements to a new file).
Python SAX example 1.
Python SAX example 2.
There are major issues with your question and your attempts at solving it:
You mention using Beautiful Soup. However, while you import Beautiful Soup in your code, you don't actually do anything with it.
The code you show that uses xml.etree is grossly incorrect. At the line parser = ET.iterparse(tree), tree is an XML tree already parsed with ET.fromstring, but the argument to iterparse must either be a file name or a file object. An XML tree is neither of those. So that attempt is dead on arrival.
But more importantly, it looks like what you are trying to process is a file which contains a bunch of concatenated XML files. In your xml.etree attempt you have this test:
element.tag == '?xml version="1.0" encoding="UTF-8"?'
The only intent I can imagine for this test is that you think that xml.etree will somehow interpret <?xml version="1.0" encoding="UTF-8"?> as an XML element which has a name of '?xml version="1.0" encoding="UTF-8"?'. However, the structure <?xml version="1.0" encoding="UTF-8"?> is not an XML element, it is an XML declaration.
And since your code seems to be attempting to split every time an XML declaration is encountered, it seems that your input is a file that contains multiple XML declarations. This file is not valid XML. The XML specification allows the XML declaration to appear once, and only once at the beginning of an XML file. (Don't confuse the XML declaration with a processing instruction. They look similar because they are both delimited by <? and ?>, but the XML declaration is not a processing instruction.) If you use an XML parser on your input file, and this parser conforms to the XML specification, then it has to reject your file as being not XML because XML does not allow XML declarations to appear at random positions in documents.
Where does that leave you? If all XML declarations present in your source document are the same, there's a relatively easy way to make your document parsable by an XML parser. (The attempts you made suggest that they are all the same since you do not use a regular expressions to match different forms of the XML declaration (e.g. one that would specify the standalone parameter).) You can just remove all XML declarations from your source document, wrap it in a new root element, and parse that with xml.etree. (This assumes that the individual XML documents that were concatenated to make up your source document were all individually well-formed. If they weren't then this won't work.)
Note, however, that the string <?xml version="1.0" encoding="UTF-8"?> can appear in an XML document in contexts where this string is not actually an XML declaration. Here is a well-formed XML document that would throw off an algorithm that just looks for a string that looks like an XML declaration:
<?xml version = "1.0" encoding = "UTF-8"?>
<a>
<![CDATA[
<?xml version = "1.0" encoding = "UTF-8"?>
]]>
<?q <?xml version = "1.0" encoding = "UTF-8"?> ?>
<!-- <?xml version = "1.0" encoding = "UTF-8"?> -->
</a>
If you know how your source file was created, you may already be able to know for sure that you don't have any of the cases above. Otherwise, you may want to examine your source to make sure none of the above happens.
Once you take care of this, then using a strategy based on ET.iterparse, or SAX should work.

Write Open Office XML (e.g. docx) with XML that matches the OOXML namespace

I have a python program that edits the XML in a .docx file. I'd like to edit the XML with ETree.
When I read the XML from the .docx file, it begins like this:
b'<?xml version="1.0" encoding="UTF-8" standalone="yes"?>\r\n<w:document xmlns:wpc="http://schemas.micro'...
This is in a variable called data. I create the element tree with:
import xml.etree.ElementTree as ElementTree
tree = ElementTree.XML(data)
I convert it back with:
data = ElementTree.tostring(tree)
However, there have been subtle changes to the XML. It now looks like this:
b'<ns0:document xmlns:ns0="http://schemas.openxmlformats.org/wordprocessingml/2006/main" xmlns:ns1="ht...
Word won't read this, even though it is standard XML.
EDIT: I tried adding the string to my XML, just to get it to round-trip:
XML_HEADER=b'<?xml version="1.0" encoding="UTF-8" standalone="yes"?>\r\n'
tree = ElementTree.XML(data)
data = XML_HEADER + ElementTree.tostring(tree)
But I still get the error:
We're sorry. We can't open <filename>.docx because we found a problem with its contents.
Details:
The XML data is invalid according to the schema.
Location: Part: /word/document.xml, Line: 0, Column:0
I can't fix word. I've got to generate XML that looks exactly like the XML that I started with. How do I get ETree to generate that?

Convert Deeply Nested XML to CSV in Python [duplicate]

This question already has answers here:
How to parse XML and get instances of a particular node attribute?
(19 answers)
Closed 6 years ago.
I'm new to Python and have heard that it is one of the best ways to parse fairly large XML files (150MB). I can't get my head around how to iterate through the tags and extract only the <hw> and <defunit> tags as it's fairly deeply nested.
I have some XML formatted as below, and I need to extract the "hw" and "defunit" tags from it using Python and convert them into a .csv format.
<?xml version="1.0" encoding="UTF-8"?>
<dps-data xmlns="urn:DPS2-metadata" project="SCRABBLELARGE" guid="7d6b7164fde1e064:34368a61:14306b637ab:-8000--4a25ae5c-c104-4c7a-bba5-b434dd4d9314">
<superentry xmlns="urn:COLL" xmlns:d="urn:COLL" xmlns:e="urn:IDMEE" e:id="u583c10bfdbd326ba.31865a51.12110e76de1.-336">
<entry publevel="1" id="a000001" e:id="u583c10bfdbd326ba.31865a51.12110e76de1.-335">
<hwblk>
<hwgrp>
<hwunit>
<hw>aa</hw>
<ulsrc>edsh</ulsrc>
</hwunit>
</hwgrp>
</hwblk>
<datablk>
<gramcat publevel="1" id="a000001.001">
<pospgrp>
<pospunit>
<posp value="noun" />
</pospunit>
</pospgrp>
<sensecat id="a000001.001.01" publevel="1">
<defgrp>
<defunit>
<def>volcanic rock</def>
</defunit>
</defgrp>
</sensecat>
</gramcat>
</datablk>
</entry>
</superentry>
</dps-data>
The .csv format I'd like to see it in is simply:
hw, defunit
aa, volcanic rock
How about this:
from xml.dom import minidom
xmldoc = minidom.parse('your.xml')
hw_lst = xmldoc.getElementsByTagName('hw')
defu_lst = xmldoc.getElementsByTagName('def')
with open('your.csv', 'a') as out_file:
for i in range(len(hw_lst)):
out_file.write('{0}, {1}\n'.format(hw_lst[i].firstChild.data, defu_lst[i].firstChild.data))
Consider XSLT, the XML transformation language that can manipulate source .xml files to various end use structures including text files like .csv, specifying method="text" in <xsl:output>.
Python's lxml module can run XSLT 1.0 scripts. Below assumes the <entry> tag and its children repeat with different data. And two undeclared namespaces had to be handled in the xsl. Also, XSLT tends to be very efficient on smaller sized XML but varies depending on computer environments.
XSLT Script (save as .xsl to be referenced below)
<xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0"
xmlns:ns1="urn:DPS2-metadata" xmlns="urn:COLL">
<xsl:output version="1.0" encoding="UTF-8" indent="yes" method="text"/>
<xsl:strip-space elements="*"/>
<xsl:template match="/ns1:dps-data/ns1:superentry">
<xsl:text>hw,defunit</xsl:text><xsl:text>
</xsl:text>
<xsl:apply-templates select="ns1:entry"/>
</xsl:template>
<xsl:template match="ns1:entry" namespace="urn:COLL">
<xsl:value-of select="descendant::ns1:hw" namespace="urn:COLL"/><xsl:text>,</xsl:text>
<xsl:value-of select="descendant::ns1:defunit" namespace="urn:COLL"/>
<xsl:text>
</xsl:text>
</xsl:template>
Pyton Script
import lxml.etree as ET
// LOAD XML AND XSL SOURCES
xml = ET.parse('Input.xml')
xsl = ET.parse('XSLTScript.xsl')
// TRANSFORM SOURCE
transform = ET.XSLT(xsl)
newdom = transform(xml)
// SAVE AS .CSV
with open('Output.csv'), 'wb') as f:
f.write(newdom)
# hw,defunit
# aa,volcanic rock
The lxml library is capable of very powerful XML parsing, and can be used to iterate over an XML tree to search for specific elements.
from lxml import etree
with open(r'path/to/xml', 'r') as xml:
text = xml.read()
tree = lxml.etree.fromstring(text)
row = ['', '']
for item in tree.iter('hw', 'def'):
if item.tag == 'hw':
row[0] = item.text
elif item.tag == 'def':
row[1] = item.text
line = ','.join(row)
with open(r'path/to/csv', 'a') as csv:
csv.write(line + '\n')
How you build the CSV file is largely based upon preference, but I have provided a trivial example above. If there are multiple <dps-data> tags, you could extract those elements first (which can be done with the same tree.iter method shown above), and then apply the above logic to each of them.
EDIT: I should point out that this particular implementation reads the entire XML file into memory. If you are working with a single 150mb file at a time, this should not be a problem, but it's just something to be aware of.

Transferring Excel data to XML

I'm new to handeling XML in python so be easy on me.
i've been trying transfer my excel data to an xml file that looks like so:
<?xml version="1.0" encoding="UTF-8"?>
<xml>
<shelter>
<adress>..</adress>
<code>...</code>
<neighborhood>..</neighborhood>
</shelter>
<shelter>
<adress>...</adress>
<code>...</code>
<neighborhood>...</neighborhood>
</shelter>
</xml>
my excel spread sheet looks like so:
Rather simple right? I tried a couple of methodes on excel also i tried to write a script to do but cant seem to make it work.
Any ideas?
Thanks a lot!
If this is just a one-time transformation you can use the CONCATENATE function to create the content of your XML. Put this formula in the D column for all rows:
=CONCATENATE("<shelter><adress>",A2,"</adress><code>",B2,"</code><neighborhood>",C2,"</neighborhood></shelter>")
Then copy the text to a new file, add the appropriate XML tags on the first and last line and you are done.
If you need to do this in Python, save the file as CSV such that you have something like (note that the header line is removed from this file):
adress1,1,n1
adress2,2,n1
adress3,3,n1
Then you can use the following Python script to get the desired output:
print '<?xml version="1.0" encoding="UTF-8"?>'
with open('test.csv','r') as f:
for x in f:
splitted = x.split(',')
print """
<shelter>
<adress>{0}</address>
<code>{1}</code>
<neighborhood>{2}</neighborhood>
</shelter>""".format(x[0],x[1],x[2])
print '</xml>'

How to parse the xml to find the text value of the following node in python?

Assuming I have a sample configuration XML file that is the following:
<?xml version="1.0"?>
<note>
<to>Tove</to>
<infoaboutauthor>
<nestedprofile>
<aboutme>
<gco:CharacterString>I am a 10th grader who likes to play ball.</gco:CharacterString>
</aboutme>
</nestedprofile>
</infoaboutauthor>
<date>
<info_date>
<date>
<gco:Date>2003-06-13</gco:Date>
</date>
<datetype>
<datetype attribute="Value">
</datetype>
</datetype>
</info_date>
</date>
<from>Jani</from>
<heading>Reminder</heading>
<body>Don't forget me this weekend!</body>
</note>
In python (tried using ElementTree, not sure if its the best) I would like to get certain values for certain tags. I have tried:
with open('testfile.xml', 'rt') as f:
tree = ElementTree.parse(f)
print 'Parsing'
root = tree.getroot()
listofelements = root_elem.findall('gco:CharacterString')
for elementfound in listofelements:
print elementfound.text
In the code I use above, it appears to not work when I have the colon as I get the following error:
SyntaxError: prefix 'gco' not found in prefix map
My goal is to
get the text in the "2003-06-13" tag
the text in the "aboutme" tag
What is the best way to accomplish this? Is there some way to look up "gco:CharacterString" where parent is equal to "aboutme"? Or is there some convenient way to get it into a dict where I can go mydict['note']['to']['nestedprofile']['aboutme']?
Note: The "gco:" prefix is something that I have to deal with that is part of the xml. If elementtree is not appropriate for this, that is okay.
Firstly, your XML is broken. the - in line 2 is breaking the parser. Also I don't think it likes the gco:s. Can you possibly use some other XML configuration? Or is this automatically generated by something out of your control?
So here's what the XML needs to look like for this to work with Python:
<?xml version="1.0"?>
<note>
<to>Tove</to>
<infoaboutauthor>
<nestedprofile>
<aboutme>
<CharacterString>I am a 10th grader who likes to play ball.</CharacterString>
</aboutme>
</nestedprofile>
</infoaboutauthor>
<date>
<info_date>
<date>
<Date>2003-06-13</Date>
</date>
<datetype>
<datetype attribute="Value">
</datetype>
</datetype>
</info_date>
</date>
<from>Jani</from>
<heading>Reminder</heading>
<body>Don't forget me this weekend!</body>
</note>
And here's the code to accomplish your two goals:
# Get the element tree from the file name and not a file object
tree = ElementTree.parse('config.xml')
# Get the root of the tree
root = tree.getroot()
# To get the 'Date' tag and print its text
date_tag = root.find('date').find('info_date').find('date').find('Date')
print date_tag.text
# Get the `aboutme` tag and print its text
about_me_tag = root.find('infoaboutauthor').find('nestedprofile').find('aboutme').find('CharacterString')
print about_me_tag.text
UPDATE
As far as dealing with the "gco:"s goes, you could do something like this:
def replace_in_config(old, new):
with open('config.xml', 'r') as f:
text = f.read()
with open('config.xml', 'w') as f:
f.write(text.replace(old, new))
Then before you do the above XML operations run:
replace_in_config('gco:', '_stripped')
Then after the XMl operations are done (of course you will need to account for the fact that the gco:Date tag is now stripped_Date as is the CharacterString tag), run this:
replace_in_config('_stripped', 'gco:')
This will preserve the original format and allow you to parse it with etree.
I don't think your XML document is valid as the 'gco' namespace has not been defined.
I can't find a way to supply the definition to lxml as part of the parse command. It's possible you could manipulate the document to add the definition or remove the prefix as suggested by #mjgpy3 .
Another approach might be to use the HTML parser as this is much less strict about what it will accept. Beaware though that this will make changes to the structure of the data to add HTML headers and such.
from lxml import etree
Parser = etree.HTMLParser()
XMLDoc = etree.parse(open('C:/Temp/Test.xml', 'r'), Parser)
Elements = XMLDoc.xpath('//characterstring')
for Element in Elements:
print Element.text

Categories