Iteratively parse a large XML file without using the DOM approach - python

I have an xml file
<temp>
<email id="1" Body="abc"/>
<email id="2" Body="fre"/>
.
.
<email id="998349883487454359203" Body="hi"/>
</temp>
I want to read the xml file for each email tag. That is, at a time I want to read email id=1..extract body from it, the read email id=2...and extract body from it...and so on
I tried to do this using DOM model for XML parsing, since my file size is 100 GB..the approach does not work. I then tried using:
from xml.etree import ElementTree as ET
tree=ET.parse('myfile.xml')
root=ET.parse('myfile.xml').getroot()
for i in root.findall('email/'):
print i.get('Body')
Now once I get the root..I am not getting why is my code not been able to parse.
The code upon using iterparse is throwing the following error:
"UnicodeEncodeError: 'ascii' codec can't encode character u'\u20ac' in position 437: ordinal not in range(128)"
Can somebody help

An example for iterparse:
import cStringIO
from xml.etree.ElementTree import iterparse
fakefile = cStringIO.StringIO("""<temp>
<email id="1" Body="abc"/>
<email id="2" Body="fre"/>
<email id="998349883487454359203" Body="hi"/>
</temp>
""")
for _, elem in iterparse(fakefile):
if elem.tag == 'email':
print elem.attrib['id'], elem.attrib['Body']
elem.clear()
Just replace fakefile with your real file.
Also read this for further details.

Related

Editing existing XML file and sending post via Jboss

I have the following python method which runs though an xml file and parses it, and TRIES to edit a field:
import requests
import xml.etree.ElementTree as ET
import random
def runThrougheTree():
#open xml file
with open("testxml.xml") as xml:
from lxml import etree
#parse
parser = etree.XMLParser(strip_cdata=True, recover=True)
tree = etree.parse("testxml.xml", parser)
root= tree.getroot()
#ATTEMPT to edit field - will not work as of now
for ci in root.iter("CurrentlyInjured"):
ci.text = randomCurrentlyInjured(['sffdgdg', 'sdfsdfdsfsfsfsd','sfdsdfsdfds'])
#Overwrite initial xml file with new fields - will not work as of now
etree.ElementTree(root).write("testxml.xml",pretty_print=True, encoding='utf-8', xml_declaration=True)
#send post (Jboss)
requests.post('http://localhost:9000/something/RuleServiceImpl', data="testxml.xml)
def randomCurrentlyInjured(ran):
random.shuffle(ran)
return ran[0]
#-----------------------------------------------
if __name__ == "__main__":
runThrougheTree()
Edited XML file:
<soapenv:Envelope xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/" xmlns:rule="http://somewebsite.com/" xmlns:ws="http://somewebsite.com/" xmlns:bus="http://somewebsite.com/">
<soapenv:Header/>
<soapenv:Body>
<ws:Respond>
<ws:busMessage>
<bus:SomeRef>insertnumericvaluehere</bus:SomeRef>
<bus:Content><![CDATA[<SomeDef>
<SomeType>ABCD</Sometype>
<A_Message>
<body>
<AnonymousField>
<RefIndicator>1111111111111</RefIndicator>
<OneMoreType>HIJK</OneMoreType>
<CurrentlyInjured>ABCDE</CurentlyInjured>
</AnonymousField>
</body>
</A_Message>
</SomeDef>]]></bus:Content>
<bus:MessageTypeId>somenumericvalue</bus:MessageTypeId>
</ws:busMessage>
</ws:Respond>
</soapenv:Body>
</soapenv:Envelope>
Issues:
The field is not being edited.
Jboss error: Caused by: javax.xml.stream.XMLStreamException: ParseError at [row,col]:[1,1]
Message: Content is not allowed in prolog.
Note: I have ensured that there is no characters prior to first xml tag.
In the end, I was unable to use lxml, elementtree to edit the fields/post to Jboss as:
I had CDATA in the xml as mzjn pointed out in the comments
Jboss did not like the request after it had been parsed, even when the CDATA tags were removed.
Workaround/Eventual SOlution: I was able to (somewhat tediously) use .replace() in my script to edit the plaintext successfully, and then send the POST via Jboss. I hope this helps someone else someday!

Reading xml and trying to extract it into 2 different xml's

I have a xml file which looks like below:
<?xml version="1.0" encoding="ASCII" standalone="yes"?>
<file>
<records>
<record>
<device_serial_number>PAD203137687</device_serial_number>
<device_serial_number_2>203137687</device_serial_number_2>
</record>
<record>
<device_serial_number>PAD203146024</device_serial_number>
<device_serial_number_2>203146024</device_serial_number_2>
</record>
</records>
</file>
Now i want to check device_serial_number in each record and check if the last 4 characters are 6024, if yes then write the complete record data to newxml file named one.xml
I have tried the below
from xml.etree import ElementTree as ET
tree = ET.parse('C:\\Users\\x3.xml')
for node in tree.findall('.//records//record/'):
print("<"+str(node.tag) + "> "+"<"+str(node.text)+"/>")
So from what I understand, you can try something like below:
from xml.etree import ElementTree as ET
from xml.dom.minidom import getDOMImplementation
from xml.dom.minidom import parseString
tree = ET.parse('C:\\Users\\x3.xml')
root = tree.getroot()
impl = getDOMImplementation()
#print(root) #just to check
commands = root.findall(".//records//")
recs=[c for c in commands if c.find('device_serial_number')!=None and
c.find('soc_id').text[-4:]=='6024']
bb=""
for rec in recs:
aa=(parseString(ET.tostring(rec)).toprettyxml(''))
bb=bb+aa
#print(bb) #it will have all data you need, write these into files
newdoc = impl.createDocument(None, bb, None)
newdoc.writexml(open('your_output_file.xml', 'w'),
indent="",
addindent="",
newl='') #check documentation for these
Here is the linkfor documentation regarding writing to xml files.
Node.writexml(writer, indent=”“, addindent=”“, newl=”“)
Write XML to the writer object. The writer should have a write() method which matches that of the file object interface. The indent parameter is the indentation of the current node. The addindent parameter is the incremental indentation to use for subnodes of the current one. The newl parameter specifies the string to use to terminate newlines.
The above is from xml.dom.minidom documentation.Which explains how to write and what they mean.
Finally this will help you to write the required data to the file which you specify in writexml, in xml format.

How to remove all " \n" in xml payload by using lxml library

I'm trying to change a text value in xml file, and I need to return the updated xml content by using lxml library. I can able to successfully update the value, but the updated xml file contains "\n"(next line) character as below.
Output:
<?xml version='1.0' encoding='ASCII'?>\n<Order>\n <content>\n <sID>123</sID>\n <spNumber>UserTemp</spNumber>\n <client>ARRCHANA</client>\n <orderType>Dashboard</orderType>\n </content>\n
<content>\n <sID>111</sID>\n <spNumber>UserTemp</spNumber>\n <client>ARRCHANA</client>\n <orderType>Dashboard</orderType>\n </content>\n
</Order>
Note: I didn't format the above xml output, and posted it how exactly I get it from output console.
Input:
<Order>
<content>
<sID>123</sID>
<spNumber>UserTemp</spNumber>
<client>WALLMART</client>
<orderType>Dashboard</orderType>
</content>
<content>
<sID>111</sID>
<spNumber>UserTemp</spNumber>
<client>D&B</client>
<orderType>Dashboard</orderType>
</content>
</Order>
Also, I tried to remove the \n character in output xml file by using
getValue = getValue.replace('\n','')
but, no luck.
The below code I used to update the xml( tag), and tried to return the updated xml content back.
Python Code:
from lxml import etree
from io import StringIO
import six
import numpy
def getListOfNodes(location):
f = open(location)
xml = f.read()
f.close()
#print(xml)
getXml = etree.parse(location)
for elm in getXml.xpath('.//Order//content/client'):
index='ARRCHANA'
elm.text=index
#with open('C:\\New folder\\temp.xml','w',newline='\r\n') as writeFile:
#writeFile.write(str(etree.tostring(getXml,pretty_print=True, xml_declaration=True)))
getValue=str((etree.tostring(getXml,pretty_print=True, xml_declaration=True)))
#getValue = getValue.replace('\n','')
#getValue=getValue.replace("\n","<br/>")
print(getValue)
return getValue
When I'm trying to open the response payload through firefox browser, then It says the below error message:
XML Parsing Error: no element found Location:
file:///C:/New%20folder/Confidential.xml
Line Number 1, Column 1:
It says that "no element found location in Line Number 1, column 1" in xml file when it found "\n" character in it.
Can somebody assist me the better way to update the text value, and return it back without any additional characters.
It's fixed by myself by using the below script:
code = root.xpath('.//Order//content/client')
if code:
code[0].text = 'ARRCHANA'
etree.ElementTree(root).write('D:\test.xml', pretty_print=True)

XPath with LXML Element

I am trying to parse an XML document using lxml etree. The XML doc I am parsing looks like this:
<?xml version="1.0" encoding="UTF-8"?>
<metadata xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://www.openarchives.org/OAI/2.0/">\t
<codeBook version="2.5" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="ddi:codebook:2_5" xsi:schemaLocation="ddi:codebook:2_5 http://www.ddialliance.org/Specification/DDI-Codebook/2.5/XMLSchema/codebook.xsd">
<docDscr>
<citation>
<titlStmt>
<titl>Test Title</titl>
</titlStmt>
<prodStmt>
<prodDate/>
</prodStmt>
</citation>
</docDscr>
<stdyDscr>
<citation>
<titlStmt>
<titl>Test Title 2</titl>
<IDNo agency="UKDA">101</IDNo>
</titlStmt>
<rspStmt>
<AuthEnty>TestAuthEntry</AuthEnty>
</rspStmt>
<prodStmt>
<copyright>Yes</copyright>
</prodStmt>
<distStmt/>
<verStmt>
<version date="">1</version>
</verStmt>
</citation>
<stdyInfo>
<subject>
<keyword>2009</keyword>
<keyword>2010</keyword>
<topcClas>CLASS</topcClas>
<topcClas>ffdsf</topcClas>
</subject>
<abstract>This is an abstract piece of text.</abstract>
<sumDscr>
<timePrd event="single">2020</timePrd>
<nation>UK</nation>
<anlyUnit>Test</anlyUnit>
<universe>test</universe>
<universe>hello</universe>
<dataKind>fdsfdsf</dataKind>
</sumDscr>
</stdyInfo>
<method>
<dataColl>
<timeMeth>test timemeth</timeMeth>
<dataCollector>test data collector</dataCollector>
<sampProc>test sampprocess</sampProc>
<deviat>test deviat</deviat>
<collMode>test collMode</collMode>
<sources/>
</dataColl>
</method>
<dataAccs>
<setAvail>
<accsPlac>Test accsPlac</accsPlac>
</setAvail>
<useStmt>
<restrctn>NONE</restrctn>
</useStmt>
</dataAccs>
<othrStdyMat>
<relPubl>122</relPubl>
<relPubl>12332</relPubl>
</othrStdyMat>
</stdyDscr>
</codeBook>
</metadata>
I wrote the following code to try and process it:
from lxml import etree
import pdb
f = open('/vagrant/out2.xml', 'r')
xml_str = f.read()
xml_doc = etree.fromstring(xml_str)
f.close()
From what I understand from the lxml xpath docs, I should be able to get the text from a specific element as follows:
xml_doc.xpath('/metadata/codeBook/docDscr/citation/titlStmt/titl/text()')
However, when I run this it returns an empty array.
The only xpath I can get to return something is using a wildcard:
xml_doc.xpath('*')
Which returns [<Element {ddi:codebook:2_5}codeBook at 0x7f8da8a413f8>].
I've read through the docs and I'm not understanding what is going wrong with this. Any help is appreciated.
You need to take the default namespace into account so instead of
xml_doc.xpath('/metadata/codeBook/docDscr/citation/titlStmt/titl/text()')
use
xml_doc.xpath.xpath(
'/oai:metadata/ddi:codeBook/ddi:docDscr/ddi:citation/ddi:titlStmt/ddi:titl/text()',
namespaces={
'oai': 'http://www.openarchives.org/OAI/2.0/',
'ddi': 'ddi:codebook:2_5'
}
)

Error in escaping XML for a KML file

Some time ago I asked a question trying to figure out why modifying a KML file increased the file size.
After poking around, I've found that the issue had to do with escaping XML.
Essentially, the "<", ">", and "&" characters were being replaced with:
"<", ">", and "&"
It's not a big deal for smaller files, but the extra characters make a big difference in larger files.
I copied some code from this site to help solve the problem:
import lxml
from lxml import etree
import pykml
from pykml.factory import KML_ElementMaker as KML
from pykml import parser
def unescape(s):
s = s.replace("<", "<")
s = s.replace(">", ">")
## Ampersands must be last to avoid errors in text replacement
s = s.replace("&", "&")
return s
with open("myplaces.kml", "rb") as f:
doc = parser.parse(f).getroot()
a = doc.Document.Folder[0].Folder[1]
for q in GEList:
x = KML.Folder(KML.name(q))
a.append(x)
finished = (etree.tostring(doc, pretty_print = True))
finished = unescape(finished)
with open("myplaces.kml", "wb") as f:
f.write(finished)
Now however, I'm running into another error. I compared the file before and after I replaced the <, >, and & characters.
Before: <description><![CDATA[<img src="fedland_leg_pop_2.jpg" alt="headerimg" width="550" height="77"><br>
After: <description><img src="fedland_leg_pop_2.jpg" alt="headerimg" width="550" height="77"><br>
Now it seems to be throwing out "< ![CDATA[", & I can't figure out why.
I had the same issue but then I found this (https://developers.google.com/kml/documentation/kml_tut#descriptive_html):
Using the CDATA Element
If you want to write standard HTML inside a tag, you can put it inside a CDATA tag. If you don't, the angle brackets need to be written as entity references to prevent Google Earth from parsing the HTML incorrectly (for example, the symbol > is written as > and the symbol < is written as <). This is a standard feature of XML and is not unique to Google Earth.
Consider the difference between HTML markup with CDATA tags and without CDATA. First, here's the with CDATA tags:
<?xml version="1.0" encoding="UTF-8"?>
<kml xmlns="http://www.opengis.net/kml/2.2">
<Document>
<Placemark>
<name>CDATA example</name>
<description>
<![CDATA[
<h1>CDATA Tags are useful!</h1>
<p><font color="red">Text is <i>more readable</i> and
<b>easier to write</b> when you can avoid using entity
references.</font></p>
]]>
</description>
<Point>
<coordinates>102.595626,14.996729</coordinates>
</Point>
</Placemark>
</Document>
</kml>
And here's the without CDATA tags, so that special characters must use entity references:
<?xml version="1.0" encoding="UTF-8"?>
<kml xmlns="http://www.opengis.net/kml/2.2">
<Document>
<Placemark>
<name>Entity references example</name>
<description>
<h1>Entity references are hard to type!</h1>
<p><font color="green">Text is
<i>more readable</i>
and <b>easier to write</b>
when you can avoid using entity references.</font></p>
</description>
<Point>
<coordinates>102.594411,14.998518</coordinates>
</Point>
</Placemark>
</Document>
</kml>

Categories