Parse Heavy XML into Ordered Dictionary

Parse Heavy XML into Ordered Dictionary - python

Am currently working on parsing XML in Python 3.x, for XML size till 300 MB not facing any issues with below code. However when file size increases to 500 MB or in GB, memory issues are being faced.
tree2=etree.parse(xmlfile2)
root2=tree2.getroot()
df_list2=[]
for i, child in enumerate(root2):
for subchildren in (child.findall('{raml20.xsd}header')):
for subchildren in (child.findall('{raml20.xsd}managedObject')):
xml_class_name2 = subchildren.get('class')
xml_dist_name2 = subchildren.get('distName')
for subchild in subchildren:
df_dict2=OrderedDict()
header2=subchild.attrib.get('name')
df_dict2['MOClass']=xml_class_name2
df_dict2['CellDN']=xml_dist_name2
df_dict2['Parameter']=header2
df_dict2['CurrentValue']=subchild.text
df_list2.append(df_dict2)
Came across various articles explaining use of 'iterparse', but am not getting a way through to use it for saving the XML data in ordered way.
Below is format of my XML:
<raml version="2.0" xmlns="raml20.xsd">
<cmData type="plan" scope="all" name="XML_Plan_update.xml">
<header>
<log dateTime="2018-12-31T16:13:28" action="created" appInfo="PlanExporter"/>
</header>
<managedObject class="WNCEL" version="LN2.0" distName="PLMN-PLMN/MRBTS-137/WNBTS-1/WNCEL-27046" operation="update">
<p name="defaultCarrier">10787</p>
<p name="lCelwDN">MRBTS-137/MNL-1/MNLENT-1/CELLMAPPING-1/LCELW-4</p>
<p name="maxCarrierPower">460</p>
</managedObject>
<managedObject class="WNCEL" version="LN2.0" distName="PLMN-PLMN/MRBTS-6770/WNBTS-1/WNCEL-26925" operation="update">
<p name="defaultCarrier">10787</p>
<p name="lCelwDN">MRBTS-6770/MNL-1/MNLENT-1/CELLMAPPING-1/LCELW-5</p>
<p name="maxCarrierPower">460</p>
</managedObject>
<managedObject class="WNCEL" version="LN2.0" distName="PLMN-PLMN/MRBTS-806/WNBTS-1/WNCEL-22661" operation="update">
<p name="defaultCarrier">10762</p>
<p name="lCelwDN">MRBTS-806/MNL-1/MNLENT-1/CELLMAPPING-1/LCELW-9</p>
<p name="maxCarrierPower">460</p>
</managedObject>
Am currently using cElementTree or lxml to parse the XML and save the for loop generated output in Ordered Dictionary. All entries of dict are appended in list at the end.
Looking for a way to use iterparse method for parsing above XML in ordered dict.

Related

How we can read unstructured xml file in pyspark

<editors>
<p poid="1232" class="odo">
<person id="1232">Rob Jhon</person>
<br /> **this text need to be read**
<br />
<title>Sto items:</title> **"this text need to be read"**
<br />
<title>Recent items:</title> **this text need to be read**
</p>
</editors>
As you see in my dataset there are some string areas which are not tagged.
How can i read this xml properly in pyspark to see this string field as a column as well.

If xml is a file called "data.xml", you could start with:
import xml.etree.ElementTree as ET
tree = ET.parse("data.xml")
root = tree.getroot()
print(root[0][1].tail)
This works for me.

Add values dynamically in XML string using python

I am new in XML and stuck on some feature. My problem statement is I have a list and an XML String (structure of XML is not fixed). I have defined some identifier in my XML string (here in my case is "{some_values}") with the same name as the name of the list. I want that when my code executes, XML string can identify that list variable and the values that are present in the list will add dynamically at run time.
some_values=[1,2,3]
Input xml
<Envelope xmlns="http://schemas.xmlsoap.org/soap/envelope/">
<Body>
<Add xmlns="http://tempuri.org/">
<intA>{some_values}</intA>
</Add>
</Body>
</Envelope>
OutPut Xml:
<Envelope xmlns="http://schemas.xmlsoap.org/soap/envelope/">
<Body>
<Add xmlns="http://tempuri.org/">
<intA>1</intA>
<intA>2</intA>
<intA>3</intA>
</Add>
</Body>
</Envelope>
I need some approach or solution that how can I solve this problem. I read some Python XML parser's libraries and have read somewhere that we can handle XML string using python templating also but unable to find the solution that fits for this particular problem.

Try something along these lines:
import lxml.etree as ET
parser = ET.XMLParser()
some_values=[1,2,3]
content='''<Envelope xmlns="http://schemas.xmlsoap.org/soap/envelope/">
<Body>
<Add xmlns="http://tempuri.org/">
<intA>{some_values}</intA>
</Add>
</Body>
</Envelope>
'''
tree = ET.fromstring(content, parser)
item = tree.xpath('.//*[local-name()="intA"]')
par = item[0].getparent()
for val in reversed(some_values):
new = ET.XML(f'<intA>{val}</intA>')
par.insert(par.index(item[0])+1,new)
par.remove(item[0])
print(etree.tostring(tree).decode())
Output (you can fix the formatting later):
<Envelope xmlns="http://schemas.xmlsoap.org/soap/envelope/">
<Body>
<Add xmlns="http://tempuri.org/">
<intA>1</intA><intA>2</intA><intA>3</intA></Add>
</Body>
</Envelope>

Python: convert json+html string to .doc

I'm writing a python script and i have to convert a rendered string(from a json with html inside) to a .docx file.
I searched a lot in web but I'm still confused.
I tried with python-docx but doesn't work well because wants docx input and he doesn't like this as a string:
<h1><span lessico='Questa' idx="0" testo="testo" show-modal="setModal()" tables="updateTables(input)">Questa</span> <span lessico='è' idx="1" testo="testo" show-modal="setModal()" tables="updateTables(input)">è</span> <span lessico='una' idx="2" testo="testo" show-modal="setModal()" tables="updateTables(input)">una</span> <span lessico='domanda' idx="3" testo="testo" show-modal="setModal()" tables="updateTables(input)">domanda</span>...</h1>
<ul>
<li>a scelta multipla</li>
<li>con risposta aperta</li>
<li>di tipo trova</li>
<li>di associazione</li>
How can i convert this into a formatted .doc or .docx? possibly without getting mad :)

Blank XML Namespace processing With Python

I am trying to parse a XML using python ,xml example snippet:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<raml xmlns="raml21.xsd" version="2.1">
<series xmlns="" scope="USA" name="Arizona">
<header>
<log action="created"/>
</header>
<x_ns color="Blue">
<p name="timeZone">(GMT-10)</p>
</x_ns>
<x_ns color="Red">
<p name="AvgHeight">175</p>
</x_ns>
<x_ns color="black">
<p name="AvgWeight">235</p>
</x_ns>
the problem is namespaces keeps changing so as an alternative I tried to read the xmlns string first then create a dicionary using namespaces using the below code
root = raw_xml.getroot()
namespace_temp1=root.tag.split("}")
namespace_temp2=namespace_temp1[0].strip('{')
namespaces_auto={}
tag_name =["x","y","z","w","v"]
ns_name=[namespace_temp2,namespace_temp2,namespace_temp2,namespace_temp2,namespace_temp2]
namespace_temp3=zip(tag_name,ns_name)
for tag,ns in namespace_temp3:
namespaces_auto[tag]=ns
namespaces=namespaces_auto
to access a particular tag with namespace I am using the code as follows
for data in raw_xml.findall('x:x_ns',namespaces)
this pretty much solves the problem but gets stuck when the child node has blank xmlns as seen in the series tag (xmlns=""). Not Sure how to incorporate it in the code to check this condition.

Programmatically delete everything before a HTML node?

I am trying to create a corpus of data from a set of .html pages I have stored in a directory.
These HTML pages have lots of info I don't need.
This info is all stored before the line
<div class="channel">
How can I programmatically remove all of the text before
<div class="channel">
in every HTML file in a folder?
Bonus question for a 50point bounty :
How do I programmatically remove everything AFTER, for example,
<div class="footer">
?
So if my index.html was previously :
<head>
<title>This is bad HTML</title>
</head>
<body>
<h1> Remove me</h1>
<div class="channel">
<h1> This is the good data, keep me</h1>
<p> Keep this text </p>
</div>
<div class="footer">
<h1> Remove me, I am pointless</h1>
</div>
</body>
After my script runs, I want it to be :
<div class="channel">
<h1> This is the good data, keep me</h1>
<p> Keep this text </p>
</div>

This is a bit heavy on memory usage, but it works. Basically you open up the directory, get all ".html" files, read them into a variable, find the split point, store the before or after in a variable, and then overwrite the file.
There are probably better ways to do this, nonetheless, but it works.
import os
dir = os.listdir(".")
files = []
for file in dir:
if file[-5:] == '.html':
files.insert(0, file)
for fileName in files:
file = open(fileName)
content = file.read()
file.close()
loc = content.find('<div class="channel">')
newContent = content[loc:]
file = open(fileName, 'w')
file.write(newContent)
file.close()
If you wanted to just keep up to a point:
newContent = content[0:loc - 1] # I think the -1 is needed, not sure
Note that the things you're searching should be kept in a variable, and not hardcoded.
Also, this won't work recursively for file/folder structures, but you can find out how to modify it to do that very easily.

to remove everything above and everything below
that means the only thing left should be this section:
<div class="channel">
<h1> This is the good data, keep me</h1>
<p> Keep this text </p>
</div>
rather than thinking to remove the unwanted, it would be easier to just extract the wanted.
you can easily extract channel div using XML parser such as DOM

You've not mentioned a language in the question - the post is tagged with python so this answer might still be out of context, but I'll give a php solution that could likely easily be rewritten in another language.
$html='....'; // your page
$search='<div class="channel">';
$components = explode($search,$html); // [0 => before the string, 1 => after the string]
$result = $search.$components[1];
return $result;
To do the reverse is fairly easy too; simply take the value of $components[0] after altering $search to your <div class="footer"> value.
If you happen to have the $search string cropping up multiple times:
$html='....'; // your page
$search='<div class="channel">';
$components = explode($search,$html); // [0 => before the string, 1 => after the string]
unset($components[0]);
$result = $search.implode($search,$components);
return $result;
Someone who knows python better than I do feel free to rewrite and take the answer!

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Parse Heavy XML into Ordered Dictionary - python

Related

How we can read unstructured xml file in pyspark

Add values dynamically in XML string using python

Python: convert json+html string to .doc

Blank XML Namespace processing With Python

Programmatically delete everything before a HTML node?

Categories

Resources