How we can read unstructured xml file in pyspark - python

<editors>
<p poid="1232" class="odo">
<person id="1232">Rob Jhon</person>
<br /> **this text need to be read**
<br />
<title>Sto items:</title> **"this text need to be read"**
<br />
<title>Recent items:</title> **this text need to be read**
</p>
</editors>
As you see in my dataset there are some string areas which are not tagged.
How can i read this xml properly in pyspark to see this string field as a column as well.

If xml is a file called "data.xml", you could start with:
import xml.etree.ElementTree as ET
tree = ET.parse("data.xml")
root = tree.getroot()
print(root[0][1].tail)
This works for me.

Related

Add values dynamically in XML string using python

I am new in XML and stuck on some feature. My problem statement is I have a list and an XML String (structure of XML is not fixed). I have defined some identifier in my XML string (here in my case is "{some_values}") with the same name as the name of the list. I want that when my code executes, XML string can identify that list variable and the values that are present in the list will add dynamically at run time.
some_values=[1,2,3]
Input xml
<Envelope xmlns="http://schemas.xmlsoap.org/soap/envelope/">
<Body>
<Add xmlns="http://tempuri.org/">
<intA>{some_values}</intA>
</Add>
</Body>
</Envelope>
OutPut Xml:
<Envelope xmlns="http://schemas.xmlsoap.org/soap/envelope/">
<Body>
<Add xmlns="http://tempuri.org/">
<intA>1</intA>
<intA>2</intA>
<intA>3</intA>
</Add>
</Body>
</Envelope>
I need some approach or solution that how can I solve this problem. I read some Python XML parser's libraries and have read somewhere that we can handle XML string using python templating also but unable to find the solution that fits for this particular problem.
Try something along these lines:
import lxml.etree as ET
parser = ET.XMLParser()
some_values=[1,2,3]
content='''<Envelope xmlns="http://schemas.xmlsoap.org/soap/envelope/">
<Body>
<Add xmlns="http://tempuri.org/">
<intA>{some_values}</intA>
</Add>
</Body>
</Envelope>
'''
tree = ET.fromstring(content, parser)
item = tree.xpath('.//*[local-name()="intA"]')
par = item[0].getparent()
for val in reversed(some_values):
new = ET.XML(f'<intA>{val}</intA>')
par.insert(par.index(item[0])+1,new)
par.remove(item[0])
print(etree.tostring(tree).decode())
Output (you can fix the formatting later):
<Envelope xmlns="http://schemas.xmlsoap.org/soap/envelope/">
<Body>
<Add xmlns="http://tempuri.org/">
<intA>1</intA><intA>2</intA><intA>3</intA></Add>
</Body>
</Envelope>

Parse Heavy XML into Ordered Dictionary

Am currently working on parsing XML in Python 3.x, for XML size till 300 MB not facing any issues with below code. However when file size increases to 500 MB or in GB, memory issues are being faced.
tree2=etree.parse(xmlfile2)
root2=tree2.getroot()
df_list2=[]
for i, child in enumerate(root2):
for subchildren in (child.findall('{raml20.xsd}header')):
for subchildren in (child.findall('{raml20.xsd}managedObject')):
xml_class_name2 = subchildren.get('class')
xml_dist_name2 = subchildren.get('distName')
for subchild in subchildren:
df_dict2=OrderedDict()
header2=subchild.attrib.get('name')
df_dict2['MOClass']=xml_class_name2
df_dict2['CellDN']=xml_dist_name2
df_dict2['Parameter']=header2
df_dict2['CurrentValue']=subchild.text
df_list2.append(df_dict2)
Came across various articles explaining use of 'iterparse', but am not getting a way through to use it for saving the XML data in ordered way.
Below is format of my XML:
<raml version="2.0" xmlns="raml20.xsd">
<cmData type="plan" scope="all" name="XML_Plan_update.xml">
<header>
<log dateTime="2018-12-31T16:13:28" action="created" appInfo="PlanExporter"/>
</header>
<managedObject class="WNCEL" version="LN2.0" distName="PLMN-PLMN/MRBTS-137/WNBTS-1/WNCEL-27046" operation="update">
<p name="defaultCarrier">10787</p>
<p name="lCelwDN">MRBTS-137/MNL-1/MNLENT-1/CELLMAPPING-1/LCELW-4</p>
<p name="maxCarrierPower">460</p>
</managedObject>
<managedObject class="WNCEL" version="LN2.0" distName="PLMN-PLMN/MRBTS-6770/WNBTS-1/WNCEL-26925" operation="update">
<p name="defaultCarrier">10787</p>
<p name="lCelwDN">MRBTS-6770/MNL-1/MNLENT-1/CELLMAPPING-1/LCELW-5</p>
<p name="maxCarrierPower">460</p>
</managedObject>
<managedObject class="WNCEL" version="LN2.0" distName="PLMN-PLMN/MRBTS-806/WNBTS-1/WNCEL-22661" operation="update">
<p name="defaultCarrier">10762</p>
<p name="lCelwDN">MRBTS-806/MNL-1/MNLENT-1/CELLMAPPING-1/LCELW-9</p>
<p name="maxCarrierPower">460</p>
</managedObject>
Am currently using cElementTree or lxml to parse the XML and save the for loop generated output in Ordered Dictionary. All entries of dict are appended in list at the end.
Looking for a way to use iterparse method for parsing above XML in ordered dict.

Using python, elementtree, xml parser to get attributes not working for some reason?

I'm new to python and parsing xml, but I'm having trouble with a particular xml file which is spat out by a program I work with. I'm trying parse this xml file using python and elementtree in order to extract the url data (the URL below is fake). Any ideas as to why this isn't working?
my python code:
def xmlTreeParser(fileName,attribute,tagName):
tree = ET.parse(fileName)
root = tree.getroot()
attribArray = [element.attrib[attribute] for element in root.findall(tagName)]
print attribArray
xmlTreeParser("xml_file.xml",'text','Expr')
here's my xml file:
<Query id="f9cef041-085d-47e0-8d16-15e36bba1ec8" name="">
<Description />
<JustSortedColumns />
<Conditions linking="All">
<Condition class="PDCT" enabled="True" readOnly="False" linking="Any">
<Condition class="SMPL" enabled="True" readOnly="False">
<Operator id="Contains" />
<Expressions>
<Expr class="ENTATTR" id="Person.LinkedInUrl" />
<Expr class="CONST" type="String" kind="Scalar" value="https://www.linkedin.com/Bill-Smith" text="https://www.linkedin.com/Bill-Smith" />
</Expressions>
</Condition>
</Condition>
</Conditions>
</Query>
The python I wrote works just fine on another, test, xml file that I wrote myself. I'm at a loss as to why I can't parse this particular block of xml. Thanks everyone.
For the specific call you make, you need to add this syntax to reach the tag Expr (doc):
xmlTreeParser("xml_file.xml",'text','.//Expr')
But also your Xml doesn't have all attributes like text, you should prevent errors like this :
attribArray = [element.attrib.get(attribute, '') for element in root.findall(tagName)]
# -----------------------------^
print(attribArray)
xmlTreeParser("xml_file.xml",'text','.//Expr')

Blank XML Namespace processing With Python

I am trying to parse a XML using python ,xml example snippet:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<raml xmlns="raml21.xsd" version="2.1">
<series xmlns="" scope="USA" name="Arizona">
<header>
<log action="created"/>
</header>
<x_ns color="Blue">
<p name="timeZone">(GMT-10)</p>
</x_ns>
<x_ns color="Red">
<p name="AvgHeight">175</p>
</x_ns>
<x_ns color="black">
<p name="AvgWeight">235</p>
</x_ns>
the problem is namespaces keeps changing so as an alternative I tried to read the xmlns string first then create a dicionary using namespaces using the below code
root = raw_xml.getroot()
namespace_temp1=root.tag.split("}")
namespace_temp2=namespace_temp1[0].strip('{')
namespaces_auto={}
tag_name =["x","y","z","w","v"]
ns_name=[namespace_temp2,namespace_temp2,namespace_temp2,namespace_temp2,namespace_temp2]
namespace_temp3=zip(tag_name,ns_name)
for tag,ns in namespace_temp3:
namespaces_auto[tag]=ns
namespaces=namespaces_auto
to access a particular tag with namespace I am using the code as follows
for data in raw_xml.findall('x:x_ns',namespaces)
this pretty much solves the problem but gets stuck when the child node has blank xmlns as seen in the series tag (xmlns=""). Not Sure how to incorporate it in the code to check this condition.

Remove xmlns information from generated file?

I am using Elementtree to parse an xml file, edit the contents and write to a new xml file. I have this all working apart form one issue. When I generate the file there are a lot of extra lines containing namespace information. Here are some snippets of code:
import xml.etree.ElementTree as ET
ET.register_namespace("", "http://clish.sourceforge.net/XMLSchema")
tree = ET.parse('ethernet.xml')
root = tree.getroot()
commands = root.findall('{http://clish.sourceforge.net/XMLSchema}'
'VIEW/{http://clish.sourceforge.net/XMLSchema}COMMAND')
for command in commands:
all1.append(list(command.iter()))
And a sample of the output file, with the erroneous line xmlns="http://clish.sourceforge.net/XMLSchema:
<COMMAND xmlns="http://clish.sourceforge.net/XMLSchema" help="Interface specific description" name="description">
<PARAM help="Description (must be in double-quotes)" name="description" ptype="LINE" />
<CONFIG />
</COMMAND>
How can I remove this with elementtree, can I? Or will i have to use some regex (I am writing a string to the file)?

Categories