<editors>
<p poid="1232" class="odo">
<person id="1232">Rob Jhon</person>
<br /> **this text need to be read**
<br />
<title>Sto items:</title> **"this text need to be read"**
<br />
<title>Recent items:</title> **this text need to be read**
</p>
</editors>
As you see in my dataset there are some string areas which are not tagged.
How can i read this xml properly in pyspark to see this string field as a column as well.
If xml is a file called "data.xml", you could start with:
import xml.etree.ElementTree as ET
tree = ET.parse("data.xml")
root = tree.getroot()
print(root[0][1].tail)
This works for me.
I am new in XML and stuck on some feature. My problem statement is I have a list and an XML String (structure of XML is not fixed). I have defined some identifier in my XML string (here in my case is "{some_values}") with the same name as the name of the list. I want that when my code executes, XML string can identify that list variable and the values that are present in the list will add dynamically at run time.
some_values=[1,2,3]
Input xml
<Envelope xmlns="http://schemas.xmlsoap.org/soap/envelope/">
<Body>
<Add xmlns="http://tempuri.org/">
<intA>{some_values}</intA>
</Add>
</Body>
</Envelope>
OutPut Xml:
<Envelope xmlns="http://schemas.xmlsoap.org/soap/envelope/">
<Body>
<Add xmlns="http://tempuri.org/">
<intA>1</intA>
<intA>2</intA>
<intA>3</intA>
</Add>
</Body>
</Envelope>
I need some approach or solution that how can I solve this problem. I read some Python XML parser's libraries and have read somewhere that we can handle XML string using python templating also but unable to find the solution that fits for this particular problem.
Try something along these lines:
import lxml.etree as ET
parser = ET.XMLParser()
some_values=[1,2,3]
content='''<Envelope xmlns="http://schemas.xmlsoap.org/soap/envelope/">
<Body>
<Add xmlns="http://tempuri.org/">
<intA>{some_values}</intA>
</Add>
</Body>
</Envelope>
'''
tree = ET.fromstring(content, parser)
item = tree.xpath('.//*[local-name()="intA"]')
par = item[0].getparent()
for val in reversed(some_values):
new = ET.XML(f'<intA>{val}</intA>')
par.insert(par.index(item[0])+1,new)
par.remove(item[0])
print(etree.tostring(tree).decode())
Output (you can fix the formatting later):
<Envelope xmlns="http://schemas.xmlsoap.org/soap/envelope/">
<Body>
<Add xmlns="http://tempuri.org/">
<intA>1</intA><intA>2</intA><intA>3</intA></Add>
</Body>
</Envelope>
I'm writing a python script and i have to convert a rendered string(from a json with html inside) to a .docx file.
I searched a lot in web but I'm still confused.
I tried with python-docx but doesn't work well because wants docx input and he doesn't like this as a string:
<h1><span lessico='Questa' idx="0" testo="testo" show-modal="setModal()" tables="updateTables(input)">Questa</span> <span lessico='è' idx="1" testo="testo" show-modal="setModal()" tables="updateTables(input)">è</span> <span lessico='una' idx="2" testo="testo" show-modal="setModal()" tables="updateTables(input)">una</span> <span lessico='domanda' idx="3" testo="testo" show-modal="setModal()" tables="updateTables(input)">domanda</span>...</h1>
<ul>
<li>a scelta multipla</li>
<li>con risposta aperta</li>
<li>di tipo trova</li>
<li>di associazione</li>
How can i convert this into a formatted .doc or .docx? possibly without getting mad :)
I am trying to parse a XML using python ,xml example snippet:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<raml xmlns="raml21.xsd" version="2.1">
<series xmlns="" scope="USA" name="Arizona">
<header>
<log action="created"/>
</header>
<x_ns color="Blue">
<p name="timeZone">(GMT-10)</p>
</x_ns>
<x_ns color="Red">
<p name="AvgHeight">175</p>
</x_ns>
<x_ns color="black">
<p name="AvgWeight">235</p>
</x_ns>
the problem is namespaces keeps changing so as an alternative I tried to read the xmlns string first then create a dicionary using namespaces using the below code
root = raw_xml.getroot()
namespace_temp1=root.tag.split("}")
namespace_temp2=namespace_temp1[0].strip('{')
namespaces_auto={}
tag_name =["x","y","z","w","v"]
ns_name=[namespace_temp2,namespace_temp2,namespace_temp2,namespace_temp2,namespace_temp2]
namespace_temp3=zip(tag_name,ns_name)
for tag,ns in namespace_temp3:
namespaces_auto[tag]=ns
namespaces=namespaces_auto
to access a particular tag with namespace I am using the code as follows
for data in raw_xml.findall('x:x_ns',namespaces)
this pretty much solves the problem but gets stuck when the child node has blank xmlns as seen in the series tag (xmlns=""). Not Sure how to incorporate it in the code to check this condition.
I am trying to create a corpus of data from a set of .html pages I have stored in a directory.
These HTML pages have lots of info I don't need.
This info is all stored before the line
<div class="channel">
How can I programmatically remove all of the text before
<div class="channel">
in every HTML file in a folder?
Bonus question for a 50point bounty :
How do I programmatically remove everything AFTER, for example,
<div class="footer">
?
So if my index.html was previously :
<head>
<title>This is bad HTML</title>
</head>
<body>
<h1> Remove me</h1>
<div class="channel">
<h1> This is the good data, keep me</h1>
<p> Keep this text </p>
</div>
<div class="footer">
<h1> Remove me, I am pointless</h1>
</div>
</body>
After my script runs, I want it to be :
<div class="channel">
<h1> This is the good data, keep me</h1>
<p> Keep this text </p>
</div>
This is a bit heavy on memory usage, but it works. Basically you open up the directory, get all ".html" files, read them into a variable, find the split point, store the before or after in a variable, and then overwrite the file.
There are probably better ways to do this, nonetheless, but it works.
import os
dir = os.listdir(".")
files = []
for file in dir:
if file[-5:] == '.html':
files.insert(0, file)
for fileName in files:
file = open(fileName)
content = file.read()
file.close()
loc = content.find('<div class="channel">')
newContent = content[loc:]
file = open(fileName, 'w')
file.write(newContent)
file.close()
If you wanted to just keep up to a point:
newContent = content[0:loc - 1] # I think the -1 is needed, not sure
Note that the things you're searching should be kept in a variable, and not hardcoded.
Also, this won't work recursively for file/folder structures, but you can find out how to modify it to do that very easily.
to remove everything above and everything below
that means the only thing left should be this section:
<div class="channel">
<h1> This is the good data, keep me</h1>
<p> Keep this text </p>
</div>
rather than thinking to remove the unwanted, it would be easier to just extract the wanted.
you can easily extract channel div using XML parser such as DOM
You've not mentioned a language in the question - the post is tagged with python so this answer might still be out of context, but I'll give a php solution that could likely easily be rewritten in another language.
$html='....'; // your page
$search='<div class="channel">';
$components = explode($search,$html); // [0 => before the string, 1 => after the string]
$result = $search.$components[1];
return $result;
To do the reverse is fairly easy too; simply take the value of $components[0] after altering $search to your <div class="footer"> value.
If you happen to have the $search string cropping up multiple times:
$html='....'; // your page
$search='<div class="channel">';
$components = explode($search,$html); // [0 => before the string, 1 => after the string]
unset($components[0]);
$result = $search.implode($search,$components);
return $result;
Someone who knows python better than I do feel free to rewrite and take the answer!