Python: convert json+html string to .doc - python

I'm writing a python script and i have to convert a rendered string(from a json with html inside) to a .docx file.
I searched a lot in web but I'm still confused.
I tried with python-docx but doesn't work well because wants docx input and he doesn't like this as a string:
<h1><span lessico='Questa' idx="0" testo="testo" show-modal="setModal()" tables="updateTables(input)">Questa</span> <span lessico='è' idx="1" testo="testo" show-modal="setModal()" tables="updateTables(input)">è</span> <span lessico='una' idx="2" testo="testo" show-modal="setModal()" tables="updateTables(input)">una</span> <span lessico='domanda' idx="3" testo="testo" show-modal="setModal()" tables="updateTables(input)">domanda</span>...</h1>
<ul>
<li>a scelta multipla</li>
<li>con risposta aperta</li>
<li>di tipo trova</li>
<li>di associazione</li>
How can i convert this into a formatted .doc or .docx? possibly without getting mad :)

Related

Parse Heavy XML into Ordered Dictionary

Am currently working on parsing XML in Python 3.x, for XML size till 300 MB not facing any issues with below code. However when file size increases to 500 MB or in GB, memory issues are being faced.
tree2=etree.parse(xmlfile2)
root2=tree2.getroot()
df_list2=[]
for i, child in enumerate(root2):
for subchildren in (child.findall('{raml20.xsd}header')):
for subchildren in (child.findall('{raml20.xsd}managedObject')):
xml_class_name2 = subchildren.get('class')
xml_dist_name2 = subchildren.get('distName')
for subchild in subchildren:
df_dict2=OrderedDict()
header2=subchild.attrib.get('name')
df_dict2['MOClass']=xml_class_name2
df_dict2['CellDN']=xml_dist_name2
df_dict2['Parameter']=header2
df_dict2['CurrentValue']=subchild.text
df_list2.append(df_dict2)
Came across various articles explaining use of 'iterparse', but am not getting a way through to use it for saving the XML data in ordered way.
Below is format of my XML:
<raml version="2.0" xmlns="raml20.xsd">
<cmData type="plan" scope="all" name="XML_Plan_update.xml">
<header>
<log dateTime="2018-12-31T16:13:28" action="created" appInfo="PlanExporter"/>
</header>
<managedObject class="WNCEL" version="LN2.0" distName="PLMN-PLMN/MRBTS-137/WNBTS-1/WNCEL-27046" operation="update">
<p name="defaultCarrier">10787</p>
<p name="lCelwDN">MRBTS-137/MNL-1/MNLENT-1/CELLMAPPING-1/LCELW-4</p>
<p name="maxCarrierPower">460</p>
</managedObject>
<managedObject class="WNCEL" version="LN2.0" distName="PLMN-PLMN/MRBTS-6770/WNBTS-1/WNCEL-26925" operation="update">
<p name="defaultCarrier">10787</p>
<p name="lCelwDN">MRBTS-6770/MNL-1/MNLENT-1/CELLMAPPING-1/LCELW-5</p>
<p name="maxCarrierPower">460</p>
</managedObject>
<managedObject class="WNCEL" version="LN2.0" distName="PLMN-PLMN/MRBTS-806/WNBTS-1/WNCEL-22661" operation="update">
<p name="defaultCarrier">10762</p>
<p name="lCelwDN">MRBTS-806/MNL-1/MNLENT-1/CELLMAPPING-1/LCELW-9</p>
<p name="maxCarrierPower">460</p>
</managedObject>
Am currently using cElementTree or lxml to parse the XML and save the for loop generated output in Ordered Dictionary. All entries of dict are appended in list at the end.
Looking for a way to use iterparse method for parsing above XML in ordered dict.

Programmatically delete everything before a HTML node?

I am trying to create a corpus of data from a set of .html pages I have stored in a directory.
These HTML pages have lots of info I don't need.
This info is all stored before the line
<div class="channel">
How can I programmatically remove all of the text before
<div class="channel">
in every HTML file in a folder?
Bonus question for a 50point bounty :
How do I programmatically remove everything AFTER, for example,
<div class="footer">
?
So if my index.html was previously :
<head>
<title>This is bad HTML</title>
</head>
<body>
<h1> Remove me</h1>
<div class="channel">
<h1> This is the good data, keep me</h1>
<p> Keep this text </p>
</div>
<div class="footer">
<h1> Remove me, I am pointless</h1>
</div>
</body>
After my script runs, I want it to be :
<div class="channel">
<h1> This is the good data, keep me</h1>
<p> Keep this text </p>
</div>
This is a bit heavy on memory usage, but it works. Basically you open up the directory, get all ".html" files, read them into a variable, find the split point, store the before or after in a variable, and then overwrite the file.
There are probably better ways to do this, nonetheless, but it works.
import os
dir = os.listdir(".")
files = []
for file in dir:
if file[-5:] == '.html':
files.insert(0, file)
for fileName in files:
file = open(fileName)
content = file.read()
file.close()
loc = content.find('<div class="channel">')
newContent = content[loc:]
file = open(fileName, 'w')
file.write(newContent)
file.close()
If you wanted to just keep up to a point:
newContent = content[0:loc - 1] # I think the -1 is needed, not sure
Note that the things you're searching should be kept in a variable, and not hardcoded.
Also, this won't work recursively for file/folder structures, but you can find out how to modify it to do that very easily.
to remove everything above and everything below
that means the only thing left should be this section:
<div class="channel">
<h1> This is the good data, keep me</h1>
<p> Keep this text </p>
</div>
rather than thinking to remove the unwanted, it would be easier to just extract the wanted.
you can easily extract channel div using XML parser such as DOM
You've not mentioned a language in the question - the post is tagged with python so this answer might still be out of context, but I'll give a php solution that could likely easily be rewritten in another language.
$html='....'; // your page
$search='<div class="channel">';
$components = explode($search,$html); // [0 => before the string, 1 => after the string]
$result = $search.$components[1];
return $result;
To do the reverse is fairly easy too; simply take the value of $components[0] after altering $search to your <div class="footer"> value.
If you happen to have the $search string cropping up multiple times:
$html='....'; // your page
$search='<div class="channel">';
$components = explode($search,$html); // [0 => before the string, 1 => after the string]
unset($components[0]);
$result = $search.implode($search,$components);
return $result;
Someone who knows python better than I do feel free to rewrite and take the answer!

Sleek way of un/commenting out html tags in markdown

I'm trying to find a nice way of wrapping html tags in html comments without writing 5 functions and 50 lines of code. Using an example code :
<section class="left span9">
### Test
</section>
I need to transform it to :
<!--<section class="left span9">-->
### Test
<!--</section>-->
I have a regex to find the tags
re.findall('<.*?>', str)
but in the last years I wasn't using lambdas too often so now I'm having a hard time getting it to work.
btw any ideas for the reverse of this process - decommenting the tags ?
You can comment/uncomment using simple replace like this
myString = '<section class="left span9">'
print myString.replace("<", "<!--<").replace(">", ">-->")
print myString.replace("<!--", "").replace("-->", "")
Output:
<!--<section class="left span9">-->
<section class="left span9">
Note: This works because, a valid HTML document should have < and > only in the HTML tags. If they should appear, as they are, in the output, they have to be properly HTML escaped with > and <
Ok, so temporarily I've ended up using two functions and re.sub for that :
def comment(match):
return '<!--'+match.group(0)+'-->'
def uncomment(html):
return html.replace('<!--', '').replace('-->', '')
commented_html = re.sub('<.*?>', comment, html_string)
uncommented_html = uncomment(commented_html)

How to translate text in a html-text?

I have a text field in html. The inserted text is formatted with things like lists, underline, tables and such. Using pylupdate4, I want to translate this stuff. Can I wrap a self.tr("""all the html stuff""")around it and translate it? Looking into the .ts-file, the parts look very weird:
<message>
<location filename="Main.py" line="3526"/>
<source><byte value="xd"/>
<br><byte value="xd"/>
<p align="left", style="margin-left: 20px; margin-right:20px; margin-top:20px;"><byte value="xd"/>
test text ipsum blabla ... and so on tbc. <byte value="xd"/>
laurem ipsum blub rocknroll.<br><byte value="xd"/>
Moreover, Elvis is among us and jackson is his son PNEIS.<br><br><byte value="xd"/>;</source>
<translation type="unfinished"></translation>
</message>
What is a good approch to translate this? Inside of a big string """...""" I can't type self.tr...

Extract artist and music From text (regex)

I have written following regex But its not working. Can you please help me? thank you :-)
track_desc = '''<img src="http://images.raaga.com/catalog/cd/A/A0000102.jpg" align="right" border="0" width="100" height="100" vspace="4" hspace="4" />
<p>
</p>
<p> Artist(s) David: <br/>
Music: Ramana Gogula<br/>
</p>'''
rx = "<p><\/p><p>Artist\(s\): (.*?)<br\/>Music: (.*?)<br\/><\/p>"
m = re.search(rx, track_desc)
Output Should be:
Artist(s) David
Music: Ramana Gogula
You were ignoring the whitespace:
<p>[\s\n\r]*Artist\(s\)[\s\n\r]*(.*?)[\s\n\r]*:[\s\n\r]*<br/>[\s\n\r]*Music:[\s\n\r]*(.*?)<br/>[\s\n\r]*</p>
Output is:
[1] => "David"
[2] => "Ramana Gogula"
(note that your regex didn't match the Artists(s) and Music: prefixes either)
However for production code I would not rely on such rather clumsy regex (and equally clumsily formatted HTML source).
Seriously though, ditch the idea of using regex for this if you aren't the slightest familiar with regex (which it looks like). You're using the wrong tool and a badly formatted data source. Parsing HTML with Regex is wrong in 9 out of 10 cases (see #bgporter's comment link) and doomed to fail. Apart from that HTML is hardly ever an appropriate data source (unless there really really is no alternative source).
import lxml.html as lh
import re
track_desc = '''
<img src="http://images.raaga.com/catalog/cd/A/A0000102.jpg" align="right" border="0" width="100" height="100" vspace="4" hspace="4" />
<p>
</p>
<p> Artist(s) David: <br/>
Music: Ramana Gogula<br/>
</p>
'''
tree = lh.fromstring(track_desc)
print re.findall(r'Artist\(s\) (.+):\s*\nMusic: (.*\w)', tree.text_content())
I see a few errors:
regex is not multiline : should use flags=re.MULTILINE to allow to match on multilines
spaces are not taken into account
artist(s) is not followed by :
As the web page is rather strangely presented, this might be error prone to rely on a regex and I wouldn't advise to use it extensively.
Note, following seems to work:
rx='Artist(?:\(s\))?\s+(.*?)\<br\/>\s+Music:\s*(.*?)\<br'
print ("Art... : %s && Mus... : %s" % re.search(rx, track_desc,flags=re.MULTILINE).groups())

Categories