Why lxml.etree.SubElement(body, "br") will create <br />? - python

I'm going through the lxml tutorial and I have a question:
Here is the code:
>>> html = etree.Element("html")
>>> body = etree.SubElement(html, "body")
>>> body.text = "TEXT"
>>> etree.tostring(html)
b'<html><body>TEXT</body></html>'
#############LOOK!!!!!!!############
>>> br = etree.SubElement(body, "br")
>>> etree.tostring(html)
b'<html><body>TEXT<br/></body></html>'
#############END####################
>>> br.tail = "TAIL"
>>> etree.tostring(html)
b'<html><body>TEXT<br/>TAIL</body></html>'
As you can see, in the wrapped block, the instruction br = etree.SubElement(body, "br") will only create a <br /> mark, and why is that?
Is br a reserved word?

Thanks to someone's kindly notification, I should publish my answer here:
Look at this code first:
from lxml import etree
if __name__ == '__main__':
print """Trying to create xml file like this:
<html><body>Hello<br/>World</body></html>"""
html_node = etree.Element("html")
body_node = etree.SubElement(html_node, "body")
body_node.text = "Hello"
print "Step1:" + etree.tostring(html_node)
br_node = etree.SubElement(body_node, "br")
print "Step2:" + etree.tostring(html_node)
br_node.tail = "World"
print "Step3:" + etree.tostring(html_node)
br_node.text = "Yeah?"
print "Step4:" + etree.tostring(html_node)
Here is the output:
Trying to create xml file like this:
<html><body>Hello<br/>World</body></html>
Step1:<html><body>Hello</body></html>
Step2:<html><body>Hello<br/></body></html>
Step3:<html><body>Hello<br/>World</body></html>
Step4:<html><body>Hello<br>Yeah?</br>World</body></html>
At first, what I was trying to figure out is:
Why the output of br_node is rather than
You may check the step3 and step4, and the answer is quite clear:
If the element has no content, it's output format would be <"name"/>
Due to the existing semantic of , this easy question confused me for a long time.
Hope this post will help some guys like me.

Related

Extract DOCX Comments

I'm a teacher. I want a list of all the students who commented on the essay I assigned, and what they said. The Drive API stuff was too challenging for me, but I figured I could download them as a zip and parse the XML.
The comments are tagged in w:comment tags, with w:t for the comment text and . It should be easy, but XML (etree) is killing me.
via the tutorial (and official Python docs):
z = zipfile.ZipFile('test.docx')
x = z.read('word/comments.xml')
tree = etree.XML(x)
Then I do this:
children = tree.getiterator()
for c in children:
print(c.attrib)
Resulting in this:
{}
{'{http://schemas.openxmlformats.org/wordprocessingml/2006/main}author': 'Joe Shmoe', '{http://schemas.openxmlformats.org/wordprocessingml/2006/main}id': '1', '{http://schemas.openxmlformats.org/wordprocessingml/2006/main}date': '2017-11-17T16:58:27Z'}
{'{http://schemas.openxmlformats.org/wordprocessingml/2006/main}rsidR': '00000000', '{http://schemas.openxmlformats.org/wordprocessingml/2006/main}rsidDel': '00000000', '{http://schemas.openxmlformats.org/wordprocessingml/2006/main}rsidP': '00000000', '{http://schemas.openxmlformats.org/wordprocessingml/2006/main}rsidRDefault': '00000000', '{http://schemas.openxmlformats.org/wordprocessingml/2006/main}rsidRPr': '00000000'}
{}
{'{http://schemas.openxmlformats.org/wordprocessingml/2006/main}val': '0'}
{'{http://schemas.openxmlformats.org/wordprocessingml/2006/main}val': '0'}
{'{http://schemas.openxmlformats.org/wordprocessingml/2006/main}val': '0'}
And after this I am totally stuck. I've tried element.get() and element.findall() with no luck. Even when I copy/paste the value ('{http://schemas.openxmlformats.org/wordprocessingml/2006/main}val'), I get None in return.
Can anyone help?
You got remarkably far considering that OOXML is such a complex format.
Here's some sample Python code showing how to access the comments of a DOCX file via XPath:
from lxml import etree
import zipfile
ooXMLns = {'w':'http://schemas.openxmlformats.org/wordprocessingml/2006/main'}
def get_comments(docxFileName):
docxZip = zipfile.ZipFile(docxFileName)
commentsXML = docxZip.read('word/comments.xml')
et = etree.XML(commentsXML)
comments = et.xpath('//w:comment',namespaces=ooXMLns)
for c in comments:
# attributes:
print(c.xpath('#w:author',namespaces=ooXMLns))
print(c.xpath('#w:date',namespaces=ooXMLns))
# string value of the comment:
print(c.xpath('string(.)',namespaces=ooXMLns))
Thank you #kjhughes for this amazing answer for extracting all the comments from the document file. I was facing same issue like others in this thread to get the text that the comment relates to. I took the code from #kjhughes as a base and try to solve this using python-docx. So here is my take at this.
Sample document.
I will extract the comment and the paragraph which it was referenced in the document.
from docx import Document
from lxml import etree
import zipfile
ooXMLns = {'w':'http://schemas.openxmlformats.org/wordprocessingml/2006/main'}
#Function to extract all the comments of document(Same as accepted answer)
#Returns a dictionary with comment id as key and comment string as value
def get_document_comments(docxFileName):
comments_dict={}
docxZip = zipfile.ZipFile(docxFileName)
commentsXML = docxZip.read('word/comments.xml')
et = etree.XML(commentsXML)
comments = et.xpath('//w:comment',namespaces=ooXMLns)
for c in comments:
comment=c.xpath('string(.)',namespaces=ooXMLns)
comment_id=c.xpath('#w:id',namespaces=ooXMLns)[0]
comments_dict[comment_id]=comment
return comments_dict
#Function to fetch all the comments in a paragraph
def paragraph_comments(paragraph,comments_dict):
comments=[]
for run in paragraph.runs:
comment_reference=run._r.xpath("./w:commentReference")
if comment_reference:
comment_id=comment_reference[0].xpath('#w:id',namespaces=ooXMLns)[0]
comment=comments_dict[comment_id]
comments.append(comment)
return comments
#Function to fetch all comments with their referenced paragraph
#This will return list like this [{'Paragraph text': [comment 1,comment 2]}]
def comments_with_reference_paragraph(docxFileName):
document = Document(docxFileName)
comments_dict=get_document_comments(docxFileName)
comments_with_their_reference_paragraph=[]
for paragraph in document.paragraphs:
if comments_dict:
comments=paragraph_comments(paragraph,comments_dict)
if comments:
comments_with_their_reference_paragraph.append({paragraph.text: comments})
return comments_with_their_reference_paragraph
if __name__=="__main__":
document="test.docx" #filepath for the input document
print(comments_with_reference_paragraph(document))
Output for the sample document look like this
I have done this at a paragraph level. This could be done at a python-docx run level as well.
Hopefully it will be of help.
I used Word Object Model to extract comments with replies from a Word document. Documentation on Comments object can be found here. This documentation uses Visual Basic for Applications (VBA). But I was able to use the functions in Python with slight modifications. Only issue with Word Object Model is that I had to use win32com package from pywin32 which works fine on Windows PC, but I'm not sure if it will work on macOS.
Here's the sample code I used to extract comments with associated replies:
import win32com.client as win32
from win32com.client import constants
word = win32.gencache.EnsureDispatch('Word.Application')
word.Visible = False
filepath = "path\to\file.docx"
def get_comments(filepath):
doc = word.Documents.Open(filepath)
doc.Activate()
activeDoc = word.ActiveDocument
for c in activeDoc.Comments:
if c.Ancestor is None: #checking if this is a top-level comment
print("Comment by: " + c.Author)
print("Comment text: " + c.Range.Text) #text of the comment
print("Regarding: " + c.Scope.Text) #text of the original document where the comment is anchored
if len(c.Replies)> 0: #if the comment has replies
print("Number of replies: " + str(len(c.Replies)))
for r in range(1, len(c.Replies)+1):
print("Reply by: " + c.Replies(r).Author)
print("Reply text: " + c.Replies(r).Range.Text) #text of the reply
doc.Close()
If you want also the text the comments relates to :
def get_document_comments(docxFileName):
comments_dict = {}
comments_of_dict = {}
docx_zip = zipfile.ZipFile(docxFileName)
comments_xml = docx_zip.read('word/comments.xml')
comments_of_xml = docx_zip.read('word/document.xml')
et_comments = etree.XML(comments_xml)
et_comments_of = etree.XML(comments_of_xml)
comments = et_comments.xpath('//w:comment', namespaces=ooXMLns)
comments_of = et_comments_of.xpath('//w:commentRangeStart', namespaces=ooXMLns)
for c in comments:
comment = c.xpath('string(.)', namespaces=ooXMLns)
comment_id = c.xpath('#w:id', namespaces=ooXMLns)[0]
comments_dict[comment_id] = comment
for c in comments_of:
comments_of_id = c.xpath('#w:id', namespaces=ooXMLns)[0]
parts = et_comments_of.xpath(
"//w:r[preceding-sibling::w:commentRangeStart[#w:id=" + comments_of_id + "] and following-sibling::w:commentRangeEnd[#w:id=" + comments_of_id + "]]",
namespaces=ooXMLns)
comment_of = ''
for part in parts:
comment_of += part.xpath('string(.)', namespaces=ooXMLns)
comments_of_dict[comments_of_id] = comment_of
return comments_dict, comments_of_dict

Python XML modifying by ElementTree destroys the XML structure

I am using Python V 3.5.1 on windows framework in order to modify a text inside , the modification works great but after saving the tree all the empty tags get destroyed as the following example:
<HOSTNAME></HOSTNAME> Is being changed to <HOSTNAME />
child with a text between the tags looks good:
<HOSTNAME>tnas2</HOSTNAME> is being changed to
<HOSTNAME>tnas2</HOSTNAME> which is the same as the source.
The source XML file is:
<ROOT>
<DeletedName>
<VERIFY_DEST_SIZE>Y</VERIFY_DEST_SIZE>
<VERIFY_BYTES>Y</VERIFY_BYTES>
<TIMESTAMP>XXXXXXXXXDeletedXXXXXXXXXX</TIMESTAMP>
<EM_USERS>XXXXXXXXXDeletedXXXXXXXXXX</EM_USERS>
<EM_GROUPS></EM_GROUPS>
<LOCAL>
<HOSTNAME></HOSTNAME>
<PORT></PORT>
<USERNAME>XXXXXXXXXDeletedXXXXXXXXXX</USERNAME>
<PASSWORD>XXXXXXXXXDeletedXXXXXXXXXX</PASSWORD>
<HOME_DIR></HOME_DIR>
<OS_TYPE>Windows</OS_TYPE>
</LOCAL>
<REMOTE>
<HOSTNAME>DeletedHostName</HOSTNAME>
<PORT>22</PORT>
<USERNAME>XXXXXXXXXDeletedXXXXXXXXXX</USERNAME>
<PASSWORD>XXXXXXXXXDeletedXXXXXXXXXX</PASSWORD>
<HOME_DIR>XXXXXXXXXDeletedXXXXXXXXXX</HOME_DIR>
<OS_TYPE>Unix</OS_TYPE>
<CHAR_SET>UTF-8</CHAR_SET>
<SFTP>Y</SFTP>
<ENCRYPTION>Blowfish</ENCRYPTION>
<COMPRESSION>N</COMPRESSION>
</REMOTE>
</DeletedName>
</ROOT>
the code is:
import os
import xml.etree.ElementTree as ET
from shutil import copyfile
import datetime
def AddAuthUserToAccountsFile(AccountsFile,RemoteMachine,UserToAdd):
today = datetime.date.today()
today = str(today)
print(today)
BackUpAccountsFile = AccountsFile + "-" + today
try:
tree = ET.parse(AccountsFile)
except:
pass
try:
copyfile(AccountsFile,BackUpAccountsFile)
except:
pass
root = tree.getroot()
UsersTags = tree.findall('.//EM_USERS')
for UsersList in UsersTags:
Users = UsersList.text
Users = UsersList.text = Users.replace("||","|")
if UserToAdd not in Users:
print("The Users were : ",Users, "--->> Adding ",UserToAdd)
UsersList.text = Users + UserToAdd +"|"
tree.write(AccountsFile)
Appreciate for any help to pass this strange scenario.
Thanks,
Miki
OK, i found the solution -
just adding method = "html" to the tree.write line it keeps it as needed.
tree.write(AccountsFile,method = 'html')
Thanks.

Parse Code from Text - Python

I am analyzing StackOverflow's dump file "Posts.Small.xml" using pySpark. I want to separate 'code block' from 'text' in a Row. A typical parsed row looks like:
['[u"<p>I want to use a track-bar to change a form\'s opacity.</p>
<p>This is my code:</p>
<pre><code>decimal trans = trackBar1.Value / 5000;
this.Opacity = trans;
</code></pre>
<p>When I try to build it, I get this error:</p>
<blockquote>
<p>Cannot implicitly convert type \'decimal\' to \'double\'.
</p>
</blockquote>
<p>I tried making <code>trans</code> a <code>double</code>, but then the control doesn\'t work.',
'", u\'This code has worked fine for me in VB.NET in the past.',
'\', u"</p>
When setting a form\'s opacity should I use a decimal or double?"]']
I've tried "itertools" and some python functions but couldn't get the result.
My initial code to extract the above row is:
postsXml = textFile.filter( lambda line: not line.startswith("<?xml version=")
postsRDD = postsXml.map(............)
tokensentRDD = postsRDD.map(lambda x:(x[0], nltk.sent_tokenize(x[3])))
new = tokensentRDD.map(lambda x: x[1]).take(1)
a = ''.join(map(str,new))
b = a.replace("<", "<")
final = b.replace(">", ">")
nltk.sent_tokenize(final)
Any ideas are appreciated!
You can extract the code contents by using XPath (the lxml library will help) and then extract the text content selecting everything else, for example:
import lxml.etree
data = '''<p>I want to use a track-bar to change a form's opacity.</p>
<p>This is my code:</p> <pre><code>decimal trans = trackBar1.Value / 5000; this.Opacity = trans;</code></pre>
<p>When I try to build it, I get this error:</p>
<p>Cannot implicitly convert type 'decimal' to 'double'.</p>
<p>I tried making <code>trans</code> a <code>double</code>.</p>'''
html = lxml.etree.HTML(data)
code_blocks = html.xpath('//code/text()')
text_blocks = html.xpath('//*[not(descendant-or-self::code)]/text()')
The easiest way will probably be to apply a regex to the text, matching tags '' and ''. That would enable you to find the code blocks. You don't say what you would do with them afterwards, though. So ...
from itertools import zip_longest
sample_paras = [
"""<p>I want to use a track-bar to change a form\'s opacity.</p>
<p>This is my code:</p>
<pre><code>decimal trans = trackBar1.Value / 5000;
this.Opacity = trans;
</code></pre>
<p>When I try to build it, I get this error:</p>
<blockquote>
<p>Cannot implicitly convert type \'decimal\' to \'double\'. </p>
</blockquote>
<p>I tried making <code>trans</code> a <code>double</code>, but then the control doesn\'t work.""",
"""This code has worked fine for me in VB.NET in the past.""",
"""</p>
When setting a form\'s opacity should I use a decimal or double?""",
]
single_block = " ".join(sample_paras)
import re
separate_code = re.split(r"</?code>", single_block)
text_blocks, code_blocks = zip(*zip_longest(*[iter(separate_code)] * 2))
print("Text:\n")
for t in text_blocks:
print("--")
print(t)
print("\n\nCode:\n")
for t in code_blocks:
print("--")
print(t)

how to cast a variable in xpath python

from lxml import html
import requests
pagina = 'http://www.beleggen.nl/amx'
page = requests.get(pagina)
tree = html.fromstring(page.text)
aandeel = tree.xpath('//a[#title="Imtech"]/text()')
print aandeel
This part works, but I want to read multiple lines with different titles, is it possible to change the "Imtech" part to a variable?
Something like this, it obviously doesnt work, but where did I go wrong? Or is it not quite this easy?
FondsName = "Imtech"
aandeel = tree.xpath('//a[#title="%s"]/text()')%(FondsName)
print aandeel
You were almost right:
variabelen = [var1,var2,var3]
for var in variabelen:
aandeel = tree.xpath('//a[#title="%s"]/text()' % var)
XPath allows $variables and lxml's .xpath() method allows for supplying values for those variables as keyword arguments: .xpath('$variable', variable='my value')
Using your example, here's how you'd do it:
fonds_name = 'Imtech'
aandeel = tree.xpath('//a[#title="$title"]/text()', title=fonds_name)
print(aandeel)
See lmxl's docs for more info: http://lxml.de/xpathxslt.html#the-xpath-method
Almost...
FondsName = "Imtech"
aandeel = tree.xpath('//a[#title="%s"]/text()'%FondsName)
print aandeel

Generating Xml using python

Kindly have a look at below code i am using this to generate a xml using python .
from lxml import etree
# Some dummy text
conn_id = 5
conn_name = "Airtelll"
conn_desc = "Largets TRelecome"
ip = "192.168.1.23"
# Building the XML tree
# Note how attributes and text are added, using the Element methods
# and not by concatenating strings as in your question
root = etree.Element("ispinfo")
child = etree.SubElement(root, 'connection',
number = str(conn_id),
name = conn_name,
desc = conn_desc)
subchild_ip = etree.SubElement(child, 'ip_address')
subchild_ip.text = ip
# and pretty-printing it
print etree.tostring(root, pretty_print=True)
This will produce:
<ispinfo>
<connection desc="Largets TRelecome" number="5" name="Airtelll">
<ip_address>192.168.1.23</ip_address>
</connection>
</ispinfo>
But i want it to be like :
<ispinfo>
<connection desc="Largets TRelecome" number='1' name="Airtelll">
<ip_address>192.168.1.23</ip_address>
</connection>
</ispinfo>
Mean number attribute should be come in a single quote .Any idea ....How can i achieve this
There is no flag in lxml to do this, so you have to resort to manual manipulation.
import re
re.sub(r'number="([0-9]+)"',r"number='\1'", etree.tostring(root, pretty_print=True))
However, why do you want to do this? As there is no difference other than cosmetics.

Categories