Extract DOCX Comments

Extract DOCX Comments - python

I'm a teacher. I want a list of all the students who commented on the essay I assigned, and what they said. The Drive API stuff was too challenging for me, but I figured I could download them as a zip and parse the XML.
The comments are tagged in w:comment tags, with w:t for the comment text and . It should be easy, but XML (etree) is killing me.
via the tutorial (and official Python docs):
z = zipfile.ZipFile('test.docx')
x = z.read('word/comments.xml')
tree = etree.XML(x)
Then I do this:
children = tree.getiterator()
for c in children:
print(c.attrib)
Resulting in this:
{}
{'{http://schemas.openxmlformats.org/wordprocessingml/2006/main}author': 'Joe Shmoe', '{http://schemas.openxmlformats.org/wordprocessingml/2006/main}id': '1', '{http://schemas.openxmlformats.org/wordprocessingml/2006/main}date': '2017-11-17T16:58:27Z'}
{'{http://schemas.openxmlformats.org/wordprocessingml/2006/main}rsidR': '00000000', '{http://schemas.openxmlformats.org/wordprocessingml/2006/main}rsidDel': '00000000', '{http://schemas.openxmlformats.org/wordprocessingml/2006/main}rsidP': '00000000', '{http://schemas.openxmlformats.org/wordprocessingml/2006/main}rsidRDefault': '00000000', '{http://schemas.openxmlformats.org/wordprocessingml/2006/main}rsidRPr': '00000000'}
{}
{'{http://schemas.openxmlformats.org/wordprocessingml/2006/main}val': '0'}
{'{http://schemas.openxmlformats.org/wordprocessingml/2006/main}val': '0'}
{'{http://schemas.openxmlformats.org/wordprocessingml/2006/main}val': '0'}
And after this I am totally stuck. I've tried element.get() and element.findall() with no luck. Even when I copy/paste the value ('{http://schemas.openxmlformats.org/wordprocessingml/2006/main}val'), I get None in return.
Can anyone help?

You got remarkably far considering that OOXML is such a complex format.
Here's some sample Python code showing how to access the comments of a DOCX file via XPath:
from lxml import etree
import zipfile
ooXMLns = {'w':'http://schemas.openxmlformats.org/wordprocessingml/2006/main'}
def get_comments(docxFileName):
docxZip = zipfile.ZipFile(docxFileName)
commentsXML = docxZip.read('word/comments.xml')
et = etree.XML(commentsXML)
comments = et.xpath('//w:comment',namespaces=ooXMLns)
for c in comments:
# attributes:
print(c.xpath('#w:author',namespaces=ooXMLns))
print(c.xpath('#w:date',namespaces=ooXMLns))
# string value of the comment:
print(c.xpath('string(.)',namespaces=ooXMLns))

Thank you #kjhughes for this amazing answer for extracting all the comments from the document file. I was facing same issue like others in this thread to get the text that the comment relates to. I took the code from #kjhughes as a base and try to solve this using python-docx. So here is my take at this.
Sample document.
I will extract the comment and the paragraph which it was referenced in the document.
from docx import Document
from lxml import etree
import zipfile
ooXMLns = {'w':'http://schemas.openxmlformats.org/wordprocessingml/2006/main'}
#Function to extract all the comments of document(Same as accepted answer)
#Returns a dictionary with comment id as key and comment string as value
def get_document_comments(docxFileName):
comments_dict={}
docxZip = zipfile.ZipFile(docxFileName)
commentsXML = docxZip.read('word/comments.xml')
et = etree.XML(commentsXML)
comments = et.xpath('//w:comment',namespaces=ooXMLns)
for c in comments:
comment=c.xpath('string(.)',namespaces=ooXMLns)
comment_id=c.xpath('#w:id',namespaces=ooXMLns)[0]
comments_dict[comment_id]=comment
return comments_dict
#Function to fetch all the comments in a paragraph
def paragraph_comments(paragraph,comments_dict):
comments=[]
for run in paragraph.runs:
comment_reference=run._r.xpath("./w:commentReference")
if comment_reference:
comment_id=comment_reference[0].xpath('#w:id',namespaces=ooXMLns)[0]
comment=comments_dict[comment_id]
comments.append(comment)
return comments
#Function to fetch all comments with their referenced paragraph
#This will return list like this [{'Paragraph text': [comment 1,comment 2]}]
def comments_with_reference_paragraph(docxFileName):
document = Document(docxFileName)
comments_dict=get_document_comments(docxFileName)
comments_with_their_reference_paragraph=[]
for paragraph in document.paragraphs:
if comments_dict:
comments=paragraph_comments(paragraph,comments_dict)
if comments:
comments_with_their_reference_paragraph.append({paragraph.text: comments})
return comments_with_their_reference_paragraph
if __name__=="__main__":
document="test.docx" #filepath for the input document
print(comments_with_reference_paragraph(document))
Output for the sample document look like this
I have done this at a paragraph level. This could be done at a python-docx run level as well.
Hopefully it will be of help.

I used Word Object Model to extract comments with replies from a Word document. Documentation on Comments object can be found here. This documentation uses Visual Basic for Applications (VBA). But I was able to use the functions in Python with slight modifications. Only issue with Word Object Model is that I had to use win32com package from pywin32 which works fine on Windows PC, but I'm not sure if it will work on macOS.
Here's the sample code I used to extract comments with associated replies:
import win32com.client as win32
from win32com.client import constants
word = win32.gencache.EnsureDispatch('Word.Application')
word.Visible = False
filepath = "path\to\file.docx"
def get_comments(filepath):
doc = word.Documents.Open(filepath)
doc.Activate()
activeDoc = word.ActiveDocument
for c in activeDoc.Comments:
if c.Ancestor is None: #checking if this is a top-level comment
print("Comment by: " + c.Author)
print("Comment text: " + c.Range.Text) #text of the comment
print("Regarding: " + c.Scope.Text) #text of the original document where the comment is anchored
if len(c.Replies)> 0: #if the comment has replies
print("Number of replies: " + str(len(c.Replies)))
for r in range(1, len(c.Replies)+1):
print("Reply by: " + c.Replies(r).Author)
print("Reply text: " + c.Replies(r).Range.Text) #text of the reply
doc.Close()

If you want also the text the comments relates to :
def get_document_comments(docxFileName):
comments_dict = {}
comments_of_dict = {}
docx_zip = zipfile.ZipFile(docxFileName)
comments_xml = docx_zip.read('word/comments.xml')
comments_of_xml = docx_zip.read('word/document.xml')
et_comments = etree.XML(comments_xml)
et_comments_of = etree.XML(comments_of_xml)
comments = et_comments.xpath('//w:comment', namespaces=ooXMLns)
comments_of = et_comments_of.xpath('//w:commentRangeStart', namespaces=ooXMLns)
for c in comments:
comment = c.xpath('string(.)', namespaces=ooXMLns)
comment_id = c.xpath('#w:id', namespaces=ooXMLns)[0]
comments_dict[comment_id] = comment
for c in comments_of:
comments_of_id = c.xpath('#w:id', namespaces=ooXMLns)[0]
parts = et_comments_of.xpath(
"//w:r[preceding-sibling::w:commentRangeStart[#w:id=" + comments_of_id + "] and following-sibling::w:commentRangeEnd[#w:id=" + comments_of_id + "]]",
namespaces=ooXMLns)
comment_of = ''
for part in parts:
comment_of += part.xpath('string(.)', namespaces=ooXMLns)
comments_of_dict[comments_of_id] = comment_of
return comments_dict, comments_of_dict

Related

How extract text from selected books, convert them to tags, and add them to metadata in CALIBRE?

I created a python code in VScode that allows me to perform text searches within an epub book, these searches consist of matching the text of the book with regular expressions. These regular expressions come from patterns that I formulated for the tags in my library. I have already managed to get over 400 tags this way and I have a custom column for them, I add the # symbol at the beginning to differentiate them from tags downloaded from other sources. I have 3000+ books and I want each of them to be attacked by these 400+ regular expressions.
I need help because my code only contemplates the search in a single book and what I want to configure is:
**Run the code on selected books from my library (books_ids).**
**Found tags are added to the metadata.**
**Add a verification tag confirming that the book was processed.**
Code:
import re
import ast
from epub_conversion.utils import open_book, convert_epub_to_lines
import colorama
colorama.init()
"test_dict.txt = {'#publication_(*history)': r'\bpublication\b[^.]*\bhistory',
'#horror_fiction': r'\bhorror fiction',
'#story_(*writer)': r'\bstory\b[^.]*\bwriter',
'#published_(*books)': r'\bpublished\b[^.]*\bbooks',
'#books_(*poems)': r'\bbook[^.]*\bpoem',
'#new_discovery': r'\bnew discovery',
'#weird_tales': r'\bweird tales',
'#literary_(*importance)': r'\bliterary[^.]*\bimportance',
}"
book = open_book("Cthulhu Mythos.epub")
lines = convert_epub_to_lines(book)
with open("test_dict.txt", "r") as data:
tags_dict = ast.literal_eval(data.read())
print(colorama.Back.YELLOW + 'Matches(regex - book text):',colorama.Style.RESET_ALL)
temp = []
res = dict()
for line in lines:
for key,value in tags_dict.items():
if re.search(rf'{value}', line):
if value not in temp:
temp.append(value)
res[key] = value
regex = re.compile(value)
match_array = regex.finditer(line)
match_list = list(match_array)
for m in match_list:
print(colorama.Fore.MAGENTA + key, ":",colorama.Style.RESET_ALL + m.group())
print('\n',colorama.Back.YELLOW + 'Found tags:',colorama.Style.RESET_ALL)
temp = []
res = dict()
for line in lines:
for key,value in tags_dict.items():
if re.search(rf'{value}', line):
if value not in temp:
temp.append(value)
res[key] = value
print(colorama.Fore.GREEN + key, end=", ")
print('\n\n' + colorama.Back.YELLOW + "N° found tags:",colorama.Style.RESET_ALL, len(temp))
This prints:
Matches(regex - book text):
#story_(*writer) : story writer
#publication_(*history) : publication in the history
#horror_fiction : horror fiction
#published_(*books) : published three books
#books_(*poems) : books of poems professionally, and had even sold a couple of prose poem
#new_discovery : new discovery
#literary_(*importance) : literary outpouring of prodigious wordage and importance
Found tags:
#story_(*writer), #publication_(*history), #horror_fiction, #published_(*books), #books_(*poems), #new_discovery, #literary_(*importance),
N° found tags: 7
https://i.stack.imgur.com/iBE1f.png
The truth is that my knowledge of python is very poor, I only learned about regular expressions thanks to Calibre.
I'd appreciate any help with the code, thank you very much.

How to store the HTML within an opening and closing tag with Python

I am reading in an HTML document and want to store the HTML nested within a div tag of a certain name, while maintaining its structure (the spacing). This is for the ability convert an HTML doc into components for React. I am struggling with how to store the structure of the nested HTML, and locate the correct closing tag for the div the denotes that everything nested within it will become a React component (div class='rc-componentname' is the opening tag). Any help would be very appreciated. Thanks!
Edit: I assume regex are the best way to go about this. I haven't used regex before so if that is correct someone could point me in the right direction for the expression used in this context that would be great.
import os
components = []
class react_template():
def __init__(self, component_name): # add nested html as second element
self.Import = "import React, { Component } from ‘react’;"
self.Class = "Class " + component_name + ' extends Component {'
self.Render = "render() {"
self.Return = "return "
self.Export = "Default export " + component_name + ";"
def react(component):
r = react_template(component)
if not os.path.exists('components'): # create components folder
os.mkdir('components')
os.chdir('components')
if not os.path.exists(component): # create folder for component
os.mkdir(component)
os.chdir(component)
with open(component + '.js', 'wb') as f: # create js component file
for j_key, j_code in r.__dict__.items():
f.write(j_code.encode('utf-8') + '\n'.encode('utf-8'))
f.close()
def process_html():
with open('file.html', 'r') as f:
for line in f:
if 'rc-' in line:
char_soup = list(line)
for index, char in enumerate(char_soup):
if char == 'r' and char_soup[index+1] == 'c' and char_soup[index+2] == '-':
sliced_soup = char_soup[int(index+3):]
c_slice_index = sliced_soup.index("\'")
component = "".join(sliced_soup[:c_slice_index])
components.append(component)
innerHTML(sliced_soup)
# react(component)
def innerHTML(sliced_soup): # work in progress
first_closing = sliced_soup.index(">")
sliced_soup = "".join(sliced_soup[first_closing:]).split(" ")
def generate_components(components):
for c in components:
react(c)
if __name__ == "__main__":
process_html()

I see you've used the word soup in your code... maybe you've already tried and disliked BeautifulSoup? If you haven't tried it, I'd recommend you look at BeautifulSoup instead of attempting to parse HTML with regex. Although regex would be sufficient for a single tag or even a handful of tags, markup languages are deceptively simple. BeautifulSoup is a fine library and can make things easier for dealing with markup.
https://www.crummy.com/software/BeautifulSoup/bs4/doc/
This will allow you to treat the entirety of your html as a single object and enable you to:
# create a list of specific elements as objects
soup.find_all('div')
# find a specific element by id
soup.find(id="custom-header")

How do I detect proper nouns in the Google NLP API?

Apologies if this isn't totally clear - I'm a Python copy-the-code-and-try-to-make-it-work developer.
I'm using the Google NLP API in Python 2.7.
When I use analyze_entities(), I can get and print the name, entity type and salience.
Mentions is supposed to contain the noun type: PROPER or COMMON, per this page:
https://cloud.google.com/natural-language/docs/reference/rest/v1beta1/Entity#EntityMention
I can't get mention type from the returned dictionary.
Here's my hideous code:
def entities_text(text, client):
"""Detects entities in the text."""
language_client = client
# Instantiates a plain text document.
document = language_client.document_from_text(text)
# Detects entities in the document. You can also analyze HTML with:
# document.doc_type == language.Document.HTML
entities = document.analyze_entities()
return entities
articles = os.listdir('articles')
for f in articles:
language_client = language.Client()
fname = "articles/" + f
thisfile = open(fname,'r')
content = thisfile.read()
entities = entities_text(content, language_client)
for e in entities:
name = e.name.strip()
type = e.entity_type.strip()
if e.name.strip()[0].isupper() and len(e.name.strip()) > 2:
print name, type, e.salience, e.mentions
That returns this:
RELATED OTHER 0.0019081507 [u'RELATED']
Zoe 3 PERSON 0.0016676666 [u'Zoe 3']
Where the value in [] is the mentions.
If I try to get mentions.type, I get an attribute not found error.
I'd appreciate any input.

1) Do not call the "AnalyzeEntities" function, but call the "AnnotateText" one instead.
2) Check for "Proper". Examine its value, it should be "PROPER" and not "PROPER_UNKNOWN" nor "NOT_PROPER".

How to send scraped data through reddit bot

So I've got this bot that I want to use to reply with the box score of the mets game anytime someone says "mets score" on a specific subreddit. This is my first python project and I plan on using it on a dummy subreddit I created as a learning tool. I'm having trouble sending the scores from the website I scraped through the bot so it can appear in the reply to the "mets score" comments. Any suggestions?
import praw
import time
from lxml import html
import requests
from bs4 import BeautifulSoup
r = praw.Reddit(user_agent = 'my_first_bot')
r.login('user_name', 'password')
def scores():
soup = BeautifulSoup(requests.get("http://scores.nbcsports.com/mlb/scoreboard.asp?day=20160621&meta=true").content, "lxml")
table = soup.find("a",class_="teamName", text="NY Mets").find_previous("table")
a, b = [a.text for a in table.find_all("a",class_="teamName")]
inn, a_score, b_score = ([td.text for td in row.select("td.shsTotD")] for row in table.find_all("tr"))
print (" ".join(inn))
print ("{}: {}".format(a, " ".join(a_score)))
print ("{}: {}".format(b, " ".join(b_score)))
words_to_match = ['mets score']
cache = []
def run_bot():
print("Grabbing subreddit...")
subreddit = r.get_subreddit("random_subreddit")
print("Grabbing comments...")
comments = subreddit.get_comments(limit=40)
for comment in comments:
print(comment.id)
comment_text = comment.body.lower()
isMatch = any(string in comment_text for string in words_to_match)
if comment.id not in cache and isMatch:
print("match found!" + comment.id)
comment.reply('heres the score to last nights mets game...' scores())
print("reply successful")
cache.append(comment.id)
print("loop finished, goodnight")
while True:
run_bot()
time.sleep(120)

I think I'll just put you out of your misery ;). There are multiple issues with your code snippet:
comment.reply('heres the score to last nights mets game...' scores())
The .reply() method requires a string or an object that can have a good enough representation as a string. Assuming the method scores() returns a string, you should concatenate the two arguments, like this:
comment.reply('heres the score to last nights mets game...'+ scores())
It looks like your knowledge of basic python syntax and constructs is dusty. For a quick refresher see this.
Your method scores() doesn't return anything. It just prints out a bunch of lines (which I assume are for debugging purposes).
def scores():
soup = BeautifulSoup(requests.get("http://scores.nbcsports.com/mlb/scoreboard.asp?day=20160621&meta=true").content, "lxml")
.......
print (" ".join(inn))
print ("{}: {}".format(a, " ".join(a_score)))
print ("{}: {}".format(b, " ".join(b_score)))
Funnily enough you could use those exact strings for your return value (or maybe something else entirely, as suit your needs) like this:
def scores():
.......
inn_string = " ".join(inn)
a_string = "{}: {}".format(a, " ".join(a_score))
b_string = "{}: {}".format(b, " ".join(b_score))
return "\n".join([inn_string, a_string, b_string])
These should get you up and running.
More advice: Have you read the Reddit PRAW docs? You should. You should also probably use praw.helpers.comment_stream(). It's simple and easy to use and will handle retrieving new comments for you. Currently you try and fetch a maximum of 40 comments every 120 seconds. What happens when there are more than that many relevant comments in that 120 second span. You'll end up missing some of the comments you should've replied to. .comment_stream() will take care of rate limiting for you so that your bot can reply to each new comment which needs its attention at its own pace. Read more about this here.

Why lxml.etree.SubElement(body, "br") will create <br />?

I'm going through the lxml tutorial and I have a question:
Here is the code:
>>> html = etree.Element("html")
>>> body = etree.SubElement(html, "body")
>>> body.text = "TEXT"
>>> etree.tostring(html)
b'<html><body>TEXT</body></html>'
#############LOOK!!!!!!!############
>>> br = etree.SubElement(body, "br")
>>> etree.tostring(html)
b'<html><body>TEXT<br/></body></html>'
#############END####################
>>> br.tail = "TAIL"
>>> etree.tostring(html)
b'<html><body>TEXT<br/>TAIL</body></html>'
As you can see, in the wrapped block, the instruction br = etree.SubElement(body, "br") will only create a <br /> mark, and why is that?
Is br a reserved word?

Thanks to someone's kindly notification, I should publish my answer here:
Look at this code first:
from lxml import etree
if __name__ == '__main__':
print """Trying to create xml file like this:
<html><body>Hello<br/>World</body></html>"""
html_node = etree.Element("html")
body_node = etree.SubElement(html_node, "body")
body_node.text = "Hello"
print "Step1:" + etree.tostring(html_node)
br_node = etree.SubElement(body_node, "br")
print "Step2:" + etree.tostring(html_node)
br_node.tail = "World"
print "Step3:" + etree.tostring(html_node)
br_node.text = "Yeah?"
print "Step4:" + etree.tostring(html_node)
Here is the output:
Trying to create xml file like this:
<html><body>Hello<br/>World</body></html>
Step1:<html><body>Hello</body></html>
Step2:<html><body>Hello<br/></body></html>
Step3:<html><body>Hello<br/>World</body></html>
Step4:<html><body>Hello<br>Yeah?</br>World</body></html>
At first, what I was trying to figure out is:
Why the output of br_node is rather than
You may check the step3 and step4, and the answer is quite clear:
If the element has no content, it's output format would be <"name"/>
Due to the existing semantic of , this easy question confused me for a long time.
Hope this post will help some guys like me.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extract DOCX Comments - python

Related

How extract text from selected books, convert them to tags, and add them to metadata in CALIBRE?

How to store the HTML within an opening and closing tag with Python

How do I detect proper nouns in the Google NLP API?

How to send scraped data through reddit bot

Why lxml.etree.SubElement(body, "br") will create <br />?

Categories

Resources