Detect word and page of occurrence in Word Document - python

I am trying to detect specific words (with a regex pattern that I already have) in a Word Document. I do not only want to detect the word but also to know in which page it appears, I think of something like a list of tuples: [(WordA, 10), (WordB, 4) ....]
I am able to extract the text from the word document and detect all the words that match the regex pattern but I am not able to know if which page the word appears. Also, I want to detect all the occurrences regardless if they appear in the header, body or footnotes.
Here is my regex pattern:
pattern = re.compile(r'\bDOC[-–—]\d{9}(?!\d)')
Extraction of text:
import docx2txt
result = docx2txt.process("Word_Document.docx")
Thank you in advance,

I just wanted to say thank you to those who tried to answer this question. I found two solutions:
With Word Documents, splitting them into one word document per page with Aspose:
https://products.aspose.cloud/words/python/split/
Convert the Word Document into PDF and then create one PDF per page with PyPDF2 or other library
E

Ok, after a while of trying to figure this out, I managed to get this:
import docx2txt.docx2txt as docx2txt
import re
page_contents = []
def xml2text(xml):
text = u''
root = docx2txt.ET.fromstring(xml)
start = 0
for child in root.iter():
if child.tag == docx2txt.qn('w:t'):
t_text = child.text
text += t_text if t_text is not None else ''
elif child.tag == docx2txt.qn('w:tab'):
text += '\t'
elif child.tag in (docx2txt.qn('w:br'), docx2txt.qn('w:cr')):
text += '\n'
elif child.tag == docx2txt.qn("w:p"):
text += '\n\n'
elif child.tag == docx2txt.qn('w:lastRenderedPageBreak'):
end = len(text) + 1
page_contents.append(text[start:end])
start = len(text)
page_contents.append(text[start:len(text) + 1])
return text
docx2txt.xml2text = xml2text
docx2txt.process('test_file.docx') # use your filename
matches = []
pattern = re.compile(r'\bDOC[-–—]\d{9}(?!\d)')
for page_num, page_content in enumerate(page_contents, start=1):
# do regex search
all_matches = pattern.findall(page_content)
if all_matches:
for match in all_matches:
matches.append((match, page_num))
print(matches)
It modifies the module's function so that when it is called it will add each page to a list and the index + 1 will be the page number. It modifies the module's xml2text parser to additionally detect a page break and then add that pages contents to the local global list. It uses the tag 'lastRenderedPageBreak', the slight caution is to save the file if you have edited it so that the placement of these tags also gets updated.

Related

Replacing String Text That Contains Double Quotes

I have a number series contained in a string, and I want to remove everything but the number series. But the double quotes are giving me errors. Here are examples of the strings and a sample command that I have used. All I want is 127.60-02-15, 127.60-02-16, etc.
<span id="lblTaxMapNum">127.60-02-15</span>
<span id="lblTaxMapNum">127.60-02-16</span>
I have tried all sorts of methods (e.g., triple double quotes, single quotes, quotes with backslashes, etc.). Here is one inelegant way that still isn't working because it's still leaving ">:
text = text.replace("<span id=", "")
text = text.replace("\"lblTaxMapNum\"", "")
text = text.replace("</span>", "")
Here is what I am working with (more specific code). I'm retrieving the data from an CSV and just trying to clean it up.
text = open("outputA.csv", "r")
text = ''.join([i for i in text])
text = text.replace("<span id=", "")
text = text.replace("\"lblTaxMapNum\"", "")
text = text.replace("</span>", "")
outputB = open("outputB.csv", "w")
outputB.writelines(text)
outputB.close()
If you add a > in the second replace it is still not elegant but it works:
text = text.replace("<span id=", "")
text = text.replace("\"lblTaxMapNum\">", "")
text = text.replace("</span>", "")
Alternatively, you could use a regex:
import re
text = "<span id=\"lblTaxMapNum\">127.60-02-16</span>"
pattern = r".*>(\d*.\d*-\d*-\d*)\D*" # the pattern in the brackets matches the number
match = re.search(pattern, text) # this searches for the pattern in the text
print(match.group(1)) # this prints out only the number
You can use beatifulsoup.
from bs4 import BeautifulSoup
strings = ['<span id="lblTaxMapNum">127.60-02-15</span>', '<span id="lblTaxMapNum">127.60-02-16</span>']
# Use BeautifulSoup to extract the text from the <span> tags
for string in strings:
soup = BeautifulSoup(string, 'html.parser')
number_series = soup.span.text
print(number_series)
output:
127.60-02-15
127.60-02-16
it's a little bit long , hope my documents are readable
with open(r'c:\users\GH\desktop\test.csv' , 'r') as f:
text = f.read().strip()
stRange = '<' # we will gonna remove the dump txt from our file by using (range
index) method
endRange = '>' # which means removing all extra literals between <>
text = list(text)
# casting our data to a list to be able to modify our data by reffering to its
components by index number
i = 0
length = len(text)
# we're gonna manipulate our text while we are iterating upon it
# so we have to declare a variable to be able to change it while iterating
while i < length:
if text[i] == stRange:
stRange = text.index(text[i])
elif text[i] != endRange and text[i] != stRange:
i += 1
continue
elif text[i] == endRange:
endRange = text.index(text[i]) # an integer to be used as rangeIndex
i = 0
del text[stRange : endRange + 1] # deleting the extra unwanted
characters
length = len(text) # getting the new length of our data
stRange = '<' # and again , assigning the specific characters to their
variables
endRange = '>'
i += 1
else:
result = str()
for l in text:
result += l
else:
with open(path , 'w') as f:
f.write(result)
with open(path , 'r') as f:
print('the result ==>')
print(f.read())

How to extract multiple instances of a word from PDF files on python?

I'm writing a script on python to read a PDF file and record both the string that appears after every instance that "time" is mentioned as well as the page number its mentioned on.
I have gotten it to recognize when each page has the string "time" on it and send me the page number, however if the page has "time" more than once, it does not tell me. I'm assuming this is because it has already fulfilled the criteria of having the string "time" on it at least once, and therefore it skips to the next page to perform the check.
How would I go about finding multiple instances of the word "time"?
This is my code:
import PyPDF2
def pdf_read():
pdfFile = "records\document.pdf"
pdf = PyPDF2.PdfFileReader(pdfFile)
pageCount = pdf.getNumPages()
for pageNumber in range(pageCount):
page = pdf.getPage(pageNumber)
pageContent = page.extractText()
if "Time" in pageContent or "time" in pageContent:
print(pageNumber)
Also as a side note, this pdf is a scanned document and therefore when I read the text on python (or copy and paste onto word) there are a lot words which come up with multiple random symbols and characters even though its perfectly legible. Is this a limitation of computer programming without having to apply more complex concepts such as machine learning in order to read the files accurately?
A solution would be to create a list of strings off pageContent and count the frequency of the word 'time' in the list. It is also easier to select the word following 'time' - you can simply retrieve the next item in the list:
import PyPDF2
import string
pdfFile = "records\document.pdf"
pdf = PyPDF2.PdfFileReader(pdfFile)
pageCount = pdf.getNumPages()
for pageNumber in range(pageCount):
page = pdf.getPage(pageNumber)
pageContent = page.extractText()
pageContent = ''.join(pageContent.splitlines()).split() # words to list
pageContent = ["".join(j.lower() for j in i if j not in string.punctuation) for i in pageContent] # remove punctuation
print(pageContent.count('time') + pageContent.count('Time')) # count occurances of time in list
print([(j, pageContent[i+1] if i+1 < len(pageContent) else '') for i, j in enumerate(pageContent) if j == 'Time' or j == 'time']) # list time and following word
Note that this example also strips all words from characters that are not letters or digits. Hopefully this sufficiently cleans up the bad OCR.

How to catch the list number for each paragraph

I try to use python docx to read word file content.
For example: Attachment demo word file, it contains several paragraphs. Some paragraph contains a heading number, like 1.3, 1.4.1 etc.
My program is try to open the docx, and search a keyword in each paragraph. If the keyword exist in dedicate paragraph, print out that paragraph and its heading number.
However, it fail to print the heading number. For example, I search keyword "wall", it only print out paragraph with "wall", but no heading number 1.4.1.
I need the number too.
def search_word(filename,word):
#open the word file
document=Document(filename)
#read every paragraph
l=[paragraph.text.encode('utf-8') for paragraph in document.paragraphs]
result=[]
for i in l:
i=i.strip()
i=str(i)
pattern=re.compile(r"(.*)(%s)(.*)"%word,re.I|re.M)
rel=pattern.findall(i)
if len(rel):
result.append(rel)
print(filename+"="*30+"Search Result"+"="*30)
print("-"*150)
for k in result:
for m in k:
print("".join(m).strip('b\'')+"\n"*1)
print("-"*150+"\n"*2)
Finally, I find a stupid way to catch each paragraph heading and its content.
I convert the docx to HTML first, then using beautifulsoup & re to search my keyword.
def search_file(file,word):
global output_content
output_content=output_content+"\n"+"*"*30+file.split("\\")[-1]+" Search Result" +"*"*30+"\n"*2
url=file
htmlfile = open(url, 'r', encoding='utf-8')
demo = htmlfile.read()
soup=BeautifulSoup(demo,'lxml')
all_content=soup.find_all(['h1','h2','h3', 'h4', 'h5','p'])
new_list=[]
for item in all_content:
if item.text not in new_list:
new_list.append(item.text)
dic1={} #Build a empty dic to store each clause no, and its detail content from every paragraph
Target=""
content=""
for line in new_list:
line=str(line.replace("\n"," "))
pattern=re.compile(r"(^[1-9].+)") #Judge the paragraph whether start with heading no.
line_no=bool(pattern.search(line))
if line_no: #If the paragraph start with heading no
dic1[Target]=content #Save the conent to heading no. in dic.
Target=line
content=""
continue
else: #if the paragraph is detail, not heading line,
content=content+line+"\n" # save the content
continue
result=[] #The keyword search from the dic item, if the keyword in the item, shall print the dic key and item at the same time.
for value in dic1.values():
pattern=re.compile(r".*%s.*"%word,re.I|re.M)
rel=pattern.findall(value)
if len(rel):
result.append((list(dic1.keys())[list(dic1.values()).index(value)]))
result.append(list(rel))
result.append("\n")
return print_result(file,result)
def print_result(file,nums):
global output_content
for i in nums:
if isinstance(i, list):
print_result(file,i)
else:
output_content=output_content+i

Empty list when I supposed to have appended a lot of text to it

So I am working on a code to make a specific xml document turn into a html document for presenting a story. I have managed to get most of the way there, but when I go from concatenating a list into a string and append that new string into a list, the list is empty. I have tried to use the limited understanding I have to troubleshoot where the issue lies, but have so far come up short. I will go show you my code and the area I think the problem lies.
I have already fixed one thing that I noticed, where the varaiable I needed was not the one I used, but I have gone through the code, and can not find any further slip-ups of this kind.
import codecs
import re
fileIn = codecs.open("differenceInAbility.xml", "r", "utf-8")
text = fileIn.read()
fileIn.close()
chapterTitle = re.findall(r'<chapter number="(\d)" name="(.+?)">', text)
chapters = re.findall(r'<chapter number="\d" name=".+?">(.+?)</chapter>', text, flags=re.DOTALL)
paragraphs = re.findall(r"<paragraph>(.+?)</paragraph>", text, flags=re.DOTALL)
cleanParagraphs = []
for entry in paragraphs:
cleanup = re.sub(r"\r\n[ ]+", " ", entry)
cleanup2 = re.sub(r"[ ]+", " ", cleanup)
cleanParagraphs.append(cleanup2)
chaptersHTML = []
chapterCounter = 1
for entry in chapters:
if chapterTitle[0] == r"\d+":
chapterHTML = "<h1> Chapter " + chapterCounter + " - " + chapterTitle[1] + "</h1>"
chapterTitle.pop(0)
chapterTitle.pop(1)
paragraphsHTML = []
for paragraph in cleanParagraphs:
if paragraph in entry:
p = "<p>" + paragraph + "</p>"
paragraphsHTML.append(p)
allParagraphsHTML = "\n".join(paragraphsHTML)
wholeSection = chapterHTML + allParagraphsHTML
chaptersHTML.append(wholeSection)
chapterCounter += 1
print(chaptersHTML)
The part I believe are relevant, is:
paragraphsHTML = []
for paragraph in cleanParagraphs:
if paragraph in entry:
p = "<p>" + paragraph + "</p>"
paragraphsHTML.append(p)
allParagraphsHTML = "\n".join(paragraphsHTML)
wholeSection = chapterHTML + allParagraphsHTML
chaptersHTML.append(wholeSection)
because the cleanParagraphs list has the right content, where each paragraph of the xml document is its own entry in this list.
Could the problem be if paragraph in entrybecause it doesn't register parts of the "entry" as the paragraph within it?
If so, how would I go about solving this? How do I make sure it knows which paragraph is in what chapter?
The contents of cleanParagraphs are not the original substrings, so of course they do not appear in the unaltered chapters values. You should process each chapter (including breaking it into paragraphs) separately, so that you don’t have to rediscover which paragraphs it contains (and to avoid mishandling paragraphs that happen to be identical between two chapters).

Web Scraping a wikipedia page

In some wikipedia pages, after the title of the article (appearing in bold), there is some text inside of parentheses used to explain the pronunciation and phonetics of the words in the title. For example, on this, after the bold title diglossia in the <p>, there is an open parenthesis. In order to find the corresponding close parenthesis, you would have to iterate through the text nodes one by one to find it, which is simple. What I'm trying to do is find the very next href link and store it.
The issue here is that (AFAIK), there isn't a way to uniquely identify the text node with the close parenthesis and then get the following href. Is there any straight forward (not convoluted) way to get the first link outside of the initial parentheses?
EDIT
In the case of the link provided here, the href to be stored should be: https://en.wikipedia.org/wiki/Dialects since that is the first link outside of the parenthesis
Is this what you want?
import requests
rs = requests.get('https://en.wikipedia.org/wiki/Diglossia', verify=False)
parsed_html = BeautifulSoup(rs.text)
print parsed_html.body.findAll('p')[0].findAll('a')[0]
This gives:
linguistics
if you want to extract href then you can use this:
parsed_html.body.findAll('p')[0].findAll('a')[0].attrs[0][1]
UPDATE
It seems you want href after parentheses not the before one.
I have written script for it. Try this:
import requests
from BeautifulSoup import BeautifulSoup
rs = requests.get('https://en.wikipedia.org/wiki/Diglossia', verify=False)
parsed_html = BeautifulSoup(rs.text)
temp = parsed_html.body.findAll('p')[0]
start_count = 0
started = False
found = False
while temp.next and found is False:
temp = temp.next
if '(' in temp:
start_count += 1
if started is False:
started = True
if ')' in temp and started and start_count > 1:
start_count -= 1
elif ')' in temp and started and start_count == 1:
found = True
print temp.findNext('a').attrs[0][1]

Categories