Splitting a docx by headings into separate files in Python - python

I want to write a program that grabs my docx files, iterates through them and splits each file into multiple separate files based on headings. Inside each of the docx there are a couple of articles, each with a 'Heading 1' and text underneath it.
So if my original file1.docx has 4 articles, I want it to be split into 4 separate files each with its heading and text.
I got to the part where it iterates through all of the files in a path where I hold the .docx files, and I can read the headings and text separately, but I can't seem to figure out a way how to merge it all and split it into separate files each with the heading and the text. I am using the python-docx library.
import glob
from docx import Document
headings = []
texts = []
def iter_headings(paragraphs):
for paragraph in paragraphs:
if paragraph.style.name.startswith('Heading'):
yield paragraph
def iter_text(paragraphs):
for paragraph in paragraphs:
if paragraph.style.name.startswith('Normal'):
yield paragraph
for name in glob.glob('/*.docx'):
document = Document(name)
for heading in iter_headings(document.paragraphs):
headings.append(heading.text)
for paragraph in iter_text(document.paragraphs):
texts.append(paragraph.text)
print(texts)
How do I extract the text and heading for each article?
This is the XML reading python-docx gives me. The red braces mark what I want to extract from each file.
https://user-images.githubusercontent.com/17858776/51575980-4dcd0200-1eac-11e9-95a8-f643f87b1f40.png
I am open for any alternative suggestions on how to achieve what I want with different methods, or if there is an easier way to do it with PDF files.

I think the approach of using iterators is a sound one, but I'd be inclined to parcel them differently. At the top level you could have:
for paragraphs in iterate_document_sections(document.paragraphs):
create_document_from_paragraphs(paragraphs)
Then iterate_document_sections() would look something like:
def iterate_document_sections(document):
"""Generate a sequence of paragraphs for each headed section in document.
Each generated sequence has a heading paragraph in its first position,
followed by one or more body paragraphs.
"""
paragraphs = [document.paragraphs[0]]
for paragraph in document.paragraphs[1:]:
if is_heading(paragraph):
yield paragraphs
paragraphs = [paragraph]
continue
paragraphs.append(paragraph)
yield paragraphs
Something like this combined with portions of your other code should give you something workable to start with. You'll need an implementation of is_heading() and create_document_from_paragraphs().
Note that the term "section" here is used as in common publishing parlance to refer to a (section) heading and its subordinate paragraphs, and does not refer to a Word document section object (like document.sections).

In fact, provided solution works well only if documents don't have any other elements except paragraphs (tables for example).
Another possible solution is to iterate not only through paragraphs but all document body's child xml elements. Once you find "subdocument's" start and end elements (paragraphs with headings in your example) you should delete other irrelevant to this part xml elements (a kind of cut off all other document content). This way you can preserve all styles, text, tables and other document elements and formatting.
It's not an elegant solution and means that you have to keep a temporary copy of a full source document in memory.
This is my code:
import tempfile
from typing import Generator, Tuple, Union
from docx import Document
from docx.document import Document as DocType
from docx.oxml.table import CT_Tbl
from docx.oxml.text.paragraph import CT_P
from docx.oxml.xmlchemy import BaseOxmlElement
from docx.text.paragraph import Paragraph
def iterparts(doc_path:str, skip_first=True, bias:int=0) -> Generator[Tuple[int,DocType],None,None]:
"""Iterate over sub-documents by splitting source document into parts
Split into parts by copying source document and cutting off unrelevant
data.
Args:
doc_path (str): path to source *docx* file
skip_first (bool, optional): skip first split point and wait for
second occurrence. Defaults to True.
bias (int, optional): split point bias. Defaults to 0.
Yields:
Generator[Tuple[int,DocType],None,None]: first element of each tuple
indicates the number of a
sub-document, if number is 0
then there are no sub-documents
"""
doc = Document(doc_path)
counter = 0
while doc:
split_elem_idx = -1
doc_body = doc.element.body
cutted = [doc, None]
for idx, elem in enumerate(doc_body.iterchildren()):
if is_split_point(elem):
if split_elem_idx == -1 and skip_first:
split_elem_idx = idx
else:
cutted = split(doc, idx+bias) # idx-1 to keep previous paragraph
counter += 1
break
yield (counter, cutted[0])
doc = cutted[1]
def is_split_point(element:BaseOxmlElement) -> bool:
"""Split criteria
Args:
element (BaseOxmlElement): oxml element
Returns:
bool: whether the *element* is the beginning of a new sub-document
"""
if isinstance(element, CT_P):
p = Paragraph(element, element.getparent())
return p.text.startswith("Some text")
return False
def split(doc:DocType, cut_idx:int) -> Tuple[DocType,DocType]:
"""Splitting into parts by copying source document and cutting of
unrelevant data.
Args:
doc (DocType): [description]
cut_idx (int): [description]
Returns:
Tuple[DocType,DocType]: [description]
"""
tmpdocfile = write_tmp_doc(doc)
second_part = doc
second_elems = list(second_part.element.body.iterchildren())
for i in range(0, cut_idx):
remove_element(second_elems[i])
first_part = Document(tmpdocfile)
first_elems = list(first_part.element.body.iterchildren())
for i in range(cut_idx, len(first_elems)):
remove_element(first_elems[i])
tmpdocfile.close()
return (first_part, second_part)
def remove_element(elem: Union[CT_P,CT_Tbl]):
elem.getparent().remove(elem)
def write_tmp_doc(doc:DocType):
tmp = tempfile.TemporaryFile()
doc.save(tmp)
return tmp
Note that you should define is_split_point method according to your split criteria

Related

Adding rows and columns to a pandas DataFrame in multiple loops

I am trying to make a simple tool which can look for keywords (from multiple txt files) in multiple PDFs. In the end, I would like it to produce a report in the following form:
Name of the pdf document
Keyword document 1
Keyword document...
Keyword document x
PDF 1
1
546
77
PDF...
3
8
8
PDF x
324
23
34
Where the numbers represent the total number of occurrences of all keywords from the keyword document in that particular file.
This is where I got so far - the function can successfully locate, count, and relate summed keywords to the document:
import fitz
import glob
def keyword_finder():
# access all PDFs from current directory
for pdf_file in glob.glob('*.pdf'):
# open files using PyMuPDF
document = fitz.open(pdf_file)
# count the number of pages in document
document_pages = document.page_count
# access all txt files (these contain the keywords)
for text_file in glob.glob('*.txt'):
# empty list to store the results
occurrences_sdg = []
# open keywords file
inputs = open(text_file, 'r')
# read txt file
keywords_list = inputs.read()
# split the words by an 'enter'
keywords_list_separated = keywords_list.split('\n')
for keyword in keywords_list_separated[1:-1]: # omit first and last entry
occurrences_keyword = []
# read in page by page
for page in range(0, document_pages):
# load in text from i page
text_per_page = document.load_page(page)
# search for keywords on the page, and sum all occurrences
keyword_sum = len(text_per_page.search_for(keyword))
# add occurrences from each page to list per keyword
occurrences_keyword.append(keyword_sum)
# sum all occurances of a keyword in the document
occurrences_sdg.append(sum(occurrences_keyword))
if sum(occurrences_sdg) > 0:
print(f'{pdf_file} has {sum(occurrences_sdg)} keyword(s) from {text_file}\n')
I did try using pandas and I believe that still is the best choice. The number of loops makes it difficult for me to decide at which point the "skeleton" dataframe should be made, and when the results should be added. Final goal is to have this produced report saved as csv.

Python Readline Loop and Subloop

I'm trying to loop through some unstructured text data in python. End goal is to structure it in a dataframe. For now I'm just trying to get the relevant data in an array and understand the line, readline() functionality in python.
This is what the text looks like:
Title: title of an article
Full text: unfortunately the full text of each article,
is on numerous lines. Each article has a differing number
of lines. In this example, there are three..
Subject: Python
Title: title of another article
Full text: again unfortunately the full text of each article,
is on numerous lines.
Subject: Python
This same format is repeated for lots of text articles in the same file. So far I've figured out how to pull out lines that include certain text. For example, I can loop through it and put all of the article titles in a list like this:
a = "Title:"
titleList = []
sample = 'sample.txt'
with open(sample,encoding="utf8") as unstr:
for line in unstr:
if a in line:
titleList.append(line)
Now I want to do the below:
a = "Title:"
b = "Full text:"
d = "Subject:"
list = []
sample = 'sample.txt'
with open(sample,encoding="utf8") as unstr:
for line in unstr:
if a in line:
list.append(line)
if b in line:
1. Concatenate this line with each line after it, until i reach the line that includes "Subject:". Ignore the "Subject:" line, stop the "Full text:" subloop, add the concatenated full text to the list array.<br>
2. Continue the for loop within which all of this sits
As a Python beginner, I'm spinning my wheels searching google on this topic. Any pointers would be much appreciated.
If you want to stick with your for-loop, you're probably going to need something like this:
titles = []
texts = []
subjects = []
with open('sample.txt', encoding="utf8") as f:
inside_fulltext = False
for line in f:
if line.startswith("Title:"):
inside_fulltext = False
titles.append(line)
elif line.startswith("Full text:"):
inside_fulltext = True
full_text = line
elif line.startswith("Subject:"):
inside_fulltext = False
texts.append(full_text)
subjects.append(line)
elif inside_fulltext:
full_text += line
else:
# Possibly throw a format error here?
pass
(A couple of things: Python is weird about names, and when you write list = [], you're actually overwriting the label for the list class, which can cause you problems later. You should really treat list, set, and so on like keywords - even thought Python technically doesn't - just to save yourself the headache. Also, the startswith method is a little more precise here, given your description of the data.)
Alternatively, you could wrap the file object in an iterator (i = iter(f), and then next(i)), but that's going to cause some headaches with catching StopIteration exceptions - but it would let you use a more classic while-loop for the whole thing. For myself, I would stick with the state-machine approach above, and just make it sufficiently robust to deal with all your reasonably expected edge-cases.
As your goal is to construct a DataFrame, here is a re+numpy+pandas solution:
import re
import pandas as pd
import numpy as np
# read all file
with open('sample.txt', encoding="utf8") as f:
text = f.read()
keys = ['Subject', 'Title', 'Full text']
regex = '(?:^|\n)(%s): ' % '|'.join(keys)
# split text on keys
chunks = re.split(regex, text)[1:]
# reshape flat list of records to group key/value and infos on the same article
df = pd.DataFrame([dict(e) for e in np.array(chunks).reshape(-1, len(keys), 2)])
Output:
Title Full text Subject
0 title of an article unfortunately the full text of each article,\nis on numerous lines. Each article has a differing number \nof lines. In this example, there are three.. Python
1 title of another article again unfortunately the full text of each article,\nis on numerous lines. Python

How to extract multiple instances of a word from PDF files on python?

I'm writing a script on python to read a PDF file and record both the string that appears after every instance that "time" is mentioned as well as the page number its mentioned on.
I have gotten it to recognize when each page has the string "time" on it and send me the page number, however if the page has "time" more than once, it does not tell me. I'm assuming this is because it has already fulfilled the criteria of having the string "time" on it at least once, and therefore it skips to the next page to perform the check.
How would I go about finding multiple instances of the word "time"?
This is my code:
import PyPDF2
def pdf_read():
pdfFile = "records\document.pdf"
pdf = PyPDF2.PdfFileReader(pdfFile)
pageCount = pdf.getNumPages()
for pageNumber in range(pageCount):
page = pdf.getPage(pageNumber)
pageContent = page.extractText()
if "Time" in pageContent or "time" in pageContent:
print(pageNumber)
Also as a side note, this pdf is a scanned document and therefore when I read the text on python (or copy and paste onto word) there are a lot words which come up with multiple random symbols and characters even though its perfectly legible. Is this a limitation of computer programming without having to apply more complex concepts such as machine learning in order to read the files accurately?
A solution would be to create a list of strings off pageContent and count the frequency of the word 'time' in the list. It is also easier to select the word following 'time' - you can simply retrieve the next item in the list:
import PyPDF2
import string
pdfFile = "records\document.pdf"
pdf = PyPDF2.PdfFileReader(pdfFile)
pageCount = pdf.getNumPages()
for pageNumber in range(pageCount):
page = pdf.getPage(pageNumber)
pageContent = page.extractText()
pageContent = ''.join(pageContent.splitlines()).split() # words to list
pageContent = ["".join(j.lower() for j in i if j not in string.punctuation) for i in pageContent] # remove punctuation
print(pageContent.count('time') + pageContent.count('Time')) # count occurances of time in list
print([(j, pageContent[i+1] if i+1 < len(pageContent) else '') for i, j in enumerate(pageContent) if j == 'Time' or j == 'time']) # list time and following word
Note that this example also strips all words from characters that are not letters or digits. Hopefully this sufficiently cleans up the bad OCR.

Flatten HTML code, with tree structure delimiters

I have some raw HTML scraped from a random website, possibly messy, with some scripts, self-closing tags, etc. Example:
ex="<!DOCTYPE html PUBLIC \\\n><html lang=\\'en-US\\'><head><meta http-equiv=\\'Content-Type\\'/><title>Some text</title></head><body><h1>Some other text</h1><p><span style='color:red'>My</span> first paragraph.</p></body></html>"
I want to return the HTML DOM without any string, attributes or such stuff, only the tag structure, in the format of a string showing the relation between parents, children and siblings, this would be my expected output (though the use of brackets is a personnal choice):
'[html[head[meta, title], body[h1, p[span]]]]'
So far I tried using beautifulSoup (this answer was helpful). I figured out I should split the work in two steps:
- extract the tag "skeleton" of the html DOM, emptying everything like strings, attributes, and stuff before the <html>.
- return the flat HTML DOM, but structured with tree-like delimiters indicating each children and siblings, such as brackets.
I posted the code as an self-answer
You can use recursion. The name argument will give the name of the tag. You can check if the type is bs4.element.Tag to confirm if an element is a tag.
import bs4
ex="<!DOCTYPE html PUBLIC \\\n><html lang=\\'en-US\\'><head><meta http-equiv=\\'Content-Type\\'/><title>Some text</title></head><body><h1>Some other text</h1><p><span style='color:red'>My</span> first paragraph.</p></body></html>"
soup=bs4.BeautifulSoup(ex,'html.parser')
str=''
def recursive_child_seach(tag):
global str
str+=tag.name
child_tag_list=[x for x in tag.children if type(x)==bs4.element.Tag]
if len(child_tag_list) > 0:
str+='['
for i,child in enumerate(child_tag_list):
recursive_child_seach(child)
if not i == len(child_tag_list) - 1: #if not last child
str+=', '
if len(child_tag_list) > 0:
str+=']'
return
recursive_child_seach(soup.find())
print(str)
#html[head[meta, title], body[h1, p[span]]]
print('['+str+']')
#[html[head[meta, title], body[h1, p[span]]]]
I post here my first solution, which is still a bit messy and uses a lot of regex. The first function gets the emptied DOM structure and outputs it as a raw string, the second function modifies the string to add the delimiters.
import re
def clear_tags(htmlstring, remove_scripts=False):
htmlstring = re.sub("^.*?(<html)",r"\1", htmlstring, flags=re.DOTALL)
finishyoursoup = soup(htmlstring, 'html.parser')
for tag in finishyoursoup.find_all():
tag.attrs = {}
for sub in tag.contents:
if sub.string:
sub.string.replace_with('')
if remove_scripts:
[tag.extract() for tag in finishyoursoup.find_all(['script', 'noscript'])]
return(str(finishyoursoup))
clear_tags(ex)
# '<html><head><meta/><title></title></head><body><h1></h1><p><span></span></p></b
def flattened_html(htmlstring):
import re
squeletton = clear_tags(htmlstring)
step1 = re.sub("<([^/]*?)>", r"[\1", squeletton) # replace begining of tag
step2 = re.sub("</(.*?)>", r"]", step1) # replace end of tag
step3 = re.sub("<(.*?)/>", r"[\1]", step2) # deal with self-closing tag
step4 = re.sub("\]\[", ", ", step3) # gather sibling tags with coma
return(step4)
flattened_html(ex)
# '[html[head[meta, title], body[h1, p[span]]]]'

Put Header with Python - docx

I am using Python-docx to create and write a Word document.
How i can put a text in document header using python-docx?
http://image.prntscr.com/image/8757b4e6d6f545a5ab6a08a161e4c55e.png
Thanks
UPDATE: This feature has been implemented since the time of this answer.
As other respondents have noted below, the Section object provides access to its header objects.
header = document.sections[0].header
Note that a section can have up to three headers (first_page, odd_pages, even_pages) and each section can have its own set of headers. The most common situation is that a document has a single header and a single section.
A header is like a document body or table cell in that it can contain tables and/or paragraphs and by default has a single empty paragraph (it cannot contain zero paragraphs).
header.paragraphs[0].text = "My header text"
This is explained in greater detail on this page in the documentation::
https://python-docx.readthedocs.io/en/latest/user/hdrftr.html
Unfortunately this feature is not implemented yet. The page #SamRogers linked to is part of the enhancement proposal (aka. "analysis page"). The implementation is in progress however, by #eupharis, so might be available in a month or so. The ongoing pull request is here if you want to follow it. https://github.com/python-openxml/python-docx/pull/291
This feature has been implemented. See: https://python-docx.readthedocs.io/en/latest/dev/analysis/features/header.html
You can add text to the header of a word document using python-docx as follows:
header = document.sections[0].header
head = header.paragraphs[0]
head.text = 'Add Your Text'
I've been using it to work
header = document.sections[0].header
header.add_paragraph('Test Header')
Header is a subclass of BlockItemContainer, from which it inherits the same content editing capabilities as Document, such as .add_paragraph().
(With respect that this question is old...)
I have used a work-around in my project where my "client" wanted different headers in different pages by:
Creating a document using python-docx and section breaks
Execute a word macro file (*.xlsm) with two arguments: (1) fileName = path, docTitle = title of the document to be inserted in footer.
The macro file will open the newly created document and add headers and footers that are already inside the macro file. This would need to be modified if the header and footer text need to vary.
Pyton code:
wd = win32com.client.Dispatch("Word.Application")
wd.Visible = False
doc = wd.Documents.Open(pathToDOCM) # path here
wd.Run("Main.RunMain",fileName, docTitle) # 2 args
doc.Close()
del wd
VBA code:
VBA (inside *.xlsm) code:
Sub RunInside()
Call RunMain("C:\Users\???\dokument.docx", "test")
End Sub
Sub RunMain(wordDocument As String, wordTitle As String)
' Create Headers
Call CreateHeaders(wordDocument, wordTitle)
End Sub
Sub CreateHeaders(wordDocument As String, wordTitle As String)
Dim i As Integer
Dim outputName As String
Dim aDoc As Document
Dim oApp As Word.Application
Dim oSec As Word.Section
Dim oDoc As Word.Document
Dim hdr1, hdr2 As HeaderFooter
Dim ftr1, ftr2 As HeaderFooter
'Create a new document in Word
Set oApp = New Word.Application
'Set oDoc = oApp.Documents.Add
Set oDoc = oApp.Documents.Open(wordDocument)
'Set aDoc as active document
Set aDoc = ActiveDocument
oDoc.BuiltInDocumentProperties("Title") = wordTitle
For i = 1 To 9:
Set hdr1 = aDoc.Sections(i).Headers(wdHeaderFooterPrimary)
Set hdr2 = oDoc.Sections(i).Headers(wdHeaderFooterPrimary)
Set ftr1 = aDoc.Sections(i).Footers(wdHeaderFooterPrimary)
Set ftr2 = oDoc.Sections(i).Footers(wdHeaderFooterPrimary)
If i > 1 Then
With oDoc.Sections(i).Headers(wdHeaderFooterPrimary)
.LinkToPrevious = False
End With
With oDoc.Sections(i).Footers(wdHeaderFooterPrimary)
.LinkToPrevious = False
End With
End If
hdr1.Range.Copy
hdr2.Range.Paste
ftr1.Range.Copy
ftr2.Range.Paste
Next i
outputName = Left(wordDocument, Len(wordDocument) - 5)
outputName = outputName + ".pdf"
oDoc.SaveAs outputName, 17
oDoc.Close SaveChanges:=wdSaveChanges
Set oDoc = Nothing
Set aDoc = Nothing
End Sub
Final remark:
The code loops through different sections and copy-paste the header and footers. It also saves the document to *.PDF.
For those of you looking to set custom headers w/docx:
I had to use a couple packages to get this to work. My use case was this: I was generating multiple templates and then merging them together, however when I merged them with docx the header from the master file (below) was applied to all sections and all sections were marked as linkedToPrevious = True despite being =False in the original files. However, docx does a really nice job appending files and having it come out error-free on the other end, so I decided to find a way to make it work. Code for reference:
master = Document(files[0])
composer = Composer(master)
footnotes_doc = Document('chapters/footnotes.docx')
for file in files[1:]:
mergeDoc = Document(file)
composer.append(mergeDoc)
composer.append(footnotes_doc)
composer.save("chapters/combined.docx")
So now I have a master doc (combined.docx) with all the proper sections however the headers need to be adjusted. You can't iterate over the document with docx, get the current section that you are in, and adjust it or set the headers linking to false. If you set to False you wipe the header completely. You can explicitly call the section and adjust it, but since everything after it is linked to previous, you change the rest of the document from that point. So I pulled in win32com:
Gets the number of sections and then iterates thru them backwards using win32com. This way as you remove linkedToPrevious, you preserve the header in place.
def getSections(document):
sectionArray = {}
sections = document.sections
x = 1
for section in sections:
sectionArray[x] = section
x += 1
return sectionArray
start_doc = Document('chapters/combined.docx')
listArray = getSections(start_doc) #gets an array of all sections
keylist = list(reversed(sorted(listArray.keys()))) ##now reverse it
word = win32com.client.gencache.EnsureDispatch("Word.Application")
word = client.DispatchEx("Word.Application")
word.Visible = False
#tell word to open the document
word.Documents.Open(' C:\path to\combined.docx')
#open it internally
doc = word.Documents(1)
try:
for item in keylist:
word.ActiveDocument.Sections(item).Headers(win32com.client.constants.wdHeaderFooterPrimary).LinkToPrevious=False
word.ActiveDocument.Sections(item).Headers(win32com.client.constants.wdHeaderFooterEvenPages).LinkToPrevious=False
word.ActiveDocument.SaveAs("c:\wherever\combined_1.docx")
doc.Close()
word.Quit()
except:
doc.Close()
word.Quit()
ok so now the doc is primed to edit the headers, which we can do with docx easily and worry free now. First we need to parse the XML, which I use docx to access then feed to lxml, to get the location of the section needed:
xml = str(start_doc._element.xml) #this gets the full XML using docx
tree = etree.fromstring(xml)
WORD_NAMESPACE='{http://schemas.openxmlformats.org/wordprocessingml/2006/main}'
TEXT = WORD_NAMESPACE + 't'
PARA = WORD_NAMESPACE + 'p'
SECT = WORD_NAMESPACE + 'sectPr'
sectionLoc = []
for item in tree.iter(PARA):
for node in item.iter(TEXT):
if 'Section' in node.text: #this is how I am identifying which headers I need to edit
print(node.text)
sectionLoc.append(node.text)
for sect in item.iter(SECT):
print(sect)
sectionLoc.append('section')
# print(etree.tostring(sect))
counter =0
sectionLocs = []
for index, item in enumerate(sectionLoc): #just some logic to get the correct section number from the xml parse
if 'Section' in item:
sectionLocs.append(counter)
continue
counter += 1
#ok now use those locations with docx to adjust the headers
#remember that start_doc here needs to be the new result from win32 process-
#so start_doc = Document('C:\path to\combined.docx') in this case
for item in sectionLocs:
section = start_doc.sections[item]
header = section.header
para_new = header.paragraphs[0]
para_new.text = 'TEST!'
start_doc.save('willthiswork.docx')
This is a lot of work. I bet there is a way to do it entirely with win32com but I couldn't figure out how to get the relevant sections with it based on the content in the body of the page. The "sectPr" tags always come at the end of the page, so in combing the document for text I know is on the page that needs a new header, "Section," I know that the next printed out section is the one I want to edit so I just get it's location in the list.
I think this whole workflow is a hack but it works and I hope the sample code helps someone.
You can use header.text like
header = section.header
header.text = 'foobar'
see http://python-docx.readthedocs.io/en/latest/dev/analysis/features/header.html?highlight=header for more information
import docx
document = docx.Document()
header_section = document.sections[0]
header = header_section.header
header_text = header.paragraphs[0]
header_text.text = "Header of document"
You can use \t either side of text to align it in the centre

Categories