Put Header with Python - docx - python

I am using Python-docx to create and write a Word document.
How i can put a text in document header using python-docx?
http://image.prntscr.com/image/8757b4e6d6f545a5ab6a08a161e4c55e.png
Thanks

UPDATE: This feature has been implemented since the time of this answer.
As other respondents have noted below, the Section object provides access to its header objects.
header = document.sections[0].header
Note that a section can have up to three headers (first_page, odd_pages, even_pages) and each section can have its own set of headers. The most common situation is that a document has a single header and a single section.
A header is like a document body or table cell in that it can contain tables and/or paragraphs and by default has a single empty paragraph (it cannot contain zero paragraphs).
header.paragraphs[0].text = "My header text"
This is explained in greater detail on this page in the documentation::
https://python-docx.readthedocs.io/en/latest/user/hdrftr.html
Unfortunately this feature is not implemented yet. The page #SamRogers linked to is part of the enhancement proposal (aka. "analysis page"). The implementation is in progress however, by #eupharis, so might be available in a month or so. The ongoing pull request is here if you want to follow it. https://github.com/python-openxml/python-docx/pull/291

This feature has been implemented. See: https://python-docx.readthedocs.io/en/latest/dev/analysis/features/header.html
You can add text to the header of a word document using python-docx as follows:
header = document.sections[0].header
head = header.paragraphs[0]
head.text = 'Add Your Text'

I've been using it to work
header = document.sections[0].header
header.add_paragraph('Test Header')
Header is a subclass of BlockItemContainer, from which it inherits the same content editing capabilities as Document, such as .add_paragraph().

(With respect that this question is old...)
I have used a work-around in my project where my "client" wanted different headers in different pages by:
Creating a document using python-docx and section breaks
Execute a word macro file (*.xlsm) with two arguments: (1) fileName = path, docTitle = title of the document to be inserted in footer.
The macro file will open the newly created document and add headers and footers that are already inside the macro file. This would need to be modified if the header and footer text need to vary.
Pyton code:
wd = win32com.client.Dispatch("Word.Application")
wd.Visible = False
doc = wd.Documents.Open(pathToDOCM) # path here
wd.Run("Main.RunMain",fileName, docTitle) # 2 args
doc.Close()
del wd
VBA code:
VBA (inside *.xlsm) code:
Sub RunInside()
Call RunMain("C:\Users\???\dokument.docx", "test")
End Sub
Sub RunMain(wordDocument As String, wordTitle As String)
' Create Headers
Call CreateHeaders(wordDocument, wordTitle)
End Sub
Sub CreateHeaders(wordDocument As String, wordTitle As String)
Dim i As Integer
Dim outputName As String
Dim aDoc As Document
Dim oApp As Word.Application
Dim oSec As Word.Section
Dim oDoc As Word.Document
Dim hdr1, hdr2 As HeaderFooter
Dim ftr1, ftr2 As HeaderFooter
'Create a new document in Word
Set oApp = New Word.Application
'Set oDoc = oApp.Documents.Add
Set oDoc = oApp.Documents.Open(wordDocument)
'Set aDoc as active document
Set aDoc = ActiveDocument
oDoc.BuiltInDocumentProperties("Title") = wordTitle
For i = 1 To 9:
Set hdr1 = aDoc.Sections(i).Headers(wdHeaderFooterPrimary)
Set hdr2 = oDoc.Sections(i).Headers(wdHeaderFooterPrimary)
Set ftr1 = aDoc.Sections(i).Footers(wdHeaderFooterPrimary)
Set ftr2 = oDoc.Sections(i).Footers(wdHeaderFooterPrimary)
If i > 1 Then
With oDoc.Sections(i).Headers(wdHeaderFooterPrimary)
.LinkToPrevious = False
End With
With oDoc.Sections(i).Footers(wdHeaderFooterPrimary)
.LinkToPrevious = False
End With
End If
hdr1.Range.Copy
hdr2.Range.Paste
ftr1.Range.Copy
ftr2.Range.Paste
Next i
outputName = Left(wordDocument, Len(wordDocument) - 5)
outputName = outputName + ".pdf"
oDoc.SaveAs outputName, 17
oDoc.Close SaveChanges:=wdSaveChanges
Set oDoc = Nothing
Set aDoc = Nothing
End Sub
Final remark:
The code loops through different sections and copy-paste the header and footers. It also saves the document to *.PDF.

For those of you looking to set custom headers w/docx:
I had to use a couple packages to get this to work. My use case was this: I was generating multiple templates and then merging them together, however when I merged them with docx the header from the master file (below) was applied to all sections and all sections were marked as linkedToPrevious = True despite being =False in the original files. However, docx does a really nice job appending files and having it come out error-free on the other end, so I decided to find a way to make it work. Code for reference:
master = Document(files[0])
composer = Composer(master)
footnotes_doc = Document('chapters/footnotes.docx')
for file in files[1:]:
mergeDoc = Document(file)
composer.append(mergeDoc)
composer.append(footnotes_doc)
composer.save("chapters/combined.docx")
So now I have a master doc (combined.docx) with all the proper sections however the headers need to be adjusted. You can't iterate over the document with docx, get the current section that you are in, and adjust it or set the headers linking to false. If you set to False you wipe the header completely. You can explicitly call the section and adjust it, but since everything after it is linked to previous, you change the rest of the document from that point. So I pulled in win32com:
Gets the number of sections and then iterates thru them backwards using win32com. This way as you remove linkedToPrevious, you preserve the header in place.
def getSections(document):
sectionArray = {}
sections = document.sections
x = 1
for section in sections:
sectionArray[x] = section
x += 1
return sectionArray
start_doc = Document('chapters/combined.docx')
listArray = getSections(start_doc) #gets an array of all sections
keylist = list(reversed(sorted(listArray.keys()))) ##now reverse it
word = win32com.client.gencache.EnsureDispatch("Word.Application")
word = client.DispatchEx("Word.Application")
word.Visible = False
#tell word to open the document
word.Documents.Open(' C:\path to\combined.docx')
#open it internally
doc = word.Documents(1)
try:
for item in keylist:
word.ActiveDocument.Sections(item).Headers(win32com.client.constants.wdHeaderFooterPrimary).LinkToPrevious=False
word.ActiveDocument.Sections(item).Headers(win32com.client.constants.wdHeaderFooterEvenPages).LinkToPrevious=False
word.ActiveDocument.SaveAs("c:\wherever\combined_1.docx")
doc.Close()
word.Quit()
except:
doc.Close()
word.Quit()
ok so now the doc is primed to edit the headers, which we can do with docx easily and worry free now. First we need to parse the XML, which I use docx to access then feed to lxml, to get the location of the section needed:
xml = str(start_doc._element.xml) #this gets the full XML using docx
tree = etree.fromstring(xml)
WORD_NAMESPACE='{http://schemas.openxmlformats.org/wordprocessingml/2006/main}'
TEXT = WORD_NAMESPACE + 't'
PARA = WORD_NAMESPACE + 'p'
SECT = WORD_NAMESPACE + 'sectPr'
sectionLoc = []
for item in tree.iter(PARA):
for node in item.iter(TEXT):
if 'Section' in node.text: #this is how I am identifying which headers I need to edit
print(node.text)
sectionLoc.append(node.text)
for sect in item.iter(SECT):
print(sect)
sectionLoc.append('section')
# print(etree.tostring(sect))
counter =0
sectionLocs = []
for index, item in enumerate(sectionLoc): #just some logic to get the correct section number from the xml parse
if 'Section' in item:
sectionLocs.append(counter)
continue
counter += 1
#ok now use those locations with docx to adjust the headers
#remember that start_doc here needs to be the new result from win32 process-
#so start_doc = Document('C:\path to\combined.docx') in this case
for item in sectionLocs:
section = start_doc.sections[item]
header = section.header
para_new = header.paragraphs[0]
para_new.text = 'TEST!'
start_doc.save('willthiswork.docx')
This is a lot of work. I bet there is a way to do it entirely with win32com but I couldn't figure out how to get the relevant sections with it based on the content in the body of the page. The "sectPr" tags always come at the end of the page, so in combing the document for text I know is on the page that needs a new header, "Section," I know that the next printed out section is the one I want to edit so I just get it's location in the list.
I think this whole workflow is a hack but it works and I hope the sample code helps someone.

You can use header.text like
header = section.header
header.text = 'foobar'
see http://python-docx.readthedocs.io/en/latest/dev/analysis/features/header.html?highlight=header for more information

import docx
document = docx.Document()
header_section = document.sections[0]
header = header_section.header
header_text = header.paragraphs[0]
header_text.text = "Header of document"
You can use \t either side of text to align it in the centre

Related

split big pdf into multiple smaller pdfs of different page length based on spectific string appearance in the big pdf using python

Problem
I have a long PDF file with many pages. I want that this pdf is splitted in many smaller files, which lenght is derived from the text content of the long pdf. You can imagine the string as something that activate a scissors that cut the long pdf and give even the filename to the smaller pdf.
The "scissors" strings are generated by the following iterator and are represented from "text":
for municipality in array_merged_zone_send:
text = f'PANORAMICA DI {municipality.upper()}'
If I print ('text') in the iterator the result is that:
PANORAMICA DI BELLINZONA
PANORAMICA DI RIVIERA
PANORAMICA DI BLENIO
PANORAMICA DI ACQUAROSSA
The strings above are unique values, they appear only once.
Above I have shown only the first four, there are more and EVERY item is written in the original pdf that I want to split. Every item appears only one time in the original pdf, no more than one, no less than one (match always one to one) and in the pdf there is never additional "PANORAMICA DI........" that is not already an item obtained by the iteration. PANORAMICA means OVERVIEW in English.
Here an example of the pages inside the original pdf where there is the string that come from item "PANORAMICA DI BLENIO"
What I want to do: I want to split the original pdf every time that appears the string item.
In the image above the original pdf have to be split in two: first pdf end in the page before "PANORAMICA DI BLENIO", second begins in page "PANORAMICA DI BLENIO" and will end in the page before the next "PANORAMICA DI {municipality.upper()}". The resulting pdf name is "zp_Blenio.pdf" for the second, for the first "zp_Acquarossa". For this it should be no problem because "municipality" when it is no upper() is already OK (in other words is "Acquarossa" and "Blenio")
Other example to understand with a simplified simulation (my file has more page):
original pdf 12 pages long, pay attention that is not a code, but I put as a code to write well:
page 1: "PANORAMICA DI RIVIERA"
page 2: no match with "text" item
page 3: no match with "text" item
page 4: "PANORAMICA DI ACQUAROSSA"
page 5: no match with "text" item
page 6: "PANORAMICA DI BLENIO"
page 7: no match with "text" item
page 8: no match with "text" item
page 9: no match with "text" item
page 10: no match with "text" item
page 11: "PANORAMICA DI BELLINZONA"
page 12: no match with "text" item
results will be (again pay attention that is not a code, but I put as a code to show you well):
first created pdf is from page 1 to page 3
second created pdf is from page 4 to page 5
third pdf is from page 6 to 10
forth pdf is from page 11 to 12
the rule is like: split at the page when a text appears until the page before that the text appears again, split at the page when a text appears until the page before that the text appears again, and so on.
Take care: my original pdf is part of a long py code and the pdf changed every time, but the rule of "PANORAMICA DI ....." does not change. In other words, maybe the interval lenght of pages between "PANORAMICA DI ACQUAROSSA" and "PANORAMICA DI BLENIO" changes. This prevents to use a workaroung and set manually the interval of page to split ignoring the rules established above.
Attempt to solve the problem
The only one solution to this issue that I have found is a code that was obsolete and not checked by the author that can be found in this page: https://stackoverflow.com/a/62344714/13769033
I've taken the code and changed depending on the new functions and classes and integrating the iteration to obtain "text".
The result of the old code after my updating is the following:
from PyPDF2 import PdfWriter, PdfReader
import re
def getPagebreakList(file_name: str)->list:
pdf_file = PyPDF2.PdfReader(file_name)
num_pages = len(pdf_file.pages)
page_breaks = list()
for i in range(0, num_pages):
Page = pdf_file.pages[i]
Text = PageObject.extract_text()
for municipality in array_merged_zone_send:
text = f'PANORAMICA DI {municipality.upper()}'
if re.search(text, Text):
page_breaks.append(i)
return page_breaks
inputpdf = PdfReader(open("./report1.pdf", "rb"))
num_pages = len(inputpdf.pages)
page_breaks = getPagebreakList("./report1.pdf")
i = 0
while (i < num_pages):
if page_breaks:
page_break = page_breaks.pop(0)
else:
page_break = num_pages
output = PdfWriter()
while (i != page_break + 1):
output.add_page(inputpdf.pages[i])
i = i + 1
with open(Path('.')/f'zp_{municipality}.pdf',"wb") as outputStream:
output.write(outputStream)
Unfortunately, I don't understand large part of the code.
From the part that I don't understand at all and I don't know if the author made an error:
the indentation of "output = PdfWriter()"
the "getPagebreakList('./report1.pdf')" where I put the same pdf that I want to split but where tha author put "getPagebreakList('yourPDF.pdf')" that was nevertheless different of PdfFileReader(open("80....pdf", "rb")). I assume that it should have written yourPDF.pdf for both.
To be noted: "./report1.pdf" is the path where there is the pdf to split and I am sure that is right.
The code is wrong, when I execute I obtain "TypeError: 'list' object is not callable".
I want that someone help me to find the solution. You can modified my updated code or suggest another way to solve. Thank you.
Suggestion to simulate
To simplify, at the beginning I suggest to consider a static string of your pdf (strings that is repeating every x pages) instead of part of an array.
In my case, I had considered:
Text = PageObject.extract_text()
text = 'PANORAMICA'
if re.search(text, Text):
page_breaks.append(i)
....and changed even the path for the output.
You can simply use a long pdf with repeating fixed text that appears periodically but in an irregular way (once after 3 pages, once every 5 pages and so on).
Only when you find the solution you can integrate the iteration for municipality. The integration of "municipality" on the text is only used to integrate the "municipality" in the name of the new pdf files. Using only "PANORAMICA" does not impact on the lenght of the page interval of the new pdf.
My suggestion is to divide the problem into smaller ones, essentially using a divide and conquer approach. By making single task functions debugging in case of mistakes should be easier. Notice that getPagebreakList is slightly different.
from PyPDF2 import PdfWriter, PdfReader
def page_breaks(pdf_r:PdfReader) -> dict:
p_breaks = {}
for i in range(len(pdf_r.pages)):
page_text = pdf_r.pages[i].extract_text()
for municipality in array_merged_zone_send:
pattern = f'PANORAMICA DI {municipality.upper()}'
if re.search(pattern, page_text):
p_breaks[municipality] = i
return p_breaks
def filenames_range_mapper(pdf_r:PdfReader, page_indices:dict) -> dict:
num_pages = list(page_indices.values()) + [len(pdf_r.pages)+1] # add last page as well
# slice the pages from the reader object
return {name: pdf_r.pages[start:end] for name, start, end in zip(page_indices, num_pages, num_pages[1:])}
def save(file_name:str, pdf_pages:list[PdfReader]) -> None:
# pass the pages to the writer
pdf_w = PdfWriter()
for p in pdf_pages:
pdf_w.add_page(p)
# write to file
with open(file_name, "wb") as outputStream:
pdf_w.write(outputStream)
# message
print(f'Pdf "{file_name}" created.')
# main
# ####
# initial location of the file
file_name = "./report1.pdf"
# create reader object
pdf_r = PdfReader(open(file_name, "rb"))
# get index locations of matches
p_breaks = page_breaks(pdf_r)
# dictionary of name-pages slice objects
mapper = filenames_range_mapper(pdf_r, p_breaks)
# template file name
template_output = './zp_{}.pdf'
# iterate over the location-pages mapper
for municipality, pages in mapper.items():
# set file name
new_file_name = template_output.format(municipality.title()) # eventually municipality.upper()
# save the pages into a new file
save(new_file_name, pages)
Test the code with auxiliary function to avoid unwanted output.
In this case it would be enough to consider a slightly different implementation of filenames_range_mapper in which the values will be just a list of integers (and not PdfReader objects).
def filenames_range_mapper_tester(pdf_r:PdfReader, page_indices:dict) -> dict:
num_pages = list(page_indices.values()) + [len(pdf_r.pages)+1] # add last page as well
# slice the pages from the reader object
return {name: list(range(len(pdf_r.pages)+1))[start:end] for name, start, end in zip(page_indices, num_pages, num_pages[1:])}
# auxiliary test
file_name = "./report1.pdf"
pdf_r = PdfReader(open(file_name, "rb"))
p_breaks = page_breaks(pdf_r)
mapper = filenames_range_mapper_tester(pdf_r, p_breaks)
template_output = './zp_{}.pdf'
for name, pages in mapper.items():
print(template_output.format(name.title()), pages)
If the output make sense then you can proceed with the non-testing code.
An abstraction on how to get the right pages:
# mimic return of "page_breaks"
page_breaks = {
"RIVIERA": 1,
"ACQUAROSSA": 4,
"BLENIO": 6,
"BELLINZONA": 11
}
# mimic "filenames_range_mapper"
last_page_of_pdf = 12 + 1 # increment by 1 the number of pages of the pdf!
num_pages = list(page_breaks.values()) + [last_page_of_pdf]
#[1, 4, 6, 11, 12]
mapper = {name: list(range(start, end)) for name, start, end in zip(page_breaks, num_pages, num_pages[1:])}
#{'RIVIERA': [1, 2, 3],
# 'ACQUAROSSA': [4, 5],
# 'BLENIO': [6, 7, 8, 9, 10],
# 'BELLINZONA': [11, 12]}

Python Readline Loop and Subloop

I'm trying to loop through some unstructured text data in python. End goal is to structure it in a dataframe. For now I'm just trying to get the relevant data in an array and understand the line, readline() functionality in python.
This is what the text looks like:
Title: title of an article
Full text: unfortunately the full text of each article,
is on numerous lines. Each article has a differing number
of lines. In this example, there are three..
Subject: Python
Title: title of another article
Full text: again unfortunately the full text of each article,
is on numerous lines.
Subject: Python
This same format is repeated for lots of text articles in the same file. So far I've figured out how to pull out lines that include certain text. For example, I can loop through it and put all of the article titles in a list like this:
a = "Title:"
titleList = []
sample = 'sample.txt'
with open(sample,encoding="utf8") as unstr:
for line in unstr:
if a in line:
titleList.append(line)
Now I want to do the below:
a = "Title:"
b = "Full text:"
d = "Subject:"
list = []
sample = 'sample.txt'
with open(sample,encoding="utf8") as unstr:
for line in unstr:
if a in line:
list.append(line)
if b in line:
1. Concatenate this line with each line after it, until i reach the line that includes "Subject:". Ignore the "Subject:" line, stop the "Full text:" subloop, add the concatenated full text to the list array.<br>
2. Continue the for loop within which all of this sits
As a Python beginner, I'm spinning my wheels searching google on this topic. Any pointers would be much appreciated.
If you want to stick with your for-loop, you're probably going to need something like this:
titles = []
texts = []
subjects = []
with open('sample.txt', encoding="utf8") as f:
inside_fulltext = False
for line in f:
if line.startswith("Title:"):
inside_fulltext = False
titles.append(line)
elif line.startswith("Full text:"):
inside_fulltext = True
full_text = line
elif line.startswith("Subject:"):
inside_fulltext = False
texts.append(full_text)
subjects.append(line)
elif inside_fulltext:
full_text += line
else:
# Possibly throw a format error here?
pass
(A couple of things: Python is weird about names, and when you write list = [], you're actually overwriting the label for the list class, which can cause you problems later. You should really treat list, set, and so on like keywords - even thought Python technically doesn't - just to save yourself the headache. Also, the startswith method is a little more precise here, given your description of the data.)
Alternatively, you could wrap the file object in an iterator (i = iter(f), and then next(i)), but that's going to cause some headaches with catching StopIteration exceptions - but it would let you use a more classic while-loop for the whole thing. For myself, I would stick with the state-machine approach above, and just make it sufficiently robust to deal with all your reasonably expected edge-cases.
As your goal is to construct a DataFrame, here is a re+numpy+pandas solution:
import re
import pandas as pd
import numpy as np
# read all file
with open('sample.txt', encoding="utf8") as f:
text = f.read()
keys = ['Subject', 'Title', 'Full text']
regex = '(?:^|\n)(%s): ' % '|'.join(keys)
# split text on keys
chunks = re.split(regex, text)[1:]
# reshape flat list of records to group key/value and infos on the same article
df = pd.DataFrame([dict(e) for e in np.array(chunks).reshape(-1, len(keys), 2)])
Output:
Title Full text Subject
0 title of an article unfortunately the full text of each article,\nis on numerous lines. Each article has a differing number \nof lines. In this example, there are three.. Python
1 title of another article again unfortunately the full text of each article,\nis on numerous lines. Python

How to extract multiple instances of a word from PDF files on python?

I'm writing a script on python to read a PDF file and record both the string that appears after every instance that "time" is mentioned as well as the page number its mentioned on.
I have gotten it to recognize when each page has the string "time" on it and send me the page number, however if the page has "time" more than once, it does not tell me. I'm assuming this is because it has already fulfilled the criteria of having the string "time" on it at least once, and therefore it skips to the next page to perform the check.
How would I go about finding multiple instances of the word "time"?
This is my code:
import PyPDF2
def pdf_read():
pdfFile = "records\document.pdf"
pdf = PyPDF2.PdfFileReader(pdfFile)
pageCount = pdf.getNumPages()
for pageNumber in range(pageCount):
page = pdf.getPage(pageNumber)
pageContent = page.extractText()
if "Time" in pageContent or "time" in pageContent:
print(pageNumber)
Also as a side note, this pdf is a scanned document and therefore when I read the text on python (or copy and paste onto word) there are a lot words which come up with multiple random symbols and characters even though its perfectly legible. Is this a limitation of computer programming without having to apply more complex concepts such as machine learning in order to read the files accurately?
A solution would be to create a list of strings off pageContent and count the frequency of the word 'time' in the list. It is also easier to select the word following 'time' - you can simply retrieve the next item in the list:
import PyPDF2
import string
pdfFile = "records\document.pdf"
pdf = PyPDF2.PdfFileReader(pdfFile)
pageCount = pdf.getNumPages()
for pageNumber in range(pageCount):
page = pdf.getPage(pageNumber)
pageContent = page.extractText()
pageContent = ''.join(pageContent.splitlines()).split() # words to list
pageContent = ["".join(j.lower() for j in i if j not in string.punctuation) for i in pageContent] # remove punctuation
print(pageContent.count('time') + pageContent.count('Time')) # count occurances of time in list
print([(j, pageContent[i+1] if i+1 < len(pageContent) else '') for i, j in enumerate(pageContent) if j == 'Time' or j == 'time']) # list time and following word
Note that this example also strips all words from characters that are not letters or digits. Hopefully this sufficiently cleans up the bad OCR.

Splitting a docx by headings into separate files in Python

I want to write a program that grabs my docx files, iterates through them and splits each file into multiple separate files based on headings. Inside each of the docx there are a couple of articles, each with a 'Heading 1' and text underneath it.
So if my original file1.docx has 4 articles, I want it to be split into 4 separate files each with its heading and text.
I got to the part where it iterates through all of the files in a path where I hold the .docx files, and I can read the headings and text separately, but I can't seem to figure out a way how to merge it all and split it into separate files each with the heading and the text. I am using the python-docx library.
import glob
from docx import Document
headings = []
texts = []
def iter_headings(paragraphs):
for paragraph in paragraphs:
if paragraph.style.name.startswith('Heading'):
yield paragraph
def iter_text(paragraphs):
for paragraph in paragraphs:
if paragraph.style.name.startswith('Normal'):
yield paragraph
for name in glob.glob('/*.docx'):
document = Document(name)
for heading in iter_headings(document.paragraphs):
headings.append(heading.text)
for paragraph in iter_text(document.paragraphs):
texts.append(paragraph.text)
print(texts)
How do I extract the text and heading for each article?
This is the XML reading python-docx gives me. The red braces mark what I want to extract from each file.
https://user-images.githubusercontent.com/17858776/51575980-4dcd0200-1eac-11e9-95a8-f643f87b1f40.png
I am open for any alternative suggestions on how to achieve what I want with different methods, or if there is an easier way to do it with PDF files.
I think the approach of using iterators is a sound one, but I'd be inclined to parcel them differently. At the top level you could have:
for paragraphs in iterate_document_sections(document.paragraphs):
create_document_from_paragraphs(paragraphs)
Then iterate_document_sections() would look something like:
def iterate_document_sections(document):
"""Generate a sequence of paragraphs for each headed section in document.
Each generated sequence has a heading paragraph in its first position,
followed by one or more body paragraphs.
"""
paragraphs = [document.paragraphs[0]]
for paragraph in document.paragraphs[1:]:
if is_heading(paragraph):
yield paragraphs
paragraphs = [paragraph]
continue
paragraphs.append(paragraph)
yield paragraphs
Something like this combined with portions of your other code should give you something workable to start with. You'll need an implementation of is_heading() and create_document_from_paragraphs().
Note that the term "section" here is used as in common publishing parlance to refer to a (section) heading and its subordinate paragraphs, and does not refer to a Word document section object (like document.sections).
In fact, provided solution works well only if documents don't have any other elements except paragraphs (tables for example).
Another possible solution is to iterate not only through paragraphs but all document body's child xml elements. Once you find "subdocument's" start and end elements (paragraphs with headings in your example) you should delete other irrelevant to this part xml elements (a kind of cut off all other document content). This way you can preserve all styles, text, tables and other document elements and formatting.
It's not an elegant solution and means that you have to keep a temporary copy of a full source document in memory.
This is my code:
import tempfile
from typing import Generator, Tuple, Union
from docx import Document
from docx.document import Document as DocType
from docx.oxml.table import CT_Tbl
from docx.oxml.text.paragraph import CT_P
from docx.oxml.xmlchemy import BaseOxmlElement
from docx.text.paragraph import Paragraph
def iterparts(doc_path:str, skip_first=True, bias:int=0) -> Generator[Tuple[int,DocType],None,None]:
"""Iterate over sub-documents by splitting source document into parts
Split into parts by copying source document and cutting off unrelevant
data.
Args:
doc_path (str): path to source *docx* file
skip_first (bool, optional): skip first split point and wait for
second occurrence. Defaults to True.
bias (int, optional): split point bias. Defaults to 0.
Yields:
Generator[Tuple[int,DocType],None,None]: first element of each tuple
indicates the number of a
sub-document, if number is 0
then there are no sub-documents
"""
doc = Document(doc_path)
counter = 0
while doc:
split_elem_idx = -1
doc_body = doc.element.body
cutted = [doc, None]
for idx, elem in enumerate(doc_body.iterchildren()):
if is_split_point(elem):
if split_elem_idx == -1 and skip_first:
split_elem_idx = idx
else:
cutted = split(doc, idx+bias) # idx-1 to keep previous paragraph
counter += 1
break
yield (counter, cutted[0])
doc = cutted[1]
def is_split_point(element:BaseOxmlElement) -> bool:
"""Split criteria
Args:
element (BaseOxmlElement): oxml element
Returns:
bool: whether the *element* is the beginning of a new sub-document
"""
if isinstance(element, CT_P):
p = Paragraph(element, element.getparent())
return p.text.startswith("Some text")
return False
def split(doc:DocType, cut_idx:int) -> Tuple[DocType,DocType]:
"""Splitting into parts by copying source document and cutting of
unrelevant data.
Args:
doc (DocType): [description]
cut_idx (int): [description]
Returns:
Tuple[DocType,DocType]: [description]
"""
tmpdocfile = write_tmp_doc(doc)
second_part = doc
second_elems = list(second_part.element.body.iterchildren())
for i in range(0, cut_idx):
remove_element(second_elems[i])
first_part = Document(tmpdocfile)
first_elems = list(first_part.element.body.iterchildren())
for i in range(cut_idx, len(first_elems)):
remove_element(first_elems[i])
tmpdocfile.close()
return (first_part, second_part)
def remove_element(elem: Union[CT_P,CT_Tbl]):
elem.getparent().remove(elem)
def write_tmp_doc(doc:DocType):
tmp = tempfile.TemporaryFile()
doc.save(tmp)
return tmp
Note that you should define is_split_point method according to your split criteria

extract a certain quote after a keyword has been detected in Python 3

I'm trying to make a multi-term definer to quicken the process of searching for the definitions individually.
After python loads a webpage, it saves the page as a temporary text file.
Sample of saved page: ..."A","Answer":"","Abstract":"Harriet Tubman was an American abolitionist.","ImageIs...
In this sample, I'm after the string that contains the definition, in this case Harriet Tubman. The string "Abstract": is the portion always before the definition of the term.
What I need is a way to scan the text file for "Abstract":. Once that has been detected, look for an opening ". Then, copy and save all text to another text file until reaching the end ".
If you just wanted to find the string following "Abstract:" you could take a substring.
page = '..."A","Answer":"","Abstract":"Harriet Tubman was an American abolitionist.","ImageIs...'
i = page.index("Abstract") + 11
defn = page[i: page.index("\"", i)]
If you wanted to extract multiple parts of the page you should try the following.
dict_str = '"Answer":"","Abstract":"Harriet Tubman was an American abolitionist."'
definitions = {}
for kv in dict_str.split(","):
parts = kv.replace("\"", "").split(":")
if len(parts) != 2:
continue
definitions[parts[0]] = parts[1]
definitions['Abstract'] # 'Harriet Tubman was an American abolitionist.'
definitions["Answer"] # ''

Categories