Is it possible to add a single horizontal line spanning the entire page width after a paragraph?
By using the add_header such a line is created after the header:
document.add_heading('Test', 0)
I'd need such a line between two paragraphs in the document, so is it possible to just add the line by itself?
This is possible through the python-docx API using low-level functions. This is taken from a GitHub comment
from docx.oxml.shared import OxmlElement
from docx.oxml.ns import qn
def insertHR(paragraph):
p = paragraph._p # p is the <w:p> XML element
pPr = p.get_or_add_pPr()
pBdr = OxmlElement('w:pBdr')
pPr.insert_element_before(pBdr,
'w:shd', 'w:tabs', 'w:suppressAutoHyphens', 'w:kinsoku', 'w:wordWrap',
'w:overflowPunct', 'w:topLinePunct', 'w:autoSpaceDE', 'w:autoSpaceDN',
'w:bidi', 'w:adjustRightInd', 'w:snapToGrid', 'w:spacing', 'w:ind',
'w:contextualSpacing', 'w:mirrorIndents', 'w:suppressOverlap', 'w:jc',
'w:textDirection', 'w:textAlignment', 'w:textboxTightWrap',
'w:outlineLvl', 'w:divId', 'w:cnfStyle', 'w:rPr', 'w:sectPr',
'w:pPrChange'
)
bottom = OxmlElement('w:bottom')
bottom.set(qn('w:val'), 'single')
bottom.set(qn('w:sz'), '6')
bottom.set(qn('w:space'), '1')
bottom.set(qn('w:color'), 'auto')
pBdr.append(bottom)
This will add a horizontal rule to a paragraph.
This is not supported directly yet by the python-docx API.
However, you can achieve this effect by creating a paragraph style that has this border setting in the "template" document you open to create a new document, e.g.
document = Document('template-with-style.docx')
Then you can apply that style to a new paragraph in that document using python-docx.
Just add
document.add_paragraph("_____________________________________________")
length of this line may not be enough but you can add it ,i hope you have solved your problems this is for new comers
Related
So I work in NLP with hundreds of PDFs and the thing I hate is that since there is no one way of writing PDF I have to write a script to handle that (almost all the script is for the two-column PDF with tables and other weird stuff) and when I input one column one it gets messed up. Is there any way to detect if a PDF is one or two-column and run the fixing script only for the two column one after that? Please help me out with this.
This is what the PDFs look like
One column PDF
Two column PDF
disclaimer: I am the author of borb, the library used in this answer
borb has several classes that process PDF documents. These all implement EventListener. The idea is that they listen to the processing of pdf syntax, and process events (e.g.: an image has been rendered, a string was rendered, a new page has started, etc).
One of these implementations is SimpleParagraphExtraction. It attempts to use geometric information to determine which text should be separated from other text, and when something makes up a line of text, and when several lines make up a paragraph.
This is how you'd use it:
# read document
l: SimpleParagraphExtraction = SimpleParagraphExtraction(maximum_multiplied_leading=Decimal(1.7))
doc: typing.Optional[Document] = None
with open("input.pdf", "rb") as pdf_file_handle:
doc = PDF.loads(pdf_file_handle, [l])
Once you've processed the PDF, you can now do something with the paragraphs you've detected.
for p in l.get_paragraphs_for_page(0):
doc.get_page(0).add_annotation(
SquareAnnotation(p.get_bounding_box(), stroke_color=HexColor("f1cd2e"))
)
The above code adds a colored rectangle around each paragraph.
You can easily modify this code to determine how many paragraphs appear side-by-side. Which should help you determine whether something is single- or multi-column layout.
edit: This is a quick write-up I did:
from pathlib import Path
import typing
from borb.pdf.pdf import PDF
from borb.toolkit.text.simple_paragraph_extraction import SimpleParagraphExtraction
from borb.pdf.canvas.layout.annotation.square_annotation import SquareAnnotation
from borb.pdf import HexColor
from borb.pdf import Paragraph
from decimal import Decimal
import requests
open("example_001.pdf", "wb").write(requests.get("https://github.com/Da-vid21/Outputs/raw/main/BarCvDescLJ11.pdf").content)
open("example_002.pdf", "wb").write(requests.get("https://github.com/Da-vid21/Outputs/raw/main/Bill-Brown-Reprint.pdf").content)
# open PDF
l: SimpleParagraphExtraction = SimpleParagraphExtraction(maximum_multiplied_leading=1.6)
with open("example_002.pdf", "rb") as fh:
doc = PDF.loads(fh, [l])
# build histogram (number of paragraphs per y-coordinate)
ps: typing.List[Paragraph] = l.get_paragraphs_for_page(0)
h: typing.Dict[int, int] = {}
for p in ps:
y0: int = int(p.get_bounding_box().get_y())
y1: int = int(y0 + p.get_bounding_box().get_height())
for y in range(y0, y1):
h[y] = h.get(y, 0) + 1
# display average
avg_paras_per_y: float = sum([x for x in h.values()]) / len(h)
print(avg_paras_per_y)
This outputs:
1.5903010033444815
On average, your two-column document has 1.6 paragraphs per y-coordinate. That would seem to indicate it's a two-column layout.
I am extracting text from pdf files using python pdfminer library (see docs).
However, pdfminer seems unable to extract all texts in some files and extracts LTFigure object instead. Assuming from position of this object it "covers" some of the text and thus this text is not extracted.
Both pdf file and short jupyter notebook with the code extracting information from pdf are in the Github repository I created specifically in order to ask this question:
https://github.com/druskacik/ltfigure-pdfminer
I am not an expert on how pdf files work but common sense tells me that if I can look for the text using control + f in browser, it should be extractable.
I have considered using some other library but the problem is that I also need positions of the extracted words (in order to use them for my machine learning model), which is a functionality only pdfminer seems to provide.
Ok, so I finally came up with the solution. It's very simple - it's possible to iterate over LTFigure object in the same way you would iterate over e.g. LTTextBox object.
interpreter.process_page(page)
layout = device.get_result()
for lobj in layout:
if isinstance(lobj, LTTextBox):
for element in lobj:
if isinstance(element, LTTextLine):
text = element.get_text()
print(text)
elif isinstance(lobj, LTFigure):
for element in lobj:
if isinstance(element, LTChar):
text = element.get_text()
print(text)
Note that the correct way (as to make sure that the parser reads everything in the document) would be to iterate pdfminer objects recursively, as shown here: How does one obtain the location of text in a PDF with PDFMiner?
Given that you also consider other libraries, I suggest using poppler-util's pdftohtml to convert the pdf to xml:
!apt-get install -y poppler-utils
!pdftohtml -c -hidden -xml document.pdf output.xml
It will output an xml file with the text and top, left, width, and height values for the boxes. It had no issues with the text that pdfminer doesn't recognize.
I want to put the logo and the barcode on the same level and not one on top of the other.The logo should stay at the very left and the barcode at the very right of the word file.Here is my code , thank you:
import uuid
import pandas as pd
import pyqrcode
from docx import Document
from docx.shared import Inches
qr=pyqrcode.create(str(uuid.uuid4()).replace('-',''))
qr.png('somecode.png')
df=pd.DataFrame(pd.read_excel('Komplette-
GastAccountFHZugangsdatenFertig.xlsx'))
Attributes=['Name', 'Vorname ', 'Geschlecht', 'Adresse (in Marokko)',
'Telefonnummer', 'E-Mailadresse', 'Studiengang', 'Semester']
document = Document()
document.add_heading('Informationen.',level=0)
document.add_picture('Informatik_Logo.png',height=Inches(1.0))
p = document.add_paragraph()
r = p.add_run()
p_format=p.paragraph_format
p_format.left_indent=Inches(4.5)
r.add_picture('somecode.png',width=Inches(1.0))
table=document.add_table(len(Attributes),2,style='LightGrid-Accent1')
for i in range(len(Attributes)):
row=table.rows[i]
row.cells[0].text=Attributes[i]
Infos=df[Attributes[i]]
string=str(Infos[49])
row.cells[1].text=string
document.save('sample.docx')
From what I see in the documentation, python-docx only currently supports inline pictures, not floating pictures, which means that you can only get the look you currently have. From the docs:
At the time of writing, python-docx only supports inline pictures. Floating pictures can be added. If you have an active use case, submit a feature request on the issue tracker. The Document.add_picture() method adds a specified picture to the end of the document in a paragraph of its own.
Based on that last sentence, I think what you're trying to do is currently impossible. A workaround might be to insert a table with one row and two columns, and insert an image in each cell.
i can help you with that,just add one sentence into your code
r.add_picture('somecode.png',width=Inches(2.0))
that is all. just write the sentence one more time.and if you want more picture in the same level, all you need to do is that add the sentence. i tried and it works to me.
there is my test code
dc = Document()
run = dc.add_paragraph().add_run()
run.add_picture("./picture/danilise.jpg", width=Inches(1.1))
run.add_picture("./picture/danilise.jpg", width=Inches(1.3))
run.add_picture("./picture/danilise.jpg", width=Inches(1.5))
dc.save("test1.docx")
I'm trying to get the image index from the .docx file using python-docx library. I'm able to extract the name of the image, image height and width. But not the index where it is in the word file
import docx
doc = docx.Document(filename)
for s in doc.inline_shapes:
print (s.height.cm,s.width.cm,s._inline.graphic.graphicData.pic.nvPicPr.cNvPr.name)
output
21.228 15.920 IMG_20160910_220903848.jpg
In fact I would like to know if there is any simpler way to get the image name , like s.height.cm fetched me the height in cm. My primary requirement is to get to know where the image is in the document, because I need to extract the image and do some work on it and then again put the image back to the same location
This operation is not directly supported by the API.
However, if you're willing to dig into the internals a bit and use the underlying lxml API it's possible.
The general approach would be to access the ImagePart instance corresponding to the picture you want to inspect and modify, then read and write the ._blob attribute (which holds the image file as bytes).
This specimen XML might be helpful:
http://python-docx.readthedocs.io/en/latest/dev/analysis/features/shapes/picture.html#specimen-xml
From the inline shape containing the picture, you get the <a:blip> element with this:
blip = inline_shape._inline.graphic.graphicData.pic.blipFill.blip
The relationship id (r:id generally, but r:embed in this case) is available at:
rId = blip.embed
Then you can get the image part from the document part
document_part = document.part
image_part = document_part.related_parts[rId]
And then the binary image is available for read and write on ._blob.
If you write a new blob, it will replace the prior image when saved.
You probably want to get it working with a single image and get a feel for it before scaling up to multiple images in a single document.
There might be one or two image characteristics that are cached, so you might not get all the finer points working until you save and reload the file, so just be alert for that.
Not for the faint of heart as you can see, but should work if you want it bad enough and can trace through the code a bit :)
You can also inspect paragraphs with a simple loop, and check which xml contains an image (for example if an xml contains "graphicData"), that is which is an image container (you can do the same with runs):
from docx import Document
image_paragraphs = []
doc = Document(path_to_docx)
for par in doc.paragraphs:
if 'graphicData' in par._p.xml:
image_paragraphs.append(par)
Than you unzip docx file, images are in the "images" folder, and they are in the same order as they will be in the image_paragraphs list. On every paragraph element you have many options how to change it. If you want to extract img process it and than insert it in the same place, than
paragraph.clear()
paragraph.add_run('your description, if needed')
run = paragraph.runs[0]
run.add_picture(path_to_pic, width, height)
So, I've never really written any answers here, but i think this might be the solution to your problem. With this little code you can see the position of your images given all the paragraphs. Hope it helps.
import docx
doc = docx.Document(filename)
paraGr = []
index = []
par = doc.paragraphs
for i in range(len(par)):
paraGr.append(par[i].text)
if 'graphicData' in par[i]._p.xml:
index.append(i)
If you are using Python 3
pip install python-docx
import docx
doc = docx.Document(document_path)
P = []
I = []
par = doc.paragraphs
for i in range(len(par)):
P.append(par[i].text)
if 'graphicData' in par[i]._p.xml:
I.append(i)
print(I)
#returns list of index(Image_Reference)
I'm trying to automate the creation of .docx files (WordML) with the help of python-docx (https://github.com/mikemaccana/python-docx). My current script creates the ToC manually with following loop:
for chapter in myChapters:
body.append(paragraph(chapter.text, style='ListNumber'))
Does anyone know of a way to use the "word built-in" ToC-function, which adds the index automatically and also creates paragraph-links to the individual chapters?
Thanks a lot!
The key challenge is that a rendered ToC depends on pagination to know what page number to put for each heading. Pagination is a function provided by the layout engine, a very complex piece of software built into the Word client. Writing a page layout engine in Python is probably not a good idea, definitely not a project I'm planning to undertake anytime soon :)
The ToC is composed of two parts:
the element that specifies the ToC placement and things like which heading levels to include.
the actual visible ToC content, headings and page numbers with dotted lines connecting them.
Creating the element is pretty straightforward and relatively low-effort. Creating the actual visible content, at least if you want the page numbers included, requires the Word layout engine.
These are the options:
Just add the tag and a few other bits to signal Word the ToC needs to be updated. When the document is first opened, a dialog box appears saying links need to be refreshed. The user clicks Yes and Bob's your uncle. If the user clicks No, the ToC title appears with no content below it and the ToC can be updated manually.
Add the tag and then engage a Word client, by means of C# or Visual Basic against the Word Automation library, to open and save the file; all the fields (including the ToC field) get updated.
Do the same thing server-side if you have a SharePoint instance or whatever that can do it with Word Automation Services.
Create an AutoOpen macro in the document that automatically runs the field update when the document is opened. Probably won't pass a lot of virus checkers and won't work on locked-down Windows builds common in a corporate setting.
Here's a very nice set of screencasts by Eric White that explain all the hairy details
Sorry for adding comments to an old post, but I think it may be helpful.
This is not my solution, but it has been found there: https://github.com/python-openxml/python-docx/issues/36
Thanks to https://github.com/mustash and https://github.com/scanny
from docx.oxml.ns import qn
from docx.oxml import OxmlElement
paragraph = self.document.add_paragraph()
run = paragraph.add_run()
fldChar = OxmlElement('w:fldChar') # creates a new element
fldChar.set(qn('w:fldCharType'), 'begin') # sets attribute on element
instrText = OxmlElement('w:instrText')
instrText.set(qn('xml:space'), 'preserve') # sets attribute on element
instrText.text = 'TOC \\o "1-3" \\h \\z \\u' # change 1-3 depending on heading levels you need
fldChar2 = OxmlElement('w:fldChar')
fldChar2.set(qn('w:fldCharType'), 'separate')
fldChar3 = OxmlElement('w:t')
fldChar3.text = "Right-click to update field."
fldChar2.append(fldChar3)
fldChar4 = OxmlElement('w:fldChar')
fldChar4.set(qn('w:fldCharType'), 'end')
r_element = run._r
r_element.append(fldChar)
r_element.append(instrText)
r_element.append(fldChar2)
r_element.append(fldChar4)
p_element = paragraph._p
Please see explanations in the code comments.
# First set directory where you want to save the file
import os
os.chdir("D:/")
# Now import required packages
import docx
from docx import Document
from docx.oxml.ns import qn
from docx.oxml import OxmlElement
# Initialising document to make word file using python
document = Document()
# Code for making Table of Contents
paragraph = document.add_paragraph()
run = paragraph.add_run()
fldChar = OxmlElement('w:fldChar') # creates a new element
fldChar.set(qn('w:fldCharType'), 'begin') # sets attribute on element
instrText = OxmlElement('w:instrText')
instrText.set(qn('xml:space'), 'preserve') # sets attribute on element
instrText.text = 'TOC \\o "1-3" \\h \\z \\u' # change 1-3 depending on heading levels you need
fldChar2 = OxmlElement('w:fldChar')
fldChar2.set(qn('w:fldCharType'), 'separate')
fldChar3 = OxmlElement('w:t')
fldChar3.text = "Right-click to update field."
fldChar2.append(fldChar3)
fldChar4 = OxmlElement('w:fldChar')
fldChar4.set(qn('w:fldCharType'), 'end')
r_element = run._r
r_element.append(fldChar)
r_element.append(instrText)
r_element.append(fldChar2)
r_element.append(fldChar4)
p_element = paragraph._p
# Giving headings that need to be included in Table of contents
document.add_heading("Network Connectivity")
document.add_heading("Weather Stations")
# Saving the word file by giving name to the file
name = "mdh2"
document.save(name+".docx")
# Now check word file which got created
# Select "Right-click to update field text"
# Now right click and then select update field option
# and then click on update entire table
# Now,You will find Automatic Table of Contents
#Mawg // Updating ToC
Had the same issue to update the ToC and googled for it. Not my code, but it works:
word = win32com.client.DispatchEx("Word.Application")
doc = word.Documents.Open(input_file_name)
doc.TablesOfContents(1).Update()
doc.Close(SaveChanges=True)
word.Quit()