I am creating PDF document using borb and try to align text within table cells.
from borb.pdf import Document
from borb.pdf import Page
from borb.pdf import SingleColumnLayout
from borb.pdf import Paragraph
from borb.pdf import PDF
from borb.pdf import Alignment
from borb.pdf import TableCell
from borb.pdf import FlexibleColumnWidthTable
from borb.pdf import Table
pdf = Document()
page = Page()
pdf.add_page(page)
layout = SingleColumnLayout(page)
layout.add(
FixedColumnWidthTable(number_of_columns=1, number_of_rows=1)
.add(Paragraph(
"""
Report generated on 2022-01-01 at 00:00 am (UTC)
Date: 01 Jan
""",
text_alignment=Alignment.RIGHT,
padding_top=Decimal(12),
respect_newlines_in_text=True,
font_size=Decimal(10))))
with open(Path("output.pdf"), "wb") as pdf_file_handle:
PDF.dumps(pdf_file_handle, pdf)
But the text is not aligned to the very right, but to middle of the cell (check the image). Do you know how to align the text to the very right border of the table?
disclaimer: I am the author of borb
You are experiencing the difference between the horizontal_alignment of a LayoutElement and the text_alignment of said element.
When performing layout on a text-carrying LayoutElement, the logic is roughly the following:
How wide is this text going to be? That will be the width of this LayoutElement. This step takes into account possible overflow, moving text to the next line, etc
Now that we've determined the width (and height) of the LayoutElement, we take into account text_alignment
Your text fits on roughly half the Page, so the content box of your Paragraph is "roughly half the page". If you then apply text_alignment, it doesn't really do much.
I would suggest you simply set the horizontal_alignment to Alignment.RIGHT. That yields the following PDF:
I also wonder why you're using a Table to perform layout. This seems like bad document design.
Related
So I work in NLP with hundreds of PDFs and the thing I hate is that since there is no one way of writing PDF I have to write a script to handle that (almost all the script is for the two-column PDF with tables and other weird stuff) and when I input one column one it gets messed up. Is there any way to detect if a PDF is one or two-column and run the fixing script only for the two column one after that? Please help me out with this.
This is what the PDFs look like
One column PDF
Two column PDF
disclaimer: I am the author of borb, the library used in this answer
borb has several classes that process PDF documents. These all implement EventListener. The idea is that they listen to the processing of pdf syntax, and process events (e.g.: an image has been rendered, a string was rendered, a new page has started, etc).
One of these implementations is SimpleParagraphExtraction. It attempts to use geometric information to determine which text should be separated from other text, and when something makes up a line of text, and when several lines make up a paragraph.
This is how you'd use it:
# read document
l: SimpleParagraphExtraction = SimpleParagraphExtraction(maximum_multiplied_leading=Decimal(1.7))
doc: typing.Optional[Document] = None
with open("input.pdf", "rb") as pdf_file_handle:
doc = PDF.loads(pdf_file_handle, [l])
Once you've processed the PDF, you can now do something with the paragraphs you've detected.
for p in l.get_paragraphs_for_page(0):
doc.get_page(0).add_annotation(
SquareAnnotation(p.get_bounding_box(), stroke_color=HexColor("f1cd2e"))
)
The above code adds a colored rectangle around each paragraph.
You can easily modify this code to determine how many paragraphs appear side-by-side. Which should help you determine whether something is single- or multi-column layout.
edit: This is a quick write-up I did:
from pathlib import Path
import typing
from borb.pdf.pdf import PDF
from borb.toolkit.text.simple_paragraph_extraction import SimpleParagraphExtraction
from borb.pdf.canvas.layout.annotation.square_annotation import SquareAnnotation
from borb.pdf import HexColor
from borb.pdf import Paragraph
from decimal import Decimal
import requests
open("example_001.pdf", "wb").write(requests.get("https://github.com/Da-vid21/Outputs/raw/main/BarCvDescLJ11.pdf").content)
open("example_002.pdf", "wb").write(requests.get("https://github.com/Da-vid21/Outputs/raw/main/Bill-Brown-Reprint.pdf").content)
# open PDF
l: SimpleParagraphExtraction = SimpleParagraphExtraction(maximum_multiplied_leading=1.6)
with open("example_002.pdf", "rb") as fh:
doc = PDF.loads(fh, [l])
# build histogram (number of paragraphs per y-coordinate)
ps: typing.List[Paragraph] = l.get_paragraphs_for_page(0)
h: typing.Dict[int, int] = {}
for p in ps:
y0: int = int(p.get_bounding_box().get_y())
y1: int = int(y0 + p.get_bounding_box().get_height())
for y in range(y0, y1):
h[y] = h.get(y, 0) + 1
# display average
avg_paras_per_y: float = sum([x for x in h.values()]) / len(h)
print(avg_paras_per_y)
This outputs:
1.5903010033444815
On average, your two-column document has 1.6 paragraphs per y-coordinate. That would seem to indicate it's a two-column layout.
Here's my code :
from docx import Document
from docx.shared import Inches
import excel2img
excel2img.export_img("test.xlsx","image2.png", "Sheet1", "G13:J22")
document = Document('filename.docx')
paragraphs = document.paragraphs
paragraph = paragraphs[0]
run = paragraph.add_run()
run.add_picture('image2.png', width=Inches(6.65), height=Inches(2.02))
document.save('new.docx')
And here's my word output look like :
I don't understand how to put the image above the bold sentences "Sentences to replace below" because I cannot specify less than 0. I guess I'm not using the right way ? If you could give me some tips it would be wonderful.
Something like this should do the needful:
top_paragraph = document.paragraphs[0]
image_paragraph = top_paragraph.insert_paragraph_before()
image_run = image_paragraph.add_run()
image_run.add_picture("my-picture.png")
An inline image lives in a run. If you don't want text to appear beside it then that run needs to be in its own paragraph. Because you want it to be the very first paragraph, you need to insert its paragraph above the existing one.
Python 3.8 x64 | Windows 10 x64 | reportlab v3.5.46 (open-source)
Been searching everywhere for an answer on this to no avail. I just want to create a PDF document that contains a lot of text. After the first page is completely full of text, the remaining text should flow onto a second page. After the second page is completely full of text, the remaining text should flow onto a third page (and so on)...
All answers I find searching around the world via the internet states to use the canvas.showPage() method. This is not a great solution because in all of these examples the first page is not completely populated with text hence it is a manual method of adding a new page. In my example I do not know when the first page will be filled with text thus I do not know when I need to create a second or new page using canvas.showPage().
I need to somehow detect when the first page cannot hold any more text, and when this occurs create a new page to hold the text which remains.
From reading over the reportlabs documentation, I am not sure how to achieve this in a practical pythonic implementation. There are also platypus.PageBreak() and BaseDocTemplate.afterPage() methods but not sure what they do because the documentation is sparse on these methods.
I don't think the code I am using will be much value for my question, but it is included below for reference. The function parameter my_text is a multi-page amount of text.
from reportlab.pdfgen.canvas import Canvas
from reportlab.lib.pagesizes import LETTER
from reportlab.lib.units import inch
def create_pdf_report_text_object(my_text):
canvas = Canvas('Test.pdf', pagesize=LETTER)
canvas.setFont('Helvetica', size=10)
text_object = canvas.beginText(x=1 * inch, y=10 * inch)
for line in my_text.splitlines(False):
text_object.textLine(line.rstrip())
canvas.drawText(text_object)
canvas.save()
I believe one solution to your question is to use a frame, this means the frame is re-created on every page until text runs out. The frame will detect when it is full.
Please see below example as a start to your own solution (its complete, just copy and paste and run the code, a pdf called "Example_output.pdf" should be created).
from reportlab.lib.pagesizes import letter
from reportlab.platypus import Paragraph
from reportlab.lib.styles import getSampleStyleSheet, ParagraphStyle
from reportlab.platypus import BaseDocTemplate, PageTemplate, Flowable, FrameBreak, KeepTogether, PageBreak, Spacer
from reportlab.platypus import Frame, PageTemplate, KeepInFrame
from reportlab.lib.units import cm
from reportlab.platypus import (Table, TableStyle, BaseDocTemplate)
styleSheet = getSampleStyleSheet()
########################################################################
def create_pdf():
"""
Create a pdf
"""
# Create a frame
text_frame = Frame(
x1=3.00 * cm, # From left
y1=1.5 * cm, # From bottom
height=19.60 * cm,
width=15.90 * cm,
leftPadding=0 * cm,
bottomPadding=0 * cm,
rightPadding=0 * cm,
topPadding=0 * cm,
showBoundary=1,
id='text_frame')
# Create text
L = [Paragraph("""What concepts does PLATYPUS deal with?""", styleSheet['Heading2']),
Paragraph("""
The central concepts in PLATYPUS are Flowable Objects, Frames, Flow
Management, Styles and Style Sheets, Paragraphs and Tables. This is
best explained in contrast to PDFgen, the layer underneath PLATYPUS.
PDFgen is a graphics library, and has primitive commans to draw lines
and strings. There is nothing in it to manage the flow of text down
the page. PLATYPUS works at the conceptual level fo a desktop publishing
package; you can write programs which deal intelligently with graphic
objects and fit them onto the page.
""", styleSheet['BodyText']),
Paragraph("""
How is this document organized?
""", styleSheet['Heading2']),
Paragraph("""
Since this is a test script, we'll just note how it is organized.
the top of each page contains commentary. The bottom half contains
example drawings and graphic elements to whicht he commentary will
relate. Down below, you can see the outline of a text frame, and
various bits and pieces within it. We'll explain how they work
on the next page.
""", styleSheet['BodyText']),
]
# Building the story
story = L * 20 # (alternative, story.add(L))
story.append(KeepTogether([]))
# Establish a document
doc = BaseDocTemplate("Example_output.pdf", pagesize=letter)
# Creating a page template
frontpage = PageTemplate(id='FrontPage',
frames=[text_frame]
)
# Adding the story to the template and template to the document
doc.addPageTemplates(frontpage)
# Building doc
doc.build(story)
# ----------------------------------------------------------------------
if __name__ == "__main__":
create_pdf() # Printing the pdf
I want to put the logo and the barcode on the same level and not one on top of the other.The logo should stay at the very left and the barcode at the very right of the word file.Here is my code , thank you:
import uuid
import pandas as pd
import pyqrcode
from docx import Document
from docx.shared import Inches
qr=pyqrcode.create(str(uuid.uuid4()).replace('-',''))
qr.png('somecode.png')
df=pd.DataFrame(pd.read_excel('Komplette-
GastAccountFHZugangsdatenFertig.xlsx'))
Attributes=['Name', 'Vorname ', 'Geschlecht', 'Adresse (in Marokko)',
'Telefonnummer', 'E-Mailadresse', 'Studiengang', 'Semester']
document = Document()
document.add_heading('Informationen.',level=0)
document.add_picture('Informatik_Logo.png',height=Inches(1.0))
p = document.add_paragraph()
r = p.add_run()
p_format=p.paragraph_format
p_format.left_indent=Inches(4.5)
r.add_picture('somecode.png',width=Inches(1.0))
table=document.add_table(len(Attributes),2,style='LightGrid-Accent1')
for i in range(len(Attributes)):
row=table.rows[i]
row.cells[0].text=Attributes[i]
Infos=df[Attributes[i]]
string=str(Infos[49])
row.cells[1].text=string
document.save('sample.docx')
From what I see in the documentation, python-docx only currently supports inline pictures, not floating pictures, which means that you can only get the look you currently have. From the docs:
At the time of writing, python-docx only supports inline pictures. Floating pictures can be added. If you have an active use case, submit a feature request on the issue tracker. The Document.add_picture() method adds a specified picture to the end of the document in a paragraph of its own.
Based on that last sentence, I think what you're trying to do is currently impossible. A workaround might be to insert a table with one row and two columns, and insert an image in each cell.
i can help you with that,just add one sentence into your code
r.add_picture('somecode.png',width=Inches(2.0))
that is all. just write the sentence one more time.and if you want more picture in the same level, all you need to do is that add the sentence. i tried and it works to me.
there is my test code
dc = Document()
run = dc.add_paragraph().add_run()
run.add_picture("./picture/danilise.jpg", width=Inches(1.1))
run.add_picture("./picture/danilise.jpg", width=Inches(1.3))
run.add_picture("./picture/danilise.jpg", width=Inches(1.5))
dc.save("test1.docx")
I need to generate an examination template for an online school website,
I need to know each coordinates of answers boxes in order to crop them later.
Is it possible to generate a pdf and get coordinates from each elements inside the pdf ? (Like inserting a black square as an image in the pdf and get his coordinates ?)
I found many libraries to create pdf like pyPdf, pyPdf2,... but i didn't find a way to get coordinates.
Thank you for your suggestions and advices.
You could use reportlab. It would allow you to access coordinates by specifying them yourself:
from reportlab.platypus import SimpleDocTemplate, Paragraph
from reportlab.lib.styles import getSampleStyleSheet, ParagraphStyle
from reportlab.lib.units import inch
from reportlab.lib.pagesizes import letter
import io
buf = io.BytesIO()
doc = SimpleDocTemplate(buf, rightMargin=inch/2, leftMargin=inch/2, topMargin=inch/2, bottomMargin=inch/2, pagesize=letter)
styles = getSampleStyleSheet()
answers = []
answers.append(Paragraph('Data for Answer box', styles['Normal']))
doc.build(answers)
school_pdf = open('answers.pdf', 'a')
school_pdf.write(buf.getvalue())