I want to create a document with serial barcode (10 per page) so I am using the code bellow to generate a barcode:
my_code = Code128(document)
my_code.save(document)
The result of this code is an svg picture...So I want to insert this svg picture into a table in docx file and I am using this piece of code for that:
doc = Document('assets/test.docx')
tables = doc.tables
p = tables[0].rows[1].cells[0].add_paragraph()
r = p.add_run()
r.add_picture('BL22002222.svg', width=Inches(3), height=Inches(1))
doc.save('assets/test.docx')
but it through this error:
docx.image.exceptions.UnrecognizedImageError
I want to generate a printed pages pdf or docx i don't care the most important is that it contain barcode
Related
I'm having a trouble while scraping text from pdf files using python.
My goal is to get the text from a pdf file ( from chapter 1 to chapter 2, for example) and write it on a docx file(or txt file). However, the text I get is full of incorrect spacing.
Text example: "
Chapter 1
Aerial seeding can quickly cover large and physically inaccessible areas1 to improve soil quality and scavenge residual nitrogen in agriculture2, and for postfire reforestation3,4,5 and wildland restoration6,7. However, it suffers from low germination rates, due to the direct exposure of unburied
Chapter 2
"
Text output on docx file: "
Chapter 1
Aerial seed ing can quic kly cover large and phys ically inacce ssible a reas1 to improve soil quality and scavenge residu al nitrogen in agriculture2, and for postf ire refore station3,4,5 and wil dland restorati on6,7. However, it suffers from low germina tion rates, due to the direct expos ure of unbur ied
Chapter 2
"
Notice that words are incorretly spaced.
My code is the following:
import PyPDF2
# Open the PDF file
with open('example.pdf', 'rb') as pdf_file:
# Create a PDF reader object
pdf_reader = PyPDF2.PdfFileReader(pdf_file)
# Initialize an empty string to store the text
text = ""
# Loop through each page in the PDF file
for page in range(pdf_reader.getNumPages()):
# Extract the text from the page
page_text = pdf_reader.getPage(page).extractText()
# Append the page text to the overall text
text += page_text
# Stop extracting text after the end of chapter 1
if "CAPĂTULO II" in page_text:
break
# Extract the text from Chapter 1 to Chapter 2
start = text.find("Chapter 1")
end = text.find("Chapter 2")
text = text[start:end]
print(text)
# Save the extracted text to a new file
with open('extracted_text.txt', 'w') as text_file:
text_file.write(text)
The expected output is the first text exactly as it is.
How can I solve this case?
If I drag a PDF onto a shortcut (One I prepared earlier)
Then it will automatically generate the text no mess no fuss just a right click or a drag and drop, HOWEVER it needs PDFtoTEXT (Xpdf or poppler, either willl do) as the Automator
C:\Windows\System32\cmd.exe /c C:\Apps\PDF\xpdf\xpdf-tools-win-4.04\bin32\pdftotext.exe -layout -nopgbrk -enc UTF-8
Once you have the layout text you can slice and dice it with Python, or MS NotePad with VBS or any other richer editor like WordPad
using the tips to use the PyMuPDF I managed to get it. I used the code bellow
import fitz
from docx import Document
from docx.shared import Pt
from docx.enum.text import WD_ALIGN_PARAGRAPH
# Open the PDF file
doc = fitz.open('Regulamento-GalapagosDeepOceanFICFIM06022023VF.pdf')
# Create a new Word document
document = Document()
# Set the font style for the document
font_style = document.styles['Normal']
font_style.font.size = Pt(12)
text=""
text_aux=""
# Iterate over each page in the PDF file
for page in doc:
# Extract the text from the page
text = page.get_text()
# Add the text to a new paragraph in the Word document
#paragraph = document.add_paragraph(text)
# Set the paragraph alignment to left
#paragraph.alignment = WD_ALIGN_PARAGRAPH.LEFT
text_aux = text_aux + text
# Save the Word document
start = text_aux.find("CHAPTER 1")
end = text_aux.find("CHAPTER 2")
text1 = text_aux[start:end]
paragraph = document.add_paragraph(text1)
paragraph.alignment = WD_ALIGN_PARAGRAPH.LEFT
print(text1)
document.save('example_test.docx')
Using this code I was able to get the text with the exact format and extract the chapters as wanted.
I'm trying to mark only a few words in a pdf and with the results I want to make a new pdf using only pytesseract.
Here is the code:
images = convert_from_path(name,poppler_path=r'C:\Program Files\poppler-0.68.0\bin')
for i in images:
img = cv.cvtColor(np.array(i),cv.COLOR_RGB2BGR)
d = pytesseract.image_to_data(img,output_type=Output.DICT,lang='eng+equ',config="--psm 6")
boxes = len(d['level'])
for i in range(boxes):
for e in functionEvent: #functionEvent is a list of strings
if e in d['text'][i]:
(x,y,w,h) = (d['left'][i],d['top'][i],d['width'][i],d['height'][i])
cv.rectangle(img,(x,y),(x+w,y+h),(0,255,0),2)
pdf = pytesseract.image_to_pdf_or_hocr(img,extension='pdf')
with open('results.pdf','w+b') as f:
f.write(pdf)
What have I tried:
with open('results.pdf','a+b') as f:
f.write(pdf)
If you know how can I fix this just let me know.
Also I don't care at all if you recommand another module or your opinion how am I supposed to write code.
Thanks in advance!
Try using PyPDF2 to link your pdfs together.
Firstly you extract your text from pdf with tesseract OCR and store it into list object like this :
for filename in tqdm(os.listdir(in_dir)):
img = Image.open(os.path.join(in_dir,filename))
pdf = pytesseract.image_to_pdf_or_hocr(img, lang='slk', extension='pdf')
pdf_pages.append(pdf)
then iterate trough each processed image or file, read the bytes and add pages using PdfFileReader like this(do not forget to import io):
pdf_writer = PdfFileWriter()
for page in pdf_pages:
pdf = PdfFileReader(io.BytesIO(page))
pdf_writer.addPage(pdf.getPage(0))
In the end create the file and store data to it:
file = open(out_dir, "w+b")
pdf_writer.write(file)
file.close()
I am trying to create table of contents using Python's FPDF package (https://pyfpdf.readthedocs.io/en/latest/)
I want the table of contents to be at the top of the PDF document, and clickable so that the reader is instantly taken to a section.
The issue I have is that for the links to work, the table of contents needs to be at the end of the document.
I have tried using python's PyPDF2 to split the created PDF document and move the table of contents to the beginning of the document. The problem is that in the resulting PDF the links no longer work.
I am using python (v3.7) with Spyder.
Here is my code for creating the table of contents:
from fpdf import FPDF
# This is my section with link to table of contents
report.add_page(orientation='p')
report.set_font('Helvetica', size=22, style ='B')
report.cell(200, 10, 'section', 0, 2, 'l')
DD = report.add_link()
report.set_link(DD)
# Table of contents including that one section
report.write(5, 'Section',DD)
Here is my code for rearranging the PDF document (TOC = table of contents):
from PyPDF2 import PdfFileReader, PdfFileWriter, PdfFileMerger
PDF_report = PdfFileReader('file.pdf')
pdf_bulk_writer = PdfFileWriter()
output_filename_bulk = "bulk.pdf"
pdf_TOC_writer = PdfFileWriter()
output_filename_TOC = "TOC.pdf"
for page in range(PDF_report.getNumPages()):
current_page = PDF_report.getPage(page)
if page == PDF_report.getNumPages()-1:
pdf_TOC_writer.addPage(current_page)
if page <= PDF_report.getNumPages()-2:
pdf_bulk_writer.addPage(current_page)
# Write the data to disk
with open(output_filename_TOC, "wb") as out:
pdf_TOC_writer.write(out)
print("created", output_filename_TOC)
# Write the data to disk
with open(output_filename_bulk, "wb") as out:
pdf_bulk_writer.write(out)
print("created", output_filename_bulk)
pdfs = ['TOC.pdf', 'bulk.pdf']
merger = PdfFileMerger()
for pdf in pdfs:
merger.append(pdf)
merger.write("result.pdf")
merger.close()
Is there any way to extract images as stream from pdf document (using PyPDF2 library)?
Also is it possible to replace some images to another (generated with PIL for example or loaded from file)?
I'm able to get EncodedStreamObject from pdf objects tree and get encoded stream (by calling getData() method), but looks like it just raw content w/o any image headers and other meta information.
>>> import PyPDF2
>>> # sample.pdf contains png images
>>> reader = PyPDF2.PdfFileReader(open('sample.pdf', 'rb'))
>>> reader.resolvedObjects[0][9]
{'/BitsPerComponent': 8,
'/ColorSpace': ['/ICCBased', IndirectObject(20, 0)],
'/Filter': '/FlateDecode',
'/Height': 30,
'/Subtype': '/Image',
'/Type': '/XObject',
'/Width': 100}
>>>
>>> reader.resolvedObjects[0][9].__class__
PyPDF2.generic.EncodedStreamObject
>>>
>>> s = reader.resolvedObjects[0][9].getData()
>>> len(s), s[:10]
(9000, '\xcc\xcc\xcc\xcc\xcc\xcc\xcc\xcc\xcc\xcc')
I've looked across PyPDF2, ReportLab and PDFMiner solutions quite a bit, but haven't found anything like what I'm looking for.
Any code samples and links will be very helpful.
import fitz
doc = fitz.open(filePath)
for i in range(len(doc)):
for img in doc.getPageImageList(i):
xref = img[0]
pix = fitz.Pixmap(doc, xref)
if pix.n < 5: # this is GRAY or RGB
pix.writePNG("p%s-%s.png" % (i, xref))
else: # CMYK: convert to RGB first
pix1 = fitz.Pixmap(fitz.csRGB, pix)
pix1.writePNG("p%s-%s.png" % (i, xref))
pix1 = None
pix = None
Image metadata is not stored within the encoded images of a PDF. If metadata is stored at all, it is stored in PDF itself, but stripped from the underlying image. The metadata you see in your example is likely all that you'll be able to get. It's possible that PDF encoders may store image metadata elsewhere in the PDF, but I haven't seen this. (Note this metadata question was also asked for Java.)
It's definitely possible to extract the stream however, as you mentioned, you use the getData operation.
As for replacing it, you'll need to create a new image object with the PDF, add it to the end, and update the indirect Object pointers accordingly. It will be difficult to do this with PyPdf2.
Extracting Images from PDF
This code helps to fetch any images in scanned or machine generated
pdf or normal pdf
determines its occurrence example how many images in each page
Fetches images with same resolution and extension
pip install PyMuPDF
import fitz
import io
from PIL import Image
#file path you want to extract images from
file = r"File_path"
#open the file
pdf_file = fitz.open(file)
#iterate over PDF pages
for page_index in range(pdf_file.page_count):
#get the page itself
page = pdf_file[page_index]
image_li = page.get_images()
#printing number of images found in this page
#page index starts from 0 hence adding 1 to its content
if image_li:
print(f"[+] Found a total of {len(image_li)} images in page {page_index+1}")
else:
print(f"[!] No images found on page {page_index+1}")
for image_index, img in enumerate(page.get_images(), start=1):
#get the XREF of the image
xref = img[0]
#extract the image bytes
base_image = pdf_file.extract_image(xref)
image_bytes = base_image["image"]
#get the image extension
image_ext = base_image["ext"]
#load it to PIL
image = Image.open(io.BytesIO(image_bytes))
#save it to local disk
image.save(open(f"image{page_index+1}_{image_index}.{image_ext}", "wb"))
`
How can i reduce the image size which is generated by pybarcode ImageWriter, and also how can append multiple images to docx file with proper alignment?
I read about dpi option for ImageWriter but not getting how to use it.
import barcode
from barcode.writer import ImageWriter
from docx import *
if __name__ == '__main__':
# Default set of relationshipships - these are the minimum components of a document
ean = barcode.get_barcode('ean', '123456789102', writer=ImageWriter())
ean.default_writer_options['module_height'] = 3.0
ean.default_writer_options['module_width'] = 0.1
filename = ean.save('bar_image')
relationships = relationshiplist()
# Make a new document tree - this is the main part of a Word document
document = newdocument()
# This xpath location is where most interesting content lives
docbody = document.xpath('/w:document/w:body', namespaces=nsprefixes)[0]
# Add an image
relationships,picpara = picture(relationships, filename,'This is a test description')
docbody.append(picpara)
# Create our properties, contenttypes, and other support files
coreprops = coreproperties(title='Python docx demo',subject='A practical example of making docx from Python',creator='Mike MacCana',keywords=['python','Office Open XML','Word'])
appprops = appproperties()
contenttypes = contenttypes()
websettings = websettings()
wordrelationships = wordrelationships(relationships)
# Save our document
savedocx(document,coreprops,appprops,contenttypes,websettings,wordrelationships,'sample_barcode.docx')
Generally, the barcode.writer doesn't give a parameter on the generated output image size, and you may ask PIL help. And the proper alignment is quite not accurate to code, but you may try using tables to make them in right place.
In PIL, you can resize the png image to (480,320) by
from PIL import Image
im = Image.open("barcode.png")
im.resize((480,320)).save("barcode_resized.png")
And for the docx file, some table example is here, you may need know what proper aliment is and then type the code.