How can i reduce the image size which is generated by pybarcode ImageWriter, and also how can append multiple images to docx file with proper alignment?
I read about dpi option for ImageWriter but not getting how to use it.
import barcode
from barcode.writer import ImageWriter
from docx import *
if __name__ == '__main__':
# Default set of relationshipships - these are the minimum components of a document
ean = barcode.get_barcode('ean', '123456789102', writer=ImageWriter())
ean.default_writer_options['module_height'] = 3.0
ean.default_writer_options['module_width'] = 0.1
filename = ean.save('bar_image')
relationships = relationshiplist()
# Make a new document tree - this is the main part of a Word document
document = newdocument()
# This xpath location is where most interesting content lives
docbody = document.xpath('/w:document/w:body', namespaces=nsprefixes)[0]
# Add an image
relationships,picpara = picture(relationships, filename,'This is a test description')
docbody.append(picpara)
# Create our properties, contenttypes, and other support files
coreprops = coreproperties(title='Python docx demo',subject='A practical example of making docx from Python',creator='Mike MacCana',keywords=['python','Office Open XML','Word'])
appprops = appproperties()
contenttypes = contenttypes()
websettings = websettings()
wordrelationships = wordrelationships(relationships)
# Save our document
savedocx(document,coreprops,appprops,contenttypes,websettings,wordrelationships,'sample_barcode.docx')
Generally, the barcode.writer doesn't give a parameter on the generated output image size, and you may ask PIL help. And the proper alignment is quite not accurate to code, but you may try using tables to make them in right place.
In PIL, you can resize the png image to (480,320) by
from PIL import Image
im = Image.open("barcode.png")
im.resize((480,320)).save("barcode_resized.png")
And for the docx file, some table example is here, you may need know what proper aliment is and then type the code.
Related
I want to append the contents of a Tkinter widget to the end of an existing pdf.
First, I capture the widget to an PIL image. Then, it seems it is required when using PyPDF2 to create an intermediate temporary file from the image, which feels unnecessary. Instead I would like to append the image directly, or at least convert the image to something that can be appended without the need to be written to a file.
In the code snippet below I save the image to a temporary pdf, then open the pdf and append. This works, but is not a very elegant solution.
import tkinter as tk
from PIL import ImageGrab
import os
import PyPDF2
def process_pdf(widget, pdf_filepath):
""""Append Tkinter widget to pdf"""
# capture the widget
img = capture_widget(widget)
# create pdf merger object
merger = PyPDF2.PdfMerger()
# creating a pdf file object of original pdf and add to output
pdf_org = open(pdf_filepath, 'rb')
merger.append(pdf_org)
# NOTE I want to get rid of this step
# create temporary file, read it, append it to output, delete it.
temp_filename = pdf_filepath[:-4] + "_temp.pdf"
img.save(temp_filename)
pdf_temp = open(temp_filename, 'rb')
merger.append(pdf_temp)
pdf_temp.close()
# write
outputfile = open(pdf_filepath, "wb")
merger.write(outputfile)
# clean up
pdf_org.close()
outputfile.close()
os.remove(temp_filename)
def capture_widget(widget):
"""Take screenshot of the passed widget"""
x0 = widget.winfo_rootx()
y0 = widget.winfo_rooty()
x1 = x0 + widget.winfo_width()
y1 = y0 + widget.winfo_height()
img = ImageGrab.grab((x0, y0, x1, y1))
return img
Does someone have a more elegant solution not requiring an intermediate file while retaining the flexibility PyPDF2 provides?
Thanks.
So I was able to find a solution myself. The trick is writing the PIL image to a byte array (see this question), then converting that to a pdf using img2pdf. For img2pdf, it is required to use the format = "jpeg" argument during saving to byte array.
Subsequently, the result of img2pdf can be written to another io.BytesIO() stream. Since it implements a .read() method, PyPDF2.PdfMerger() can read it.
Below is the code, hope this helps someone.
import io
import PyPDF2
import img2pdf
def process_pdf(widget, pdf_filepath):
""""Append Tkinter widget to pdf"""
# create pdf merger object
merger = PyPDF2.PdfMerger()
# creating a pdf file object of original pdf and add to output
pdf_org = open(pdf_filepath, 'rb')
merger.append(pdf_org)
pdf_org.close()
# capture the widget and rotate
img = capture_widget(widget)
# create img bytearray
img_byte_arr = io.BytesIO()
img.save(img_byte_arr, format = "jpeg", quality=100)
img_byte_arr = img_byte_arr.getvalue()
# create a pdf bytearray and do formatting using img2pdf
pdf_byte_arr = io.BytesIO()
layout_fun = img2pdf.get_layout_fun((img2pdf.mm_to_pt(210),img2pdf.mm_to_pt(297)))
pdf_byte_arr.write(img2pdf.convert(img_byte_arr, layout_fun=layout_fun))
merger.append(pdf_byte_arr)
# write
outputfile = open(pdf_filepath[:-4] + "_appended.pdf", "wb")
merger.write(outputfile)
outputfile.close()
I want to add an image as background/watermark to a new word document using Python. I tried Python-docx but couldn't find anything useful
from docx import Document
from docx.shared import Inches
document = Document()
document.add_picture(r'D:\Python\Projects\raw_imgs\3b057d6199d95c4339ef532001cb20cd.jpg', width=Inches(6))
document.save('demo.docx')
The above code just inserts the image but I want to add it as the background image.
Aspose.Words Cloud SDK for Python can insert an image as a background to the DOC/DOCX. Though it is paid product, its free trial allows 150 free API calls monthly.
# For complete examples and data files, please go to https://github.com/aspose-words-cloud/aspose-words-cloud-python
# Import module
import asposewordscloud
import asposewordscloud.models.requests
from shutil import copyfile
# Please get your Client ID and Secret from https://dashboard.aspose.cloud.
client_id='xxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxx'
client_secret='xxxxxxxxxxxxxxxxxxxxxxxxxxx'
words_api = asposewordscloud.WordsApi(client_id,client_secret)
words_api.api_client.configuration.host='https://api.aspose.cloud'
localFile = 'C:/Temp/Sections.docx'
imageFile= 'C:/Temp/Tulips.jpg'
outputFile= 'C:/Temp/Watermark.docx'
request = asposewordscloud.models.requests.InsertWatermarkImageOnlineRequest(document=open(localFile, 'rb'), image_file=open(imageFile, 'rb'))
result = words_api.insert_watermark_image_online(request)
copyfile(result.document, outputFile)
I would like to extract text from scanned PDFs.
My "test" code is as follows:
from pdf2image import convert_from_path
from pytesseract import image_to_string
from PIL import Image
converted_scan = convert_from_path('test.pdf', 500)
for i in converted_scan:
i.save('scan_image.png', 'png')
text = image_to_string(Image.open('scan_image.png'))
with open('scan_text_output.txt', 'w') as outfile:
outfile.write(text.replace('\n\n', '\n'))
I would like to know if there is a way to extract the content of the image directly from the object converted_scan, without saving the scan as a new "physical" image file on the disk?
Basically, I would like to skip this part:
for i in converted_scan:
i.save('scan_image.png', 'png')
I have a few thousands scans to extract text from. Although all the generated new image files are not particularly heavy, it's not negligible and I find it a bit overkill.
EDIT
Here's a slightly different, more compact approach than Colonder's answer, based on this post. For .pdf files with many pages, it might be worth adding a progress bar to each loop using e.g. the tqdm module.
from wand.image import Image as w_img
from PIL import Image as p_img
import pyocr.builders
import regex, pyocr, io
infile = 'my_file.pdf'
tool = pyocr.get_available_tools()[0]
tool = tools[0]
req_image = []
txt = ''
# to convert pdf to img and extract text
with w_img(filename = infile, resolution = 200) as scan:
image_png = scan.convert('png')
for i in image_png.sequence:
img_page = w_img(image = i)
req_image.append(img_page.make_blob('png'))
for i in req_image:
content = tool.image_to_string(
p_img.open(io.BytesIO(i)),
lang = tool.get_available_languages()[0],
builder = pyocr.builders.TextBuilder()
)
txt += content
# to save the output as a .txt file
with open(infile[:-4] + '.txt', 'w') as outfile:
full_txt = regex.sub(r'\n+', '\n', txt)
outfile.write(full_txt)
UPDATE MAY 2021
I realized that although pdf2image is simply calling a subprocess, one doesn't have to save images to subsequently OCR them. What you can do is just simply (you can use pytesseract as OCR library as well)
from pdf2image import convert_from_path
for img in convert_from_path("some_pdf.pdf", 300):
txt = tool.image_to_string(img,
lang=lang,
builder=pyocr.builders.TextBuilder())
EDIT: you can also try and use pdftotext library
pdf2image is a simple wrapper around pdftoppm and pdftocairo. It internally does nothing more but calls subprocess. This script should do what you want, but you need a wand library as well as pyocr (I think this is a matter of preference, so feel free to use any library for text extraction you want).
from PIL import Image as Pimage, ImageDraw
from wand.image import Image as Wimage
import sys
import numpy as np
from io import BytesIO
import pyocr
import pyocr.builders
def _convert_pdf2jpg(in_file_path: str, resolution: int=300) -> Pimage:
"""
Convert PDF file to JPG
:param in_file_path: path of pdf file to convert
:param resolution: resolution with which to read the PDF file
:return: PIL Image
"""
with Wimage(filename=in_file_path, resolution=resolution).convert("jpg") as all_pages:
for page in all_pages.sequence:
with Wimage(page) as single_page_image:
# transform wand image to bytes in order to transform it into PIL image
yield Pimage.open(BytesIO(bytearray(single_page_image.make_blob(format="jpeg"))))
tools = pyocr.get_available_tools()
if len(tools) == 0:
print("No OCR tool found")
sys.exit(1)
# The tools are returned in the recommended order of usage
tool = tools[0]
print("Will use tool '%s'" % (tool.get_name()))
# Ex: Will use tool 'libtesseract'
langs = tool.get_available_languages()
print("Available languages: %s" % ", ".join(langs))
lang = langs[0]
print("Will use lang '%s'" % (lang))
# Ex: Will use lang 'fra'
# Note that languages are NOT sorted in any way. Please refer
# to the system locale settings for the default language
# to use.
for img in _convert_pdf2jpg("some_pdf.pdf"):
txt = tool.image_to_string(img,
lang=lang,
builder=pyocr.builders.TextBuilder())
I have a function that gets a page from a PDF file via pyPdf2 and should convert the first page to a png (or jpg) with Pillow (PIL Fork)
from PyPDF2 import PdfFileWriter, PdfFileReader
import os
from PIL import Image
import io
# Open PDF Source #
app_path = os.path.dirname(__file__)
src_pdf= PdfFileReader(open(os.path.join(app_path, "../../../uploads/%s" % filename), "rb"))
# Get the first page of the PDF #
dst_pdf = PdfFileWriter()
dst_pdf.addPage(src_pdf.getPage(0))
# Create BytesIO #
pdf_bytes = io.BytesIO()
dst_pdf.write(pdf_bytes)
pdf_bytes.seek(0)
file_name = "../../../uploads/%s_p%s.png" % (name, pagenum)
img = Image.open(pdf_bytes)
img.save(file_name, 'PNG')
pdf_bytes.flush()
That results in an error:
OSError: cannot identify image file <_io.BytesIO object at 0x0000023440F3A8E0>
I found some threads with a similar issue, (PIL open() method not working with BytesIO) but I cannot see where I am wrong here, as I have pdf_bytes.seek(0) already added.
Any hints appreciated
Per document:
write(stream) Writes the collection of pages added to this object out
as a PDF file.
Parameters: stream – An object to write the file to. The object must
support the write method and the tell method, similar to a file
object.
So the object pdf_bytes contains a PDF file, not an image file.
The reason why there are codes like above work is: sometimes, the pdf file just contains a jpeg file as its content. If your pdf is just a normal pdf file, you can't just read the bytes and parse it as an image.
And refer to as a more robust implementation: https://stackoverflow.com/a/34116472/334999
[![enter image description here][1]][1]
import glob, sys, fitz
# To get better resolution
zoom_x = 2.0 # horizontal zoom
zoom_y = 2.0 # vertical zoom
mat = fitz.Matrix(zoom_x, zoom_y) # zoom factor 2 in each dimension
filename = "/xyz/abcd/1234.pdf" # name of pdf file you want to render
doc = fitz.open(filename)
for page in doc:
pix = page.get_pixmap(matrix=mat) # render page to an image
pix.save("/xyz/abcd/1234.png") # store image as a PNG
Credit
[Convert PDF to Image in Python Using PyMuPDF][2]
https://towardsdatascience.com/convert-pdf-to-image-in-python-using-pymupdf-9cc8f602525b
Is there any way to extract images as stream from pdf document (using PyPDF2 library)?
Also is it possible to replace some images to another (generated with PIL for example or loaded from file)?
I'm able to get EncodedStreamObject from pdf objects tree and get encoded stream (by calling getData() method), but looks like it just raw content w/o any image headers and other meta information.
>>> import PyPDF2
>>> # sample.pdf contains png images
>>> reader = PyPDF2.PdfFileReader(open('sample.pdf', 'rb'))
>>> reader.resolvedObjects[0][9]
{'/BitsPerComponent': 8,
'/ColorSpace': ['/ICCBased', IndirectObject(20, 0)],
'/Filter': '/FlateDecode',
'/Height': 30,
'/Subtype': '/Image',
'/Type': '/XObject',
'/Width': 100}
>>>
>>> reader.resolvedObjects[0][9].__class__
PyPDF2.generic.EncodedStreamObject
>>>
>>> s = reader.resolvedObjects[0][9].getData()
>>> len(s), s[:10]
(9000, '\xcc\xcc\xcc\xcc\xcc\xcc\xcc\xcc\xcc\xcc')
I've looked across PyPDF2, ReportLab and PDFMiner solutions quite a bit, but haven't found anything like what I'm looking for.
Any code samples and links will be very helpful.
import fitz
doc = fitz.open(filePath)
for i in range(len(doc)):
for img in doc.getPageImageList(i):
xref = img[0]
pix = fitz.Pixmap(doc, xref)
if pix.n < 5: # this is GRAY or RGB
pix.writePNG("p%s-%s.png" % (i, xref))
else: # CMYK: convert to RGB first
pix1 = fitz.Pixmap(fitz.csRGB, pix)
pix1.writePNG("p%s-%s.png" % (i, xref))
pix1 = None
pix = None
Image metadata is not stored within the encoded images of a PDF. If metadata is stored at all, it is stored in PDF itself, but stripped from the underlying image. The metadata you see in your example is likely all that you'll be able to get. It's possible that PDF encoders may store image metadata elsewhere in the PDF, but I haven't seen this. (Note this metadata question was also asked for Java.)
It's definitely possible to extract the stream however, as you mentioned, you use the getData operation.
As for replacing it, you'll need to create a new image object with the PDF, add it to the end, and update the indirect Object pointers accordingly. It will be difficult to do this with PyPdf2.
Extracting Images from PDF
This code helps to fetch any images in scanned or machine generated
pdf or normal pdf
determines its occurrence example how many images in each page
Fetches images with same resolution and extension
pip install PyMuPDF
import fitz
import io
from PIL import Image
#file path you want to extract images from
file = r"File_path"
#open the file
pdf_file = fitz.open(file)
#iterate over PDF pages
for page_index in range(pdf_file.page_count):
#get the page itself
page = pdf_file[page_index]
image_li = page.get_images()
#printing number of images found in this page
#page index starts from 0 hence adding 1 to its content
if image_li:
print(f"[+] Found a total of {len(image_li)} images in page {page_index+1}")
else:
print(f"[!] No images found on page {page_index+1}")
for image_index, img in enumerate(page.get_images(), start=1):
#get the XREF of the image
xref = img[0]
#extract the image bytes
base_image = pdf_file.extract_image(xref)
image_bytes = base_image["image"]
#get the image extension
image_ext = base_image["ext"]
#load it to PIL
image = Image.open(io.BytesIO(image_bytes))
#save it to local disk
image.save(open(f"image{page_index+1}_{image_index}.{image_ext}", "wb"))
`