Extract images from PDF using python PyPDF2

Extract images from PDF using python PyPDF2 - python

Is there any way to extract images as stream from pdf document (using PyPDF2 library)?
Also is it possible to replace some images to another (generated with PIL for example or loaded from file)?
I'm able to get EncodedStreamObject from pdf objects tree and get encoded stream (by calling getData() method), but looks like it just raw content w/o any image headers and other meta information.
>>> import PyPDF2
>>> # sample.pdf contains png images
>>> reader = PyPDF2.PdfFileReader(open('sample.pdf', 'rb'))
>>> reader.resolvedObjects[0][9]
{'/BitsPerComponent': 8,
'/ColorSpace': ['/ICCBased', IndirectObject(20, 0)],
'/Filter': '/FlateDecode',
'/Height': 30,
'/Subtype': '/Image',
'/Type': '/XObject',
'/Width': 100}
>>>
>>> reader.resolvedObjects[0][9].__class__
PyPDF2.generic.EncodedStreamObject
>>>
>>> s = reader.resolvedObjects[0][9].getData()
>>> len(s), s[:10]
(9000, '\xcc\xcc\xcc\xcc\xcc\xcc\xcc\xcc\xcc\xcc')
I've looked across PyPDF2, ReportLab and PDFMiner solutions quite a bit, but haven't found anything like what I'm looking for.
Any code samples and links will be very helpful.

import fitz
doc = fitz.open(filePath)
for i in range(len(doc)):
for img in doc.getPageImageList(i):
xref = img[0]
pix = fitz.Pixmap(doc, xref)
if pix.n < 5: # this is GRAY or RGB
pix.writePNG("p%s-%s.png" % (i, xref))
else: # CMYK: convert to RGB first
pix1 = fitz.Pixmap(fitz.csRGB, pix)
pix1.writePNG("p%s-%s.png" % (i, xref))
pix1 = None
pix = None

Image metadata is not stored within the encoded images of a PDF. If metadata is stored at all, it is stored in PDF itself, but stripped from the underlying image. The metadata you see in your example is likely all that you'll be able to get. It's possible that PDF encoders may store image metadata elsewhere in the PDF, but I haven't seen this. (Note this metadata question was also asked for Java.)
It's definitely possible to extract the stream however, as you mentioned, you use the getData operation.
As for replacing it, you'll need to create a new image object with the PDF, add it to the end, and update the indirect Object pointers accordingly. It will be difficult to do this with PyPdf2.

Extracting Images from PDF
This code helps to fetch any images in scanned or machine generated
pdf or normal pdf
determines its occurrence example how many images in each page
Fetches images with same resolution and extension
pip install PyMuPDF
import fitz
import io
from PIL import Image
#file path you want to extract images from
file = r"File_path"
#open the file
pdf_file = fitz.open(file)
#iterate over PDF pages
for page_index in range(pdf_file.page_count):
#get the page itself
page = pdf_file[page_index]
image_li = page.get_images()
#printing number of images found in this page
#page index starts from 0 hence adding 1 to its content
if image_li:
print(f"[+] Found a total of {len(image_li)} images in page {page_index+1}")
else:
print(f"[!] No images found on page {page_index+1}")
for image_index, img in enumerate(page.get_images(), start=1):
#get the XREF of the image
xref = img[0]
#extract the image bytes
base_image = pdf_file.extract_image(xref)
image_bytes = base_image["image"]
#get the image extension
image_ext = base_image["ext"]
#load it to PIL
image = Image.open(io.BytesIO(image_bytes))
#save it to local disk
image.save(open(f"image{page_index+1}_{image_index}.{image_ext}", "wb"))
`

Related

Extract all the pages of all the PDFs present in folder in jpg format and store the images in separate folder

I want to get the pdf pages as jpg , (suppose pdf contains 3 pages then output should be 3 images with JPG extension)
I tried 2-3 different ways but not getting the results!
Below is the script I wrote but it only gives the picture of heading present on first page , and also the image created is not stored at the specific folder
`import os
from PyPDF2 import PdfReader
from wand.image import Image as WImage
pdf_file = r"C:\Users\saura\Aidetic\image_processing_data\sample_file.pdf"
def pdf_to_img(pdf_file):
# Open the PDF file
# pdf = open("pdf_file", "rb")
pdf = PdfReader(pdf_file)
# Create a folder to store the images
folder_name = pdf_file[:-4]
if not os.path.exists(folder_name):
os.makedirs(folder_name)
page = pdf.pages[0]
count = 0
# Iterate through each page of the PDF
# for i in pdf.pages:
# Get the current page
for image in page.images:
# page = pdf.pages[i]
# Convert the page to an image
with open(str(count) + image.name, "wb") as fp:
fp.write(image.data)
count += 1
# Save the image to the folder
# img.save(filename=os.path.join(folder_name, str(i) + ".jpg"))
# Get the list of PDF files in the current directory
pdf_files = [f for f in os.listdir() if f.endswith(".pdf")]
# Iterate through each PDF file
for pdf_file in pdf_files:
pdf_to_img(pdf_file)
`

As far as I know, PyPDF2 is not capable of rendering a page of a document as an image.
In contrast, PyMuPDF can do this. Snippet here:
import fitz # import PyMuPDF
doc = fitz.open("your.pdf")
# NOTE: apart from PDF, you can do the exact same thing for EPUB,
# MOBI, XPS, FB2, CBZ documents - no code change required.
for page in doc: # iterate over document pages
pix = page.get_pixmap(dpi=150) # render full page with desired DPI
pix.save(f"page-%04i.png" % page.number) # PNG directly supported
# if JPEG desired, use a variant that employs Pillow:
# pix.pil_save(f"page-%04i.jpg" % page.number)
In next version, JPEG will be directly supported.
Method get_pixmap() has several parameters to choose the colorspace (e.g. gray), include transparency channel, or restrict the page area to be rendered.

Here is a version that iterates over a list of files (PDF or other document types), and saves all their pages in a folder "images". This time we are using Python context manager.
import os
import fitz
filelist = [] # list of filenames (PDFs, XPS, EPUB, MOBI, ...)
for fname in filenames:
basename = os.path.basename(fname)
first = os.path.splitext(basename)[0]
with fitz.open(fname) as doc:
for page in doc:
pix = page.get_pixmap(dpi=150)
pix.save(os.path.join("images", "%s_%04.png" % (first, page.number)))

How to create a multiple page pdf with pytesseract?

I'm trying to mark only a few words in a pdf and with the results I want to make a new pdf using only pytesseract.
Here is the code:
images = convert_from_path(name,poppler_path=r'C:\Program Files\poppler-0.68.0\bin')
for i in images:
img = cv.cvtColor(np.array(i),cv.COLOR_RGB2BGR)
d = pytesseract.image_to_data(img,output_type=Output.DICT,lang='eng+equ',config="--psm 6")
boxes = len(d['level'])
for i in range(boxes):
for e in functionEvent: #functionEvent is a list of strings
if e in d['text'][i]:
(x,y,w,h) = (d['left'][i],d['top'][i],d['width'][i],d['height'][i])
cv.rectangle(img,(x,y),(x+w,y+h),(0,255,0),2)
pdf = pytesseract.image_to_pdf_or_hocr(img,extension='pdf')
with open('results.pdf','w+b') as f:
f.write(pdf)
What have I tried:
with open('results.pdf','a+b') as f:
f.write(pdf)
If you know how can I fix this just let me know.
Also I don't care at all if you recommand another module or your opinion how am I supposed to write code.
Thanks in advance!

Try using PyPDF2 to link your pdfs together.
Firstly you extract your text from pdf with tesseract OCR and store it into list object like this :
for filename in tqdm(os.listdir(in_dir)):
img = Image.open(os.path.join(in_dir,filename))
pdf = pytesseract.image_to_pdf_or_hocr(img, lang='slk', extension='pdf')
pdf_pages.append(pdf)
then iterate trough each processed image or file, read the bytes and add pages using PdfFileReader like this(do not forget to import io):
pdf_writer = PdfFileWriter()
for page in pdf_pages:
pdf = PdfFileReader(io.BytesIO(page))
pdf_writer.addPage(pdf.getPage(0))
In the end create the file and store data to it:
file = open(out_dir, "w+b")
pdf_writer.write(file)
file.close()

Extracting text from scanned PDF without saving the scan as a new file image

I would like to extract text from scanned PDFs.
My "test" code is as follows:
from pdf2image import convert_from_path
from pytesseract import image_to_string
from PIL import Image
converted_scan = convert_from_path('test.pdf', 500)
for i in converted_scan:
i.save('scan_image.png', 'png')
text = image_to_string(Image.open('scan_image.png'))
with open('scan_text_output.txt', 'w') as outfile:
outfile.write(text.replace('\n\n', '\n'))
I would like to know if there is a way to extract the content of the image directly from the object converted_scan, without saving the scan as a new "physical" image file on the disk?
Basically, I would like to skip this part:
for i in converted_scan:
i.save('scan_image.png', 'png')
I have a few thousands scans to extract text from. Although all the generated new image files are not particularly heavy, it's not negligible and I find it a bit overkill.
EDIT
Here's a slightly different, more compact approach than Colonder's answer, based on this post. For .pdf files with many pages, it might be worth adding a progress bar to each loop using e.g. the tqdm module.
from wand.image import Image as w_img
from PIL import Image as p_img
import pyocr.builders
import regex, pyocr, io
infile = 'my_file.pdf'
tool = pyocr.get_available_tools()[0]
tool = tools[0]
req_image = []
txt = ''
# to convert pdf to img and extract text
with w_img(filename = infile, resolution = 200) as scan:
image_png = scan.convert('png')
for i in image_png.sequence:
img_page = w_img(image = i)
req_image.append(img_page.make_blob('png'))
for i in req_image:
content = tool.image_to_string(
p_img.open(io.BytesIO(i)),
lang = tool.get_available_languages()[0],
builder = pyocr.builders.TextBuilder()
)
txt += content
# to save the output as a .txt file
with open(infile[:-4] + '.txt', 'w') as outfile:
full_txt = regex.sub(r'\n+', '\n', txt)
outfile.write(full_txt)

UPDATE MAY 2021
I realized that although pdf2image is simply calling a subprocess, one doesn't have to save images to subsequently OCR them. What you can do is just simply (you can use pytesseract as OCR library as well)
from pdf2image import convert_from_path
for img in convert_from_path("some_pdf.pdf", 300):
txt = tool.image_to_string(img,
lang=lang,
builder=pyocr.builders.TextBuilder())
EDIT: you can also try and use pdftotext library
pdf2image is a simple wrapper around pdftoppm and pdftocairo. It internally does nothing more but calls subprocess. This script should do what you want, but you need a wand library as well as pyocr (I think this is a matter of preference, so feel free to use any library for text extraction you want).
from PIL import Image as Pimage, ImageDraw
from wand.image import Image as Wimage
import sys
import numpy as np
from io import BytesIO
import pyocr
import pyocr.builders
def _convert_pdf2jpg(in_file_path: str, resolution: int=300) -> Pimage:
"""
Convert PDF file to JPG
:param in_file_path: path of pdf file to convert
:param resolution: resolution with which to read the PDF file
:return: PIL Image
"""
with Wimage(filename=in_file_path, resolution=resolution).convert("jpg") as all_pages:
for page in all_pages.sequence:
with Wimage(page) as single_page_image:
# transform wand image to bytes in order to transform it into PIL image
yield Pimage.open(BytesIO(bytearray(single_page_image.make_blob(format="jpeg"))))
tools = pyocr.get_available_tools()
if len(tools) == 0:
print("No OCR tool found")
sys.exit(1)
# The tools are returned in the recommended order of usage
tool = tools[0]
print("Will use tool '%s'" % (tool.get_name()))
# Ex: Will use tool 'libtesseract'
langs = tool.get_available_languages()
print("Available languages: %s" % ", ".join(langs))
lang = langs[0]
print("Will use lang '%s'" % (lang))
# Ex: Will use lang 'fra'
# Note that languages are NOT sorted in any way. Please refer
# to the system locale settings for the default language
# to use.
for img in _convert_pdf2jpg("some_pdf.pdf"):
txt = tool.image_to_string(img,
lang=lang,
builder=pyocr.builders.TextBuilder())

Convert PDF page to image with pyPDF2 and BytesIO

I have a function that gets a page from a PDF file via pyPdf2 and should convert the first page to a png (or jpg) with Pillow (PIL Fork)
from PyPDF2 import PdfFileWriter, PdfFileReader
import os
from PIL import Image
import io
# Open PDF Source #
app_path = os.path.dirname(__file__)
src_pdf= PdfFileReader(open(os.path.join(app_path, "../../../uploads/%s" % filename), "rb"))
# Get the first page of the PDF #
dst_pdf = PdfFileWriter()
dst_pdf.addPage(src_pdf.getPage(0))
# Create BytesIO #
pdf_bytes = io.BytesIO()
dst_pdf.write(pdf_bytes)
pdf_bytes.seek(0)
file_name = "../../../uploads/%s_p%s.png" % (name, pagenum)
img = Image.open(pdf_bytes)
img.save(file_name, 'PNG')
pdf_bytes.flush()
That results in an error:
OSError: cannot identify image file <_io.BytesIO object at 0x0000023440F3A8E0>
I found some threads with a similar issue, (PIL open() method not working with BytesIO) but I cannot see where I am wrong here, as I have pdf_bytes.seek(0) already added.
Any hints appreciated

Per document:
write(stream) Writes the collection of pages added to this object out
as a PDF file.
Parameters: stream – An object to write the file to. The object must
support the write method and the tell method, similar to a file
object.
So the object pdf_bytes contains a PDF file, not an image file.
The reason why there are codes like above work is: sometimes, the pdf file just contains a jpeg file as its content. If your pdf is just a normal pdf file, you can't just read the bytes and parse it as an image.
And refer to as a more robust implementation: https://stackoverflow.com/a/34116472/334999

[![enter image description here][1]][1]
import glob, sys, fitz
# To get better resolution
zoom_x = 2.0 # horizontal zoom
zoom_y = 2.0 # vertical zoom
mat = fitz.Matrix(zoom_x, zoom_y) # zoom factor 2 in each dimension
filename = "/xyz/abcd/1234.pdf" # name of pdf file you want to render
doc = fitz.open(filename)
for page in doc:
pix = page.get_pixmap(matrix=mat) # render page to an image
pix.save("/xyz/abcd/1234.png") # store image as a PNG
Credit
[Convert PDF to Image in Python Using PyMuPDF][2]
https://towardsdatascience.com/convert-pdf-to-image-in-python-using-pymupdf-9cc8f602525b

Upload Image To Imgur After Resizeing In PIL

I am writing a script which will get an image from a link. Then the image will be resized using the PIL module and the uploaded to Imgur using pyimgur. I dont want to save the image on disk, instead manipulate the image in memory and then upload it from memory to Imgur.
The Script:
from pyimgur import Imgur
import cStringIO
import requests
from PIL import Image
LINK = "http://pngimg.com/upload/cat_PNG106.png"
CLIENT_ID = '29619ae5d125ae6'
im = Imgur(CLIENT_ID)
def _upload_image(img, title):
uploaded_image = im.upload_image(img, title=title)
return uploaded_image.link
def _resize_image(width, height, link):
#Retrieve our source image from a URL
fp = requests.get(link)
#Load the URL data into an image
img = cStringIO.StringIO(fp.content)
im = Image.open(img)
#Resize the image
im2 = im.resize((width, height), Image.NEAREST)
#saving the image into a cStringIO object to avoid writing to disk
out_im2 = cStringIO.StringIO()
im2.save(out_im2, 'png')
return out_im2.getvalue()
When I run this script I get this error: TypeError: file() argument 1 must be encoded string without NULL bytes, not str
Anyone has a solution in mind?

It looks like the same problem as this, and the solution is to use StringIO.
A common tip for searching such issues is to search using the generic part of the error message/string.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extract images from PDF using python PyPDF2 - python

Related

Extract all the pages of all the PDFs present in folder in jpg format and store the images in separate folder

How to create a multiple page pdf with pytesseract?

Extracting text from scanned PDF without saving the scan as a new file image

Convert PDF page to image with pyPDF2 and BytesIO

Upload Image To Imgur After Resizeing In PIL

Categories

Resources