Problems extracting files from a pdf with PyM

Problems extracting files from a pdf with PyM - python

I want to extract and save images as .png, from a pdf file. I use the following Python code and PyMuPDF:
import fitz
import io
from PIL import Image
file = "pdf1.pdf"
pdf_file = fitz.open(file)
for page_index in range(len(pdf_file)):
page = pdf_file[page_index]
image_list = page.getImageList()
if image_list:
print(f"[+] Found a total of {len(image_list)} images in page {page_index}")
else:
print("[!] No images found on page", page_index)
for image_index, img in enumerate(page.getImageList(), start=1):
xref = img[0]
base_image = pdf_file.extractImage(xref)
image_bytes = base_image["image"]
image_ext = base_image["ext"]
image = Image.open(io.BytesIO(image_bytes))
image.save(open(f"image{page_index+1}_{image_index}.{image_ext}", "wb"))
But I get the following error message:
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-5-bb8715bc185b> in <module>()
10 # get the page itself
11 page = pdf_file[page_index]
---> 12 image_list = page.getImageList()
13 # printing number of images found in this page
14 if image_list:
AttributeError: 'Page' object has no attribute 'getImageList'
Is it related to the pdf file structure ( a non-dictionary type)? How could I solve it in that case?

You forgot to mention the PyMuPDF version you used. Your method name getImageList had been deprecated for a long time - a new name page.get_images() should be have been used. In the most recent version 1.20.x the old name is finally removed.
If you have a lot of old code using those old names you can either use a utility to make a global change, or execute fitz.restore_aliases() after import fitz.

Related

Python Attribute Error Raised while using Thumbnail method of PIL

I am using PIL to make an application to open all images in a folder. I sought for tutorials for PIL. I tried to find tutorials with list of images, but I failed to do so. I found some, but I had to list the file location beforehand. It annoyed me. So, instead I want the user to choose a folder, and the application would load all the images for the user. But, while making the thumbnails for the list of images, I got an error which I'm not familiar with. This is the exact error:
Exception in Tkinter callback
Traceback (most recent call last):
File "C:\Users\Admin\AppData\Local\Programs\Python\Python39\lib\tkinter\__init__.py", line
1892, in __call__
return self.func(*args)
File "f:\OG\Python\ImageViewer.py", line 47, in openFolder
GetFiles()
File "f:\OG\Python\ImageViewer.py", line 87, in GetFiles
with Image.open(i) as img:
prefix = fp.read(16)
raise AttributeError(name)
The minimal code to get this error is:
import glob
from PIL import Image, ImageTk
fileDir = "Your Folder"
imageList = []
image_list = []
for filename in glob.glob(fileDir + '/*.jpg'): # gets jpg
im = Image.open(filename)
imageList.append(im)
for i in imageList:
with Image.open(i) as img: # This raises the error
imageList[i] = img.thumbnail((550, 450))
for i in image_list: # Would this work?
image_list[i] = ImageTk.PhotoImage(imageList[i])
I would like to know if the code that is commented with 'Would this work?' would work or not.

Just remove the reading part again which doesn't make sense
import glob
from PIL import Image, ImageTk
fileDir =r"your path"
imageList = []
for filename in glob.glob(fileDir + '/*.jpg'): # gets jpg
im = Image.open(filename)
imageList.append(im)
imageList will look like this :
[<PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=200x200 at 0x25334A87D90>]
here is the blockbuster solution
import glob
from PIL import Image, ImageTk
import PIL
from pathlib import Path
fileDir = r"your_path_here"
imageList = []
for filename in glob.glob(fileDir + '/*.jpg'): # gets jpg
im = Image.open(filename)
imageList.append(im)
im.thumbnail((550, 450))
im.save(fileDir+'/'+Path(filename).name.split('.')[0]+'_thumbnail.png')

I solved it, I edited the code as follows:
import glob
from PIL import Image, ImageTk
fileDir = "Your Folder"
imageList = []
image_list = []
count = 0
for filename in glob.glob(fileDir + '/*.jpg'): # gets jpg
imageList.append(filename)
for i in imageList:
with Image.open(i) as img:
i = img.thumbnail((550, 450))
for i in imageList: # This gives a Key Error Now
image_list.append(ImageTk.PhotoImage(imageList[count]))
count = count + 1
Basically, Introduced a new variable count with a value of 0, removed open from first for loop, used append method for the last for loop and added count 1 each time :)

recognise.train(face, np.array(ids)) Empty training data was given You'll need more than 1 sample to learn a model. in function 'cv:face:LBPH::train'

I'm currently doing a project but I'm running into an error.
I'm using python 3.9, opencv-contrib and I have installed the required libraries
The project is a face recognition project. First, I ran a code to open the webcam and another one to identify a face and make a square and these two codes are working fine. I also uploaded around 20 sample pictures in a file called image.
Now the other code that I ran is used to train python for the faces. This is the following code:
# import the required libraries
import cv2
import os
import numpy as np
from PIL import Image
import pickle
cascade = cv2.CascadeClassifier("haarcascade_frontalface_default.xml")
recognise = cv2.face.LBPHFaceRecognizer_create()
# Created a function
def getdata():
current_id = 0
label_id = {} #dictionanary
face_train = [] # list
face_label = [] # list
# Finding the path of the base directory i.e path were this file is placed
BASE_DIR = os.path.dirname(os.path.abspath(__file__))
# We have created "image_data" folder that contains the data so basically
# we are appending its path to the base path
my_face_dir = os.path.join(BASE_DIR,'image_data')
# Finding all the folders and files inside the "image_data" folder
for root, dirs, files in os.walk(my_face_dir):
for file in files:
# Checking if the file has extention ".png" or ".jpg"
if file.endswith("png") or file.endswith("jpg"):
# Adding the path of the file with the base path
# so you basically have the path of the image
path = os.path.join(root, file)
# Taking the name of the folder as label i.e his/her name
label = os.path.basename(root).lower()
# providing label ID as 1 or 2 and so on for different persons
if not label in label_id:
label_id[label] = current_id
current_id += 1
ID = label_id[label]
# converting the image into gray scale image
# you can also use cv2 library for this action
pil_image = Image.open(path).convert("L")
# converting the image data into numpy array
image_array = np.array(pil_image, "uint8")
# identifying the faces
face = cascade.detectMultiScale(image_array)
# finding the Region of Interest and appending the data
for x,y,w,h in face:
img = image_array[y:y+h, x:x+w]
#image_array = cv2.rectangle(image_array,(x,y),(x+w,y+h),(255,255,255),3)
cv2.imshow("Test",img)
cv2.waitKey(1)
face_train.append(img)
face_label.append(ID)
# string the labels data into a file
with open("labels.pickle", 'wb') as f:
pickle.dump(label_id, f)
return face_train,face_label
# creating ".yml" file
face,ids = getdata()
recognise.train(face, np.array(ids))
recognise.save("trainner.yml")
After running the code I get the following error:
Traceback (most recent call last):
File "C:\Users\person\Desktop\WebcamRecognition\face_trainer.py", line 76, in <module>
recognise.train(face, np.array(ids))
cv2.error: OpenCV(4.5.4-dev) D:\a\opencv-python\opencv-python\opencv_contrib\modules\face\src\lbph_faces.cpp:362: error: (-210:Unsupported format or combination of formats) Empty training data was given. You'll need more than one sample to learn a model. in function 'cv::face::LBPH::train'
Anyone knows how to solve this error.

Python concatinating a string and an intiger counter to name content inside a folder

I am using widows 10 pro, python 3.6.2rc1. I am training a convolutional neural network made by tensorflow. As a preprocessing phase, I hae written the following code to resize each image. It works perfectly well, but since I have more than 100 training images (I made it quite low just to see how it works at the moment) with very different names, and at the end I'd like all of them follow the same naming convention as in "image001", "image002" and so on, I added a counter and use it to change the name of the image before saving it to the same folder by using cv2.imwrite(). But I am getting this error:
Traceback (most recent call last):
File "E:/Python/TrainingData/TrainingPrep.py", line 11, in <module>
cv2.imwrite(imageName,re)
cv2.error: D:\Build\OpenCV\opencv-3.2.0\modules\imgcodecs\src\loadsave.cpp:531: error: (-2) could not find a writer for the specified extension in function cv::imwrite_
import cv2
import glob
i=0
images = glob.glob("*.jpg")
for image in images:
img = cv2.imread(image,1)
counter=str(i)
re = cv2.resize(img,(128,128))
imageName = "image"+counter
cv2.imwrite(imageName,re)
i=i+1
print(counter)
I need my images have the names image001, image00x. I appreciate if you help me solve this problem.
Thank you very much.

The imwrite method expects the extension to determine the file format.
Simply change your line to (for PNG, or whatever file format you want) and it should work:
imageName = "image"+counter+".png"
You can rename the files later if you so wish, using glob.glob. A working example should be something like this:
import cv2
import glob
import os
i=0
images = glob.glob("*.jpg")
for image in images:
img = cv2.imread(image,1)
counter=str(i)
re = cv2.resize(img,(128,128))
imageName = "image"+counter+".jpg"
cv2.imwrite(imageName,re)
i=i+1
print(counter)
rename = glob.glob("images*.jpg")
for src in rename:
dst = os.path.splitext(item)[0]
os.rename(src, dst)

This method will give you the leading zeros you want in the file name:
import cv2
import glob
i=0
images = glob.glob("*.jpg")
for image in images:
img = cv2.imread(image,1)
re = cv2.resize(img,(128,128))
imageName = "image{:03d}.png".format(i) # format i as 3 characters with leading zeros
cv2.imwrite(imageName,re)
i=i+1
print(counter)

IOError: cannot identify image file when loading images from pdf files

I am trying to read scanned images from a pdf using wand and display it using PIL. But I get some error. First page of the pdf file works perfectly but the second page shows this error.
Code
from wand.image import Image
from wand.display import display
from PIL import Image as PI
import pyocr
import pyocr.builders
import io
import numpy as np
import cStringIO
tool = pyocr.get_available_tools()[0]
lang = tool.get_available_languages()[1]
req_image = []
final_text = []
image_pdf = Image(filename="DEEP_PLAST_20.700.pdf", resolution=200)
image_jpeg = image_pdf.convert('jpeg')
img_page = Image(image=image_jpeg.sequence[1])
img_buffer = np.asarray(bytearray(img_page.make_blob()), dtype=np.uint8)
print(img_buffer)
# im = PI.fromarray(img_buffer)
im = PI.open(cStringIO.StringIO(img_buffer))
I get this error.
Traceback (most recent call last):
File "ocr.py", line 43, in <module>
im = PI.open(cStringIO.StringIO(img_buffer))
File "/home/sahil/anaconda2/lib/python2.7/site-packages/PIL/Image.py", line 2452, in open
% (filename if filename else fp))
IOError: cannot identify image file <cStringIO.StringI object at 0x7fc4a8f168b0>
I don't why the code fails on the second page of the pdf whereas it works for the first one.
Any help would be appreciated!

Extract images from PDF using python PyPDF2

Is there any way to extract images as stream from pdf document (using PyPDF2 library)?
Also is it possible to replace some images to another (generated with PIL for example or loaded from file)?
I'm able to get EncodedStreamObject from pdf objects tree and get encoded stream (by calling getData() method), but looks like it just raw content w/o any image headers and other meta information.
>>> import PyPDF2
>>> # sample.pdf contains png images
>>> reader = PyPDF2.PdfFileReader(open('sample.pdf', 'rb'))
>>> reader.resolvedObjects[0][9]
{'/BitsPerComponent': 8,
'/ColorSpace': ['/ICCBased', IndirectObject(20, 0)],
'/Filter': '/FlateDecode',
'/Height': 30,
'/Subtype': '/Image',
'/Type': '/XObject',
'/Width': 100}
>>>
>>> reader.resolvedObjects[0][9].__class__
PyPDF2.generic.EncodedStreamObject
>>>
>>> s = reader.resolvedObjects[0][9].getData()
>>> len(s), s[:10]
(9000, '\xcc\xcc\xcc\xcc\xcc\xcc\xcc\xcc\xcc\xcc')
I've looked across PyPDF2, ReportLab and PDFMiner solutions quite a bit, but haven't found anything like what I'm looking for.
Any code samples and links will be very helpful.

import fitz
doc = fitz.open(filePath)
for i in range(len(doc)):
for img in doc.getPageImageList(i):
xref = img[0]
pix = fitz.Pixmap(doc, xref)
if pix.n < 5: # this is GRAY or RGB
pix.writePNG("p%s-%s.png" % (i, xref))
else: # CMYK: convert to RGB first
pix1 = fitz.Pixmap(fitz.csRGB, pix)
pix1.writePNG("p%s-%s.png" % (i, xref))
pix1 = None
pix = None

Image metadata is not stored within the encoded images of a PDF. If metadata is stored at all, it is stored in PDF itself, but stripped from the underlying image. The metadata you see in your example is likely all that you'll be able to get. It's possible that PDF encoders may store image metadata elsewhere in the PDF, but I haven't seen this. (Note this metadata question was also asked for Java.)
It's definitely possible to extract the stream however, as you mentioned, you use the getData operation.
As for replacing it, you'll need to create a new image object with the PDF, add it to the end, and update the indirect Object pointers accordingly. It will be difficult to do this with PyPdf2.

Extracting Images from PDF
This code helps to fetch any images in scanned or machine generated
pdf or normal pdf
determines its occurrence example how many images in each page
Fetches images with same resolution and extension
pip install PyMuPDF
import fitz
import io
from PIL import Image
#file path you want to extract images from
file = r"File_path"
#open the file
pdf_file = fitz.open(file)
#iterate over PDF pages
for page_index in range(pdf_file.page_count):
#get the page itself
page = pdf_file[page_index]
image_li = page.get_images()
#printing number of images found in this page
#page index starts from 0 hence adding 1 to its content
if image_li:
print(f"[+] Found a total of {len(image_li)} images in page {page_index+1}")
else:
print(f"[!] No images found on page {page_index+1}")
for image_index, img in enumerate(page.get_images(), start=1):
#get the XREF of the image
xref = img[0]
#extract the image bytes
base_image = pdf_file.extract_image(xref)
image_bytes = base_image["image"]
#get the image extension
image_ext = base_image["ext"]
#load it to PIL
image = Image.open(io.BytesIO(image_bytes))
#save it to local disk
image.save(open(f"image{page_index+1}_{image_index}.{image_ext}", "wb"))
`

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Problems extracting files from a pdf with PyM - python

Related

Python Attribute Error Raised while using Thumbnail method of PIL

recognise.train(face, np.array(ids)) Empty training data was given You'll need more than 1 sample to learn a model. in function 'cv:face:LBPH::train'

Python concatinating a string and an intiger counter to name content inside a folder

IOError: cannot identify image file when loading images from pdf files

Extract images from PDF using python PyPDF2

Categories

Resources