Python Image extraction sequence from pdf

Python Image extraction sequence from pdf - python

I was trying to extract images from a pdf using PyMuPDF (fitz). My pdf has multiple images in a single page. I am maintaining a proper sequence number while saving my images. I saw that the images being extracted don't follow a proper sequence. Sometimes it is starting to extract from the bottom, sometimes from the top and so on. Is there a way to modify my code so that the extraction follow a proper sequence?
Given below is the code I am using :
import fitz
from PIL import Image
filename = "document.pdf"
doc = fitz.open(filename)
for i in range(len(doc)):
img_num = 0
p_no = 1
for img in doc.getPageImageList(i):
xref = img[0]
pix = fitz.Pixmap(doc, xref)
if pix.n - pix.alpha < 4:
img_num += 1
pix.writeImage("%s-%s.jpg" % (str(p_no),str(img_num)))
else:
img_num += 1
pix1 = fitz.Pixmap(fitz.csRGB, pix)
pix1.writeImage("%s-%s.jpg" % (str(p_no),str(img_num)))
pix1 = None
pix = None
p_no += 1
Given below is a sample page of the pdf

I have the same problem I've used the following code:
import fitz
import io
from PIL import Image
file = "file_path"
pdf_file = fitz.open(file)
for page_index in range(len(pdf_file)):
# get the page itself
page = pdf_file[page_index]
image_list = page.getImageList()
# printing number of images found in this page
if image_list:
print(f"[+] Found {len(image_list)} images in page {page_index}")
else:
print("[!] No images found on the given pdf page", page_index)
for image_index, img in enumerate(page.getImageList(), start=1):
print(img)
print(image_index)
# get the XREF of the image
xref = img[0]
# extract the image bytes
base_image = pdf_file.extractImage(xref)
image_bytes = base_image["image"]
# get the image extension
image_ext = base_image["ext"]
# load it to PIL
image = Image.open(io.BytesIO(image_bytes))
# save it to local disk
image.save(open(f"image{page_index+1}_{image_index}.{image_ext}", "wb"))
The most probable way is to locate the 'img' var and order them.
I'd love to hear any further sggestions or if you found better idea/solution.

Related

Downloading Images using BS4

I have been trying to download images off of a site using bs4, the images are not jpeg or png so I think that bs4 is unable to find the image, I could be wrong about that as well.
Here's my code
#--- IMAGE --- NOT WORKING
#Finds the image URL with the name of the product
#done
image = soup.find('img', attrs={'class':"image_container"})
try:
image = image.get("src")
except AttributeError:
print("NO image FOUND")
image = "NO image FOUND"
if(image != "NO image FOUND"): #if the image is found
try:
pos = image.index("?")
image = "http:" + image[:pos]
except ValueError:
pass
pathImg += name[:nameLength] # Truncates to 5 characters and adds to pathImg file
if(generateFiles):
download(image, pathImg) # Downloads image
self.image = image # Exporting var to class global var
Heres an image of where the source is on the page
Source Code for the image container

Add page title, img2pdf

I've recently found this (wonderful) python software to convert multiple images to a single pdf, img2pdf. After create the first pdf I realized that every page hasn't got any title and it's difficult identify what's the original image (because there're 400), does anyone know how can I add a page title?
Thanks in advance.

I tried to find the same solution but ended up writing a Python program to solve it. I dont know if it helps you but here is a solution nonetheless.
In Python I used PIL.Image and ImageDraw to go through all images and put the filename in each if the images. After that I used img2pdf as a python library to generate the pdf.
Must be run it in the same folder of the images.
import os
import img2pdf
from PIL import Image, ImageDraw, ImageFont, ExifTags
# Enter the path to the font you want, 'fc-list' on ubuntu will get a list of fonts you can use.
#image_text_font = ImageFont.truetype('/Library/Fonts/Arial.ttf', 15)
image_text_font = ImageFont.truetype("/usr/share/fonts/truetype/dejavu/DejaVuSansMono.ttf", 32)
# Tags the images with 'file name' in the upper left corner
def tag_images():
for file in os.listdir('.'):
if file.endswith(".jpg") and str(file+"_tagged.jpg") not in os.listdir('.') and not file.endswith("_tagged.jpg"):
one_image = check_and_adjust_rotation(Image.open(file))
one_image_draw = ImageDraw.Draw(one_image)
# Add textbox to image
size = one_image_draw.textsize(file, font=image_text_font)
offset = image_text_font.getoffset(file)
one_image_draw.rectangle((10, 10, 10 + size[0] + offset[0], 10 + size[1] + offset[1]), fill='white', outline='black')
# Add text to image
one_image_draw.text((10,10), file, font=image_text_font, fill='black')
# Save tagged image
one_image.save(file + "_tagged.jpg")
print(f'Tagged and saved "{file}_tagged.jpg".')
# Generate the PDF
def generate_pdf_from_multiple_images():
with open("output.pdf", "wb") as f:
f.write(img2pdf.convert([image_file for image_file in os.listdir('.') if image_file.endswith("_tagged.jpg")]))
# Use exif information about rotation to apply proper rotation to the image
def check_and_adjust_rotation(image):
try :
for orientation in ExifTags.TAGS.keys() :
if ExifTags.TAGS[orientation]=='Orientation' : break
exif=dict(image._getexif().items())
print(exif[orientation])
if exif[orientation] == 3 :
image=image.rotate(180, expand=True)
elif exif[orientation] == 6 :
image=image.rotate(270, expand=True)
elif exif[orientation] == 8 :
image=image.rotate(90, expand=True)
except:
traceback.print_exc()
return image
def main():
tag_images()
generate_pdf_from_multiple_images()
if __name__ == '__main__':
main()

Change color scheme when extracting an image from PDF in Python

I am trying to read an image from a pdf following this post:
Extract images from PDF without resampling, in python?
So far I managed to get the image file from the pdf, but it contains a CWYK color scheme and the picture is becoming messed up.
My code is the following:
import PyPDF2
import struct
pdf_filename = 'document.pdf'
pdf_file = open(pdf_filename, 'rb')
cond_scan_reader = PyPDF2.PdfFileReader(pdf_file)
page = cond_scan_reader.getPage(4)
xObject = page['/Resources']['/XObject'].getObject()
for obj in xObject:
print(xObject[obj])
if xObject[obj]['/Subtype'] == '/Image':
if xObject[obj]['/Filter'] == '/DCTDecode':
data = xObject[obj]._data
img = open("image" + ".jpg", "wb")
img.write(data)
img.close()
pdf_file.close()
The point is that when I save, the colors are all weird, I believe it's because of the colorScheme. I have the following in the console:
{'/Type': '/XObject', '/Subtype': '/Image', '/Width': 1122, '/Height': 502, '/Interpolate': <PyPDF2.generic.BooleanObject object at 0x1061574a8>, '/ColorSpace': '/DeviceCMYK', '/BitsPerComponent': 8, '/Filter': '/DCTDecode'}
As you can see, the ColorSpace is CMYK, and I believe that's why the colors of the image are weird.
That's the image I have:
This is the original image (it is inside a pdf file):
Can anyone help me?
Thanks in advance.
Israel

A CMYK mode JPG image that contained in PDF must be invert.
But in PIL, invert of CMYK mode image is not supported.
Than I solve it using numpy.
Full source is in below link.
https://github.com/Gaia3D/pdfImageExtractor/blob/master/extrectImage.py
imgData = np.frombuffer(img.tobytes(), dtype='B')
invData = np.full(imgData.shape, 255, dtype='B')
invData -= imgData
img = Image.frombytes(img.mode, img.size, invData.tobytes())
img.save(outFileName + ".jpg")

Image drawn to reportlab pdf bigger than pdf paper size

i'm writing a program which takes all the pictures in a given folder and aggregates them into a pdf. The problem I have is that when the images are drawn, they are bigger in size and are rotated to the left oddly. I've searched everywhere, havent found anything even in the reportlab documentation.
Here's the code:
import os
from PIL import Image
from PyPDF2 import PdfFileWriter, PdfFileReader
from reportlab.pdfgen import canvas
from reportlab.lib.units import cm
from StringIO import StringIO
def main():
images = image_search()
output = PdfFileWriter()
for image in images:
Image_file = Image.open(image) # need to convert the image to the specific size first.
width, height = Image_file.size
im_width = 1 * cm
# Using ReportLab to insert image into PDF
watermark_str = "watermark" + str(images.index(image)) + '.pdf'
imgDoc = canvas.Canvas(watermark_str)
# Draw image on Canvas and save PDF in buffer
# define the aspect ratio first
aspect = height / float(width)
## Drawing the image
imgDoc.drawImage(image, 0,0, width = im_width, height = (im_width * aspect)) ## at (399,760) with size 160x160
imgDoc.showPage()
imgDoc.save()
# Get the watermark file just created
watermark = PdfFileReader(open(watermark_str, "rb"))
#Get our files ready
pdf1File = open('sample.pdf', 'rb')
page = PdfFileReader(pdf1File).getPage(0)
page.mergePage(watermark.getPage(0))
#Save the result
output.addPage(page)
output.write(file("output.pdf","wb"))
#The function which searches the current directory for image files.
def image_search():
found_images = []
for doc in os.listdir(os.curdir):
image_ext = ['.jpg', '.png', '.PNG', '.jpeg', '.JPG']
for ext in image_ext:
if doc.endswith(ext):
found_images.append(doc)
return found_images
main()
I also tried scaling and specifying the aspect ratio using the im_width variable, which gave the same output.

After a little bit of confusion about your goal I figured out that the goal is to make a PDF overview of the images in the current folder. To do so we actual don't need PyPDF2 as Reportlab offers everything we need for this.
See the code below with the comments as guidelines:
def main():
output_file_loc = "overview.pdf"
imgDoc = canvas.Canvas(output_file_loc)
imgDoc.setPageSize(A4) # This is actually the default page size
document_width, document_height = A4
images = image_search()
for image in images:
# Open the image file to get image dimensions
Image_file = Image.open(image)
image_width, image_height = Image_file.size
image_aspect = image_height / float(image_width)
# Determine the dimensions of the image in the overview
print_width = document_width
print_height = document_width * image_aspect
# Draw the image on the current page
# Note: As reportlab uses bottom left as (0,0) we need to determine the start position by subtracting the
# dimensions of the image from those of the document
imgDoc.drawImage(image, document_width - print_width, document_height - print_height, width=print_width,
height=print_height)
# Inform Reportlab that we want a new page
imgDoc.showPage()
# Save the document
imgDoc.save()

Converting an UploadedFile to PIL image in Django

I'm trying to check an image's dimension, before saving it. I don't need to change it, just make sure it fits my limits.
Right now, I can read the file, and save it to AWS without a problem.
output['pic file'] = request.POST['picture_file']
conn = myproject.S3.AWSAuthConnection(aws_key_id, aws_key)
filedata = request.FILES['picture'].read()
content_type = 'image/png'
conn.put(
bucket_name,
request.POST['picture_file'],
myproject.S3.S3Object(filedata),
{'x-amz-acl': 'public-read', 'Content-Type': content_type},
)
I need to put a step in the middle, that makes sure the file has the right size / width dimensions. My file isn't coming from a form that uses ImageField, and all the solutions I've seen use that.
Is there a way to do something like
img = Image.open(filedata)

image = Image.open(file)
#To get the image size, in pixels.
(width,height) = image.size()
#check for dimensions width and height and resize
image = image.resize((width_new,height_new))

I've done this before but I can't find my old snippet... so here we go off the top of my head
picture = request.FILES.get['picture']
img = Image.open(picture)
#check sizes .... probably using img.size and then resize
#resave if necessary
imgstr = StringIO()
img.save(imgstr, 'PNG')
imgstr.reset()
filedata = imgstr.read()

The code bellow creates the image from the request, as you want:
from PIL import ImageFile
def image_upload(request):
for f in request.FILES.values():
p = ImageFile.Parser()
while 1:
s = f.read(1024)
if not s:
break
p.feed(s)
im = p.close()
im.save("/tmp/" + f.name)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python Image extraction sequence from pdf - python

Related

Downloading Images using BS4

Add page title, img2pdf

Change color scheme when extracting an image from PDF in Python

Image drawn to reportlab pdf bigger than pdf paper size

Converting an UploadedFile to PIL image in Django

Categories

Resources