Extract an image from a PDF in python - python

I'm trying to extract images from a pdf using PyPDF2, but when my code gets it, the image is very different from what it should actually look like, look at the example below:
But this is how it should really look like:
Here's the pdf I'm using:
https://www.hbp.com/resources/SAMPLE%20PDF.pdf
Here's my code:
pdf_filename = "SAMPLE.pdf"
pdf_file = open(pdf_filename, 'rb')
cond_scan_reader = PyPDF2.PdfFileReader(pdf_file)
page = cond_scan_reader.getPage(0)
xObject = page['/Resources']['/XObject'].getObject()
i = 0
for obj in xObject:
# print(xObject[obj])
if xObject[obj]['/Subtype'] == '/Image':
if xObject[obj]['/Filter'] == '/DCTDecode':
data = xObject[obj]._data
img = open("{}".format(i) + ".jpg", "wb")
img.write(data)
img.close()
i += 1
And since I need to keep the image in it's colour mode, I can't just convert it to RBG if it was CMYK because I need that information.
Also, I'm trying to get dpi from images I get from a pdf, is that information always stored in the image?
Thanks in advance

I used pdfreader to extract the image from your example.
The image uses ICCBased colorspace with the value of N=4 and Intent value of RelativeColorimetric. This means that the "closest" PDF colorspace is DeviceCMYK.
All you need is to convert the image to RGB and invert the colors.
Here is the code:
from pdfreader import SimplePDFViewer
import PIL.ImageOps
fd = open("SAMPLE PDF.pdf", "rb")
viewer = SimplePDFViewer(fd)
viewer.render()
img = viewer.canvas.images['Im0']
# this displays ICCBased 4 RelativeColorimetric
print(img.ColorSpace[0], img.ColorSpace[1].N, img.Intent)
pil_image = img.to_Pillow()
pil_image = pil_image.convert("RGB")
inverted = PIL.ImageOps.invert(pil_image)
inverted.save("sample.png")
Read more on PDF objects: Image (sec. 8.9.5), InlineImage (sec. 8.9.7)

Hope this works: you probably need to use another library such as Pillow:
Here is an example:
from PIL import Image
image = Image.open("path_to_image")
if image.mode == 'CMYK':
image = image.convert('RGB')
image.write("path_to_image.jpg")
Reference: Convert from CMYK to RGB

Related

How to add noise (dithering) at background only?

I am trying to train a model with some noisy images having dithering.
What I have :
clean pdfs with white background
coloured pdfs(RGB) and grayscale pdfs (with 3 channels, RGB)
What I want:
convert only white background (not text) into gray background, if possible only half page should be converted
Add dithering to the gray background without loosing the text
what I tried:
import os
from PIL import Image
from numpy import asarray
ORIGIN_PATH = "/home/dithering/temp/"
DESTIN_PATH = "/home/dithering`enter code here`/temp_try/"
"""for filename in os.listdir(ORIGIN_PATH):
img = Image.open(ORIGIN_PATH + filename).convert("L")
rbg_grayscale_img = img.convert("RGB")
rbg_grayscale_img.save(DESTIN_PATH + filename)"""
for filename in os.listdir(ORIGIN_PATH):
img = Image.open(ORIGIN_PATH + filename).convert("L", dither=Image.Dither.FLOYDSTEINBERG)
# convert image to nparray
numpydata = asarray(img)
numpydata[numpydata > 250] = 128
# data
print(numpydata)
# convert array to image
final_image = Image.fromarray(numpydata)
# img show
final_image.show()
# img save
final_image.save(DESTIN_PATH + filename)
I expect something like this,
Any help would be appreciated, thanks in advance!

Lost information getting pdf page as image

I am not an expert in any sense, I am trying to extract a pdf page as an image to do some processing later. I used the following code for that, that I built from other recommendations in this page.
import fitz
from PIL import Image
dir = r'C:\Users\...'
files = os.listdir(dir)
print(dir+files[21])
doc = fitz.open(dir+files[21])
page = doc.loadPage(2)
zoom = 2
mat = fitz.Matrix(zoom, zoom)
pix = page.getPixmap(matrix = mat)
img = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)
density=img.getdata()
Usually this would give me the pixel information of the image, but in this case it returns a list of white pixels. I have no clue as for what is the reason of this... The image (img) is displayed if asked, but not its data.
I will appreciate any help?
If you want to convert pdf to image, and process, you might use something along these lines. This particular simple example reads in 5 pages of the PDF, and for the last page, looks at what percentage of the image is a particular color; the slow way and fast way.
import pdf2image
import numpy as np
# details:
# https://pypi.org/project/pdf2image/
images = pdf2image.convert_from_path('test.pdf')
# Get first five pages, just for testing
i = 1
for image in images:
print(i," shape: ", image.size)
image.save('output' + str(i) + '.jpg', 'JPEG')
i = i + 1
if(i>5):
break
color_test=(128,128,128)
other=0
specific_color=0
# Look at last image
for i in range(image.width):
for j in range(image.height):
x=image.getpixel((i,j))
if(x[0]==color_test[0] and x[1]==color_test[1] and x[2]==color_test[2]):
specific_color=specific_color+1
else:
other=other+1
print("frac of specific color = ", specific_color/(specific_color+other))
# faster!
x=np.asarray(image)
a=np.where(np.all(x==color_test,axis=-1))
print("(faster) frac of color = ", len(a[0])/((image.width)*(image.height)))
The code works if I take a shorter path and replace doc.loadPage with doc.getPagePixmap
import fitz
from PIL import Image
dir = r'C:\Users\...'
files = os.listdir(dir)
print(dir+files[21])
doc = fitz.open(dir+files[21])
pix= doc.getPagePixmap(2)
img = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)
density=img.getdata()
I still don't know why the long code fails, and the working method doesn't allows me to get a better resolution version of the exctracted page.

Python add noise to image breaks PNG

I'm trying to create a image system in Python 3 to be used in a web app. The idea is to load an image from disk and add some random noise to it. When I try this, I get what looks like a totally random image, not resembling the original:
import cv2
import numpy as np
from skimage.util import random_noise
from random import randint
from pathlib import Path
from PIL import Image
import io
image_files = [
{
'name': 'test1',
'file': 'test1.png'
},
{
'name': 'test2',
'file': 'test2.png'
}
]
def gen_image():
rand_image = randint(0, len(image_files)-1)
image_file = image_files[rand_image]['file']
image_name = image_files[rand_image]['name']
image_path = str(Path().absolute())+'/img/'+image_file
img = cv2.imread(image_path)
noise_img = random_noise(img, mode='s&p', amount=0.1)
img = Image.fromarray(noise_img, 'RGB')
fp = io.BytesIO()
img.save(fp, format="PNG")
content = fp.getvalue()
return content
gen_image()
I have also tried using pypng:
import png
# Added the following to gen_image()
content = png.from_array(noise_img, mode='L;1')
content.save('image.png')
How can I load a png (With alpha transparency) from disk, add some noise to it, and return it so that it can be displayed by web server code (flask, aiohttp, etc).
As indicated in the answer by makayla, this makes it better: noise_img = (noise_img*255).astype(np.uint8) but the colors are still wrong and there's no transparency.
Here's the updated function for that:
def gen_image():
rand_image = randint(0, len(image_files)-1)
image_file = image_files[rand_image]['file']
image_name = image_files[rand_image]['name']
image_path = str(Path().absolute())+'/img/'+image_file
img = cv2.imread(image_path)
cv2.imshow('dst_rt', img)
cv2.waitKey(0)
cv2.destroyAllWindows()
# Problem exists somewhere below this line.
img = random_noise(img, mode='s&p', amount=0.1)
img = (img*255).astype(np.uint8)
img = Image.fromarray(img, 'RGB')
fp = io.BytesIO()
img.save(fp, format="png")
content = fp.getvalue()
return content
This will popup a pre-noise image and return the noised image. RGB (And alpha) problem exists in returned image.
I think the problem is it needs to be RGBA but when I change to that, I get ValueError: buffer is not large enough
Given all the new information I am updating my answer with a few more tips for debugging the issue.
I found a site here which creates sample transparent images. I created a 64x64 cyan (R=0, G=255, B=255) image with a transparency layer of 0.5. I used this to test your code.
I read in the image two ways to compare: im1 = cv2.imread(fileName) and im2 = cv2.imread(fileName,cv2.IMREAD_UNCHANGED). np.shape(im1) returned (64,64,3) and np.shape(im2) returned (64,64,4). This is why that flag is required--the default imread settings in opencv will read in a transparent image as a normal RGB image.
However opencv reads in as BGR instead of RGB, and since you cannot save out with opencv, you'll need to convert it to the correct order otherwise the image will have reversed color. For example, my cyan image, when viewed with the reversed color appears like this:
You can change this using openCV's color conversion function like this im = cv2.cvtColor(im, cv2.COLOR_BGRA2RGBA) (Here is a list of all the color conversion codes). Again, double check the size of your image if you need to, it should still have four channels since you converted it to RGBA.
You can now add your noise to your image. Just so you know, this is also going to add noise to your alpha channel as well, randomly making some pixels more transparent and others less transparent. The random_noise function from skimage converts your image to float and returns it as float. This means the image values, normally integers ranging from 0 to 255, are converted to decimal values from 0 to 1. Your line img = Image.fromarray(noise_img, 'RGB') does not know what to do with the floating point noise_img. That's why the image is all messed up when you save it, as well as when I tried to show it.
So I took my cyan image, added noise, and then converted the floats back to 8 bits.
noise_img = random_noise(im, mode='s&p', amount=0.1)
noise_img = (noise_img*255).astype(np.uint8)
img = Image.fromarray(noise_img, 'RGBA')
It now looks like this (screenshot) using img.show():
I used the PIL library to save out my image instead of openCV so it's as close to your code as possible.
fp = 'saved_im.png'
img.save(fp, format="png")
I loaded the image into powerpoint to double-check that it preserved the transparency when I saved it using this method. Here is a screenshot of the saved image overlaid on a red circle in powerpoint:

PDF is blue when it generated from image

I am converting jpg to pdf using PIL library. Below is my code.
im = PIL.Image.open(filename)
PIL.Image.Image.save(im, newfilename, "PDF", resoultion = 200.0,quality = 100)
But output of pdf file is blur and also color of image is change.
Is there any class of PIL which use to avoid such things?
Thanks in advance.
I have success with this code, to convert a JPG to PDF:
from PIL import Image
inputfilename = "aaron.jpg"
outputfilename = "aaron.pdf"
im = Image.open(inputfilename)
dpi = None
if hasattr(im.info, "dpi"):
dpi = im.info.dpi[0] # Assume horizontal DPI is same as vertical DPI.
if not dpi:
dpi = 72 # Assume it's 72 if it's not specified in the JPG.
im.save(outputfilename, resolution=dpi, quality=100)
im.save() also works without the resolution and quality parameters, but I included them here because your example showed them.

Image resize and get data in a buffer using python

I want to resize the input image to a fixed size. Then I want to entire content of the resized image file in a buffer for further use like, appending it with other buffer(data).
Currently I am doing it using the following python function,
def get_resize_img(img_file):
img = Image.open(img_file)
img = img.resize((640,960), Image.NEAREST)
img.save("tmp_out.jpg")
fp = open("tmp_out.jpg", "rb")
data = fp.read()
fp.close()
print "img sz:", len(data)
return data
Is there any better way to achieve this without writing into a dummy file (tmp_out.jpg) and reading back from it.
-Mohan
Import StringIO
def get_resize_img(img_file):
buffer = StringIO.StringIO()
img = Image.open(img_file)
img = img.resize((640,960), Image.NEAREST)
format = "YOUR_FORMAT" // JPG,PNG,etc.
img.save(buffer,format)
return buffer.getvalue()
I think you can use the getdata to get the image data, which is in fact the image buffer. Then you can reconstruct the image using frombuffer. Or maybe you can just return the resized Image object and use it later.
You can check out the document here.
if you are using numpy then following can be done
data = numpy.asarray(im)

Categories