How to convert the pdf file to jpeg images - python

Here is my program, I want to convert pdf file into jpeg images, I wrote below code I am getting the PIL.PpmImagePlugin object how can I convert to jpeg format can you please help me. Thank you in advance.
from pdf2image import convert_from_path
images = convert_from_path('/home/cioc/Desktop/testingFiles/pdfurl-guide.pdf')
print images

You can add an output path and an output format for the images. Each page of your pdf will be saved in that directory in the specified format.
Add these keyword arguments to your code.
images = convert_from_path(
'/home/cioc/Desktop/testingFiles/pdfurl-guide.pdf',
output_folder='img',
fmt='jpeg'
)
This will create a directory named img and save each page of your pdf as a jpeg image inside img/
Alternatively, you can save each page using a loop by calling save() on each image.
from pdf2image import convert_from_path
images = convert_from_path('/home/cioc/Desktop/testingFiles/pdfurl-guide.pdf')
for page_no, image in enumerate(images):
image.save(f'page-{page_no}.jpeg')

You could use pdf2image parameter fmt='jpeg' to make it return JPEG instead.
You can also just manipulate the PPM as a you would a normal JPEG as this is only the backend file type. If you do Image.save('path.jpg') it will save it as a JPEG.

Related

how to convert pdf to image and return back to pdf

i am trying to convert pdf to image need to some manipulation on image and again need to convert back manipulated file to pdf using python.
I have tried to convert pdf to image but i don't need to save file in local instead of this need to manipulate on the image file and again need to convert back to pdf file.
# import module
from pdf2image import convert_from_path
# Store Pdf with convert_from_path function
images = convert_from_path('example.pdf')
for i in range(len(images)):
# Save pages as images in the pdf
images[i].save('page'+ str(i) +'.jpg', 'JPEG')
// here it saving locally but i need to apply some background change like operation again i need to convert back it to pdf.

Unable To Convert PDF to Text format

I am getting this error while parsing the PDF file using pypdf2
i am attaching PDF along with the error.
I have attached the PDF to be parsed please click to view
Can anyone help?
import PyPDF2
def convert(data):
pdfName = data
read_pdf = PyPDF2.PdfFileReader(pdfName)
page = read_pdf.getPage(0)
page_content = page.extractText()
print(page_content)
return (page_content)
error:
PyPDF2.utils.PdfReadError: Expected object ID (8 0) does not match actual (7 0); xref table not zero-indexed.
There are some open source OCR tools like tesseract or openCV.
If you want to use e.g. tesseract there is a python wrapper library called pytesseract.
Most of OCR tools work on images, so you have to first convert your PDF into an image file format like PNG or JPG. After this you can load your image and process it with pytesseract.
Here is some sample code how you can use pytesseract, let's suppose you have already converted your PDF to an image with filename pdfName.png:
from PIL import Image
import pytesseract
def ocr_core(filename):
"""
This function will handle the core OCR processing of images.
"""
text = pytesseract.image_to_string(Image.open(filename)) # We'll use Pillow's Image class to open the image and pytesseract to detect the string in the image
return text
print(ocr_core('pdfName.png'))

How to find the file name for files generated by pdf2image

I am trying to convert my pdf files to jpg. I first use pdf2image to save the file as a .ppm. Then I want to use PIL to convert the .ppm to .jpg.
How do I find the name of the file that pdf2image saved?
Here is my code:
def to_jpg(just_ids):
for just_id in just_ids:
image = convert_from_path('/Users/davidtannenbaum/Desktop/scraped/{}.pdf'.format(just_id), output_folder='/Users/davidtannenbaum/Desktop/scraped/')
file_name = ?
im = Image.open("/Users/davidtannenbaum/Desktop/scraped/{}.ppm".format(file_name))
im.save("/Users/davidtannenbaum/Desktop/scraped/{}.jpg".format(just_id))
You don't need to, the image variable should contain a list of Image objects. You can simply do:
for i, im in enumerate(image):
im.save("/Users/davidtannenbaum/Desktop/scraped/{}_{}.jpg".format(just_id, i)))
The convert_to_path() method has a few more parameters you can use. You can set the paths_only parameter to True and the format attribute fmt to "jpeg".
This will directly save your images to your output folder in JPG format instead of PPM and the image variable will contain the relative paths to each image instead of the image objects.
for just_id in just_ids:
image = convert_from_path('/Users/davidtannenbaum/Desktop/scraped/{}.pdf'.format(just_id), output_folder='/Users/davidtannenbaum/Desktop/scraped/', fmt="jpeg", paths_only=True)
pdf_path = '/path/to/pdf_images/'
output_folder = '/path/for/output/images/'
for pdf in os.listdir(pdf_path):
filename = pdf.split('.')[0] # prepare your filename
pdfs = convert_from_path(os.path.join(pdf_path,pdf),output_folder=output_folder, output_file=os.path.join(output_folder+ filename), fmt="jpeg")

building tiff stack with wand

How can I achieve this with Wand library for python:
convert *.png stack_of_multiple_pngs.tiff
?
In particular, how can I read every png image, pack them into a sequence and then save the image as tiff stack:
with Image(filename='*.tiff') as img:
img.save(filename='stack_of_multiple_pngs.tiff')
I understand how to do it for gifs though, i.e. as described in docs. But what about building a sequence as a list and appending every new image I read as a SingleImage()?
Having trouble figuring it out right now.
See also
With wand you would use Image.sequence, not a wildcard filename *.
from wand.image import Image
from glob import glob
# Get list of all images filenames to include
image_names = glob('*.tiff')
# Create new Image, and extend sequence
with Image() as img:
img.sequence.extend( [ Image(filename=f) for f in image_names ] )
img.save(filename='stack_of_multiple_pngs.tiff')
The sequence_test.py file under the test directory will have better examples of working with the image sequence.

Generate barcode image from PIL.EPSImageFile instance

I want to generate a barcode image. So, I used elaphe package. It works correctly but it returns PIL.EPSImageFile instance. I don't know how I can convert it to image format like SVG, PNG or JPG.
The code I have written is:
barcode('code128', 'barcodetest')
And it returns:
<PIL.EpsImagePlugin.EpsImageFile image mode=RGB size=145x72 at 0x9AA47AC>
How can I convert this instance to image?
Actually I think my question is wrong but I don't know how to explain it well!
Simply save that file object to something with a .png or .jpg filename:
bc = barcode('qrcode',
'Hello Barcode Writer In Pure PostScript.',
options=dict(version=9, eclevel='M'),
margin=10, data_mode='8bits')
bc.save('yourfile.jpg')
or state the format explicitly:
bc.save('yourfile.jpg', 'JPEG')
PIL will then convert the image to the correct format.
Note that the PIL EPS module uses the gs command from the Ghostscript project to do it's conversions, you'll need to have it installed for this to work.

Categories