Python PIL can't open PDFs for some reason - python

So my program is able to open PNGs but not PDFs, so I made this just to test, and it still isn't able to open even a simple PDF. And I don't know why.
from PIL import Image
with Image.open(r"Adams, K\a.pdf") as file:
print file
Traceback (most recent call last):
File "C:\Users\Hayden\Desktop\Scans\test4.py", line 3, in <module>
with Image.open(r"Adams, K\a.pdf") as file:
File "C:\Python27\lib\site-packages\PIL\Image.py", line 2590, in open
% (filename if filename else fp))
IOError: cannot identify image file 'Adams, K\\a.pdf'
After trying PyPDF2 as suggested (Thanks for the link by the way), I am getting this error with my code.
import PyPDF2
pdf_file= open(r"Adams, K (6).pdf", "rb")
read_pdf= PyPDF2.PdfFileReader(pdf_file)
number_of_pages = read_pdf.getNumPages()
print number_of_pages
Xref table not zero-indexed. ID numbers for objects will be corrected. [pdf.py:1736]

Following this article: https://www.geeksforgeeks.org/convert-pdf-to-image-using-python/ you can use the pdf2image package to convert the pdf to a PIL object.
This should solve your problem:
from pdf2image import convert_from_path
fname = r"Adams, K\a.pdf"
pil_image_lst = convert_from_path(fname) # This returns a list even for a 1 page pdf
pil_image = pil_image_lst[0]
I just tried this out with a one page pdf.

As pointed out by #Kevin (see comment below) PIL has support for writing pdfs but not reading them.
To read a pdf you will need some other library. You can look here which is a tutorial for handling PDFs with PyPDF2.
https://pythonhosted.org/PyPDF2/?utm_source=recordnotfound.com

Related

Deep Learning model on M1 chip(TensorFlow and Keras) [duplicate]

I am trying to read a png file into a python-flask application running in docker and am getting an error that says
ValueError: Could not find a format to read the specified file in mode
'i'
i have uploaded a file using an HTML file and now i am trying to read it for further processing. i see that scipy.misc.imread is deprecated and i am trying to replace this with imageio.imread
if request.method=='POST':
file = request.files['image']
if not file:
return render_template('index.html', label="No file")
#img = misc.imread(file)
img = imageio.imread(file)
i get this error :
File "./appimclass.py", line 34, in make_prediction
img = imageio.imread(file)
File "/usr/local/lib/python3.6/site-packages/imageio/core/functions.py", line 221, in imread
reader = read(uri, format, "i", **kwargs)
File "/usr/local/lib/python3.6/site-packages/imageio/core/functions.py", line 139, in get_reader
"Could not find a format to read the specified file " "in mode %r" % mode
Different, but in case helpful. I had an identical error in a different library (skimage), and the solution was to add an extra 'plugin' parameter like so -
image = io.imread(filename,plugin='matplotlib')
Had the exact same problem recently, and the issue was a single corrupt file. Best is to use something like PIL to check for bad files.
import os
from os import listdir
from PIL import Image
dir_path = "/path/"
for filename in listdir(dir_path):
if filename.endswith('.jpg'):
try:
img = Image.open(dir_path+"\\"+filename) # open the image file
img.verify() # verify that it is, in fact an image
except (IOError, SyntaxError) as e:
print('Bad file:', filename)
#os.remove(dir_path+"\\"+filename) (Maybe)
I had this problem today, and found that if I closed the file before reading it into imageio the problem went away.
Error was:
File "/home/vinny/pvenvs/chess/lib/python3.6/site-packages/imageio/core/functions.py", line 139, in get_reader "Could not find a format to read the specified file " "in mode %r" % mode ValueError: Could not find a format to read the specified file in mode 'i'
Solution:
Put file.close() before images.append(imageio.imread(filename)), not after.
Add the option "pilmode":
imageio.imread(filename,pilmode="RGB")
It worked for me.
I encountered the same error, and at last, I found it was because the picture was damaged.
I had accidentally saved some images as PDF, so the error occurred. resolved after deleting those incompatible format images.

Getting Assertion error while reading the PDF file python - pypdf2

I am getting the below error when I try to read a PDF file.
Code:
from PyPDF2 import PdfFileReader
import os
os.chdir("Path to dir")
pdf_document = 'sample.pdf'
pdf = PdfFileReader(pdf_document,'rb') #Error here
Error:
Traceback (most recent call last):
File "/home/krishna/PycharmProjects/sample/sample.py", line 9, in
pdf = PdfFileReader(filehandle)
File "/home/krishna/PycharmProjects/AI_DRC/venv/lib/python3.6/site-packages/PyPDF2/pdf.py", line 1084, in init
self.read(stream)
File "/home/krishna/PycharmProjects/AI_DRC/venv/lib/python3.6/site-packages/PyPDF2/pdf.py", line 1838, in read
assert start >= last_end
AssertionError
NOTE: File is 18 MB in size
Here I wrote this and it completely works for me, The pdf is in same folder, you can use os to get a path value of string type too
import PyPDF2
pdf_file = PyPDF2.PdfFileReader("Sample.pdf")#addressing the file, you can use os method it works on that as well
page_content = pdf_file.getPage(0).extractText()# here I get the psge number one(index zero) and then extracted its content
print(page_content)#you can then do whatever you want with it
I think the problem with your program is that "rb" thing, you use it in normal file handling, PyPDF2 already has methods called PdfFileReader, PdfFileWriter and PdfFileMerger.
Hope it helped
If you counter any problem just mention, and I will try to get back at it.

Read image from URL and keep it in memory

I am using Python and requests library. I just want to download an image to a numpy array for example and there are multiple questions where you can find different combinations (using opencv, PIL, requests, urllib...)
None of them work for my case. I basically receive this error when I try to download the image:
cannot identify image file <_io.BytesIO object at 0x7f6a9734da98>
A simple example of my code can be:
import requests
from PIL import Image
response = requests.get(url, stream=True)
response.raw.decode_content = True
image = Image.open(response.raw)
image.show()
The main this that is driving me crazy is that, if I download the image to a file (using urllib), the whole process runs without any problem!
import urllib
urllib.request.urlretrieve(garment.url, os.path.join(download_folder, garment.get_path()))
What can I be doing wrong?
EDIT:
My mistake was finally related with URL formation and not with requests
or PIL library. My previous code example should work perfectly if the URL is correct.
I think you are using data from requests.raw object somehow before save them in Image but requests response raw object is not seekable, you can read from it only once:
>>> response.raw.seekable()
False
First open is ok:
>>> response.raw.tell()
0
>>> image = Image.open(response.raw)
Second open throws error (stream position is on the end of file already):
>>> response.raw.tell()
695 # this file length https://docs.python.org/3/_static/py.png
>>> image = Image.open(response.raw)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python3/dist-packages/PIL/Image.py", line 2295, in open
% (filename if filename else fp))
OSError: cannot identify image file <_io.BytesIO object at 0x7f11850074c0>
You should save data from requests response in file-like object (or file of course) if you want to use them several times:
import io
image_data = io.BytesIO(response.raw.read())
Now you can read image stream and rewind it as many times as needed:
>>> image_data.seekable()
True
image = Image.open(image_data)
image1 = Image.open(image_data)

Randomly shuffle the pages of a PDF file using pyPDF or PyPDF2

I'm not very experienced in programming. What I'm trying to do is to randomly shuffle the pages of a pdf and output it to another pdf.
Searching online I found the following two solutions (source 1, source 2):
#!/usr/bin/env python2
import random, sys
from PyPDF2 import PdfFileWriter, PdfFileReader
input = PdfFileReader(sys.stdin)
output = PdfFileWriter()
pages = range(input.getNumPages())
random.shuffle(pages)
for i in pages:
output.addPage(input.getPage(i))
output.write(sys.stdout)
And this one:
#!/usr/bin/python
import sys
import random
from pyPdf import PdfFileWriter, PdfFileReader
# read input pdf and instantiate output pdf
output = PdfFileWriter()
input1 = PdfFileReader(file(sys.argv[1],"rb"))
# construct and shuffle page number list
pages = list(range(input1.getNumPages()))
random.shuffle(pages)
# display new sequence
print 'Reordering pages according to sequence:'
print pages
# add the new sequence of pages to output pdf
for page in pages:
output.addPage(input1.getPage(page))
# write the output pdf to file
outputStream = file(sys.argv[1]+'-mixed.pdf','wb')
output.write(outputStream)
outputStream.close()
I tried both (and both with PyPDF2 and pyPdf) and both indeed create a new pdf file, but this file is simply empty (and has 0KB) (when I enter, let's say "shuffle.py new.pdf").
I'm using PyCharm and one problem I encounter (and not really understand) is that it says: "Cannot find reference 'PdfFileWriter'".
PyCharm tells me that it cannot find the reference
I would appreciate any help understanding what I'm doing wrong :)
EDIT:
As suggested by Tom Dalton, I'm posting what happens when I run the first one:
C:\Users\Anwender\AppData\Local\Temp\shuffle.py\venv\Scripts\python.exe "E:/Shuffle PDF/shuffle.py"
PdfReadWarning: PdfFileReader stream/file object is not in binary mode. It may not be read correctly. [pdf.py:1079]
Traceback (most recent call last):
File "E:/Shuffle PDF/shuffle.py", line 5, in <module>
input = PdfFileReader(sys.stdin)
File "C:\Python27\lib\site-packages\PyPDF2\pdf.py", line 1084, in __init__
self.read(stream)
File "C:\Python27\lib\site-packages\PyPDF2\pdf.py", line 1689, in read
stream.seek(-1, 2)
IOError: [Errno 22] Invalid argument
Process finished with exit code 1
From the comments I infer that the fact that a new PDF is created is only due to me typing "shuffle.py newfile.pdf" into the terminal :D
EDIT 2: I now figured it out; this now works:
from PyPDF2 import PdfFileReader, PdfFileWriter
import random, sys
output = PdfFileWriter()
input = PdfFileReader(file("test.pdf", "rb"))
pages = range(input.getNumPages())
random.shuffle(pages)
for i in pages:
output.addPage(input.getPage(i))
outputStream = file(r"output2.pdf", "wb")
output.write(outputStream)
outputStream.close()

Convert PDF page to image with pyPDF2 and BytesIO

I have a function that gets a page from a PDF file via pyPdf2 and should convert the first page to a png (or jpg) with Pillow (PIL Fork)
from PyPDF2 import PdfFileWriter, PdfFileReader
import os
from PIL import Image
import io
# Open PDF Source #
app_path = os.path.dirname(__file__)
src_pdf= PdfFileReader(open(os.path.join(app_path, "../../../uploads/%s" % filename), "rb"))
# Get the first page of the PDF #
dst_pdf = PdfFileWriter()
dst_pdf.addPage(src_pdf.getPage(0))
# Create BytesIO #
pdf_bytes = io.BytesIO()
dst_pdf.write(pdf_bytes)
pdf_bytes.seek(0)
file_name = "../../../uploads/%s_p%s.png" % (name, pagenum)
img = Image.open(pdf_bytes)
img.save(file_name, 'PNG')
pdf_bytes.flush()
That results in an error:
OSError: cannot identify image file <_io.BytesIO object at 0x0000023440F3A8E0>
I found some threads with a similar issue, (PIL open() method not working with BytesIO) but I cannot see where I am wrong here, as I have pdf_bytes.seek(0) already added.
Any hints appreciated
Per document:
write(stream) Writes the collection of pages added to this object out
as a PDF file.
Parameters: stream – An object to write the file to. The object must
support the write method and the tell method, similar to a file
object.
So the object pdf_bytes contains a PDF file, not an image file.
The reason why there are codes like above work is: sometimes, the pdf file just contains a jpeg file as its content. If your pdf is just a normal pdf file, you can't just read the bytes and parse it as an image.
And refer to as a more robust implementation: https://stackoverflow.com/a/34116472/334999
[![enter image description here][1]][1]
import glob, sys, fitz
# To get better resolution
zoom_x = 2.0 # horizontal zoom
zoom_y = 2.0 # vertical zoom
mat = fitz.Matrix(zoom_x, zoom_y) # zoom factor 2 in each dimension
filename = "/xyz/abcd/1234.pdf" # name of pdf file you want to render
doc = fitz.open(filename)
for page in doc:
pix = page.get_pixmap(matrix=mat) # render page to an image
pix.save("/xyz/abcd/1234.png") # store image as a PNG
Credit
[Convert PDF to Image in Python Using PyMuPDF][2]
https://towardsdatascience.com/convert-pdf-to-image-in-python-using-pymupdf-9cc8f602525b

Categories