Error while image extraction from PDF in python - python

I am trying to extract all formats of images from pdf. I did some googling and found this page on StackOverflow. I tried this code but I am getting this error:
I am using python 3.x and here is the code I am using. I tried to go through comments but couldn't figure out. Please help me resolve this.
Here is the sample PDF.
import PyPDF2
from PIL import Image
if __name__ == '__main__':
input1 = PyPDF2.PdfFileReader(open("Aadhaar1.pdf", "rb"))
page0 = input1.getPage(0)
xObject = page0['/Resources']['/XObject'].getObject()
for obj in xObject:
if xObject[obj]['/Subtype'] == '/Image':
size = (xObject[obj]['/Width'], xObject[obj]['/Height'])
data = xObject[obj].getData()
if xObject[obj]['/ColorSpace'] == '/DeviceRGB':
mode = "RGB"
else:
mode = "P"
if xObject[obj]['/Filter'] == '/FlateDecode':
img = Image.frombytes(mode, size, data)
img.save(obj[1:] + ".png")
elif xObject[obj]['/Filter'] == '/DCTDecode':
img = open(obj[1:] + ".jpg", "wb")
img.write(data)
img.close()
elif xObject[obj]['/Filter'] == '/JPXDecode':
img = open(obj[1:] + ".jp2", "wb")
img.write(data)
img.close()
I was reading some comments and going through links and found this problem solved on this page. Can someone please help me implement it?

It is the PyPDF2 library error. Try uninstalling and installing the library with changes or you can see the changes in the GitHub and mark the changes.I hope that will work.

As of today, I'm still getting the error NotImplementedError: unsupported filter /DCTDecode
I've PyPDF2 v 1.26.0 installed, using Python3 3.7.5. My Python code is the same as above.
Is there a solution yet?

Same error for me with Python 3.9 and PyPDF2 1.26 at time of this writing.
data = xObject[obj].getData()
was the problem line. My PDF had JPG images, and that line was not working because of same NotImlemented exception.
Changing the line for the /DCTDecode part to;
data = xObject[obj]._data
kind of worked for me. This gives plain JPG stream in the pdf.
So ie separate data = ... lines for each if/filter section, though not tried the JP2 part.

Related

Running into problems setting cover art for MP4 files using Python and Mutagen

Following multiple suggestions from other StackOverflow questions and the mutagen documentation, I was able to come up with code to get and set every ID3 tag in both MP3 and MP4 files. The issue I have is with setting the cover art for M4B files.
I have reproduced the code exactly like it is laid out in this answer:
Embedding album cover in MP4 file using Mutagen
But I am still receiving errors when I attempt to run the code. If I run the code with the 'albumart' value by itself I receive the error:
MP4file.tags['covr'] = albumart
Exception has occurred: TypeError
can't concat int to bytes
However, if I surround the albumart variable with brackets like is shown in the aforementioned StackOverflow question I get this output:
MP4file.tags['covr'] = [albumart]
Exception has occurred: struct.error
required argument is not an integer
Here is the function in it's entirety. The MP3 section works without any problems.
from mutagen.mp3 import MP3
from mutagen.mp4 import MP4, MP4Cover
def set_cover(filename, cover):
r = requests.get(cover)
with open('C:/temp/cover.jpg', 'wb') as q:
q.write(r.content)
if(filename.endswith(".mp3")):
MP3file = MP3(filename, ID3=ID3)
if cover.endswith('.jpg') or cover.endswith('.jpeg'):
mime = 'image/jpg'
else:
mime = 'image/png'
with open('C:/temp/cover.jpg', 'rb') as albumart:
MP3file.tags.add(APIC(encoding=3, mime=mime, type=3, desc=u'Cover', data=albumart.read()))
MP3file.save(filename)
else:
MP4file = MP4(filename)
if cover.endswith('.jpg') or cover.endswith('.jpeg'):
cover_format = 'MP4Cover.FORMAT_JPEG'
else:
cover_format = 'MP4Cover.FORMAT_PNG'
with open('C:/temp/cover.jpg', 'rb') as f:
albumart = MP4Cover(f.read(), imageformat=cover_format)
MP4file.tags['covr'] = [albumart]
I have been trying to figure out what I am doing wrong for two days now. If anyone can help me spot the problem I would be in your debt.
Thanks!
In the source code of mutagen at the location where the exception is being raised I've found the following lines:
def __render_cover(self, key, value):
...
for cover in value:
try:
imageformat = cover.imageformat
except AttributeError:
imageformat = MP4Cover.FORMAT_JPEG
...
Atom.render(b"data", struct.pack(">2I", imageformat, 0) + cover))
...
There key is the name for the cover tag and value is the data read from the image, wrapped into an MP4Cover object. Well, it turns out that if you iterates over an MP4Cover object, as the above code does, the iteration yields one byte of the image per iteration as int.
Moreover, in Python version 3+, struct.pack returns an object of type bytes. I think the cover argument was intended to be the collection of bytes taken from the cover image.
In the code you've given above the bytes of the cover image are wrapped inside an object of type MP4Cover that cannot be added to bytes as done in the second argument of Atom.render.
To avoid having to edit or patch the mutagen library source code, the trick is converting the 'MP4Cover' object to bytes and wrapping the result inside a collection as shown below.
import requests
from mutagen.mp3 import MP3
from mutagen.mp4 import MP4, MP4Cover
def set_cover(filename, cover):
r = requests.get(cover)
with open('C:/temp/cover.jpg', 'wb') as q:
q.write(r.content)
if(filename.endswith(".mp3")):
MP3file = MP3(filename, ID3=ID3)
if cover.endswith('.jpg') or cover.endswith('.jpeg'):
mime = 'image/jpg'
else:
mime = 'image/png'
with open('C:/temp/cover.jpg', 'rb') as albumart:
MP3file.tags.add(APIC(encoding=3, mime=mime, type=3, desc=u'Cover', data=albumart.read()))
MP3file.save(filename)
else:
MP4file = MP4(filename)
if cover.endswith('.jpg') or cover.endswith('.jpeg'):
cover_format = 'MP4Cover.FORMAT_JPEG'
else:
cover_format = 'MP4Cover.FORMAT_PNG'
with open('C:/temp/cover.jpg', 'rb') as f:
albumart = MP4Cover(f.read(), imageformat=cover_format)
MP4file.tags['covr'] = [bytes(albumart)]
MP4file.save(filename)
I've also added MP4file.save(filename) as the last line of the code to persists the changes done to the file.

Python urlretrieve downloading corrupted images

I am downloading a list of images (all .jpg) from the web using this python script:
__author__ = 'alessio'
import urllib.request
fname = "inputs/skyscraper_light.txt"
with open(fname) as f:
content = f.readlines()
for link in content:
try:
link_fname = link.split('/')[-1]
urllib.request.urlretrieve(link, "outputs_new/" + link_fname)
print("saved without errors " + link_fname)
except:
pass
In OSX preview I see the images just fine, but I can't open them with any image editor (for example Photoshop says "Could not complete your request because Photoshop does not recognize this type of file."), and when i try to attach them to a word document, the files are not even showed as picture files in the dialog for browsing for image.
What am i doing wrong?
As J.F. Sebastian suggested me in the comments, the issue was related to the newline in the filename.
To make my script work, you need to replace
link_fname = link.split('/')[-1]
with
link_fname = link.strip().split('/')[-1]

Convert html to pdf using Python/Flask

I want to generate pdf file from html using Python + Flask. To do this, I use xhtml2pdf. Here is my code:
def main():
pdf = StringIO()
pdf = create_pdf(render_template('cvTemplate.html', user=user))
pdf_out = pdf.getvalue()
response = make_response(pdf_out)
return response
def create_pdf(pdf_data):
pdf = StringIO()
pisa.CreatePDF(StringIO(pdf_data.encode('utf-8')), pdf)
return pdf
In this code file is generating on the fly. BUT! xhtml2pdf doesn't support many styles in CSS, because of this big problem to mark page correctly. I found another instrument(wkhtmltopdf). But when I wrote something like:
pdf = StringIO()
data = render_template('cvTemplate1.html', user=user)
WKhtmlToPdf(data.encode('utf-8'), pdf)
return pdf
Was raised error:
AttributeError: 'cStringIO.StringO' object has no attribute 'rfind'
And my question is how to convert html to pdf using wkhtmltopdf (with generating file on the fly) in Flask?
Thanks in advance for your answers.
The page need render, You can use pdfkit:
https://pypi.python.org/pypi/pdfkit
https://github.com/JazzCore/python-pdfkit
Example in document.
import pdfkit
pdfkit.from_url('http://google.com', 'out.pdf')
pdfkit.from_file('test.html', 'out.pdf')
pdfkit.from_string('Hello!', 'out.pdf') # Is your requirement?
Have you tried with Flask-WeasyPrint, which uses WeasyPrint? There are good examples in their web sites so I don't replicate them here.
Not sure if this would assist anyone but my issue was capturing Bootstrap5 elements as a pdf. pdfkit did not do so and heres a work around on windows using html2image and PIL. This is limited and does not take a full page screenshot.
from html2image import Html2Image
from PIL import Image
try:
hti.screenshot(html_file=C:\yourfilepath\file.html, save_as="test.png")
finally:
image1 = Image.open(r'C:\yourfilepath\test.png')
im1 = image1.convert('RGB')
im1.save(r'C:\yourfilepath\newpdf.pdf')

How to get letters from an image using python

i want capture the letters(characters & Numbers) from an image using python please help me how can i do it explain me with any sample code.
I hope this will help you out if your image is clear (positively less Noise).
Use "PyTesser" Project of Google in this Case.
PyTesser is an Optical Character Recognition module for Python.
It takes as input an image or image file and outputs a string.
You can get PyTesser from this link.
Here's an example:
>>> from pytesser import *
>>> image = Image.open('fnord.tif') # Open image object using PIL
>>> print image_to_string(image) # Run tesseract.exe on image
fnord
>>> print image_file_to_string('fnord.tif')
fnord
I use tesseract for this.
There is also a Python library for it: https://code.google.com/p/python-tesseract/
Example from the main page:
import tesseract
api = tesseract.TessBaseAPI()
api.Init(".","eng",tesseract.OEM_DEFAULT)
api.SetVariable("tessedit_char_whitelist", "0123456789abcdefghijklmnopqrstuvwxyz")
api.SetPageSegMode(tesseract.PSM_AUTO)
mImgFile = "eurotext.jpg"
mBuffer=open(mImgFile,"rb").read()
result = tesseract.ProcessPagesBuffer(mBuffer,len(mBuffer),api)
print "result(ProcessPagesBuffer)=",result
Here is my code for Python3 not using the tesseract library but the .exe file:
import os
import tempfile
def tesser_exe():
path = os.path.join(os.environ['Programfiles'], 'Tesseract-OCR', 'tesseract.exe')
if not os.path.exists(path):
raise NotImplementedError('You must first install tesseract from https://code.google.com/p/tesseract-ocr/downloads/detail?name=tesseract-ocr-setup-3.02.02.exe&can=2&q=')
return path
def text_from_image_file(image_name):
assert image_name.lower().endswith('.bmp')
output_name = tempfile.mktemp()
exe_file = tesser_exe() # path to the tesseract.exe file from
return_code = subprocess.call([exe_file, image_name, output_name, '-psm', '7'])
if return_code != 0:
raise NotImplementedError('error handling not implemented')
return open(output_name + '.txt', encoding = 'utf8').read()

merging pdf files with pypdf

I am writing a script that parses an internet site (maya.tase.co.il) for links, downloads pdf file and merges them. It works mostly, but merging gives me different kinds of errors depending on the file. I cant seem to figure out why. I cut out the relevant code and built a test only for two specific files that are causing a problem. The script uses pypdf, but I am willing to try anything that works. Some files are encrypted, some are not.
def is_incry(pdf):
from pyPdf import PdfFileWriter, PdfFileReader
input=PdfFileReader(pdf)
try:
input.getNumPages()
return input
except:
input.decrypt("")
return input
def merg_pdf(to_keep,to_lose):
import os
from pyPdf import PdfFileWriter, PdfFileReader
if os.path.exists(to_keep):
in1=file(to_keep, "rb")
in2=file(to_lose, "rb")
input1 = is_incry(in1)
input2 = is_incry(in2)
output = PdfFileWriter()
loop1=input1.getNumPages()
for i in range(0,loop1):
output.addPage(input1.getPage(i))#
loop2=input2.getNumPages()
for i in range(0,loop2):
output.addPage(input2.getPage(i))#
outputStream = file("document-output.pdf", "wb")
output.write(outputStream)
outputStream.close()
pdflen=loop1+loop2
in1.close()
in2.close()
os.remove(to_lose)
os.remove(to_keep)
os.rename("document-output.pdf",to_keep)
else:
os.rename(to_lose,to_keep)
in1=file(to_keep, "rb")
input1 = PdfFileReader(in1)
try:
pdflen=input1.getNumPages()
except:
input1.decrypt("")
pdflen=input1.getNumPages()
in1.close()
#input1.close()
return pdflen
def test():
import urllib
urllib.urlretrieve ('http://mayafiles.tase.co.il/RPdf/487001-488000/P487028-01.pdf', 'temp1.pdf')
urllib.urlretrieve ('http://mayafiles.tase.co.il/RPdf/488001-489000/P488170-00.pdf', 'temp2.pdf')
merg_pdf('temp1.pdf','temp2.pdf')
test()
I thank anyone that even took the time to read this.
Al.
I once wrote a complex PDF generation/merging stuff which I have now open-sourced.
You can have a look at it: https://github.com/becomingGuru/nikecup/blob/master/reg/models.py#L71
def merge_pdf(self):
from pyPdf import PdfFileReader,PdfFileWriter
pdf_file = file_names['main_pdf']%settings.MEDIA_ROOT
pdf_obj = PdfFileReader(open(pdf_file))
values_page = PdfFileReader(open(self.make_pdf())).getPage(0)
mergepage = pdf_obj.pages[0]
mergepage.mergePage(values_page)
signed_pdf = PdfFileWriter()
for page in pdf_obj.pages:
signed_pdf.addPage(page)
signed_pdf_name = file_names['dl_done']%(settings.MEDIA_ROOT,self.phash)
signed_pdf_file = open(signed_pdf_name,mode='wb')
signed_pdf.write(signed_pdf_file)
signed_pdf_file.close()
return signed_pdf_name
It then works like a charm. Hope it helps.
I tried the documents with pyPdf - it looks like both the pdfs in this document are encrypted, and a blank password ("") is not valid.
Looking at the security settings in adobe, you're allowed to print and copy - however, when you run pyPdf's input.decrypt(""), the files are still encrypted (as in, input.getNumPages() after this still returns 0).
I was able to open this in OSX's preview, and save the pdfs without encryption, and then the assembly works fine. pyPdf deals in pages though, so I don't think you can do this through pyPdf. Either find the correct password, or perhaps use another application, such as using a batch job to OpenOffice, or another pdf plugin would work.

Categories