Unable To Convert PDF to Text format

Unable To Convert PDF to Text format - python

I am getting this error while parsing the PDF file using pypdf2
i am attaching PDF along with the error.
I have attached the PDF to be parsed please click to view
Can anyone help?
import PyPDF2
def convert(data):
pdfName = data
read_pdf = PyPDF2.PdfFileReader(pdfName)
page = read_pdf.getPage(0)
page_content = page.extractText()
print(page_content)
return (page_content)
error:
PyPDF2.utils.PdfReadError: Expected object ID (8 0) does not match actual (7 0); xref table not zero-indexed.

There are some open source OCR tools like tesseract or openCV.
If you want to use e.g. tesseract there is a python wrapper library called pytesseract.
Most of OCR tools work on images, so you have to first convert your PDF into an image file format like PNG or JPG. After this you can load your image and process it with pytesseract.
Here is some sample code how you can use pytesseract, let's suppose you have already converted your PDF to an image with filename pdfName.png:
from PIL import Image
import pytesseract
def ocr_core(filename):
"""
This function will handle the core OCR processing of images.
"""
text = pytesseract.image_to_string(Image.open(filename)) # We'll use Pillow's Image class to open the image and pytesseract to detect the string in the image
return text
print(ocr_core('pdfName.png'))

Related

how to convert pdf to image and return back to pdf

i am trying to convert pdf to image need to some manipulation on image and again need to convert back manipulated file to pdf using python.
I have tried to convert pdf to image but i don't need to save file in local instead of this need to manipulate on the image file and again need to convert back to pdf file.
# import module
from pdf2image import convert_from_path
# Store Pdf with convert_from_path function
images = convert_from_path('example.pdf')
for i in range(len(images)):
# Save pages as images in the pdf
images[i].save('page'+ str(i) +'.jpg', 'JPEG')
// here it saving locally but i need to apply some background change like operation again i need to convert back it to pdf.

Extract Text from PDF using Python

Hi I am a python beginner.
I am trying to extract text from only few boxes in a pdf file
PDF File Link
I used pytesseract library to extract the text but it is downloading all the text. I want to limit my text extraction to certain observations in the file such as FEI number, Date Of Inspection at the top and employees signature at the bottom, can someone please guide what packages can I use to do so, and how to do so .
the Code I am using is something I borrowed from internet:
from pdf2image import convert_from_path
from pytesseract import image_to_string
from PIL import Image
!apt-get install -y poppler-utils #installing poppler
def convert_pdf_to_img(pdf_file):
"""
#desc: this function converts a PDF into Image
#params:
- pdf_file: the file to be converted
#returns:
- an interable containing image format of all the pages of the PDF
"""
return convert_from_path(pdf_file)
def convert_image_to_text(file):
"""
#desc: this function extracts text from image
#params:
- file: the image file to extract the content
#returns:
- the textual content of single image
"""
text = image_to_string(file)
return text
def get_text_from_any_pdf(pdf_file):
"""
#desc: this function is our final system combining the previous functions
#params:
- file: the original PDF File
#returns:
- the textual content of ALL the pages
"""
images = convert_pdf_to_img(pdf_file)
final_text = ""
for pg, img in enumerate(images):
final_text += convert_image_to_text(img)
#print("Page n°{}".format(pg))
#print(convert_image_to_text(img))
return final_text
Kaggle link for my notebook

I'm sure it is more efficient to crop the part of the images where you want the text to be extracted. And for that I'd use cv2 for image processing python module.

How to convert image from a url to text using pytesseract without saving in memory

I have a URL that contains an image, which has only text in it. I want to extract the text present in the image. I was able to find a solution using pytesseract. However, I need to save the image in local memory & then use it in the function to get the text. Is there any way to do this in-memory?
img_resp = requests.get(img_url)
with open('test.png','wb') as img:
img.write(img_resp.content)
print(image_to_string(Image.open('test.png')))

You can pass response content to Image.open like this:
import io
import requests
from PIL import Image
img_resp = requests.get(img_url)
img = Image.open(io.BytesIO(img_resp.content))

Python: How to get total image number (shutter count) from EXIF of a JPG file?

When I display the EXIF data in the Mac app "Preview", I see the absolute number of the image called "Image Number". I assume this is the XXX image that my camera ever took. I would like to get this data in my python code.
"More Info" Window of "Preview"
I have already successfully exported this number from a RAW image with the package exifread by using "MakerNote TotalShutterReleases". But this does not work with JPEG.
import exifread
with open(file_path, 'rb') as img:
tags = exifread.process_file(img)
img_number = tags["MakerNote TotalShutterReleases"]
For an RAW image I get that I want but for JPG:
KeyError: 'MakerNote TotalShutterReleases'
Unfortunately I couldn't find another suitable tag in the list of all tags. Where is this information stored? Why can Preview display this?

How to convert the pdf file to jpeg images

Here is my program, I want to convert pdf file into jpeg images, I wrote below code I am getting the PIL.PpmImagePlugin object how can I convert to jpeg format can you please help me. Thank you in advance.
from pdf2image import convert_from_path
images = convert_from_path('/home/cioc/Desktop/testingFiles/pdfurl-guide.pdf')
print images

You can add an output path and an output format for the images. Each page of your pdf will be saved in that directory in the specified format.
Add these keyword arguments to your code.
images = convert_from_path(
'/home/cioc/Desktop/testingFiles/pdfurl-guide.pdf',
output_folder='img',
fmt='jpeg'
)
This will create a directory named img and save each page of your pdf as a jpeg image inside img/
Alternatively, you can save each page using a loop by calling save() on each image.
from pdf2image import convert_from_path
images = convert_from_path('/home/cioc/Desktop/testingFiles/pdfurl-guide.pdf')
for page_no, image in enumerate(images):
image.save(f'page-{page_no}.jpeg')

You could use pdf2image parameter fmt='jpeg' to make it return JPEG instead.
You can also just manipulate the PPM as a you would a normal JPEG as this is only the backend file type. If you do Image.save('path.jpg') it will save it as a JPEG.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Unable To Convert PDF to Text format - python

Related

how to convert pdf to image and return back to pdf

Extract Text from PDF using Python

How to convert image from a url to text using pytesseract without saving in memory

Python: How to get total image number (shutter count) from EXIF of a JPG file?

How to convert the pdf file to jpeg images

Categories

Resources