Extract Text from PDF using Python - python

Hi I am a python beginner.
I am trying to extract text from only few boxes in a pdf file
PDF File Link
I used pytesseract library to extract the text but it is downloading all the text. I want to limit my text extraction to certain observations in the file such as FEI number, Date Of Inspection at the top and employees signature at the bottom, can someone please guide what packages can I use to do so, and how to do so .
the Code I am using is something I borrowed from internet:
from pdf2image import convert_from_path
from pytesseract import image_to_string
from PIL import Image
!apt-get install -y poppler-utils #installing poppler
def convert_pdf_to_img(pdf_file):
"""
#desc: this function converts a PDF into Image
#params:
- pdf_file: the file to be converted
#returns:
- an interable containing image format of all the pages of the PDF
"""
return convert_from_path(pdf_file)
def convert_image_to_text(file):
"""
#desc: this function extracts text from image
#params:
- file: the image file to extract the content
#returns:
- the textual content of single image
"""
text = image_to_string(file)
return text
def get_text_from_any_pdf(pdf_file):
"""
#desc: this function is our final system combining the previous functions
#params:
- file: the original PDF File
#returns:
- the textual content of ALL the pages
"""
images = convert_pdf_to_img(pdf_file)
final_text = ""
for pg, img in enumerate(images):
final_text += convert_image_to_text(img)
#print("Page n°{}".format(pg))
#print(convert_image_to_text(img))
return final_text
Kaggle link for my notebook

I'm sure it is more efficient to crop the part of the images where you want the text to be extracted. And for that I'd use cv2 for image processing python module.

Related

Why does extracting file data in PyMuPDF give me empty lists?

I am new to programming (just do it for fun sometimes) and I am having trouble using PyMuPDF.
In VS Code, it returns no errors but the output is always just an empty list.
Here is the code:
> import fitz
file_path = "/Users/conor/Desktop/projects/png2pdf.pdf"
def extract_text_from_pdf(file_path):
# Open the pdf file
pdf_document = fitz.open(file_path)
# Initialize an empty list to store the text
text = []
# Iterate through the pages
for page in pdf_document:
# Extract the text from the page
page_text = page.get_text()
# Append the text to the list
text.append(page_text)
# Close the pdf document
pdf_document.close()
# Return the list of text
return text
if __name__ == '__main__':
file_path = "/Users/conor/Desktop/projects/png2pdf.pdf"
text = extract_text_from_pdf(file_path)
Based on the name of the file, I'm going to guess this was an image that was converted to a PDF. In that case, the PDF does not contain any text. It just contains an image.
If you convert a Word document to a PDF, the words in the Word document are present in the PDF, along with instructions on what font to use and where to place them. But when you convert an image to a PDF, all you have are the bytes in the image. There is no text.
If you really want to explore this further, what you need is an OCR package (Optical Character Recognition). There are Python packages for doing that (like pytesseract), but they can be finicky.
FOLLOWUP
PyMuPDF can do OCR, if the Tesseract package is installed. You need to scan through the documentation.
https://pymupdf.readthedocs.io/en/latest/functions.html
One possibility is that the PDF file does not contain any text. PyMuPDF uses OCR to extract text from PDFs, so if the PDF is an image-only PDF or if the text is not in a format that PyMuPDF's OCR can recognize, it may not be able to extract any text.

How to convert scanned pdf to text files in a directory more efficiently using Python

I use the following code to convert all scanned pdf files in a directory into text files. However, currently, I get one text file for each pdf page. For instance, I get multiple text files such as examplepage1.txt examplepage2.txt from example.pdf. Is there any way I could combine text files for all the pages from one pdf? This way I could create just one text file per pdf. Any help would be appreciated. Thanks.
import pytesseract
from pdf2image import convert_from_path
import glob
pdfs = glob.glob(r"K:\pdfs\*.pdf")
for pdf_path in pdfs:
pages = convert_from_path(pdf_path, 500)
for pageNum,imgBlob in enumerate(pages):
text = pytesseract.image_to_string(imgBlob,lang='eng')
with open(f'{pdf_path[:-4]}_page{pageNum}.txt', 'w') as the_file:
the_file.write(text)
UPDATE:
Upon the suggestion in the comment, I have made the following changes and it is working fine now. It creates one text per scanned pdf.
import pytesseract
from pdf2image import convert_from_path
import glob
pdfs = glob.glob(r"K:\pdfs\*.pdf")
for pdf_path in pdfs:
pages = convert_from_path(pdf_path, 500)
for pageNum,imgBlob in enumerate(pages):
text = pytesseract.image_to_string(imgBlob,lang='eng')
with open(f'{pdf_path}.txt', 'a') as the_file:
the_file.write(text)

extract specific text from image

I'm trying to extract specific (or the whole text and then parse it) text from the image.
the image is in the Hebrew language.
what I already tried in nodejs is using in Tesseract library but in Hebrew, it does not recognize the text good.
I'm also tried to convert the image to pdf and then parse from pdf but it's not working well in Hebrew.
anyone has already tried to do that? maybe with python or node js?
I'm trying to do something like cloud vision google text
have you tried preprocessing the image you feed to tesseract? In case you didn't I would give a try to use OpenCV contour detection, particularly Hough Line Transform, and then clean it up a bit. https://www.youtube.com/watch?v=lhMXDqQHf9g&list=PLQVvvaa0QuDeETZEOy4VdocT7TOjfSA8a&index=5 this guy doesn't do your stuff exactly, but if ya take time to scroll bit you can see how it can be useful.
Based on our conversation in OP. Here is some options for you to consider.
Option 1:
If you are working directly with PDFs as your input file
import fitz
input_file = '/path/to/your/pdfs/'
pdf_file = input_file
doc = fitz.open(pdf_file)
noOfPages = doc.pageCount
for pageNo in range(noOfPages):
page = doc.loadPage(pageNo)
pageTextblocks = page.getText('blocks') # This creates a list of items (x0,y0,x1,y1,"line1\nline2\nline3...",...)
pageTextblocks.sort(key=lambda block: block[3])
for block in pageTextblocks:
targetBlock = block[4] # This gets to the content of each block and you can work your logic here to get relevant data
Option 2:
If you are working with image as your input and you need to convert it to PDFs before processing it using code snippet in Option 1.
doc = fitz.open(input_file)
pdfbytes = doc.convertToPDF() # open it as a pdf file
pdf = fitz.open("pdf", pdfbytes) # extract data as a pdf file
One useful tip for processing image in PyMuPDF is to use zoom factor for better resolution if the image is somewhat hard to be recognized.
zoom = 1.2 # scale the image by 120%
mat = fitz.Matrix(zoom,zoom)
Option 3:
A hybrid approach with PyMuPDF and pytesseract since you've mentioned tesseract. I am not sure if this approach fits your needs to extract Hebrew language but it's an idea. The example is used for PDFs.
import fitz
import pytesseract
pytesseract.pytesseract.tesseract_cmd = r'/path/to/your/tesseract/cmd'
input_file = '/path/to/pdfs'
pdf_file = input_file
fullText = ""
doc = fitz.open(pdf_file)
zoom = 1.2
mat = fitz.Matrix(zoom, zoom)
noOfPages = doc.pageCount
for pageNo in range(noOfPages):
page = doc.loadPage(pageNo) #number of page
pix = page.getPixmap(matrix = mat)
output = '/path/to/save/image' + str(pageNo) + '.jpg'
pix.writePNG(output)
print('Converting PDFs to Image ... ' + output)
text_of_each_page = str(((pytesseract.image_to_string(Image.open(output)))))
fullText += text_without_whitespace
fullText += '\n'
Hope this helps. If you need more information about PyMuPDF, click this link and it has a more detailed explanation to fit your needs.

Python: How to get total image number (shutter count) from EXIF of a JPG file?

When I display the EXIF data in the Mac app "Preview", I see the absolute number of the image called "Image Number". I assume this is the XXX image that my camera ever took. I would like to get this data in my python code.
"More Info" Window of "Preview"
I have already successfully exported this number from a RAW image with the package exifread by using "MakerNote TotalShutterReleases". But this does not work with JPEG.
import exifread
with open(file_path, 'rb') as img:
tags = exifread.process_file(img)
img_number = tags["MakerNote TotalShutterReleases"]
For an RAW image I get that I want but for JPG:
KeyError: 'MakerNote TotalShutterReleases'
Unfortunately I couldn't find another suitable tag in the list of all tags. Where is this information stored? Why can Preview display this?

Unable To Convert PDF to Text format

I am getting this error while parsing the PDF file using pypdf2
i am attaching PDF along with the error.
I have attached the PDF to be parsed please click to view
Can anyone help?
import PyPDF2
def convert(data):
pdfName = data
read_pdf = PyPDF2.PdfFileReader(pdfName)
page = read_pdf.getPage(0)
page_content = page.extractText()
print(page_content)
return (page_content)
error:
PyPDF2.utils.PdfReadError: Expected object ID (8 0) does not match actual (7 0); xref table not zero-indexed.
There are some open source OCR tools like tesseract or openCV.
If you want to use e.g. tesseract there is a python wrapper library called pytesseract.
Most of OCR tools work on images, so you have to first convert your PDF into an image file format like PNG or JPG. After this you can load your image and process it with pytesseract.
Here is some sample code how you can use pytesseract, let's suppose you have already converted your PDF to an image with filename pdfName.png:
from PIL import Image
import pytesseract
def ocr_core(filename):
"""
This function will handle the core OCR processing of images.
"""
text = pytesseract.image_to_string(Image.open(filename)) # We'll use Pillow's Image class to open the image and pytesseract to detect the string in the image
return text
print(ocr_core('pdfName.png'))

Categories