Tesseract OCR Arabic Scanned PDF - python

Just wondering if anyone did manage to use Python script to utilize Tesseract OCR Engine to extract Arabic text from PDF scanned images without the process of converting the file into PNG images, if these lines of codes are available please do share it with me, also please do provide me which URL should be added into the PATH environment variables.
If this option is not available, then is it possible to batch process multiple images in Tesseract instead of one by one?
Thanks,
Medo Hamdani
Tried these as a general process of dealing with PDF Arabic files:
Use this line of code
python split-image.py pet.png 1 1
Althougth it automatically splitted every single image insdie the folder.
Some times you will have to add --load-large-images to load large images
# e.g. split_image("bridge.jpg", 2, 2, True, False)
# https://pypi.org/project/split-image/
# split_image(image_path, rows, cols, should_square, should_cleanup, [output_dir])
===================================================================================
For Tesseract use this line of code
tesseract 5.png outtput -l ara
#no need to add python in the front
To download for Windows
https://github.com/UB-Mannheim/tesseract/wiki
===================================================================================
Combine image
Ensure all the images are in one folder. For example 10 images
Use this line of code
Python combine-images.py
It will create an output.png file

Related

Extracting Text PDF [duplicate]

I am building an OCR project and I am using a .Net wrapper for Tesseract. The samples that the wrapper have don't show how to deal with a PDF as input. Using a PDF as input how do I produce a searchable PDF using c#?
I have use ghostscript library to change Pdf to image then feed Tesseract with it and it's working great getting the text but i doesn't save the original shape of Pdf i only get text
how can i get text from Pdf with saving the shape of original Pdf
this is a page from pdf i don't want only text i want the text to be in the shapes like the original pdf and sorry for poor English
Just for documentation reasons, here is an example of OCR using tesseract and pdf2image to extract text from an image pdf.
import pdf2image
try:
from PIL import Image
except ImportError:
import Image
import pytesseract
def pdf_to_img(pdf_file):
return pdf2image.convert_from_path(pdf_file)
def ocr_core(file):
text = pytesseract.image_to_string(file)
return text
def print_pages(pdf_file):
images = pdf_to_img(pdf_file)
for pg, img in enumerate(images):
print(ocr_core(img))
print_pages('sample.pdf')
There is a handy tool OCRmyPDF that will add a text layer to a scanned PDF making it searchable - which essentially automates the steps mentioned in previous answers.
Tesseract supports the creation of sandwich since version 3.0. But 3.02 or 3.03 are recommended for this feature.
Pdfsandwich is a script which does more or less what you want.
There is the online service www.sandwichpdf.com which does use tesseract for creating searchable PDFs. You might want to run a few tests before you start implementing your solution with tesseract. The results are ok, but there are commercial products which deliver better results.
Disclosure: I am the creator of www.sandwichpdf.com.
Use pdf2png.com, then upload your pdf, then it will make all png files of each page as <pdf_name>-<page_number>.png in .zip file,
Then, you can write simple python code as
#/usr/bin/python3
#coding:utf-8
import os
pdf_name = 'pdf_name'
language = 'language of tesseract'
for x in range(int('number of pdf_pages')):
cmd = f'tesseract {pdf_mame}-{x}.png {x} -l {language}'
os.system(cmd)
And then, read all the files such as from 1.txt to all the way up, and append to a single, file, it is as easy as that.

Extract text from pdf file with Python

I would like to extract text, including tables from pdf file.
I tried camelot, but it can only get table data not text.
I also tried PDF2, however it can't read Chinese characters.
Here is the pdf sample to read.
Are there any recommended text-extraction python packages?
By far the simplest way is to extract text in one OS shell command line using the poppler pdf utility tools (often included in python libraries) then modify that output in python.py as required.
>pdftotext -layout -f 1 -l 1 -enc UTF-8 sample.pdf
NOTE some of the text is embeded to right of the logo image and that can be extracted separately using pdftoppm -png or pdfimages then pass to inferior output quality OCR tools for those smaller areas.

how to add person with disability (PwD) symbol to tesseract dataset

I am working on license plate recognition using python. I worked with tesseract for doing OCR recognition.For my project i wish to include person with disability symbol in tesseract library. I reviewed the following links for updating the tesseract library tutorial for tesseract library update
I followed the steps for creating tff file but failed by notifying the image is not prescribed font.
I understood from the studied literature people added various types of fonts and number style and couldn't find the information about how to add image into tesseract data set.
Can anyone suggest me how could i succeed in adding the image to tesseract data set? I am grateful if someone provided me the links or information pertaining to the problem
To train for new data in tesseract library you need the following packages (i) jTessBoxEditor (ii)notepad++ (iii) serak trainer
jTessBoxEditor can be downloaded from herejTessBoxEditordownload link which also requires runtime java environment. It accepts input as .txt format.
You can use notepadd++ to enter the special characters. The procedures to enter the characters can be found from how to enter characters in notepad++ For example, to enter PwD symbol you might press ALT key hold down then entered +9855 from numberpad, the symbol will appear in notepad++. After entering the characters save as .txt type.
Open the jTessBoxEditor and click Tiff/Box Generator to feed the .txt file as input, also change the font style that supports your character. For Pwd symbol i choose segoeuisymbol. the tif will be stored in the folder where the .txt file created.
To train the tessdata you need seraktrainer which can be downloaded from serak download link The procedures for using serak trainer can be found from the videoserak trainer video It explains step by step procedure for creating tessdata i.e trained data file
Hope this can be useful for someone

Folder of images to individual PDF reports

I have various folders containing images. Inside each folder the size of the images are the same, but the sizes vary slightly between the various folders.
I am trying to automate processing the folders by taking each image and placing it within a Letter template file (existing PDF file) then saving it before moving to the next image in the folder. The template file is just a simple header and footer document.
I have thought about using python to replace the image within a HTML file or as an overlay in the exiting PDF template, but not sure if my approach is correct or if there is a more simpler option.
I have already looked at:
pdfkit
wkhtmltopdf
ReportLab
Just looking for some suggestions at this point.

text extract from scanned pdfs

My problem is that I have a bunch of PDF files and I want to convert them to text files. Some of them are pure PDFs while others have scanned pages inside. I am writing a program in python so I am using pdftotext to convert them to TXTs.
I am using the command below
filename = glob.glob(src) //src is my directory with my files
for file in filename:
subprocess.call(["pdftotext", file])
What I would like to ask is if there is a way to check for scanned pages before the conversion so that I can use ghostscript commands with pdftotext to manipulate them.
For now I have a treshold to check the size of the .txt file and if it is below that treshold I am using ghostscript commands to manipulate them.
The problem is that for big-sized files with 50 or 60 scanned out of 90 pages even with pdftotext the size of the file is always above the treshold.
A 'pure' PDF file can have images in it....
There's no easy way to tell whether a PDF file is a scanned page or not. Your best bet, I think, would be to analyse the page content streams to see if they consist of nothing but images (some scanners break up the single scanned page into multiple images). You could assume that they are scanned pages, in any event you won't get any text out of them with Ghostscript.
Another approach would be to use the pdf_info.ps program for Ghostscript and have it list fonts uses. No fonts == no text, though potentially there may be fonts present and still no text. Also I don't think this works on a page by page basis.

Categories