Extracting Text PDF [duplicate] - python

I am building an OCR project and I am using a .Net wrapper for Tesseract. The samples that the wrapper have don't show how to deal with a PDF as input. Using a PDF as input how do I produce a searchable PDF using c#?
I have use ghostscript library to change Pdf to image then feed Tesseract with it and it's working great getting the text but i doesn't save the original shape of Pdf i only get text
how can i get text from Pdf with saving the shape of original Pdf
this is a page from pdf i don't want only text i want the text to be in the shapes like the original pdf and sorry for poor English

Just for documentation reasons, here is an example of OCR using tesseract and pdf2image to extract text from an image pdf.
import pdf2image
try:
from PIL import Image
except ImportError:
import Image
import pytesseract
def pdf_to_img(pdf_file):
return pdf2image.convert_from_path(pdf_file)
def ocr_core(file):
text = pytesseract.image_to_string(file)
return text
def print_pages(pdf_file):
images = pdf_to_img(pdf_file)
for pg, img in enumerate(images):
print(ocr_core(img))
print_pages('sample.pdf')

There is a handy tool OCRmyPDF that will add a text layer to a scanned PDF making it searchable - which essentially automates the steps mentioned in previous answers.

Tesseract supports the creation of sandwich since version 3.0. But 3.02 or 3.03 are recommended for this feature.
Pdfsandwich is a script which does more or less what you want.
There is the online service www.sandwichpdf.com which does use tesseract for creating searchable PDFs. You might want to run a few tests before you start implementing your solution with tesseract. The results are ok, but there are commercial products which deliver better results.
Disclosure: I am the creator of www.sandwichpdf.com.

Use pdf2png.com, then upload your pdf, then it will make all png files of each page as <pdf_name>-<page_number>.png in .zip file,
Then, you can write simple python code as
#/usr/bin/python3
#coding:utf-8
import os
pdf_name = 'pdf_name'
language = 'language of tesseract'
for x in range(int('number of pdf_pages')):
cmd = f'tesseract {pdf_mame}-{x}.png {x} -l {language}'
os.system(cmd)
And then, read all the files such as from 1.txt to all the way up, and append to a single, file, it is as easy as that.

Related

Tesseract OCR Arabic Scanned PDF

Just wondering if anyone did manage to use Python script to utilize Tesseract OCR Engine to extract Arabic text from PDF scanned images without the process of converting the file into PNG images, if these lines of codes are available please do share it with me, also please do provide me which URL should be added into the PATH environment variables.
If this option is not available, then is it possible to batch process multiple images in Tesseract instead of one by one?
Thanks,
Medo Hamdani
Tried these as a general process of dealing with PDF Arabic files:
Use this line of code
python split-image.py pet.png 1 1
Althougth it automatically splitted every single image insdie the folder.
Some times you will have to add --load-large-images to load large images
# e.g. split_image("bridge.jpg", 2, 2, True, False)
# https://pypi.org/project/split-image/
# split_image(image_path, rows, cols, should_square, should_cleanup, [output_dir])
===================================================================================
For Tesseract use this line of code
tesseract 5.png outtput -l ara
#no need to add python in the front
To download for Windows
https://github.com/UB-Mannheim/tesseract/wiki
===================================================================================
Combine image
Ensure all the images are in one folder. For example 10 images
Use this line of code
Python combine-images.py
It will create an output.png file

How to remove text layer from pdf using python

I need to remove all text information from pdf file. So the file I wanna get should be like scan: only images wrapped as pdf, no texts that u can copy or select. Now I'm using ghostscript command:
import os
...
os.system(f"gs -o {output_path} -sDEVICE=pdfwrite -dFILTERTEXT {input_path}")
unfortunately, with some documents it removes not only text layer but real pixels of characters!!! And sometimes I cannot see any text pictures on the page, it's not what I need
Is there some stable and fast solution with python or pip utils? It will be wonderful if I can solve this with PyMuPDF (fitz) but I couldn't find anything about it

Why PyPDF2 showing this output when printing extractText?

I am trying to extract data from pdf using PyPDF2 but instead of showing actual text it showing something else in the output what could be the reason behind it?
Here is my code
xfile=open('filename','rb')
pdfReader = PyPDF2.PdfFileReader(xfile)
num=pdfReader.numPages
pageobj=pdfReader.getPage(0)
print(pageobj.extractText())
when I run above program I get this output what could be the reason?
!"#$%#&'(%!#)
(((((((((((((((((((((((((((((((((((((((((((((((((!"#$%#&'(%!#)*+,-./0!$1(230
4444444444445674+8,8,9:+*8
4&*)+!,$-.
4,*7;44444444444444444444444444
4$/012/($/3414546(78(,69:/7;7<=(>"#)?#(A2B2/231
(444<(4=&2#4$>4?&#!0$24A>/$>&&#$>/B4?CDEF4+(;8
4,*7,444*B62C;2/0(#B(%69(%9:77;#("1;23D5B
((((?C<GA47,H#B48:(,*I
4,*7*444E2F2:2B(.2G702=2(A10=2;2=2#("1;23D5B
((((?<GA47*H#B4?CDEF46(8
44%'$HH%(!.*($.,&I&%,%
Pdf is a file format oriented around page layout. Thus, text present in a pdf can be stored in various methods. It is not guaranteed that your pdf is stored in a format readable by PyPDF.
Moving forward: you can try extracting data from other pdfs before concluding if there is a fault with your PyPdf implementation.
you can also try extracting data from pytesseract and see if your result improves.
From PyPDF2s documentation:
This works well for some PDF files, but poorly for others, depending on the generator used.
Your PDF might be of the latter category and you are SOL...
With PyPDF2 not being actively developed anymore (no updates to the Pypi package since 2016) maybe try a more up-to-date package like pdftotext

How can I extract text from a PDF using Python similarly to what Chrome browser does?

I'm trying to extract text from pdf files (similar to a form). Currently, I open the file on Chrome, select/copy all the text, paste it into a txt file and process it into CSV using Python. Chrome allows me to have data quite structured and uniform, so that every page of the pdf results in a similar block of text, allowing me to process it easily.
I'm trying to extract the text directly from the pdf, to process it into CSV format, but I always get some messy results, due to the way the original pdf is generated. I've tried pdfminer and pyPdf2, but the results get messy when the form has a missing value in certain fields.
Maybe it's a generalistic question, but, how can I have a more structured result in my extraction?
Not all PDFs have embedded texts. some are texts in embedded images. Hence, to get a common solution that works for all PDFs, is to use OCR.
Step 1) Convert the PDF to an image
Step 2) Use pytessract to perform OCR: Use pytesseract OCR to recognize text from an image

How to create PDF files in Python [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 4 years ago.
This question's answers are a community effort. Edit existing answers to improve this post. It is not currently accepting new answers or interactions.
I'm working on a project which takes some images from user and then creates a PDF file which contains all of these images.
Is there any way or any tool to do this in Python? E.g. to create a PDF file (or eps, ps) from image1 + image 2 + image 3 -> PDF file?
Here is my experience after following the hints on this page.
pyPDF can't embed images into files. It can only split and merge. (Source: Ctrl+F through its documentation page)
Which is great, but not if you have images that are not already embedded in a PDF.
pyPDF2 doesn't seem to have any extra documentation on top of pyPDF.
ReportLab is very extensive. (Userguide) However, with a bit of Ctrl+F and grepping through its source, I got this:
First, download the Windows installer and source
Then try this on Python command line:
from reportlab.pdfgen import canvas
from reportlab.lib.units import inch, cm
c = canvas.Canvas('ex.pdf')
c.drawImage('ar.jpg', 0, 0, 10*cm, 10*cm)
c.showPage()
c.save()
All I needed is to get a bunch of images into a PDF, so that I can check how they look and print them. The above is sufficient to achieve that goal.
ReportLab is great, but would benefit from including helloworlds like the above prominently in its documentation.
I suggest Pdfkit. (installation guide)
It creates pdf from html files. I chose it to create pdf in 2 steps from my Python Pyramid stack:
Rendering server-side with mako templates with the style and markup you want for you pdf document
Executing pdfkit.from_string(...) method by passing the rendered html as parameter
This way you get a pdf document with styling and images supported.
You can install it as follows :
using pip
pip install pdfkit
You will also need to install wkhtmltopdf (on Ubuntu).
I suggest pyPdf. It works really nice. I also wrote a blog post some while ago, you can find it here.
You can try this(Python-for-PDF-Generation) or you can try PyQt, which has support for printing to pdf.
Python for PDF Generation
The Portable Document Format (PDF) lets you create documents that look exactly the same on every platform. Sometimes a PDF document needs to be generated dynamically, however, and that can be quite a challenge. Fortunately, there are libraries that can help. This article examines one of those for Python.
Read more at http://www.devshed.com/c/a/Python/Python-for-PDF-Generation/#whoCFCPh3TAks368.99
fpdf works well for me. Much simpler than ReportLab and really free. Works with UTF-8.
Here is a solution that works with only the standard packages. matplotlib has a PDF backend to save figures to PDF. You can create a figures with subplots, where each subplot is one of your images. You have full freedom to mess with the figure: Adding titles, play with position, etc. Once your figure is done, save to PDF. Each call to savefig will create another page of PDF.
Example below plots 2 images side-by-side, on page 1 and page 2.
from matplotlib.backends.backend_pdf import PdfPages
import matplotlib.pyplot as plt
from scipy.misc import imread
import os
import numpy as np
files = [ "Column0_Line16.jpg", "Column0_Line47.jpg" ]
def plotImage(f):
folder = "C:/temp/"
im = imread(os.path.join(folder, f)).astype(np.float32) / 255
plt.imshow(im)
a = plt.gca()
a.get_xaxis().set_visible(False) # We don't need axis ticks
a.get_yaxis().set_visible(False)
pp = PdfPages("c:/temp/page1.pdf")
plt.subplot(121)
plotImage(files[0])
plt.subplot(122)
plotImage(files[1])
pp.savefig(plt.gcf()) # This generates page 1
pp.savefig(plt.gcf()) # This generates page 2
pp.close()
rinohtype supports embedding PDF, PNG and JPEG images (natively) and other bitmap formats (when Pillow is installed).
(Full disclosure: I am the author of rinohtype)
fpdf is python (too). And often used. See PyPI / pip search. But maybe it was renamed from pyfpdf to fpdf. From features:
PNG, GIF and JPG support (including transparency and alpha channel)
If you are familiar with LaTex you might want to consider pylatex
One of the advantages of pylatex is that it is easy to control the image quality. The images in your pdf will be of the same quality as the original images. When using reportlab, I experienced that the images were automatically compressed, and the image quality reduced.
The disadvantage of pylatex is that, since it is based on LaTex, it can be hard to place images exactly where you want on the page. However, I have found that using the position argument in the Figure class, and sometimes Subfigure, gives good enough results.
Example code for creating a pdf with a single image:
from pylatex import Document, Figure
doc = Document(documentclass="article")
with doc.create(Figure(position='p')) as fig:
fig.add_image('Lenna.png')
doc.generate_pdf('test', compiler='latexmk', compiler_args=["-pdf", "-pdflatex=pdflatex"], clean_tex=True)
In addition to installing pylatex (pip install pylatex), you need to install LaTex. For Ubuntu and other Debian systems you can run sudo apt-get install texlive-full. If you are using Windows I would recommend MixTex
I have done this quite a bit in PyQt and it works very well. Qt has extensive support for images, fonts, styles, etc and all of those can be written out to pdf documents.
I believe that matplotlib has the ability to serialize graphics, text and other objects to a pdf document.
I use rst2pdf to create a pdf file, since I am more familiar with RST than with HTML. It supports embedding almost any kind of raster or vector images.
It requires reportlab, but I found reportlab is not so straight forward to use (at least for me).
You can actually try xhtml2pdf http://flask.pocoo.org/snippets/68/
It depends on what format your image files are in, but for a project here at work I used the tiff2pdf tool in LibTIFF from RemoteSensing.org.
Basically just used subprocess to call tiff2pdf.exe with the appropriate argument to read the kind of tiff I had and output the kind of pdf I wanted. If they aren't tiffs you could probably convert them to tiffs using PIL, or maybe find a tool more specific to your image type (or more generic if the images will be diverse) like ReportLab mentioned above.

Categories