text extract from scanned pdfs - python

My problem is that I have a bunch of PDF files and I want to convert them to text files. Some of them are pure PDFs while others have scanned pages inside. I am writing a program in python so I am using pdftotext to convert them to TXTs.
I am using the command below
filename = glob.glob(src) //src is my directory with my files
for file in filename:
subprocess.call(["pdftotext", file])
What I would like to ask is if there is a way to check for scanned pages before the conversion so that I can use ghostscript commands with pdftotext to manipulate them.
For now I have a treshold to check the size of the .txt file and if it is below that treshold I am using ghostscript commands to manipulate them.
The problem is that for big-sized files with 50 or 60 scanned out of 90 pages even with pdftotext the size of the file is always above the treshold.

A 'pure' PDF file can have images in it....
There's no easy way to tell whether a PDF file is a scanned page or not. Your best bet, I think, would be to analyse the page content streams to see if they consist of nothing but images (some scanners break up the single scanned page into multiple images). You could assume that they are scanned pages, in any event you won't get any text out of them with Ghostscript.
Another approach would be to use the pdf_info.ps program for Ghostscript and have it list fonts uses. No fonts == no text, though potentially there may be fonts present and still no text. Also I don't think this works on a page by page basis.

Related

How can I extract text from a PDF using Python similarly to what Chrome browser does?

I'm trying to extract text from pdf files (similar to a form). Currently, I open the file on Chrome, select/copy all the text, paste it into a txt file and process it into CSV using Python. Chrome allows me to have data quite structured and uniform, so that every page of the pdf results in a similar block of text, allowing me to process it easily.
I'm trying to extract the text directly from the pdf, to process it into CSV format, but I always get some messy results, due to the way the original pdf is generated. I've tried pdfminer and pyPdf2, but the results get messy when the form has a missing value in certain fields.
Maybe it's a generalistic question, but, how can I have a more structured result in my extraction?
Not all PDFs have embedded texts. some are texts in embedded images. Hence, to get a common solution that works for all PDFs, is to use OCR.
Step 1) Convert the PDF to an image
Step 2) Use pytessract to perform OCR: Use pytesseract OCR to recognize text from an image

Getting Chinese text from pdf, font encoding issue

I am using python 3 on windows 10 (though OS X is also available). I am attempting to extract the text from lots of .pdf files, all in Chinese characters. I have had success with pdfminer and textract, except for certain files. These files are not images, but proper documents with selectable text. If I use Adobe Acrobat Pro X and export to .txt, My output looks like:
!!
F/.....e..................!
216.. ..... .... ....
........
If I output to .doc, .docx, .rtf, or even copy-paste into any text editor, it looks like this:
ҁϦљӢख़ε༊౗ݢ୏ቹៜϐѦჾѱ൑॥ᓀϩ݋ӵΠ
I have no idea why Adobe would display the text properly but not export it correctly or even let me copy-paste. I thought maybe it was a font issue, the font is DFKaiShu sb-estd-bf which I already have installed (it appears to automatically come with windows).
I do have a workaround, but it's ugly and very difficult to automate; I print the pdf to a pdf (or any sort of image), then use adobe pro's built-in OCR, then convert to a word document (it still does not convert correctly to .txt). Ultimately I need to do this for ~2000 documents, each of which can be up to 200 pages.
Is there any other way to do this? Why is exporting or copy-pasting not working correctly? I have uploaded a 2-page sample to google drive here.

Folder of images to individual PDF reports

I have various folders containing images. Inside each folder the size of the images are the same, but the sizes vary slightly between the various folders.
I am trying to automate processing the folders by taking each image and placing it within a Letter template file (existing PDF file) then saving it before moving to the next image in the folder. The template file is just a simple header and footer document.
I have thought about using python to replace the image within a HTML file or as an overlay in the exiting PDF template, but not sure if my approach is correct or if there is a more simpler option.
I have already looked at:
pdfkit
wkhtmltopdf
ReportLab
Just looking for some suggestions at this point.

Is there a way to automate specific data extraction from a number of pdf files and add them to an excel sheet?

Regularly I have to go through a list of pdf files and search for specific data and add them to an excel sheet for later review. As the number of pdf files are around 50 per month, it is both time taking and frustrating to do it manually.
Can the process be automated in windows by python or any other scripting language? I require to have all the pdf files in a folder and run the script which will generate an excel sheet with all the data added. The pdf files with which I work are tabular and have similar structures.
Yes. And no. And maybe.
The problem here is not extracting something from a PDF document. Extracting something is almost always possible and there are plenty of tools available to extract content from a PDF document. Text, images, whatever you need.
The major problem (and the reason for the "no" or "maybe") is that PDF in general is not a structured file format. It doesn't care about columns, paragraphs, tables, sentences or even words. In the general case it cares only about characters on a page in a specific location.
This means that in the general case you cannot query a PDF document and ask it for every paragraph or for the third sentence in the fifth paragraph. You can ask a library to get all of the text or all of the text in a specific location. And then you have to hope the library is able to extract the text you need in a legible format. Because there doesn't even have to be the case that you can copy and paste or otherwise extra understandable characters from a PDF file. Many PDF files don't even contain enough information for that.
So... If you have a certain type of document and you can test that it predictably behaves a certain way with a certain extraction engine, then yes, you can extract information from a PDF file.
If the PDF files you receive are different all the time or the layout on the page is totally different every time than the answer is probably that you cannot reliably extract the information you want.
As a side note:
There are certain types of PDF documents that are easier to handle than others so if you're lucky that might make your life easier. Two examples:
Many PDF files will in fact contain textual information in such a way that it can be extracted in a legible way. PDF files that follow certain standards (such as PDF/A-1a, PDF/A-2a or PDF/A-2u etc...) are even required to be created this way.
Some PDF files are "tagged" which means they contain additional structural information that allows you to extract information in an easier and more meaningful way. This structure would in fact identify paragraphs, images, tables etc and if the tagging was done in a good way it could make the job of content extraction much easier.
You could use pdf2text2 in Python to extract data from your PDF.
Alternatively you can use pdftotext that is part of the Xpdf suite

How can I extract the tables, text and the pictures in ODT(OpenDocumentText) format using Python?

How can I extract the tables, text and the pictures in an ODT(OpenDocumentText) file to output them to another ODT file using Python on Ubuntu?
OOoPy seems to be a good fit. I've never used it, but it comes with documentation and code examples, and it can read and write ODT files.
An easy way is to just rename the foo.odt to foo.zip and then extract it. the extracted directory contains many files including Pictures.
However I think it's better to change it's type to docx and then do the process on docx (extract it). Because it extract images with better name (image1, image2, ...).

Categories