Getting Chinese text from pdf, font encoding issue - python

I am using python 3 on windows 10 (though OS X is also available). I am attempting to extract the text from lots of .pdf files, all in Chinese characters. I have had success with pdfminer and textract, except for certain files. These files are not images, but proper documents with selectable text. If I use Adobe Acrobat Pro X and export to .txt, My output looks like:
!!
F/.....e..................!
216.. ..... .... ....
........
If I output to .doc, .docx, .rtf, or even copy-paste into any text editor, it looks like this:
ҁϦљӢख़ε༊౗ݢ୏ቹៜϐѦჾѱ൑॥ᓀϩ݋ӵΠ
I have no idea why Adobe would display the text properly but not export it correctly or even let me copy-paste. I thought maybe it was a font issue, the font is DFKaiShu sb-estd-bf which I already have installed (it appears to automatically come with windows).
I do have a workaround, but it's ugly and very difficult to automate; I print the pdf to a pdf (or any sort of image), then use adobe pro's built-in OCR, then convert to a word document (it still does not convert correctly to .txt). Ultimately I need to do this for ~2000 documents, each of which can be up to 200 pages.
Is there any other way to do this? Why is exporting or copy-pasting not working correctly? I have uploaded a 2-page sample to google drive here.

Related

Retaining the fond of pdf to epub

I'm currently working on a project which is to convert pdf to epub using python. While converting the pdf to epub the styling like font family, font size need to be exactly same in epub as that of pdf. Is there a way to achieve this using python? And i don't need any external softwares to do it. I used aspose.
#code i used
import aspose.words as aw
doc = aw.Document("Input.pdf")
doc.save("Output.epub")
and it is a simple text pdf.
You are going to get a variety of answers/comments that will ask you to show code as to what you tried and post sample documents etc.
Let me save you the trouble. Your question seems straightforward in that want to convert a pdf to epub and retain the style information.
Good luck.
It will all depend on your PDF file. Does it have embedded fonts or does it rely on system fonts? Complicated layout? Headers and footers? What about images? Dingbats characters? What if there is no text in the pdf, but just postscript drawing of text characters? What if the PDF just consists of multiple scans of pages in a pdf container? Is everything in English? Any Unicode characters? Are you looking to get the styles right at the page level? Paragraph? Sentence? Word? or Character Level?
Basically this is a hard problem. PDF was designed as an end use format not an interchangeable format. Most things get converted to PDF because someone wanted to control how the final product looked. You can look at text extraction tools for PDF, but there is not an easy solution with opensource or commercial tools.
You can easily convert PDF to EPUB using Aspose.Words for Python. The code is pretty simple:
import aspose.words as aw
doc = aw.Document("C:\\Temp\\in.pdf")
doc.save("C:\\Temp\\out.epub")
However, upon loading PDF into Aspose.Words Document Object Model it is converted from fixed page layout to flow document. And when document is saved to EPUB it is saved as flow document. I am afraid, this might lead into layout and formatting loses upon conversion.

PyPDF2 can't read non-English characters, returns empty string on extractText()

i'm working on a script that will extract data from a large PDF File (40-60 plus, pages long)
that isn't in English but the file contains Greek characters and all seems good until i run the extractText() function of PyPDF2 to get the givens page contents, then it returns an empty string.
I'm new to this library and i don't know what to do, to fix this problem!!
PyPDF2's "Extract Text" looks like it will either Work Just Fine, or Fail Completely. There's no parameters you can pass in to try to get things to work properly. It'll work or it won't.
You may not be able to fix this problem. If you can successfully copy/paste the text in Acrobat/Reader, then it's possible to extract the text. So what happens when you try to copy/paste out of Reader? Don't try this with some other third party PDF viewer, use Adobe software. You'll probably have to abandon PyPDF2 and move on to some other PDF API, but if Reader can do it, it's a fixable problem.
There are three different things in a PDF that can look like letters to the human eye.
Letters in the PDF in some text encoding. There are several fixed encodings, plus PDF allows you to embed your own custom encodings (often used with font subsets). Software can create PDFs that look fine but can't really be copy/pasted from, even by Adobe.
Path art that just happens to look an awful lot like letters. "Start drawing a line here, draw a straight line to there, then a curve like this to there" and so on. If you're curious, PDF uses Bezier curves to define its curves. Not terribly related to your question, but interesting.
Bit maps (.jpeg/gif/etc images) that define a grid of pixels.
In the past, Reader has only been able to handle text type 1 above, and then only if the text was encoded properly. Broken custom encodings are alarmingly common (or were 7+ years ago when I stopped working on PDF software).
With broken type 1s, and all of 2 and 3, the only thing you can do is to run OCR on the PDF. OCR: Optical Character Recognition. There are several open source OCR projects out there, as well as commercial ones.

text extract from scanned pdfs

My problem is that I have a bunch of PDF files and I want to convert them to text files. Some of them are pure PDFs while others have scanned pages inside. I am writing a program in python so I am using pdftotext to convert them to TXTs.
I am using the command below
filename = glob.glob(src) //src is my directory with my files
for file in filename:
subprocess.call(["pdftotext", file])
What I would like to ask is if there is a way to check for scanned pages before the conversion so that I can use ghostscript commands with pdftotext to manipulate them.
For now I have a treshold to check the size of the .txt file and if it is below that treshold I am using ghostscript commands to manipulate them.
The problem is that for big-sized files with 50 or 60 scanned out of 90 pages even with pdftotext the size of the file is always above the treshold.
A 'pure' PDF file can have images in it....
There's no easy way to tell whether a PDF file is a scanned page or not. Your best bet, I think, would be to analyse the page content streams to see if they consist of nothing but images (some scanners break up the single scanned page into multiple images). You could assume that they are scanned pages, in any event you won't get any text out of them with Ghostscript.
Another approach would be to use the pdf_info.ps program for Ghostscript and have it list fonts uses. No fonts == no text, though potentially there may be fonts present and still no text. Also I don't think this works on a page by page basis.

how to read ppt file using python?

I want to get the content (text only) in a ppt file. How to do it?
(It likes that if I want to get content in a txt file, I just need to open and read. What do I need to do to get information from ppt files?)
By the way, I know there is a win32com in windows system. But now I am working on linux, is there any possible way?
I found this discussion over on Superuser:
Command line tool in Linux to Extract Text From Word, Excel, Powerpoint?
There are several reasonable answers listed there, including using LibreOffice to do this (and for .doc, .docx, .pptx, etc, etc.), and the Apache Tika Project (which appears to be the 5,000lb gorilla in this solution space).

How to include page in PDF in PDF document in Python

I am using reportlab toolkit in Python to generate some reports in PDF format. I want to use some predefined parts of documents already published in PDF format to be included in generated PDF file. Is it possible (and how) to accomplish this in reportlab or in python library?
I know I can use some other tools like PDF Toolkit (pdftk) but I am looking for Python-based solution.
I'm currently using PyPDF to read, write, and combine existing PDF's and ReportLab to generate new content. Using the two package seemed to work better than any single package I was able to find.
If you want to place existing PDF pages in your Reportlab documents I recommend pdfrw. Unlike PageCatcher it is free.
I've used it for several projects where I need to add barcodes etc to existing documents and it works very well. There are a couple of examples on the project page of how to use it with Reportlab.
A couple of things to note though:
If the source PDF contains errors (due to the originating program following the PDF spec imperfectly for example), pdfrw may fail even though something like Adobe Reader has no apparent problems reading the PDF. pdfrw is currently not very fault tolerant.
Also, pdfrw works by being completely agnostic to the actual content of the PDF page you are placing. So for example, you wouldn't be able to use pdfrw inspect a page to see if it contains a certain string of text in the lower right-hand corner. However if you don't need to do anything like that you should be fine.
There is an add-on for ReportLab — PageCatcher.

Categories