how to extract formatted text content from PDF

how to extract formatted text content from PDF - python

How can I extract the text content (not images) from a PDF while (roughly) maintaining the style and layout like Google Docs can?

To extract the text from the PDF AND get it's position you can use PDFMiner. PDFMiner can also export the PDF directly in HTML keeping the text at the good position.
I don't know your use case, but there's a lot of problems you can encounter when doing this because PDF is really presentation oriented and not content oriented, the text flow is not continous. So, if you want the text to be editable, it will not be an easy task.

Have you tried pyPDF or ReportLab PDF libraries? I personally have not used them but you can have a go at them. here is useful too

Xpdf has a utility call PDFtoText that does a great job. http://foolabs.com/xpdf/download.html

If you want to do it just like Google:
Google converts the PDF to an image, and then overlays the image, where text used to be, with JavaScript highlightable areas (which is about like Voodoo magic). The areas appear to be text when you scroll over them with your cursor, but they're not. This might not help you to know, but that's how they do it. If you want to reverse engineer it, you might start with https://www.mercurial-scm.org/ On the home page, they do the same thing with JavaScript to make the text highlightable and copyable. You can extract the text from the PDF, and find it's location in the page with on of the mentioned libraries in the other answers. Then you can overlay an extracted image of the file with the same style of JavaScript areas.

If you don't have your heart set on doing this with python, Ghostscript can do this for you. Check out pdf2ascii (a script that comes with GS) to get the plain text. Styles are more complicated as they can be specified in a few different ways.

Acrobat Professional can do the job. In the "File" menu, choose export. Then, choose Text.

Related

Retaining the fond of pdf to epub

I'm currently working on a project which is to convert pdf to epub using python. While converting the pdf to epub the styling like font family, font size need to be exactly same in epub as that of pdf. Is there a way to achieve this using python? And i don't need any external softwares to do it. I used aspose.
#code i used
import aspose.words as aw
doc = aw.Document("Input.pdf")
doc.save("Output.epub")
and it is a simple text pdf.

You are going to get a variety of answers/comments that will ask you to show code as to what you tried and post sample documents etc.
Let me save you the trouble. Your question seems straightforward in that want to convert a pdf to epub and retain the style information.
Good luck.
It will all depend on your PDF file. Does it have embedded fonts or does it rely on system fonts? Complicated layout? Headers and footers? What about images? Dingbats characters? What if there is no text in the pdf, but just postscript drawing of text characters? What if the PDF just consists of multiple scans of pages in a pdf container? Is everything in English? Any Unicode characters? Are you looking to get the styles right at the page level? Paragraph? Sentence? Word? or Character Level?
Basically this is a hard problem. PDF was designed as an end use format not an interchangeable format. Most things get converted to PDF because someone wanted to control how the final product looked. You can look at text extraction tools for PDF, but there is not an easy solution with opensource or commercial tools.

You can easily convert PDF to EPUB using Aspose.Words for Python. The code is pretty simple:
import aspose.words as aw
doc = aw.Document("C:\\Temp\\in.pdf")
doc.save("C:\\Temp\\out.epub")
However, upon loading PDF into Aspose.Words Document Object Model it is converted from fixed page layout to flow document. And when document is saved to EPUB it is saved as flow document. I am afraid, this might lead into layout and formatting loses upon conversion.

Is there any way of generating exact HTML page from a PDF file?

I am trying different python libraries like pdftotree, pdfminer, tabula etc. But could not get the exact results. I mean I can get text from PDF, Images and Tabular data in HTML, but not as maintained and organized as original PDF file. Can someone help me with something regarding this? I would be thankful.

Mostly yes. Translate the PDF to SVG, and embed the SVG in your web page.
SVG's image model (what it can represent and how) is a near-superset of the PDF image model (which is itself a superset of PostScript), though SVG lacks some of the print-specific features of PDF. There are probably quite a few PDF->SVG converters out there already. Googling "Pdf to SVG" turned up quite a few promising hits
There will be some complications:
Many PDF files are longer than 1 page. You might need to generate 10 SVG files for a single 10 page PDF file, and then build a web page around those 10 SVGs. Throw in some dynamic HTML to "turn pages" and you've got a good web-based PDF viewer.
There are parts of PDF that aren't within its image model at all... bookmarks, annotations (form fields, digital signatures), document metadata (author, creation date, etc), and so forth. Some of the non-image-model stuff is common enough that a PDF to SVG utility might handle it directly (links), while other stuff doesn't have an HTML equivalent and would be lost.
You could preserve the appearance of a digital signature, but the actual security represented by those visuals would be gone. Preserving that signature's appearance could be considered lying about the security.

Edit pdf file with PDFMiner

I was wondering if it's possible to édit an existing pdf file with Pdfminer. It seens to be a powerful tool, but the documentation is poor/inexisting.
I found some exemples, but they don't match with my goal. I want to make a search engine which changes the color of my keywords in the pdf file.

PDFMiner is not for altering existing PDF files, but for extracting text and metadata from them. The closest solution to what you're looking for using PDFMiner would probably be to use the included pdf2txt.py tool to extract the text and then mark that up to highlight your keywords.
There's also the simple option of just using a PDF viewer with the built-in ability to find and highlight multiple search terms. I think Adobe Acrobat can do it, but I'm not sure about others.

No, pdfminer doesn't support editing.
However, it might be a lot easier if you don't try to modify the pdf, but use PDFOpenParameters instead: http://partners.adobe.com/public/developer/en/acrobat/PDFOpenParameters.pdf
You can use url fragment identifiers like this:
http://www.example.com/test.pdf#search=foo
Or even when opening Acrobat on the commandline (Windows example)
AcroRd32.exe /A "search=foo" test.pdf
You could also open the pdf a specific page, and highlight a certain area of that page (but not different areas on different pages at the same time).
(ok, I know it's not really a solution for the question you asked, but if this is sufficient for your needs, it's a lot simpler)

Hide information in a PDF file in Python

In Python, I have files generated by ReportLab. Now, i need to extract some pages from that PDF and hide confidential information.
I can create a PDF file with blacked-out spots and use pyPdf to mergePage, but people can still select and copy-paste the information under the blacked-out spots.
Is there a way to make those spots completely confidential?
Per example, I need to hide addresses on the pages, how would i do it?
Thanks,

Basically you'll have to remove the corresponding text drawing commands in the PDF's page content stream. It's much easier to generate the pages twice, once with the confidential information, once without them.
It might be possible (I don't know ReportLab enough) to specially craft the PDF in a way that the confidential information is easier accessible (e.g. as separate XObjects) for deletion. Still you'd have to do pretty low-level operations on the PDF -- which I would advise against.

(Sorry, I was not able to log on when I posted the question...)
Unfortunately, the document cannot be regenerated at will (context sensitive), and those PDF files (about 35) are 3000+ pages.
I was thinking about using pdf2ps and pdf2ps back, but there is a lot of quality.
pdf2ps -dLanguageLevel=3 input.pdf - | ps2pdf14 - output.pdf
And if i use "pdftops" instead, the text is still selectable. If there is a way to make it non-selectable like with "pdf2ps" but with better quality, it will do too.

How to include page in PDF in PDF document in Python

I am using reportlab toolkit in Python to generate some reports in PDF format. I want to use some predefined parts of documents already published in PDF format to be included in generated PDF file. Is it possible (and how) to accomplish this in reportlab or in python library?
I know I can use some other tools like PDF Toolkit (pdftk) but I am looking for Python-based solution.

I'm currently using PyPDF to read, write, and combine existing PDF's and ReportLab to generate new content. Using the two package seemed to work better than any single package I was able to find.

If you want to place existing PDF pages in your Reportlab documents I recommend pdfrw. Unlike PageCatcher it is free.
I've used it for several projects where I need to add barcodes etc to existing documents and it works very well. There are a couple of examples on the project page of how to use it with Reportlab.
A couple of things to note though:
If the source PDF contains errors (due to the originating program following the PDF spec imperfectly for example), pdfrw may fail even though something like Adobe Reader has no apparent problems reading the PDF. pdfrw is currently not very fault tolerant.
Also, pdfrw works by being completely agnostic to the actual content of the PDF page you are placing. So for example, you wouldn't be able to use pdfrw inspect a page to see if it contains a certain string of text in the lower right-hand corner. However if you don't need to do anything like that you should be fine.

There is an add-on for ReportLab — PageCatcher.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

how to extract formatted text content from PDF - python

How can I extract the text content (not images) from a PDF while (roughly) maintaining the style and layout like Google Docs can?

Have you tried pyPDF or ReportLab PDF libraries? I personally have not used them but you can have a go at them. here is useful too

Xpdf has a utility call PDFtoText that does a great job. http://foolabs.com/xpdf/download.html

If you don't have your heart set on doing this with python, Ghostscript can do this for you. Check out pdf2ascii (a script that comes with GS) to get the plain text. Styles are more complicated as they can be specified in a few different ways.

Acrobat Professional can do the job. In the "File" menu, choose export. Then, choose Text.

Related

Retaining the fond of pdf to epub

Is there any way of generating exact HTML page from a PDF file?

Edit pdf file with PDFMiner

Hide information in a PDF file in Python

How to include page in PDF in PDF document in Python

Categories

Resources