Retaining the fond of pdf to epub - python

I'm currently working on a project which is to convert pdf to epub using python. While converting the pdf to epub the styling like font family, font size need to be exactly same in epub as that of pdf. Is there a way to achieve this using python? And i don't need any external softwares to do it. I used aspose.
#code i used
import aspose.words as aw
doc = aw.Document("Input.pdf")
doc.save("Output.epub")
and it is a simple text pdf.

You are going to get a variety of answers/comments that will ask you to show code as to what you tried and post sample documents etc.
Let me save you the trouble. Your question seems straightforward in that want to convert a pdf to epub and retain the style information.
Good luck.
It will all depend on your PDF file. Does it have embedded fonts or does it rely on system fonts? Complicated layout? Headers and footers? What about images? Dingbats characters? What if there is no text in the pdf, but just postscript drawing of text characters? What if the PDF just consists of multiple scans of pages in a pdf container? Is everything in English? Any Unicode characters? Are you looking to get the styles right at the page level? Paragraph? Sentence? Word? or Character Level?
Basically this is a hard problem. PDF was designed as an end use format not an interchangeable format. Most things get converted to PDF because someone wanted to control how the final product looked. You can look at text extraction tools for PDF, but there is not an easy solution with opensource or commercial tools.

You can easily convert PDF to EPUB using Aspose.Words for Python. The code is pretty simple:
import aspose.words as aw
doc = aw.Document("C:\\Temp\\in.pdf")
doc.save("C:\\Temp\\out.epub")
However, upon loading PDF into Aspose.Words Document Object Model it is converted from fixed page layout to flow document. And when document is saved to EPUB it is saved as flow document. I am afraid, this might lead into layout and formatting loses upon conversion.

Related

Is there any way of generating exact HTML page from a PDF file?

I am trying different python libraries like pdftotree, pdfminer, tabula etc. But could not get the exact results. I mean I can get text from PDF, Images and Tabular data in HTML, but not as maintained and organized as original PDF file. Can someone help me with something regarding this? I would be thankful.
Mostly yes. Translate the PDF to SVG, and embed the SVG in your web page.
SVG's image model (what it can represent and how) is a near-superset of the PDF image model (which is itself a superset of PostScript), though SVG lacks some of the print-specific features of PDF. There are probably quite a few PDF->SVG converters out there already. Googling "Pdf to SVG" turned up quite a few promising hits
There will be some complications:
Many PDF files are longer than 1 page. You might need to generate 10 SVG files for a single 10 page PDF file, and then build a web page around those 10 SVGs. Throw in some dynamic HTML to "turn pages" and you've got a good web-based PDF viewer.
There are parts of PDF that aren't within its image model at all... bookmarks, annotations (form fields, digital signatures), document metadata (author, creation date, etc), and so forth. Some of the non-image-model stuff is common enough that a PDF to SVG utility might handle it directly (links), while other stuff doesn't have an HTML equivalent and would be lost.
You could preserve the appearance of a digital signature, but the actual security represented by those visuals would be gone. Preserving that signature's appearance could be considered lying about the security.

Modify and create PDF using Python

I have created a really nice looking invitation letter in word (.doc/.docx). Now, I need to personalize it for 1,000 people with their names and associated QR codes.
I tried working with pyfpdf and reportlab but it seems like in order to use these packages I have to re-generate the whole invitation letter along with text and graphics. I'm not sure if I will be able to generate an equally visually appealing letter as I have now in word (at least not without a lot of effort).
Is there a better way, where I use word template as input, fill-in the name and QR code and generate PDF?
If you are prepared to do the QR code and personalization in reportlab, then pdfrw (disclaimer: I am the author) will let you either merge the PDFs after the fact (similar to a watermarking operation), or can bring the PDF you generate from word in to reportlab a form XObject (similar to an image). You can use it for a background.
You should try using the Microsoft Word MailMerge feature which will probably do exactly what you want from within Word itself.
PDF editing is a very complex beast, as is docx editing. The majority of companies who offer PDF "support" use PDF APIs, since the software to edit and create PDF documents is so complex it's a retailable product in itself.
You can use MailMerge either to print or to email the PDF to lots of people at once with custom settings for each person.

Processing a PDF for information extraction

I am working on a project where I have a pdf file which describes one of the health policy. What I need to do is extract the information from this PDF and try to save it in some form such that I can answer the questions related to the policy by extracting info from this PDf.
This PDF is too big, so I want to divide the PDF according to the different sections so that when a query related to some particular area comes in then I wont have to go through the entire document.
I tried solving this using some pdf converters which converts the PDFs into the HTMLs. But these converters wont convert the PDF to HTML properly so that headings will have heading tag. Also even if I convert this properly and get the proper sections out of the document, I am not getting how to store this data.(I mean in which form should I store this Data).
Is there any other solution with which I can achieve this. I am using Python and also I can use NLTK if needed. Also the format is not fixed for the PDfs, I mean to say my code should work on any kind of PDFs.
PDFMiner is great in that it has location for every bit of text it gets from the PDF. It won't be nicely put in header tags or anything like that, but if you have a consistent PDF structure in your docs you might be able to get something working.

how to extract formatted text content from PDF

How can I extract the text content (not images) from a PDF while (roughly) maintaining the style and layout like Google Docs can?
To extract the text from the PDF AND get it's position you can use PDFMiner. PDFMiner can also export the PDF directly in HTML keeping the text at the good position.
I don't know your use case, but there's a lot of problems you can encounter when doing this because PDF is really presentation oriented and not content oriented, the text flow is not continous. So, if you want the text to be editable, it will not be an easy task.
Have you tried pyPDF or ReportLab PDF libraries? I personally have not used them but you can have a go at them. here is useful too
Xpdf has a utility call PDFtoText that does a great job. http://foolabs.com/xpdf/download.html
If you want to do it just like Google:
Google converts the PDF to an image, and then overlays the image, where text used to be, with JavaScript highlightable areas (which is about like Voodoo magic). The areas appear to be text when you scroll over them with your cursor, but they're not. This might not help you to know, but that's how they do it. If you want to reverse engineer it, you might start with https://www.mercurial-scm.org/ On the home page, they do the same thing with JavaScript to make the text highlightable and copyable. You can extract the text from the PDF, and find it's location in the page with on of the mentioned libraries in the other answers. Then you can overlay an extracted image of the file with the same style of JavaScript areas.
If you don't have your heart set on doing this with python, Ghostscript can do this for you. Check out pdf2ascii (a script that comes with GS) to get the plain text. Styles are more complicated as they can be specified in a few different ways.
Acrobat Professional can do the job. In the "File" menu, choose export. Then, choose Text.

How to include page in PDF in PDF document in Python

I am using reportlab toolkit in Python to generate some reports in PDF format. I want to use some predefined parts of documents already published in PDF format to be included in generated PDF file. Is it possible (and how) to accomplish this in reportlab or in python library?
I know I can use some other tools like PDF Toolkit (pdftk) but I am looking for Python-based solution.
I'm currently using PyPDF to read, write, and combine existing PDF's and ReportLab to generate new content. Using the two package seemed to work better than any single package I was able to find.
If you want to place existing PDF pages in your Reportlab documents I recommend pdfrw. Unlike PageCatcher it is free.
I've used it for several projects where I need to add barcodes etc to existing documents and it works very well. There are a couple of examples on the project page of how to use it with Reportlab.
A couple of things to note though:
If the source PDF contains errors (due to the originating program following the PDF spec imperfectly for example), pdfrw may fail even though something like Adobe Reader has no apparent problems reading the PDF. pdfrw is currently not very fault tolerant.
Also, pdfrw works by being completely agnostic to the actual content of the PDF page you are placing. So for example, you wouldn't be able to use pdfrw inspect a page to see if it contains a certain string of text in the lower right-hand corner. However if you don't need to do anything like that you should be fine.
There is an add-on for ReportLab — PageCatcher.

Categories