How do I modify a pdf using pypdf2 without breaking hyperlinks? - python

I am trying to make some simple changes (e.g. add a header) to an existing pdf using pypdf2. However, it appears that using the PdfWriter class from pypdf2 I loose all existing hyperlinks although in the final page object I can see the annotations ('\Annots'). More specifically, the final pdf still contains the clickable links but they do not point anywhere.
The same happens when I do not make any changes, i.e. I only read the pdf and save it again.
Is there something I am missing or is this a known issue?
Thanks
Terry

Related

Why PyPDF2 showing this output when printing extractText?

I am trying to extract data from pdf using PyPDF2 but instead of showing actual text it showing something else in the output what could be the reason behind it?
Here is my code
xfile=open('filename','rb')
pdfReader = PyPDF2.PdfFileReader(xfile)
num=pdfReader.numPages
pageobj=pdfReader.getPage(0)
print(pageobj.extractText())
when I run above program I get this output what could be the reason?
!"#$%#&'(%!#)
(((((((((((((((((((((((((((((((((((((((((((((((((!"#$%#&'(%!#)*+,-./0!$1(230
4444444444445674+8,8,9:+*8
4&*)+!,$-.
4,*7;44444444444444444444444444
4$/012/($/3414546(78(,69:/7;7<=(>"#)?#(A2B2/231
(444<(4=&2#4$>4?&#!0$24A>/$>&&#$>/B4?CDEF4+(;8
4,*7,444*B62C;2/0(#B(%69(%9:77;#("1;23D5B
((((?C<GA47,H#B48:(,*I
4,*7*444E2F2:2B(.2G702=2(A10=2;2=2#("1;23D5B
((((?<GA47*H#B4?CDEF46(8
44%'$HH%(!.*($.,&I&%,%
Pdf is a file format oriented around page layout. Thus, text present in a pdf can be stored in various methods. It is not guaranteed that your pdf is stored in a format readable by PyPDF.
Moving forward: you can try extracting data from other pdfs before concluding if there is a fault with your PyPdf implementation.
you can also try extracting data from pytesseract and see if your result improves.
From PyPDF2s documentation:
This works well for some PDF files, but poorly for others, depending on the generator used.
Your PDF might be of the latter category and you are SOL...
With PyPDF2 not being actively developed anymore (no updates to the Pypi package since 2016) maybe try a more up-to-date package like pdftotext

Processing a PDF for information extraction

I am working on a project where I have a pdf file which describes one of the health policy. What I need to do is extract the information from this PDF and try to save it in some form such that I can answer the questions related to the policy by extracting info from this PDf.
This PDF is too big, so I want to divide the PDF according to the different sections so that when a query related to some particular area comes in then I wont have to go through the entire document.
I tried solving this using some pdf converters which converts the PDFs into the HTMLs. But these converters wont convert the PDF to HTML properly so that headings will have heading tag. Also even if I convert this properly and get the proper sections out of the document, I am not getting how to store this data.(I mean in which form should I store this Data).
Is there any other solution with which I can achieve this. I am using Python and also I can use NLTK if needed. Also the format is not fixed for the PDfs, I mean to say my code should work on any kind of PDFs.
PDFMiner is great in that it has location for every bit of text it gets from the PDF. It won't be nicely put in header tags or anything like that, but if you have a consistent PDF structure in your docs you might be able to get something working.

How can I insert a pdf file as a figure inside of another pdf in python?

I'm trying to automatically generate some pdf format reports in Python. I have figures that I want in the reports, but the figures are currently saved as pdfs. Saving the figures as something else is an option, but not ideal for what I'm trying to do. I've found examples (http://code.google.com/p/pdfrw/wiki/ExampleTools) using pdfrw and reportlab to turn pages from one pdf into pages of a new pdf, but I don't want them to be entire pages of my new pdf, just a figure occupying a section of the page. I haven't used pdfrw before, so I don't know a lot about the Canvas method and what it is fully capable of.
The trick to using a part of a page with the pagemerge canvas is the ViewInfo object. With it, you can describe, among other things, rectangle coordinates (e.g. when you are adding a page to a PageMerge object).
ViewInfo objects are defined and described in more detail in the buildxobj.py file.
I should make another example and some documentation for that, but here is a similar stackoverflow question that I answered awhile back. HTH
(Disclaimer: I am the pdfrw author.)

Reportlab: How to add uncertain contents using BaseDocTemplate?

I want to add contents on first page. But the data is uncertain, contents must draw depend on the data.
I try to fix this problem by drawing twice, calculating and recording the contents, and it works well.
But it wastes time and ungainly.
Can I insert first page after completing drawing? Or other ways to fix this problem?
Sorry, I'm poor of English..
If you use Platypus with ReportLab you can get a table of contents "for free": just see the ReportLab manual for details but you basically just add a TableOfContents to the list of Flowables for the document and you're done. Otherwise you'll have to figure out all the logic for building a table of contents yourself.

How to include page in PDF in PDF document in Python

I am using reportlab toolkit in Python to generate some reports in PDF format. I want to use some predefined parts of documents already published in PDF format to be included in generated PDF file. Is it possible (and how) to accomplish this in reportlab or in python library?
I know I can use some other tools like PDF Toolkit (pdftk) but I am looking for Python-based solution.
I'm currently using PyPDF to read, write, and combine existing PDF's and ReportLab to generate new content. Using the two package seemed to work better than any single package I was able to find.
If you want to place existing PDF pages in your Reportlab documents I recommend pdfrw. Unlike PageCatcher it is free.
I've used it for several projects where I need to add barcodes etc to existing documents and it works very well. There are a couple of examples on the project page of how to use it with Reportlab.
A couple of things to note though:
If the source PDF contains errors (due to the originating program following the PDF spec imperfectly for example), pdfrw may fail even though something like Adobe Reader has no apparent problems reading the PDF. pdfrw is currently not very fault tolerant.
Also, pdfrw works by being completely agnostic to the actual content of the PDF page you are placing. So for example, you wouldn't be able to use pdfrw inspect a page to see if it contains a certain string of text in the lower right-hand corner. However if you don't need to do anything like that you should be fine.
There is an add-on for ReportLab — PageCatcher.

Categories