style hyperlinks in reportlab pdfs - python

I am using rst2pdf to generate a PDF. I am using links to sections and they appear as hyperlinks in the PDF. If I hover over the link I can see it says "Go to page XXX". Is there a way to insert that page number into the text, so that it can be seen on hardcopies?

I'm starting using reportlab recently. Maybe you need to use the superscript tag?
p = Paragraph("<link href='http://someurl' color='blue'><u>Some text</u><super> [goto page xx]</super></link>", customstyle)
What it may looks like

Related

ReportLab PDF, Table header repeating in first page

I am using reportLab library to generate pdf. Table header is repeating on the first page. I need to remove it. Other pages are working fine.
I am using repeatRows=1 to show the header on top of each page.
PDF generation code is here
python==2.7, reportlab==2.7

Extract text from HTML faster than NLTK?

We use NLTK to extract text from HTML pages, but we want only most trivial text analysis, e.g. word count.
Is there a faster way to extract visible text from HTML using Python?
Understanding HTML (and ideally CSS) at some minimal level, like visible / invisible nodes, images' alt texts, etc, would be additionally great.
Ran into the same problem at my previous workplace. You'll want to check out beautifulsoup.
from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
print soup.text
You'll find its documentation here: https://www.crummy.com/software/BeautifulSoup/bs4/doc/
You can ignore elements based on attributes. As to understanding external stylesheets im not too sure. However what you could do there and something that would not be too slow (depending on the page) is to look into rendering the page with something like phantomjs and then selecting the rendered text :)

Scrape content from python string in reverse order using Regex

I got this following content from a html page
str='http://www.ralphlauren.com/graphics/product_images/pPOLO2-24922076_alternate1_v360x480.jpg\', zoom: \'s7-1251098_alternate1\' }]\n\n\nEnlarge Image\n\n\n\n\n\n\nCotton Canvas Utility Jacket\nStyle Number : 112933196\n\n\n\n$125.00'
Like so, I have many html pages. I need some way to read the content BEFORE the style number. In this case, I need Cotton Canvas Utility Jacket. Is there a regex in python to do that? Note that I can start looking for pattern Enlarge Image and read whatever comes before I strike Style number. The issue is that there are many Enlarge Image on the html page. What I have shown above is part of the html page. full html page is here
In short, I need to find the product name from the linked HTML page.
Thanks.

Make links in existing PDF using python

I want to make existing text in a PDF links to another PDF, or concatenate the PDFs and then link internally. There would be 100 + links so I do not want to do this by hand.
I tried using pypdf, this worked to get the pages the links should lead to, but I do not know how to make the text links.
So: How to make links in an existing PDF using python?

Reportlab 2 or more pages per file

How can I generate a PDF with two or more pages with reportlab? I've been unable to find anything in the documentation.
canvas.showPage()
will force a new page (even though it sounds like its showing a page,)
(assuming you are using the canvas)
if you are using flowables I think there is a PageBreak flowable
I think you can just keep .append()ing stuff and it will break the pages automatically once it get's too large for a single page, or you can force a page break by doing:
.append(PageBreak())

Categories