Extract text from pdf to file

Extract text from pdf to file - python

This question is probably a duplicate, but none of the answers in similar questions helped me. I'm looking for a simple way to extract text from a pdf file into any other type of file or structure which will let me use it.
the text I want to extract appears on pages 78-79.
At the end of the processes, I want to write each cell from the table in different rows in a .txt file. for example, I want to turn the first row in the table from this:
to this:
0x00
Channel standby
CH_7
CH_6
CH_5
CH_4
CH_3
CH_2
CH_1
CH_0
0x00
RW
I'm using Visual Studio 2017 but I can also work on Pycharm instead.
I've tried using all the options suggested in this question and here
but I'm having problems installing the required libraries on windows 10 OS. I'm also not sure whether those libraries are still in use and supported. I'd appreciate it if anyone could refer me to some updated material on this subject or refer me to the relevant library.
Thank you.

Here's something using PyMuPDF (pip install pymupdf).
In this example, get_document_bytes simply makes a request the PDF resource at the URL you provided (using the third-party requests module), and returns the PDF bytes. We use the bytes in main to create a fitz.Document instance via the stream parameter. You could also just download the PDF file manually and provide a filename instead of a stream of bytes, but I didn't feel like doing that. We grab a specific page from the document and print all the text on that page:
def get_document_bytes():
import requests
url = "https://www.mouser.co.il/datasheet/2/609/AD7768-7768-4-1502035.pdf"
headers = {
"user-agent": "Mozilla/5.0"
}
response = requests.get(url, headers=headers)
response.raise_for_status()
return response.content
def main():
import fitz
desired_page = 78
doc = fitz.Document(stream=get_document_bytes(), filetype="PDF")
page = doc.loadPage(page_id=desired_page-1)
print(page.getText())
return 0
if __name__ == "__main__":
import sys
sys.exit(main())
Output:
AD7768/AD7768-4
Data Sheet
Rev. B | Page 78 of 105
AD7768 REGISTER MAP DETAILS (SPI CONTROL)
AD7768 REGISTER MAP
See Table 63 and the AD7768-4 Register Map Details (SPI Control) section for the AD7768-4 register map and register functions.
Table 37. Detailed AD7768 Register Map
Reg.
Name
Bit 7
Bit 6
Bit 5
Bit 4
Bit 3
Bit 2
Bit 1
Bit 0
Reset RW
0x00
Channel standby
CH_7
CH_6
CH_5
CH_4
CH_3
CH_2
CH_1
CH_0
0x00
RW
...
I realize you want the text from two pages, not just one - and you also don't want all the text from these pages, just the stuff that's in the table. This is just to get you started - I may tinker around with this a bit more, and update my post later.

Related

Convert PNG to ZPL and print

I'm trying to convert an image to ZPl and then print the label to a 6.5*4cm label on a TLP 2844 zebra printer on Python.
My main problems are:
1.Converting the image
2.Printing from python to the zebra queue (I've honestly tried all the obvious printing packages like zebra0.5/ win32 print/ ZPL...)
Any help would be appreciated.

I had the same issue some weeks ago. I made a python script specifically for this printer, with some fields available. I commented (#) what does not involve your need, but left it in as you may find it helpful.
I also recommend that you set your printer to the EPL2 driver, and 5cm/s print speed. With this script you'll get the PNG previews with an EAN13 formatted barcode. (If you need other formats, you might need to hit the ZPL module docs.)
Please bear in mind that if you print with ZLP 2844, you will either need to use their paid software, or you will need to manually configure the whole printer.
import os
import zpl
#import pandas
#df= pandas.read_excel("Datos.xlsx")
#a=pandas.Series(df.GTIN.values,index=df.FINAL).to_dict()
for elem in a:
l = zpl.Label(15,25.5)
height = 0
l.origin(3,1)
l.write_text("CUIT: 30-11111111-7", char_height=2, char_width=2, line_width=40)
l.endorigin()
l.origin(2,5)
l.write_text("Art:", char_height=2, char_width=2, line_width=40)
l.endorigin()
l.origin(5.5,4)
l.write_text(elem, char_height=3, char_width=2.5, line_width=40)
l.endorigin()
l.origin(2, 7)
l.write_barcode(height=40, barcode_type='2', check_digit='N')
l.write_text(a[elem])
l.endorigin()
height += 8
l.origin(8.5, 13)
l.write_text('WILL S.A.', char_height=2, char_width=2, line_width=40)
l.endorigin()
print(l.dumpZPL())
lista.append(l.dumpZPL())
l.preview()
To print the previews without having to watch and confirm each preview, I ended up modifying the ZPL preview method, to return an IO variable so I can save it to a file.
fake_file = io.BytesIO(l.preview())
img = Image.open(fake_file)
img = img.save('tags/'+'name'+'.png')
On the Label.py from ZPL module (preview method):
#Image.open(io.BytesIO(res)).show(). <---- comment out the show, but add the return of the BytesIO
return res

I had similar issues and created a .net core application which takes an image and converts it to ZPL, either to a file or to the console so it's pipeable in bash scripts. You could package it with your python app call it as a subprocess like so:
output = subprocess.Popen(["zplify", "path/to/file.png"], stdout=subprocess.PIPE).communicate()[0]
Or feel free to use my code as a reference point and implement it in python.
Once you have a zpl file or stream you can send it directly to a printer using lpr if you're on linux. If on windows you can connect to a printer using it's IP address as shown in this stack overflow question

For what is worth and for anyone else reference, was facing a similar situation and came up with a solution. To whom it may help:
Converting the image?
After trying many libraries i came across ZPLGRF although it seems the demo is focused on PDF only, i found in the source that there is a from_image() class property that could convert from image to zpl combining it part of the demo/exaples. Full code description below
Printing from python to the zebra queue?
Many libraries again but i settled with ZEBRA seem to be the most straight forward one to send raw zpl to a zebra printer
CODE
from zplgrf import GRF
from zebra import Zebra
#Open the image file and generate ZPL from it
with open(path_to_your_image, 'rb') as img:
grf = GRF.from_image(img.read(), 'LABEL')
grf.optimise_barcodes()
zpl_code = grf.to_zpl
#Setup and print to Zebra Printer
z = Zebra()
#This will return a list of all the printers on a given machine as a list
#['printer1', 'printer2', ...]
z.getqueues()
#If or once u know the printer queue name then u can set it up with
z.setqueue('printer1')
#And now is ready to send the raw ZPL text
z.output(zpl_code )
The above i have tested successfully with a Zebra GX430t printer connected via USB in a Windows 11 machine.
Hope it helps

Search and replace placeholder text in PDF with Python

I need to generate a customized PDF copy of a template document.
The easiest way - I thought - was to create a source PDF that has some placeholder text where customization needs to happen , ie <first_name> and <last_name>, and then replace these with the correct values.
I've searched high and low, but is there really no way of basically taking the source template PDF, replace the placeholders with actual values and write to a new PDF?
I looked at PyPDF2 and ReportLab but neither seem to be able to do so.
Any suggestions? Most of my searches lead to using a Perl app, CAM::PDF, but I'd prefer to keep it all in Python.

There is no direct way to do this that will work reliably. PDFs are not like HTML: they specify the positioning of text character-by-character. They may not even include the whole font used to render the text, just the characters needed to render the specific text in the document. No library I've found will do nice things like re-wrap paragraphs after updating the text. PDFs are for the most part a display-only format, so you'll be much better off using a tool that turns markup into a PDF than updating the PDF in-place.
If that's not an option, you can create a PDF form in something like Acrobat, then use a PDF manipulation library like iText (AGPL) or pdfbox, which has a nice clojure wrapper called pdfboxing that can handle some of that.
From my experience, Python's support for writing to PDFs is pretty limited. Java has, by far, the best language support. Also, you get what you pay for, so it would probably be worth paying for a iText license if you're using this for commercial purposes. I've had pretty good results writing python wrappers around PDF-manipulation CLI tools like pdfboxing and ghostscript. That will probably be much easier for your use case than trying to shoehorn this into Python's PDF ecosystem.

There is no definite solution but I found 2 solutions that works most of the time.
In python https://github.com/JoshData/pdf-redactor gives good results. Here is the example code:
# Redact things that look like social security numbers, replacing the
# text with X's.
options.content_filters = [
# First convert all dash-like characters to dashes.
(
re.compile(u"Tom Xavier"),
lambda m : "XXXXXXX"
),
# Then do an actual SSL regex.
# See https://github.com/opendata/SSN-Redaction for why this regex is complicated.
(
re.compile(r"(?<!\d)(?!666|000|9\d{2})([OoIli0-9]{3})([\s-]?)(?!00)([OoIli0-9]{2})\2(?!0{4})([OoIli0-9]{4})(?!\d)"),
lambda m : "XXX-XX-XXXX"
),
]
# Perform the redaction using PDF on standard input and writing to standard output.
pdf_redactor.redactor(options)
Full Example can be found here
In ruby https://github.com/gettalong/hexapdf works for black out text.
Example code:
require 'hexapdf'
class ShowTextProcessor < HexaPDF::Content::Processor
def initialize(page, to_hide_arr)
super()
#canvas = page.canvas(type: :overlay)
#to_hide_arr = to_hide_arr
end
def show_text(str)
boxes = decode_text_with_positioning(str)
return if boxes.string.empty?
if #to_hide_arr.include? boxes.string
#canvas.stroke_color(0, 0 , 0)
boxes.each do |box|
x, y = *box.lower_left
tx, ty = *box.upper_right
#canvas.rectangle(x, y, tx - x, ty - y).fill
end
end
end
alias :show_text_with_positioning :show_text
end
file_name = ARGV[0]
strings_to_black = ARGV[1].split("|")
doc = HexaPDF::Document.open(file_name)
puts "Blacken strings [#{strings_to_black}], inside [#{file_name}]."
doc.pages.each.with_index do |page, index|
processor = ShowTextProcessor.new(page, strings_to_black)
page.process_contents(processor)
end
new_file_name = "#{file_name.split('.').first}_updated.pdf"
doc.write(new_file_name, optimize: true)
puts "Writing updated file [#{new_file_name}]."
In this you can black out text on select text will be visible.

As another solution you may try Aspose.PDF Cloud SDK for Python, it provides the feature to replace text in a PDF document.
First thing first, install the Aspose.PDF Cloud SDK for Python
pip install asposepdfcloud
Sample Code upload PDF file to your cloud storage and replace multiple strings in a PDF document
import os
import asposepdfcloud
from asposepdfcloud.apis.pdf_api import PdfApi
# Get App key and App SID from https://aspose.cloud
pdf_api_client = asposepdfcloud.api_client.ApiClient(
app_key='xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx',
app_sid='xxxxx-xxxx-xxxx-xxxx-xxxxxxxx')
pdf_api = PdfApi(pdf_api_client)
filename = '02_pages.pdf'
remote_name = '02_pages.pdf'
#upload PDF file to storage
pdf_api.upload_file(remote_name,filename)
#Replace Text
text_replace1 = asposepdfcloud.models.TextReplace(old_value='origami',new_value='aspose',regex='true')
text_replace2 = asposepdfcloud.models.TextReplace(old_value='candy',new_value='biscuit',regex='true')
text_replace_list = asposepdfcloud.models.TextReplaceListRequest(text_replaces=[text_replace1,text_replace2])
response = pdf_api.post_document_text_replace(remote_name, text_replace_list)
print(response)
I'm developer evangelist at aspose.

Searching text in a PDF using Python? [duplicate]

This question already has answers here:
How to extract text from a PDF file?
(33 answers)
Closed 2 months ago.
Problem
I'm trying to determine what type a document is (e.g. pleading, correspondence, subpoena, etc) by searching through its text, preferably using python. All PDFs are searchable, but I haven't found a solution to parsing it with python and applying a script to search it (short of converting it to a text file first, but that could be resource-intensive for n documents).
What I've done so far
I've looked into pypdf, pdfminer, adobe pdf documentation, and any questions here I could find (though none seemed to directly solve this issue). PDFminer seems to have the most potential, but after reading through the documentation I'm not even sure where to begin.
Is there a simple, effective method for reading PDF text, either by page, line, or the entire document? Or any other workarounds?

This is called PDF mining, and is very hard because:
PDF is a document format designed to be printed, not to be parsed. Inside a PDF document,
text is in no particular order (unless order is important for printing), most of the time
the original text structure is lost (letters may not be grouped
as words and words may not be grouped in sentences, and the order they are placed in
the paper is often random).
There are tons of software generating PDFs, many are defective.
Tools like PDFminer use heuristics to group letters and words again based on their position in the page. I agree, the interface is pretty low level, but it makes more sense when you know
what problem they are trying to solve (in the end, what matters is choosing how close from the neighbors a letter/word/line has to be in order to be considered part of a paragraph).
An expensive alternative (in terms of time/computer power) is generating images for each page and feeding them to OCR, may be worth a try if you have a very good OCR.
So my answer is no, there is no such thing as a simple, effective method for extracting text from PDF files - if your documents have a known structure, you can fine-tune the rules and get good results, but it is always a gambling.
I would really like to be proven wrong.
[update]
The answer has not changed but recently I was involved with two projects: one of them is using computer vision in order to extract data from scanned hospital forms. The other extracts data from court records. What I learned is:
Computer vision is at reach of mere mortals in 2018. If you have a good sample of already classified documents you can use OpenCV or SciKit-Image in order to extract features and train a machine learning classifier to determine what type a document is.
If the PDF you are analyzing is "searchable", you can get very far extracting all the text using a software like pdftotext and a Bayesian filter (same kind of algorithm used to classify SPAM).
So there is no reliable and effective method for extracting text from PDF files but you may not need one in order to solve the problem at hand (document type classification).

I am totally a green hand, but this script works for me:
# import packages
import PyPDF2
import re
# open the pdf file
reader = PyPDF2.PdfReader("test.pdf")
# get number of pages
num_pages = len(reader.pages)
# define key terms
string = "Social"
# extract text and do the search
for page in reader.pages:
rext = page.extract_text()
# print(text)
res_search = re.search(string, text)
print(res_search)

I've written extensive systems for the company I work for to convert PDF's into data for processing (invoices, settlements, scanned tickets, etc.), and #Paulo Scardine is correct--there is no completely reliable and easy way to do this. That said, the fastest, most reliable, and least-intensive way is to use pdftotext, part of the xpdf set of tools. This tool will quickly convert searchable PDF's to a text file, which you can read and parse with Python. Hint: Use the -layout argument. And by the way, not all PDF's are searchable, only those that contain text. Some PDF's contain only images with no text at all.

I recently started using ScraperWiki to do what you described.
Here's an example of using ScraperWiki to extract PDF data.
The scraperwiki.pdftoxml() function returns an XML structure.
You can then use BeautifulSoup to parse that into a navigatable tree.
Here's my code for -
import scraperwiki, urllib2
from bs4 import BeautifulSoup
def send_Request(url):
#Get content, regardless of whether an HTML, XML or PDF file
pageContent = urllib2.urlopen(url)
return pageContent
def process_PDF(fileLocation):
#Use this to get PDF, covert to XML
pdfToProcess = send_Request(fileLocation)
pdfToObject = scraperwiki.pdftoxml(pdfToProcess.read())
return pdfToObject
def parse_HTML_tree(contentToParse):
#returns a navigatibale tree, which you can iterate through
soup = BeautifulSoup(contentToParse)
return soup
pdf = process_PDF('http://greenteapress.com/thinkstats/thinkstats.pdf')
pdfToSoup = parse_HTML_tree(pdf)
soupToArray = pdfToSoup.findAll('text')
for line in soupToArray:
print line
This code is going to print a whole, big ugly pile of <text> tags.
Each page is separated with a </page>, if that's any consolation.
If you want the content inside the <text> tags, which might include headings wrapped in <b> for example, use line.contents
If you only want each line of text, not including tags, use line.getText()
It's messy, and painful, but this will work for searchable PDF docs. So far I've found this to be accurate, but painful.

Here is the solution that I found it comfortable for this issue. In the text variable you get the text from PDF in order to search in it. But I have kept also the idea of spiting the text in keywords as I found on this website: https://medium.com/#rqaiserr/how-to-convert-pdfs-into-searchable-key-words-with-python-85aab86c544f from were I took this solution, although making nltk was not very straightforward, it might be useful for further purposes:
import PyPDF2
import textract
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
def searchInPDF(filename, key):
occurrences = 0
pdfFileObj = open(filename,'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
num_pages = pdfReader.numPages
count = 0
text = ""
while count < num_pages:
pageObj = pdfReader.getPage(count)
count +=1
text += pageObj.extractText()
if text != "":
text = text
else:
text = textract.process(filename, method='tesseract', language='eng')
tokens = word_tokenize(text)
punctuation = ['(',')',';',':','[',']',',']
stop_words = stopwords.words('english')
keywords = [word for word in tokens if not word in stop_words and not word in punctuation]
for k in keywords:
if key == k: occurrences+=1
return occurrences
pdf_filename = '/home/florin/Downloads/python.pdf'
search_for = 'string'
print searchInPDF (pdf_filename,search_for)

I agree with #Paulo PDF data-mining is a huge pain. But you might have success with pdftotext which is part of the Xpdf suite freely available here:
http://www.foolabs.com/xpdf/download.html
This should be sufficient for your purpose if you are just looking for single keywords.
pdftotext is a command line utility, but very straightforward to use. It will give you text files, which you may find easier to work with.

If you are on bash, There is a nice tool called pdfgrep,
Since, This is in apt repository, You can install this with:
sudo apt install pdfgrep
It had served my requirements well.

Trying to pick through PDFs for keywords is not an easy thing to do. I tried to use the pdfminer library with very limited success. It’s basically because PDFs are pandemonium incarnate when it comes to structure. Everything in a PDF can stand on its own or be a part of a horizontal or vertical section, backwards or forwards. Pdfminer was having issues translating one page, not recognizing the font, so I tried another direction — optical character recognition of the document. That worked out almost perfectly.
Wand converts all the separate pages in the PDF into image blobs, then you run OCR over the image blobs. What I have as a BytesIO object is the content of the PDF file from the web request. BytesIO is a streaming object that simulates a file load as if the object was coming off of disk, which wand requires as the file parameter. This allows you to just take the data in memory instead of having to save the file to disk first and then load it.
Here’s a very basic code block that should be able to get you going. I can envision various functions that would loop through different URL / files, different keyword searches for each file, and different actions to take, possibly even per keyword and file.
# http://docs.wand-py.org/en/0.5.9/
# http://www.imagemagick.org/script/formats.php
# brew install freetype imagemagick
# brew install PIL
# brew install tesseract
# pip3 install wand
# pip3 install pyocr
import pyocr.builders
import requests
from io import BytesIO
from PIL import Image as PI
from wand.image import Image
if __name__ == '__main__':
pdf_url = 'https://www.vbgov.com/government/departments/city-clerk/city-council/Documents/CurrentBriefAgenda.pdf'
req = requests.get(pdf_url)
content_type = req.headers['Content-Type']
modified_date = req.headers['Last-Modified']
content_buffer = BytesIO(req.content)
search_text = 'tourism investment program'
if content_type == 'application/pdf':
tool = pyocr.get_available_tools()[0]
lang = 'eng' if tool.get_available_languages().index('eng') >= 0 else None
image_pdf = Image(file=content_buffer, format='pdf', resolution=600)
image_jpeg = image_pdf.convert('jpeg')
for img in image_jpeg.sequence:
img_page = Image(image=img)
txt = tool.image_to_string(
PI.open(BytesIO(img_page.make_blob('jpeg'))),
lang=lang,
builder=pyocr.builders.TextBuilder()
)
if search_text in txt.lower():
print('Alert! {} {} {}'.format(search_text, txt.lower().find(search_text),
modified_date))
req.close()

This answer follows #Emma Yu's:
If you want to print out all the matches of a string pattern on every page.
(Note that Emma's code prints a match per page):
import PyPDF2
import re
pattern = input("Enter string pattern to search: ")
fileName = input("Enter file path and name: ")
object = PyPDF2.PdfFileReader(fileName)
numPages = object.getNumPages()
for i in range(0, numPages):
pageObj = object.getPage(i)
text = pageObj.extractText()
for match in re.finditer(pattern, text):
print(f'Page no: {i} | Match: {match}')

A version using PyMuPDF. I find it to be more robust than PyPDF2.
import fitz
import re
# load document
doc = fitz.open(filename)
# define keyterms
String = "hours"
# get text, search for string and print count on page.
for page in doc:
text = ''
text += page.getText()
print(f'count on page {page.number +1} is: {len(re.findall(String, text))}')

Example with pdfminer.six
from pdfminer import high_level
with open('file.pdf', 'rb') as f:
text = high_level.extract_text(f)
print(text)
Compared to PyPDF2, it can work with cyrillic

Running Acrobat with Python

I'm using Adobe Acrobat Pro to extract information from PDFs in XML format. Acrobat does this particularly well. I want to extract information from about a thousand documents and do stuff with that information, so using Acrobat by hand would be annoying. Are there plugins to call Acrobat functions (i.e. save as XML) from any common language, ideally Python?

If you're on Windows, you can talk to Acrobat using DDE commands. The pyWin32 module supports DDE calls, or you could try your luck with this stand-alone binding.
But you'll have to figure out the request to send to Acrobat. (here's some random documentation, but it doesn't mention XML). It seems that the commands change from version to version, (or at least some things break), so keep an eye on the version. Good luck.

Maybe you could take a look at pypdf? It allows python reference to Adobe PDF's. Also PDFminer allows pdf xml extracting. I know perl can do it because I have previously used it myself, here is the reference to the module CAM::PDF
Example:
from pyPdf import PdfFileWriter, PdfFileReader
output = PdfFileWriter()
input1 = PdfFileReader(file("document1.pdf", "rb"))
# print the title of document1.pdf
print "title = %s" % (input1.getDocumentInfo().title)
# add page 1 from input1 to output document, unchanged
output.addPage(input1.getPage(0))
# add page 2 from input1, but rotated clockwise 90 degrees
output.addPage(input1.getPage(1).rotateClockwise(90))
# add page 3 from input1, rotated the other way:
output.addPage(input1.getPage(2).rotateCounterClockwise(90))
# alt: output.addPage(input1.getPage(2).rotateClockwise(270))
# add page 4 from input1, but first add a watermark from another pdf:
page4 = input1.getPage(3)
watermark = PdfFileReader(file("watermark.pdf", "rb"))
page4.mergePage(watermark.getPage(0))
# add page 5 from input1, but crop it to half size:
page5 = input1.getPage(4)
page5.mediaBox.upperRight = (
page5.mediaBox.getUpperRight_x() / 2,
page5.mediaBox.getUpperRight_y() / 2
)
output.addPage(page5)
# print how many pages input1 has:
print "document1.pdf has %s pages." % input1.getNumPages()
# finally, write "output" to document-output.pdf
outputStream = file("document-output.pdf", "wb")
output.write(outputStream)
outputStream.close()
Also take a look at this question: python and pyPdf - how to extract text from the pages so that there are spaces between lines. Describes XML parsing and such in PDF's.

Extracting links to pages in another PDF from PDF using Python or other method

I have 5 PDF files, each of which have links to different pages in another PDF file. The files are each tables of contents for large PDFs (~1000 pages each), making manual extraction possible, but very painful. So far I have tried to open the file in Acrobat Pro, and I can right click on each link and see what page it points to, but I need to extract all the links in some manner. I am not opposed to having to do a good amount of further parsing of the links, but I can't seem to pull them out by any means. I tried to export the PDF from Acrobat Pro as HTML or Word, but both methods didn't maintain the links.
I'm at my wits end, and any help would be great. I'm comfortable working with Python, or a range of other languages

Looking for URIs using pyPdf,
import pyPdf
f = open('TMR-Issue6.pdf','rb')
pdf = pyPdf.PdfFileReader(f)
pgs = pdf.getNumPages()
key = '/Annots'
uri = '/URI'
ank = '/A'
for pg in range(pgs):
p = pdf.getPage(pg)
o = p.getObject()
if o.has_key(key):
ann = o[key]
for a in ann:
u = a.getObject()
if u[ank].has_key(uri):
print u[ank][uri]
gives,
http://www.augustsson.net/Darcs/Djinn/
http://plato.stanford.edu/entries/logic-intuitionistic/
http://citeseer.ist.psu.edu/ishihara98note.html
etc...
I couldn't find a file that had links to another pdf, but I suspect that the URI field should contain URIs of the form file:///myfiles

I've just made a small Python tool for exactly this, to list/download all referenced PDFs from a given PDF: https://www.metachris.com/pdfx/ (also: https://github.com/metachris/pdfx)
$ ./pdfx.py https://weakdh.org/imperfect-forward-secrecy.pdf -d ./
Reading url 'https://weakdh.org/imperfect-forward-secrecy.pdf'...
Saved pdf as './imperfect-forward-secrecy.pdf'
Document infos:
- CreationDate = D:20150821110623-04'00'
- Creator = LaTeX with hyperref package
- ModDate = D:20150821110805-04'00'
- PTEX.Fullbanner = This is pdfTeX, Version 3.1415926-2.5-1.40.14 (TeX Live 2013/Debian) kpathsea version 6.1.1
- Producer = pdfTeX-1.40.14
- Title = Imperfect Forward Secrecy: How Diffie-Hellman Fails in Practice
- Trapped = False
- Pages = 13
Analyzing text...
- URLs: 49
- URLs to PDFs: 17
JSON summary saved as './imperfect-forward-secrecy.pdf.infos.json'
Downloading 17 referenced pdfs...
Created directory './imperfect-forward-secrecy.pdf-referenced-pdfs'
Downloaded 'http://cr.yp.to/factorization/smoothparts-20040510.pdf' to './imperfect-forward-secrecy.pdf-referenced-pdfs/smoothparts-20040510.pdf'...
Downloaded 'http://www.spiegel.de/media/media-35517.pdf' to './imperfect-forward-secrecy.pdf-referenced-pdfs/media-35517.pdf'...
Downloaded 'http://www.spiegel.de/media/media-35514.pdf' to './imperfect-forward-secrecy.pdf-referenced-pdfs/media-35514.pdf'...
Downloaded 'http://www.spiegel.de/media/media-35519.pdf' to './imperfect-forward-secrecy.pdf-referenced-pdfs/media-35519.pdf'...
Downloaded 'http://www.spiegel.de/media/media-35522.pdf' to './imperfect-forward-secrecy.pdf-referenced-pdfs/media-35522.pdf'...
Downloaded 'http://www.spiegel.de/media/media-35509.pdf' to './imperfect-forward-secrecy.pdf-referenced-pdfs/media-35509.pdf'...
Downloaded 'http://www.spiegel.de/media/media-35528.pdf' to './imperfect-forward-secrecy.pdf-referenced-pdfs/media-35528.pdf'...
Downloaded 'http://www.spiegel.de/media/media-35513.pdf' to './imperfect-forward-secrecy.pdf-referenced-pdfs/media-35513.pdf'...
Downloaded 'http://www.spiegel.de/media/media-35533.pdf' to './imperfect-forward-secrecy.pdf-referenced-pdfs/media-35533.pdf'...
Downloaded 'http://www.spiegel.de/media/media-35551.pdf' to './imperfect-forward-secrecy.pdf-referenced-pdfs/media-35551.pdf'...
Downloaded 'http://www.spiegel.de/media/media-35527.pdf' to './imperfect-forward-secrecy.pdf-referenced-pdfs/media-35527.pdf'...
Downloaded 'http://www.spiegel.de/media/media-35520.pdf' to './imperfect-forward-secrecy.pdf-referenced-pdfs/media-35520.pdf'...
Downloaded 'http://www.spiegel.de/media/media-35526.pdf' to './imperfect-forward-secrecy.pdf-referenced-pdfs/media-35526.pdf'...
Downloaded 'http://www.spiegel.de/media/media-35515.pdf' to './imperfect-forward-secrecy.pdf-referenced-pdfs/media-35515.pdf'...
Downloaded 'http://www.spiegel.de/media/media-35529.pdf' to './imperfect-forward-secrecy.pdf-referenced-pdfs/media-35529.pdf'...
Downloaded 'http://cryptome.org/2013/08/spy-budget-fy13.pdf' to './imperfect-forward-secrecy.pdf-referenced-pdfs/spy-budget-fy13.pdf'...
Downloaded 'http://www.spiegel.de/media/media-35671.pdf' to './imperfect-forward-secrecy.pdf-referenced-pdfs/media-35671.pdf'...
The tool uses PyPDF2, the de-facto Python standard library to read PDF content, a regular expression to match all urls, and starts a download thread for each PDF if you run it with the -d option (for --download-pdfs).

In case you cannot use Python, but have a method of decompressing the PDF's internal object streams like qpdf, you can grep for URIs:
qpdf --qdf --object-streams=disable input.pdf - | grep -Poa '(?<=/URI \().*(?=\))'

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.