How to serve PDF files on the web?

How to serve PDF files on the web? - python

I have the following code, which seems to serve a PDF without any content:
from pathlib import Path
pdf = Path("url/to/file.pdf")
print(f"Content-Type: application/pdf;\r\n")
print(pdf.read_bytes())
Any tips to correctly serve this PDF would be helpful!
Edit: for context, I am trying to serve PDF files and obscure original PDF file path on the server.

I'm no Python developer so I can't help you too specifically, but a couple things...
If your Phython script is outputting headers and content in the same response (such as via CGI), you need to have a blank line between the headers and content. Right now you have one \r\n. Add a second \r\n.
The other thing is that you should find a way to stream the output from that file rather than reading all its bytes and printing them.
Finally, I don't know if print() in Python is interpreting that as a string, but that can be problematic for binary data. This again is solved by piping a stream directly to the output.

Figured out the answer to my own question if anyone else needs to know:
pdf = Path("url/to/file.pdf")
print("Content-type: application/pdf\r\n\r\n")
stdout.flush()
stdout.buffer.write(pdf.read_bytes())

I realy dont know what your question is about. Its looks like code is unrelated with question. Its seems like its suppose to find pdf in local file system(build_in open?) and then print its content(?). Do you use some framework(flask/django)?
If by dynamic serving you have in mind dynamic creation of pdf based on some template:
pdf may be constructed from some markdown language like html, tex file(latex unfortunately is complicated system and dont fit to be depoloyed with web app)
markdown language file may be in turn rendered by template soft (jinja2, django build_in)
https://weasyprint.org/ is library that convert html + css to pdf
Ps. add more context

Related

Python 3 - Data mining from PDF

I'm working on a project that requires obtaining data from some PDF documents.
Currently I'm using Foxit toolkit (calling it from the script) to convert the document to txt and then I iterate through it. I'm pretty happy with it, but 100$ it's just something I can't afford for such a small project.
I've tested all the free converters that I could find (like xpdf, pdftotext) but they just don't cut it, they mess up the format in a way that i cant use the words to locate the data.
I've tried some Python modules like pdfminer but they don't seem to work well in Python 3.
I can't get the data before it's converted to PDF because I get them from a phone carrier.
I'm looking for a way of getting the data from the PDF or a converter that at least follow the newlines properly.
Update:
PyPDF2 is not grabbing any text whatsoever from the pdf document.

The PyPDF2 seems to be the best one available for Python3
It's well documented and the API is simple to use.
It also can work with encrypted files, retrieve metadata, merge documents, etc
A simple use case for extracting the text:
from PyPDF2 import PdfFileReader
with open("test.pdf",'rb') as f:
if f:
ipdf = PdfFileReader(f)
text = [p.extractText() for p in ipdf.pages]

I don't believe that there is a good free python pdf converter sadly, however pdf2html although it is not a python module, works extremely well and provides you with much more structured data(html) compared to a simple text file. And from there you can use python tools such as beautiful soup to scrape the html file.
link - http://coolwanglu.github.io/pdf2htmlEX/
Hope this helps.

Here is an example of pyPDF2 codes:
from PyPDF2 import PdfFileReader
pdfFileObj = open("FileName", "rb")
pdfReader = PdfFileReader(pdfFileObj,strict = False)
data=[page.extractText() for page in pdfReader.pages]
more information on pyPDF2 here.

I had the same problem when I wanted to do some deep inspection of PDFs for security analysis - I had to write my own utility that parses the low-level objects and literals, unpacks streams, etc so I could get at the "raw data":
https://github.com/opticaliqlusion/pypdf
It's not a feature complete solution, but it is meant to be used in a pure python context where you can define your own visitors to iterate over all the streams, text, id nodes, etc in the PDF tree:
class StreamIterator(PdfTreeVisitor):
'''For deflating (not crossing) the streams'''
def visit_stream(self, node):
print(node.value)
pass
...
StreamIterator().visit(tree)
Anyhow, I dont know if this is the kind of thing you were looking for, but I used it to do some security analysis when looking at suspicious email attachments.
Cheers!

How to edit and save existing HTML using Python?

I'm trying to write a program that enables someone to edit html from python 'input()' questions. For example: change a paragraph from the command line in python. Is there some sort of library I can use to read html then edit and save it?

Since an HTML file is just a plain text file it can be opened by python without the need for any extra libraries and such. Just open the file, edit what you need and write it.
Check out the following link:
http://www.pythonforbeginners.com/files/reading-and-writing-files-in-python

create pdf from python

I'm looking to generate PDF's from a Python application.
They start relatively simple but some may become more complex (Essentially letter like documents but will include watermarks for example later)
I've worked in raw postscript before and providing I can generate the correct headers etc and file at the end of it I want to avoid use of complex libs that may not do entirely what I want. Some seem to have got bitrot and no longer supported (pypdf and pypdf2) Especially when I know PDF/Postscript can do exactly what I need. PDF content really isn't that complex.
I can generate EPS (Encapsulated postscript) fine by just writing the appropriate text headers to file and my postscript code. But Inspecting PDF's there is a lil binary header I'm not sure how to generate.
I could generate an EPS and convert it. I'm not overly happy with this as the production environment is a Windows 2008 server (Dev is Ubuntu 12.04) and making something and converting it seems very silly.
Has anyone done this before?
Am I being pedantic by not wanting to use a library?

borrowed from ask.yahoo
A PDF file starts with "%PDF-1.1" if it is a version 1.1 type of PDF file. You can read PDF files ok when they don't have binary data objects stored in them, and you could even make one using Notepad if you didn't need to store a binary object like a Paint bitmap in it.
But after seeing the "%PDF-1.1" you ignore what's after that (Adobe Reader does, too) and go straight to the end of the file to where there is a line that says "%%EOF". That's always the last thing in the file; and if that's there you know that just a few characters before that place in the file there's the word "startxref" followed by a number. This number tells a reader program where to look in the file to find the start of the list of items describing the structure of the file. These items in the list can be page objects, dictionary objects, or stream objects (like the binary data of a bitmap), and each one has "obj" and "endobj" marking out where its description starts and ends.
For fairly simple PDF files, you might be able to type the text in just like you did with Notepad to make a working PDF file that Adobe Reader and other PDF viewer programs could read and display correctly.
Doing something like this is a challenge, even for a simple file, and you'd really have to know what you're doing to get any binary data into the file where it's supposed to go; but for character data, you'd just be able to type it in. And all of the commands used in the PDF are in the form of strings that you could type in. The hardest part is calculating those numbers that give the file offsets for items in the file (such as the number following "startxref").
If the way the file format is laid out intrigues you, go ahead and read the PDF manual, which tells the whole story.
http://www.adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/PDF32000_2008.pdf
but really you should probably just use a library
Thanks to #LukasGraf for providing this link http://www.gnupdf.org/Introduction_to_PDF that shows how to create a simple hello world pdf from scratch

As long as you're working in Python 2.7, Reportlab seems to be the best solution out there at the moment. It's quite full-featured, and can be a little complex to work with, depending on exactly what you're doing with it, but since you seem to be familiar with PDF internals in general hopefully the learning curve won't be too steep.

I recommend you to use a library. I spent a lot of time creating pdfme and learned a lot of things along the way, but it's not something you would do for a single project. If you want to use my library check the docs here.

Using Python FTP Server (pyftplib) - create HTML file on log in

I have a FTP server working great using Python and the pyftplib library (https://code.google.com/p/pyftpdlib/). I would like to, on login (either anonymous or user), create a html file reflecting the latest state of the server in a nice looking way. For example, all the files that are on the server and their properties nicely separated and looking nice. I thought that since I was already doing everything in Python, and my html wouldn't be overly complex, I would just have python write the html file on log in, and then the user could open the html file for the information that was written seconds before.
My main problem is that when I override the "public callbacks" section of the handlers.py (or any section so far), no file is created that I can find. I am new to python, but it seems like a modification in the handlers.py file should affect the Handler class. Another idea I plan on trying is to override the handler base class with my "on_login" function that does create the html file.
What I am really asking for is
1) Advice from anybody who has done/tried this before
2) Any red flags that are going off in your head regarding my plan
3) Any other ideas (ideally strictly using python)
Thanks!

What worked for me was not editing the handler.py file, but rather creating my own subclass (myFTPHandler) and then redefining the onconnect method to write my html file then.
Thanks for the help though!

GAE Python how to check file type on upload

So, i'm trying to create an google app engine (python) app that allows people to share files. I have file uploads working well, but my concern is about checking the file extension and making sure, primarily, that the files are read only, and secondly, that they are of the filetype that is specified. These will not be image files, as a know they are a lot of image resources already. Specifically, .stl mesh files, but i'd like to be able to do this more generally.
I know there are modules that can do this, python-magic seems to be able to do this for example, but i can't seem to find any that i'm able to import without LoadModuleRestricted. I'm considering writing my own parser, but that would be a lot of work for such a common (i'm assuming) issue.
Anyway, i'm totally stumped so this is my first stackoverflow question, so hope i'm doing well etiquette wise. Let me know, and thanks!

It sounds like you want to read the first few bytes of the uploaded file to verify that its signature matches the purported mime type. Assuming that you're uploading to blobstore (i.e., via a url obtained from blobstore.get_upload_url(), then once you're redirected to the upload handler whose path you gave to get_upload_url, you can open blob using a BlobReader, then read and verify the signature.
The Blobstore sample app lays out the framework. You'd glue in code in UploadHandler once you have blob_info (using blob_info.key() to open the blob).

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.