Django view to convert URL to PDF with PyQt

Django view to convert URL to PDF with PyQt - python

I am trying to write a Django view that will return a PDF of a URL.
I'm using PyQt webview.print to create the PDF but I am unsure how to pass the pdf to the Django response, I've tried QBuffer but I can't seem to get it right.
Here is my view so far:
def pdf(request):
app = QApplication(sys.argv)
bufferPdf = QBuffer()
bufferPdf.open(QBuffer.ReadWrite)
web = QWebView()
web.load(QUrl("http://www.google.com")) #the desired url.
printer = QPrinter()
printer.setPageSize(QPrinter.Letter)
printer.setOrientation(QPrinter.Landscape);
printer.setOutputFormat(QPrinter.PdfFormat)
printer.setOutputFileName("file.pdf")
def convertIt():
web.print_(printer)
print "Pdf generated"
QApplication.exit()
QObject.connect(web, SIGNAL("loadFinished(bool)"), convertIt)
bufferPdf.seek(0)
result = bufferPdf.readData(0)
bufferPdf.close()
sys.exit(app.exec_())
response = HttpResponse(result, mimetype='application/pdf')
response['Content-Disposition'] = 'attachment; filename=coupon.pdf'
return response
Thanks in advance.

The accepted solution from ekhumoro is incorrect. He provides you code that will run from the command line, but can never ever work within a Django view.
Many people have noted that it's not easy and possibly outright impossible to combine Django with a QT threaded application. The error that you are seeing is a classic example of what you will see when you attempt to do so.
In my own projects I tried many different permutations of organizing and grouping the code, and never did find a solution. The issue seems to be (I am not a QT expert, so if anyone has more information please correct me) that event driven QT applications (anything WebKit uses the QT event model) are built around what is effectively a singleton "QApplication". You cannot control when this sub application will quit and when it's various resources are reaped. As a result any multi-threaded applications using the library will need to very carefully manage it's resources - something that you have zero control over during the process of handling various web applications.
One possible (messy and unprofessional) solution would be to create a script that accepts command line arguments and then invoke said script from within Django as an official sub-process. You would use temporary files for output and then load that into your application. After whatever read event you'd just purge the file on disk. Messy, but effective.
I personally would love to hear from anyone who definitively knows either why this is so hard, or a proper solution - there are literally dozens of threads here on Stackoverflow with incorrect or incomplete explanations of how to approach this problem...

Here's a re-write of your example that should do what you want:
import sys
from PyQt4 import QtCore, QtGui, QtWebKit
class WebPage(QtWebKit.QWebPage):
def __init__(self):
QtWebKit.QWebPage.__init__(self)
self.printer = QtGui.QPrinter()
self.printer.setPageSize(QtGui.QPrinter.Letter)
self.printer.setOrientation(QtGui.QPrinter.Landscape);
self.printer.setOutputFormat(QtGui.QPrinter.PdfFormat)
self.mainFrame().loadFinished.connect(self.handleLoadFinished)
def start(self, url):
self.mainFrame().load(QtCore.QUrl(url))
QtGui.qApp.exec_()
def handleLoadFinished(self):
temp = QtCore.QTemporaryFile(
QtCore.QDir.temp().filePath('webpage.XXXXXX.pdf'))
# must open the file to get the filename.
# file will be automatically deleted later
temp.open()
self.printer.setOutputFileName(temp.fileName())
# ensure that the file can be written to
temp.close()
self.mainFrame().print_(self.printer)
temp.open()
self.pdf = temp.readAll().data()
QtGui.qApp.quit()
def webpage2pdf(url):
if not hasattr(WebPage, 'app'):
# can only have one QApplication, and it must be created first
WebPage.app = QtGui.QApplication(sys.argv)
webpage = WebPage()
webpage.start(url)
return webpage.pdf
if __name__ == '__main__':
if len(sys.argv) > 1:
url = sys.argv[1]
else:
url = 'http://www.google.com'
result = webpage2pdf(url)
response = HttpResponse(result, mimetype='application/pdf')
response['Content-Disposition'] = 'attachment; filename=coupon.pdf'
# do stuff with response...

Related

Why is a website's response in python's `urllib.request` different to a request sent directly from a web-browser?

I have a program that takes a URL and gets a response from the server using urllib.request. It all works fine, but I tested it a little more and realised that when I put in a URL such as http://google.com into my browser, I got a different page (which had a doodle and a science fair promotion etc.) but with my program it was just plain Google with nothing special on it.
It is probably due to redirection, but if the request from my program goes through the same router and DNS, surely the output should be exactly the same?
Here is the code:
"""
This is a simple browsing widget that handles user requests, with the
added condition that all proxy settings are ignored. It outputs in the
default web browser.
"""
# This imports some necessary libraries.
import tkinter as tk
import webbrowser
from tempfile import NamedTemporaryFile
import urllib.request
def parse(data):
"""
Removes junk from the data so it can be easily processed.
:rtype : list
:param data: A long string of compressed HTML.
"""
data = data.decode(encoding='UTF-8') # This makes data workable.
lines = data.splitlines() # This clarifies the lines for writing.
return lines
class Browser(object):
"""This creates an object for getting a direct server response."""
def __init__(self, master):
"""
Sets up a direct browsing session and a GUI to manipulate it.
:param master: Any Tk() window in which the GUI is displayable.
"""
# This creates a frame within which widgets can be stored.
frame = tk.Frame(master)
frame.pack()
# Here we create a handler that ignores proxies.
proxy_handler = urllib.request.ProxyHandler(proxies=None)
self.opener = urllib.request.build_opener(proxy_handler)
# This sets up components for the GUI.
tk.Label(frame, text='Full Path').grid(row=0)
self.url = tk.Entry(frame) # This takes the specified path.
self.url.grid(row=0, column=1)
tk.Button(frame, text='Go', command=self.browse).grid(row=0, column=2)
# This binds the return key to calling the method self.browse.
master.bind('<Return>', self.browse)
def navigate(self, query):
"""
Gets raw data from the queried server, ready to be processed.
:rtype : str
:param query: The request entered into 'self.url'.
"""
# This contacts the domain and parses it's response.
response = self.opener.open(query)
html = response.read()
return html
def browse(self, event=None):
"""
Wraps all functionality together for data reading and writing.
:param event: The argument from whatever calls the method.
"""
# This retrieves the input given by the user.
location = self.url.get()
print('\nUser inputted:', location)
# This attempts to access the server and gives any errors.
try:
raw_data = self.navigate(location)
except Exception as e:
print(e)
# This executes assuming there are no errors.
else:
clean_data = parse(raw_data)
# This creates and executes a temporary HTML file.
with NamedTemporaryFile(suffix='.html', delete=False) as cache:
cache.writelines(line.encode('UTF-8') for line in clean_data)
webbrowser.open_new_tab(cache.name)
print('Done.')
def main():
"""Using a main function means not doing everything globally."""
# This creates a window that is always in the foreground.
root = tk.Tk()
root.wm_attributes('-topmost', 1)
root.title('DirectQuery')
# This starts the program.
Browser(root)
root.mainloop()
# This allows for execution as well as for importing.
if __name__ == '__main__':
main()
Note: I don't know if it is something to do with the fact that it is instructed to ignore proxies? My computer doesn't have any proxy settings turned on by the way. Also, if there is a way that I can get the same response/output as a web browser such as chrome would, I would love to hear it.

In order to answer your general question you need to understand how the web site in question operates, so this isn't really a Python question. Web sites frequently detect the browser's "make and model" with special detection code, often (as indicated in the comment on your question) starting with the User-Agent: HTTP header.
It would therefor make sense for Google's home page not to include any JavaScript-based functionality if the User-Agent identifies itself as a program.

PyQt4 QWebView external resource content

class Browser(QWebView):
def __init__(self):
QWebView.__init__(self)
self.loadFinished.connect(self._result_available)
self.loadStarted.connect(self._load_started)
self.page().frameCreated.connect(self.onFrame)
# ...
browser = Browser()
browser.setHtml('<html>...</html>', baseUrl=QUrl('http://www.google.com/'))
After that, i need to catch content of all external resources loaded by QWebView. I need to get content of all CSS/Javascript files. How can i do that ? Related questions: question 1, question 2
I know i need to use QNetworkAccessManager somehow, but i don't have any example to use.

We need to make custom QNetworkReply class and get results in readyRead event results.

How to self-handling cookies in PyObjC

I'm implementing a minimal browser in PyObjC for my study.
First, I googled about the way to use webkit from pyobjc and wrote code like below:
#coding: utf-8
import Foundation
import WebKit
import AppKit
import objc
def main():
app = AppKit.NSApplication.sharedApplication()
rect = Foundation.NSMakeRect(100,350,600,800)
win = AppKit.NSWindow.alloc()
win.initWithContentRect_styleMask_backing_defer_(
rect,
AppKit.NSTitledWindowMask |
AppKit.NSClosableWindowMask |
AppKit.NSResizableWindowMask |
AppKit.NSMiniaturizableWindowMask,
AppKit.NSBackingStoreBuffered,
False)
win.display()
win.orderFrontRegardless()
webview = WebKit.WebView.alloc()
webview.initWithFrame_(rect)
pageurl = Foundation.NSURL.URLWithString_("http://twitter.com")
req = Foundation.NSURLRequest.requestWithURL_(pageurl)
webview.mainFrame().loadRequest_(req)
win.setContentView_(webview)
app.run()
if __name__ == '__main__':
main()
It worked fine. But I noticed that this browser is sharing cookies with safari. I want it to be independent from my Safari.app.
So I googled again and I learned that I can override cookie-handling-methods by using NSMutableURLRequest.
Below is the second code I tested:
#coding: utf-8
import Foundation
import WebKit
import AppKit
import objc
def main():
app = AppKit.NSApplication.sharedApplication()
rect = Foundation.NSMakeRect(100,350,600,800)
win = AppKit.NSWindow.alloc()
win.initWithContentRect_styleMask_backing_defer_(
rect,
AppKit.NSTitledWindowMask |
AppKit.NSClosableWindowMask |
AppKit.NSResizableWindowMask |
AppKit.NSMiniaturizableWindowMask,
AppKit.NSBackingStoreBuffered,
False)
win.display()
win.orderFrontRegardless()
webview = WebKit.WebView.alloc()
webview.initWithFrame_(rect)
pageurl = Foundation.NSURL.URLWithString_("http://twitter.com")
req = Foundation.NSMutableURLRequest.requestWithURL_(pageurl)
Foundation.NSMutableURLRequest.setHTTPShouldHandleCookies_(req, False)
webview.mainFrame().loadRequest_(req)
win.setContentView_(webview)
app.run()
if __name__ == '__main__':
main()
This code show me a login screen of twitter :-)
But I couldn't login to twitter by this browser.
I input account name, password and pushed enter key. Then the browser displays the timeline of the account which I always use in Safari.app.
Yes, I know that it's proper result.
I didn't write anything about handling cookies.
And my question is on this point.
I want to know that:
How can I implement and use something like NSHTTPCookieStorage?
Can I write it in python?
Thank you.

To start with the easy part: if it is possible to do this in Objective-C it should also be possible with PyObjC.
That said, it is unclear to me if this is possible at all. How can I have multiple instances of webkit without sharing cookies? seems to indicate that it isn't although you might be able to do something through the webkit delegate.
An other alternative is to use NSURLProtocol, register a custom NSURLProtocol class for handling http/https requests and implement that using Python's urllib or urllib2. The PyDocURL example shows how to do this (that example registers a subclass for pydoc:// URLs).
More information on NSURLConnection is on Apple's website.
Updated with an implemention hint:
An alternate method might be to disable cookie storaga by NSHTTPCookieStorage (NSHTTPCookieStorage.sharedHTTPCookieStorage.setCookieAcceptPolicy_(NSHTTPCookieAcceptPolicyNever)). Then use the webkit resource loading delegate to handle cookies yourself:
Maintain your own cookie store (possibly using a class in urllib2)
In webView:resource:willSendRequest:redirectResponse:fromDataSource: add cookie headers based on information in that store
In webView:resource:didReceiveResponse:fromDataSource: check for "set-cookie" headers and update your own cookie store.
It shouldn't be too hard to do this, and I'd love to have this functionality as an example on the PyObjC website (or even as a utility class in the WebKit bindings for PyObjC).

How to set the default open path for a Gtk.FileChooserWidget?

If I set the current folder via the method Gtk.FileChooserWidget.set_current_folder(), the first time I open the file chooser, it opens on the location used as argument for set_current_folder()
But, if I select a file, the I re-open the file-chooser, it opens on the "most_recent_used_files".
I'd like it opens on the last selected file's folder path.
How to do it?
Thank you.

From the docs:
Old versions of the file chooser's documentation suggested using gtk_file_chooser_set_current_folder() in various situations, with the intention of letting the application suggest a reasonable default folder. This is no longer considered to be a good policy, as now the file chooser is able to make good suggestions on its own. In general, you should only cause the file chooser to show a specific folder when it is appropriate to use gtk_file_chooser_set_filename() - i.e. when you are doing a File/Save As command and you already have a file saved somewhere.
You may or may not like the reasoning for this behavior. If you're curious about how it came about, see File chooser recent-files in the mailing list and Help the user choose a place to put a new file on the GNOME wiki.

Setting the current folder each time works for me, but it is a little tricky. I'm using Gtk 3.14 and Python 2.7.
You have to get the filename before resetting the directory, or it's lost, and the current directory may be None, so you have to check for that.
This code is tested on Debian jessie and Windows 7.
import os.path as osp
from gi.repository import Gtk
class FileDialog(Gtk.FileChooserDialog):
def __init__(self, parent, title):
Gtk.FileChooserDialog.__init__(self, title, parent)
self.add_button(Gtk.STOCK_CANCEL, Gtk.ResponseType.CANCEL)
self.add_button(Gtk.STOCK_OPEN, Gtk.ResponseType.OK)
self.set_current_folder(osp.abspath('.'))
def __call__(self):
resp = self.run()
self.hide()
fname = self.get_filename()
d = self.get_current_folder()
if d:
self.set_current_folder(d)
if resp == Gtk.ResponseType.OK:
return fname
else:
return None
class TheApp(Gtk.Window):
def on_clicked(self, w, dlg):
fname = dlg()
print fname if fname else 'canceled'
def __init__(self):
Gtk.Window.__init__(self)
self.connect('delete_event', Gtk.main_quit)
self.set_resizable(False)
dlg = FileDialog(self, 'Your File Dialog, Sir.')
btn = Gtk.Button.new_with_label('click here')
btn.connect('clicked', self.on_clicked, dlg)
self.add(btn)
btn.show()
if __name__ == '__main__':
app = TheApp()
app.show()
Gtk.main()

How do I shut down PyQt's QtApplication correctly?

I don't know the first thing about Qt, but I'm trying to be cheeky and borrow code from elsewhere (http://lateral.netmanagers.com.ar/weblog/posts/BB901.html#disqus_thread). ;)
I have a problem. When I run test() the first time, everything works swimmingly. However, when I run it the second time, I get nasty segfaults. I suspect that the problem is that I'm not ending the qt stuff correctly. What should I change about this program to make it work multiple times? Thanks in advance!
from PyQt4 import QtCore, QtGui, QtWebKit
import logging
logging.basicConfig(level=logging.DEBUG)
class Capturer(object):
"""A class to capture webpages as images"""
def __init__(self, url, filename, app):
self.url = url
self.app = app
self.filename = filename
self.saw_initial_layout = False
self.saw_document_complete = False
def loadFinishedSlot(self):
self.saw_document_complete = True
if self.saw_initial_layout and self.saw_document_complete:
self.doCapture()
def initialLayoutSlot(self):
self.saw_initial_layout = True
if self.saw_initial_layout and self.saw_document_complete:
self.doCapture()
def capture(self):
"""Captures url as an image to the file specified"""
self.wb = QtWebKit.QWebPage()
self.wb.mainFrame().setScrollBarPolicy(
QtCore.Qt.Horizontal, QtCore.Qt.ScrollBarAlwaysOff)
self.wb.mainFrame().setScrollBarPolicy(
QtCore.Qt.Vertical, QtCore.Qt.ScrollBarAlwaysOff)
self.wb.loadFinished.connect(self.loadFinishedSlot)
self.wb.mainFrame().initialLayoutCompleted.connect(
self.initialLayoutSlot)
logging.debug("Load %s", self.url)
self.wb.mainFrame().load(QtCore.QUrl(self.url))
def doCapture(self):
logging.debug("Beginning capture")
self.wb.setViewportSize(self.wb.mainFrame().contentsSize())
img = QtGui.QImage(self.wb.viewportSize(), QtGui.QImage.Format_ARGB32)
painter = QtGui.QPainter(img)
self.wb.mainFrame().render(painter)
painter.end()
img.save(self.filename)
self.app.quit()
def test():
"""Run a simple capture"""
app = QtGui.QApplication([])
c = Capturer("http://www.google.com", "google.png", app)
c.capture()
logging.debug("About to run exec_")
app.exec_()
DEBUG:root:Load http://www.google.com
QObject::connect: Cannot connect (null)::configurationAdded(QNetworkConfiguration) to QNetworkConfigurationManager::configurationAdded(QNetworkConfiguration)
QObject::connect: Cannot connect (null)::configurationRemoved(QNetworkConfiguration) to QNetworkConfigurationManager::configurationRemoved(QNetworkConfiguration)
QObject::connect: Cannot connect (null)::configurationUpdateComplete() to QNetworkConfigurationManager::updateCompleted()
QObject::connect: Cannot connect (null)::onlineStateChanged(bool) to QNetworkConfigurationManager::onlineStateChanged(bool)
QObject::connect: Cannot connect (null)::configurationChanged(QNetworkConfiguration) to QNetworkConfigurationManager::configurationChanged(QNetworkConfiguration)
Process Python segmentation fault (this last line is comes from emacs)

You need to handle the QApplication outside of the test functions, sort of like a singleton (it's actually appropriate here).
What you can do is to check if QtCore.qApp is something (or if QApplication.instance() returns None or something else) and only then create your qApp, otherwise, use the global one.
It will not be destroyed after your test() function since PyQt stores the app somewhere.
If you want to be sure it's handled correctly, just setup a lazily initialized singleton for it.

A QApplication should only be initialized once!
It can be used by as many Capture instances as you like, but you should start them in the mainloop.
See: https://doc.qt.io/qt-4.8/qapplication.html
You could also try "del app" after "app.exec_", but I am unsure about the results.
(Your original code runs fine on my system)
I would use urllib instead of webkit:
import urllib
class Capturer:
def capture(self, s_url, s_filename):
s_file_out, httpmessage = urllib.urlretrieve(s_url, s_filename, self.report)
def report(self, i_count, i_chunk, i_size):
print('retrived %5d of %5d bytes' % (i_count * i_chunk, i_size))
def test():
c = Capturer()
c.capture("http://www.google.com/google.png", "google1.png")
c.capture("http://www.google.com/google.png", "google2.png")
if __name__ == '__main__':
test()

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.