How to get the html dom of a webpage and its frames

How to get the html dom of a webpage and its frames - python

I would like to get the DOM of a website after js execution.
I would also like to get all the content of the iframes in the website, similarly to what I have in Google Chrome's Inspect Element feature.
This is my code:
import sys
from PyQt4 import QtGui, QtCore, QtWebKit
class Sp():
def save(self):
print ("call")
data = self.webView.page().currentFrame().documentElement().toInnerXml()
print(data.encode('utf-8'))
print ('finished')
def main(self):
self.webView = QtWebKit.QWebView()
self.webView.load(QtCore.QUrl("http://www.w3schools.com/tags/tryit.asp?filename=tryhtml_iframe_scrolling"))
QtCore.QObject.connect(self.webView,QtCore.SIGNAL("loadFinished(bool)"),self.save)
app = QtGui.QApplication(sys.argv)
s = Sp()
s.main()
sys.exit(app.exec_())
This gives me the html of the website, but not the html inside the iframes. Is there any way that I could get the HTML of the iframes.

This is a very hard problem to solve in general.
The main difficulty is that there is no way to know in advance how many frames each page has. And in addition to that, each child-frame may have its own set of frames, the number of which is also unknown. In theory, there could be an infinite number of nested frames, and the page will never finish loading (which seems no exaggeration for sites that have a lot of ads).
Anyway, below is a version of your script which gets the top-level QWebFrame object of each frame as it loads, and shows how you can access some of the things you are interested in. As you will see from the output, there are a lot of "junk" frames inserted by ads and such like that you will somehow need to filter out.
import sys, signal
from PyQt4 import QtGui, QtCore, QtWebKit
class Sp():
def save(self, ok, frame=None):
if frame is None:
print ('main-frame')
frame = self.webView.page().mainFrame()
else:
print('child-frame')
print('URL: %s' % frame.baseUrl().toString())
print('METADATA: %s' % frame.metaData())
print('TAG: %s' % frame.documentElement().tagName())
print()
def handleFrameCreated(self, frame):
frame.loadFinished.connect(lambda: self.save(True, frame=frame))
def main(self):
self.webView = QtWebKit.QWebView()
self.webView.page().frameCreated.connect(self.handleFrameCreated)
self.webView.page().mainFrame().loadFinished.connect(self.save)
self.webView.load(QtCore.QUrl("http://www.w3schools.com/tags/tryit.asp?filename=tryhtml_iframe_scrolling"))
signal.signal(signal.SIGINT, signal.SIG_DFL)
print('Press Crtl+C to quit\n')
app = QtGui.QApplication(sys.argv)
s = Sp()
s.main()
sys.exit(app.exec_())
NB: it is important that you connect to the loadFinished signal of the main frame rather than the web-view. If you connect to the latter, it will be called multiple times if the page contains more than one frame.

Related

How to scrape several websites with pyqt4, scope change?

I would like to scrape two websites in java for links using PyQt4.QtWebKit to render the pages and then get the desired links. The code works fine with one page or url, but stops (but continues running until force quit) after printing the links of the first website. It seems the scope stays in the event loop of the render class. How can I get the program to change scope and continue with the for loop and rendering the second website? Using exit() in _loadFinished method just quits the program after the first iteration. Maybe the python app has to close and reopen to render the next page, which is impossible because the app is opened/reopened outside of the program?
import sys
from PyQt4.QtGui import *
from PyQt4.QtCore import *
from PyQt4.QtWebKit import *
from PyQt4 import QtGui
from lxml import html
class Render(QWebPage):
def __init__(self, url):
self.frame = None
QWebPage.__init__(self)
self.loadFinished.connect(self._loadFinished)
self.mainFrame().load(QUrl(url))
def _loadFinished(self, result):
self.frame = self.mainFrame()
result = self.frame.toHtml()
formatted_result = str(result)
tree = html.fromstring(formatted_result)
archive_links = tree.xpath('//div/div/a/#href')[0:4]
print(archive_links)
urls = ['http://pycoders.com/archive/', 'http://www.pythonjobshq.com']
def main(urls):
app = QtGui.QApplication(sys.argv)
for url in urls:
r = Render(url)
#s = Render(urls[1]) #The pages can be rendered parallel, but rendering more than a handful of pages a the same time is a bad idea
sys.exit(app.exec_())
if __name__ == '__main__':
main(urls)
Thankful for any help!

PyQt4 QWebview error after initial evaluateJavaScript() call

I've got a Python 2.7 application running with PyQt4 that has a QWebView in it, that has two way communication to and from Javascript.
The application is multithreaded via QThreadPool, QRunnables, so I'm communicating with a ViewController class with signals.
When I run the application, the QWebView loads my HTML with external JS and CSS just fine. I'm able to interact with the Javascript functions via the main program thread and ViewController class.
Once the user selects a directory and certain criteria are met, it starts looping through QRunnable tasks one at a time. During that time it calls back to the ViewController -> Javascript via Signal slots, just as expected. The problem is when I'm calling those ViewController methods that execute evaluateJavaScript, I get a Javascript error returned,
undefined line 1: SyntaxError: Parse error
I've done lots of trial error back and forth, but can't seem to figure out why evaluateJavaScript won't run in these instances. I've tried sending simple Javascript calls ranging from test functions that don't accept any arguments (thinking maybe it was some weird encoding issue), to just sending things like Application.main.evaluateJavaScript("alert('foo')"), which normally work outside of the threads. The only other thing I can think of is that maybe self.main.addToJavaScriptWindowObject('view', self.view) needs to be called in the threads again, but I've run a dir() on Application.main and it appears to have the evaluateJavaScript method attached to it already.
Any thoughts on why this could be occurring, when the scope seems to be correct, and the ViewController appears to be communicating just fine to the QWebView otherwise? Answers in Qt C++ will probably work as well, if you've seen this happen before!
I tried to simplify the code for example purposes:
# coding: utf8
import subprocess as sp
import os.path, os, sys, time, datetime
from os.path import basename
import glob
import random
import string
from PyQt4 import QtCore, QtGui, QtWebKit
from PyQt4.QtCore import QObject, pyqtSlot, QThreadPool, QRunnable, pyqtSignal
from PyQt4.QtGui import QApplication, QFileDialog
from PyQt4.QtWebKit import QWebView
from ImportController import *
class Browser(QtGui.QMainWindow):
def __init__(self):
QtGui.QMainWindow.__init__(self)
self.resize(800,500)
self.centralwidget = QtGui.QWidget(self)
self.mainLayout = QtGui.QHBoxLayout(self.centralwidget)
self.mainLayout.setSpacing(0)
self.mainLayout.setMargin(0)
self.frame = QtGui.QFrame(self.centralwidget)
self.gridLayout = QtGui.QVBoxLayout(self.frame)
self.gridLayout.setMargin(0)
self.gridLayout.setSpacing(0)
self.html = QtWebKit.QWebView()
# for javascript errors
errors = WebPage()
self.html.setPage(errors)
self.main = self.html.page().mainFrame()
self.gridLayout.addWidget(self.html)
self.mainLayout.addWidget(self.frame)
self.setCentralWidget(self.centralwidget)
path = os.getcwd()
if self.checkNetworkAvailability() and self.checkApiAvailbility():
self.default_url = "file://"+path+"/View/mainView.html"
else:
self.default_url = "file://"+path+"/View/errorView.html"
# load the html view
self.openView()
# controller class that sends and receives to/from javascript
self.view = ViewController()
self.main.addToJavaScriptWindowObject('view', self.view)
# on gui load finish
self.html.loadFinished.connect(self.on_loadFinished)
# to javascript
def selectDirectory(self):
# This evaluates the directory we've selected to make sure it fits the criteria, then parses the XML files
pass
def evaluateDirectory(self, directory):
if not directory:
return False
if os.path.isdir(directory):
return True
else:
return False
#QtCore.pyqtSlot()
def on_loadFinished(self):
# open directory select dialog
self.selectDirectory()
def openView(self):
self.html.load(QtCore.QUrl(self.default_url))
self.html.show()
def checkNetworkAvailability(self):
#TODO: make sure we can reach the outside world before trying anything else
return True
def checkApiAvailbility(self):
#TODO: make sure the API server is alive and responding
return True
class WebPage(QtWebKit.QWebPage):
def javaScriptConsoleMessage(self, msg, line, source):
print '%s line %d: %s' % (source, line, msg)
class ViewController(QObject):
def __init__(self, parent=None):
super(ViewController, self).__init__(parent)
#pyqtSlot()
def did_load(self):
print "View Loaded."
#pyqtSlot()
def selectDirectoryDialog(self):
# FROM JAVASCRIPT: in case they need to re-open the file dialog
Application.selectDirectory()
def prepareImportView(self, displayPath):
# TO JAVASCRIPT: XML directory parsed okay, so let's show the main
Application.main.evaluateJavaScript("prepareImportView('{0}');".format(displayPath))
def generalMessageToView(self, target, message):
# TO JAVASCRIPT: Send a general message to a specific widget target
Application.main.evaluateJavaScript("receiveMessageFromController('{0}', '{1}')".format(target, message))
#pyqtSlot()
def startProductImport(self):
# FROM JAVASCRIPT: Trigger the product import loop, QThreads
print "### view.startProductImport"
position = 1
count = len(Application.data.products)
importTasks = ProductImportQueue(Application.data.products)
importTasks.start()
#pyqtSlot(str)
def updateProductView(self, data):
# TO JAVASCRIPT: Send product information to view
print "### updateProductView "
Application.main.evaluateJavaScript('updateProductView("{0}");'.format(QtCore.QString(data)) )
class WorkerSignals(QObject):
''' Declares the signals that will be broadcast to their connected view methods '''
productResult = pyqtSignal(str)
class ProductImporterTask(QRunnable):
''' This is where the import process will be fired for each loop iteration '''
def __init__(self, product):
super(ProductImporterTask, self).__init__()
self.product = product
self.count = ""
self.position = ""
self.signals = WorkerSignals()
def run(self):
print "### ProductImporterTask worker {0}/{1}".format(self.position, self.count)
# Normally we'd create a dict here, but I'm trying to just send a string for testing purposes
self.signals.productResult.emit(data)
return
class ProductImportQueue(QObject):
''' The synchronous threadpool that is going to one by one run the import threads '''
def __init__(self, products):
super(ProductImportQueue, self).__init__()
self.products = products
self.pool = QThreadPool()
self.pool.setMaxThreadCount(1)
def process_result(self, product):
return
def start(self):
''' Call the product import worker from here, and format it in a predictable way '''
count = len(self.products)
position = 1
for product in self.products:
worker = ProductImporterTask("test")
worker.signals.productResult.connect(Application.view.updateProductView, QtCore.Qt.DirectConnection)
self.pool.start(worker)
position = position + 1
self.pool.waitForDone()
if __name__ == "__main__":
app = QtGui.QApplication(sys.argv)
Application = Browser()
Application.raise_()
Application.show()
Application.activateWindow()
sys.exit(app.exec_())

You know, I love PyQt4 but after searching and searching, I believe this is actually a bug and not as designed.
I've since moved on and am trying to implement this in CEFPython with WxPython, which seems to have a much more elegant implementation for this specific purpose.

QWebView get response

I have a python code with PySide that has a QWebView that shows google maps.
I just want to get the response each time that I do any request using the QWebView widget.
I have searched info but there is no reference about getting a response with PySide. If you need me to paste some code I will but I just have a simple QWebView widget.
EDIT: You asked me for the code:
from PySide.QtCore import *
from PySide.QtGui import *
import sys
import pyside3
class MainDialog(QMainWindow, pyside3.Ui_MainWindow):
def __init__(self, parent=None):
super(MainDialog,self).__init__(parent)
self.setupUi(self)
token_fb=""
#self.Connect_buttom.clicked.connect(self.get_fb_token)
self.Connect_buttom.clicked.connect(lambda: self.get_fb_token(self.FB_username.text(), self.FB_password.text()))
#self.connect(self.Connect_buttom, SIGNAL("clicked()"), self.get_fb_token)
#Change between locate and hunt
self.MapsButton.clicked.connect(lambda: self.select_page_index(0))
self.HuntButton.clicked.connect(lambda: self.select_page_index(1))
###########################
self.webView.setHtml(URL)
def select_page_index(self, index): # To change between frames
self.Container.setCurrentIndex(index)
I need the response from: self.webView.setHtml(URL) because depending on the response my app has to do one thing or other.

Function QWebView.setHtml() has no response in the sense that it doesn't return anything.
Maybe you want to listen to all links that are clicked and do something custom with it.
web_view = QtWebKit.QWebView()
web_view.page().setLinkDelegationPolicy(QtWebKit.QWebPage.DelegateAllLinks)
web_view.linkClicked.connect(your_handler)
Or maybe you want to do something when loading has finished. This is done by:
web_view = QtWebKit.QWebView()
web_view.loadFinished.connect(your_handler)

PDF with QWebView: missing refresh/repaint after loading

I use the QWebView (python 3.3 + pyside 1.1.2 + Qt 4.8) as FileViewer. Picture, Text, HTML, ... all fine, but PDF has a display problem. I tested two possible ways.
internal pdf viewer: after use webview.load(file) it loads, but
the screen is blank, after loading another file, all works fine, it
shows the file
pdf.js: after use setContent() with filebase, it
loads the webviewer.html/.js with a white page and the loading circle. The
screen only refresh if I resize the form or use the scrollbars, but
then all is fine
I don't find an event for "plugin/javascript finished loading", so I could force a repaint or so.
Here an example code for case 1:
import sys
from PySide import QtCore, QtGui, QtWebKit ##UnusedWildImport
class DialogTest(QtGui.QDialog):
def __init__(self, parent = None):
super(DialogTest, self).__init__(parent)
self.resize(620, 600)
self.setAttribute(QtCore.Qt.WA_DeleteOnClose)
self.PreviewBox = QtWebKit.QWebView()
self.PreviewBox.settings().setAttribute(QtWebKit.QWebSettings.PluginsEnabled, True)
self.PreviewBox.settings().setAttribute(QtWebKit.QWebSettings.WebAttribute.DeveloperExtrasEnabled, True)
self.PreviewBox.settings().setAttribute(QtWebKit.QWebSettings.PrivateBrowsingEnabled, True)
self.PreviewBox.settings().setAttribute(QtWebKit.QWebSettings.LocalContentCanAccessRemoteUrls, True)
self.PreviewBox.loadFinished.connect(self._loadfinished)
self.button_test1 = QtGui.QPushButton("File 1")
self.button_test1.clicked.connect(self._onselect1)
self.button_test2 = QtGui.QPushButton("File 2")
self.button_test2.clicked.connect(self._onselect2)
layout_Buttons = QtGui.QHBoxLayout()
layout_Buttons.addWidget(self.button_test1)
#layout_Buttons.addStretch()
layout_Buttons.addWidget(self.button_test2)
layout_Main = QtGui.QVBoxLayout()
layout_Main.addLayout(layout_Buttons)
layout_Main.addWidget(self.PreviewBox)
self.setLayout(layout_Main)
def Execute(self):
self.show()
self.exec_()
def _onselect1(self):
self.PreviewBox.load(QtCore.QUrl().fromLocalFile("c:\\tmp\\test1.pdf"))
def _onselect2(self):
self.PreviewBox.load(QtCore.QUrl().fromLocalFile("c:\\tmp\\test2.pdf"))
def _loadfinished(self, ok):
#self.PreviewBox.repaint()
pass
app = QtGui.QApplication(sys.argv)
DialogTest().Execute()
Edit: Workaround
Case 1 (webkit plugin) has an otherbug, it takes the focus to itself, so this solution isn't acceptable to me. I played with the pdf.js again and found a workaroud:
self.PreviewBox.setHtml(content, baseUrl = QtCore.QUrl().fromLocalFile(path))
self.PreviewBox.hide()
QtCore.QTimer.singleShot(700, self.PreviewBox.show)
The hide() must be after the content filling and the timer haven't to be too low.
//jay

I just solved a similar problem cleaning the QWebView before every pdf load.
Be careful with the loadFinished() signal.
In your example:
self.PreviewBox.load(QUrl('about:blank'))
or, in case we don't like 'about:blank' this may be a more portable solution:
self.PreviewBox.setHtml('<html><head></head><title></title><body></body></html>')

PyQt QWebKit frame bug?

I'm using Python, PyQt4, and QtWebKit to load a web page into a bare-bones browser to examine the data.
However, there is a small issue. I'm trying to get the contents and src of every iframe on the loaded page. I'm using webView.page().mainFrame().childFrames() to get the frames. To problem is, childFrames() loads the frames ONLY if they're visible by the browser. For example, when your browser is positioned at the top of the page, childFrames() will not load the iframes are at the footer of the page. Is there a way or setting I could tweak where I can get all ads? I've attached the source of my "browser". Try scrolling down when the page finishes it's loading. Watch the console and you will see that the iframes load dynamically. Please help.
from PyQt4 import QtGui, QtCore, QtWebKit
import sys
import unicodedata
class Sp():
def Main(self):
self.webView = QtWebKit.QWebView()
self.webView.load(QtCore.QUrl("http://www.msnbc.msn.com/id/41197838/ns/us_news-environment/"))
self.webView.show()
QtCore.QObject.connect(self.webView,QtCore.SIGNAL("loadFinished(bool)"),self.Load)
def Load(self):
frame = self.webView.page().mainFrame()
children = frame.childFrames()
fT = []
for x in children:
print "=========================================="
print unicodedata.normalize('NFKD', unicode(x.url().toString())).encode('ascii','ignore')
print "=========================================="
fT.append([unicode(x.url().toString()),unicode(x.toHtml()),[]])
for x in range(len(fT)):
f = children[x]
tl = []
for fx in f.childFrames():
print "___________________________________________"
print unicodedata.normalize('NFKD', unicode(fx.url().toString())).encode('ascii','ignore')
print "___________________________________________"
tl.append([unicode(fx.url().toString()),unicode(fx.toHtml()),[]])
fT[x][2] = tl
app = QtGui.QApplication(sys.argv)
s = Sp()
s.Main()
app.exec_()

Not sure why you're doing what you're doing, but if it's only loading what's visible, you can set the page viewport size to the content size and that should load everything:
def Load(self):
self.webView.page().setViewportSize(
self.webView.page().mainFrame().contentsSize())
However, this has a weird effect in the GUI so this solution may be unacceptable for what you are trying to do.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to get the html dom of a webpage and its frames - python

Related

How to scrape several websites with pyqt4, scope change?

PyQt4 QWebview error after initial evaluateJavaScript() call

QWebView get response

PDF with QWebView: missing refresh/repaint after loading

PyQt QWebKit frame bug?

Categories

Resources