I'm looking to scrape JavaScript-driven pages using this code, which has appeared on a number of past threads (c.f. this, this, and others on offsite threads):
import sys
from PyQt5.QtCore import QEventLoop
from PyQt5.QtWidgets import QApplication
from PyQt5.QtWebEngineWidgets import QWebEngineView
def render(source_html):
class Render(QWebEngineView):
def __init__(self, html):
self.html = None
self.app = QApplication(sys.argv)
QWebEngineView.__init__(self)
self.loadFinished.connect(self._loadFinished)
self.setHtml(html)
while self.html is None:
self.app.processEvents(QEventLoop.ExcludeUserInputEvents | QEventLoop.ExcludeSocketNotifiers | QEventLoop.WaitForMoreEvents)
self.app.quit()
def _callable(self, data):
self.html = data
def _loadFinished(self, result):
self.page().toHtml(self._callable)
return Render(source_html).html
It's working fine.
My question is whether I need to use a proxy for this portion of the code (assuming that I generally want to be using a proxy for all network activity).
I'm using urllib.request and a proxy to access the site in question, and then passing the html from there on to PyQt5 to do the JavaScript mambo. Does that second leg of the journey involve a network connection that should be proxy-fied? If so, how should I change this code - haven't touched PyQt until today and am feeling a bit over my head.
Using Python 3.5 and Windows 7.
Many thanks.
Related
This question already has an answer here:
Scrape multiple urls using QWebPage
(1 answer)
Closed 4 years ago.
I am attempting to web scrape using the PyQT5 QWebEngineView. Here is the code that I got from another response on StackOverflow:
from PyQt5.QtWidgets import QApplication
from PyQt5.QtCore import QUrl, QEventLoop
from PyQt5.QtWebEngineWidgets import QWebEngineView
import sys
def render(url):
class Render(QWebEngineView):
def __init__(self, t_url):
self.html = None
self.app = QApplication(sys.argv)
QWebEngineView.__init__(self)
self.loadFinished.connect(self._loadfinished)
self.load(QUrl(t_url))
while self.html is None:
self.app.processEvents(QEventLoop.ExcludeUserInputEvents | QEventLoop.ExcludeSocketNotifiers | QEventLoop.WaitForMoreEvents)
self.app.quit()
def _callable(self, data):
self.html = data
def _loadfinished(self, result):
self.page().toHtml(self._callable)
return Render(url).html
Then if I put the line:
print(render('http://quotes.toscrape.com/random'))
it works as expected. But if I add a second line to that so it reads:
print(render('http://quotes.toscrape.com/random'))
print(render('http://quotes.toscrape.com/tableful/'))
it gives me the error "Process finished with exit code -1073741819 (0xC0000005)" after printing out the first render correctly.
I have narrowed the error down to the line that says self.load(QUrl(t_url))
You're initializing QApplication more than once. Only once instance should exist, globally. If you need to get the current instance and do not have a handle to it, you can use QApplication.instance(). QApplication.quit() is meant to be called right before sys.exit, in fact, you should almost never use one without the other.
In short, you're telling Qt you're exiting the application, and then trying to run more Qt code. It's an easy fix, however...
Solution
You can do 1 of three things:
Store the app in a global variable and reference it from there:
APP = QApplication(sys.argv)
# ... Many lines ellipsed
class SomeClass(QWidget):
def some_method(self):
APP.processEvents(QEventLoop.ExcludeUserInputEvents | QEventLoop.ExcludeSocketNotifiers | QEventLoop.WaitForMoreEvents)
Pass the app as a handle to the class.
def render(app, url):
...
Create a global instance, and use QApplication.instance().
APP = QApplication(sys.argv)
# ... Many lines ellipsed
class SomeClass(QWidget):
def some_method(self):
app = QApplication.instance()
app.processEvents(QEventLoop.ExcludeUserInputEvents | QEventLoop.ExcludeSocketNotifiers | QEventLoop.WaitForMoreEvents)
Do what's most convenient for you.
working on writing a Python script to scrape a webpage after it has run its JavaScript. I realized I needed the JS to run because using the Requests wasn't returning any data. I found what seemed to be my solution here but I am having some problems still.
First of all that tutorial uses PyQt4, I have installed and tried multiple versions of PyQt 4 and 5 from the project interpreter and still can't find a solution. Here is the relevant code:
import PyQt5.QtWebEngineWidgets
import PyQt5.QtCore
import PyQt5.QtWebEngine
import PyQt5.QtWebEngineCore
class Render(QWebpage):
def __init__(self, url):
self.app = QApplication(sys.argv)
QWebPage.__init__(self)
self.loadFinished.connect(self._loadFinished)
self.mainFrame().load(QUrl(url))
self.app.exec_()
def _load_finished(self, result):
self.frame = self.mainFrame()
self.app.quit()
The QWebpage, QApplication, and QUrl calls all have 'Unresolved Reference' errors, the four PyQt5 import statements also all have 'Unused Import Statement' indications. I have tried for several hours to resolve these issues, uninstalling and reinstalling PyQt several times and searching the internet
Any advice would be awesome, Thanks!
Your imports are incorrect, in python there are many ways to do it: in your case you could be like this:
1.from package import class
import sys
from PyQt5.QtCore import QUrl
from PyQt5.QtWebKitWidgets import QWebPage
from PyQt5.QtWidgets import QApplication
# Take this class for granted.Just use result of rendering.
class Render(QWebPage):
def __init__(self, url):
self.app = QApplication(sys.argv)
QWebPage.__init__(self)
self.loadFinished.connect(self._loadFinished)
self.mainFrame().load(QUrl(url))
self.app.exec_()
def _loadFinished(self, result):
self.frame = self.mainFrame()
self.app.quit()
url = 'http://pycoders.com/archive/'
r = Render(url)
result = r.frame.toHtml()
print(result)
import package, then you should use each element as package.class:
import sys
from PyQt5 import QtWebKitWidgets, QtCore, QtWidgets
class Render(QtWebKitWidgets.QWebPage):
def __init__(self, url):
self.app = QtWidgets.QApplication(sys.argv)
QtWebKitWidgets.QWebPage.__init__(self)
self.loadFinished.connect(self._loadFinished)
self.mainFrame().load(QtCore.QUrl(url))
self.app.exec_()
def _loadFinished(self, result):
self.frame = self.mainFrame()
self.app.quit()
url = 'http://pycoders.com/archive/'
r = Render(url)
result = r.frame.toHtml()
print(result)
If you are using pycharm there is a very simple way that pycharm imports the packages correctly for you, for this you must place the dot above the word that generates the error and execute Ctrl+M
Note:If you are using windows you will not be able to use these modules since Qt and therefore PyQt, use chromium, and it seems that they have a problem with Windows.
I would like to scrape two websites in java for links using PyQt4.QtWebKit to render the pages and then get the desired links. The code works fine with one page or url, but stops (but continues running until force quit) after printing the links of the first website. It seems the scope stays in the event loop of the render class. How can I get the program to change scope and continue with the for loop and rendering the second website? Using exit() in _loadFinished method just quits the program after the first iteration. Maybe the python app has to close and reopen to render the next page, which is impossible because the app is opened/reopened outside of the program?
import sys
from PyQt4.QtGui import *
from PyQt4.QtCore import *
from PyQt4.QtWebKit import *
from PyQt4 import QtGui
from lxml import html
class Render(QWebPage):
def __init__(self, url):
self.frame = None
QWebPage.__init__(self)
self.loadFinished.connect(self._loadFinished)
self.mainFrame().load(QUrl(url))
def _loadFinished(self, result):
self.frame = self.mainFrame()
result = self.frame.toHtml()
formatted_result = str(result)
tree = html.fromstring(formatted_result)
archive_links = tree.xpath('//div/div/a/#href')[0:4]
print(archive_links)
urls = ['http://pycoders.com/archive/', 'http://www.pythonjobshq.com']
def main(urls):
app = QtGui.QApplication(sys.argv)
for url in urls:
r = Render(url)
#s = Render(urls[1]) #The pages can be rendered parallel, but rendering more than a handful of pages a the same time is a bad idea
sys.exit(app.exec_())
if __name__ == '__main__':
main(urls)
Thankful for any help!
I've got a Python 2.7 application running with PyQt4 that has a QWebView in it, that has two way communication to and from Javascript.
The application is multithreaded via QThreadPool, QRunnables, so I'm communicating with a ViewController class with signals.
When I run the application, the QWebView loads my HTML with external JS and CSS just fine. I'm able to interact with the Javascript functions via the main program thread and ViewController class.
Once the user selects a directory and certain criteria are met, it starts looping through QRunnable tasks one at a time. During that time it calls back to the ViewController -> Javascript via Signal slots, just as expected. The problem is when I'm calling those ViewController methods that execute evaluateJavaScript, I get a Javascript error returned,
undefined line 1: SyntaxError: Parse error
I've done lots of trial error back and forth, but can't seem to figure out why evaluateJavaScript won't run in these instances. I've tried sending simple Javascript calls ranging from test functions that don't accept any arguments (thinking maybe it was some weird encoding issue), to just sending things like Application.main.evaluateJavaScript("alert('foo')"), which normally work outside of the threads. The only other thing I can think of is that maybe self.main.addToJavaScriptWindowObject('view', self.view) needs to be called in the threads again, but I've run a dir() on Application.main and it appears to have the evaluateJavaScript method attached to it already.
Any thoughts on why this could be occurring, when the scope seems to be correct, and the ViewController appears to be communicating just fine to the QWebView otherwise? Answers in Qt C++ will probably work as well, if you've seen this happen before!
I tried to simplify the code for example purposes:
# coding: utf8
import subprocess as sp
import os.path, os, sys, time, datetime
from os.path import basename
import glob
import random
import string
from PyQt4 import QtCore, QtGui, QtWebKit
from PyQt4.QtCore import QObject, pyqtSlot, QThreadPool, QRunnable, pyqtSignal
from PyQt4.QtGui import QApplication, QFileDialog
from PyQt4.QtWebKit import QWebView
from ImportController import *
class Browser(QtGui.QMainWindow):
def __init__(self):
QtGui.QMainWindow.__init__(self)
self.resize(800,500)
self.centralwidget = QtGui.QWidget(self)
self.mainLayout = QtGui.QHBoxLayout(self.centralwidget)
self.mainLayout.setSpacing(0)
self.mainLayout.setMargin(0)
self.frame = QtGui.QFrame(self.centralwidget)
self.gridLayout = QtGui.QVBoxLayout(self.frame)
self.gridLayout.setMargin(0)
self.gridLayout.setSpacing(0)
self.html = QtWebKit.QWebView()
# for javascript errors
errors = WebPage()
self.html.setPage(errors)
self.main = self.html.page().mainFrame()
self.gridLayout.addWidget(self.html)
self.mainLayout.addWidget(self.frame)
self.setCentralWidget(self.centralwidget)
path = os.getcwd()
if self.checkNetworkAvailability() and self.checkApiAvailbility():
self.default_url = "file://"+path+"/View/mainView.html"
else:
self.default_url = "file://"+path+"/View/errorView.html"
# load the html view
self.openView()
# controller class that sends and receives to/from javascript
self.view = ViewController()
self.main.addToJavaScriptWindowObject('view', self.view)
# on gui load finish
self.html.loadFinished.connect(self.on_loadFinished)
# to javascript
def selectDirectory(self):
# This evaluates the directory we've selected to make sure it fits the criteria, then parses the XML files
pass
def evaluateDirectory(self, directory):
if not directory:
return False
if os.path.isdir(directory):
return True
else:
return False
#QtCore.pyqtSlot()
def on_loadFinished(self):
# open directory select dialog
self.selectDirectory()
def openView(self):
self.html.load(QtCore.QUrl(self.default_url))
self.html.show()
def checkNetworkAvailability(self):
#TODO: make sure we can reach the outside world before trying anything else
return True
def checkApiAvailbility(self):
#TODO: make sure the API server is alive and responding
return True
class WebPage(QtWebKit.QWebPage):
def javaScriptConsoleMessage(self, msg, line, source):
print '%s line %d: %s' % (source, line, msg)
class ViewController(QObject):
def __init__(self, parent=None):
super(ViewController, self).__init__(parent)
#pyqtSlot()
def did_load(self):
print "View Loaded."
#pyqtSlot()
def selectDirectoryDialog(self):
# FROM JAVASCRIPT: in case they need to re-open the file dialog
Application.selectDirectory()
def prepareImportView(self, displayPath):
# TO JAVASCRIPT: XML directory parsed okay, so let's show the main
Application.main.evaluateJavaScript("prepareImportView('{0}');".format(displayPath))
def generalMessageToView(self, target, message):
# TO JAVASCRIPT: Send a general message to a specific widget target
Application.main.evaluateJavaScript("receiveMessageFromController('{0}', '{1}')".format(target, message))
#pyqtSlot()
def startProductImport(self):
# FROM JAVASCRIPT: Trigger the product import loop, QThreads
print "### view.startProductImport"
position = 1
count = len(Application.data.products)
importTasks = ProductImportQueue(Application.data.products)
importTasks.start()
#pyqtSlot(str)
def updateProductView(self, data):
# TO JAVASCRIPT: Send product information to view
print "### updateProductView "
Application.main.evaluateJavaScript('updateProductView("{0}");'.format(QtCore.QString(data)) )
class WorkerSignals(QObject):
''' Declares the signals that will be broadcast to their connected view methods '''
productResult = pyqtSignal(str)
class ProductImporterTask(QRunnable):
''' This is where the import process will be fired for each loop iteration '''
def __init__(self, product):
super(ProductImporterTask, self).__init__()
self.product = product
self.count = ""
self.position = ""
self.signals = WorkerSignals()
def run(self):
print "### ProductImporterTask worker {0}/{1}".format(self.position, self.count)
# Normally we'd create a dict here, but I'm trying to just send a string for testing purposes
self.signals.productResult.emit(data)
return
class ProductImportQueue(QObject):
''' The synchronous threadpool that is going to one by one run the import threads '''
def __init__(self, products):
super(ProductImportQueue, self).__init__()
self.products = products
self.pool = QThreadPool()
self.pool.setMaxThreadCount(1)
def process_result(self, product):
return
def start(self):
''' Call the product import worker from here, and format it in a predictable way '''
count = len(self.products)
position = 1
for product in self.products:
worker = ProductImporterTask("test")
worker.signals.productResult.connect(Application.view.updateProductView, QtCore.Qt.DirectConnection)
self.pool.start(worker)
position = position + 1
self.pool.waitForDone()
if __name__ == "__main__":
app = QtGui.QApplication(sys.argv)
Application = Browser()
Application.raise_()
Application.show()
Application.activateWindow()
sys.exit(app.exec_())
You know, I love PyQt4 but after searching and searching, I believe this is actually a bug and not as designed.
I've since moved on and am trying to implement this in CEFPython with WxPython, which seems to have a much more elegant implementation for this specific purpose.
I have a python code with PySide that has a QWebView that shows google maps.
I just want to get the response each time that I do any request using the QWebView widget.
I have searched info but there is no reference about getting a response with PySide. If you need me to paste some code I will but I just have a simple QWebView widget.
EDIT: You asked me for the code:
from PySide.QtCore import *
from PySide.QtGui import *
import sys
import pyside3
class MainDialog(QMainWindow, pyside3.Ui_MainWindow):
def __init__(self, parent=None):
super(MainDialog,self).__init__(parent)
self.setupUi(self)
token_fb=""
#self.Connect_buttom.clicked.connect(self.get_fb_token)
self.Connect_buttom.clicked.connect(lambda: self.get_fb_token(self.FB_username.text(), self.FB_password.text()))
#self.connect(self.Connect_buttom, SIGNAL("clicked()"), self.get_fb_token)
#Change between locate and hunt
self.MapsButton.clicked.connect(lambda: self.select_page_index(0))
self.HuntButton.clicked.connect(lambda: self.select_page_index(1))
###########################
self.webView.setHtml(URL)
def select_page_index(self, index): # To change between frames
self.Container.setCurrentIndex(index)
I need the response from: self.webView.setHtml(URL) because depending on the response my app has to do one thing or other.
Function QWebView.setHtml() has no response in the sense that it doesn't return anything.
Maybe you want to listen to all links that are clicked and do something custom with it.
web_view = QtWebKit.QWebView()
web_view.page().setLinkDelegationPolicy(QtWebKit.QWebPage.DelegateAllLinks)
web_view.linkClicked.connect(your_handler)
Or maybe you want to do something when loading has finished. This is done by:
web_view = QtWebKit.QWebView()
web_view.loadFinished.connect(your_handler)