Can't web scrape with PyQt5 more than once [duplicate]

Can't web scrape with PyQt5 more than once [duplicate] - python

This question already has an answer here:
Scrape multiple urls using QWebPage
(1 answer)
Closed 4 years ago.
I am attempting to web scrape using the PyQT5 QWebEngineView. Here is the code that I got from another response on StackOverflow:
from PyQt5.QtWidgets import QApplication
from PyQt5.QtCore import QUrl, QEventLoop
from PyQt5.QtWebEngineWidgets import QWebEngineView
import sys
def render(url):
class Render(QWebEngineView):
def __init__(self, t_url):
self.html = None
self.app = QApplication(sys.argv)
QWebEngineView.__init__(self)
self.loadFinished.connect(self._loadfinished)
self.load(QUrl(t_url))
while self.html is None:
self.app.processEvents(QEventLoop.ExcludeUserInputEvents | QEventLoop.ExcludeSocketNotifiers | QEventLoop.WaitForMoreEvents)
self.app.quit()
def _callable(self, data):
self.html = data
def _loadfinished(self, result):
self.page().toHtml(self._callable)
return Render(url).html
Then if I put the line:
print(render('http://quotes.toscrape.com/random'))
it works as expected. But if I add a second line to that so it reads:
print(render('http://quotes.toscrape.com/random'))
print(render('http://quotes.toscrape.com/tableful/'))
it gives me the error "Process finished with exit code -1073741819 (0xC0000005)" after printing out the first render correctly.
I have narrowed the error down to the line that says self.load(QUrl(t_url))

You're initializing QApplication more than once. Only once instance should exist, globally. If you need to get the current instance and do not have a handle to it, you can use QApplication.instance(). QApplication.quit() is meant to be called right before sys.exit, in fact, you should almost never use one without the other.
In short, you're telling Qt you're exiting the application, and then trying to run more Qt code. It's an easy fix, however...
Solution
You can do 1 of three things:
Store the app in a global variable and reference it from there:
APP = QApplication(sys.argv)
# ... Many lines ellipsed
class SomeClass(QWidget):
def some_method(self):
APP.processEvents(QEventLoop.ExcludeUserInputEvents | QEventLoop.ExcludeSocketNotifiers | QEventLoop.WaitForMoreEvents)
Pass the app as a handle to the class.
def render(app, url):
...
Create a global instance, and use QApplication.instance().
APP = QApplication(sys.argv)
# ... Many lines ellipsed
class SomeClass(QWidget):
def some_method(self):
app = QApplication.instance()
app.processEvents(QEventLoop.ExcludeUserInputEvents | QEventLoop.ExcludeSocketNotifiers | QEventLoop.WaitForMoreEvents)
Do what's most convenient for you.

Related

Continuously Check SQLAlchemy Database Connection on Pyside2

First of all, I want to figure out how to check database status every second. so that the user will able to tell if the database is up or not without even clicking or triggering anything. I've read that this will create a problem as mentioned in the comments here
so here's my minimal reproducible example:
import sys
import os
import shiboken2
from PySide2 import QtCore, QtGui, QtWidgets
from PySide2.QtWidgets import QMainWindow, QFileDialog, QMessageBox, QWidget, QDialog, QProxyStyle
from sqlalchemy import create_engine, inspect
class MyWidget(QtWidgets.QWidget):
def __init__(self):
QtWidgets.QWidget.__init__(self)
self.resize(200, 200)
self.path = os.path.abspath(os.path.dirname(sys.argv[0]))
self.button = QtWidgets.QPushButton("Open File")
self.labelFile = QtWidgets.QLabel("empty")
self.labelData = QtWidgets.QLabel("None")
self.layout = QtWidgets.QVBoxLayout()
self.layout.addWidget(self.button)
self.layout.addWidget(self.labelFile)
self.layout.addWidget(self.labelData)
self.setLayout(self.layout)
self.button.clicked.connect(self.open_file)
self.process = None
self.CreateEngine = CreateEngine(self)
self.CreateEngine.result.connect(self.start_timer)
self.CreateEngine.start()
def open_file(self):
x = QFileDialog.getOpenFileName(self,"Just To Spice This Code",self.path,"CSV Files (*.csv)")
self.labelFile.setText(x[0]) #just to check that GUI doesn't freeze
def start_timer(self,engine): #callback from CreateEngine
self.timer = QtCore.QTimer(self)
self.timer.timeout.connect(lambda: self.continuously_check(engine))
self.timer.start(1000) #check connetion every second, as real-time as possible
def continuously_check(self,engine): #this gonna get called every second, yes it isn't effective i know
self.process = CheckConnection(self,engine)
self.process.result.connect(self.update_connection_label)
self.process.start()
def update_connection_label(self,x): #update connection status on GUI
self.labelData.setText("DB Status: "+str(x))
def closeEvent(self,event): #to handle QThread: Destroyed while thread is still running
print("begin close event")
if(self.process is not None):
if(shiboken2.isValid(self.process)): #to check whether the object is deleted. ->
self.process.wait() #-> this will get messy when the DB connection is down
self.process.quit() #-> (IMO):since i stack so many CheckConnection objects maybe?
print("end close event")
class CreateEngine(QtCore.QThread): #creating engine on seperate thread so that it wont block GUI
result = QtCore.Signal(object)
def __init__(self, parent):
QtCore.QThread.__init__(self, parent)
self.engine = None
def run(self):
self.engine = create_engine('mysql+pymysql://{}:{}#{}:{}/{}'.format("root","","localhost","3306","adex_admin"))
self.result.emit(self.engine)
class CheckConnection(QtCore.QThread): #constantly called every second, yes its not a good approach ->
result = QtCore.Signal(str) #-> i wonder how to replace all this with something appropriate
def __init__(self, parent,engine):
QtCore.QThread.__init__(self, parent)
self.engine = engine
def run(self):
try:
self.engine.execute('SELECT 1').fetchall()
self.result.emit("Connected")
except:
self.result.emit("Not Connected")
self.deleteLater() #somehow this doesn't do it job very well. maybe blocked?
#-> especially when the connection is busted. this thread gets stuck quite long to finish
if __name__ == "__main__":
#idk why when you start this without connection it's running really slow on showing the status of DB
#you must wait like 4 seconds until the connection status is showed up, which is really bad
#but once it's live. it could read database status really fast
app = QtWidgets.QApplication(sys.argv)
widget = MyWidget()
widget.show()
sys.exit(app.exec_())
I've created this example just to reproduce the same problem I'm facing in my real app. so the problem is that closeEvent takes too long to terminate the checking process and also blocking the GUI. The reason why I create 'closeEvent' is that I had this problem which produce [QThread: Destroyed while thread is still running] when the app is closed.
also, whenever the database isn't reachable it makes the QThread finishes way longer than it should unlike when the database is reachable. but we can retrieve the status pretty much like we want (every second of live DB Status). I also tried a silly approach like this
...
def continuously_check(self,engine):
self.process = CheckConnection(self,engine)
self.process.result.connect(self.update_connection_label)
self.process.finished.connect(lambda: QtCore.QTimer.singleShot(1000,self.continuously_check))
self.process.start()
...
hoping that it won't keep creating objects before the thread even finished (ps: obviously this won't work). so what's the best approach when it comes to this? sorry for multiple problems at a time.

Proxies when using PyQt to render HTML

I'm looking to scrape JavaScript-driven pages using this code, which has appeared on a number of past threads (c.f. this, this, and others on offsite threads):
import sys
from PyQt5.QtCore import QEventLoop
from PyQt5.QtWidgets import QApplication
from PyQt5.QtWebEngineWidgets import QWebEngineView
def render(source_html):
class Render(QWebEngineView):
def __init__(self, html):
self.html = None
self.app = QApplication(sys.argv)
QWebEngineView.__init__(self)
self.loadFinished.connect(self._loadFinished)
self.setHtml(html)
while self.html is None:
self.app.processEvents(QEventLoop.ExcludeUserInputEvents | QEventLoop.ExcludeSocketNotifiers | QEventLoop.WaitForMoreEvents)
self.app.quit()
def _callable(self, data):
self.html = data
def _loadFinished(self, result):
self.page().toHtml(self._callable)
return Render(source_html).html
It's working fine.
My question is whether I need to use a proxy for this portion of the code (assuming that I generally want to be using a proxy for all network activity).
I'm using urllib.request and a proxy to access the site in question, and then passing the html from there on to PyQt5 to do the JavaScript mambo. Does that second leg of the journey involve a network connection that should be proxy-fied? If so, how should I change this code - haven't touched PyQt until today and am feeling a bit over my head.
Using Python 3.5 and Windows 7.
Many thanks.

PyQt4 QWebview error after initial evaluateJavaScript() call

I've got a Python 2.7 application running with PyQt4 that has a QWebView in it, that has two way communication to and from Javascript.
The application is multithreaded via QThreadPool, QRunnables, so I'm communicating with a ViewController class with signals.
When I run the application, the QWebView loads my HTML with external JS and CSS just fine. I'm able to interact with the Javascript functions via the main program thread and ViewController class.
Once the user selects a directory and certain criteria are met, it starts looping through QRunnable tasks one at a time. During that time it calls back to the ViewController -> Javascript via Signal slots, just as expected. The problem is when I'm calling those ViewController methods that execute evaluateJavaScript, I get a Javascript error returned,
undefined line 1: SyntaxError: Parse error
I've done lots of trial error back and forth, but can't seem to figure out why evaluateJavaScript won't run in these instances. I've tried sending simple Javascript calls ranging from test functions that don't accept any arguments (thinking maybe it was some weird encoding issue), to just sending things like Application.main.evaluateJavaScript("alert('foo')"), which normally work outside of the threads. The only other thing I can think of is that maybe self.main.addToJavaScriptWindowObject('view', self.view) needs to be called in the threads again, but I've run a dir() on Application.main and it appears to have the evaluateJavaScript method attached to it already.
Any thoughts on why this could be occurring, when the scope seems to be correct, and the ViewController appears to be communicating just fine to the QWebView otherwise? Answers in Qt C++ will probably work as well, if you've seen this happen before!
I tried to simplify the code for example purposes:
# coding: utf8
import subprocess as sp
import os.path, os, sys, time, datetime
from os.path import basename
import glob
import random
import string
from PyQt4 import QtCore, QtGui, QtWebKit
from PyQt4.QtCore import QObject, pyqtSlot, QThreadPool, QRunnable, pyqtSignal
from PyQt4.QtGui import QApplication, QFileDialog
from PyQt4.QtWebKit import QWebView
from ImportController import *
class Browser(QtGui.QMainWindow):
def __init__(self):
QtGui.QMainWindow.__init__(self)
self.resize(800,500)
self.centralwidget = QtGui.QWidget(self)
self.mainLayout = QtGui.QHBoxLayout(self.centralwidget)
self.mainLayout.setSpacing(0)
self.mainLayout.setMargin(0)
self.frame = QtGui.QFrame(self.centralwidget)
self.gridLayout = QtGui.QVBoxLayout(self.frame)
self.gridLayout.setMargin(0)
self.gridLayout.setSpacing(0)
self.html = QtWebKit.QWebView()
# for javascript errors
errors = WebPage()
self.html.setPage(errors)
self.main = self.html.page().mainFrame()
self.gridLayout.addWidget(self.html)
self.mainLayout.addWidget(self.frame)
self.setCentralWidget(self.centralwidget)
path = os.getcwd()
if self.checkNetworkAvailability() and self.checkApiAvailbility():
self.default_url = "file://"+path+"/View/mainView.html"
else:
self.default_url = "file://"+path+"/View/errorView.html"
# load the html view
self.openView()
# controller class that sends and receives to/from javascript
self.view = ViewController()
self.main.addToJavaScriptWindowObject('view', self.view)
# on gui load finish
self.html.loadFinished.connect(self.on_loadFinished)
# to javascript
def selectDirectory(self):
# This evaluates the directory we've selected to make sure it fits the criteria, then parses the XML files
pass
def evaluateDirectory(self, directory):
if not directory:
return False
if os.path.isdir(directory):
return True
else:
return False
#QtCore.pyqtSlot()
def on_loadFinished(self):
# open directory select dialog
self.selectDirectory()
def openView(self):
self.html.load(QtCore.QUrl(self.default_url))
self.html.show()
def checkNetworkAvailability(self):
#TODO: make sure we can reach the outside world before trying anything else
return True
def checkApiAvailbility(self):
#TODO: make sure the API server is alive and responding
return True
class WebPage(QtWebKit.QWebPage):
def javaScriptConsoleMessage(self, msg, line, source):
print '%s line %d: %s' % (source, line, msg)
class ViewController(QObject):
def __init__(self, parent=None):
super(ViewController, self).__init__(parent)
#pyqtSlot()
def did_load(self):
print "View Loaded."
#pyqtSlot()
def selectDirectoryDialog(self):
# FROM JAVASCRIPT: in case they need to re-open the file dialog
Application.selectDirectory()
def prepareImportView(self, displayPath):
# TO JAVASCRIPT: XML directory parsed okay, so let's show the main
Application.main.evaluateJavaScript("prepareImportView('{0}');".format(displayPath))
def generalMessageToView(self, target, message):
# TO JAVASCRIPT: Send a general message to a specific widget target
Application.main.evaluateJavaScript("receiveMessageFromController('{0}', '{1}')".format(target, message))
#pyqtSlot()
def startProductImport(self):
# FROM JAVASCRIPT: Trigger the product import loop, QThreads
print "### view.startProductImport"
position = 1
count = len(Application.data.products)
importTasks = ProductImportQueue(Application.data.products)
importTasks.start()
#pyqtSlot(str)
def updateProductView(self, data):
# TO JAVASCRIPT: Send product information to view
print "### updateProductView "
Application.main.evaluateJavaScript('updateProductView("{0}");'.format(QtCore.QString(data)) )
class WorkerSignals(QObject):
''' Declares the signals that will be broadcast to their connected view methods '''
productResult = pyqtSignal(str)
class ProductImporterTask(QRunnable):
''' This is where the import process will be fired for each loop iteration '''
def __init__(self, product):
super(ProductImporterTask, self).__init__()
self.product = product
self.count = ""
self.position = ""
self.signals = WorkerSignals()
def run(self):
print "### ProductImporterTask worker {0}/{1}".format(self.position, self.count)
# Normally we'd create a dict here, but I'm trying to just send a string for testing purposes
self.signals.productResult.emit(data)
return
class ProductImportQueue(QObject):
''' The synchronous threadpool that is going to one by one run the import threads '''
def __init__(self, products):
super(ProductImportQueue, self).__init__()
self.products = products
self.pool = QThreadPool()
self.pool.setMaxThreadCount(1)
def process_result(self, product):
return
def start(self):
''' Call the product import worker from here, and format it in a predictable way '''
count = len(self.products)
position = 1
for product in self.products:
worker = ProductImporterTask("test")
worker.signals.productResult.connect(Application.view.updateProductView, QtCore.Qt.DirectConnection)
self.pool.start(worker)
position = position + 1
self.pool.waitForDone()
if __name__ == "__main__":
app = QtGui.QApplication(sys.argv)
Application = Browser()
Application.raise_()
Application.show()
Application.activateWindow()
sys.exit(app.exec_())

You know, I love PyQt4 but after searching and searching, I believe this is actually a bug and not as designed.
I've since moved on and am trying to implement this in CEFPython with WxPython, which seems to have a much more elegant implementation for this specific purpose.

Subclassed QWebView doesn't react to Hyperlink Clicks

This is in Python/PySide.
I am trying to create my own Parental WebBrowser by overloading the PySide.QtWebKit.QWebView widget. Then whenever someone clicks a link on the widget I check to see if we are going to an invalid website, if not we proceed, if yes then I redirect to a generic page.
So I have subclassed the PySide.QtWebKit.QWebView, but I am not receiving notification of when a link is clicked. I have overridden the linkClicked function but the function never runs when a link is clicked?
What am I doing wrong? Why cant my function run/react to the hyperlink click "event"? Do I need to override the webpage object & not this class to react to link clicks?
import PySide.QtWebKit
import sys
from PyQt4 import QtGui
class BrowserWindow( PySide.QtWebKit.QWebView ):
# Class Variables:
def __init__( self, _parent ):
""" Constructor: """
super(BrowserWindow, self).__init__()
PySide.QtWebKit.QWebView(None)
def linkClicked(self, arg__1):
""" Post: """
#print("LINK CLICKED")
#text, ok = QtGui.QInputDialog.getText(self, 'Input Dialog',
# 'Enter your name:')
self.load("http://yahoo.com")
def main():
app = QtGui.QApplication(sys.argv)
view = BrowserWindow(None) #PySide.QtWebKit.QWebView(None)
view.load("http://google.com")
view.show()
sys.exit(app.exec_())
if __name__ == '__main__':
main()

There are several problems with the code you posted. Firstly, you are importing both PySide and PyQt4, which is not a good idea. Secondly, QWebView.linkClicked is a signal, not a protected method, so you can't override it. Thirdly, you are passing a string to QWebView.load, when you should be passing a QtCore.QUrl.
However, aside from those problems, you also need to set the linkDelegationPolicy on the web page in order to override its link handling.
Here's an edited version of your code which should fix all the problems:
from PySide import QtCore, QtGui, QtWebKit
class BrowserWindow(QtWebKit.QWebView):
def __init__(self, parent=None):
super(BrowserWindow, self).__init__()
self.linkClicked.connect(self.handleLinkClicked)
def handleLinkClicked(self, url):
print(url.toString())
if __name__ == '__main__':
import sys
app = QtGui.QApplication(sys.argv)
view = BrowserWindow()
view.load(QtCore.QUrl("http://google.com"))
view.page().setLinkDelegationPolicy(
QtWebKit.QWebPage.DelegateAllLinks)
view.show()
sys.exit(app.exec_())

PyQt - is it possible to run two applications?

Two files. Each runs new window and works by itself. I need to run them both.
When I run first.pyw, only one (second) window is shown.
Is it possible two run them both?
first.pyw:
import sys
from PyQt4.QtGui import *
import second
class first(QWidget):
def __init__(self, parent=None):
QWidget.__init__(self, parent)
self.setWindowTitle('first')
app = QApplication(sys.argv)
firstApp = first()
firstApp.show()
sys.exit(app.exec_())
second.pyw:
import sys
from PyQt4.QtGui import *
class second(QWidget):
def __init__(self, parent=None):
QWidget.__init__(self, parent)
self.setWindowTitle('second')
app2 = QApplication(sys.argv)
secondApp = second()
secondApp.show()
sys.exit(app2.exec_())
How can I run two applications that are in different modules?

The accepted answer is essentially right, but there are cases where you want to run multiple QApplications one after the other, e.g. :
Unit tests
A command-line tool that shouldn't require a running X server (hence no QApplication on startup), but can optionally show a window if the user's system supports it
I ended up using the multiprocessing module to start each QApplication in a separate process, so that each one is independent from the others.
from multiprocessing import Queue, Process
class MyApp(Process):
def __init__(self):
self.queue = Queue(1)
super(MyApp, self).__init__()
def run(self):
app = QApplication([])
...
self.queue.put(return_value)
app1 = MyApp()
app1.start()
app1.join()
print("App 1 returned: " + app1.queue.get())
app2 = MyApp()
app2.start()
app2.join()
print("App 2 returned: " + app1.queue.get())

You can only run a single application at a time, although your application can have multiple top-level windows. The QCoreApplication docs say that:
...there should be exactly one QCoreApplication object.
This also holds true for QApplication as it derives from QCoreApplication. You can get access to that application through the QCoreApplication.instance() method or the qApp macro in C++.
What do you expect to get out of having two different applications running? Instead, you could have each module provide a top-level window that then gets displayed by the application launcher.

You import second. Hence it is interpreted before you even reach the definition of class first. As the last line of second.pyw is sys.exit, nothing behind it can be executed.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Can't web scrape with PyQt5 more than once [duplicate] - python

Related

Continuously Check SQLAlchemy Database Connection on Pyside2

Proxies when using PyQt to render HTML

PyQt4 QWebview error after initial evaluateJavaScript() call

Subclassed QWebView doesn't react to Hyperlink Clicks

PyQt - is it possible to run two applications?

Categories

Resources