How to break out of crawl if certain condition encountered in Scrapy

How to break out of crawl if certain condition encountered in Scrapy - python

For the purposes of an application I'm working on, I need scrapy to break out of the crawl and start crawling again from a particular, arbitrary URL.
The intended behaviour is for scrapy to just back to a particular URL which can be supplied in an argument if a particular condition is satisfied.
I'm using CrawlSpider but can't figure out how to achieve this:
class MyCrawlSpider(CrawlSpider):
name = 'mycrawlspider'
initial_url = ""
def __init__(self, initial_url, *args, **kwargs):
self.initial_url = initial_url
domain = "mydomain.com"
self.start_urls = [initial_url]
self.allowed_domains = [domain]
self.rules = (
Rule(LinkExtractor(allow=[r"^http[s]?://(www.)?" + domain + "/.*"]), callback='parse_item', follow=True),
)
super(MyCrawlSpider, self)._compile_rules()
def parse_item(self, response):
if(some_condition is True):
# force scrapy to go back to home page and recrawl
print("Should break out")
else:
print("Just carry on")
I tried to place
return scrapy.Request(self.initial_url, callback=self.parse_item)
in the branch of someCondition is True but without success. Would hugely appreciate some help, been working on trying to figure this out for hours.

you could make a custom exception that you handle appropriately, like so...
Please feel free to edit with the appropriate syntax for CrawlSpider
class RestartException(Exception):
pass
class MyCrawlSpider(CrawlSpider):
name = 'mycrawlspider'
initial_url = ""
def __init__(self, initial_url, *args, **kwargs):
self.initial_url = initial_url
domain = "mydomain.com"
self.start_urls = [initial_url]
self.allowed_domains = [domain]
self.rules = (
Rule(LinkExtractor(allow=[r"^http[s]?://(www.)?" + domain + "/.*"]), callback='parse_item', follow=True),
)
super(MyCrawlSpider, self)._compile_rules()
def parse_item(self, response):
if(some_condition is True):
print("Should break out")
raise RestartException("We're restarting now")
else:
print("Just carry on")
siteName = "http://whatever.com"
crawler = MyCrawlSpider(siteName)
while True:
try:
#idk how you start this thing, but do that
crawler.run()
break
except RestartException as err:
print(err.args)
crawler.something = err.args
continue
print("I'm done!")

Related

Custom scrapy proxy rotation middleware with specific retry condition

I have a few conditions to implement for rotating proxies in scrapy middleware:
If response is not 200 try that request with another random proxy from a list.
I have two lists of proxies let's say I'd like to start crawling with first list of proxies and retry about 10 times with that list and after that as a last resort I want to try second proxy list.
I have tried creating the middleware but it is not working as expected it is not rotating proxies as well as not picking up the second proxy list as last resort. Here is the code:
class SFAProxyMiddleware(object):
#classmethod
def from_crawler(cls, crawler):
return cls(crawler.settings)
def __init__(self, settings):
self.packetstream_proxies = [
settings.get("PS_PROXY_USA"),
settings.get("PS_PROXY_CA"),
settings.get("PS_PROXY_IT"),
settings.get("PS_PROXY_GLOBAL"),
]
self.unlimited_proxies = [
settings.get("UNLIMITED_PROXY_1"),
settings.get("UNLIMITED_PROXY_2"),
settings.get("UNLIMITED_PROXY_3"),
settings.get("UNLIMITED_PROXY_4"),
settings.get("UNLIMITED_PROXY_5"),
settings.get("UNLIMITED_PROXY_6"),
]
def add_proxy(self, request, host):
request.meta["proxy"] = host
def process_request(self, request, spider):
retries = request.meta.get("retry_times", 0)
if "proxy" in request.meta.keys():
return None
if retries <= 10:
self.add_proxy(request, random.choice(self.unlimited_proxies))
else:
self.add_proxy(request, random.choice(self.packetstream_proxies))
Am I doing something wrong implementing the middleware? Thanks

I think based on the conditions at the beginning of your question, that you also need to process the response to check for it's status code and if it isn't a 200 then to increase the retry count and send it make to the scheduler.
You might need to set the dont_filter parameter in the request to True, and you should also probably set a maximum for the number of retries.
for example
from scrapy.exceptions import IgnoreRequest
MAX_RETRY = 20
class SFAProxyMiddleware(object):
#classmethod
def from_crawler(cls, crawler):
return cls(crawler.settings)
def __init__(self, settings):
self.packetstream_proxies = [
settings.get("PS_PROXY_USA"),
settings.get("PS_PROXY_CA"),
settings.get("PS_PROXY_IT"),
settings.get("PS_PROXY_GLOBAL"),
]
self.unlimited_proxies = [
settings.get("UNLIMITED_PROXY_1"),
settings.get("UNLIMITED_PROXY_2"),
settings.get("UNLIMITED_PROXY_3"),
settings.get("UNLIMITED_PROXY_4"),
settings.get("UNLIMITED_PROXY_5"),
settings.get("UNLIMITED_PROXY_6"),
]
def add_proxy(self, request, host):
request.meta["proxy"] = host
def process_request(self, request, spider):
retries = request.meta.get("retry_times", 0)
if "proxy" in request.meta.keys():
return None
if retries <= 10:
self.add_proxy(request, random.choice(self.unlimited_proxies))
else:
self.add_proxy(request, random.choice(self.packetstream_proxies))
def process_response(self, response, spider):
if response.status_code != 200:
request = response.request
request.meta.setdefault("retry_times", 1)
request.meta["retry_times"] += 1
if request.meta["retry_times"] > MAX_RETRY:
raise IgnoreRequest
request.dont_filter = True
return request
return response

Restart scrapy spider after unsuccessful login

I've got a spider with Init function:
class ExpireddomainsSpider(InitSpider):
name = "expiredomains"
def __init__(self,typ=None, *args, **kwargs):
super(ExpireddomainsSpider, self).__init__(*args, **kwargs)
self.typ = typ
users=open("users.txt","r")
dane = self.random_line(users)
dane = dane.split(':')
self.user = dane[0]
self.password = dane[1]
self.ip = dane[2]
self.port = dane[3]
self.headers = {"User-Agent": dane[4]}
It takes a random line from tex file, where I've got user login, pass etc. Then I've got login function:
def login(self, response):
self.log("USER: "+self.user+" PASS: "+self.password)
return FormRequest('https://member.expireddomains.net/login/',
formdata={'login': self.user, 'password': self.password},
callback=self.check_login_response, method='POST')
And the function to check login status:
def check_login_response(self, response):
sprawdz = self.user.title()
if sprawdz in response.body:
self.log("Successfully logged in. Let's start crawling!")
return scrapy.FormRequest('http://expireddomains.net/', callback=self.start_crawl)
elif "Your account was disabled" in response.body:
self.log("Your account was disabled!")
super(ExpireddomainsSpider, self).__init__()
else:
self.log("Bad times :(")
Now I want to restart my spider if there was unsuccessful login. So the spider will open the file with users again and get another random line and try again.
I've tried with:
super(ExpireddomainsSpider, self).__init__()
But it doesnt work and the spider is closed.
EDIT:
Ok now I've got this:
class ExpireddomainsSpider(InitSpider):
name = "expiredomains"
def init(self):
users=open("users.txt","r")
dane = self.random_line(users)
dane = dane.split(':')
self.user = dane[0]
self.password = dane[1]
self.ip = dane[2]
self.port = dane[3]
self.headers = {"User-Agent": dane[4]}
def __init__(self,typ=None, *args, **kwargs):
super(ExpireddomainsSpider, self).__init__(*args, **kwargs)
self.typ = typ
self.init()
and
def check_login_response(self, response):
sprawdz = self.user.title()
if sprawdz in response.body:
self.log("Successfully logged in. Let's start crawling!")
return scrapy.FormRequest('http://expireddomains.net/', callback=self.start_crawl)
elif "Your account was disabled" in response.body:
self.log("Your account was disabled!")
self.init()
return self.login(response)
else:
self.log("Bad times :(")
But it works only twice - it gets random line, try to login if fail then again it gets random line try to login if fail spider closed. It's not tryning until login is success.
SOLUTION:
Ok I've solved it. I need to add: dont_filter=True to my login function:
def login(self, response):
return FormRequest('https://member.expireddomains.net/login/',
formdata={'login': self.user, 'password': self.password},
callback=self.check_login_response, method='POST',dont_filter=True)

You could move out your initialization code into its own _init method and then call self.login again. I would change check_login_response like:
def check_login_response(self, response):
sprawdz = self.user.title()
if sprawdz in response.body:
self.log("Successfully logged in. Let's start crawling!")
return scrapy.FormRequest('http://expireddomains.net/', callback=self.start_crawl)
elif "Your account was disabled" in response.body:
self.log("Your account was disabled!")
self.init()
return self.login(response)
else:
self.log("Bad times :(")

Python- name ' ' is not defined

class Main:
PROJECT_NAME = 'something'
HOMEPAGE = 'something'
DOMAIN_NAME = get_domain_name(HOMEPAGE)
QUEUE_FILE = PROJECT_NAME + '/queue.txt'
CRAWLED_FILE = PROJECT_NAME + '/crawled.txt'
DATA_FILE = PROJECT_NAME + '/data.txt'
NUMBER_OF_THREADS = 20
queue = Queue()
Spider(PROJECT_NAME, HOMEPAGE, DOMAIN_NAME)
# Create worker threads (will die when main exits)
def create_workers(self):
for _ in range(self.NUMBER_OF_THREADS):
t = self.threading.Thread(target=self.work)
t.daemon = True
t.start()
# Do the next job in the queue
def work(self):
while True:
url = self.queue.get()
Spider.crawl_page(self.threading.current_thread().name, url)
self.queue.task_done()
# Each queued link is a new job
def create_jobs(self):
for link in self.file_to_set(self.QUEUE_FILE):
self.queue.put(link)
self.queue.join()
self.crawl()
# Check if there are items in the queue, if so crawl them
def crawl(self):
queued_links = self.file_to_set(self.QUEUE_FILE)
if len(queued_links) > 0:
print(str(len(queued_links)) + ' links in the queue')
self.create_jobs()
create_workers()
crawl()
Above are my code. I have been receiving:
NameError: name 'create_workers' is not defined and NameError: name 'crawl' is not defined.
Any help or suggestion for beginner here?

You must create an object of the class Main and through that object to use its methods, for this you must change:
create_workers()
crawl()
to:
m = Main()
m.create_workers()
m.crawl()

How to save the data from a scrapy crawler into a variable?

I'm currently building a web app meant to display the data collected by a scrapy spider. The user makes a request, the spider crawl a website, then return the data to the app in order to be prompted. I'd like to retrieve the data directly from the scraper, without relying on an intermediary .csv or .json file. Something like :
from scrapy.crawler import CrawlerProcess
from scraper.spiders import MySpider
url = 'www.example.com'
spider = MySpider()
crawler = CrawlerProcess()
crawler.crawl(spider, start_urls=[url])
crawler.start()
data = crawler.data # this bit

This is not so easy because Scrapy is non-blocking and works in an event loop; it uses Twisted event loop, and Twisted event loop is not restartable, so you can't write crawler.start(); data = crawler.data - after crawler.start() process runs forever, calling registered callbacks until it is killed or ended.
These answers may be relevant:
How to integrate Flask & Scrapy?
Building a RESTful Flask API for Scrapy
If you use an event loop in your app (e.g. you have a Twisted or Tornado web server) then it is possible to get the data from a crawl without storing it to disk. The idea is to listen to item_scraped signal. I'm using the following helper to make it nicer:
import collections
from twisted.internet.defer import Deferred
from scrapy.crawler import Crawler
from scrapy import signals
def scrape_items(crawler_runner, crawler_or_spidercls, *args, **kwargs):
"""
Start a crawl and return an object (ItemCursor instance)
which allows to retrieve scraped items and wait for items
to become available.
Example:
.. code-block:: python
#inlineCallbacks
def f():
runner = CrawlerRunner()
async_items = scrape_items(runner, my_spider)
while (yield async_items.fetch_next):
item = async_items.next_item()
# ...
# ...
This convoluted way to write a loop should become unnecessary
in Python 3.5 because of ``async for``.
"""
crawler = crawler_runner.create_crawler(crawler_or_spidercls)
d = crawler_runner.crawl(crawler, *args, **kwargs)
return ItemCursor(d, crawler)
class ItemCursor(object):
def __init__(self, crawl_d, crawler):
self.crawl_d = crawl_d
self.crawler = crawler
crawler.signals.connect(self._on_item_scraped, signals.item_scraped)
crawl_d.addCallback(self._on_finished)
crawl_d.addErrback(self._on_error)
self.closed = False
self._items_available = Deferred()
self._items = collections.deque()
def _on_item_scraped(self, item):
self._items.append(item)
self._items_available.callback(True)
self._items_available = Deferred()
def _on_finished(self, result):
self.closed = True
self._items_available.callback(False)
def _on_error(self, failure):
self.closed = True
self._items_available.errback(failure)
#property
def fetch_next(self):
"""
A Deferred used with ``inlineCallbacks`` or ``gen.coroutine`` to
asynchronously retrieve the next item, waiting for an item to be
crawled if necessary. Resolves to ``False`` if the crawl is finished,
otherwise :meth:`next_item` is guaranteed to return an item
(a dict or a scrapy.Item instance).
"""
if self.closed:
# crawl is finished
d = Deferred()
d.callback(False)
return d
if self._items:
# result is ready
d = Deferred()
d.callback(True)
return d
# We're active, but item is not ready yet. Return a Deferred which
# resolves to True if item is scraped or to False if crawl is stopped.
return self._items_available
def next_item(self):
"""Get a document from the most recently fetched batch, or ``None``.
See :attr:`fetch_next`.
"""
if not self._items:
return None
return self._items.popleft()
The API is inspired by motor, a MongoDB driver for async frameworks. Using scrape_items you can get items from twisted or tornado callbacks as soon as they are scraped, in a way similar to how you fetch items from a MongoDB query.

This is probably too late but it may help others, you can pass a callback function to the Spider and call that function to return your data like so:
The dummy spider that we are going to use:
class Trial(Spider):
name = 'trial'
start_urls = ['']
def __init__(self, **kwargs):
super().__init__(**kwargs)
self.output_callback = kwargs.get('args').get('callback')
def parse(self, response):
pass
def close(self, spider, reason):
self.output_callback(['Hi, This is the output.'])
A custom class with the callback:
from scrapy.crawler import CrawlerProcess
from scrapyapp.spiders.trial_spider import Trial
class CustomCrawler:
def __init__(self):
self.output = None
self.process = CrawlerProcess(settings={'LOG_ENABLED': False})
def yield_output(self, data):
self.output = data
def crawl(self, cls):
self.process.crawl(cls, args={'callback': self.yield_output})
self.process.start()
def crawl_static(cls):
crawler = CustomCrawler()
crawler.crawl(cls)
return crawler.output
Then you can do:
out = crawl_static(Trial)
print(out)

you can pass the variable as an attribute of the class and store the data in it.
of curse you need to add the attribute in the __init__ method of you spider class.
from scrapy.crawler import CrawlerProcess
from scraper.spiders import MySpider
url = 'www.example.com'
spider = MySpider()
crawler = CrawlerProcess()
data = []
crawler.crawl(spider, start_urls=[url], data)
crawler.start()
print(data)

My answer is inspired from Siddhant,
from scrapy import Spider
class MySpider(Spider):
name = 'myspider'
def parse(self, response):
item = {
'url': response.url,
'status': response.status
}
yield self.output_callback(item) # instead of yield item
from scrapy.crawler import CrawlerProcess
class Crawler:
def __init__(self):
self.process = CrawlerProcess()
self.scraped_items = []
def process_item(self, item): # similar to process_item in pipeline
item['scraped'] = 'yes'
self.scraped_items.append(item)
return item
def spawn(self, **kwargs):
self.process.crawl(MySpider,
output_callback=self.process_item,
**kwargs)
def run(self):
self.process.start()
if __name__ == '__main__':
crawler = Crawler()
crawler.spawn(start_urls=['https://www.example.com', 'https://www.google.com'])
crawler.run()
print(crawler.scraped_items)
Output
[{'url': 'https://www.google.com', 'status': 200, 'scraped': 'yes'},
{'url': 'https://www.example.com', 'status': 200, 'scraped': 'yes'}]
process_item is very useful for processing item as well as storing it.

Web Scraping Multiple Links with PyQt / QtWebkit

I'm trying to scrape a large website of government records which requires a "snowball" method, i.e., starting at the main search page and then following each link that the scraper finds to the next page.
I've been able to load the main page using PyQt this SiteScraper tutorial.
import sys
from PySide.QtGui import *
from PySide.QtCore import *
from PySide.QtWebKit import *
from BeautifulSoup import BeautifulSoup
class Render(QWebPage):
def __init__(self, url):
self.app = QApplication(sys.argv)
QWebPage.__init__(self)
self.loadFinished.connect(self._loadFinished)
self.mainFrame().load(QUrl(url))
self.app.exec_()
def _loadFinished(self, result):
self.frame = self.mainFrame()
self.app.quit()
def main():
baseUrl = 'http://www.thesite.gov'
url = 'http://www.thesite.gov/search'
r = Render(url)
html = r.frame.toHtml()
# use BeautifulSoup to cycle through each regulation
soup = BeautifulSoup(html)
regs = soup.find('div',{'class':'x-grid3-body'}).findAll('a')
# cycle through list and call up each page separately
for reg in regs:
link = baseUrl + reg['href']
link = str(link)
# use Qt to load each regulation page
r = Render(link)
html = r.frame.toHtml() # get actual rendered web page
The problem is I get this error when I try to render a new webpage:
RuntimeError: A QApplication instance already exists.
I get it that the function is trying to call another QApplication instance. But how do I navigate to a new page with the same instance?
class Render(QWebPage):
def __init__(self, app, url):
QWebPage.__init__(self)
self.loadFinished.connect(self._loadFinished)
self.mainFrame().load(QUrl(url))
app.exec_()
def _loadFinished(self, result):
self.frame = self.mainFrame()
def main():
app = QApplication(sys.argv)
baseUrl = 'http://www.thesite.gov'
url = 'http://www.thesite.gov/search'
r = Render(app, url)
html = r.frame.toHtml()

I had the same problem (needing to load multiple pages with QWebPage) but I couldn't get any of these answers to work for me. Here's what did work, the key is to use a QEventLoop and connect loadFinished to loop.quit:
from PySide import QtCore, QtGui, QtWebKit
import sys
def loadPage(url):
page = QtWebKit.QWebPage()
loop = QtCore.QEventLoop() # Create event loop
page.mainFrame().loadFinished.connect(loop.quit) # Connect loadFinished to loop quit
page.mainFrame().load(url)
loop.exec_() # Run event loop, it will end on loadFinished
return page.mainFrame().toHtml()
app = QtGui.QApplication(sys.argv)
urls = ['https://google.com', 'http://reddit.com', 'http://wikipedia.org']
for url in urls:
print '-----------------------------------------------------'
print 'Loading ' + url
html = loadPage(url)
print html
app.exit()
Posting a simplified example here compared to OP's to demonstrate the essential problem and solution.

You're crazy man! QT has a much better DOM than beautifulsoup.
Replace:
soup = BeautifulSoup(html)
With
page = QWebPage()
page.settings().setAttribute(QWebSettings.AutoLoadImages, False)
page.settings().setAttribute(QWebSettings.PluginsEnabled, False)
page.mainFrame().setHtml(html)
dom = page.mainFrame().documentElement()
Then you can simply scrape data like so:
li = dom.findFirst("body div#content div#special ul > li")
if not li.isNull():
class = li.attribute("class")
text = li.toPlainText()
Finally you should use QWebView instead of QWebPage. You can set it up to act like a server which can be controlled with a socket. This is what I do:
class QTimerWithPause(QTimer):
def __init__(self, parent = None):
super(QTimerWithPause, self).__init__ (parent)
self.startTime = 0
self.interval = 0
return
def start(self, interval):
from time import time
self.interval = interval
self.startTime = time()
super(QTimerWithPause, self).start(interval)
return
def pause(self):
from time import time
if self.isActive ():
self.stop()
elapsedTime = self.startTime - time()
self.startTime -= elapsedTime
# time() returns float secs, interval is int msec
self.interval -= int(elapsedTime*1000)+1
return
def resume(self):
if not self.isActive():
self.start(self.interval)
return
class CrawlerWebServer(QWebView):
TIMEOUT = 60
STUPID = r"(bing|yahoo|google)"
def __init__(self, host="0.0.0.0", port=50007, parent=None, enableImages=True, enablePlugins=True):
# Constructor
super(CrawlerWebServer, self).__init__(parent)
self.command = None
self.isLoading = True
self.isConnected = False
self.url = QUrl("http://mast3rpee.tk/")
self.timeout = QTimerWithPause(self)
self.socket = QTcpServer(self)
# 1: Settings
self.settings().enablePersistentStorage()
self.settings().setAttribute(QWebSettings.AutoLoadImages, enableImages)
self.settings().setAttribute(QWebSettings.PluginsEnabled, enablePlugins)
self.settings().setAttribute(QWebSettings.DeveloperExtrasEnabled, True)
# 2: Server
if args.verbosity > 0: print "Starting server..."
self.socket.setProxy(QNetworkProxy(QNetworkProxy.NoProxy))
self.socket.listen(QHostAddress(host), int(port))
self.connect(self.socket, SIGNAL("newConnection()"), self._connect)
if args.verbosity > 1:
print " Waiting for connection(" + host + ":" + str(port) + ")..."
# 3: Default page
self._load(10*1000, self._loadFinished)
return
def __del__(self):
try:
self.conn.close()
self.socket.close()
except:
pass
return
def _sendAuth(self):
self.conn.write("Welcome to WebCrawler server (http://mast3rpee.tk)\r\n\rLicenced under GPL\r\n\r\n")
def _connect(self):
self.disconnect(self.socket, SIGNAL("newConnection()"), self._connect)
self.conn = self.socket.nextPendingConnection()
self.conn.nextBlockSize = 0
self.connect(self.conn, SIGNAL("readyRead()"), self.io)
self.connect(self.conn, SIGNAL("disconnected()"), self.close)
self.connect(self.conn, SIGNAL("error()"), self.close)
self._sendAuth()
if args.verbosity > 1:
print " Connection by:", self.conn.peerAddress().toString()
self.isConnected = True
if self.isLoading == False:
self.conn.write("\r\nEnter command:")
return
def io(self):
if self.isLoading: return None
if args.verbosity > 0:
print "Reading command..."
data = self.conn.read(1024).strip(" \r\n\t")
if not data: return None
elif self.command is not None:
r = self.command(data)
self.command = None
return r
return self._getCommand(data)
def _getCommand(self, d):
from re import search
d = unicode(d, errors="ignore")
if search(r"(help|HELP)", d) is not None:
self.conn.write("URL | JS | WAIT | QUIT\r\n\r\nEnter Command:")
elif search(r"(url|URL)", d) is not None:
self.command = self._print
self.conn.write("Enter address:")
elif search(r"(js|JS|javascript|JAVASCRIPT)", d) is not None:
self.command = self._js
self.conn.write("Enter javascript to execte:")
elif search(r"(wait|WAIT)", d) is not None:
self.loadFinished.connect(self._loadFinishedPrint)
self.loadFinished.connect(self._loadFinished)
elif search(r"(quit|QUIT|exit|EXIT)", d) is not None:
self.close()
else:
self.conn.write("Invalid command!\r\n\r\nEnter Command:")
return
def _print(self, d):
u = d[:250]
self.out(u)
return True
def _js(self, d):
try:
self.page().mainFrame().evaluateJavaScript(d)
except:
pass
self.conn.write("Enter Javascript:")
return True
def _stop(self):
from time import sleep
if self.isLoading == False: return
if args.verbosity > 0:
print " Stopping..."
self.timeout.stop()
self.stop()
def _load(self, timeout, after):
# Loads a page into frame / sets up timeout
self.timeout.timeout.connect(self._stop)
self.timeout.start(timeout)
self.loadFinished.connect(after)
self.load(self.url)
return
def _loadDone(self, disconnect = None):
from re import search
from time import sleep
self.timeout.timeout.disconnect(self._stop)
self.timeout.stop()
if disconnect is not None:
self.loadFinished.disconnect(disconnect)
# Stick a while on the page
if search(CrawlerWebServer.STUPID, self.url.toString(QUrl.RemovePath)) is not None:
sleep(5)
else:
sleep(1)
return
def _loadError(self):
from time import sleep, time
if not self.timeout.isActive(): return True
if args.verbosity > 0: print " Error retrying..."
# 1: Pause timeout
self.timeout.pause()
# 2: Check for internet connection
while self.page().networkAccessManager().networkAccessible() == QNetworkAccessManager.NotAccessible: sleep(1)
# 3: Wait then try again
sleep(2)
self.reload()
self.timeout.resume()
return False
def go(self, url, after = None):
# Go to a specific address
global args
if after is None:
after = self._loadFinished
if args.verbosity > 0:
print "Loading url..."
self.url = QUrl(url)
self.isLoading = True
if args.verbosity > 1:
print " ", self.url.toString()
self._load(CrawlerWebServer.TIMEOUT * 1000, after)
return
def out(self, url):
# Print html of a a specific url
self.go(url, self._loadFinishedPrint)
return
def createWindow(self, windowType):
# Load links in the same web-view.
return self
def _loadFinished(self, ok):
# Default LoadFinished
from time import sleep
from re import search
if self.isLoading == False: return
if ok == False:
if not self._loadError(): return
self._loadDone(self._loadFinished)
if args.verbosity > 1:
print " Done"
if self.isConnected == True:
self.conn.write("\r\nEnter command:")
self.isLoading = False
return
def _loadFinishedPrint(self, ok):
# Print the evaluated HTML to stdout
if self.isLoading == False: return
if ok == False:
if not self._loadError(): return
self._loadDone(self._loadFinishedPrint)
if args.verbosity > 1:
print " Done"
h = unicode( self.page().mainFrame().toHtml(), errors="ignore" )
if args.verbosity > 2:
print "------------------\n" + h + "\n--------------------"
self.conn.write(h)
self.conn.write("\r\nEnter command:")
self.isLoading = False
return
def contextMenuEvent(self, event):
# Context Menu
menu = self.page().createStandardContextMenu()
menu.addSeparator()
action = menu.addAction('ReLoad')
#action.triggered.connect
def refresh():
self.load(self.url)
menu.exec_(QCursor.pos())
class CrawlerWebClient(object):
def __init__(self, host, port):
import socket
global args
# CONNECT TO SERVER
self.socket = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
self.socket.connect((host, port))
o = self.read()
if args.verbosity > 2:
print "\n------------------------------\n" + o + "\n------------------------------\n"
return
def __del__(self):
try: self.socket.close()
except: pass
def read(self):
from re import search
r = ""
while True:
out = self.socket.recv(64*1024).strip("\r\n")
if out.startswith(r"Enter"):
break
if out.endswith(r"Enter command:"):
r += out[:-14]
break
r += out
return r
def command(self, command):
global args
if args.verbosity > 2:
print " Command: [" + command + "]\n------------------------------"
self.socket.sendall(unicode(command))
r = self.read()
if args.verbosity > 2:
print r, "\n------------------------------\n"
return r

OK then. If you really need JavaScript. (Can you get the answer from JSON at all? That would probably be easier still with simplejson or json.) The answer is don't make more than one QApplication. You're not allowed to. Make main make a QApplication and then use the QWebPage without bothering to call QApplication.exec_(). If that doesn't work, run it all in another QThread.

I am not familiar with PyQt, but as an option, you could write your script without using a class. That way, you can more easily re-use that application instance.
Hope it helps.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to break out of crawl if certain condition encountered in Scrapy - python

Related

Custom scrapy proxy rotation middleware with specific retry condition

Restart scrapy spider after unsuccessful login

Python- name ' ' is not defined

How to save the data from a scrapy crawler into a variable?

Web Scraping Multiple Links with PyQt / QtWebkit

Categories

Resources