OS: Ubuntu 16.04
Stack - Scrapy 1.0.3 + Selenium
I'm pretty new to scrapy and this might sound very basic, But in my spider, only "init" is being getting executed. Any code/function after that is not getting called and thhe spider just halts.
class CancerForumSpider(scrapy.Spider):
name = "mainpage_spider"
allowed_domains = ["cancerforums.net"]
start_urls = [
"http://www.cancerforums.net/forums/14-Prostate-Cancer-Forum"
]
def __init__(self,*args,**kwargs):
self.browser=webdriver.Firefox()
self.browser.get("http://www.cancerforums.net/forums/14-Prostate-Cancer-Forum")
print "----------------Going to sleep------------------"
time.sleep(5)
# self.parse()
def __exit__(self):
print "------------Exiting----------"
self.browser.quit()
def parse(self,response):
print "----------------Inside Parse------------------"
print "------------Exiting----------"
self.browser.quit()
The spider gets the browser object, prints "Going to sleep" and just halts. It doesn't go inside the parse function.
Following are the contents of the run logs:
----------------inside init----------------
----------------Going to sleep------------------
There are a few problems you need to address or be aware of:
You're not calling super() during the __init__ method, so none of the inherited classes initialization is going to be happening. Scrapy won't do anything (like calling it's parse() method), as that all is setup in scrapy.Spider.
After fixing the above, your parse() method will be called by Scrapy, but won't be operating on your Selenium-fetched webpage. It will have no knowledge of this whatsoever, and will go re-fetch the url (based on start_urls). It's very much likely that these two sources will differ (often drastically).
You're going to be bypassing almost all of Scrapy's functionality using Selenium the way you are. All of Selenium's get()'s will be executed outside of the Scrapy framework. Middleware won't be applied (cookies, throttling, filtering, etc.) nor will any of the expected/created objects (like request and response) be populated with the data you expect.
Before you fix all of that, you should consider a couple of better options/alternatives:
Create a downloader middleware that handles all "Selenium" related functionality. Have it intercept request objects right before they hit the downloader, populate a new response objects and return them for processing by the spider.
This isn't optimal, as you're effectively creating your own downloader, and short-circuiting Scrapy's. You'll have to re-implement the handling of any desired settings the downloader usually takes into account and make them work with Selenium.
Ditch Selenium and use the Splash HTTP and scrapy-splash middleware for handling Javascript.
Ditch Scrapy all together and just use Selenium and BeautifulSoup.
Scrapy is useful when you have to crawl a big amount of pages. Selenium is normally useful for scraping when you need to have the DOM source after the JS was loaded. If that's your situation, there are two main ways to combine Selenium and Scrapy. One is to write a download handler, like the one you can find here.
The code goes as:
# encoding: utf-8
from __future__ import unicode_literals
from scrapy import signals
from scrapy.signalmanager import SignalManager
from scrapy.responsetypes import responsetypes
from scrapy.xlib.pydispatch import dispatcher
from selenium import webdriver
from six.moves import queue
from twisted.internet import defer, threads
from twisted.python.failure import Failure
class PhantomJSDownloadHandler(object):
def __init__(self, settings):
self.options = settings.get('PHANTOMJS_OPTIONS', {})
max_run = settings.get('PHANTOMJS_MAXRUN', 10)
self.sem = defer.DeferredSemaphore(max_run)
self.queue = queue.LifoQueue(max_run)
SignalManager(dispatcher.Any).connect(self._close, signal=signals.spider_closed)
def download_request(self, request, spider):
"""use semaphore to guard a phantomjs pool"""
return self.sem.run(self._wait_request, request, spider)
def _wait_request(self, request, spider):
try:
driver = self.queue.get_nowait()
except queue.Empty:
driver = webdriver.PhantomJS(**self.options)
driver.get(request.url)
# ghostdriver won't response when switch window until page is loaded
dfd = threads.deferToThread(lambda: driver.switch_to.window(driver.current_window_handle))
dfd.addCallback(self._response, driver, spider)
return dfd
def _response(self, _, driver, spider):
body = driver.execute_script("return document.documentElement.innerHTML")
if body.startswith("<head></head>"): # cannot access response header in Selenium
body = driver.execute_script("return document.documentElement.textContent")
url = driver.current_url
respcls = responsetypes.from_args(url=url, body=body[:100].encode('utf8'))
resp = respcls(url=url, body=body, encoding="utf-8")
response_failed = getattr(spider, "response_failed", None)
if response_failed and callable(response_failed) and response_failed(resp, driver):
driver.close()
return defer.fail(Failure())
else:
self.queue.put(driver)
return defer.succeed(resp)
def _close(self):
while not self.queue.empty():
driver = self.queue.get_nowait()
driver.close()
Suppose your scraper is called "scraper". If you put the mentioned code inside a file called handlers.py on the root of the "scraper" folder, then you could add to your settings.py:
DOWNLOAD_HANDLERS = {
'http': 'scraper.handlers.PhantomJSDownloadHandler',
'https': 'scraper.handlers.PhantomJSDownloadHandler',
}
Another way is to write a download middleware, as described here. The download middleware has the downside of preventing some key features from working out of the box, such as cache and retries.
In any case, starting the Selenium webdriver at the init of the Scrapy spider is not the usual way to go.
Related
TL;DR
In scrapy, I want the Request to wait till all spider parse callbacks finish. So the whole process needs to be sequential. Like this:
Request1 -> Crawl1 -> Request2 -> Crawl2 ...
But what is happening now:
Request1 -> Request2 -> Request3 ...
Crawl1
Crawl2
Crawl3 ...
Long version
I am new to scrapy + selenium web scraping.
I am trying to scrape a website where the contents are being updated heavily with javascript. Firstly I am opening the website with selenium and logging in. After that, I am creating a using a downloader middleware that handles the requests with selenium and returns the responses. Below is the middleware's process_request implementation:
class XYZDownloaderMiddleware:
'''Other functions are as is. I just changed this one'''
def process_request(self, request, spider):
driver = request.meta['driver']
# We are opening a new link
if request.meta['load_url']:
driver.get(request.url)
WebDriverWait(driver, 100).until(EC.presence_of_element_located((By.XPATH, request.meta['wait_for_xpath'])))
# We are clicking on an element to get new data using javascript.
elif request.meta['click_bet']:
element = request.meta['click_bet']
element.click()
WebDriverWait(driver, 100).until(EC.presence_of_element_located((By.XPATH, request.meta['wait_for_xpath'])))
body = driver.page_source
return HtmlResponse(driver.current_url, body=body, encoding="utf-8", request=request)
In settings, I have also set CONCURRENT_REQUESTS = 1 so that, multiple driver.get() do not called and selenium can peacefully load responses one by one.
Now what I see happening is selenium opens each URL, scrapy lets selenium wait for the response to finish loading and then middleware returns the response properly (goes to if response.meta['load_url'] block).
But, after I got the response, I want to use the selenium driver (in the parse(response) functions) to click on each of the elements by yielding a Request and return the updated HTML from the middleware (the elif request.meta['click_bet'] block).
The Spider is minimally like this:
class XYZSpider(scrapy.Spider):
def start_requests(self):
start_urls = [
'https://www.example.com/a',
'https://www.example.com/b'
]
self.driver = self.getSeleniumDriver()
for url in start_urls:
request = scrapy.Request(url=url, callback=self.parse)
request.meta['driver'] = self.driver
request.meta['load_url'] = True
request.meta['wait_for_xpath'] = '/div/bla/bla'
request.meta['click_bet'] = None
yield request
def parse(self, response):
urls = response.xpath('//a/#href').getall()
for url in start_urls:
request = scrapy.Request(url=url, callback=self.rightSectionParse)
request.meta['driver'] = self.driver
request.meta['load_url'] = True
request.meta['wait_for_xpath'] = '//div[contains(#class, "rightSection")]'
request.meta['click_bet'] = None
yield request
def rightSectionParse(self, response):
...
So what is happening is, scrapy is not waiting for the spider to parse. Scrapy gets the response, and then parallelly calls parse callback and next fetch response. But selenium driver needs to be used by the parse callback function before the next request processing.
I want the requests to wait until the parse callback is finished.
I'm testing out a splash instance with scrapy 1.6 following https://blog.scrapinghub.com/2015/03/02/handling-javascript-in-scrapy-with-splash and https://aaqai.me/notes/scrapy-splash-setup. My spider:
import scrapy
from scrapy_splash import SplashRequest
from scrapy.utils.response import open_in_browser
class MySpider(scrapy.Spider):
start_urls = ["http://yahoo.com"]
name = 'mytest'
def start_requests(self):
for url in self.start_urls:
yield SplashRequest(url, self.parse, endpoint='render.html', args={'wait': 7.5},)
def parse(self, response):
# response.body is a result of render.html call; it
# contains HTML processed by a browser.
open_in_browser(response)
return None
The output opens up in notepad rather than a browser. How can I open this in a browser?
If you are using the splash middleware and everything the splash response goes into the regular response object with you can access via response.css and response.xpath. Depending on what endpoint you use you can execute JavaScript and other stuff.
If you need to do moving around a page and other stuff you will need to write a LUA script to execute with the proper endpoint. As far as parsing the output it automatically goes into the response object.
Get rid of open_in_browser I'm not exactly sure what you are doing but if all you want to do is parse the page you can do so like so
body = response.css('body').extract_first()
links = response.css('a::attr(href)').extract()
If you could please clarify your question most people don't want to look in links to try and guess what your having trouble with.
Update for clarified question:
It sounds like you may want scrapy shell with Splash this will enable you to experiment with selectors:
scrapy shell 'http://localhost:8050/render.html?url=http://page.html&timeout=10&wait=0.5'
In order to access Splash in a browser instance simply go to http://0.0.0.0:8050/ you input the URL in there. I'm not sure about the method in the tutorial but this is how you can interact with the Splash session.
I'v created a script in scrapy to parse the author name of different posts from it's landing page and then pass it to the parse_page method using meta keyword in order to print the post content along with the author name at the same time.
I've used download_slot within meta keyword which allegedly maskes the script run faster. Although it is not necessary to comply with the logic I tried to apply here, I would like to stick to it only to understand how download_slot works within any script and why. I searched a lot to know more about download_slot but I end up some links like this one.
An example usage of download_slot (I'm not quite sure about it though):
from scrapy.crawler import CrawlerProcess
from scrapy import Request
import scrapy
class ConventionSpider(scrapy.Spider):
name = 'stackoverflow'
start_urls = ['https://stackoverflow.com/questions/tagged/web-scraping']
def parse(self,response):
for link in response.css('.summary'):
name = link.css('.user-details a::text').extract_first()
url = link.css('.question-hyperlink::attr(href)').extract_first()
nurl = response.urljoin(url)
yield Request(nurl,callback=self.parse_page,meta={'item':name,"download_slot":name})
def parse_page(self,response):
elem = response.meta.get("item")
post = ' '.join([item for item in response.css("#question .post-text p::text").extract()])
yield {'Name':elem,'Main_Content':post}
if __name__ == "__main__":
process = CrawlerProcess({
'USER_AGENT': 'Mozilla/5.0',
})
process.crawl(ConventionSpider)
process.start()
The above script runs flawlessly.
My question: how download_slot works within scrapy?
Let's start with the Scrapy architecture. When you create a scrapy.Request, the Scrapy engine passes the request to the downloader to fetch the content. The downloader puts incoming requests into slots which you can imagine as independent queues of requests. The queues are then polled and each individual request gets processed (the content gets downloaded).
Now, here's the crucial part. To determine into what slot to put the incoming request, downloader checks request.meta for download_slot key. If it's present, it puts the request into the slot with that name (and creates it if it doesn't yet exist). If the download_slot key is not present, it puts the request into the slot for the domain (more accurately, the hostname) the request's URL points to.
This explains why your script runs faster. You create multiple downloader slots because they are based on the author's name. If you did not, they would be put into the same slot based on the domain (which is always stackoverflow.com). Thus, you effectively increase the parallelism of downloading content.
This explanation is a little bit simplified but it should give you a picture of what's going on. You can check the code yourself.
For example there is a some target website which allows to process only 1 request per 20 seconds and we need to parse/process 3000 webpages of products data from it.
Common spider with DOWNLOAD_DELAY setting to 20 - application will finish work in ~17 hours(3000 pages * 20 seconds download delay).
If you have aim to increase scraping speed without getting banned by website and you have for example 20 valid proxies You can uniformly allocate request urls to all your proxies using proxy and download_slot meta key and significally reduce application completion time
from scrapy.crawler import CrawlerProcess
from scrapy import Request
import scrapy
class ProxySpider(scrapy.Spider):
name = 'proxy'
start_urls = ['https://example.com/products/1','https://example.com/products/2','....']#list with 3000 products url
proxies = [',,,'] #list wiht 20 proxies
def start_requests(self):
for index, url in start_urls:
chosen_proxy = proxies(index % len(self.proxies)
yield Request(url, callback=self.parse,
meta = {"proxy":chosen_proxy,"download_slot":chosen_proxy})
def parse(self,response):
....
yeild item
#yield Request(deatails_url,
callback=self.parse_additional_details,
meta=
{"download_slot":response.request.meta["download_slot"],
"proxy":response.request.meta["download_slot"]})
if __name__ == "__main__":
process = CrawlerProcess({
'USER_AGENT': 'Mozilla/5.0','DOWNLOAD_DELAY':20, "COOKIES_ENABLED":False
})
process.crawl(ProxySpider)
process.start()
I'm using Scrapy to scrape data this site. I need to call getlinkfrom parse. Normal call is not working as well when use yield, i get this error:
2015-11-16 10:12:34 [scrapy] ERROR: Spider must return Request, BaseItem, dict or None, got 'generator' in <GET https://www.coldwellbankerhomes.com/fl/miami-dad
e-county/kvc-17_1,17_3,17_2,17_8/incl-22/>
Returning getlink function from parse works but i need to execute some code even after returning. I'm confused any help would be really appreciable.
# -*- coding: utf-8 -*-
from scrapy.spiders import BaseSpider
from scrapy.selector import Selector
from scrapy.http import Request,Response
import re
import csv
import time
from selenium import webdriver
class ColdWellSpider(BaseSpider):
name = "cwspider"
allowed_domains = ["coldwellbankerhomes.com"]
#start_urls = [''.join(row).strip() for row in csv.reader(open("remaining_links.csv"))]
#start_urls = ['https://www.coldwellbankerhomes.com/fl/boynton-beach/5451-verona-drive-unit-d/pid_9266204/']
start_urls = ['https://www.coldwellbankerhomes.com/fl/miami-dade-county/kvc-17_1,17_3,17_2,17_8/incl-22/']
def parse(self,response):
#browser = webdriver.PhantomJS(service_args=['--ignore-ssl-errors=true', '--load-images=false'])
browser = webdriver.Firefox()
browser.maximize_window()
browser.get(response.url)
time.sleep(5)
#to extract all the links from a page and send request to those links
#this works but even after returning i need to execute the while loop
return self.getlink(response)
#for clicking the load more button in the page
while True:
try:
browser.find_element_by_class_name('search-results-load-more').find_element_by_tag_name('a').click()
time.sleep(3)
self.getlink(response)
except:
break
def getlink(self,response):
print 'hhelo'
c = open('data_getlink.csv', 'a')
d = csv.writer(c, lineterminator='\n')
print 'hello2'
listclass = response.xpath('//div[#class="list-items"]/div[contains(#id,"snapshot")]')
for l in listclass:
link = 'http://www.coldwellbankerhomes.com/'+''.join(l.xpath('./h2/a/#href').extract())
d.writerow([link])
yield Request(url = str(link),callback=self.parse_link)
#callback function of Request
def parse_link(self,response):
b = open('data_parselink.csv', 'a')
a = csv.writer(b, lineterminator='\n')
a.writerow([response.url])
Spider must return Request, BaseItem, dict or None, got 'generator'
getlink() is a generator. You are trying to yield it from the parse() generator.
Instead, you can/should iterate over the results of getlink() call:
def parse(self, response):
browser = webdriver.Firefox()
browser.maximize_window()
browser.get(response.url)
time.sleep(5)
while True:
try:
for request in self.getlink(response):
yield request
browser.find_element_by_class_name('search-results-load-more').find_element_by_tag_name('a').click()
time.sleep(3)
except:
break
Also, I've noticed you have both self.getlink(response) and self.getlink(browser). The latter is not gonna work since there is no xpath() method on a webdriver instance - you probably meant to make a Scrapy Selector out of the page source that your webdriver controlled browser has loaded, for example:
selector = scrapy.Selector(text=browser.page_source)
self.getlink(selector)
You should also take a look on to Explicit Waits with Expected Conditions instead of using unreliable and slow artificial delays via time.sleep().
Plus, I'm not sure what is the reason you are writing to CSV manually instead of using built-in Scrapy Items and Item Exporters. And, you are not closing the files properly and not using the with() context manager either.
Additionally, try to catch more specific exception(s) and avoid having a bare try/expect block.
Is there a way to stop a url from redirecting?
driver.get('http://loginrequired.com')
This redirects me to another page but I want it to stay on that page without redirecting by default.
There are two ways that what users call "redirection" typically happens:
You load a page and the page loads some JavaScript code which performs a test and decides to load a different page. This process can be interrupted in some browsers by hitting the ESCAPE key. Selenium can send an ESCAPE key.
However, this redirection could happen before Selenium gives control back to your script. Whether it would work in any specific case depends on the page being loaded.
You load a page and get an HTTP 3xx (301, 303, 304, etc.) response from the server. There are no opportunities for users to interrupt these redirections in their browser, so Selenium does not provide the means to interrupt or prevent them.
So there is no surefire way to prevent a redirection in Selenium.
A solution, in case you do not need to visualize the page but access to the source of "http://loginrequired.com" would be the usage of Selenium with Scrapy.
Basically you tell the Scrapy middleware to stop redirecting, and while the spider access to the page the redirect is handle the redirection (302).
In the setting.py you have to set
"REDIRECT_ENABLED=False"
The spider code is:
class LoginSpider(CrawlSpider):
name = "login"
allowed_domains = ['loginrequired.com']
start_urls = ['http://loginrequired.com']
handle_httpstatus_list = [302]
def __init__(self):
self.driver = webdriver.Firefox()
def parse(self, response):
if response.status in self.handle_httpstatus_list:
return Request(url="http://loginrequired.com", callback=self.after_302)
def after_302(self, response):
print response.url
# Your code to analysis the page by here
Idea taken from how to handle 302 redirect in scrapy