passing selenium response url to scrapy - python

I am learning Python and am trying to scrape this page for a specific value on the dropdown menu. After that I need to click each item on the resulted table to retrieve the specific information. I am able to select the item and retrieve the information on the webdriver. But I do not know how to pass the response url to the crawlspider.
driver = webdriver.Firefox()
driver.get('http://www.cppcc.gov.cn/CMS/icms/project1/cppcc/wylibary/wjWeiYuanList.jsp')
more_btn = WebDriverWait(driver, 20).until(
EC.visibility_of_element_located((By.ID, '_button_select'))
)
more_btn.click()
## select specific value from the dropdown
driver.find_element_by_css_selector("select#tabJcwyxt_jiebie > option[value='teyaoxgrs']").click()
driver.find_element_by_css_selector("select#tabJcwyxt_jieci > option[value='d11jie']").click()
search2 = driver.find_element_by_class_name('input_a2')
search2.click()
time.sleep(5)
## convert html to "nice format"
text_html=driver.page_source.encode('utf-8')
html_str=str(text_html)
## this is a hack that initiates a "TextResponse" object (taken from the Scrapy module)
resp_for_scrapy=TextResponse('none',200,{},html_str,[],None)
## convert html to "nice format"
text_html=driver.page_source.encode('utf-8')
html_str=str(text_html)
resp_for_scrapy=TextResponse('none',200,{},html_str,[],None)
So this is where I am stuck. I was able to query using the above code. But How can I pass resp_for_scrapy to the crawlspider? I put resp_for_scrapy in place of item but that didn't work.
## spider
class ProfileSpider(CrawlSpider):
name = 'pccprofile2'
allowed_domains = ['cppcc.gov.cn']
start_urls = ['http://www.cppcc.gov.cn/CMS/icms/project1/cppcc/wylibary/wjWeiYuanList.jsp']
def parse(self, resp_for_scrapy):
hxs = HtmlXPathSelector(resp_for_scrapy)
for post in resp_for_scrapy.xpath('//div[#class="table"]//ul//li'):
items = []
item = Ppcprofile2Item()
item ["name"] = hxs.select("//h1/text()").extract()
item ["title"] = hxs.select("//div[#id='contentbody']//tr//td//text()").extract()
items.append(item)
##click next page
while True:
next = self.driver.findElement(By.linkText("下一页"))
try:
next.click()
except:
break
return(items)
Any suggestions would be greatly appreciated!!!!
EDITS I included a middleware class to select from the dropdown before the spider class. But now there is no error and no result.
class JSMiddleware(object):
def process_request(self, request, spider):
driver = webdriver.PhantomJS()
driver.get('http://www.cppcc.gov.cn/CMS/icms/project1/cppcc/wylibary/wjWeiYuanList.jsp')
# select from the dropdown
more_btn = WebDriverWait(driver, 20).until(
EC.visibility_of_element_located((By.ID, '_button_select'))
)
more_btn.click()
driver.find_element_by_css_selector("select#tabJcwyxt_jiebie > option[value='teyaoxgrs']").click()
driver.find_element_by_css_selector("select#tabJcwyxt_jieci > option[value='d11jie']").click()
search2 = driver.find_element_by_class_name('input_a2')
search2.click()
time.sleep(5)
#get the response
body = driver.page_source
return HtmlResponse(driver.current_url, body=body, encoding='utf-8', request=request)
class ProfileSpider(CrawlSpider):
name = 'pccprofile2'
rules = [Rule(SgmlLinkExtractor(allow=(),restrict_xpaths=("//div[#class='table']")), callback='parse_item')]
def parse_item(self, response):
hxs = HtmlXPathSelector(response)
items = []
item = Ppcprofile2Item()
item ["name"] = hxs.select("//h1/text()").extract()
item ["title"] = hxs.select("//div[#id='contentbody']//tr//td//text()").extract()
items.append(item)
#click next page
while True:
next = response.findElement(By.linkText("下一页"))
try:
next.click()
except:
break
return(items)

Use Downloader Middleware to catch selenium-required pages before you process them regularly with Scrapy:
The downloader middleware is a framework of hooks into Scrapy’s request/response processing. It’s a light, low-level system for globally altering Scrapy’s requests and responses.
Here's a very basic example using PhantomJS:
from scrapy.http import HtmlResponse
from selenium import webdriver
class JSMiddleware(object):
def process_request(self, request, spider):
driver = webdriver.PhantomJS()
driver.get(request.url)
body = driver.page_source
return HtmlResponse(driver.current_url, body=body, encoding='utf-8', request=request)
Once you return that HtmlResponse (or a TextResponse if that's what you really want), Scrapy will cease processing downloaders and drop into the spider's parse method:
If it returns a Response object, Scrapy won’t bother calling any other
process_request() or process_exception() methods, or the appropriate
download function; it’ll return that response. The process_response()
methods of installed middleware is always called on every response.
In this case, you can continue to use your spider's parse method as you normally would with HTML, except that the JS on the page has already been executed.
Tip: Since the Downloader Middleware's process_request method accepts the spider as an argument, you can add a conditional in the spider to check whether you need to process JS at all, and that will let you handle both JS and non-JS pages with the exact same spider class.

Here is a middleware for Scrapy and Selenium
from scrapy.http import HtmlResponse
from scrapy.utils.python import to_bytes
from selenium import webdriver
from scrapy import signals
class SeleniumMiddleware(object):
#classmethod
def from_crawler(cls, crawler):
middleware = cls()
crawler.signals.connect(middleware.spider_opened, signals.spider_opened)
crawler.signals.connect(middleware.spider_closed, signals.spider_closed)
return middleware
def process_request(self, request, spider):
request.meta['driver'] = self.driver # to access driver from response
self.driver.get(request.url)
body = to_bytes(self.driver.page_source) # body must be of type bytes
return HtmlResponse(self.driver.current_url, body=body, encoding='utf-8', request=request)
def spider_opened(self, spider):
self.driver = webdriver.Firefox()
def spider_closed(self, spider):
self.driver.close()
Also need to add in settings.py
DOWNLOADER_MIDDLEWARES = {
'youproject.middlewares.selenium.SeleniumMiddleware': 200
}
Decide weather its 200 or something else based on docs.
Update firefox headless mode with scrapy and selenium
If you want to run firefox in headless mode then install xvfb
sudo apt-get install -y xvfb
and PyVirtualDisplay
sudo pip install pyvirtualdisplay
and use this middleware
from shutil import which
from pyvirtualdisplay import Display
from scrapy import signals
from scrapy.http import HtmlResponse
from scrapy.utils.project import get_project_settings
from selenium import webdriver
from selenium.webdriver.firefox.firefox_binary import FirefoxBinary
settings = get_project_settings()
HEADLESS = True
class SeleniumMiddleware(object):
#classmethod
def from_crawler(cls, crawler):
middleware = cls()
crawler.signals.connect(middleware.spider_opened, signals.spider_opened)
crawler.signals.connect(middleware.spider_closed, signals.spider_closed)
return middleware
def process_request(self, request, spider):
self.driver.get(request.url)
request.meta['driver'] = self.driver
body = str.encode(self.driver.page_source)
return HtmlResponse(self.driver.current_url, body=body, encoding='utf-8', request=request)
def spider_opened(self, spider):
if HEADLESS:
self.display = Display(visible=0, size=(1280, 1024))
self.display.start()
binary = FirefoxBinary(settings.get('FIREFOX_EXE') or which('firefox'))
self.driver = webdriver.Firefox(firefox_binary=binary)
def spider_closed(self, spider):
self.driver.close()
if HEADLESS:
self.display.stop()
where settings.py contains
FIREFOX_EXE = '/path/to/firefox.exe'
The problem is that some versions of firefox don't work with selenium.
To solve this problem you can download firefox version 47.0.1 (this version works flawlessly) from here then extract it and put it inside your selenium project. Afterwards modify firefox path as
FIREFOX_EXE = '/path/to/your/scrapyproject/firefox/firefox.exe'

Related

Scrapy with Selenium Middleware to generate second response after first response

I'm trying to extract comments from a news page. The Crawler starts at the homepage and follows all the internal links found on the site. The comments are just on the article-sites and those comments are embedded from an external Website, so the section with the comments are in an JavaScript iframe. Here's an example article site
My first Step was to build a crawler and a selenium middleware. The crawler follows all the links and those are loaded through Selenium:
from scrapy import Request
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
class CrawlerSpider(CrawlSpider):
name = 'crawler'
allowed_domains = ['www.merkur.de', 'disqus.com/embed/comments/']
start_urls = ['https://www.merkur.de/welt/novavax-corona-totimpfstoff-omikron-zulassung-impfstoff-weihnachten-wirkung-covid-lauterbach-zr-91197497.html']
rules = [Rule(LinkExtractor(allow=r'.*'), callback='parse',
follow=True)]
def parse(self, response):
title = response.xpath('//html/head/title/text()').extract_first()
iframe_url = response.xpath('//iframe[#title="Disqus"]//#src').get()
yield Request(iframe_url, callback=self.next_parse, meta={'title': title})
def next_parse(self, response):
title = response.meta.get('title')
comments = response.xpath("//div[#class='post-message ']/div/p").getall()
yield {
'title': title,
'comments': comments
}
To get access to the iframe elements the Scrapy Request goes through the middleware:
from scrapy import signals, spiders
from selenium import webdriver
from scrapy.http import HtmlResponse
from selenium.webdriver.chrome.options import Options
class SeleniumMiddleware(object):
def __init__(self):
chrome_options = Options()
chrome_options.add_argument("--headless")
self.driver = webdriver.Chrome(options=chrome_options)
# Here you get the request you are making to the urls with the LinkExtractor found and use selenium to get them and return a response.
def process_request(self, request, spider):
self.driver.get(request.url)
element = self.driver.find_element_by_xpath('//div[#id="disqus_thread"]')
self.driver.execute_script("arguments[0].scrollIntoView();", element)
time.sleep(1)
body = self.driver.page_source
return HtmlResponse(self.driver.current_url, body=body, encoding='utf-8', request=request)
I am getting the right link from the iframe src here but my CrawlerSpider is not yielding the iframe_url Request so that I can follow the link from the iframe. What am I doing wrong here ? I really appreciate your help!

In scrapy+selenium, how to make a spider request to wait until previous request has finished processing?

TL;DR
In scrapy, I want the Request to wait till all spider parse callbacks finish. So the whole process needs to be sequential. Like this:
Request1 -> Crawl1 -> Request2 -> Crawl2 ...
But what is happening now:
Request1 -> Request2 -> Request3 ...
Crawl1
Crawl2
Crawl3 ...
Long version
I am new to scrapy + selenium web scraping.
I am trying to scrape a website where the contents are being updated heavily with javascript. Firstly I am opening the website with selenium and logging in. After that, I am creating a using a downloader middleware that handles the requests with selenium and returns the responses. Below is the middleware's process_request implementation:
class XYZDownloaderMiddleware:
'''Other functions are as is. I just changed this one'''
def process_request(self, request, spider):
driver = request.meta['driver']
# We are opening a new link
if request.meta['load_url']:
driver.get(request.url)
WebDriverWait(driver, 100).until(EC.presence_of_element_located((By.XPATH, request.meta['wait_for_xpath'])))
# We are clicking on an element to get new data using javascript.
elif request.meta['click_bet']:
element = request.meta['click_bet']
element.click()
WebDriverWait(driver, 100).until(EC.presence_of_element_located((By.XPATH, request.meta['wait_for_xpath'])))
body = driver.page_source
return HtmlResponse(driver.current_url, body=body, encoding="utf-8", request=request)
In settings, I have also set CONCURRENT_REQUESTS = 1 so that, multiple driver.get() do not called and selenium can peacefully load responses one by one.
Now what I see happening is selenium opens each URL, scrapy lets selenium wait for the response to finish loading and then middleware returns the response properly (goes to if response.meta['load_url'] block).
But, after I got the response, I want to use the selenium driver (in the parse(response) functions) to click on each of the elements by yielding a Request and return the updated HTML from the middleware (the elif request.meta['click_bet'] block).
The Spider is minimally like this:
class XYZSpider(scrapy.Spider):
def start_requests(self):
start_urls = [
'https://www.example.com/a',
'https://www.example.com/b'
]
self.driver = self.getSeleniumDriver()
for url in start_urls:
request = scrapy.Request(url=url, callback=self.parse)
request.meta['driver'] = self.driver
request.meta['load_url'] = True
request.meta['wait_for_xpath'] = '/div/bla/bla'
request.meta['click_bet'] = None
yield request
def parse(self, response):
urls = response.xpath('//a/#href').getall()
for url in start_urls:
request = scrapy.Request(url=url, callback=self.rightSectionParse)
request.meta['driver'] = self.driver
request.meta['load_url'] = True
request.meta['wait_for_xpath'] = '//div[contains(#class, "rightSection")]'
request.meta['click_bet'] = None
yield request
def rightSectionParse(self, response):
...
So what is happening is, scrapy is not waiting for the spider to parse. Scrapy gets the response, and then parallelly calls parse callback and next fetch response. But selenium driver needs to be used by the parse callback function before the next request processing.
I want the requests to wait until the parse callback is finished.

Scrapy. First response requires Selenium

I'm scraping a website that strongly depends on Javascript. The main page from which I need to extract the urls that will be parsed depends on Javascript, so I have to modify start_requests.
I'm looking for a way to connect start_requests, with the linkextractor and with process_match
class MatchSpider(CrawlSpider):
name = "match"
allowed_domains = ["whoscored"]
rules = (
Rule(LinkExtractor(restrict_xpaths='//*[contains(#class, "match-report")]//#href'), callback='parse_item'),
)
def start_requests(self):
url = 'https://www.whoscored.com/Regions/252/Tournaments/2/Seasons/6335/Stages/13796/Fixtures/England-Premier-League-2016-2017'
browser = Browser(browser='Chrome')
browser.get(url)
# should return a request with the html body from Selenium driver so that LinkExtractor rule can be applied
def process_match(self, response):
match_item = MatchItem()
regex = re.compile("matchCentreData = \{.*?\};", re.S)
match = re.search(regex, response.text).group()
match = match.replace('matchCentreData =', '').replace(';', '')
match_item['match'] = json.loads(match)
match_item['url'] = response.url
match_item['project'] = self.settings.get('BOT_NAME')
match_item['spider'] = self.name
match_item['server'] = socket.gethostname()
match_item['date'] = datetime.datetime.now()
yield match_item
A wrapper I'm using around Selenium:
class Browser:
"""
selenium on steroids. allows you to create different types of browsers plus
adds methods for safer calls
"""
def __init__(self, browser='Firefox'):
"""
type: silent or not
browser: chrome of firefox
"""
self.browser = browser
self._start()
def _start(self):
'''
starts browser
'''
if self.browser == 'Chrome':
chrome_options = webdriver.ChromeOptions()
prefs = {"profile.managed_default_content_settings.images": 2}
chrome_options.add_extension('./libcommon/adblockpluschrome-1.10.0.1526.crx')
chrome_options.add_experimental_option("prefs", prefs)
chrome_options.add_argument("user-agent={0}".format(random.choice(USER_AGENTS)))
self.driver_ = webdriver.Chrome(executable_path='./libcommon/chromedriver', chrome_options=chrome_options)
elif self.browser == 'Firefox':
profile = webdriver.FirefoxProfile()
profile.set_preference("general.useragent.override", random.choice(USER_AGENTS))
profile.add_extension('./libcommon/adblock_plus-2.7.1-sm+tb+an+fx.xpi')
profile.set_preference('permissions.default.image', 2)
profile.set_preference('dom.ipc.plugins.enabled.libflashplayer.so', 'false')
profile.set_preference("webdriver.load.strategy", "unstable")
self.driver_ = webdriver.Firefox(profile)
elif self.browser == 'PhantomJS':
self.driver_ = webdriver.PhantomJS()
self.driver_.set_window_size(1120, 550)
def close(self):
self.driver_.close()
def return_when(self, condition, locator):
"""
returns browser execution when condition is met
"""
for _ in range(5):
with suppress(Exception):
wait = WebDriverWait(self.driver_, timeout=100, poll_frequency=0.1)
wait.until(condition(locator))
self.driver_.execute_script("return window.stop")
return True
return False
def __getattr__(self, name):
"""
ruby-like method missing: derive methods not implemented to attribute that
holds selenium browser
"""
def _missing(*args, **kwargs):
return getattr(self.driver_, name)(*args, **kwargs)
return _missing
There's two problems I see after looking into this. Forgive any ignorance on my part, because it's been a while since I was last in the Python/Scrapy world.
First: How do we get the HTML from Selenium?
According to the Selenium docs, the driver should have a page_source attribute containing the contents of the page.
browser = Browser(browser='Chrome')
browser.get(url)
html = browser.driver_.page_source
browser.close()
You may want to make this a function in your browser class to avoid accessing browser.driver_ from MatchSpider.
# class Browser
def page_source(self):
return self.driver_.page_source
# end class
browser.get(url)
html = browser.page_source()
Second: How do we override Scrapy's internal web requests?
It looks like Scrapy tries to decouple the behind-the-scenes web requests from the what-am-I-trying-to-parse functionality of each spider you write. start_requests() should "return an iterable with the first Requests to crawl" and make_requests_from_url(url) (which is called if you don't override start_requests()) takes "a URL and returns a Request object". When internally processing a Spider, Scrapy starts creating a plethora of Request objects that will be asynchronously executed and the subsequent Response will be sent to parse(response)...the Spider never actually does the processing from Request to `Response.
Long story short, this means you would need to create middleware for the Scrapy Downloader to use Selenium. Then, you can remove your overridden start_requests() method and add a start_urls attribute. Specifically, your SeleniumDownloaderMiddleware should overwrite the process_request(request, spider) method to use the above Selenium code.

Scrapy with Selenium crawling but not scraping

I have read all the threads on using scrapy for AJAX pages and installed selenium webdrive to simplify the task, my spider can partially crawl but can't get any data into my Items.
My objectives are:
Crawl from this page to this page
Scrape each item(post)'s:
author_name (xpath:/html/body/div[8]/div/div[1]/div[3]/div[3]/ul/li[2]/div[2]/span[2]/ul/li[3]/a/text())
author_page_url (xpath:/html/body/div[8]/div/div[1]/div[3]/div[3]/ul/li[2]/div[2]/span[2]/ul/li[3]/a/#href)
post_title (xpath://a[#class="title_txt"])
post_page_url (xpath://a[#class="title_txt"]/#href)
post_text (xpath on a separate post page: //div[#id="a_NMContent/text()")
This is my monkey code (since I am only making my first steps in Python as an aspiring natural language processing student, who majored in linguistics in the past):
import scrapy
import time
from selenium import webdriver
from scrapy.contrib.linkextractors.lxmlhtml import LxmlLinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.selector import XPathSelector
class ItalkiSpider(CrawlSpider):
name = "italki"
allowed_domains = ['italki.com']
start_urls = ['http://www.italki.com/entries/korean']
# not sure if the rule is set correctly
rules = (Rule(LxmlLinkExtractor(allow="\entry"), callback = "parse_post", follow = True),)
def __init__(self):
self.driver = webdriver.Firefox()
def parse(self, response):
# adding necessary search parameters to the URL
self.driver.get(response.url+"#language=korean&author-language=russian&marks-min=-5&sort=1&page=1")
# pressing the "Show More" button at the bottom of the search results page to show the next 15 posts, when all results are loaded to the page, the button disappears
more_btn = self.driver.find_element_by_xpath('//a[#id="a_show_more"]')
while more_btn:
more_btn.click()
# sometimes waiting for 5 sec made spider close prematurely so keeping it long in case the server is slow
time.sleep(10)
# here is where the problem begins, I am making a list of links to all the posts on the big page, but I am afraid links will contain only the first link, because selenium doesn't do the multiple selection as one would expect from this xpath...how can I grab all the links and put them in the links list (and should I?)
links=self.driver.find_elements_by_xpath('/html/body/div[8]/div/div[1]/div[3]/div[3]/ul/li/div[2]/a')
for link in links:
link.click()
time.sleep(3)
# this is the function for parsing individual posts, called back by the *parse* method as specified in the rule of the spider; if it is correct, it should have saved at least one post into an item... I don't really understand how and where this callback function gets the response from the new page (the page of the post in this case)...is it automatically loaded to drive and then passed on to the callback function as soon as selenium has clicked on the link (link.click())? or is it all total nonsense...
def parse_post(self, response):
hxs = Selector(response)
item = ItalkiItem()
item["post_item"] = hxs.xpath('//div [#id="a_NMContent"]/text()').extract()
return item
Let's think about it a bit differently:
open the page in the browser and click "Show More" until you get to the desired page
initialize a scrapy TextResponse with the current page source (with all necessary posts loaded)
for every post initialize an Item, yield a Request to the post page and pass an item instance from a request to a response in the meta dictionary
Notes and changes I'm introducing:
use a normal Spider class
use Selenium Waits to wait for the "Show More" button to be visible
closing the driver instance in spider_closed signal dispatcher
The code:
import scrapy
from scrapy import signals
from scrapy.http import TextResponse
from scrapy.xlib.pydispatch import dispatcher
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
class ItalkiItem(scrapy.Item):
title = scrapy.Field()
url = scrapy.Field()
text = scrapy.Field()
class ItalkiSpider(scrapy.Spider):
name = "italki"
allowed_domains = ['italki.com']
start_urls = ['http://www.italki.com/entries/korean']
def __init__(self):
self.driver = webdriver.Firefox()
dispatcher.connect(self.spider_closed, signals.spider_closed)
def spider_closed(self, spider):
self.driver.close()
def parse(self, response):
# selenium part of the job
self.driver.get('http://www.italki.com/entries/korean')
while True:
more_btn = WebDriverWait(self.driver, 10).until(
EC.visibility_of_element_located((By.ID, "a_show_more"))
)
more_btn.click()
# stop when we reach the desired page
if self.driver.current_url.endswith('page=52'):
break
# now scrapy should do the job
response = TextResponse(url=response.url, body=self.driver.page_source, encoding='utf-8')
for post in response.xpath('//ul[#id="content"]/li'):
item = ItalkiItem()
item['title'] = post.xpath('.//a[#class="title_txt"]/text()').extract()[0]
item['url'] = post.xpath('.//a[#class="title_txt"]/#href').extract()[0]
yield scrapy.Request(item['url'], meta={'item': item}, callback=self.parse_post)
def parse_post(self, response):
item = response.meta['item']
item["text"] = response.xpath('//div[#id="a_NMContent"]/text()').extract()
return item
This is something you should use as a base code and improve to fill out all other fields, like author or author_url. Hope that helps.

Failed to crawl element of specific website with scrapy spider

I want to get website addresses of some jobs, so I write a scrapy spider, I want to get all of the value with xpath://article/dl/dd/h2/a[#class="job-title"]/#href, but when I execute the spider with command :
scrapy spider auseek -a addsthreshold=3
the variable "urls" used to preserve values is empty, can someone help me to figure it,
here is my code:
from scrapy.contrib.spiders import CrawlSpider,Rule
from scrapy.selector import Selector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.conf import settings
from scrapy.mail import MailSender
from scrapy.xlib.pydispatch import dispatcher
from scrapy.exceptions import CloseSpider
from scrapy import log
from scrapy import signals
from myProj.items import ADItem
import time
class AuSeekSpider(CrawlSpider):
name = "auseek"
result_address = []
addressCount = int(0)
addressThresh = int(0)
allowed_domains = ["seek.com.au"]
start_urls = [
"http://www.seek.com.au/jobs/in-australia/"
]
def __init__(self,**kwargs):
super(AuSeekSpider, self).__init__()
self.addressThresh = int(kwargs.get('addsthreshold'))
print 'init finished...'
def parse_start_url(self,response):
print 'This is start url function'
log.msg("Pipeline.spider_opened called", level=log.INFO)
hxs = Selector(response)
urls = hxs.xpath('//article/dl/dd/h2/a[#class="job-title"]/#href').extract()
print 'urls is:',urls
print 'test element:',urls[0].encode("ascii")
for url in urls:
postfix = url.getAttribute('href')
print 'postfix:',postfix
url = urlparse.urljoin(response.url,postfix)
yield Request(url, callback = self.parse_ad)
return
def parse_ad(self, response):
print 'this is parse_ad function'
hxs = Selector(response)
item = ADItem()
log.msg("Pipeline.parse_ad called", level=log.INFO)
item['name'] = str(self.name)
item['picNum'] = str(6)
item['link'] = response.url
item['date'] = time.strftime('%Y%m%d',time.localtime(time.time()))
self.addressCount = self.addressCount + 1
if self.addressCount > self.addressThresh:
raise CloseSpider('Get enough website address')
return item
The problems is:
urls = hxs.xpath('//article/dl/dd/h2/a[#class="job-title"]/#href').extract()
urls is empty when I tried to print it out, I just cant figure out why it doesn't work and how can I correct it, thanks for your help.
Here is a working example using selenium and phantomjs headless webdriver in a download handler middleware.
class JsDownload(object):
#check_spider_middleware
def process_request(self, request, spider):
driver = webdriver.PhantomJS(executable_path='D:\phantomjs.exe')
driver.get(request.url)
return HtmlResponse(request.url, encoding='utf-8', body=driver.page_source.encode('utf-8'))
I wanted to ability to tell different spiders which middleware to use so I implemented this wrapper:
def check_spider_middleware(method):
#functools.wraps(method)
def wrapper(self, request, spider):
msg = '%%s %s middleware step' % (self.__class__.__name__,)
if self.__class__ in spider.middleware:
spider.log(msg % 'executing', level=log.DEBUG)
return method(self, request, spider)
else:
spider.log(msg % 'skipping', level=log.DEBUG)
return None
return wrapper
settings.py:
DOWNLOADER_MIDDLEWARES = {'MyProj.middleware.MiddleWareModule.MiddleWareClass': 500}
for wrapper to work all spiders must have at minimum:
middleware = set([])
to include a middleware:
middleware = set([MyProj.middleware.ModuleName.ClassName])
You could have implemented this in a request callback (in spider) but then the http request would be happening twice. This isn't a full proof solution but it works for stuff that loads on .ready(). If you spend some time reading into selenium you can wait for specific event's to trigger before saving page source.
Another example: https://github.com/scrapinghub/scrapyjs
More info: What's the best way of scraping data from a website?
Cheers!
Scrapy does not evaluate Javascript. If you run the following command, you will see that the raw HTML does not contain the anchors you are looking for.
curl http://www.seek.com.au/jobs/in-australia/ | grep job-title
You should try PhantomJS or Selenium instead.
After examining the network requests in Chrome, the job listing appear to have originated from this JSONP request. It should be easy to retrieve whatever you need from it.

Categories