I'm using Scrapy to scrape data this site. I need to call getlinkfrom parse. Normal call is not working as well when use yield, i get this error:
2015-11-16 10:12:34 [scrapy] ERROR: Spider must return Request, BaseItem, dict or None, got 'generator' in <GET https://www.coldwellbankerhomes.com/fl/miami-dad
e-county/kvc-17_1,17_3,17_2,17_8/incl-22/>
Returning getlink function from parse works but i need to execute some code even after returning. I'm confused any help would be really appreciable.
# -*- coding: utf-8 -*-
from scrapy.spiders import BaseSpider
from scrapy.selector import Selector
from scrapy.http import Request,Response
import re
import csv
import time
from selenium import webdriver
class ColdWellSpider(BaseSpider):
name = "cwspider"
allowed_domains = ["coldwellbankerhomes.com"]
#start_urls = [''.join(row).strip() for row in csv.reader(open("remaining_links.csv"))]
#start_urls = ['https://www.coldwellbankerhomes.com/fl/boynton-beach/5451-verona-drive-unit-d/pid_9266204/']
start_urls = ['https://www.coldwellbankerhomes.com/fl/miami-dade-county/kvc-17_1,17_3,17_2,17_8/incl-22/']
def parse(self,response):
#browser = webdriver.PhantomJS(service_args=['--ignore-ssl-errors=true', '--load-images=false'])
browser = webdriver.Firefox()
browser.maximize_window()
browser.get(response.url)
time.sleep(5)
#to extract all the links from a page and send request to those links
#this works but even after returning i need to execute the while loop
return self.getlink(response)
#for clicking the load more button in the page
while True:
try:
browser.find_element_by_class_name('search-results-load-more').find_element_by_tag_name('a').click()
time.sleep(3)
self.getlink(response)
except:
break
def getlink(self,response):
print 'hhelo'
c = open('data_getlink.csv', 'a')
d = csv.writer(c, lineterminator='\n')
print 'hello2'
listclass = response.xpath('//div[#class="list-items"]/div[contains(#id,"snapshot")]')
for l in listclass:
link = 'http://www.coldwellbankerhomes.com/'+''.join(l.xpath('./h2/a/#href').extract())
d.writerow([link])
yield Request(url = str(link),callback=self.parse_link)
#callback function of Request
def parse_link(self,response):
b = open('data_parselink.csv', 'a')
a = csv.writer(b, lineterminator='\n')
a.writerow([response.url])
Spider must return Request, BaseItem, dict or None, got 'generator'
getlink() is a generator. You are trying to yield it from the parse() generator.
Instead, you can/should iterate over the results of getlink() call:
def parse(self, response):
browser = webdriver.Firefox()
browser.maximize_window()
browser.get(response.url)
time.sleep(5)
while True:
try:
for request in self.getlink(response):
yield request
browser.find_element_by_class_name('search-results-load-more').find_element_by_tag_name('a').click()
time.sleep(3)
except:
break
Also, I've noticed you have both self.getlink(response) and self.getlink(browser). The latter is not gonna work since there is no xpath() method on a webdriver instance - you probably meant to make a Scrapy Selector out of the page source that your webdriver controlled browser has loaded, for example:
selector = scrapy.Selector(text=browser.page_source)
self.getlink(selector)
You should also take a look on to Explicit Waits with Expected Conditions instead of using unreliable and slow artificial delays via time.sleep().
Plus, I'm not sure what is the reason you are writing to CSV manually instead of using built-in Scrapy Items and Item Exporters. And, you are not closing the files properly and not using the with() context manager either.
Additionally, try to catch more specific exception(s) and avoid having a bare try/expect block.
Related
OS: Ubuntu 16.04
Stack - Scrapy 1.0.3 + Selenium
I'm pretty new to scrapy and this might sound very basic, But in my spider, only "init" is being getting executed. Any code/function after that is not getting called and thhe spider just halts.
class CancerForumSpider(scrapy.Spider):
name = "mainpage_spider"
allowed_domains = ["cancerforums.net"]
start_urls = [
"http://www.cancerforums.net/forums/14-Prostate-Cancer-Forum"
]
def __init__(self,*args,**kwargs):
self.browser=webdriver.Firefox()
self.browser.get("http://www.cancerforums.net/forums/14-Prostate-Cancer-Forum")
print "----------------Going to sleep------------------"
time.sleep(5)
# self.parse()
def __exit__(self):
print "------------Exiting----------"
self.browser.quit()
def parse(self,response):
print "----------------Inside Parse------------------"
print "------------Exiting----------"
self.browser.quit()
The spider gets the browser object, prints "Going to sleep" and just halts. It doesn't go inside the parse function.
Following are the contents of the run logs:
----------------inside init----------------
----------------Going to sleep------------------
There are a few problems you need to address or be aware of:
You're not calling super() during the __init__ method, so none of the inherited classes initialization is going to be happening. Scrapy won't do anything (like calling it's parse() method), as that all is setup in scrapy.Spider.
After fixing the above, your parse() method will be called by Scrapy, but won't be operating on your Selenium-fetched webpage. It will have no knowledge of this whatsoever, and will go re-fetch the url (based on start_urls). It's very much likely that these two sources will differ (often drastically).
You're going to be bypassing almost all of Scrapy's functionality using Selenium the way you are. All of Selenium's get()'s will be executed outside of the Scrapy framework. Middleware won't be applied (cookies, throttling, filtering, etc.) nor will any of the expected/created objects (like request and response) be populated with the data you expect.
Before you fix all of that, you should consider a couple of better options/alternatives:
Create a downloader middleware that handles all "Selenium" related functionality. Have it intercept request objects right before they hit the downloader, populate a new response objects and return them for processing by the spider.
This isn't optimal, as you're effectively creating your own downloader, and short-circuiting Scrapy's. You'll have to re-implement the handling of any desired settings the downloader usually takes into account and make them work with Selenium.
Ditch Selenium and use the Splash HTTP and scrapy-splash middleware for handling Javascript.
Ditch Scrapy all together and just use Selenium and BeautifulSoup.
Scrapy is useful when you have to crawl a big amount of pages. Selenium is normally useful for scraping when you need to have the DOM source after the JS was loaded. If that's your situation, there are two main ways to combine Selenium and Scrapy. One is to write a download handler, like the one you can find here.
The code goes as:
# encoding: utf-8
from __future__ import unicode_literals
from scrapy import signals
from scrapy.signalmanager import SignalManager
from scrapy.responsetypes import responsetypes
from scrapy.xlib.pydispatch import dispatcher
from selenium import webdriver
from six.moves import queue
from twisted.internet import defer, threads
from twisted.python.failure import Failure
class PhantomJSDownloadHandler(object):
def __init__(self, settings):
self.options = settings.get('PHANTOMJS_OPTIONS', {})
max_run = settings.get('PHANTOMJS_MAXRUN', 10)
self.sem = defer.DeferredSemaphore(max_run)
self.queue = queue.LifoQueue(max_run)
SignalManager(dispatcher.Any).connect(self._close, signal=signals.spider_closed)
def download_request(self, request, spider):
"""use semaphore to guard a phantomjs pool"""
return self.sem.run(self._wait_request, request, spider)
def _wait_request(self, request, spider):
try:
driver = self.queue.get_nowait()
except queue.Empty:
driver = webdriver.PhantomJS(**self.options)
driver.get(request.url)
# ghostdriver won't response when switch window until page is loaded
dfd = threads.deferToThread(lambda: driver.switch_to.window(driver.current_window_handle))
dfd.addCallback(self._response, driver, spider)
return dfd
def _response(self, _, driver, spider):
body = driver.execute_script("return document.documentElement.innerHTML")
if body.startswith("<head></head>"): # cannot access response header in Selenium
body = driver.execute_script("return document.documentElement.textContent")
url = driver.current_url
respcls = responsetypes.from_args(url=url, body=body[:100].encode('utf8'))
resp = respcls(url=url, body=body, encoding="utf-8")
response_failed = getattr(spider, "response_failed", None)
if response_failed and callable(response_failed) and response_failed(resp, driver):
driver.close()
return defer.fail(Failure())
else:
self.queue.put(driver)
return defer.succeed(resp)
def _close(self):
while not self.queue.empty():
driver = self.queue.get_nowait()
driver.close()
Suppose your scraper is called "scraper". If you put the mentioned code inside a file called handlers.py on the root of the "scraper" folder, then you could add to your settings.py:
DOWNLOAD_HANDLERS = {
'http': 'scraper.handlers.PhantomJSDownloadHandler',
'https': 'scraper.handlers.PhantomJSDownloadHandler',
}
Another way is to write a download middleware, as described here. The download middleware has the downside of preventing some key features from working out of the box, such as cache and retries.
In any case, starting the Selenium webdriver at the init of the Scrapy spider is not the usual way to go.
I want to ask how about (do crawling) clicking next button(change number page of website) (then do crawling more till the end of page number) from this site
I've try to combining scrape with selenium,but its still error and says "line 22
self.driver = webdriver.Firefox()
^
IndentationError: expected an indented block"
I don't know why it happens, i think i code is so well.Anybody can resolve this problem?
This my source :
from selenium import webdriver
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from now.items import NowItem
class MySpider(BaseSpider):
name = "nowhere"
allowed_domains = ["n0where.net"]
start_urls = ["https://n0where.net/"]
def parse(self, response):
for article in response.css('.loop-panel'):
item = NowItem()
item['title'] = article.css('.article-title::text').extract_first()
item['link'] = article.css('.loop-panel>a::attr(href)').extract_first()
item['body'] ='' .join(article.css('.excerpt p::text').extract()).strip()
#item['date'] = article.css('[itemprop="datePublished"]::attr(content)').extract_first()
yield item
def __init__(self):
self.driver = webdriver.Firefox()
def parse2(self, response):
self.driver.get(response.url)
while True:
next = self.driver.find_element_by_xpath('/html/body/div[4]/div[3]/div/div/div/div/div[1]/div/div[6]/div/a[8]/span')
try:
next.click()
# get the data and write it to scrapy items
except:
break
self.driver.close()`
This my capture of my program mate :
Ignoring the syntax and indentation errors you have an issue with your code logic in general.
What you do is create webdriver and never use it. What your spider does here is:
Create webdriver object.
Schedule a request for every url in self.start_urls, in your case it's only one.
Download it, make Response object and pass it to the self.parse()
Your parse method seems to find some xpaths and makes some items, so scrapy yields you some items that were found if any
Done
Your parse2 was never called and so your selenium webdriver was never used.
Since you are not using scrapy to download anything in this case you can just override start_requests()(<- that's where your spider starts) method of your spider to do the whole logic.
Something like:
from selenium import webdriver
import scrapy
from scrapy import Selector
class MySpider(scrapy.Spider):
name = "nowhere"
allowed_domains = ["n0where.net"]
start_url = "https://n0where.net/"
def start_requests(self):
driver = webdriver.Firefox()
driver.get(self.start_url)
while True:
next_url = driver.find_element_by_xpath(
'/html/body/div[4]/div[3]/div/div/div/div/div[1]/div/div[6]/div/a[8]/span')
try:
# parse the body your webdriver has
self.parse(driver.page_source)
# click the button to go to next page
next_url.click()
except:
break
driver.close()
def parse(self, body):
# create Selector from html string
sel = Selector(text=body)
# parse it
for article in sel.css('.loop-panel'):
item = dict()
item['title'] = article.css('.article-title::text').extract_first()
item['link'] = article.css('.loop-panel>a::attr(href)').extract_first()
item['body'] = ''.join(article.css('.excerpt p::text').extract()).strip()
# item['date'] = article.css('[itemprop="datePublished"]::attr(content)').extract_first()
yield item
This is a indentation error. Look the lines near the error:
def parse2(self, response):
self.driver.get(response.url)
The first of these two lines ends with a colon. So, the second line should be more indented than the first one.
There are two possible fixes, depending on what you want to do. Either add an indentation level to the second one:
def parse2(self, response):
self.driver.get(response.url)
Or move the parse2 function out of theinit` function:
def parse2(self, response):
self.driver.get(response.url)
def __init__(self):
self.driver = webdriver.Firefox()
# etc.
Can you help me please to correct this script: I have a list of links search results and I want to vist and crawl each one of these links.
But this script click just the first link and then my crawler stops.
Any help is appreciated
Code "Spider" :
from scrapy.contrib.spiders import CrawlSpider
from scrapy import Selector
from selenium import webdriver
from selenium.webdriver.support.select import Select
from time import sleep
import selenium.webdriver.support.ui as ui
from scrapy.xlib.pydispatch import dispatcher
from scrapy.http import HtmlResponse, TextResponse
from extraction.items import ProduitItem
from scrapy import log
class RunnerSpider(CrawlSpider):
name = 'products_d'
allowed_domains = ['amazon.com']
start_urls = ['http://www.amazon.com']
def __init__(self):
self.driver = webdriver.Firefox()
def parse(self, response):
sel = Selector(response)
self.driver.get(response.url)
recherche = self.driver.find_element_by_xpath('//*[#id="twotabsearchtextbox"]')
recherche.send_keys("A")
recherche.submit()
resultat = self.driver.find_element_by_xpath('//ul[#id="s-results-list-atf"]')
#Links
resultas = resultat.find_elements_by_xpath('//li/div[#class="s-item-container"]/div/div/div[2]/div[1]/a')
links = []
for lien in resultas:
l = lien.get_attribute('href')
links.append(l)
for result in links:
item = ProduitItem()
link = result
self.driver.get(link)
item['URL'] = link
item['Title'] = self.driver.find_element_by_xpath('//h1[#id="aiv-content-title"]').text
yield item
self.driver.close()
So there are a few issues with your script.
1) Your parse function overrides CrawlSpider's implementation of the same function. That means that CrawlSpider's default behaviour, which is in charge of extracting links from the page for continued crawling, is not being called. That's not recommended when using CrawlSpider. See here for details:
http://doc.scrapy.org/en/latest/topics/spiders.html
2) You don't yield any followup URLs yourself. You only yield Items. If you want Scrapy to keep processing URLs, you have to yield some form of Request object alongside your items.
3) You kill Selenium's driver at the end of the parse function. That will probably cause it to fail on a followup call anyway. There's no need to do that.
4) You're using Selenium & Scrapy's URL grabbing concurrently. That's not necessarily wrong, but keep in mind that it might result in some erratic behaviour.
5) Your script indentation is definitely off, that makes it difficult to look at your code.
I am trying to use Selenium to obtain value of selected option from a drop down list in a scrapy spider, but am unsure of how to go about it. Its my first interaction with Selenium.
As you can see in the code below, I create a request in parse function which calls parse_page function as a callback. In parse_page I want to extract the value of selected option. I cant figure out how to attach webdriver to the response page sent into parse_page to be able to use it in Select. I have written an obviously wrong code below :(
from scrapy.spider import Spider
from scrapy.selector import Selector
from scrapy.http import Request
from scrapy.exceptions import CloseSpider
import logging
import scrapy
from scrapy.utils.response import open_in_browser
from scrapy.http import FormRequest
from scrapy.http import Request
from selenium import webdriver
from selenium.webdriver.support.ui import Select
from activityadvisor.items import TruYog
logging.basicConfig()
logger = logging.getLogger()
class TrueYoga(Spider):
name = "trueyoga"
allowed_domains = ["trueyoga.com.sg","trueclassbooking.com.sg"]
start_urls = [
"http://trueclassbooking.com.sg/frames/class-schedules.aspx",
]
def parse(self, response):
clubs=[]
clubs = Selector(response).xpath('//div[#class="club-selections"]/div/div/div/a/#rel').extract()
clubs.sort()
print 'length of clubs = ' , len(clubs), '1st content of clubs = ', clubs
req=[]
for club in clubs:
payload = {'ctl00$cphContents$ddlClub':club}
req.append(FormRequest.from_response(response,formdata = payload, dont_click=True, callback = self.parse_page))
for request in req:
yield request
def parse_page(self, response):
driver = webdriver.Firefox()
driver.get(response)
clubSelect = Select(driver.find_element_by_id("ctl00_cphContents_ddlClub"))
option = clubSelect.first_selected_option
print option.text
Is there any way to obtain this option value in scrapy without using Selenium? My search on google and stackoverflow didn't yield any useful answers so far.
Thanks for help!
I would recommend using Downloader Middleware to pass the Selenium response over to your spider's parse method. Take a look at the example I wrote as an answer to another question.
If you get the response there are the select boxes with their options. One of those options has the attribute selected="selected". I think you should go through this attribute to avoid the usage of Selenium:
def parse_page(self, response):
response.xpath("//select[#id='ctl00_cphContents_ddlClub']//option[#selected = 'selected']").extract()
I've been trying to build a small scraper for ebay (college assignment). I already figured out most of it, but I ran into an issue with my loop.
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.http import Request
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from loop.items import loopitems
class myProjectSpider(CrawlSpider):
name = 'looper'
allowed_domains = ['ebay.com']
start_urls = [l.strip() for l in open('bobo.txt').readlines()]
def __init__(self):
service_args = ['--load-images=no',]
self.driver = webdriver.PhantomJS(executable_path='/Users/localhost/desktop/.bin/phantomjs.cmd', service_args=service_args)
def parse(self, response):
self.driver.get(response.url)
item = loopitems()
for abc in range(2,50):
abc = str(abc)
jackson = self.driver.execute_script("return !!document.evaluate('.//div[5]/div[2]/select/option[" + abc + "]', document, null, XPathResult.FIRST_ORDERED_NODE_TYPE, null).singleNodeValue;")
if jackson == True:
item['title'] = self.driver.execute_script("return document.evaluate('.//div[5]/div[2]/select/option[" + abc + "]', document, null, XPathResult.FIRST_ORDERED_NODE_TYPE, null).singleNodeValue.textContent;")
yield item
else:
break
The urls (start_urls are dispatched from txt file):
http://www.ebay.com/itm/Mens-Jeans-Slim-Fit-Straight-Skinny-Fit-Denim- Trousers-Casual-Pants-14-color-/221560999664?pt=LH_DefaultDomain_0&var=&hash=item3396108ef0
http://www.ebay.com/itm/New-Apple-iPad-3rd-Generation-16GB-32GB-or-64GB-WiFi-Retina-Display-Tablet-/261749018535?pt=LH_DefaultDomain_0&var=&hash=item3cf1750fa7
I'm running scrapy version 0.24.6 and phantomjs version 2.0. The objective is to go to the urls and extract the variations or attributes from the ebay form.
The if statement right at the beginning of the loop is used to check if the element exists because selenium returns a bad header error if it can't find an element. I also loop the (yield item) because I need each variation on a new row. I use execute_script because it is 100 times faster than using seleniums get element by xpath.
The main problem I run into is the way scrapy returns my item results; if I use one url as my start_url it works like it should (it returns all items in a neat order). The second I add more urls to it I get a completly different result all my items are scrambled around and some items are returned multiple times it also happens to vary almost everytime. After countless testing I noticed yield item is causing some kind of problem; so I removed it and tried just printing the results and sure enough it returns them perfectly. I really need each item on a new row though, and the only way I got to do so is by using yield item (maybe there's a better way?).
As of now I have just copy pasted the looped code changing the xpath option manually. And it works like expected, but I really need to be able to loop through items in the future. If someone sees an error in my code or a better way to try it please tell me. All responses are helpful...
Thanks
If I correctly understood what you want to do, I think this one could help you.
Scrapy Crawl URLs in Order
The problem is that start_urls are not processed in order. They are passed to start_requests method and returned with a downloaded response to parse method. This is asynchronous.
Maybe this helps
#Do your thing
start_urls = [open('bobo.txt').readlines()[0].strip()]
other_urls = [l.strip() for l in open('bobo.txt').readlines()[1:]]
other_urls.reverse()
#Do your thing
def parse(self, response):
#Do your thing
if len(self.other_urls) != 0
url = self.other_urls.pop()
yield Request(url=url, callback=self.parse)