This question is very similar to Force my scrapy spider to stop crawling and some others asked several years ago. However, the suggested solutions there are either dated for Scrapy 1.1.1 or not precisely relevant. The task is to close the spider when it reaches a certain URL. You definitely need this when crawling a news website for your media project, for instance.
Among the settings CLOSESPIDER_TIMEOUT CLOSESPIDER_ITEMCOUNT CLOSESPIDER_PAGECOUNT CLOSESPIDER_ERRORCOUNT, item count and page count options are close but not enough since you never know the number of pages or items.
The raise CloseSpider(reason='some reason') exception seems to do the job but so far it does it in a bit weird way. I follow the “Learning Scrapy” textbook and the structure of my code looks like the one in the book.
In items.py I make a list of items:
class MyProjectItem(scrapy.Item):
Headline = scrapy.Field()
URL = scrapy.Field()
PublishDate = scrapy.Field()
Author = scrapy.Field()
pass
In myspider.py I use the def start_requests() method where the spider takes the pages to process, parse each index page in def parse(), and specify the XPath for each item in def parse_item():
class MyProjectSpider(scrapy.Spider):
name = 'spidername'
allowed_domains = ['domain.name.com']
def start_requests(self):
for i in range(1,3000):
yield scrapy.Request('http://domain.name.com/news/index.page'+str(i)+'.html', self.parse)
def parse(self, response):
urls = response.xpath('XPath for the URLs on index page').extract()
for url in urls:
# The urls are absolute in this case. There’s no need to use urllib.parse.urljoin()
yield scrapy.Request(url, callback=self.parse_item)
def parse_item(self, response):
l = ItemLoader(item=MyProjectItem(), response=response)
l.add_xpath('Headline', 'XPath for Headline')
l.add_value('URL', response.url)
l.add_xpath ('PublishDate', 'XPath for PublishDate')
l.add_xpath('Author', 'XPath for Author')
return l.load_item()
If raise CloseSpider(reason='some reason') exception is placed in def parse_item(), it still scrapes a number of items before it finally stops:
if l.get_output_value('URL') == 'http://domain.name.com/news/1234567.html':
raise CloseSpider('No more news items.')
If it’s placed in def parse() method to stop when the specific URL is reached, it stops after grabbing only the first item from the index page which contains that specific URL:
def parse(self, response):
most_recent_url_in_db = 'http://domain.name.com/news/1234567.html '
urls = response.xpath('XPath for the URLs on index page').extract()
if most_recent_url_in_db not in urls:
for url in urls:
yield scrapy.Request(url, callback=self.parse_item)
else:
for url in urls[:urls.index(most_recent_url_in_db)]:
yield scrapy.Request(url, callback=self.parse_item)
raise CloseSpider('No more news items.')
For example, if you have 5 index pages (each of them has 25 item URLs) and most_recent_url_in_db is on page 4, it means that you’ll have all items from pages 1-3 and only the first item from page 4. Then the spider stops. If most_recent_url_in_db is number 10 in the list, items 2-9 from index page 4 won’t appear in your database.
The “hacky” tricks with crawler.engine.close_spider() suggested in many cases or the ones shared in How do I stop all spiders and the engine immediately after a condition in a pipeline is met? don’t seem to work.
What should be the method to properly complete this task?
I'd recommend to change your approach. Scrapy crawls many requests concurrently without a linear order, that's why closing the spider when you find what you're looking for won't do, since a request after that could already be processed.
To tackle this you could make Scrapy crawl sequentially, meaning a request at a time in a fixed order. This can be achieved in different ways, here's an example about how I would go about it.
First of all, you should crawl a single page at a time. This could be done like this:
class MyProjectSpider(scrapy.Spider):
pagination_url = 'http://domain.name.com/news/index.page{}.html'
def start_requests(self):
yield scrapy.Request(
self.pagination_url.format(1),
meta={'page_number': 1},
)
def parse(self, response):
# code handling item links
...
page_number = response.meta['page_number']
next_page_number = page_number + 1
if next_page_number <= 3000:
yield scrapy.Request(
self.pagination_url.format(next_page_number),
meta={'page_number': next_page_number},
)
Once that's implemented, you could do something similar with the links in each page. However, since you can filter them without downloading their content, you could do something like this:
class MyProjectSpider(scrapy.Spider):
most_recent_url_in_db = 'http://domain.name.com/news/1234567.html '
def parse(self, response):
url_found = False
urls = response.xpath('XPath for the URLs on index page').extract()
for url in urls:
if url == self.most_recent_url_in_db:
url_found = True
break
yield scrapy.Request(url, callback=self.parse_item)
page_number = response.meta['page_number']
next_page_number = page_number + 1
if not url_found:
yield scrapy.Request(
self.pagination_url.format(next_page_number),
meta={'page_number': next_page_number},
)
Putting all together you'll have:
class MyProjectSpider(scrapy.Spider):
name = 'spidername'
allowed_domains = ['domain.name.com']
pagination_url = 'http://domain.name.com/news/index.page{}.html'
most_recent_url_in_db = 'http://domain.name.com/news/1234567.html '
def start_requests(self):
yield scrapy.Request(
self.pagination_url.format(1),
meta={'page_number': 1}
)
def parse(self, response):
url_found = False
urls = response.xpath('XPath for the URLs on index page').extract()
for url in urls:
if url == self.most_recent_url_in_db:
url_found = True
break
yield scrapy.Request(url, callback=self.parse_item)
page_number = response.meta['page_number']
next_page_number = page_number + 1
if next_page_number <= 3000 and not url_found:
yield scrapy.Request(
self.pagination_url.format(next_page_number),
meta={'page_number': next_page_number},
)
def parse_item(self, response):
l = ItemLoader(item=MyProjectItem(), response=response)
l.add_xpath('Headline', 'XPath for Headline')
l.add_value('URL', response.url)
l.add_xpath ('PublishDate', 'XPath for PublishDate')
l.add_xpath('Author', 'XPath for Author')
return l.load_item()
Hope that gives you an idea on how to accomplish what you're looking for, good luck!
When you raise close_spider() exception, ideal assumption is that scrapy should stop immediately, abandoning all other activity (any future page requests, any processing in pipeline ..etc)
but this is not the case, When you raise close_spider() exception, scrapy will try to close it's current operation's gracefully , meaning it will stop the current request but it will wait for any other request pending in any of the queues( there are multiple queues!)
(i.e. if you are not overriding default settings and have more than 16 start urls, scrapy make 16 requests at a time)
Now if you want to stop spider as soon as you Raise close_spider() exception, you will want to clear three Queues
-- At spider middleware level ---
spider.crawler.engine.slot.scheduler.mqs -> Memory Queue future request
spider.crawler.engine.slot.inprogress -> Any In-progress Request
-- download middleware level ---
spider.requests_queue -> pending Request in Request queue
flush all this queues by overriding proper middle-ware to prevent scrapy from visiting any further pages
Related
This problem is starting to frustrate me very much as I feel like I have no clue how scrapy works and that I can't wrap my head around the documentation.
My question is simple. I have the most standard of spiders.
class MySpider(scrapy.spider):
def start_requests(self):
header = ..
url = "www.website.whatever/search/filter1=.../&page=1"
test = scrapy.Request(url=url, callback=self.parse, headers = header)
def parse(self, response):
site_number_of_pages = int(response.xpath(..))
return site_number_of_pages
I just want to somehow get the number of pages from the parse function back into the start requests function so I can start a for loop to go through all the pages on the website, using the same parse function again. The code above does illustrated the principle only but would not work if put in practice. Variable test would be a Request class and not my plain Joe integer that I want.
How would I accomplish what I am trying to do?
EDIT:
This is what I have tried up till now
class MySpider(scrapy.spider):
def start_requests(self):
header = ..
url = ..
yield scrapy.Request(url=url, callback=self.parse, headers = header)
def parse(self, response):
header = ..
site_number_of_pages = int(response.xpath(..))
for count in range(2,site_number_of_pages):
url = url + str(count)
yield scrapy.Request(url=url, callback=self.parse, headers = header)
Scrapy is asynchronous framework. Here is no any possibility to.. return to start_urls - only Requests followed by it's callbacks.
On general case if Requests appeared as result of some response parsing (on your case - site_number_of_pages from first url) - it is not start_requests
The easiest thing You can do in this case - is to yield requests from parse method.
def parse(self, response):
site_number_of_pages = int(response.xpath(..))
for i in range(site_number_of_pages):
...
yield Request(url=...
Instead of grabbing the number of pages and looping through all of them, I grabbed the "next page" feature of the web page. So everytime self.parse is activated it will grab the next page and call itself again. This will go on until there is no next page and it will just error out.
class MySpider(scrapy.spider):
def start_requests(self):
header = ..
url = "www.website.whatever/search/filter1=.../&page=1"
yield scrapy.Request(url=url, callback=self.parse, headers = header)
def parse(self, response):
header = ..
..
next_page = response.xpath(..)
url = "www.website.whatever/search/filter1=.../&page=" + next_page
yield scrapy.Request(url=url, callback=self.parse, headers = header)
So I'm trying to write a spider to continue clicking a next button on a webpage until it can't anymore (or until I add some logic to make it stop). The code below correctly gets the link to the next page but prints it only once. My question is why isn't it "following" the links that each next button leads to?
class MyprojectSpider(scrapy.Spider):
name = 'redditbot'
allowed_domains = ['https://www.reddit.com/r/nfl/?count=25&after=t3_7ax8lb']
start_urls = ['https://www.reddit.com/r/nfl/?count=25&after=t3_7ax8lb']
def parse(self, response):
hxs = HtmlXPathSelector(response)
next_page = hxs.select('//div[#class="nav-buttons"]//a/#href').extract()
if next_page:
yield Request(next_page[1], self.parse)
print(next_page[1])
To go to the next page, instead of printing the link you just need to yield a scrapy.Request object like the following code:
import scrapy
class MyprojectSpider(scrapy.Spider):
name = 'myproject'
allowed_domains = ['reddit.com']
start_urls = ['https://www.reddit.com/r/nfl/']
def parse(self, response):
posts = response.xpath('//div[#class="top-matter"]')
for post in posts:
# Get your data here
title = post.xpath('p[#class="title"]/a/text()').extract()
print(title)
# Go to next page
next_page = response.xpath('//span[#class="next-button"]/a/#href').extract_first()
if next_page:
yield scrapy.Request(response.urljoin(next_page), callback=self.parse)
Update: Previous code was wrong, needed to use the absolute URL and also some Xpaths were wrong, this new one should work.
Hope it helps!
I followed the document
But still not be able to crawl multiple pages.
My code is like:
def parse(self, response):
for thing in response.xpath('//article'):
item = MyItem()
request = scrapy.Request(link,
callback=self.parse_detail)
request.meta['item'] = item
yield request
def parse_detail(self, response):
print "here\n"
item = response.meta['item']
item['test'] = "test"
yield item
Running this code will not call parse_detail function and will not crawl any data. Any idea? Thanks!
I find if I comment out allowed_domains it will work. But it doesn't make sense because link is belonged to allowed_domains for sure.
I am trying to scrape multiple webpages using scrapy. The link of the pages are like:
http://www.example.com/id=some-number
In the next page the number at the end is reduced by 1.
So I am trying to build a spider which navigates to the other pages and scrapes them too. The code that I have is given below:
import scrapy
import requests
from scrapy.http import Request
URL = "http://www.example.com/id=%d"
starting_number = 1000
number_of_pages = 500
class FinalSpider(scrapy.Spider):
name = "final"
allowed_domains = ['example.com']
start_urls = [URL % starting_number]
def start_request(self):
for i in range (starting_number, number_of_pages, -1):
yield Request(url = URL % i, callback = self.parse)
def parse(self, response):
**parsing data from the webpage**
This is running into an infinite loop where on printing the page number I am getting negative numbers. I think that is happening because I am requesting for a page within my parse() function.
But then the example given here works okay. Where am I going wrong?
The first page requested is "http://www.example.com/id=1000" (starting_number)
It's response goes through parse() and with for i in range (0, 500):
you are requesting http://www.example.com/id=999, http://www.example.com/id=998, http://www.example.com/id=997...http://www.example.com/id=500
self.page_number is a spider attribute, so when you're decrementing it's value, you have self.page_number == 500 after the first parse().
So when Scrapy calls parse for the response of http://www.example.com/id=999, you're generating requests for http://www.example.com/id=499, http://www.example.com/id=498, http://www.example.com/id=497...http://www.example.com/id=0
You guess what happens the 3rd time: http://www.example.com/id=-1, http://www.example.com/id=-2...http://www.example.com/id=-500
For each response, you're generating 500 requests.
You can stop the loop by testing self.page_number >= 0
Edit after OP question in comments:
No need for multiple threads, Scrapy works asynchronously and you can enqueue all your requests in an overridden start_requests() method (instead of requesting 1 page, and then returning Request istances in the parse method).
Scrapy will take enough requests to fill it's pipeline, parse the pages, pick new requests to send and so on.
See start_requests documentation.
Something like this would work:
class FinalSpider(scrapy.Spider):
name = "final"
allowed_domains = ['example.com']
start_urls = [URL % starting_number]
def __init__(self):
self.page_number = starting_number
def start_requests(self):
# generate page IDs from 1000 down to 501
for i in range (self.page_number, number_of_pages, -1):
yield Request(url = URL % i, callback=self.parse)
def parse(self, response):
**parsing data from the webpage**
I am getting confused on how to design the architecure of crawler.
I have the search where I have
pagination: next page links to follow
a list of products on one page
individual links to be crawled to get the description
I have the following code:
def parse_page(self, response):
hxs = HtmlXPathSelector(response)
sites = hxs.select('//ol[#id=\'result-set\']/li')
items = []
for site in sites[:2]:
item = MyProduct()
item['product'] = myfilter(site.select('h2/a').select("string()").extract())
item['product_link'] = myfilter(site.select('dd[2]/').select("string()").extract())
if item['profile_link']:
request = Request(urljoin('http://www.example.com', item['product_link']),
callback = self.parseItemDescription)
request.meta['item'] = item
return request
soup = BeautifulSoup(response.body)
mylinks= soup.find_all("a", text="Next")
nextlink = mylinks[0].get('href')
yield Request(urljoin(response.url, nextlink), callback=self.parse_page)
The problem is that I have two return statements: one for request, and one for yield.
In the crawl spider, I don't need to use the last yield, so everything was working fine, but in BaseSpider I have to follow links manually.
What should I do?
As an initial pass (and based on your comment about wanting to do this yourself), I would suggest taking a look at the CrawlSpider code to get an idea of how to implement its functionality.