Scrapy - How to avoid Pagination Blackhole? - python

I was recently working on a website spider and noticed it was requesting an infinite number of pages because a site hadn't coded their pagination to ever stop.
So while they only had a few pages of content, it still would generate a next link and a url ...?page=400, ...?page=401, etc.
The content didn't change, just the URL. Is there a way to make Scrapy stop following pagination when content stops changing? Or something I could code up custom.

If the content doesn't change you can compare the content of the current page with the previous page and if it's the same, break the crawl.
for example:
def parse(self, response):
product_urls = response.xpath("//a/#href").extract()
# check last page
if response.meta.get('prev_urls') == product_urls:
logging.info('reached the last page at: {}'.format(response.url))
return # reached the last page
# crawl products
for url in product_urls:
yield Request(url, self.parse_product)
# create next page url
next_page = response.meta.get('page', 0) + 1
next_url = re.sub('page=\d+', 'page={}'.format(next_page), response.url)
# now for the next page carry some data in meta
yield Request(next_url,
meta={'prev_urls': product_urls,
'page': next_page}

Related

How to tell scrapy to stop to crawl next page when meeting an empty page?

I'm new to scrapy. Basically, I have some web pages with the identical structure to crawl. Each web page is loaded dynamically when scrolling down, and the ajax request is in the format of "https://example.org/ajax/unique_id_for_this_page/page=1/....". When reaching the end of the page, something in the response of ajax request is empty, so I can stop to send the next ajax request. How can I achieve this goal with scrapy? Here is the code I use.
class WebSpider(Spider):
base_url = 'https://example.org/ajax/{}/page={}/...'
def start_requests(self):
# unique_id for different web pages required to crawl
unique_ids = ['webpage01', 'webpage02', 'webpage03', ...]
urls = [base_url.format(unique_id, 1) for unique_id in unique_ids]
for url in urls:
# send the 1st page's ajax requests of 3 web pages required to crawl
yield Requests(url, callback=self.parse)
def parse(self, response):
# parse the response and decide if sending the next ajax request
res = json.loads(response.text)
if res['data']['list']:
# the list is not empty, save results and go to next
save_results_to_item()
current_page_num = re.search(r'page=(\d+)', response.url).group(1)
next_page_num = int(current_page_num) + 1
# generate the next ajax request
next_page_url = response.url.replace(f'page={current_page_num}', f'page={next_page_num}')
yield Requests(next_page_url, self.parse)
Will the code above be able to crawl all web pages and stop sending new ajax requests for the certain web page when res['data']['list'] is empty? I find it difficult to figure out whether it worked. Or is there any better solution? Thanks for any advice!

Scrapy crawler pagination for indeed website

# -*- coding: utf-8 -*-
import scrapy
class SearchSpider(scrapy.Spider):
name = 'search'
allowed_domains = ['www.indeed.com/']
start_urls = ['https://www.indeed.com/jobs?q=data%20analyst&l=united%20states']
def parse(self, response):
listings = response.xpath('//*[#data-tn-component="organicJob"]')
for listing in listings:
title = listing.xpath('.//a[#data-tn-element="jobTitle"]/#title').extract_first()
link = listing.xpath('.//h2[#class="title"]//a/#href').extract_first()
company = listing.xpath('normalize-space(.//span[#class="company"]//a/text())').extract_first()
yield {'title':title,
'link':link,
'company':company}
next_page = response.xpath('//ul[#class="pagination-list"]//a/#href').extract_first()
if next_page:
yield scrapy.Request(response.urljoin(next_page),callback=self.parse)
I am trying to extract all the job titles and company for every job posting in all the indeed pages. However, I am stuck at a point, because the forward button on the indeed page does not have a fixed link which my scraper could follow instead the next page url is the same as the numbered button. Which means that even after requesting the next page url, the numbers at the end change which does not allow me to extract the next page. I am trying to refrain from using selenium or splash, since I am trying to get my results through only Scrapy or Beautifull Soup. However, any help would be greatly appreciated.

How to scrape 2 web page with same domain on scrapy using python?

Hi guys I am very new in scraping data, I have tried the basic one. But my problem is I have 2 web page with same domain that I need to scrape
My Logic is,
First page www.sample.com/view-all.html
*This page open all the list of items and I need to get all the href attr of every item.
Second page www.sample.com/productpage.52689.html
*this is the link came from the first page so 52689 needs to change dynamically depending on the link provided by the first page.
I need to get all the data like title, description etc on the second page.
what I am thinking is for loop but Its not working on my end. I am searching on google but no one has the same problem as mine. please help me
import scrapy
class SalesItemSpider(scrapy.Spider):
name = 'sales_item'
allowed_domains = ['www.sample.com']
start_urls = ['www.sample.com/view-all.html', 'www.sample.com/productpage.00001.html']
def parse(self, response):
for product_item in response.css('li.product-item'):
item = {
'URL': product_item.css('a::attr(href)').extract_first(),
}
yield item`
Inside parse you can yield Request() with url and function's name to scrape this url in different function
def parse(self, response):
for product_item in response.css('li.product-item'):
url = product_item.css('a::attr(href)').extract_first()
# it will send `www.sample.com/productpage.52689.html` to `parse_subpage`
yield scrapy.Request(url=url, callback=self.parse_subpage)
def parse_subpage(self, response):
# here you parse from www.sample.com/productpage.52689.html
item = {
'title': ...,
'description': ...
}
yield item
Look for Request in Scrapy documentation and its tutorial
There is also
response.follow(url, callback=self.parse_subpage)
which will automatically add www.sample.com to urls so you don't have to do it on your own in
Request(url = "www.sample.com/" + url, callback=self.parse_subpage)
See A shortcut for creating Requests
If you interested in scraping then you should read docs.scrapy.org from first page to the last one.

Scrapy XHR Pagination on TripAdvisor

Although I've seen several similar questions here regarding this, none seem to precisely define the process for achieving this task. I borrowed largely from the Scrapy script located here but since it is over a year old I had to make adjustments to the xpath references.
My current code looks as such:
import scrapy
from tripadvisor.items import TripadvisorItem
class TrSpider(scrapy.Spider):
name = 'trspider'
start_urls = [
'https://www.tripadvisor.com/Hotels-g29217-Island_of_Hawaii_Hawaii-Hotels.html'
]
def parse(self, response):
for href in response.xpath('//div[#class="listing_title"]/a/#href'):
url = response.urljoin(href.extract())
yield scrapy.Request(url, callback=self.parse_hotel)
next_page = response.xpath('//div[#class="unified pagination standard_pagination"]/child::*[2][self::a]/#href')
if next_page:
url = response.urljoin(next_page[0].extract())
yield scrapy.Request(url, self.parse)
def parse_hotel(self, response):
for href in response.xpath('//div[starts-with(#class,"quote")]/a/#href'):
url = response.urljoin(href.extract())
yield scrapy.Request(url, callback=self.parse_review)
next_page = response.xpath('//div[#class="unified pagination "]/child::*[2][self::a]/#href')
if next_page:
url = response.urljoin(next_page[0].extract())
yield scrapy.Request(url, self.parse_hotel)
def parse_review(self, response):
item = TripadvisorItem()
item['headline'] = response.xpath('translate(//div[#class="quote"]/text(),"!"," ")').extract()[0][1:-1]
item['review'] = response.xpath('translate(//div[#class="entry"]/p,"\n"," ")').extract()[0]
item['bubbles'] = response.xpath('//span[contains(#class,"ui_bubble_rating")]/#alt').extract()[0]
item['date'] = response.xpath('normalize-space(//span[contains(#class,"ratingDate")]/#content)').extract()[0]
item['hotel'] = response.xpath('normalize-space(//span[#class="altHeadInline"]/a/text())').extract()[0]
return item
When running the spider in its current form, I scrape the first page of reviews for each hotel listed on the start_urls page but the pagination doesn't flip to the next page of reviews. From what I suspect, this is because of this line:
next_page = response.xpath('//div[#class="unified pagination "]/child::*[2][self::a]/#href')
Since these pages load dynamically, there is no existing href for the next page on the current page. Investigating further I've read that these requests are sending a POST request using XHR. By exploring the "Network" tab in Firefox "Inspect" I can see both a Request URL and Form Data that might be needed to flip the page according to other posts on SO regarding the same topic.
However, it seems that the other posts refer to a static URL starting point when trying to pass a FormRequest using Scrapy. With TripAdvisor, the URL will always change based on the name of the hotel we're looking at so I'm not sure how to chose a URL when using FormRequest to submit the form data: reqNum=1&changeSet=REVIEW_LIST (this form data also never seems to change from page to page).
Alternatively, there doesn't appear to be a way to extract the URL shown in the "Network" tab's "Request URL". These pages do have URLs that change from page to page but the way TripAdvisor is set up, I cannot seem to extract them from the source code. The review pages change by incrementing the part of the URL that is -orXX- where "XX" is a number. For example:
https://www.tripadvisor.com/Hotel_Review-g2312116-d113123-Reviews-Fairmont_Orchid_Hawaii-Puako_Kohala_Coast_Island_of_Hawaii_Hawaii.html
https://www.tripadvisor.com/Hotel_Review-g2312116-d113123-Reviews-or5-Fairmont_Orchid_Hawaii-Puako_Kohala_Coast_Island_of_Hawaii_Hawaii.html
https://www.tripadvisor.com/Hotel_Review-g2312116-d113123-Reviews-or10-Fairmont_Orchid_Hawaii-Puako_Kohala_Coast_Island_of_Hawaii_Hawaii.html
https://www.tripadvisor.com/Hotel_Review-g2312116-d113123-Reviews-or15-Fairmont_Orchid_Hawaii-Puako_Kohala_Coast_Island_of_Hawaii_Hawaii.html
So, my question is whether or not it is possible to paginate using the XHR request/form data or do I need to manually build a list of URLs for each hotel that adds the -orXX-?
Well I ended up discovering an xpath that apparently allowed pagination of the reviews, but it's funny because every time I checked the underlying HTML the href link never changed from referring to /Hotel_Review-g2312116-d113123-Reviews-or5-Fairmont_Orchid_Hawaii-Puako_Kohala_Coast_Island_of_Hawaii_Hawaii.html even if I was on page 10 for example. It seems the "-orXX-" part of the link always increments the XX by 5 so I'm not sure why this works.
All I did was change the line:
next_page = response.xpath('//div[#class="unified pagination "]/child::*[2][self::a]/#href')
to:
next_page = response.xpath('//link[#rel="next"]/#href')
and have >41K extracted reviews. Would love to get other's opinions on handling this problem in other situations.

Stop Scrapy spider when it meets a specified URL

This question is very similar to Force my scrapy spider to stop crawling and some others asked several years ago. However, the suggested solutions there are either dated for Scrapy 1.1.1 or not precisely relevant. The task is to close the spider when it reaches a certain URL. You definitely need this when crawling a news website for your media project, for instance.
Among the settings CLOSESPIDER_TIMEOUT CLOSESPIDER_ITEMCOUNT CLOSESPIDER_PAGECOUNT CLOSESPIDER_ERRORCOUNT, item count and page count options are close but not enough since you never know the number of pages or items.
The raise CloseSpider(reason='some reason') exception seems to do the job but so far it does it in a bit weird way. I follow the “Learning Scrapy” textbook and the structure of my code looks like the one in the book.
In items.py I make a list of items:
class MyProjectItem(scrapy.Item):
Headline = scrapy.Field()
URL = scrapy.Field()
PublishDate = scrapy.Field()
Author = scrapy.Field()
pass
In myspider.py I use the def start_requests() method where the spider takes the pages to process, parse each index page in def parse(), and specify the XPath for each item in def parse_item():
class MyProjectSpider(scrapy.Spider):
name = 'spidername'
allowed_domains = ['domain.name.com']
def start_requests(self):
for i in range(1,3000):
yield scrapy.Request('http://domain.name.com/news/index.page'+str(i)+'.html', self.parse)
def parse(self, response):
urls = response.xpath('XPath for the URLs on index page').extract()
for url in urls:
# The urls are absolute in this case. There’s no need to use urllib.parse.urljoin()
yield scrapy.Request(url, callback=self.parse_item)
def parse_item(self, response):
l = ItemLoader(item=MyProjectItem(), response=response)
l.add_xpath('Headline', 'XPath for Headline')
l.add_value('URL', response.url)
l.add_xpath ('PublishDate', 'XPath for PublishDate')
l.add_xpath('Author', 'XPath for Author')
return l.load_item()
If raise CloseSpider(reason='some reason') exception is placed in def parse_item(), it still scrapes a number of items before it finally stops:
if l.get_output_value('URL') == 'http://domain.name.com/news/1234567.html':
raise CloseSpider('No more news items.')
If it’s placed in def parse() method to stop when the specific URL is reached, it stops after grabbing only the first item from the index page which contains that specific URL:
def parse(self, response):
most_recent_url_in_db = 'http://domain.name.com/news/1234567.html '
urls = response.xpath('XPath for the URLs on index page').extract()
if most_recent_url_in_db not in urls:
for url in urls:
yield scrapy.Request(url, callback=self.parse_item)
else:
for url in urls[:urls.index(most_recent_url_in_db)]:
yield scrapy.Request(url, callback=self.parse_item)
raise CloseSpider('No more news items.')
For example, if you have 5 index pages (each of them has 25 item URLs) and most_recent_url_in_db is on page 4, it means that you’ll have all items from pages 1-3 and only the first item from page 4. Then the spider stops. If most_recent_url_in_db is number 10 in the list, items 2-9 from index page 4 won’t appear in your database.
The “hacky” tricks with crawler.engine.close_spider() suggested in many cases or the ones shared in How do I stop all spiders and the engine immediately after a condition in a pipeline is met? don’t seem to work.
What should be the method to properly complete this task?
I'd recommend to change your approach. Scrapy crawls many requests concurrently without a linear order, that's why closing the spider when you find what you're looking for won't do, since a request after that could already be processed.
To tackle this you could make Scrapy crawl sequentially, meaning a request at a time in a fixed order. This can be achieved in different ways, here's an example about how I would go about it.
First of all, you should crawl a single page at a time. This could be done like this:
class MyProjectSpider(scrapy.Spider):
pagination_url = 'http://domain.name.com/news/index.page{}.html'
def start_requests(self):
yield scrapy.Request(
self.pagination_url.format(1),
meta={'page_number': 1},
)
def parse(self, response):
# code handling item links
...
page_number = response.meta['page_number']
next_page_number = page_number + 1
if next_page_number <= 3000:
yield scrapy.Request(
self.pagination_url.format(next_page_number),
meta={'page_number': next_page_number},
)
Once that's implemented, you could do something similar with the links in each page. However, since you can filter them without downloading their content, you could do something like this:
class MyProjectSpider(scrapy.Spider):
most_recent_url_in_db = 'http://domain.name.com/news/1234567.html '
def parse(self, response):
url_found = False
urls = response.xpath('XPath for the URLs on index page').extract()
for url in urls:
if url == self.most_recent_url_in_db:
url_found = True
break
yield scrapy.Request(url, callback=self.parse_item)
page_number = response.meta['page_number']
next_page_number = page_number + 1
if not url_found:
yield scrapy.Request(
self.pagination_url.format(next_page_number),
meta={'page_number': next_page_number},
)
Putting all together you'll have:
class MyProjectSpider(scrapy.Spider):
name = 'spidername'
allowed_domains = ['domain.name.com']
pagination_url = 'http://domain.name.com/news/index.page{}.html'
most_recent_url_in_db = 'http://domain.name.com/news/1234567.html '
def start_requests(self):
yield scrapy.Request(
self.pagination_url.format(1),
meta={'page_number': 1}
)
def parse(self, response):
url_found = False
urls = response.xpath('XPath for the URLs on index page').extract()
for url in urls:
if url == self.most_recent_url_in_db:
url_found = True
break
yield scrapy.Request(url, callback=self.parse_item)
page_number = response.meta['page_number']
next_page_number = page_number + 1
if next_page_number <= 3000 and not url_found:
yield scrapy.Request(
self.pagination_url.format(next_page_number),
meta={'page_number': next_page_number},
)
def parse_item(self, response):
l = ItemLoader(item=MyProjectItem(), response=response)
l.add_xpath('Headline', 'XPath for Headline')
l.add_value('URL', response.url)
l.add_xpath ('PublishDate', 'XPath for PublishDate')
l.add_xpath('Author', 'XPath for Author')
return l.load_item()
Hope that gives you an idea on how to accomplish what you're looking for, good luck!
When you raise close_spider() exception, ideal assumption is that scrapy should stop immediately, abandoning all other activity (any future page requests, any processing in pipeline ..etc)
but this is not the case, When you raise close_spider() exception, scrapy will try to close it's current operation's gracefully , meaning it will stop the current request but it will wait for any other request pending in any of the queues( there are multiple queues!)
(i.e. if you are not overriding default settings and have more than 16 start urls, scrapy make 16 requests at a time)
Now if you want to stop spider as soon as you Raise close_spider() exception, you will want to clear three Queues
-- At spider middleware level ---
spider.crawler.engine.slot.scheduler.mqs -> Memory Queue future request
spider.crawler.engine.slot.inprogress -> Any In-progress Request
-- download middleware level ---
spider.requests_queue -> pending Request in Request queue
flush all this queues by overriding proper middle-ware to prevent scrapy from visiting any further pages

Categories