Scrapy doesn'f follow next-page url, why?

Scrapy doesn'f follow next-page url, why? - python

I'm scraping this website: https://www.olx.com.ar/celulares-telefonos-cat-831 with Scrapy 1.4.0. When I run the spider everything goes well until it gets to the "next-page" part. Here's the code:
# -*- coding: utf-8 -*-
import scrapy
#import time
class OlxarSpider(scrapy.Spider):
name = "olxar"
allowed_domains = ["olx.com.ar"]
start_urls = ['https://www.olx.com.ar/celulares-telefonos-cat-831']
def parse(self, response):
#time.sleep(10)
response = response.replace(body=response.body.replace('<br>', ''))
SET_SELECTOR = '.item'
for item in response.css(SET_SELECTOR):
PRODUCTO_SELECTOR = '.items-info h3 ::text'
yield {
'producto': item.css(PRODUCTO_SELECTOR).extract_first().replace(',',' '),
}
NEXT_PAGE_SELECTOR = '.items-paginations-buttons a::attr(href)'
next_page = response.css(NEXT_PAGE_SELECTOR).extract_first().replace('//','https://')
if next_page:
yield scrapy.Request(response.urljoin(next_page),
callback=self.parse
)
I've seen in other questions that some people added dont_filter = True attribute to the Request but that doesn't work for me. It just makes the spider loop over the first 2 pages.
I've added the replace('//','https://') part to fix the original href that comes without https: and can't be followed by Scrapy.
Also, when I run the spider it scraps the first page and then returns [scrapy.dupefilters] DEBUG: Filtered duplicate request: <GET https://www.olx.com.ar/celulares-telefonos-cat-831-p-2> - no more duplicates will be shown (see DUPEFILTER_DEBUG to show all duplicates)
Why is it filtering the second page like duplicated when apparently is not?
I applied the Tarun Lalwani solution on the comments. I missed that detail so bad! It worked fine with the correction thank you!

Your problem is the css selector. On page 1 it matches the next page link. On page 2 it matches the previous page and the next page link. Out of that your pick the first one using extract_first(), so your just rotate between first and second page only
Solution is simple, you need to change css selector
NEXT_PAGE_SELECTOR = '.items-paginations-buttons a::attr(href)'
to
NEXT_PAGE_SELECTOR = '.items-paginations-buttons a.next::attr(href)'
This will only identify next page url only

Related

Scrapy crawler pagination for indeed website

# -*- coding: utf-8 -*-
import scrapy
class SearchSpider(scrapy.Spider):
name = 'search'
allowed_domains = ['www.indeed.com/']
start_urls = ['https://www.indeed.com/jobs?q=data%20analyst&l=united%20states']
def parse(self, response):
listings = response.xpath('//*[#data-tn-component="organicJob"]')
for listing in listings:
title = listing.xpath('.//a[#data-tn-element="jobTitle"]/#title').extract_first()
link = listing.xpath('.//h2[#class="title"]//a/#href').extract_first()
company = listing.xpath('normalize-space(.//span[#class="company"]//a/text())').extract_first()
yield {'title':title,
'link':link,
'company':company}
next_page = response.xpath('//ul[#class="pagination-list"]//a/#href').extract_first()
if next_page:
yield scrapy.Request(response.urljoin(next_page),callback=self.parse)
I am trying to extract all the job titles and company for every job posting in all the indeed pages. However, I am stuck at a point, because the forward button on the indeed page does not have a fixed link which my scraper could follow instead the next page url is the same as the numbered button. Which means that even after requesting the next page url, the numbers at the end change which does not allow me to extract the next page. I am trying to refrain from using selenium or splash, since I am trying to get my results through only Scrapy or Beautifull Soup. However, any help would be greatly appreciated.

Recursively Scraping pages using Python (scrapy)

I'm trying to make a program that retrieves the title and price of items while going to the following page.
Now all informations of the first page (title, price) are extracted but the program does not go to the next page
URL : https://scrapingclub.com/exercise/list_basic/
import scrapy
class RecursiveSpider(scrapy.Spider):
name = 'recursive'
allowed_domains = ['scrapingclub.com/exercise/list_basic/']
start_urls = ['http://scrapingclub.com/exercise/list_basic//']
def parse(self, response):
card = response.xpath("//div[#class='card-body']")
for thing in card:
title = thing.xpath(".//h4[#class='card-title']").extract_first()
price = thing.xpath(".//h5").extract_first
yield {'price' : price, 'title' : title}
next_page_url = response.xpath("//li[#class='page-item']//a/#href")
if next_page_url:
absolute_nextpage_url = response.urljoin(next_page_url)
yield scrapy.Request(absolute_nextpage_url) ```

You should add the execution logs in situations like this, it would help pin point your problem.
I can see a few problems though:
next_page_url = response.xpath("//li[#class='page-item']//a/#href")
if next_page_url:
absolute_nextpage_url = response.urljoin(next_page_url)
The variable next_page_url contains a selector, not a string. You need to use the .get() method to extract the string with the relative url.
After this, I executed your code it returned:
2020-09-04 15:19:34 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'scrapingclub.com': <GET https://scrapingclub.com/exercise/list_basic/?page=2>
It's filtering the request as it considers it an offisite request, even if it isn't. To fix it just use allowed_domains = ['scrapingclub.com'] or just remove this line entirely. If you want to understand more how this filter works check the source here.
Finally, it doesn't make sense to have this snippet under the for loop:
next_page_url = response.xpath("//li[#class='page-item']//a/#href").get() # I added the .get()
if next_page_url:
absolute_nextpage_url = response.urljoin(next_page_url)
yield scrapy.Request(absolute_nextpage_url)
If you use get() method it will return to next_page_url the first item (which is page 2 now, but in the next callback will be page 1, so you will never advance to page 3).
If you use getall() it will return a list, which you would need to iterate over yielding all possible requests, but this is a recursive function, so you would end up doing that in each recursion step.
The best option is to select the next button instead of the page number:
next_page_url = response.xpath('//li[#class="page-item"]/a[contains(text(), "Next")]/#href').get()

Scrapy scraping content that is visible sometimes but not others

I am scraping some info off of zappos.com, specifically a part of the details page that displays what customers that view the current item have also viewed.
This is one such item listing:
https://www.zappos.com/p/chaco-marshall-tartan-rust/product/8982802/color/725500
The thing is that I discovered that the section that I am scraping appears right away on some items, but on others it will only appear after I have refreshed the page 2 or three times.
I am using scrapy to scrape and splash to render.
import scrapy
import re
from scrapy_splash import SplashRequest
class Scrapys(scrapy.Spider):
name = "sqs"
start_urls = ["https://www.zappos.com","https://www.zappos.com/marty/men-shoes/CK_XAcABAuICAgEY.zso"]
def start_requests(self):
for url in self.start_urls:
yield SplashRequest(url, self.parse,
endpoint='render.html',
args={'wait': 0.5},
)
def parse(self, response):
links = response.css("div._1Mgpu")
for link in links:
url = 'https://www.zappos.com' + link.css("a::attr(href)").extract_first()
yield SplashRequest(url, callback=self.parse_attr,
endpoint='render.html',
args={'wait': 10},
)
def parse_attr(self, response):
alsoviewimg = response.css("div._18jp0 div._3Olkk div.QDcUX div.slider div.slider-frame ul.slider-list li.slider-slide a img").extract()
The alsoviewimg is one of the elements that I am pulling from the "Customers Who Viewed this Item Also Viewed" section. I have tested pulling this and other elements, all in the scrapy shell with splash rendering to get the dynamic content, and it pulled the content fine, however in the spider it rarely, if ever, gets any hits.
Is there something I can set so that it loads the page a couple times to get the content? Or something else that I am missing?

You should check if the element you're looking for exists. If it doesn't, load the page again.
I'd look into why refreshing the page requires multiple attempts, you might be able to solve the problem without this ad-hoc multiple refresh solution.
Scrapy How to check if certain class exists in a given element
This link explains how to see if a class exists.

Scrapy XHR Pagination on TripAdvisor

Although I've seen several similar questions here regarding this, none seem to precisely define the process for achieving this task. I borrowed largely from the Scrapy script located here but since it is over a year old I had to make adjustments to the xpath references.
My current code looks as such:
import scrapy
from tripadvisor.items import TripadvisorItem
class TrSpider(scrapy.Spider):
name = 'trspider'
start_urls = [
'https://www.tripadvisor.com/Hotels-g29217-Island_of_Hawaii_Hawaii-Hotels.html'
]
def parse(self, response):
for href in response.xpath('//div[#class="listing_title"]/a/#href'):
url = response.urljoin(href.extract())
yield scrapy.Request(url, callback=self.parse_hotel)
next_page = response.xpath('//div[#class="unified pagination standard_pagination"]/child::*[2][self::a]/#href')
if next_page:
url = response.urljoin(next_page[0].extract())
yield scrapy.Request(url, self.parse)
def parse_hotel(self, response):
for href in response.xpath('//div[starts-with(#class,"quote")]/a/#href'):
url = response.urljoin(href.extract())
yield scrapy.Request(url, callback=self.parse_review)
next_page = response.xpath('//div[#class="unified pagination "]/child::*[2][self::a]/#href')
if next_page:
url = response.urljoin(next_page[0].extract())
yield scrapy.Request(url, self.parse_hotel)
def parse_review(self, response):
item = TripadvisorItem()
item['headline'] = response.xpath('translate(//div[#class="quote"]/text(),"!"," ")').extract()[0][1:-1]
item['review'] = response.xpath('translate(//div[#class="entry"]/p,"\n"," ")').extract()[0]
item['bubbles'] = response.xpath('//span[contains(#class,"ui_bubble_rating")]/#alt').extract()[0]
item['date'] = response.xpath('normalize-space(//span[contains(#class,"ratingDate")]/#content)').extract()[0]
item['hotel'] = response.xpath('normalize-space(//span[#class="altHeadInline"]/a/text())').extract()[0]
return item
When running the spider in its current form, I scrape the first page of reviews for each hotel listed on the start_urls page but the pagination doesn't flip to the next page of reviews. From what I suspect, this is because of this line:
next_page = response.xpath('//div[#class="unified pagination "]/child::*[2][self::a]/#href')
Since these pages load dynamically, there is no existing href for the next page on the current page. Investigating further I've read that these requests are sending a POST request using XHR. By exploring the "Network" tab in Firefox "Inspect" I can see both a Request URL and Form Data that might be needed to flip the page according to other posts on SO regarding the same topic.
However, it seems that the other posts refer to a static URL starting point when trying to pass a FormRequest using Scrapy. With TripAdvisor, the URL will always change based on the name of the hotel we're looking at so I'm not sure how to chose a URL when using FormRequest to submit the form data: reqNum=1&changeSet=REVIEW_LIST (this form data also never seems to change from page to page).
Alternatively, there doesn't appear to be a way to extract the URL shown in the "Network" tab's "Request URL". These pages do have URLs that change from page to page but the way TripAdvisor is set up, I cannot seem to extract them from the source code. The review pages change by incrementing the part of the URL that is -orXX- where "XX" is a number. For example:
https://www.tripadvisor.com/Hotel_Review-g2312116-d113123-Reviews-Fairmont_Orchid_Hawaii-Puako_Kohala_Coast_Island_of_Hawaii_Hawaii.html
https://www.tripadvisor.com/Hotel_Review-g2312116-d113123-Reviews-or5-Fairmont_Orchid_Hawaii-Puako_Kohala_Coast_Island_of_Hawaii_Hawaii.html
https://www.tripadvisor.com/Hotel_Review-g2312116-d113123-Reviews-or10-Fairmont_Orchid_Hawaii-Puako_Kohala_Coast_Island_of_Hawaii_Hawaii.html
https://www.tripadvisor.com/Hotel_Review-g2312116-d113123-Reviews-or15-Fairmont_Orchid_Hawaii-Puako_Kohala_Coast_Island_of_Hawaii_Hawaii.html
So, my question is whether or not it is possible to paginate using the XHR request/form data or do I need to manually build a list of URLs for each hotel that adds the -orXX-?

Well I ended up discovering an xpath that apparently allowed pagination of the reviews, but it's funny because every time I checked the underlying HTML the href link never changed from referring to /Hotel_Review-g2312116-d113123-Reviews-or5-Fairmont_Orchid_Hawaii-Puako_Kohala_Coast_Island_of_Hawaii_Hawaii.html even if I was on page 10 for example. It seems the "-orXX-" part of the link always increments the XX by 5 so I'm not sure why this works.
All I did was change the line:
next_page = response.xpath('//div[#class="unified pagination "]/child::*[2][self::a]/#href')
to:
next_page = response.xpath('//link[#rel="next"]/#href')
and have >41K extracted reviews. Would love to get other's opinions on handling this problem in other situations.

Writing a crawler to parse a site in scrapy using BaseSpider

I am getting confused on how to design the architecure of crawler.
I have the search where I have
pagination: next page links to follow
a list of products on one page
individual links to be crawled to get the description
I have the following code:
def parse_page(self, response):
hxs = HtmlXPathSelector(response)
sites = hxs.select('//ol[#id=\'result-set\']/li')
items = []
for site in sites[:2]:
item = MyProduct()
item['product'] = myfilter(site.select('h2/a').select("string()").extract())
item['product_link'] = myfilter(site.select('dd[2]/').select("string()").extract())
if item['profile_link']:
request = Request(urljoin('http://www.example.com', item['product_link']),
callback = self.parseItemDescription)
request.meta['item'] = item
return request
soup = BeautifulSoup(response.body)
mylinks= soup.find_all("a", text="Next")
nextlink = mylinks[0].get('href')
yield Request(urljoin(response.url, nextlink), callback=self.parse_page)
The problem is that I have two return statements: one for request, and one for yield.
In the crawl spider, I don't need to use the last yield, so everything was working fine, but in BaseSpider I have to follow links manually.
What should I do?

As an initial pass (and based on your comment about wanting to do this yourself), I would suggest taking a look at the CrawlSpider code to get an idea of how to implement its functionality.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Scrapy doesn'f follow next-page url, why? - python

Related

Scrapy crawler pagination for indeed website

Recursively Scraping pages using Python (scrapy)

Scrapy scraping content that is visible sometimes but not others

Scrapy XHR Pagination on TripAdvisor

Writing a crawler to parse a site in scrapy using BaseSpider

Categories

Resources