Scrapy crawler pagination for indeed website - python

# -*- coding: utf-8 -*-
import scrapy
class SearchSpider(scrapy.Spider):
name = 'search'
allowed_domains = ['www.indeed.com/']
start_urls = ['https://www.indeed.com/jobs?q=data%20analyst&l=united%20states']
def parse(self, response):
listings = response.xpath('//*[#data-tn-component="organicJob"]')
for listing in listings:
title = listing.xpath('.//a[#data-tn-element="jobTitle"]/#title').extract_first()
link = listing.xpath('.//h2[#class="title"]//a/#href').extract_first()
company = listing.xpath('normalize-space(.//span[#class="company"]//a/text())').extract_first()
yield {'title':title,
'link':link,
'company':company}
next_page = response.xpath('//ul[#class="pagination-list"]//a/#href').extract_first()
if next_page:
yield scrapy.Request(response.urljoin(next_page),callback=self.parse)
I am trying to extract all the job titles and company for every job posting in all the indeed pages. However, I am stuck at a point, because the forward button on the indeed page does not have a fixed link which my scraper could follow instead the next page url is the same as the numbered button. Which means that even after requesting the next page url, the numbers at the end change which does not allow me to extract the next page. I am trying to refrain from using selenium or splash, since I am trying to get my results through only Scrapy or Beautifull Soup. However, any help would be greatly appreciated.

Related

how to scrape javascript web site?

Hello everyone I'm a beginner at scraping and i try to scrape all iPhones in https://www.electroplanet.ma/
this is the scripts i wrote
import scrapy
from ..items import EpItem
class ep(scrapy.Spider):
name = "ep"
start_urls = ["https://www.electroplanet.ma/smartphone-tablette-gps/smartphone/iphone?p=1",
"https://www.electroplanet.ma/smartphone-tablette-gps/smartphone/iphone?p=2"
]
def parse(self, response):
products = response.css("ol li") # to find all items in the page
for product in products :
try:
lien = product.css("a.product-item-link::attr(href)").get() # get the link of each item
image= product.css("a.product-item-photo::attr(href)").get() # get the image
# and to get in each item page and scrap it, i use follow method
# i passed image as argument to parse_item cauz i couldn't scrap the image from item's page
# i think it's hidden
yield response.follow(lien,callback = self.parse_item,cb_kwargs={"image":image})
except: pass
def parse_item(self,response,image):
item = EpItem()
item["Nom"]= response.css(".ref::text").get()
pattern = re.compile(r"\s*(\S+(?:\s+\S+)*)\s*")
item["Catégorie"]= pattern.search(response.xpath("//h1/a/text()").get()).group(1)
item["Marque"]=pattern.search(response.xpath("//*[#data-th='Marque']/text()").get()).group(1)
try :
item["RAM"]= pattern.search(response.xpath("//*[#data-th='MÉMOIRE RAM']/text()").get()).group(1)
except:
pass
item["ROM"]=pattern.search(response.xpath("//*[#data-th='MÉMOIRE DE STOCKAGE']/text()").get()).group(1)
item["Couleur"]=pattern.search(response.xpath("//*[#data-th='COULEUR']/text()").get()).group(1)
item["lien"]=response.request.url
item["image"]=image
item["état"]="neuf"
item["Market"]= "Electro Planet"
yield item
i found problems to scrape all the pages, because it uses javascript to follow pages so i write all pages links in start urls and i believe it's not the best practice so i ask you to give some advices to improve my code
you can use the scrapy-playwright plugin to scrape the interactive websites, and for the start_urls, just add the main website index URL if there is just one website, and check this link in the scrapy docs to make the spider follow the pages links automatically instead of written them manually

How I can take data from all pages?

it's first time when I'm using Scrapy framework for python.
So I made this code.
# -*- coding: utf-8 -*-
import scrapy
class SpiderSpider(scrapy.Spider):
name = 'spider'
start_urls = [
'https://www.emag.ro/televizoare/c'
]
def parse(self, response):
for i in response.xpath('//div[#class="card-section-wrapper js-section-wrapper"]'):
yield {
'product-name': i.xpath('.//a[#class="product-title js-product-url"]/text()')
.extract_first().replace('\n','')
}
next_page_url = response.xpath('//a[#class="js-change-page"]/#href').extract_first()
if next_page_url is not None:
yield scrapy.Request(response.urljoin(next_page_url))
when I'm looking at the website it has over 800 products. but my script it's only taking the first 2 pages nearly 200 products...
I tried to use css selector and xpath, both same bug.
Can anyone figure out where is the problem?
Thank you!
The website you are trying to crawl is getting data from API. When you click on the pagination link, it sends ajax request to API to fetch more products and show them on the page.
Since
Scrapy doesn't simulate the browser environment itself.
So one way would be that you
Analyse the request in your browser network tab to inspect the endpoint and parameters
Build the similar request yourself in scrapy
Call that endpoint with appropriate arguments to get the products from the API.
Also you need to extract the next page from the json response you get from the API. Usually there is a key named pagination which contains info related to total pages, next page etc.
I finally figure out how to do it.
# -*- coding: utf-8 -*-
import scrapy
from ..items import ScraperItem
class SpiderSpider(scrapy.Spider):
name = 'spider'
page_number = 2
start_urls = [
'https://www.emag.ro/televizoare/c'
]
def parse(self, response):
items = ScraperItem()
for i in response.xpath('//div[#class="card-section-wrapper js-section-wrapper"]'):
product_name = i.xpath('.//a[#class="product-title js-product-url"]/text()').extract_first().replace('\n ','').replace('\n ','')
items["product_name"] = product_name
yield items
next_page = 'https://www.emag.ro/televizoare/p' + str(SpiderSpider.page_number) + '/c'
if SpiderSpider.page_number <= 28:
SpiderSpider.page_number += 1
yield response.follow(next_page, callback = self.parse)

Scrapy doesn'f follow next-page url, why?

I'm scraping this website: https://www.olx.com.ar/celulares-telefonos-cat-831 with Scrapy 1.4.0. When I run the spider everything goes well until it gets to the "next-page" part. Here's the code:
# -*- coding: utf-8 -*-
import scrapy
#import time
class OlxarSpider(scrapy.Spider):
name = "olxar"
allowed_domains = ["olx.com.ar"]
start_urls = ['https://www.olx.com.ar/celulares-telefonos-cat-831']
def parse(self, response):
#time.sleep(10)
response = response.replace(body=response.body.replace('<br>', ''))
SET_SELECTOR = '.item'
for item in response.css(SET_SELECTOR):
PRODUCTO_SELECTOR = '.items-info h3 ::text'
yield {
'producto': item.css(PRODUCTO_SELECTOR).extract_first().replace(',',' '),
}
NEXT_PAGE_SELECTOR = '.items-paginations-buttons a::attr(href)'
next_page = response.css(NEXT_PAGE_SELECTOR).extract_first().replace('//','https://')
if next_page:
yield scrapy.Request(response.urljoin(next_page),
callback=self.parse
)
I've seen in other questions that some people added dont_filter = True attribute to the Request but that doesn't work for me. It just makes the spider loop over the first 2 pages.
I've added the replace('//','https://') part to fix the original href that comes without https: and can't be followed by Scrapy.
Also, when I run the spider it scraps the first page and then returns [scrapy.dupefilters] DEBUG: Filtered duplicate request: <GET https://www.olx.com.ar/celulares-telefonos-cat-831-p-2> - no more duplicates will be shown (see DUPEFILTER_DEBUG to show all duplicates)
Why is it filtering the second page like duplicated when apparently is not?
I applied the Tarun Lalwani solution on the comments. I missed that detail so bad! It worked fine with the correction thank you!
Your problem is the css selector. On page 1 it matches the next page link. On page 2 it matches the previous page and the next page link. Out of that your pick the first one using extract_first(), so your just rotate between first and second page only
Solution is simple, you need to change css selector
NEXT_PAGE_SELECTOR = '.items-paginations-buttons a::attr(href)'
to
NEXT_PAGE_SELECTOR = '.items-paginations-buttons a.next::attr(href)'
This will only identify next page url only

Scrapy XHR Pagination on TripAdvisor

Although I've seen several similar questions here regarding this, none seem to precisely define the process for achieving this task. I borrowed largely from the Scrapy script located here but since it is over a year old I had to make adjustments to the xpath references.
My current code looks as such:
import scrapy
from tripadvisor.items import TripadvisorItem
class TrSpider(scrapy.Spider):
name = 'trspider'
start_urls = [
'https://www.tripadvisor.com/Hotels-g29217-Island_of_Hawaii_Hawaii-Hotels.html'
]
def parse(self, response):
for href in response.xpath('//div[#class="listing_title"]/a/#href'):
url = response.urljoin(href.extract())
yield scrapy.Request(url, callback=self.parse_hotel)
next_page = response.xpath('//div[#class="unified pagination standard_pagination"]/child::*[2][self::a]/#href')
if next_page:
url = response.urljoin(next_page[0].extract())
yield scrapy.Request(url, self.parse)
def parse_hotel(self, response):
for href in response.xpath('//div[starts-with(#class,"quote")]/a/#href'):
url = response.urljoin(href.extract())
yield scrapy.Request(url, callback=self.parse_review)
next_page = response.xpath('//div[#class="unified pagination "]/child::*[2][self::a]/#href')
if next_page:
url = response.urljoin(next_page[0].extract())
yield scrapy.Request(url, self.parse_hotel)
def parse_review(self, response):
item = TripadvisorItem()
item['headline'] = response.xpath('translate(//div[#class="quote"]/text(),"!"," ")').extract()[0][1:-1]
item['review'] = response.xpath('translate(//div[#class="entry"]/p,"\n"," ")').extract()[0]
item['bubbles'] = response.xpath('//span[contains(#class,"ui_bubble_rating")]/#alt').extract()[0]
item['date'] = response.xpath('normalize-space(//span[contains(#class,"ratingDate")]/#content)').extract()[0]
item['hotel'] = response.xpath('normalize-space(//span[#class="altHeadInline"]/a/text())').extract()[0]
return item
When running the spider in its current form, I scrape the first page of reviews for each hotel listed on the start_urls page but the pagination doesn't flip to the next page of reviews. From what I suspect, this is because of this line:
next_page = response.xpath('//div[#class="unified pagination "]/child::*[2][self::a]/#href')
Since these pages load dynamically, there is no existing href for the next page on the current page. Investigating further I've read that these requests are sending a POST request using XHR. By exploring the "Network" tab in Firefox "Inspect" I can see both a Request URL and Form Data that might be needed to flip the page according to other posts on SO regarding the same topic.
However, it seems that the other posts refer to a static URL starting point when trying to pass a FormRequest using Scrapy. With TripAdvisor, the URL will always change based on the name of the hotel we're looking at so I'm not sure how to chose a URL when using FormRequest to submit the form data: reqNum=1&changeSet=REVIEW_LIST (this form data also never seems to change from page to page).
Alternatively, there doesn't appear to be a way to extract the URL shown in the "Network" tab's "Request URL". These pages do have URLs that change from page to page but the way TripAdvisor is set up, I cannot seem to extract them from the source code. The review pages change by incrementing the part of the URL that is -orXX- where "XX" is a number. For example:
https://www.tripadvisor.com/Hotel_Review-g2312116-d113123-Reviews-Fairmont_Orchid_Hawaii-Puako_Kohala_Coast_Island_of_Hawaii_Hawaii.html
https://www.tripadvisor.com/Hotel_Review-g2312116-d113123-Reviews-or5-Fairmont_Orchid_Hawaii-Puako_Kohala_Coast_Island_of_Hawaii_Hawaii.html
https://www.tripadvisor.com/Hotel_Review-g2312116-d113123-Reviews-or10-Fairmont_Orchid_Hawaii-Puako_Kohala_Coast_Island_of_Hawaii_Hawaii.html
https://www.tripadvisor.com/Hotel_Review-g2312116-d113123-Reviews-or15-Fairmont_Orchid_Hawaii-Puako_Kohala_Coast_Island_of_Hawaii_Hawaii.html
So, my question is whether or not it is possible to paginate using the XHR request/form data or do I need to manually build a list of URLs for each hotel that adds the -orXX-?
Well I ended up discovering an xpath that apparently allowed pagination of the reviews, but it's funny because every time I checked the underlying HTML the href link never changed from referring to /Hotel_Review-g2312116-d113123-Reviews-or5-Fairmont_Orchid_Hawaii-Puako_Kohala_Coast_Island_of_Hawaii_Hawaii.html even if I was on page 10 for example. It seems the "-orXX-" part of the link always increments the XX by 5 so I'm not sure why this works.
All I did was change the line:
next_page = response.xpath('//div[#class="unified pagination "]/child::*[2][self::a]/#href')
to:
next_page = response.xpath('//link[#rel="next"]/#href')
and have >41K extracted reviews. Would love to get other's opinions on handling this problem in other situations.

Unable to scrape some URLs from a webpage

I am trying to scrape all the restaurant URLs on a page. There are only 5 restaurant URLs to scrape in this particular example.
At this stage, I am just trying to print them to see if my code works. However, I am not even able to get that done: - my code is unable to find any of the URLs.
import scrapy
from hungryhouse.items import HungryhouseItem
class HungryhouseSpider(scrapy.Spider):
name = "hungryhouse"
allowed_domains = ["hungryhouse.co.uk"]
start_urls = ["https://hungryhouse.co.uk/takeaways/westhill-ab32",
]
def parse(self, response):
for href in response.xpath('//div[#class="restsRestInfo"]/a/#href'):
url = response.urljoin(href.extract())
print url
any guidance as to why the five URLs are not being found would be gratefully received.

Categories