Recursively Scraping pages using Python (scrapy) - python

I'm trying to make a program that retrieves the title and price of items while going to the following page.
Now all informations of the first page (title, price) are extracted but the program does not go to the next page
URL : https://scrapingclub.com/exercise/list_basic/
import scrapy
class RecursiveSpider(scrapy.Spider):
name = 'recursive'
allowed_domains = ['scrapingclub.com/exercise/list_basic/']
start_urls = ['http://scrapingclub.com/exercise/list_basic//']
def parse(self, response):
card = response.xpath("//div[#class='card-body']")
for thing in card:
title = thing.xpath(".//h4[#class='card-title']").extract_first()
price = thing.xpath(".//h5").extract_first
yield {'price' : price, 'title' : title}
next_page_url = response.xpath("//li[#class='page-item']//a/#href")
if next_page_url:
absolute_nextpage_url = response.urljoin(next_page_url)
yield scrapy.Request(absolute_nextpage_url) ```

You should add the execution logs in situations like this, it would help pin point your problem.
I can see a few problems though:
next_page_url = response.xpath("//li[#class='page-item']//a/#href")
if next_page_url:
absolute_nextpage_url = response.urljoin(next_page_url)
The variable next_page_url contains a selector, not a string. You need to use the .get() method to extract the string with the relative url.
After this, I executed your code it returned:
2020-09-04 15:19:34 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'scrapingclub.com': <GET https://scrapingclub.com/exercise/list_basic/?page=2>
It's filtering the request as it considers it an offisite request, even if it isn't. To fix it just use allowed_domains = ['scrapingclub.com'] or just remove this line entirely. If you want to understand more how this filter works check the source here.
Finally, it doesn't make sense to have this snippet under the for loop:
next_page_url = response.xpath("//li[#class='page-item']//a/#href").get() # I added the .get()
if next_page_url:
absolute_nextpage_url = response.urljoin(next_page_url)
yield scrapy.Request(absolute_nextpage_url)
If you use get() method it will return to next_page_url the first item (which is page 2 now, but in the next callback will be page 1, so you will never advance to page 3).
If you use getall() it will return a list, which you would need to iterate over yielding all possible requests, but this is a recursive function, so you would end up doing that in each recursion step.
The best option is to select the next button instead of the page number:
next_page_url = response.xpath('//li[#class="page-item"]/a[contains(text(), "Next")]/#href').get()

Related

How can I parse several differently structured websites with the same spider?

All the websites I want to parse are in the same domain but all look very different and contain different information I need.
My start_url is a page with a list containing all links I need. So in the parse() method I yield a request for each of these links and in parse_item_page I extract the first part of the information I need - which worked completely fine.
My problem is: I thought I could just do the same another time and for each link on my item_page call parse_entry. But I tried so many different versions of this and I just can't get it to work. They are the correct URLs but scrapy seems to just don't want to call a third parse() function, nothing in there ever gets executed.
How can I get scrapy to use parse_entry, or pass all these links to a new spider?
This is a simplified, shorter version of my spider class:
def parse(self, response, **kwargs):
for href in response.xpath("//listItem/#href"):
url = response.urljoin(href.extract())
yield scrapy.Request(url, callback=self.parse_item_page)
def parse_item_page(self, response):
for sel in response.xpath("//div"):
item = items.FirstItem()
item['attribute'] = sel.xpath("//h1/text()").get().strip()
for href in response.xpath("//entry/#href"):
yield response.follow(href.extract(), callback=self.parse_entry)
yield item
def parse_entry(self, response):
for sel in response.xpath("//textBlock"):
item = items.SecondItem()
item['attribute'] = sel.xpath("//h1/text()").get().strip()
yield item

Scrapy doesn'f follow next-page url, why?

I'm scraping this website: https://www.olx.com.ar/celulares-telefonos-cat-831 with Scrapy 1.4.0. When I run the spider everything goes well until it gets to the "next-page" part. Here's the code:
# -*- coding: utf-8 -*-
import scrapy
#import time
class OlxarSpider(scrapy.Spider):
name = "olxar"
allowed_domains = ["olx.com.ar"]
start_urls = ['https://www.olx.com.ar/celulares-telefonos-cat-831']
def parse(self, response):
#time.sleep(10)
response = response.replace(body=response.body.replace('<br>', ''))
SET_SELECTOR = '.item'
for item in response.css(SET_SELECTOR):
PRODUCTO_SELECTOR = '.items-info h3 ::text'
yield {
'producto': item.css(PRODUCTO_SELECTOR).extract_first().replace(',',' '),
}
NEXT_PAGE_SELECTOR = '.items-paginations-buttons a::attr(href)'
next_page = response.css(NEXT_PAGE_SELECTOR).extract_first().replace('//','https://')
if next_page:
yield scrapy.Request(response.urljoin(next_page),
callback=self.parse
)
I've seen in other questions that some people added dont_filter = True attribute to the Request but that doesn't work for me. It just makes the spider loop over the first 2 pages.
I've added the replace('//','https://') part to fix the original href that comes without https: and can't be followed by Scrapy.
Also, when I run the spider it scraps the first page and then returns [scrapy.dupefilters] DEBUG: Filtered duplicate request: <GET https://www.olx.com.ar/celulares-telefonos-cat-831-p-2> - no more duplicates will be shown (see DUPEFILTER_DEBUG to show all duplicates)
Why is it filtering the second page like duplicated when apparently is not?
I applied the Tarun Lalwani solution on the comments. I missed that detail so bad! It worked fine with the correction thank you!
Your problem is the css selector. On page 1 it matches the next page link. On page 2 it matches the previous page and the next page link. Out of that your pick the first one using extract_first(), so your just rotate between first and second page only
Solution is simple, you need to change css selector
NEXT_PAGE_SELECTOR = '.items-paginations-buttons a::attr(href)'
to
NEXT_PAGE_SELECTOR = '.items-paginations-buttons a.next::attr(href)'
This will only identify next page url only

Scrapy - Visiting nested links and grabbing meta data from each level

I am relatively new to scrapy and have been getting a lot of exceptions...
Here is what I am trying to do:
There 4 nested links that I want to grab data from:
Let's say I have 5 items that I want to crawl in total. These items are
Industry=scrapy.Field()
Company=scrapy.Field()
Contact_First_name=scrapy.Field()
Contact_Last_name=scrapy.Field()
Website=scrapy.Field()
Now to begin crawling I would first have to get the Industry.
The Industry xpath also contains the link to individual listings of companies that belong to their Industry segments.
Next I want to use the Industry xpath and go into the link. This page does not contain any data that I want to crawl. But this page contains href links to individual companies that have their own basic info page.
Using the href link from the listings page, I now arrive at one page that contains the information for one company. Now I want to scrape the Company, Address, and Website.
There is another href link that I need to click in order to lead to Contact_First_Name, Contact_Last_Name.
Using the href link, I now arrive at another page that contains the Contact_First_Name, and Contact_Last_Name
After crawling all of these pages, I should have items that look somewhat like this:
Industry Company Website Contact_First_Name Contact_Last_Name
Finance JPMC JP.com Jamie Dimon
Finance BOA BOA.com Bryan Moynihan
Technology ADSK ADSK.com Carl Bass
EDITED
Here is the code that is working. Anzel's recommendations really helped out but i realized the subclass allowed_domains was wrong which stopped the nested links from following through. Once I changed it, it works.
class PschamberSpider(scrapy.Spider):
name="pschamber"
allowed_domains = ["cm.pschamber.com"]
start_urls = ["http://cm.pschamber.com/list/"]
def parse(self, response):
item = PschamberItem()
for sel in response.xpath('//*[#id="mn-ql"]/ul/li/a'):
# xpath and xpath().extract() will return a list
# extract()[0] will return the first element in the list
item['Industry'] = sel.xpath('text()').extract()
# another mistake you made here
# you're trying to call scrapy.Request(LIST of hrefs) which will fail
# scrapy.Request only takes a url string, not list
# another big mistake is you're trying to yield the item,
# whereas you should yield the Request object
yield scrapy.Request(sel.xpath('#href').extract()[0], callback=self.parse_2, meta={'item': item})
# another mistake, your callback function DOESNT take item as argument
def parse_2(self, response):
for sel in response.xpath('.//*[#id="mn-members"]/div/div/div/div/div/a').extract():
# you can access your response meta like this
item=response.meta['item']
item['Company'] = sel.xpath('text()').extract()
yield scrapy.Request(sel.xpath('#href').extract()[0], callback=self.parse_3, meta={'item': item})
# again, yield the Request object
def parse_3(self, response):
item=response.meta['item']
item['Website'] = response.xpath('.//[#id="mn-memberinfo-block-website"]/a/#href').extract()
# OK, finally assume you're done, just return the item object
return item
There are quite a few mistakes you've made in your code therefore it's not running as you expected. Please see my below brief sample how to get the items you need and passing the meta to other callbacks. I am not copying your xpath as I just grab the most straight forward one from the site, you can apply your own.
I will try to comment as clear as possible to let you know where you did wrong.
class PschamberSpider(scrapy.Spider):
name = "pschamber"
# start from this, since your domain is a sub-domain on its own,
# you need to change to this without http://
allowed_domains = ["cm.pschamber.com"]
start_urls = (
'http://cm.pschamber.com/list/',
)
def parse(self, response):
item = PschamberItem()
for sel in response.xpath('//div[#id="mn-ql"]//a'):
# xpath and xpath().extract() will return a list
# extract()[0] will return the first element in the list
item['industry'] = sel.xpath('text()').extract()[0]
# another mistake you made here
# you're trying to call scrapy.Request(LIST of hrefs) which will fail
# scrapy.Request only takes a url string, not list
# another big mistake is you're trying to yield the item,
# whereas you should yield the Request object
yield scrapy.Request(
sel.xpath('#href').extract()[0],
callback=self.parse_2,
meta={'item': item}
)
# another mistake, your callback function DOESNT take item as argument
def parse_2(self, response):
for sel in response.xpath('//div[#class="mn-title"]//a'):
# you can access your response meta like this
item = response.meta['item']
item['company'] = sel.xpath('text()').extract()[0]
# again, yield the Request object
yield scrapy.Request(
sel.xpath('#href').extract()[0],
callback=self.parse_3,
meta={'item': item}
)
def parse_3(self, response):
item = response.meta['item']
item['website'] = response.xpath('//a[#class="mn-print-url"]/text()').extract()
# OK, finally assume you're done, just return the item object
return item
Hope this is self-explanatory and you get to understand the basic of scrapy, you should READ thoroughly the doc from Scrapy, and sooner you will learn another method to set rules to follow links with certain patterns... well of course once you get the basic right you will understand them.
Although everyone's journey differs, I strongly recommend you keep reading and practice until you're confident in what you're doing before crawling actual website. Also, there are rules to protect web contents which can be scraped, and copyright about the content you scrape.
Keep this in mind or you may find yourself in big trouble in future. Anyway, good luck and I hope this answer helps you resolve the problem!

Writing a crawler to parse a site in scrapy using BaseSpider

I am getting confused on how to design the architecure of crawler.
I have the search where I have
pagination: next page links to follow
a list of products on one page
individual links to be crawled to get the description
I have the following code:
def parse_page(self, response):
hxs = HtmlXPathSelector(response)
sites = hxs.select('//ol[#id=\'result-set\']/li')
items = []
for site in sites[:2]:
item = MyProduct()
item['product'] = myfilter(site.select('h2/a').select("string()").extract())
item['product_link'] = myfilter(site.select('dd[2]/').select("string()").extract())
if item['profile_link']:
request = Request(urljoin('http://www.example.com', item['product_link']),
callback = self.parseItemDescription)
request.meta['item'] = item
return request
soup = BeautifulSoup(response.body)
mylinks= soup.find_all("a", text="Next")
nextlink = mylinks[0].get('href')
yield Request(urljoin(response.url, nextlink), callback=self.parse_page)
The problem is that I have two return statements: one for request, and one for yield.
In the crawl spider, I don't need to use the last yield, so everything was working fine, but in BaseSpider I have to follow links manually.
What should I do?
As an initial pass (and based on your comment about wanting to do this yourself), I would suggest taking a look at the CrawlSpider code to get an idea of how to implement its functionality.

Scrapy - parse a page to extract items - then follow and store item url contents

I have a question on how to do this thing in scrapy. I have a spider that crawls for listing pages of items.
Every time a listing page is found, with items, there's the parse_item() callback that is called for extracting items data, and yielding items. So far so good, everything works great.
But each item, has among other data, an url, with more details on that item. I want to follow that url and store in another item field (url_contents) the fetched contents of that item's url.
And I'm not sure how to organize code to achieve that, since the two links (listings link, and one particular item link) are followed differently, with callbacks called at different times, but I have to correlate them in the same item processing.
My code so far looks like this:
class MySpider(CrawlSpider):
name = "example.com"
allowed_domains = ["example.com"]
start_urls = [
"http://www.example.com/?q=example",
]
rules = (
Rule(SgmlLinkExtractor(allow=('example\.com', 'start='), deny=('sort='), restrict_xpaths = '//div[#class="pagination"]'), callback='parse_item'),
Rule(SgmlLinkExtractor(allow=('item\/detail', )), follow = False),
)
def parse_item(self, response):
main_selector = HtmlXPathSelector(response)
xpath = '//h2[#class="title"]'
sub_selectors = main_selector.select(xpath)
for sel in sub_selectors:
item = ExampleItem()
l = ExampleLoader(item = item, selector = sel)
l.add_xpath('title', 'a[#title]/#title')
......
yield l.load_item()
After some testing and thinking, I found this solution that works for me.
The idea is to use just the first rule, that gives you listings of items, and also, very important, add follow=True to that rule.
And in parse_item() you have to yield a request instead of an item, but after you load the item. The request is to item detail url. And you have to send the loaded item to that request callback. You do your job with the response, and there is where you yield the item.
So the finish of parse_item() will look like this:
itemloaded = l.load_item()
# fill url contents
url = sel.select(item_url_xpath).extract()[0]
request = Request(url, callback = lambda r: self.parse_url_contents(r))
request.meta['item'] = itemloaded
yield request
And then parse_url_contents() will look like this:
def parse_url_contents(self, response):
item = response.request.meta['item']
item['url_contents'] = response.body
yield item
If anyone has another (better) approach, let us know.
Stefan
I'm sitting with exactly the same problem, and from the fact that no-one has answered your question for 2 days I take it that the only solution is to follow that URL manually, from within your parse_item function.
I'm new to Scrapy, so I wouldn't attempt it with that (although I'm sure it's possible), but my solution will be to use urllib and BeatifulSoup to load the second page manually, extract that information myself, and save it as part of the Item. Yes, much more trouble than Scrapy makes normal parsing, but it should get the job done with the least hassle.

Categories