CrawlSpider get source link when crawling

CrawlSpider get source link when crawling - python

Using scrapy's CrawlSpider, is there a canonical way to get the url of the page that a rule follows from. So for example, if I had a link from page A to page B when I parse page B in the callback method, is there a way to know the url of page A? I am interested more in a built-in feature rather than then extending the CrawlSPider class.

In your callback you can use the "Referer" header in the response's request headers
def mycallback(self, response):
print "Referer:", response.request.headers.get("Referer")
...
It should work with all spiders.

Related

Scrapy best practice for crawling paging site

I am making crawler for html
Tach page has one tag like this,
Next >>
then last page there is not this tag.
So how can I get each page ??
At first, I thought is like this, however some how last self.start_request is not called.
page = 0
def start_requests(self,page=0):
urls = ['https://www.exmaple.com/page={0}'.format(page)]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self,response)
#check if there is <a tag ??
xlink = LinkExtractor()
for link in xlink.extract_links(response):
yield scrapy.Request(url=link.url, callback=self.parse_each)
if there is a tag:
page = page + 1
self.start_request(page)
What is the best practice for this crawling??

I'm pretty sure, that your start_requests method is executed. You are probably experiencing problems because you do not yield the result from start_request.
In your if statement, try this:
yield self.start_requests(page)
Also, I personally would not use start_requests like this, since start_requests is automatically called when your spider starts. Instead of yielding from start_requests, yielding a request directly from parse with the url scraped from the page would make your code more clear.

Python - Scrapy - Navigating through a website

I’m trying to use Scrapy to log into a website, then navigate within than website, and eventually download data from it. Currently I’m stuck in the middle of the navigation part. Here are the things I looked into to solve the problem on my own.
Datacamp course on Scrapy
Following Pagination Links with Scrapy
http://scrapingauthority.com/2016/11/22/scrapy-login/
Scrapy - Following Links
Relative URL to absolute URL Scrapy
However, I do not seem to connect the dots.
Below is the code I currently use. I manage to log in (when I call the "open_in_browser" function, I see that I’m logged in). I also manage to "click" on the first button on the website in the "parse2" part (if I call "open_in_browser" after parse 2, I see that the navigation bar at the top of the website has gone one level deeper.
The main problem is now in the "parse3" part as I cannot navigate another level deeper (or maybe I can, but the "open_in_browser" does not open the website any more - only if I put it after parse or parse 2). My understanding is that I put multiple "parse-functions" after another to navigate through the website.
Datacamp says I always need to start with a "start request function" which is what I tried but within the YouTube videos, etc. I saw evidence that most start directly with parse functions. Using "inspect" on the website for parse 3, I see that this time href is a relative link and I used different methods (See source 5) to navigate to it as I thought this might be the source of error.
import scrapy
from scrapy.http import FormRequest
from scrapy.utils.response import open_in_browser
from scrapy.crawler import CrawlerProcess
class LoginNeedScraper(scrapy.Spider):
name = "login"
start_urls = ["<some website>"]
def parse(self, response):
loginTicket = response.xpath('/html/body/section/div/div/div/div[2]/form/div[3]/input[1]/#value').extract_first()
execution = response.xpath('/html/body/section/div/div/div/div[2]/form/div[3]/input[2]/#value').extract_first()
return FormRequest.from_response(response, formdata={
'loginTicket': loginTicket,
'execution': execution,
'username': '<someusername>',
'password': '<somepassword>'},
callback=self.parse2)
def parse2(self, response):
next_page_url = response.xpath('/html/body/nav/div[2]/ul/li/a/#href').extract_first()
yield scrapy.Request(url=next_page_url, callback=self.parse3)
def parse3(self, response):
next_page_url_2 = response.xpath('/html//div[#class = "headerPanel"]/div[3]/a/#href').extract_first()
absolute_url = response.urljoin(next_page_url_2)
yield scrapy.Request(url=absolute_url, callback=self.start_scraping)
def start_scraping(self, response):
open_in_browser(response)
process = CrawlerProcess()
process.crawl(LoginNeedScraper)
process.start()

You need to define rules in order to scrape a website completely. Let's say you want to crawl all links in the header of the website and then open that link in order to see the main page to which that link was referring.
In order to achieve this, firstly identify what you need to scrape and mark CSS or XPath selectors for those links and put them in a rule. Every rule has a default callback to parse or you can also assign it to some other method. I am attaching a dummy example of creating rules, and you can map it accordingly to your case:
rules = (
Rule(LinkExtractor(restrict_css=[crawl_css_selectors])),
Rule(LinkExtractor(restrict_css=[product_css_selectors]), callback='parse_item')
)

Scrapy get all links from any website

I have the following code for a web crawler in Python 3:
import requests
from bs4 import BeautifulSoup
import re
def get_links(link):
return_links = []
r = requests.get(link)
soup = BeautifulSoup(r.content, "lxml")
if r.status_code != 200:
print("Error. Something is wrong here")
else:
for link in soup.findAll('a', attrs={'href': re.compile("^http")}):
return_links.append(link.get('href')))
def recursive_search(links)
for i in links:
links.append(get_links(i))
recursive_search(links)
recursive_search(get_links("https://www.brandonskerritt.github.io"))
The code basically gets all the links off of my GitHub pages website, and then it gets all the links off of those links, and so on until the end of time or an error occurs.
I want to recreate this code in Scrapy so it can obey robots.txt and be a better web crawler overall. I've researched online and I can only find tutorials / guides / stackoverflow / quora / blog posts about how to scrape a specific domain (allowed_domains=["google.com"], for example). I do not want to do this. I want to create code that will scrape all websites recursively.
This isn't much of a problem but all the blog posts etc only show how to get the links from a specific website (for example, it might be that he links are in list tags). The code I have above works for all anchor tags, regardless of what website it's being run on.
I do not want to use this in the wild, I need it for demonstration purposes so I'm not going to suddenly annoy everyone with excessive web crawling.
Any help will be appreciated!

There is an entire section of scrapy guide dedicated to broad crawls. I suggest you to fine-grain your settings for doing this succesfully.
For recreating the behaviour you need in scrapy, you must
set your start url in your page.
write a parse function that follow all links and recursively call itself, adding to a spider variable the requested urls
An untested example (that can be, of course, refined):
class AllSpider(scrapy.Spider):
name = 'all'
start_urls = ['https://yourgithub.com']
def __init__(self):
self.links=[]
def parse(self, response):
self.links.append(response.url)
for href in response.css('a::attr(href)'):
yield response.follow(href, self.parse)

If you want to allow crawling of all domains, simply don't specify allowed_domains, and use a LinkExtractor which extracts all links.
A simple spider that follows all links:
class FollowAllSpider(CrawlSpider):
name = 'follow_all'
start_urls = ['https://example.com']
rules = [Rule(LinkExtractor(), callback='parse_item', follow=True)]
def parse_item(self, response):
pass

scrapy : different URLs

Getting problem in making request to URL.
While inspecting in the main page I get URL in href as
But when the link gets open , it appears to be :
Both links are different, how can i make request for this.
Here what my shell says:

The problem is with allowed_domains spider attribute. Your current setting doesn't allow you to follow requests to jobview.monster.ca as per log (DEBUG: Filtered offsite request to ...). Set that attribute a bit loosely:
allowed_domains = ['monster.ca']

How does scrapy use rules?

I'm new to using Scrapy and I wanted to understand how the rules are being used within the CrawlSpider.
If I have a rule where I'm crawling through the yellowpages for cupcake listings in Tucson, AZ, how does yielding a URL request activate the rule - specifically how does it activiate the restrict_xpath attribute?
Thanks.

The rules attribute for a CrawlSpider specify how to extract the links from a page and which callbacks should be called for those links. They are handled by the default parse() method implemented in that class -- look here to read the source.
So, whenever you want to trigger the rules for an URL, you just need to yield a scrapy.Request(url, self.parse), and the Scrapy engine will send a request to that URL and apply the rules to the response.
The extraction of the links (that may or may not use restrict_xpaths) is done by the LinkExtractor object registered for that rule. It basically searches for all the <a>s and <area>s elements in the whole page or only in the elements obtained after applying the restrict_xpaths expressions if the attribute is set.
Example:
For example, say you have a CrawlSpider like so:
from scrapy.contrib.spiders.crawl import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
class MySpider(CrawlSpider):
start_urls = ['http://someurlhere.com']
rules = (
Rule(
LinkExtractor(restrict_xpaths=[
"//ul[#class='menu-categories']",
"//ul[#class='menu-subcategories']"]),
callback='parse'
),
Rule(
LinkExtractor(allow='/product.php?id=\d+'),
callback='parse_product_page'
),
)
def parse_product_page(self, response):
# yield product item here
The engine starts sending requests to the urls in start_urls and executing the default callback (the parse() method in CrawlSpider) for their response.
For each response, the parse() method will execute the link extractors on it to get the links from the page. Namely, it calls the LinkExtractor.extract_links(response) for each response object to get the urls, and then yields scrapy.Request(url, <rule_callback>) objects.
The example code is an skeleton for a spider that crawls an e-commerce site following the links of product categories and subcategories, to get links for each of the product pages.
For the rules registered specifically in this spider, it would crawl the links inside the lists of "categories" and "subcategories" with the parse() method as callback (which will trigger the crawl rules to be called for these pages), and the links matching the regular expression product.php?id=\d+ with the callback parse_product_page() -- which would finally scrape the product data.
As you can see, pretty powerful stuff. =)
Read more:
CrawlSpider - Scrapy docs
Link extractors - Scrapy docs

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

CrawlSpider get source link when crawling - python

In your callback you can use the "Referer" header in the response's request headers def mycallback(self, response): print "Referer:", response.request.headers.get("Referer") ... It should work with all spiders.

Related

Scrapy best practice for crawling paging site

Python - Scrapy - Navigating through a website

Scrapy get all links from any website

scrapy : different URLs

How does scrapy use rules?

Categories

Resources