I have defined two spiders which do the following:
Spider A:
Visits the home page.
Extracts all the links from the page and stores them in a text file.
This is necessary since the home page has a More Results button which produces further links to different products.
Spider B:
Opens the text file.
Crawls the individual pages and saves the information.
I am trying to combine the two and make a crawl-spider.
The URL structure of the home page is similar to:
http://www.example.com
The URL structure of the individual pages is similar to:
http://www.example.com/Home/Detail?id=some-random-number
The text file contains the list of such URLs which are to be scraped by the second spider.
My question:
How do I combine the two spiders so as to make a single spider which does the complete scraping?
From scrapy documantation:
In the callback function, you parse the response (web page) and return either Item objects, Request objects, or an iterable of both. Those Requests will also contain a callback (maybe the same) and will then be downloaded by Scrapy and then their response handled by the specified callback.
So what you actually need to do is in the parse method (which yuo extract the links there, for each link, yield a new request like:
yield self.make_requests_from_url(http://www.example.com/Home/Detail?id=some-random-number)
the self.make_requests_from_url is already implemented in Spider
Example of such:
class MySpider(Spider):
name = "my_spider"
def parse(self, response):
try:
user_name = Selector(text=response.body).xpath('//*[#id="ft"]/a/#href').extract()[0]
yield self.make_requests_from_url("https://example.com/" + user_name)
yield MyItem(user_name)
except Exception as e:
pass
You can handle the other requests using a different parsing function. do it by returning a Request object and specify the callback explicitly (The self.make_requests_from_url function call the parse function bu default)
Request(url=url,callback=self.parse_user_page)
Related
I’m trying to use Scrapy to log into a website, then navigate within than website, and eventually download data from it. Currently I’m stuck in the middle of the navigation part. Here are the things I looked into to solve the problem on my own.
Datacamp course on Scrapy
Following Pagination Links with Scrapy
http://scrapingauthority.com/2016/11/22/scrapy-login/
Scrapy - Following Links
Relative URL to absolute URL Scrapy
However, I do not seem to connect the dots.
Below is the code I currently use. I manage to log in (when I call the "open_in_browser" function, I see that I’m logged in). I also manage to "click" on the first button on the website in the "parse2" part (if I call "open_in_browser" after parse 2, I see that the navigation bar at the top of the website has gone one level deeper.
The main problem is now in the "parse3" part as I cannot navigate another level deeper (or maybe I can, but the "open_in_browser" does not open the website any more - only if I put it after parse or parse 2). My understanding is that I put multiple "parse-functions" after another to navigate through the website.
Datacamp says I always need to start with a "start request function" which is what I tried but within the YouTube videos, etc. I saw evidence that most start directly with parse functions. Using "inspect" on the website for parse 3, I see that this time href is a relative link and I used different methods (See source 5) to navigate to it as I thought this might be the source of error.
import scrapy
from scrapy.http import FormRequest
from scrapy.utils.response import open_in_browser
from scrapy.crawler import CrawlerProcess
class LoginNeedScraper(scrapy.Spider):
name = "login"
start_urls = ["<some website>"]
def parse(self, response):
loginTicket = response.xpath('/html/body/section/div/div/div/div[2]/form/div[3]/input[1]/#value').extract_first()
execution = response.xpath('/html/body/section/div/div/div/div[2]/form/div[3]/input[2]/#value').extract_first()
return FormRequest.from_response(response, formdata={
'loginTicket': loginTicket,
'execution': execution,
'username': '<someusername>',
'password': '<somepassword>'},
callback=self.parse2)
def parse2(self, response):
next_page_url = response.xpath('/html/body/nav/div[2]/ul/li/a/#href').extract_first()
yield scrapy.Request(url=next_page_url, callback=self.parse3)
def parse3(self, response):
next_page_url_2 = response.xpath('/html//div[#class = "headerPanel"]/div[3]/a/#href').extract_first()
absolute_url = response.urljoin(next_page_url_2)
yield scrapy.Request(url=absolute_url, callback=self.start_scraping)
def start_scraping(self, response):
open_in_browser(response)
process = CrawlerProcess()
process.crawl(LoginNeedScraper)
process.start()
You need to define rules in order to scrape a website completely. Let's say you want to crawl all links in the header of the website and then open that link in order to see the main page to which that link was referring.
In order to achieve this, firstly identify what you need to scrape and mark CSS or XPath selectors for those links and put them in a rule. Every rule has a default callback to parse or you can also assign it to some other method. I am attaching a dummy example of creating rules, and you can map it accordingly to your case:
rules = (
Rule(LinkExtractor(restrict_css=[crawl_css_selectors])),
Rule(LinkExtractor(restrict_css=[product_css_selectors]), callback='parse_item')
)
I am new to Scrapy and Python and as such I am a beginner. I want to be able to have Scrapy read a text file with a seed list of around 100k urls, have Scrapy visit each URL, and extract all external URLs (URLs of Other Sites) found on each of those Seed URLs and export the results to a separate text file.
Scrapy should only visit the URLs in the text file, not spider out and follow any other URL.
I want to be able to have Scrapy work as fast as possible, I have a very powerful server with a 1GBS line. Each URL in my list is from a unique domain, so I won't be hitting any 1 site hard at all and thus won't be encountering IP blocks.
How would I go about creating a project in Scrapy to be able to extract all external links from a list of urls stored in a textfile?
Thanks.
You should use:
1. start_requests function for reading list of urls.
2. css or xpath selector for all "a" html elements.
from scrapy import Spider
class YourSpider(Spider):
name = "your_spider"
def start_requests(self):
with open('your_input.txt', 'r') as f: # read the list of urls
for url in f.readlines() # process each of them
yield Request(url, callback=self.parse)
def parse(self, response):
item = YourItem(parent_url=response.url)
item['child_urls'] = response.css('a::attr(href)').extract()
return item
More info about start_requests here:
http://doc.scrapy.org/en/latest/topics/spiders.html#scrapy.spiders.Spider.start_requests
For extracting scraped items to another file use Item Pipeline or Feed Export. Basic pipeline example here:
http://doc.scrapy.org/en/latest/topics/item-pipeline.html#write-items-to-a-json-file
Is there any scrapy module available to build referrer chains while crawling urls.
Lets say for instance I start my crawl from http://www.example.com and move to http://www.new-example.com and then from http://www.new-example.com to http://very-new-example.com.
Can I create a url chains(a csv or json file) like this:
http://www.example.com, http://www.new-example.com
http://www.example.com, http://www.new-example.com, http://very-new-example.com
and so on, if there's no module or implementation available at the moment then what other options can I try?
Yes you can keep track of referrals by making a global list which accesible by all methods for example.
referral_url_list = []
def call_back1(self, response):
self.referral_url_list.append(response.url)
def call_back1(self, response):
self.referral_url_list.append(response.url)
def call_back1(self, response):
self.referral_url_list.append(response.url)
after spider completion which is detected by spider signals. you can write csv or json file in signal function
I am new to Python and Scrapy. I have not used callback functions before. However, I do now for the code below. The first request will be executed and the response of that will be sent to the callback function defined as second argument:
def parse_page1(self, response):
item = MyItem()
item['main_url'] = response.url
request = Request("http://www.example.com/some_page.html",
callback=self.parse_page2)
request.meta['item'] = item
return request
def parse_page2(self, response):
item = response.meta['item']
item['other_url'] = response.url
return item
I am unable to understand following things:
How is the item populated?
Does the request.meta line executes before the response.meta line in parse_page2?
Where is the returned item from parse_page2 going?
What is the need of the return request statement in parse_page1? I thought the extracted items need to be returned from here.
Read the docs:
For spiders, the scraping cycle goes through something like this:
You start by generating the initial Requests to crawl the first URLs, and specify a callback function to be called with the response
downloaded from those requests.
The first requests to perform are obtained by calling the
start_requests() method which (by default) generates Request for the
URLs specified in the start_urls and the parse method as callback
function for the Requests.
In the callback function, you parse the response (web page) and return either Item objects, Request objects, or an iterable of both.
Those Requests will also contain a callback (maybe the same) and will
then be downloaded by Scrapy and then their response handled by the
specified callback.
In callback functions, you parse the page contents, typically using Selectors (but you can also use BeautifulSoup, lxml or whatever
mechanism you prefer) and generate items with the parsed data.
Finally, the items returned from the spider will be typically persisted to a database (in some Item Pipeline) or written to a file
using Feed exports.
Answers:
How is the 'item' populated does the request.meta line executes before response.meta line in parse_page2?
Spiders are managed by Scrapy engine. It first makes requests from URLs specified in start_urls and passes them to a downloader. When downloading finishes callback specified in the request is called. If the callback returns another request, the same thing is repeated. If the callback returns an Item, the item is passed to a pipeline to save the scraped data.
Where is the returned item from parse_page2 going?
What is the need of return request statement in parse_page1? I thought the extracted items need to be returned from here ?
As stated in the docs, each callback (both parse_page1 and parse_page2) can return either a Request or an Item (or an iterable of them). parse_page1 returns a Request not the Item, because additional info needs to be scraped from additional URL. Second callback parse_page2 returns an item, because all the info is scraped and ready to be passed to a pipeline.
yes, scrapy uses a twisted reactor to call spider functions, hence using a single loop with a single thread ensures that
the spider function caller expects to either get item/s or request/s in return, requests are put in a queue for future processing and items are sent to configured pipelines
saving an item (or any other data) in request meta makes sense only if it is needed for further processing upon getting a response, otherwise it is obviously better to simply return it from parse_page1 and avoid the extra http request call
in scrapy: understanding how do items and requests work between callbacks
,eLRuLL's answer is wonderful.
I want to add the part of item transform. First, we shall be clear that callback function only work until the response of this request dwonloaded.
in the code the scrapy.doc given,it don't declare the url and request of page1 and. Let's set the url of page1 as "http://www.example.com.html".
[parse_page1] is the callback of
scrapy.Request("http://www.example.com.html",callback=parse_page1)`
[parse_page2] is the callback of
scrapy.Request("http://www.example.com/some_page.html",callback=parse_page2)
when the response of page1 is downloaded, parse_page1 is called to generate the request of page2:
item['main_url'] = response.url # send "http://www.example.com.html" to item
request = scrapy.Request("http://www.example.com/some_page.html",
callback=self.parse_page2)
request.meta['item'] = item # store item in request.meta
after the response of page2 is downloaded, the parse_page2 is called to retrun a item:
item = response.meta['item']
#response.meta is equal to request.meta,so here item['main_url']
#="http://www.example.com.html".
item['other_url'] = response.url # response.url ="http://www.example.com/some_page.html"
return item #finally,we get the item recording urls of page1 and page2.
I have to scrape something where part of the information is on one page, and then there's a link on that page that contains more information and then another url where the 3rd piece of information is available.
How do I go about setting up my callbacks in order to have all this information together? Will I have to use a database in this case or can it still be exported to CSV?
The first thing to say is that you have the right idea - callbacks are the solution. I have seen some use of urllib or similar to fetch dependent pages, but it's far preferable to fully leverage the Scrapy download mechanism than employ some synchronous call from another library.
See this example from the Scrapy docs on the issue:
http://doc.scrapy.org/en/latest/topics/request-response.html#passing-additional-data-to-callback-functions
def parse_page1(self, response):
item = MyItem()
item['main_url'] = response.url
# parse response and populate item as required
request = Request("http://www.example.com/some_page.html",
callback=self.parse_page2)
request.meta['item'] = item
return request
def parse_page2(self, response):
item = response.meta['item']
# parse response and populate item as required
item['other_url'] = response.url
return item
Is your third piece of data on a page linked from the first page or the second page?
If from the second page, you can just extend the mechanism above and have parse_page2 return a request with a callback to a new parse_page3.
If from the first page, you could have parse_page1 populate a request.meta['link3_url'] property from which parse_page2 can construct the subsequent request url.
NB - these 'secondary' and 'tertiary' urls should not be discoverable from the normal crawling process (start_urls and rules), but should be constructed from the response (using XPath etc) in parse_page1/parse_page2.
The crawling, callback structures, pipelines and item construction are all independent of the export of data, so CSV will be applicable.