I am reading Web Scraping with Python 2nd Ed, and wanted to use Scrapy module to crawl information from webpage.
I got following information from documentation: https://docs.scrapy.org/en/latest/topics/request-response.html
callback (callable) – the function that will be called with the
response of this request (once it’s downloaded) as its first
parameter. For more information see Passing additional data to
callback functions below. If a Request doesn’t specify a callback, the
spider’s parse() method will be used. Note that if exceptions are
raised during processing, errback is called instead.
My understanding is that:
pass in url and get resp like we did in requests module
resp = requests.get(url)
pass in resp for data parsing
parse(resp)
The problem is:
I didn't see where resp is passed in
Why need to put self keyword before parse in the argument
self keyword was never used in parse function, why bothering put it as first parameter?
can we extract url from response parameter like this: url = response.url or should be url = self.url
class ArticleSpider(scrapy.Spider):
name='article'
def start_requests(self):
urls = [
'http://en.wikipedia.org/wiki/Python_'
'%28programming_language%29',
'https://en.wikipedia.org/wiki/Functional_programming',
'https://en.wikipedia.org/wiki/Monty_Python']
return [scrapy.Request(url=url, callback=self.parse) for url in urls]
def parse(self, response):
url = response.url
title = response.css('h1::text').extract_first()
print('URL is: {}'.format(url))
print('Title is: {}'.format(title))
Seems like you are missing a few concepts related to python classes and OOP. It would be a good idea to take a read in python docs or at the very least this question.
Here is how Scrapy works, you instantiate a request object and yield it to the Scrapy Scheduler.
yield scrapy.Request(url=url) #or use return like you did
Scrapy will handle the requests, download the html and it will return all it got back that request to a callback function. If you didn't set a callback function in your request (like in my example above) it will call a default function called parse.
Parse is a method (a.k.a function) of your object. You wrote it in your code above, and EVEN if you haven't it would still be there, since your class inherited all functions from it's parent class
class ArticleSpider(scrapy.Spider): # <<<<<<<< here
name='article'
So a TL; DR of your questions:
1-You didn't saw it because it happened in the parent class.
2-You need to use self. so python knows you are referencing a method of the spider instance.
3-The self parameter was the instance itself, and it was used by python.
4-Response is an independent object that your parse method received as argument, so you can access it's attributes like response.url or response.headers
information about self you can find here - https://docs.python.org/3/tutorial/classes.html
about this question:
can we extract URL from response parameter like this: url = response.url or should be url = self.url
you should use response.url to get URL of the page which you currently crawl/parse
Related
I have a scraper which I want to check the url before calling http request and parsing. The url might be None since it is an input arg to the call:
def start_requests(self):
# url as input to system
if url:
yield scrapy.Request(url, callback=self.parse)
From the docs the start_request function must return an iterable of Requests. The above code works without returning any items if url is None. Is this bad practice for scrapy?
What scrapy does with that is
start_requests = iter(self.spider.start_requests())
It works because of yield keyword. This changes return type to generator, so even if url is None, empty generator will be returned and that's why it works (and is perfectly fine). But be careful, if you decide to use list:
def start_requests(self):
# url as input to system
if url:
return [scrapy.Request(url, callback=self.parse)]
It will break.
I'm using scrapy.Spider to scrape, and I want to use request inside my callback function which is in start_requests, but that request didn't work, it should return a response but it only returns Request.
I followed the debug breakpoint and found that in class Request(object_ref), the request only finished the initialization but it didn't go into request = next(slot.start_requests) as expected, to start requesting, thus only returning Request.
Here is my code in brief:
class ProjSpider(scrapy.Spider):
name = 'Proj'
allowed_domains = ['mashable.com']
def start_requests(self):
# pages
pages = 10
for i in range(1, pages):
url = "https://mashable.com/channeldatafeed/Tech/new/page/"+str(i)
yield scrapy.Request(url, callback=self.parse_mashable)
Request works fine yet
and following is:
def parse_mashable(self, response):
item = Item()
json2parse = response.text
json_response = json.loads(json2parse)
d = json_response['dataFeed'] # a list containing dicts, in which there is url for detailed article
for data in d:
item_url = data['url'] # the url for detailed article
item_response = self.get_response_mashable(item_url)
# here I want to parse the item_response to get detail
item['content'] = item_response.xpath("//body").get
yield item
def get_response_mashable(self,url):
response = scrapy.Request(url)
# using self.parser. I've also defined my own parser and yield an item
# but the problem is it never got to callback
return response # tried yield also but failed
this is where Request doesn't work. The url is in the allowed_domains, and it's not a duplicate url. I'm guessing it's because of scrapy's asynchronous mechanism of Request, but how could it affect the request in self.parse_mashable, by then the Request in start_requests is already finished.
I managed to do the second request in python Requests-html, but still I couldn't figure out why.
So could anyone help pointing where I'm doing wrong? Thx in advance!
Scrapy doesn't really expect you to do this the way you're trying to, so it doesn't have a simple way to do it.
What you should be doing instead is passing the data you've scraped from the original page to the new callback using the request's meta dict.
For details, check Passing additional data to callback functions.
I am trying to build a web crawler that crawls all the links on the page and adds them to a file.
My Python code contains a method that does the following:-
Opens a given web page(urllib2 module is used)
Checks if the HTTP header Content-Type contains text/html
Converts the raw HTML response into readable code and stores it into html_string variable.
It then creates an instance of the Link_Finder class which takes attributes base url(Spider_url) and page url(page_url). Link_Finder is defined in another module link_finder.py.
html_string is then fed to the class using feed function.
Link_Finder class is explained in details below.
def gather_links(page_url): #page_url is relative url
html_string=''
try :
req=urllib2.urlopen(page_url)
head=urllib2.Request(page_url)
if 'text/html' in head.get_header('Content-Type'):
html_bytes=req.read()
html_string=html_bytes.decode("utf-8")
finder=LinkFinder(Spider.base_url,page_url)
finder.feed(html_string)
except Exception as e:
print "Exception " + str(e)
return set()
return finder.page_links()
The link_finder.py module uses standard Python HTMLParser and urlparse modules. Class Link_Finder inherits from HTMLParser and overrides the handle_starttag function to get all the a tags with href attribute and add the url's to a set(self.queue)
from HTMLParser import HTMLParser
import urlparse
class LinkFinder(HTMLParser):
def __init__(self,base_url,page_url): #page_url is relative url
super(LinkFinder,self).__init__()
self.base_url=base_url
self.page_url=page_url
self.links=set()
def handle_starttag(self,tag,attrs): #Override default handler methods
if tag==a:
for(key,value) in attrs:
if key=='href':
url=urlparse.urljoin(self.base_url,value) #Get exact url
self.links.add(url)
def error(self,message):
pass
def page_links(self): #return set of links
return self.links
I am getting an exception
argument of type 'NoneType' is not iterable
I think the problem in the way i used urllib2 Request to check the header content.
I am a bit new to this so some explanation would be good
I'd have used BeautifulSoup instead of HTMLParser like so -
soup = BeautifulSoup(pageContent)
links = soup.find_all('a')
I have defined two spiders which do the following:
Spider A:
Visits the home page.
Extracts all the links from the page and stores them in a text file.
This is necessary since the home page has a More Results button which produces further links to different products.
Spider B:
Opens the text file.
Crawls the individual pages and saves the information.
I am trying to combine the two and make a crawl-spider.
The URL structure of the home page is similar to:
http://www.example.com
The URL structure of the individual pages is similar to:
http://www.example.com/Home/Detail?id=some-random-number
The text file contains the list of such URLs which are to be scraped by the second spider.
My question:
How do I combine the two spiders so as to make a single spider which does the complete scraping?
From scrapy documantation:
In the callback function, you parse the response (web page) and return either Item objects, Request objects, or an iterable of both. Those Requests will also contain a callback (maybe the same) and will then be downloaded by Scrapy and then their response handled by the specified callback.
So what you actually need to do is in the parse method (which yuo extract the links there, for each link, yield a new request like:
yield self.make_requests_from_url(http://www.example.com/Home/Detail?id=some-random-number)
the self.make_requests_from_url is already implemented in Spider
Example of such:
class MySpider(Spider):
name = "my_spider"
def parse(self, response):
try:
user_name = Selector(text=response.body).xpath('//*[#id="ft"]/a/#href').extract()[0]
yield self.make_requests_from_url("https://example.com/" + user_name)
yield MyItem(user_name)
except Exception as e:
pass
You can handle the other requests using a different parsing function. do it by returning a Request object and specify the callback explicitly (The self.make_requests_from_url function call the parse function bu default)
Request(url=url,callback=self.parse_user_page)
I am new to Python and Scrapy. I have not used callback functions before. However, I do now for the code below. The first request will be executed and the response of that will be sent to the callback function defined as second argument:
def parse_page1(self, response):
item = MyItem()
item['main_url'] = response.url
request = Request("http://www.example.com/some_page.html",
callback=self.parse_page2)
request.meta['item'] = item
return request
def parse_page2(self, response):
item = response.meta['item']
item['other_url'] = response.url
return item
I am unable to understand following things:
How is the item populated?
Does the request.meta line executes before the response.meta line in parse_page2?
Where is the returned item from parse_page2 going?
What is the need of the return request statement in parse_page1? I thought the extracted items need to be returned from here.
Read the docs:
For spiders, the scraping cycle goes through something like this:
You start by generating the initial Requests to crawl the first URLs, and specify a callback function to be called with the response
downloaded from those requests.
The first requests to perform are obtained by calling the
start_requests() method which (by default) generates Request for the
URLs specified in the start_urls and the parse method as callback
function for the Requests.
In the callback function, you parse the response (web page) and return either Item objects, Request objects, or an iterable of both.
Those Requests will also contain a callback (maybe the same) and will
then be downloaded by Scrapy and then their response handled by the
specified callback.
In callback functions, you parse the page contents, typically using Selectors (but you can also use BeautifulSoup, lxml or whatever
mechanism you prefer) and generate items with the parsed data.
Finally, the items returned from the spider will be typically persisted to a database (in some Item Pipeline) or written to a file
using Feed exports.
Answers:
How is the 'item' populated does the request.meta line executes before response.meta line in parse_page2?
Spiders are managed by Scrapy engine. It first makes requests from URLs specified in start_urls and passes them to a downloader. When downloading finishes callback specified in the request is called. If the callback returns another request, the same thing is repeated. If the callback returns an Item, the item is passed to a pipeline to save the scraped data.
Where is the returned item from parse_page2 going?
What is the need of return request statement in parse_page1? I thought the extracted items need to be returned from here ?
As stated in the docs, each callback (both parse_page1 and parse_page2) can return either a Request or an Item (or an iterable of them). parse_page1 returns a Request not the Item, because additional info needs to be scraped from additional URL. Second callback parse_page2 returns an item, because all the info is scraped and ready to be passed to a pipeline.
yes, scrapy uses a twisted reactor to call spider functions, hence using a single loop with a single thread ensures that
the spider function caller expects to either get item/s or request/s in return, requests are put in a queue for future processing and items are sent to configured pipelines
saving an item (or any other data) in request meta makes sense only if it is needed for further processing upon getting a response, otherwise it is obviously better to simply return it from parse_page1 and avoid the extra http request call
in scrapy: understanding how do items and requests work between callbacks
,eLRuLL's answer is wonderful.
I want to add the part of item transform. First, we shall be clear that callback function only work until the response of this request dwonloaded.
in the code the scrapy.doc given,it don't declare the url and request of page1 and. Let's set the url of page1 as "http://www.example.com.html".
[parse_page1] is the callback of
scrapy.Request("http://www.example.com.html",callback=parse_page1)`
[parse_page2] is the callback of
scrapy.Request("http://www.example.com/some_page.html",callback=parse_page2)
when the response of page1 is downloaded, parse_page1 is called to generate the request of page2:
item['main_url'] = response.url # send "http://www.example.com.html" to item
request = scrapy.Request("http://www.example.com/some_page.html",
callback=self.parse_page2)
request.meta['item'] = item # store item in request.meta
after the response of page2 is downloaded, the parse_page2 is called to retrun a item:
item = response.meta['item']
#response.meta is equal to request.meta,so here item['main_url']
#="http://www.example.com.html".
item['other_url'] = response.url # response.url ="http://www.example.com/some_page.html"
return item #finally,we get the item recording urls of page1 and page2.