I have to scrape something where part of the information is on one page, and then there's a link on that page that contains more information and then another url where the 3rd piece of information is available.
How do I go about setting up my callbacks in order to have all this information together? Will I have to use a database in this case or can it still be exported to CSV?
The first thing to say is that you have the right idea - callbacks are the solution. I have seen some use of urllib or similar to fetch dependent pages, but it's far preferable to fully leverage the Scrapy download mechanism than employ some synchronous call from another library.
See this example from the Scrapy docs on the issue:
http://doc.scrapy.org/en/latest/topics/request-response.html#passing-additional-data-to-callback-functions
def parse_page1(self, response):
item = MyItem()
item['main_url'] = response.url
# parse response and populate item as required
request = Request("http://www.example.com/some_page.html",
callback=self.parse_page2)
request.meta['item'] = item
return request
def parse_page2(self, response):
item = response.meta['item']
# parse response and populate item as required
item['other_url'] = response.url
return item
Is your third piece of data on a page linked from the first page or the second page?
If from the second page, you can just extend the mechanism above and have parse_page2 return a request with a callback to a new parse_page3.
If from the first page, you could have parse_page1 populate a request.meta['link3_url'] property from which parse_page2 can construct the subsequent request url.
NB - these 'secondary' and 'tertiary' urls should not be discoverable from the normal crawling process (start_urls and rules), but should be constructed from the response (using XPath etc) in parse_page1/parse_page2.
The crawling, callback structures, pipelines and item construction are all independent of the export of data, so CSV will be applicable.
Related
I am having a problem with the crawling the next button I tried the basic one but after checking the html code, its uses javascript I've tried different rules but nothing works here's the link for the website.
https://www2.hm.com/en_us/sale/shopbyproductladies/view-all.html
The next button name is "Load More Products"
here's my working code
def parse(self, response):
for product_item in response.css('li.product-item'):
url = "https://www2.hm.com/" + product_item.css('a::attr(href)').extract_first()
yield scrapy.Request(url=url, callback=self.parse_subpage)
def parse_subpage(self, response):
item = {
'title': response.xpath("normalize-space(.//h1[contains(#class, 'primary') and contains(#class, 'product-item-headline')]/text())").extract_first(),
'sale-price': response.xpath("normalize-space(.//span[#class='price-value']/text())").extract_first(),
'regular-price': response.xpath('//script[contains(text(), "whitePrice")]/text()').re_first("'whitePrice'\s?:\s?'([^']+)'"),
'photo-url': response.css('div.product-detail-main-image-container img::attr(src)').extract_first(),
'description': response.css('p.pdp-description-text::text').extract_first()
}
yield item
As already hinted in the comments, there's no need to involve JavaScript at all. If you visit the page and open up your browser's developer tools, you'll see there are XHR requests like this taking place:
https://www2.hm.com/en_us/sale/women/view-all/_jcr_content/main/productlisting_b48c.display.json?sort=stock&image-size=small&image=stillLife&offset=36&page-size=36
These requests return JSON data that are then rendered on the page using JavaScript. So you can just scrape data from these URLs using something like json.dumps(response.text). Control the products being returned by offset and page-size parameters. I assume you are done when you receive an empty JSON. Or, you can set offset=0 and page-size=9999 to get the data in one go (9999 is just an arbitrary number which is enough in this particular case).
It's my first experience with web scraping and I'm not sure if I'm doing well or not. The thing is I want to crawl and scrape data at the same time.
Get all the links that I'm gonna scrape
Store them into MongoDB
Visit them one by one to scrape their content
# Crawling: get all links to be scrapped later on
class LinkCrawler(Spider):
name="link"
allowed_domains = ["website.com"]
start_urls = ["https://www.website.com/offres?start=%s" % start for start in xrange(0,10000,20)]
def parse(self,response):
# loop for all pages
next_page = Selector(response).xpath('//li[#class="active"]/following-sibling::li[1]/a/#href').extract()
if not not next_page:
yield Request("https://"+next_page[0], callback = self.parse)
# loop for all links in a single page
links = Selector(response).xpath('//div[#class="row-fluid job-details pointer"]/div[#class="bloc-right"]/div[#class="row-fluid"]')
for link in links:
item = Link()
url = response.urljoin(link.xpath('a/#href')[0].extract())
item['url'] = url
items.append(item)
for item in items:
yield item
# Scraping: get all the stored links on MongoDB and scrape them????
What exactly is your use case? Are you primarily interested in the links or content of the pages they lead to? I.e. is there any reason to first store the links in MongoDB and scrape pages later? If you really need to store links in MongoDB, it's best to use an item pipeline to store the items. In the link, there's even example of storing items in MongoDB. If you need something more sophisticated, look at scrapy-mongodb package.
Other than that, there are some comments to the actual code you posted:
Instead of Selector(response).xpath(...) use just response.xpath(...).
If you need only the first extracted element from selector, use extract_first() instead of using extract() and indexing.
Don't use if not not next_page:, use if next_page:.
The second loop over items is not needed, yield item in the loop over links.
I have defined two spiders which do the following:
Spider A:
Visits the home page.
Extracts all the links from the page and stores them in a text file.
This is necessary since the home page has a More Results button which produces further links to different products.
Spider B:
Opens the text file.
Crawls the individual pages and saves the information.
I am trying to combine the two and make a crawl-spider.
The URL structure of the home page is similar to:
http://www.example.com
The URL structure of the individual pages is similar to:
http://www.example.com/Home/Detail?id=some-random-number
The text file contains the list of such URLs which are to be scraped by the second spider.
My question:
How do I combine the two spiders so as to make a single spider which does the complete scraping?
From scrapy documantation:
In the callback function, you parse the response (web page) and return either Item objects, Request objects, or an iterable of both. Those Requests will also contain a callback (maybe the same) and will then be downloaded by Scrapy and then their response handled by the specified callback.
So what you actually need to do is in the parse method (which yuo extract the links there, for each link, yield a new request like:
yield self.make_requests_from_url(http://www.example.com/Home/Detail?id=some-random-number)
the self.make_requests_from_url is already implemented in Spider
Example of such:
class MySpider(Spider):
name = "my_spider"
def parse(self, response):
try:
user_name = Selector(text=response.body).xpath('//*[#id="ft"]/a/#href').extract()[0]
yield self.make_requests_from_url("https://example.com/" + user_name)
yield MyItem(user_name)
except Exception as e:
pass
You can handle the other requests using a different parsing function. do it by returning a Request object and specify the callback explicitly (The self.make_requests_from_url function call the parse function bu default)
Request(url=url,callback=self.parse_user_page)
I am new to Python and Scrapy. I have not used callback functions before. However, I do now for the code below. The first request will be executed and the response of that will be sent to the callback function defined as second argument:
def parse_page1(self, response):
item = MyItem()
item['main_url'] = response.url
request = Request("http://www.example.com/some_page.html",
callback=self.parse_page2)
request.meta['item'] = item
return request
def parse_page2(self, response):
item = response.meta['item']
item['other_url'] = response.url
return item
I am unable to understand following things:
How is the item populated?
Does the request.meta line executes before the response.meta line in parse_page2?
Where is the returned item from parse_page2 going?
What is the need of the return request statement in parse_page1? I thought the extracted items need to be returned from here.
Read the docs:
For spiders, the scraping cycle goes through something like this:
You start by generating the initial Requests to crawl the first URLs, and specify a callback function to be called with the response
downloaded from those requests.
The first requests to perform are obtained by calling the
start_requests() method which (by default) generates Request for the
URLs specified in the start_urls and the parse method as callback
function for the Requests.
In the callback function, you parse the response (web page) and return either Item objects, Request objects, or an iterable of both.
Those Requests will also contain a callback (maybe the same) and will
then be downloaded by Scrapy and then their response handled by the
specified callback.
In callback functions, you parse the page contents, typically using Selectors (but you can also use BeautifulSoup, lxml or whatever
mechanism you prefer) and generate items with the parsed data.
Finally, the items returned from the spider will be typically persisted to a database (in some Item Pipeline) or written to a file
using Feed exports.
Answers:
How is the 'item' populated does the request.meta line executes before response.meta line in parse_page2?
Spiders are managed by Scrapy engine. It first makes requests from URLs specified in start_urls and passes them to a downloader. When downloading finishes callback specified in the request is called. If the callback returns another request, the same thing is repeated. If the callback returns an Item, the item is passed to a pipeline to save the scraped data.
Where is the returned item from parse_page2 going?
What is the need of return request statement in parse_page1? I thought the extracted items need to be returned from here ?
As stated in the docs, each callback (both parse_page1 and parse_page2) can return either a Request or an Item (or an iterable of them). parse_page1 returns a Request not the Item, because additional info needs to be scraped from additional URL. Second callback parse_page2 returns an item, because all the info is scraped and ready to be passed to a pipeline.
yes, scrapy uses a twisted reactor to call spider functions, hence using a single loop with a single thread ensures that
the spider function caller expects to either get item/s or request/s in return, requests are put in a queue for future processing and items are sent to configured pipelines
saving an item (or any other data) in request meta makes sense only if it is needed for further processing upon getting a response, otherwise it is obviously better to simply return it from parse_page1 and avoid the extra http request call
in scrapy: understanding how do items and requests work between callbacks
,eLRuLL's answer is wonderful.
I want to add the part of item transform. First, we shall be clear that callback function only work until the response of this request dwonloaded.
in the code the scrapy.doc given,it don't declare the url and request of page1 and. Let's set the url of page1 as "http://www.example.com.html".
[parse_page1] is the callback of
scrapy.Request("http://www.example.com.html",callback=parse_page1)`
[parse_page2] is the callback of
scrapy.Request("http://www.example.com/some_page.html",callback=parse_page2)
when the response of page1 is downloaded, parse_page1 is called to generate the request of page2:
item['main_url'] = response.url # send "http://www.example.com.html" to item
request = scrapy.Request("http://www.example.com/some_page.html",
callback=self.parse_page2)
request.meta['item'] = item # store item in request.meta
after the response of page2 is downloaded, the parse_page2 is called to retrun a item:
item = response.meta['item']
#response.meta is equal to request.meta,so here item['main_url']
#="http://www.example.com.html".
item['other_url'] = response.url # response.url ="http://www.example.com/some_page.html"
return item #finally,we get the item recording urls of page1 and page2.
I have a question on how to do this thing in scrapy. I have a spider that crawls for listing pages of items.
Every time a listing page is found, with items, there's the parse_item() callback that is called for extracting items data, and yielding items. So far so good, everything works great.
But each item, has among other data, an url, with more details on that item. I want to follow that url and store in another item field (url_contents) the fetched contents of that item's url.
And I'm not sure how to organize code to achieve that, since the two links (listings link, and one particular item link) are followed differently, with callbacks called at different times, but I have to correlate them in the same item processing.
My code so far looks like this:
class MySpider(CrawlSpider):
name = "example.com"
allowed_domains = ["example.com"]
start_urls = [
"http://www.example.com/?q=example",
]
rules = (
Rule(SgmlLinkExtractor(allow=('example\.com', 'start='), deny=('sort='), restrict_xpaths = '//div[#class="pagination"]'), callback='parse_item'),
Rule(SgmlLinkExtractor(allow=('item\/detail', )), follow = False),
)
def parse_item(self, response):
main_selector = HtmlXPathSelector(response)
xpath = '//h2[#class="title"]'
sub_selectors = main_selector.select(xpath)
for sel in sub_selectors:
item = ExampleItem()
l = ExampleLoader(item = item, selector = sel)
l.add_xpath('title', 'a[#title]/#title')
......
yield l.load_item()
After some testing and thinking, I found this solution that works for me.
The idea is to use just the first rule, that gives you listings of items, and also, very important, add follow=True to that rule.
And in parse_item() you have to yield a request instead of an item, but after you load the item. The request is to item detail url. And you have to send the loaded item to that request callback. You do your job with the response, and there is where you yield the item.
So the finish of parse_item() will look like this:
itemloaded = l.load_item()
# fill url contents
url = sel.select(item_url_xpath).extract()[0]
request = Request(url, callback = lambda r: self.parse_url_contents(r))
request.meta['item'] = itemloaded
yield request
And then parse_url_contents() will look like this:
def parse_url_contents(self, response):
item = response.request.meta['item']
item['url_contents'] = response.body
yield item
If anyone has another (better) approach, let us know.
Stefan
I'm sitting with exactly the same problem, and from the fact that no-one has answered your question for 2 days I take it that the only solution is to follow that URL manually, from within your parse_item function.
I'm new to Scrapy, so I wouldn't attempt it with that (although I'm sure it's possible), but my solution will be to use urllib and BeatifulSoup to load the second page manually, extract that information myself, and save it as part of the Item. Yes, much more trouble than Scrapy makes normal parsing, but it should get the job done with the least hassle.