scrapy crawl multiple page using Request - python

I followed the document
But still not be able to crawl multiple pages.
My code is like:
def parse(self, response):
for thing in response.xpath('//article'):
item = MyItem()
request = scrapy.Request(link,
callback=self.parse_detail)
request.meta['item'] = item
yield request
def parse_detail(self, response):
print "here\n"
item = response.meta['item']
item['test'] = "test"
yield item
Running this code will not call parse_detail function and will not crawl any data. Any idea? Thanks!

I find if I comment out allowed_domains it will work. But it doesn't make sense because link is belonged to allowed_domains for sure.

Related

How to use crawled output of first scrapy spider for next scrapy spider

I am new to scrapy and I want to do the following:
- I want to crawl a homepage and extract some specific listings
- with these listings I want to adjust the url and crawl the new web page
Crawling First URL
class Spider1:
start_urls = 'https://page1.org/'
def parse(self, response):
listings = response.css('get-listings-here').extract()
Crawling Second URL
class Spider2:
start_urls = 'https://page1.org/listings[output_of_Spider1]'
def parse(self, response):
final_data = response.css('get-needed_data').extract()
items['final'] = final_data
yield items
Maybe it is also possible within one spider, I am not sure. But what would be the best solution for it?
Thank you!
After extracting all links from your selector you need to yield Request to those links and add callback where you will receive HTML response
def parse(self,response):
yield Request(‘http://amazon.com/',callback=self.page)
def page(self,response):
## your new page html response
you can replace your extracted link with this amazon link.
Reference to the documentation scrapy Request

Any idea why Scrapy Response.follow is not following the links?

I have created a spider as you would see below. I can get links from homepage but when I want to use them in function scrapy doesn't follow the links. I don't get any http or server error from source.
class GamerSpider(scrapy.Spider):
name = 'gamer'
allowed_domains = ['eurogamer.net']
start_urls = ['http://www.eurogamer.net/archive/ps4']
def parse(self, response):
for link in response.xpath("//h2"):
link=link.xpath(".//a/#href").get()
content=response.xpath("//div[#class='details']/p/text()").get()
yield response.follow(url=link, callback=self.parse_game, meta={'url':link,'content':content})
next_page = 'http://www.eurogamer.net'+response.xpath("//div[#class='buttons forward']/a[#class='button next']/#href").get()
if next_page:
yield scrapy.Request(url=next_page, callback=self.parse)
def parse_game(self, response):
url=response.request.meta['url']
#some things to get
rows=response.xpath("//main")
for row in rows:
#some things to get
yield{
'url':url
#some things to get
}
Any help?

Best way to get follow links scrapy web crawler

So I'm trying to write a spider to continue clicking a next button on a webpage until it can't anymore (or until I add some logic to make it stop). The code below correctly gets the link to the next page but prints it only once. My question is why isn't it "following" the links that each next button leads to?
class MyprojectSpider(scrapy.Spider):
name = 'redditbot'
allowed_domains = ['https://www.reddit.com/r/nfl/?count=25&after=t3_7ax8lb']
start_urls = ['https://www.reddit.com/r/nfl/?count=25&after=t3_7ax8lb']
def parse(self, response):
hxs = HtmlXPathSelector(response)
next_page = hxs.select('//div[#class="nav-buttons"]//a/#href').extract()
if next_page:
yield Request(next_page[1], self.parse)
print(next_page[1])
To go to the next page, instead of printing the link you just need to yield a scrapy.Request object like the following code:
import scrapy
class MyprojectSpider(scrapy.Spider):
name = 'myproject'
allowed_domains = ['reddit.com']
start_urls = ['https://www.reddit.com/r/nfl/']
def parse(self, response):
posts = response.xpath('//div[#class="top-matter"]')
for post in posts:
# Get your data here
title = post.xpath('p[#class="title"]/a/text()').extract()
print(title)
# Go to next page
next_page = response.xpath('//span[#class="next-button"]/a/#href').extract_first()
if next_page:
yield scrapy.Request(response.urljoin(next_page), callback=self.parse)
Update: Previous code was wrong, needed to use the absolute URL and also some Xpaths were wrong, this new one should work.
Hope it helps!

Only 25 entries are stored in JSON files while scraping data using Scrapy; how to increase?

I am scraping data using Scrapy in a item.json file. Data is getting stored but the problem is only 25 entries are stored, while in the website there are more entries. I am using the following command:
class DmozSpider(Spider):
name = "dmoz"
allowed_domains = ["justdial.com"]
start_urls = ["http://www.justdial.com/Delhi-NCR/Taxi-Services/ct-57371"]
def parse(self, response):
hxs = Selector(response)
sites = hxs.xpath('//section[#class="rslwrp"]/section')
items = []
for site in sites:
item = DmozItem()
item['title'] = site.xpath('section[2]/section[1]/aside[1]/p[1]/span/a/text()').extract()
items.append(item)
return items
The command I'm using to run the script is:
scrapy crawl myspider -o items.json -t json
Is there any setting which I am not aware of? or the page is not getting loaded fully till scraping. how do i resolve this?
Abhi, here is some code, but please note that it isn't complete and working, it is just to show you the idea. Usually you have to find a next page URL and try to recreate the appropriate request in your spider. In your case AJAX is used. I used FireBug to check which requests are sent by the site.
URL = "http://www.justdial.com/function/ajxsearch.php?national_search=0&...page=%s" # this isn't the complete next page URL
next_page = 2 # how to handle next_page counter is up to you
def parse(self, response):
hxs = Selector(response)
sites = hxs.xpath('//section[#class="rslwrp"]/section')
for site in sites:
item = DmozItem()
item['title'] = site.xpath('section[2]/section[1]/aside[1]/p[1]/span/a/text()').extract()
yield item
# build you pagination URL and send a request
url = self.URL % self.next_page
yield Request(url) # Request is Scrapy request object here
# increment next_page counter if required, make additional
# checks and actions etc
Hope this will help.

Scrapy crawl all sitemap links

I want to crawl all he links present in the sitemap.xml of a fixed site. I've came across Scrapy's SitemapSpider. So far i've extracted all the urls in the sitemap. Now i want to crawl through each link of the sitemap. Any help would be highly useful. The code so far is:
class MySpider(SitemapSpider):
name = "xyz"
allowed_domains = ["xyz.nl"]
sitemap_urls = ["http://www.xyz.nl/sitemap.xml"]
def parse(self, response):
print response.url
Essentially you could create new request objects to crawl the urls created by the SitemapSpider and parse the responses with a new callback:
class MySpider(SitemapSpider):
name = "xyz"
allowed_domains = ["xyz.nl"]
sitemap_urls = ["http://www.xyz.nl/sitemap.xml"]
def parse(self, response):
print response.url
return Request(response.url, callback=self.parse_sitemap_url)
def parse_sitemap_url(self, response):
# do stuff with your sitemap links
You need to add sitemap_rules to process the data in the crawled urls, and you can create as many as you want.
For instance say you have a page named http://www.xyz.nl//x/ you want to create a rule:
class MySpider(SitemapSpider):
name = 'xyz'
sitemap_urls = 'http://www.xyz.nl/sitemap.xml'
# list with tuples - this example contains one page
sitemap_rules = [('/x/', parse_x)]
def parse_x(self, response):
sel = Selector(response)
paragraph = sel.xpath('//p').extract()
return paragraph

Categories