So I'm trying to write a spider to continue clicking a next button on a webpage until it can't anymore (or until I add some logic to make it stop). The code below correctly gets the link to the next page but prints it only once. My question is why isn't it "following" the links that each next button leads to?
class MyprojectSpider(scrapy.Spider):
name = 'redditbot'
allowed_domains = ['https://www.reddit.com/r/nfl/?count=25&after=t3_7ax8lb']
start_urls = ['https://www.reddit.com/r/nfl/?count=25&after=t3_7ax8lb']
def parse(self, response):
hxs = HtmlXPathSelector(response)
next_page = hxs.select('//div[#class="nav-buttons"]//a/#href').extract()
if next_page:
yield Request(next_page[1], self.parse)
print(next_page[1])
To go to the next page, instead of printing the link you just need to yield a scrapy.Request object like the following code:
import scrapy
class MyprojectSpider(scrapy.Spider):
name = 'myproject'
allowed_domains = ['reddit.com']
start_urls = ['https://www.reddit.com/r/nfl/']
def parse(self, response):
posts = response.xpath('//div[#class="top-matter"]')
for post in posts:
# Get your data here
title = post.xpath('p[#class="title"]/a/text()').extract()
print(title)
# Go to next page
next_page = response.xpath('//span[#class="next-button"]/a/#href').extract_first()
if next_page:
yield scrapy.Request(response.urljoin(next_page), callback=self.parse)
Update: Previous code was wrong, needed to use the absolute URL and also some Xpaths were wrong, this new one should work.
Hope it helps!
Related
I have created a spider as you would see below. I can get links from homepage but when I want to use them in function scrapy doesn't follow the links. I don't get any http or server error from source.
class GamerSpider(scrapy.Spider):
name = 'gamer'
allowed_domains = ['eurogamer.net']
start_urls = ['http://www.eurogamer.net/archive/ps4']
def parse(self, response):
for link in response.xpath("//h2"):
link=link.xpath(".//a/#href").get()
content=response.xpath("//div[#class='details']/p/text()").get()
yield response.follow(url=link, callback=self.parse_game, meta={'url':link,'content':content})
next_page = 'http://www.eurogamer.net'+response.xpath("//div[#class='buttons forward']/a[#class='button next']/#href").get()
if next_page:
yield scrapy.Request(url=next_page, callback=self.parse)
def parse_game(self, response):
url=response.request.meta['url']
#some things to get
rows=response.xpath("//main")
for row in rows:
#some things to get
yield{
'url':url
#some things to get
}
Any help?
I want to extract only 10 links from this site https://dmoz-odp.org/Sports/Events/ this links can be found in the bottom of the page some of them are AOL, Google, etc
Here is my code:
import scrapy
class cr(scrapy.Spider):
name = 'prcr'
start_urls = ['https://dmoz-odp.org/Sports/Events/']
def parse(self, response):
items = '.alt-sites'
for i in response.css(items):
title=response.css('a::attr(title)').extract()
link=response.css('a::attr(href)').extract()
yield dict(title=title, titletext=link)
this works fine but I need only the last 10 links to be extracted so please tell how to do?
i have made few changes to your parse method (check the below code) and this should work just fine,
def parse(self, response):
items = '.alt-sites a'
for i in response.css(items):
title = i.css('::text').extract_first()
link = i.css('::attr(href)').extract_first()
yield dict(title=title, title_link=link)
hope this helps you.
I followed the document
But still not be able to crawl multiple pages.
My code is like:
def parse(self, response):
for thing in response.xpath('//article'):
item = MyItem()
request = scrapy.Request(link,
callback=self.parse_detail)
request.meta['item'] = item
yield request
def parse_detail(self, response):
print "here\n"
item = response.meta['item']
item['test'] = "test"
yield item
Running this code will not call parse_detail function and will not crawl any data. Any idea? Thanks!
I find if I comment out allowed_domains it will work. But it doesn't make sense because link is belonged to allowed_domains for sure.
I am creating a crawler with scraper.
My spider must go to start page which contains a list of links and link for next page.
Then, it must follow each link, go to this link, get infos and return to main page.
Finally, when spider followed each link of the page, it go to next page and begin again.
class jiwire(CrawlSpider):
name = "example"
allowed_domains = ["example.ndd"]
start_urls = ["page.example.ndd"]
rules = (Rule (SgmlLinkExtractor(allow=("next-page\.htm", ),restrict_xpaths=('//div[#class="paging"]',)), callback="parse_items", follow= True),)
def parse_items(self, response):
hxs = HtmlXPathSelector(response)
links = hxs.select('//td[#class="desc"]')
for link in links :
link = title.select("h3/a/#href").extract()
request = Request("http://v4.jiwire.com/" + str(name), callback=self.parse_sub)
return(request)
def parse_sub(self, response):
hxs = HtmlXPathSelector(response)
name = hxs.select('//div[#id="content"]/div[#class="header"]/h2/text()').extract()
print name
I exmplain my code : I defined a rule to follow next pages.
To follow each link of current page, I created a request object with the link getted and I return this object.
normally, for each request return, I must see "print name" in parse_sub function.
But only ONE link has been follow (and no all), I don't understand why.
It crawl fine the link, request object is created fine but it enter in parse_sub only once per page.
Can you help me ?
thanks a lot
I am back ! my problem come from my return statement.
The solution:
for link in links :
link = title.select("h3/a/#href").extract()
request = Request(link, callback=self.parse_hotspot)
yield request
I am getting confused on how to design the architecure of crawler.
I have the search where I have
pagination: next page links to follow
a list of products on one page
individual links to be crawled to get the description
I have the following code:
def parse_page(self, response):
hxs = HtmlXPathSelector(response)
sites = hxs.select('//ol[#id=\'result-set\']/li')
items = []
for site in sites[:2]:
item = MyProduct()
item['product'] = myfilter(site.select('h2/a').select("string()").extract())
item['product_link'] = myfilter(site.select('dd[2]/').select("string()").extract())
if item['profile_link']:
request = Request(urljoin('http://www.example.com', item['product_link']),
callback = self.parseItemDescription)
request.meta['item'] = item
return request
soup = BeautifulSoup(response.body)
mylinks= soup.find_all("a", text="Next")
nextlink = mylinks[0].get('href')
yield Request(urljoin(response.url, nextlink), callback=self.parse_page)
The problem is that I have two return statements: one for request, and one for yield.
In the crawl spider, I don't need to use the last yield, so everything was working fine, but in BaseSpider I have to follow links manually.
What should I do?
As an initial pass (and based on your comment about wanting to do this yourself), I would suggest taking a look at the CrawlSpider code to get an idea of how to implement its functionality.