best practice for navigating through hrefs with scrapy - python

I am building a web scraper that downloads csv files from a website. I have to login to multiple user accounts in order to download all the files. I also have to navigate through several hrefs to reach these files for each user account. I've decided to use Scrapy spiders in order to complete this task. Here's the code I have so far:
I store the username and password info in a dictionary
def start_requests(self):
yield scrapy.Request(url = "https://external.lacare.org/provportal/", callback = self.login)
def login(self, response):
for uname, upass in login_info.items():
yield scrapy.FormRequest.from_response(
response,
formdata = {'username': uname,
'password': upass,
},
dont_filter = True,
callback = self.after_login
)
I then navigate through the web pages by finding all href links in each response.
def after_login(self, response):
hxs = scrapy.Selector(response)
all_links = hxs.xpath('*//a/#href').extract()
for link in all_links:
if 'listReports' in link:
url_join = response.urljoin(link)
return scrapy.Request(
url = url_join,
dont_filter = True,
callback = self.reports
)
return
def reports(self, response):
hxs = scrapy.Selector(response)
all_links = hxs.xpath('*//a/#href').extract()
for link in all_links:
url_join = response.urljoin(link)
yield scrapy.Request(
url = url_join,
dont_filter = True,
callback = self.select_year
)
return
I then crawl through each href on the page and check the response to see if I can keep going. This portion of the code seems excessive to me, but I am not sure how else to approach it.
def select_year(self, response):
if '>2017' in str(response.body):
hxs = scrapy.Selector(response)
all_links = hxs.xpath('*//a/#href').extract()
for link in all_links:
url_join = response.urljoin(link)
yield scrapy.Request(
url = url_join,
dont_filter = True,
callback = self.select_elist
)
return
def select_elist(self, response):
if '>Elists' in str(response.body):
hxs = scrapy.Selector(response)
all_links = hxs.xpath('*//a/#href').extract()
for link in all_links:
url_join = response.urljoin(link)
yield scrapy.Request(
url = url_join,
dont_filter = True,
callback = self.select_company
)
Everything works fine, but as I said it does seem excessive to crawl through each href on the page. I wrote a script for this website in Selenium, and was able to select the correct hrefs by using the select_by_partial_link_text() method. I've searched for something comparable to that in scrapy, but it seems like scrapy navigation is based strickly on xpath and css name.
Is this how Scrapy is meant to be used in this scenario? Is there anything I can do to make the scraping process less redundant?
This is my first working scrapy spider, so go easy on me!

If you need to extract only links with certain substring in link text, you can use LinkExtractor with following XPath:
LinkExtractor(restrict_xpaths='//a[contains(text(), "substring to find")]').extract_links(response)
as LinkExtractor is the proper way to extract and process links in Scrapy.
Docs: https://doc.scrapy.org/en/latest/topics/link-extractors.html

Related

Any idea why Scrapy Response.follow is not following the links?

I have created a spider as you would see below. I can get links from homepage but when I want to use them in function scrapy doesn't follow the links. I don't get any http or server error from source.
class GamerSpider(scrapy.Spider):
name = 'gamer'
allowed_domains = ['eurogamer.net']
start_urls = ['http://www.eurogamer.net/archive/ps4']
def parse(self, response):
for link in response.xpath("//h2"):
link=link.xpath(".//a/#href").get()
content=response.xpath("//div[#class='details']/p/text()").get()
yield response.follow(url=link, callback=self.parse_game, meta={'url':link,'content':content})
next_page = 'http://www.eurogamer.net'+response.xpath("//div[#class='buttons forward']/a[#class='button next']/#href").get()
if next_page:
yield scrapy.Request(url=next_page, callback=self.parse)
def parse_game(self, response):
url=response.request.meta['url']
#some things to get
rows=response.xpath("//main")
for row in rows:
#some things to get
yield{
'url':url
#some things to get
}
Any help?

Scrapy following link not getting data

I am trying to follow a list of links and scrap data from each link with a simple scrapy spider but I am having trouble.
In the scrapy shell when I recreate the script it sends the get request of the new url but when I run the crawl I do not get any data back from the link. The only data I get back is from the starting url that was scraped before going to the link.
How do I scrape data from the link?
import scrapy
class QuotesSpider(scrapy.Spider):
name = "players"
start_urls = ['http://wiki.teamliquid.net/counterstrike/Portal:Teams']
def parse(self, response):
teams = response.xpath('//*[#id="mw-content-text"]/table[1]')
for team in teams.css('span.team-template-text'):
yield{
'teamName': team.css('a::text').extract_first()
}
urls = teams.css('span.team-template-text a::attr(href)')
for url in urls:
url = response.urljoin(url)
yield scrapy.Request(url, callback=self.parse_team_info)
def parse_team_info(self, response):
yield{
'Test': response.css('span::text').extract_first()
}
Instead of using
url = response.urljoin(url)
yield scrapy.Request(url, callback=self.parse_team_info)
use
yield response.follow(url, callback=self.parse_team_info)

Best way to get follow links scrapy web crawler

So I'm trying to write a spider to continue clicking a next button on a webpage until it can't anymore (or until I add some logic to make it stop). The code below correctly gets the link to the next page but prints it only once. My question is why isn't it "following" the links that each next button leads to?
class MyprojectSpider(scrapy.Spider):
name = 'redditbot'
allowed_domains = ['https://www.reddit.com/r/nfl/?count=25&after=t3_7ax8lb']
start_urls = ['https://www.reddit.com/r/nfl/?count=25&after=t3_7ax8lb']
def parse(self, response):
hxs = HtmlXPathSelector(response)
next_page = hxs.select('//div[#class="nav-buttons"]//a/#href').extract()
if next_page:
yield Request(next_page[1], self.parse)
print(next_page[1])
To go to the next page, instead of printing the link you just need to yield a scrapy.Request object like the following code:
import scrapy
class MyprojectSpider(scrapy.Spider):
name = 'myproject'
allowed_domains = ['reddit.com']
start_urls = ['https://www.reddit.com/r/nfl/']
def parse(self, response):
posts = response.xpath('//div[#class="top-matter"]')
for post in posts:
# Get your data here
title = post.xpath('p[#class="title"]/a/text()').extract()
print(title)
# Go to next page
next_page = response.xpath('//span[#class="next-button"]/a/#href').extract_first()
if next_page:
yield scrapy.Request(response.urljoin(next_page), callback=self.parse)
Update: Previous code was wrong, needed to use the absolute URL and also some Xpaths were wrong, this new one should work.
Hope it helps!

How to I find external 404s

I'm building a scraper with scrapy that should crawl an entire domain looking for broken EXTERNAL links.
I have the following:
class domainget(CrawlSpider):
name = 'getdomains'
allowed_domains = ['start.co.uk']
start_urls = ['http://www.start.co.uk']
rules = (
Rule(LinkExtractor('/'), callback='parse_item', follow=True),
)
def parse_item(self, response):
for link in LinkExtractor(allow=(), deny = self.allowed_domains).extract_links(response):
resp = scrapy.Request(link.url, callback=self.parse_ext)
def parse_ext(self, response):
self.logger.info('>>>>>>>>>> Reading: %s', response.url)
When I run this code, it never reaches the parse_ext() function where I would like to get the http status code and do further processing based on this.
You can see I have used parse_ext() as the callback when I'm looping the extracted links on the page in the parse_item() func.
What am I doing wrong?
You are not returning the Request instances from the callback:
def parse_item(self, response):
for link in LinkExtractor(allow=(), deny = self.allowed_domains).extract_links(response):
yield scrapy.Request(link.url, callback=self.parse_ext)
def parse_ext(self, response):
self.logger.info('>>>>>>>>>> Reading: %s', response.url)

Only 25 entries are stored in JSON files while scraping data using Scrapy; how to increase?

I am scraping data using Scrapy in a item.json file. Data is getting stored but the problem is only 25 entries are stored, while in the website there are more entries. I am using the following command:
class DmozSpider(Spider):
name = "dmoz"
allowed_domains = ["justdial.com"]
start_urls = ["http://www.justdial.com/Delhi-NCR/Taxi-Services/ct-57371"]
def parse(self, response):
hxs = Selector(response)
sites = hxs.xpath('//section[#class="rslwrp"]/section')
items = []
for site in sites:
item = DmozItem()
item['title'] = site.xpath('section[2]/section[1]/aside[1]/p[1]/span/a/text()').extract()
items.append(item)
return items
The command I'm using to run the script is:
scrapy crawl myspider -o items.json -t json
Is there any setting which I am not aware of? or the page is not getting loaded fully till scraping. how do i resolve this?
Abhi, here is some code, but please note that it isn't complete and working, it is just to show you the idea. Usually you have to find a next page URL and try to recreate the appropriate request in your spider. In your case AJAX is used. I used FireBug to check which requests are sent by the site.
URL = "http://www.justdial.com/function/ajxsearch.php?national_search=0&...page=%s" # this isn't the complete next page URL
next_page = 2 # how to handle next_page counter is up to you
def parse(self, response):
hxs = Selector(response)
sites = hxs.xpath('//section[#class="rslwrp"]/section')
for site in sites:
item = DmozItem()
item['title'] = site.xpath('section[2]/section[1]/aside[1]/p[1]/span/a/text()').extract()
yield item
# build you pagination URL and send a request
url = self.URL % self.next_page
yield Request(url) # Request is Scrapy request object here
# increment next_page counter if required, make additional
# checks and actions etc
Hope this will help.

Categories