How to I find external 404s

How to I find external 404s - python

I'm building a scraper with scrapy that should crawl an entire domain looking for broken EXTERNAL links.
I have the following:
class domainget(CrawlSpider):
name = 'getdomains'
allowed_domains = ['start.co.uk']
start_urls = ['http://www.start.co.uk']
rules = (
Rule(LinkExtractor('/'), callback='parse_item', follow=True),
)
def parse_item(self, response):
for link in LinkExtractor(allow=(), deny = self.allowed_domains).extract_links(response):
resp = scrapy.Request(link.url, callback=self.parse_ext)
def parse_ext(self, response):
self.logger.info('>>>>>>>>>> Reading: %s', response.url)
When I run this code, it never reaches the parse_ext() function where I would like to get the http status code and do further processing based on this.
You can see I have used parse_ext() as the callback when I'm looping the extracted links on the page in the parse_item() func.
What am I doing wrong?

You are not returning the Request instances from the callback:
def parse_item(self, response):
for link in LinkExtractor(allow=(), deny = self.allowed_domains).extract_links(response):
yield scrapy.Request(link.url, callback=self.parse_ext)
def parse_ext(self, response):
self.logger.info('>>>>>>>>>> Reading: %s', response.url)

Related

Any idea why Scrapy Response.follow is not following the links?

I have created a spider as you would see below. I can get links from homepage but when I want to use them in function scrapy doesn't follow the links. I don't get any http or server error from source.
class GamerSpider(scrapy.Spider):
name = 'gamer'
allowed_domains = ['eurogamer.net']
start_urls = ['http://www.eurogamer.net/archive/ps4']
def parse(self, response):
for link in response.xpath("//h2"):
link=link.xpath(".//a/#href").get()
content=response.xpath("//div[#class='details']/p/text()").get()
yield response.follow(url=link, callback=self.parse_game, meta={'url':link,'content':content})
next_page = 'http://www.eurogamer.net'+response.xpath("//div[#class='buttons forward']/a[#class='button next']/#href").get()
if next_page:
yield scrapy.Request(url=next_page, callback=self.parse)
def parse_game(self, response):
url=response.request.meta['url']
#some things to get
rows=response.xpath("//main")
for row in rows:
#some things to get
yield{
'url':url
#some things to get
}
Any help?

Why is the print() function not echoing to console?

I haven't written any Python code in over 10 years. So I'm trying to use Scrapy to assemble some information off of a website:
import scrapy
class TutorialSpider(scrapy.Spider):
name = "tutorial"
def start_requests(self):
urls = [
'https://example.com/page/1',
'https://example.com/page/2',
]
for url in urls:
print(f'{self.name} spider')
print(f'url is {url}')
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
print(response.url)
self.log(response.url)
sys.stdout.write('hello')
I'm trying to parse the url in the parse method. What I can't figure out is, why will those simple print statements not print anything to stdout? They are silent. There doesn't seem to be a way to echo anything there back to the console, and I am very curious about what I am missing here.

Both requests you're doing in your spider receive 404 Not found responses. By default, Scrapy ignores responses with such a status and your callback doesn't get called.
In order to have your self.parse callback called for such responses, you have to add the 404 status code to the list of handled status codes using the handle_httpstatus_list meta key (more info here).
You could change your start_requests method so that the requests instruct Scrapy to handle even 404 responses:
import scrapy
class TutorialSpider(scrapy.Spider):
name = "tutorial"
def start_requests(self):
urls = [
'https://example.com/page/1',
'https://example.com/page/2',
]
for url in urls:
print(f'{self.name} spider')
print(f'url is {url}')
yield scrapy.Request(
url=url,
callback=self.parse,
meta={'handle_httpstatus_list': [404]},
)
def parse(self, response):
print(response.url)
self.log(response.url)
sys.stdout.write('hello')

best practice for navigating through hrefs with scrapy

I am building a web scraper that downloads csv files from a website. I have to login to multiple user accounts in order to download all the files. I also have to navigate through several hrefs to reach these files for each user account. I've decided to use Scrapy spiders in order to complete this task. Here's the code I have so far:
I store the username and password info in a dictionary
def start_requests(self):
yield scrapy.Request(url = "https://external.lacare.org/provportal/", callback = self.login)
def login(self, response):
for uname, upass in login_info.items():
yield scrapy.FormRequest.from_response(
response,
formdata = {'username': uname,
'password': upass,
},
dont_filter = True,
callback = self.after_login
)
I then navigate through the web pages by finding all href links in each response.
def after_login(self, response):
hxs = scrapy.Selector(response)
all_links = hxs.xpath('*//a/#href').extract()
for link in all_links:
if 'listReports' in link:
url_join = response.urljoin(link)
return scrapy.Request(
url = url_join,
dont_filter = True,
callback = self.reports
)
return
def reports(self, response):
hxs = scrapy.Selector(response)
all_links = hxs.xpath('*//a/#href').extract()
for link in all_links:
url_join = response.urljoin(link)
yield scrapy.Request(
url = url_join,
dont_filter = True,
callback = self.select_year
)
return
I then crawl through each href on the page and check the response to see if I can keep going. This portion of the code seems excessive to me, but I am not sure how else to approach it.
def select_year(self, response):
if '>2017' in str(response.body):
hxs = scrapy.Selector(response)
all_links = hxs.xpath('*//a/#href').extract()
for link in all_links:
url_join = response.urljoin(link)
yield scrapy.Request(
url = url_join,
dont_filter = True,
callback = self.select_elist
)
return
def select_elist(self, response):
if '>Elists' in str(response.body):
hxs = scrapy.Selector(response)
all_links = hxs.xpath('*//a/#href').extract()
for link in all_links:
url_join = response.urljoin(link)
yield scrapy.Request(
url = url_join,
dont_filter = True,
callback = self.select_company
)
Everything works fine, but as I said it does seem excessive to crawl through each href on the page. I wrote a script for this website in Selenium, and was able to select the correct hrefs by using the select_by_partial_link_text() method. I've searched for something comparable to that in scrapy, but it seems like scrapy navigation is based strickly on xpath and css name.
Is this how Scrapy is meant to be used in this scenario? Is there anything I can do to make the scraping process less redundant?
This is my first working scrapy spider, so go easy on me!

If you need to extract only links with certain substring in link text, you can use LinkExtractor with following XPath:
LinkExtractor(restrict_xpaths='//a[contains(text(), "substring to find")]').extract_links(response)
as LinkExtractor is the proper way to extract and process links in Scrapy.
Docs: https://doc.scrapy.org/en/latest/topics/link-extractors.html

scrapy crawl multiple page using Request

I followed the document
But still not be able to crawl multiple pages.
My code is like:
def parse(self, response):
for thing in response.xpath('//article'):
item = MyItem()
request = scrapy.Request(link,
callback=self.parse_detail)
request.meta['item'] = item
yield request
def parse_detail(self, response):
print "here\n"
item = response.meta['item']
item['test'] = "test"
yield item
Running this code will not call parse_detail function and will not crawl any data. Any idea? Thanks!

I find if I comment out allowed_domains it will work. But it doesn't make sense because link is belonged to allowed_domains for sure.

Scrapy crawl site including non-HTML items

I am using Scrapy to crawl an entire website, including images, CSS, JavaScript and external links. I've noticed that Scrapy's default CrawlSpider only processes HTML responses and ignores external links. So I tried overriding the method _requests_to_follow and removing the check at the beginning but that didn't work. I also tried using a method process_request to allow all requests but that failed too. Here's my code:
class MySpider(CrawlSpider):
name = 'myspider'
allowed_domains = ['example.com']
start_urls = ['http://example.com']
rules = (Rule(LinkExtractor(), callback='parse_item', follow=False,
process_request='process_request'),)
def parse_item(self, response):
node = Node()
node['location'] = response.url
node['content_type'] = response.headers['Content-Type']
yield node
link = Link()
link['source'] = response.request.headers['Referer']
link['destination'] = response.url
yield link
def process_request(self, request):
# Allow everything
return request
def _requests_to_follow(self, response):
# There used to be a check for Scrapy's HtmlResponse response here
seen = set()
for n, rule in enumerate(self._rules):
links = [l for l in rule.link_extractor.extract_links(response) if l not in seen]
if links and rule.process_links:
links = rule.process_links(links)
for link in links:
seen.add(link)
r = Request(url=link.url, callback=self._response_downloaded)
r.meta.update(rule=n, link_text=link.text)
yield rule.process_request(r)
The idea is to build a graph of the domain, that's why my parse_item yields a Node object with the resource's location and type, and a Link object to keep track of the relations between nodes. External pages should have their node and link information retrieved but they shouldn't be crawled, of course.
Thanks in advance for your help.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to I find external 404s - python

Related

Any idea why Scrapy Response.follow is not following the links?

Why is the print() function not echoing to console?

best practice for navigating through hrefs with scrapy

scrapy crawl multiple page using Request

Scrapy crawl site including non-HTML items

Categories

Resources