Scrapy crawl http header data only - python

(How) can I archieve that scrapy only downloads the header data of a website (for check purposes etc.)
I've tried to disable some download-middlewares but it doesn't seem to work.

Like #alexce said, you can issue HEAD Requests instead of the default GET:
Request(url, method="HEAD")
UPDATE: If you want to use HEAD requests for your start_urls you will need to override the make_requests_from_url method:
def make_requests_from_url(self, url):
return Request(url, method='HEAD', dont_filter=True)
UPDATE: make_requests_from_url was removed in Scrapy 2.6.

Related

scrapy going through first link only

I am new to scrapy and python in general and i am trying to make a scraper that extracts links from a page then edit these links then go through each one of them .. I am using playwright with scrapy.
this is where i am at but for some reason it only scrapes the first link only.
def parse(self, response):
for link in response.css('div.som a::attr(href)'):
yield response.follow(link.get().replace('docs', 'www').replace('com/', 'com/#'),
cookies={'__utms': '265273107'},
meta=dict(
playwright=True,
playwright_include_page=True,
playwright_page_coroutines=[
PageCoroutine('wait_for_selector', 'span#pple_numbers')]
),
callback=self.parse_c)
async def parse_c(self, response):
yield {
'text': response.css('div.pple_numb span::text').getall()
it would be nice if you could add more details about the data you are trying to get. Thefore, could you add the indicated line to see if it is going through different links?
def parse(self, response):
for link in response.css('div.som a::attr(href)'):
print(link) <--- //could you add this line to check if prints all the links?
According to the documentation there are two functions for follow:
follow:
Return a Request instance to follow a link url. It accepts the same
arguments as Request.__init__ method, but url can be not only an
absolute URL, but also a relative URL, a Link object, e.g. the result of Link Extractors, ...
follow_all
A generator that produces Request instances to follow all links in
urls. It accepts the same arguments as the Request’s __init__ method,
except that each urls element does not need to be an absolute URL, it
can be any of the following: a relative URL, a Link object, e.g. the result of Link Extractors, ...
Probably if you try your code with follow_all instead of only follow it should do the trick.

Parsing output from scrapy splash

I'm testing out a splash instance with scrapy 1.6 following https://blog.scrapinghub.com/2015/03/02/handling-javascript-in-scrapy-with-splash and https://aaqai.me/notes/scrapy-splash-setup. My spider:
import scrapy
from scrapy_splash import SplashRequest
from scrapy.utils.response import open_in_browser
class MySpider(scrapy.Spider):
start_urls = ["http://yahoo.com"]
name = 'mytest'
def start_requests(self):
for url in self.start_urls:
yield SplashRequest(url, self.parse, endpoint='render.html', args={'wait': 7.5},)
def parse(self, response):
# response.body is a result of render.html call; it
# contains HTML processed by a browser.
open_in_browser(response)
return None
The output opens up in notepad rather than a browser. How can I open this in a browser?
If you are using the splash middleware and everything the splash response goes into the regular response object with you can access via response.css and response.xpath. Depending on what endpoint you use you can execute JavaScript and other stuff.
If you need to do moving around a page and other stuff you will need to write a LUA script to execute with the proper endpoint. As far as parsing the output it automatically goes into the response object.
Get rid of open_in_browser I'm not exactly sure what you are doing but if all you want to do is parse the page you can do so like so
body = response.css('body').extract_first()
links = response.css('a::attr(href)').extract()
If you could please clarify your question most people don't want to look in links to try and guess what your having trouble with.
Update for clarified question:
It sounds like you may want scrapy shell with Splash this will enable you to experiment with selectors:
scrapy shell 'http://localhost:8050/render.html?url=http://page.html&timeout=10&wait=0.5'
In order to access Splash in a browser instance simply go to http://0.0.0.0:8050/ you input the URL in there. I'm not sure about the method in the tutorial but this is how you can interact with the Splash session.

Is it possible to set different settings for different request in the same Scrapy spider?

I want to use Crawlera only for some requests in a Scrapy spider. So I want to set CRAWLERA_ENABLED differently for different requests. Is it possible?
You can use the dont_proxy key in meta for those requests you don't want to use Crawlera. E.g.
# Supposing you have crawlera enabled in `settings.py`
yield scrapy.Request(
url,
meta={"dont_proxy": True},
callback=self.parse
)

Scrapy request, shell Fetch() in spider

I'm trying to reach a specific page, let's call it http://example.com/puppers. This page cannot be reached when connecting directly using scrapy shell or the standard scrapy.request module (results in <405> HTTP).
However, when I use scrapy shell 'http://example.com/kittens' first, and then use fetch('http://example.com/puppers') it works and I get a <200> OK HTTP code. I can now extract data using scrapy shell.
I tried implementing this in my script, by altering the referer (using url #1), the user-agent and a few others while connecting to the puppers (url #2) page. I still get a <405> code..
I appreciate all the help. Thank you.
start_urls = ['http://example.com/kittens']
def parse(self, response):
yield scrapy.Request(
url="http://example.com/puppers",
callback=self.parse_puppers
)
def parse_puppers(self, response):
#process your puppers
.....

Combining base url with resultant href in scrapy

below is my spider code,
class Blurb2Spider(BaseSpider):
name = "blurb2"
allowed_domains = ["www.domain.com"]
def start_requests(self):
yield self.make_requests_from_url("http://www.domain.com/bookstore/new")
def parse(self, response):
hxs = HtmlXPathSelector(response)
urls = hxs.select('//div[#class="bookListingBookTitle"]/a/#href').extract()
for i in urls:
yield Request(urlparse.urljoin('www.domain.com/', i[1:]),callback=self.parse_url)
def parse_url(self, response):
hxs = HtmlXPathSelector(response)
print response,'------->'
Here i am trying to combine the href link with the base link , but i am getting the following error ,
exceptions.ValueError: Missing scheme in request url: www.domain.com//bookstore/detail/3271993?alt=Something+I+Had+To+Do
Can anyone let me know why i am getting this error and how to join base url with href link and yield a request
An alternative solution, if you don't want to use urlparse:
response.urljoin(i[1:])
This solution goes even a step further: here Scrapy works out the domain base for joining. And as you can see, you don't have to provide the obvious http://www.example.com for joining.
This makes your code reusable in the future if you want to change the domain you are crawling.
It is because you didn't add the scheme, eg http:// in your base url.
Try: urlparse.urljoin('http://www.domain.com/', i[1:])
Or even more easy: urlparse.urljoin(response.url, i[1:]) as urlparse.urljoin will sort out the base URL itself.
The best way to follow a link in scrapy is to use response.follow(). scrapy will handle the rest.
more info
Quote from docs:
Unlike scrapy.Request, response.follow supports relative URLs directly - no need to call urljoin.
Also, you can pass <a> element directly as argument.

Categories