I'm trying to reach a specific page, let's call it http://example.com/puppers. This page cannot be reached when connecting directly using scrapy shell or the standard scrapy.request module (results in <405> HTTP).
However, when I use scrapy shell 'http://example.com/kittens' first, and then use fetch('http://example.com/puppers') it works and I get a <200> OK HTTP code. I can now extract data using scrapy shell.
I tried implementing this in my script, by altering the referer (using url #1), the user-agent and a few others while connecting to the puppers (url #2) page. I still get a <405> code..
I appreciate all the help. Thank you.
start_urls = ['http://example.com/kittens']
def parse(self, response):
yield scrapy.Request(
url="http://example.com/puppers",
callback=self.parse_puppers
)
def parse_puppers(self, response):
#process your puppers
.....
Related
Here's how my pseudo-code looks like:
class BasicSpider(scrapy.Spider):
def is_authed(self):
# Check if we are logged in.
def start_requests(self):
# yield request to www.example.com/login, cb=login, dont_filter=True
def login(self, response):
# POSTs a login form to initiate login, cb=after_login
def after_login(self, response):
# Call is_authed() to see if the login worked, cb=read_csv
def read_csv(self, response):
# Reads a huge csv file and schedules each row as a search url on the website.
...
The code works fine, but when I pause Scrapy (Ctrl+C) and then want to resume it, the is_authed fails because Scrapy doesn't login. On resuming a crawl job, Scrapy doesn't seem to care about start_requests, parse etc. it just starts crawling the url's that have been previously scheduled. It does seem to run next_request at some point if you define it but it's too late.
My question is similar to this: Scrapy do something before resume which has an answer, but the answer suggests to use spider_opened but doesn't say where I should use that, in a middleware or spider, I tried both and they failed because Scrapy errors on the self.spider.requests line that spider doesn't have a requests attribute.
I also tried handling the issue inside is_authed, when we detect that we are not logged-in inside the function, I yield a request with the highest priority so that Scrapy logins before crawling the next link, but Scrapy doesn't seem to care and completely ignores this.
I also read the docs on middlewares, there doesn't seem to be a way to manually edit the queues. I looked up spider_opened a lot but couldn't figure out how to make the crawler initiate the start_requests to login.
I’m trying to use Scrapy to log into a website, then navigate within than website, and eventually download data from it. Currently I’m stuck in the middle of the navigation part. Here are the things I looked into to solve the problem on my own.
Datacamp course on Scrapy
Following Pagination Links with Scrapy
http://scrapingauthority.com/2016/11/22/scrapy-login/
Scrapy - Following Links
Relative URL to absolute URL Scrapy
However, I do not seem to connect the dots.
Below is the code I currently use. I manage to log in (when I call the "open_in_browser" function, I see that I’m logged in). I also manage to "click" on the first button on the website in the "parse2" part (if I call "open_in_browser" after parse 2, I see that the navigation bar at the top of the website has gone one level deeper.
The main problem is now in the "parse3" part as I cannot navigate another level deeper (or maybe I can, but the "open_in_browser" does not open the website any more - only if I put it after parse or parse 2). My understanding is that I put multiple "parse-functions" after another to navigate through the website.
Datacamp says I always need to start with a "start request function" which is what I tried but within the YouTube videos, etc. I saw evidence that most start directly with parse functions. Using "inspect" on the website for parse 3, I see that this time href is a relative link and I used different methods (See source 5) to navigate to it as I thought this might be the source of error.
import scrapy
from scrapy.http import FormRequest
from scrapy.utils.response import open_in_browser
from scrapy.crawler import CrawlerProcess
class LoginNeedScraper(scrapy.Spider):
name = "login"
start_urls = ["<some website>"]
def parse(self, response):
loginTicket = response.xpath('/html/body/section/div/div/div/div[2]/form/div[3]/input[1]/#value').extract_first()
execution = response.xpath('/html/body/section/div/div/div/div[2]/form/div[3]/input[2]/#value').extract_first()
return FormRequest.from_response(response, formdata={
'loginTicket': loginTicket,
'execution': execution,
'username': '<someusername>',
'password': '<somepassword>'},
callback=self.parse2)
def parse2(self, response):
next_page_url = response.xpath('/html/body/nav/div[2]/ul/li/a/#href').extract_first()
yield scrapy.Request(url=next_page_url, callback=self.parse3)
def parse3(self, response):
next_page_url_2 = response.xpath('/html//div[#class = "headerPanel"]/div[3]/a/#href').extract_first()
absolute_url = response.urljoin(next_page_url_2)
yield scrapy.Request(url=absolute_url, callback=self.start_scraping)
def start_scraping(self, response):
open_in_browser(response)
process = CrawlerProcess()
process.crawl(LoginNeedScraper)
process.start()
You need to define rules in order to scrape a website completely. Let's say you want to crawl all links in the header of the website and then open that link in order to see the main page to which that link was referring.
In order to achieve this, firstly identify what you need to scrape and mark CSS or XPath selectors for those links and put them in a rule. Every rule has a default callback to parse or you can also assign it to some other method. I am attaching a dummy example of creating rules, and you can map it accordingly to your case:
rules = (
Rule(LinkExtractor(restrict_css=[crawl_css_selectors])),
Rule(LinkExtractor(restrict_css=[product_css_selectors]), callback='parse_item')
)
I'm testing out a splash instance with scrapy 1.6 following https://blog.scrapinghub.com/2015/03/02/handling-javascript-in-scrapy-with-splash and https://aaqai.me/notes/scrapy-splash-setup. My spider:
import scrapy
from scrapy_splash import SplashRequest
from scrapy.utils.response import open_in_browser
class MySpider(scrapy.Spider):
start_urls = ["http://yahoo.com"]
name = 'mytest'
def start_requests(self):
for url in self.start_urls:
yield SplashRequest(url, self.parse, endpoint='render.html', args={'wait': 7.5},)
def parse(self, response):
# response.body is a result of render.html call; it
# contains HTML processed by a browser.
open_in_browser(response)
return None
The output opens up in notepad rather than a browser. How can I open this in a browser?
If you are using the splash middleware and everything the splash response goes into the regular response object with you can access via response.css and response.xpath. Depending on what endpoint you use you can execute JavaScript and other stuff.
If you need to do moving around a page and other stuff you will need to write a LUA script to execute with the proper endpoint. As far as parsing the output it automatically goes into the response object.
Get rid of open_in_browser I'm not exactly sure what you are doing but if all you want to do is parse the page you can do so like so
body = response.css('body').extract_first()
links = response.css('a::attr(href)').extract()
If you could please clarify your question most people don't want to look in links to try and guess what your having trouble with.
Update for clarified question:
It sounds like you may want scrapy shell with Splash this will enable you to experiment with selectors:
scrapy shell 'http://localhost:8050/render.html?url=http://page.html&timeout=10&wait=0.5'
In order to access Splash in a browser instance simply go to http://0.0.0.0:8050/ you input the URL in there. I'm not sure about the method in the tutorial but this is how you can interact with the Splash session.
Given a pool of start urls I would like to identify in the parse_item() function the origin url.
As far as I'm concerned the scrapy spiders start crawling from the initial pool of start urls, but when parsing there is no trace of which of those urls was the initial one. How it would be possible to keep track of the starting point?
If you need a parsing url inside the spider, just use response.url:
def parse_item(self, response):
print response.url
but in case you need it outside spider I can think of following ways:
Use scrapy core api
You can also call scrapy from an external python module with OS command (which apparently is not recommended):
in scrapycaller.py
from subprocess import call
urls = 'url1,url2'
cmd = 'scrapy crawl myspider -a myurls={}'.format(urls)
call(cmd, shell=True)
Inside myspider:
class mySpider(scrapy.Spider):
def __init__(self, myurls=''):
self.start_urls = myurls.split(",")
(How) can I archieve that scrapy only downloads the header data of a website (for check purposes etc.)
I've tried to disable some download-middlewares but it doesn't seem to work.
Like #alexce said, you can issue HEAD Requests instead of the default GET:
Request(url, method="HEAD")
UPDATE: If you want to use HEAD requests for your start_urls you will need to override the make_requests_from_url method:
def make_requests_from_url(self, url):
return Request(url, method='HEAD', dont_filter=True)
UPDATE: make_requests_from_url was removed in Scrapy 2.6.