Parsing output from scrapy splash - python

I'm testing out a splash instance with scrapy 1.6 following https://blog.scrapinghub.com/2015/03/02/handling-javascript-in-scrapy-with-splash and https://aaqai.me/notes/scrapy-splash-setup. My spider:
import scrapy
from scrapy_splash import SplashRequest
from scrapy.utils.response import open_in_browser
class MySpider(scrapy.Spider):
start_urls = ["http://yahoo.com"]
name = 'mytest'
def start_requests(self):
for url in self.start_urls:
yield SplashRequest(url, self.parse, endpoint='render.html', args={'wait': 7.5},)
def parse(self, response):
# response.body is a result of render.html call; it
# contains HTML processed by a browser.
open_in_browser(response)
return None
The output opens up in notepad rather than a browser. How can I open this in a browser?

If you are using the splash middleware and everything the splash response goes into the regular response object with you can access via response.css and response.xpath. Depending on what endpoint you use you can execute JavaScript and other stuff.
If you need to do moving around a page and other stuff you will need to write a LUA script to execute with the proper endpoint. As far as parsing the output it automatically goes into the response object.
Get rid of open_in_browser I'm not exactly sure what you are doing but if all you want to do is parse the page you can do so like so
body = response.css('body').extract_first()
links = response.css('a::attr(href)').extract()
If you could please clarify your question most people don't want to look in links to try and guess what your having trouble with.
Update for clarified question:
It sounds like you may want scrapy shell with Splash this will enable you to experiment with selectors:
scrapy shell 'http://localhost:8050/render.html?url=http://page.html&timeout=10&wait=0.5'
In order to access Splash in a browser instance simply go to http://0.0.0.0:8050/ you input the URL in there. I'm not sure about the method in the tutorial but this is how you can interact with the Splash session.

Related

Python - Scrapy - Navigating through a website

I’m trying to use Scrapy to log into a website, then navigate within than website, and eventually download data from it. Currently I’m stuck in the middle of the navigation part. Here are the things I looked into to solve the problem on my own.
Datacamp course on Scrapy
Following Pagination Links with Scrapy
http://scrapingauthority.com/2016/11/22/scrapy-login/
Scrapy - Following Links
Relative URL to absolute URL Scrapy
However, I do not seem to connect the dots.
Below is the code I currently use. I manage to log in (when I call the "open_in_browser" function, I see that I’m logged in). I also manage to "click" on the first button on the website in the "parse2" part (if I call "open_in_browser" after parse 2, I see that the navigation bar at the top of the website has gone one level deeper.
The main problem is now in the "parse3" part as I cannot navigate another level deeper (or maybe I can, but the "open_in_browser" does not open the website any more - only if I put it after parse or parse 2). My understanding is that I put multiple "parse-functions" after another to navigate through the website.
Datacamp says I always need to start with a "start request function" which is what I tried but within the YouTube videos, etc. I saw evidence that most start directly with parse functions. Using "inspect" on the website for parse 3, I see that this time href is a relative link and I used different methods (See source 5) to navigate to it as I thought this might be the source of error.
import scrapy
from scrapy.http import FormRequest
from scrapy.utils.response import open_in_browser
from scrapy.crawler import CrawlerProcess
class LoginNeedScraper(scrapy.Spider):
name = "login"
start_urls = ["<some website>"]
def parse(self, response):
loginTicket = response.xpath('/html/body/section/div/div/div/div[2]/form/div[3]/input[1]/#value').extract_first()
execution = response.xpath('/html/body/section/div/div/div/div[2]/form/div[3]/input[2]/#value').extract_first()
return FormRequest.from_response(response, formdata={
'loginTicket': loginTicket,
'execution': execution,
'username': '<someusername>',
'password': '<somepassword>'},
callback=self.parse2)
def parse2(self, response):
next_page_url = response.xpath('/html/body/nav/div[2]/ul/li/a/#href').extract_first()
yield scrapy.Request(url=next_page_url, callback=self.parse3)
def parse3(self, response):
next_page_url_2 = response.xpath('/html//div[#class = "headerPanel"]/div[3]/a/#href').extract_first()
absolute_url = response.urljoin(next_page_url_2)
yield scrapy.Request(url=absolute_url, callback=self.start_scraping)
def start_scraping(self, response):
open_in_browser(response)
process = CrawlerProcess()
process.crawl(LoginNeedScraper)
process.start()
You need to define rules in order to scrape a website completely. Let's say you want to crawl all links in the header of the website and then open that link in order to see the main page to which that link was referring.
In order to achieve this, firstly identify what you need to scrape and mark CSS or XPath selectors for those links and put them in a rule. Every rule has a default callback to parse or you can also assign it to some other method. I am attaching a dummy example of creating rules, and you can map it accordingly to your case:
rules = (
Rule(LinkExtractor(restrict_css=[crawl_css_selectors])),
Rule(LinkExtractor(restrict_css=[product_css_selectors]), callback='parse_item')
)

How to use Scrapy and Splash to crawl LeetCode

I am a newbie to Python and Spider. I am now trying to use Scrapy and Splash to crawl dynamic pages rendered with js, such as crawling problems from https://leetcode.com/problemset/all/.
But when I use response.xpath("//div[#class='css-1ponsav']") in https://leetcode.com/problems/two-sum/ , it seems not to get any information.
Similarly, in login interface https://leetcode.com/accounts/login/ , when you try to call SplashFormRequest.from_response(response,...) to log in, it will return ValueError: No element found in <200 >.
I don't know much about the front-end. I don't know if there is anything to do with graphQL used by LeetCode. Or for other reasons?
Here is the code.
# -*- coding: utf-8 -*-
import json
import scrapy
from scrapy import Request, Selector
from scrapy_splash import SplashRequest
from leetcode_problems.items import ProblemItem
class TestSpiderSpider(scrapy.Spider):
name = 'test_spider'
allowed_domains = ['leetcode.com']
single_problem_url = "https://leetcode.com/problems/two-sum/"
def start_requests(self):
url = self.single_problem_url
yield SplashRequest(url=url, callback=self.single_problem_parse, args={'wait': 2})
def single_problem_parse(self, response):
submission_page = response.xpath("//div[#data-key='submissions']/a/#href").extract_first()
submission_text = response.xpath("//div[#data-key='submissions']//span[#class='title__qRnJ']").extract_first()
print("submission_text:", end=' ')
print(submission_text) #Print Nothing
if submission_page:
yield SplashRequest("https://leetcode.com" + submission_page, self.empty_parse, args={'wait': 2})
I am not that familiar with Splash but 98% of websites that are Javascript generated can be scraped by looking at the XHR filter under Network tab looking for POST or GET responses that generate these outputs.
In your case I can see there is one response that generate the whole page without needing any special query parameters or API keys.

Scrapy request, shell Fetch() in spider

I'm trying to reach a specific page, let's call it http://example.com/puppers. This page cannot be reached when connecting directly using scrapy shell or the standard scrapy.request module (results in <405> HTTP).
However, when I use scrapy shell 'http://example.com/kittens' first, and then use fetch('http://example.com/puppers') it works and I get a <200> OK HTTP code. I can now extract data using scrapy shell.
I tried implementing this in my script, by altering the referer (using url #1), the user-agent and a few others while connecting to the puppers (url #2) page. I still get a <405> code..
I appreciate all the help. Thank you.
start_urls = ['http://example.com/kittens']
def parse(self, response):
yield scrapy.Request(
url="http://example.com/puppers",
callback=self.parse_puppers
)
def parse_puppers(self, response):
#process your puppers
.....

Scrapy scraping content that is visible sometimes but not others

I am scraping some info off of zappos.com, specifically a part of the details page that displays what customers that view the current item have also viewed.
This is one such item listing:
https://www.zappos.com/p/chaco-marshall-tartan-rust/product/8982802/color/725500
The thing is that I discovered that the section that I am scraping appears right away on some items, but on others it will only appear after I have refreshed the page 2 or three times.
I am using scrapy to scrape and splash to render.
import scrapy
import re
from scrapy_splash import SplashRequest
class Scrapys(scrapy.Spider):
name = "sqs"
start_urls = ["https://www.zappos.com","https://www.zappos.com/marty/men-shoes/CK_XAcABAuICAgEY.zso"]
def start_requests(self):
for url in self.start_urls:
yield SplashRequest(url, self.parse,
endpoint='render.html',
args={'wait': 0.5},
)
def parse(self, response):
links = response.css("div._1Mgpu")
for link in links:
url = 'https://www.zappos.com' + link.css("a::attr(href)").extract_first()
yield SplashRequest(url, callback=self.parse_attr,
endpoint='render.html',
args={'wait': 10},
)
def parse_attr(self, response):
alsoviewimg = response.css("div._18jp0 div._3Olkk div.QDcUX div.slider div.slider-frame ul.slider-list li.slider-slide a img").extract()
The alsoviewimg is one of the elements that I am pulling from the "Customers Who Viewed this Item Also Viewed" section. I have tested pulling this and other elements, all in the scrapy shell with splash rendering to get the dynamic content, and it pulled the content fine, however in the spider it rarely, if ever, gets any hits.
Is there something I can set so that it loads the page a couple times to get the content? Or something else that I am missing?
You should check if the element you're looking for exists. If it doesn't, load the page again.
I'd look into why refreshing the page requires multiple attempts, you might be able to solve the problem without this ad-hoc multiple refresh solution.
Scrapy How to check if certain class exists in a given element
This link explains how to see if a class exists.

Cannot get item when crawl data using scrapy

I have inspected element from chrome:
I want to get data in the red box (can be more than one) using scrapy. I used this code (I see the tutorial from the scrapy documentation):
import scrapy
class KamusSetSpider(scrapy.Spider):
name = "kamusset_spider"
start_urls = ['http://kbbi.web.id/' + 'abad']
def parse(self, response):
for kamusset in response.css("div#d1"):
text = kamusset.css("div.sub_17 b.tur.highlight::text").extract()
print(dict(text=text))
But, there is no result:
What happen? I have change it to this(use splash) but still not working:
def start_requests(self):
for url in self.start_urls:
yield SplashRequest(url, self.parse, args={'wait': 0.5})
def parse(self, response):
html = response.body
for kamusset in response.css("div#d1"):
text = kamusset.css("div.sub_17 b.tur.highlight::text").extract()
print(dict(text=text))
In this case it seems that the page content is generated dynamically --
eventhough you can see the elements present when inspecting from browser, they are not present in the HTML source (i.e. in what Scrapy sees). That's because Scrapy can't render JavaScript etc. You need to use some kind of browser to render the page and then put the result to Scrapy for processing. I recommend using Splash for it's seamless integration with Scrapy.

Categories