How to use Scrapy and Splash to crawl LeetCode - python

I am a newbie to Python and Spider. I am now trying to use Scrapy and Splash to crawl dynamic pages rendered with js, such as crawling problems from https://leetcode.com/problemset/all/.
But when I use response.xpath("//div[#class='css-1ponsav']") in https://leetcode.com/problems/two-sum/ , it seems not to get any information.
Similarly, in login interface https://leetcode.com/accounts/login/ , when you try to call SplashFormRequest.from_response(response,...) to log in, it will return ValueError: No element found in <200 >.
I don't know much about the front-end. I don't know if there is anything to do with graphQL used by LeetCode. Or for other reasons?
Here is the code.
# -*- coding: utf-8 -*-
import json
import scrapy
from scrapy import Request, Selector
from scrapy_splash import SplashRequest
from leetcode_problems.items import ProblemItem
class TestSpiderSpider(scrapy.Spider):
name = 'test_spider'
allowed_domains = ['leetcode.com']
single_problem_url = "https://leetcode.com/problems/two-sum/"
def start_requests(self):
url = self.single_problem_url
yield SplashRequest(url=url, callback=self.single_problem_parse, args={'wait': 2})
def single_problem_parse(self, response):
submission_page = response.xpath("//div[#data-key='submissions']/a/#href").extract_first()
submission_text = response.xpath("//div[#data-key='submissions']//span[#class='title__qRnJ']").extract_first()
print("submission_text:", end=' ')
print(submission_text) #Print Nothing
if submission_page:
yield SplashRequest("https://leetcode.com" + submission_page, self.empty_parse, args={'wait': 2})

I am not that familiar with Splash but 98% of websites that are Javascript generated can be scraped by looking at the XHR filter under Network tab looking for POST or GET responses that generate these outputs.
In your case I can see there is one response that generate the whole page without needing any special query parameters or API keys.

Related

Scrapy keeps crawling and never stops... CrawlSpider rules

I'm very new to python and scrapy and decided to try and built a spider instead of just being scared of the new/challenging looking language.
So this is the first spider and it's purpose :
It runs through a website's pages (through links it finds on every
page)
List all the links (a>href) that exist on every page
Writes down in each row: the page where the links were found, the links themselves
(decoded+languages), number of links on every page, and http response code of every link.
The problem I'm encountering is that it's never stopping the crawl, it seems stuck in a loop and always re-crawling every page more then once...
What did I do wrong? (obviously many things since I never wrote a python code before, but still)
How can I make the spider crawl every page only once?
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
import urllib.parse
import requests
import threading
class TestSpider(CrawlSpider):
name = "test"
allowed_domains = ["cerve.co"]
start_urls = ["https://cerve.co"]
rules = [Rule (LinkExtractor(allow=['.*'], tags='a', attrs='href'), callback='parse_item', follow=True)]
def parse_item(self, response):
alllinks = response.css('a::attr(href)').getall()
for link in alllinks:
link = response.urljoin(link)
yield {
'page': urllib.parse.unquote(response.url),
'links': urllib.parse.unquote(link),
'number of links': len(alllinks),
'status': requests.get(link).status_code
}
Scrapy said :
By default, Scrapy filters out duplicated requests to URLs already visited. This can be configured by the setting DUPEFILTER_CLASS.
Solution 1 : https://docs.scrapy.org/en/latest/topics/settings.html#std-setting-DUPEFILTER_CLASS
My experience with your code :
There are so many links . And i did not see any duplicates urls being visited twice.
Solutions 2 in worst case
In settings.py set DEPTH_LIMIT= some number of your choice

How I can take data from all pages?

it's first time when I'm using Scrapy framework for python.
So I made this code.
# -*- coding: utf-8 -*-
import scrapy
class SpiderSpider(scrapy.Spider):
name = 'spider'
start_urls = [
'https://www.emag.ro/televizoare/c'
]
def parse(self, response):
for i in response.xpath('//div[#class="card-section-wrapper js-section-wrapper"]'):
yield {
'product-name': i.xpath('.//a[#class="product-title js-product-url"]/text()')
.extract_first().replace('\n','')
}
next_page_url = response.xpath('//a[#class="js-change-page"]/#href').extract_first()
if next_page_url is not None:
yield scrapy.Request(response.urljoin(next_page_url))
when I'm looking at the website it has over 800 products. but my script it's only taking the first 2 pages nearly 200 products...
I tried to use css selector and xpath, both same bug.
Can anyone figure out where is the problem?
Thank you!
The website you are trying to crawl is getting data from API. When you click on the pagination link, it sends ajax request to API to fetch more products and show them on the page.
Since
Scrapy doesn't simulate the browser environment itself.
So one way would be that you
Analyse the request in your browser network tab to inspect the endpoint and parameters
Build the similar request yourself in scrapy
Call that endpoint with appropriate arguments to get the products from the API.
Also you need to extract the next page from the json response you get from the API. Usually there is a key named pagination which contains info related to total pages, next page etc.
I finally figure out how to do it.
# -*- coding: utf-8 -*-
import scrapy
from ..items import ScraperItem
class SpiderSpider(scrapy.Spider):
name = 'spider'
page_number = 2
start_urls = [
'https://www.emag.ro/televizoare/c'
]
def parse(self, response):
items = ScraperItem()
for i in response.xpath('//div[#class="card-section-wrapper js-section-wrapper"]'):
product_name = i.xpath('.//a[#class="product-title js-product-url"]/text()').extract_first().replace('\n ','').replace('\n ','')
items["product_name"] = product_name
yield items
next_page = 'https://www.emag.ro/televizoare/p' + str(SpiderSpider.page_number) + '/c'
if SpiderSpider.page_number <= 28:
SpiderSpider.page_number += 1
yield response.follow(next_page, callback = self.parse)

Python - Scrapy - Navigating through a website

I’m trying to use Scrapy to log into a website, then navigate within than website, and eventually download data from it. Currently I’m stuck in the middle of the navigation part. Here are the things I looked into to solve the problem on my own.
Datacamp course on Scrapy
Following Pagination Links with Scrapy
http://scrapingauthority.com/2016/11/22/scrapy-login/
Scrapy - Following Links
Relative URL to absolute URL Scrapy
However, I do not seem to connect the dots.
Below is the code I currently use. I manage to log in (when I call the "open_in_browser" function, I see that I’m logged in). I also manage to "click" on the first button on the website in the "parse2" part (if I call "open_in_browser" after parse 2, I see that the navigation bar at the top of the website has gone one level deeper.
The main problem is now in the "parse3" part as I cannot navigate another level deeper (or maybe I can, but the "open_in_browser" does not open the website any more - only if I put it after parse or parse 2). My understanding is that I put multiple "parse-functions" after another to navigate through the website.
Datacamp says I always need to start with a "start request function" which is what I tried but within the YouTube videos, etc. I saw evidence that most start directly with parse functions. Using "inspect" on the website for parse 3, I see that this time href is a relative link and I used different methods (See source 5) to navigate to it as I thought this might be the source of error.
import scrapy
from scrapy.http import FormRequest
from scrapy.utils.response import open_in_browser
from scrapy.crawler import CrawlerProcess
class LoginNeedScraper(scrapy.Spider):
name = "login"
start_urls = ["<some website>"]
def parse(self, response):
loginTicket = response.xpath('/html/body/section/div/div/div/div[2]/form/div[3]/input[1]/#value').extract_first()
execution = response.xpath('/html/body/section/div/div/div/div[2]/form/div[3]/input[2]/#value').extract_first()
return FormRequest.from_response(response, formdata={
'loginTicket': loginTicket,
'execution': execution,
'username': '<someusername>',
'password': '<somepassword>'},
callback=self.parse2)
def parse2(self, response):
next_page_url = response.xpath('/html/body/nav/div[2]/ul/li/a/#href').extract_first()
yield scrapy.Request(url=next_page_url, callback=self.parse3)
def parse3(self, response):
next_page_url_2 = response.xpath('/html//div[#class = "headerPanel"]/div[3]/a/#href').extract_first()
absolute_url = response.urljoin(next_page_url_2)
yield scrapy.Request(url=absolute_url, callback=self.start_scraping)
def start_scraping(self, response):
open_in_browser(response)
process = CrawlerProcess()
process.crawl(LoginNeedScraper)
process.start()
You need to define rules in order to scrape a website completely. Let's say you want to crawl all links in the header of the website and then open that link in order to see the main page to which that link was referring.
In order to achieve this, firstly identify what you need to scrape and mark CSS or XPath selectors for those links and put them in a rule. Every rule has a default callback to parse or you can also assign it to some other method. I am attaching a dummy example of creating rules, and you can map it accordingly to your case:
rules = (
Rule(LinkExtractor(restrict_css=[crawl_css_selectors])),
Rule(LinkExtractor(restrict_css=[product_css_selectors]), callback='parse_item')
)

Parsing output from scrapy splash

I'm testing out a splash instance with scrapy 1.6 following https://blog.scrapinghub.com/2015/03/02/handling-javascript-in-scrapy-with-splash and https://aaqai.me/notes/scrapy-splash-setup. My spider:
import scrapy
from scrapy_splash import SplashRequest
from scrapy.utils.response import open_in_browser
class MySpider(scrapy.Spider):
start_urls = ["http://yahoo.com"]
name = 'mytest'
def start_requests(self):
for url in self.start_urls:
yield SplashRequest(url, self.parse, endpoint='render.html', args={'wait': 7.5},)
def parse(self, response):
# response.body is a result of render.html call; it
# contains HTML processed by a browser.
open_in_browser(response)
return None
The output opens up in notepad rather than a browser. How can I open this in a browser?
If you are using the splash middleware and everything the splash response goes into the regular response object with you can access via response.css and response.xpath. Depending on what endpoint you use you can execute JavaScript and other stuff.
If you need to do moving around a page and other stuff you will need to write a LUA script to execute with the proper endpoint. As far as parsing the output it automatically goes into the response object.
Get rid of open_in_browser I'm not exactly sure what you are doing but if all you want to do is parse the page you can do so like so
body = response.css('body').extract_first()
links = response.css('a::attr(href)').extract()
If you could please clarify your question most people don't want to look in links to try and guess what your having trouble with.
Update for clarified question:
It sounds like you may want scrapy shell with Splash this will enable you to experiment with selectors:
scrapy shell 'http://localhost:8050/render.html?url=http://page.html&timeout=10&wait=0.5'
In order to access Splash in a browser instance simply go to http://0.0.0.0:8050/ you input the URL in there. I'm not sure about the method in the tutorial but this is how you can interact with the Splash session.

POST request in search query with Scrapy

I am trying to use a Scrapy spider to crawl a website using a FormRequest to send a keyword to the search query on a city-specific page. Seems straightforward with what I read, but I'm having trouble. Fairly new to Python so sorry if there is something obvious I'm overlooking.
Here are the main 3 sites I was trying to use to help me:
Mouse vs Python [1]; Stack Overflow; Scrapy.org [3]
From the source code of the specific url I am crawling: www.lkqpickyourpart.com\locations/LKQ_Self_Service_-_Gainesville-224/recents
From the source of the particular page I found:
<input name="dnn$ctl01$txtSearch" type="text" maxlength="255" size="20" id="dnn_ctl01_txtSearch" class="NormalTextBox" autocomplete="off" placeholder="Search..." />
Which I think the name of the search is "dnn_ct101_txtSearch" which I would use in the example I found cited as 2, and I wanted to input "toyota" as my keyword within the vehicle search.
Here is the code I have of my spider right now, and I am aware I am importing excessive stuff in the beggining:
import scrapy
from scrapy.http import FormRequest
from scrapy.item import Item, Field
from scrapy.http import FormRequest
from scrapy.spider import BaseSpider
class LkqSpider(scrapy.Spider):
name = "lkq"
allowed_domains = ["lkqpickyourpart.com\locations/LKQ_Self_Service_-_Gainesville-224/recents"]
start_urls = ['http://www.lkqpickyourpart.com\locations/LKQ_Self_Service_-_Gainesville-224/recents/']
def start_requests(self):
return [ FormRequest("www.lkqpickyourpart.com\locations/LKQ_Self_Service_-_Gainesville-224/recents",
formdata={'dnn$ctl01$txtSearch':'toyota'},
callback=self.parse) ]
def parsel(self):
print self.status
Why is it not searching or printing any kind of results, is the example I'm copying from only intended for logging in on websites not entering to searchbars?
Thanks,
Dan the newbie Python writer
Here you go :)
# -*- coding: utf-8 -*-
from __future__ import unicode_literals
import scrapy
from scrapy.shell import inspect_response
from scrapy.utils.response import open_in_browser
class Cars(scrapy.Item):
Make = scrapy.Field()
Model = scrapy.Field()
Year = scrapy.Field()
Entered_Yard = scrapy.Field()
Section = scrapy.Field()
Color = scrapy.Field()
class LkqSpider(scrapy.Spider):
name = "lkq"
allowed_domains = ["lkqpickyourpart.com"]
start_urls = (
'http://www.lkqpickyourpart.com/DesktopModules/pyp_vehicleInventory/getVehicleInventory.aspx?store=224&page=0&filter=toyota&sp=&cl=&carbuyYardCode=1224&pageSize=1000&language=en-US',
)
def parse(self, response):
section_color = response.xpath(
'//div[#class="pypvi_notes"]/p/text()').extract()
info = response.xpath('//td["pypvi_make"]/text()').extract()
for element in range(0, len(info), 4):
item = Cars()
item["Make"] = info[element]
item["Model"] = info[element + 1]
item["Year"] = info[element + 2]
item["Entered_Yard"] = info[element + 3]
item["Section"] = section_color.pop(
0).replace("Section:", "").strip()
item["Color"] = section_color.pop(0).replace("Color:", "").strip()
yield item
# open_in_browser(response)
# inspect_response(response, self)
The page that you're trying to scrape is generated by an AJAX call.
Scrapy by default doesn't load any dynamically loaded Javascript content including AJAX. Almost all sites that load data dynamically as you scroll down the page are done using AJAX.
^^Trapping^^ AJAX call's are pretty simple using either Chrome Dev Tools or Firebug for Firefox.
All you have to do is observe the XHR requests in Chrome Dev Tools or Firebug. XHR is an AJAX request.
Here's a screen shot of how it looks:
Once you find the link, you can go change its attributes.
This is the link that the XHR request in Chrome Dev Tools gave me:
http://www.lkqpickyourpart.com/DesktopModules/pyp_vehicleInventory/getVehicleInventory.aspx?store=224&page=0&filter=toyota&sp=&cl=&carbuyYardCode=1224&pageSize=1000&language=en-US
I've changed the page size to 1000 up there to give me a 1000 results per page. The default was 15.
There's also a page number over there which you would ideally increase till you capture all the data.
The web page requires javascript rendering framework to load the content in the scrapy code
Use Splash and refer the document for usage.

Categories