I am trying to use a Scrapy spider to crawl a website using a FormRequest to send a keyword to the search query on a city-specific page. Seems straightforward with what I read, but I'm having trouble. Fairly new to Python so sorry if there is something obvious I'm overlooking.
Here are the main 3 sites I was trying to use to help me:
Mouse vs Python [1]; Stack Overflow; Scrapy.org [3]
From the source code of the specific url I am crawling: www.lkqpickyourpart.com\locations/LKQ_Self_Service_-_Gainesville-224/recents
From the source of the particular page I found:
<input name="dnn$ctl01$txtSearch" type="text" maxlength="255" size="20" id="dnn_ctl01_txtSearch" class="NormalTextBox" autocomplete="off" placeholder="Search..." />
Which I think the name of the search is "dnn_ct101_txtSearch" which I would use in the example I found cited as 2, and I wanted to input "toyota" as my keyword within the vehicle search.
Here is the code I have of my spider right now, and I am aware I am importing excessive stuff in the beggining:
import scrapy
from scrapy.http import FormRequest
from scrapy.item import Item, Field
from scrapy.http import FormRequest
from scrapy.spider import BaseSpider
class LkqSpider(scrapy.Spider):
name = "lkq"
allowed_domains = ["lkqpickyourpart.com\locations/LKQ_Self_Service_-_Gainesville-224/recents"]
start_urls = ['http://www.lkqpickyourpart.com\locations/LKQ_Self_Service_-_Gainesville-224/recents/']
def start_requests(self):
return [ FormRequest("www.lkqpickyourpart.com\locations/LKQ_Self_Service_-_Gainesville-224/recents",
formdata={'dnn$ctl01$txtSearch':'toyota'},
callback=self.parse) ]
def parsel(self):
print self.status
Why is it not searching or printing any kind of results, is the example I'm copying from only intended for logging in on websites not entering to searchbars?
Thanks,
Dan the newbie Python writer
Here you go :)
# -*- coding: utf-8 -*-
from __future__ import unicode_literals
import scrapy
from scrapy.shell import inspect_response
from scrapy.utils.response import open_in_browser
class Cars(scrapy.Item):
Make = scrapy.Field()
Model = scrapy.Field()
Year = scrapy.Field()
Entered_Yard = scrapy.Field()
Section = scrapy.Field()
Color = scrapy.Field()
class LkqSpider(scrapy.Spider):
name = "lkq"
allowed_domains = ["lkqpickyourpart.com"]
start_urls = (
'http://www.lkqpickyourpart.com/DesktopModules/pyp_vehicleInventory/getVehicleInventory.aspx?store=224&page=0&filter=toyota&sp=&cl=&carbuyYardCode=1224&pageSize=1000&language=en-US',
)
def parse(self, response):
section_color = response.xpath(
'//div[#class="pypvi_notes"]/p/text()').extract()
info = response.xpath('//td["pypvi_make"]/text()').extract()
for element in range(0, len(info), 4):
item = Cars()
item["Make"] = info[element]
item["Model"] = info[element + 1]
item["Year"] = info[element + 2]
item["Entered_Yard"] = info[element + 3]
item["Section"] = section_color.pop(
0).replace("Section:", "").strip()
item["Color"] = section_color.pop(0).replace("Color:", "").strip()
yield item
# open_in_browser(response)
# inspect_response(response, self)
The page that you're trying to scrape is generated by an AJAX call.
Scrapy by default doesn't load any dynamically loaded Javascript content including AJAX. Almost all sites that load data dynamically as you scroll down the page are done using AJAX.
^^Trapping^^ AJAX call's are pretty simple using either Chrome Dev Tools or Firebug for Firefox.
All you have to do is observe the XHR requests in Chrome Dev Tools or Firebug. XHR is an AJAX request.
Here's a screen shot of how it looks:
Once you find the link, you can go change its attributes.
This is the link that the XHR request in Chrome Dev Tools gave me:
http://www.lkqpickyourpart.com/DesktopModules/pyp_vehicleInventory/getVehicleInventory.aspx?store=224&page=0&filter=toyota&sp=&cl=&carbuyYardCode=1224&pageSize=1000&language=en-US
I've changed the page size to 1000 up there to give me a 1000 results per page. The default was 15.
There's also a page number over there which you would ideally increase till you capture all the data.
The web page requires javascript rendering framework to load the content in the scrapy code
Use Splash and refer the document for usage.
Related
it's first time when I'm using Scrapy framework for python.
So I made this code.
# -*- coding: utf-8 -*-
import scrapy
class SpiderSpider(scrapy.Spider):
name = 'spider'
start_urls = [
'https://www.emag.ro/televizoare/c'
]
def parse(self, response):
for i in response.xpath('//div[#class="card-section-wrapper js-section-wrapper"]'):
yield {
'product-name': i.xpath('.//a[#class="product-title js-product-url"]/text()')
.extract_first().replace('\n','')
}
next_page_url = response.xpath('//a[#class="js-change-page"]/#href').extract_first()
if next_page_url is not None:
yield scrapy.Request(response.urljoin(next_page_url))
when I'm looking at the website it has over 800 products. but my script it's only taking the first 2 pages nearly 200 products...
I tried to use css selector and xpath, both same bug.
Can anyone figure out where is the problem?
Thank you!
The website you are trying to crawl is getting data from API. When you click on the pagination link, it sends ajax request to API to fetch more products and show them on the page.
Since
Scrapy doesn't simulate the browser environment itself.
So one way would be that you
Analyse the request in your browser network tab to inspect the endpoint and parameters
Build the similar request yourself in scrapy
Call that endpoint with appropriate arguments to get the products from the API.
Also you need to extract the next page from the json response you get from the API. Usually there is a key named pagination which contains info related to total pages, next page etc.
I finally figure out how to do it.
# -*- coding: utf-8 -*-
import scrapy
from ..items import ScraperItem
class SpiderSpider(scrapy.Spider):
name = 'spider'
page_number = 2
start_urls = [
'https://www.emag.ro/televizoare/c'
]
def parse(self, response):
items = ScraperItem()
for i in response.xpath('//div[#class="card-section-wrapper js-section-wrapper"]'):
product_name = i.xpath('.//a[#class="product-title js-product-url"]/text()').extract_first().replace('\n ','').replace('\n ','')
items["product_name"] = product_name
yield items
next_page = 'https://www.emag.ro/televizoare/p' + str(SpiderSpider.page_number) + '/c'
if SpiderSpider.page_number <= 28:
SpiderSpider.page_number += 1
yield response.follow(next_page, callback = self.parse)
I am a newbie to Python and Spider. I am now trying to use Scrapy and Splash to crawl dynamic pages rendered with js, such as crawling problems from https://leetcode.com/problemset/all/.
But when I use response.xpath("//div[#class='css-1ponsav']") in https://leetcode.com/problems/two-sum/ , it seems not to get any information.
Similarly, in login interface https://leetcode.com/accounts/login/ , when you try to call SplashFormRequest.from_response(response,...) to log in, it will return ValueError: No element found in <200 >.
I don't know much about the front-end. I don't know if there is anything to do with graphQL used by LeetCode. Or for other reasons?
Here is the code.
# -*- coding: utf-8 -*-
import json
import scrapy
from scrapy import Request, Selector
from scrapy_splash import SplashRequest
from leetcode_problems.items import ProblemItem
class TestSpiderSpider(scrapy.Spider):
name = 'test_spider'
allowed_domains = ['leetcode.com']
single_problem_url = "https://leetcode.com/problems/two-sum/"
def start_requests(self):
url = self.single_problem_url
yield SplashRequest(url=url, callback=self.single_problem_parse, args={'wait': 2})
def single_problem_parse(self, response):
submission_page = response.xpath("//div[#data-key='submissions']/a/#href").extract_first()
submission_text = response.xpath("//div[#data-key='submissions']//span[#class='title__qRnJ']").extract_first()
print("submission_text:", end=' ')
print(submission_text) #Print Nothing
if submission_page:
yield SplashRequest("https://leetcode.com" + submission_page, self.empty_parse, args={'wait': 2})
I am not that familiar with Splash but 98% of websites that are Javascript generated can be scraped by looking at the XHR filter under Network tab looking for POST or GET responses that generate these outputs.
In your case I can see there is one response that generate the whole page without needing any special query parameters or API keys.
I was trying to scrap link which has ajax call for pagination.
I am trying to crawl http://www.demo.com link. and in .py file I provided this code for restrict XPATH and coding is:
# -*- coding: utf-8 -*-
import scrapy
from scrapy.contrib.linkextractors import LinkExtractor
from scrapy.contrib.spiders import sumSpider, Rule
from scrapy.selector import HtmlXPathSelector
from sum.items import sumItem
class Sumspider1(sumSpider):
name = 'sumDetailsUrls'
allowed_domains = ['sum.com']
start_urls = ['http://www.demo.com']
rules = (
Rule(LinkExtractor(restrict_xpaths='.//ul[#id="pager"]/li[8]/a'), callback='parse_start_url', follow=True),
)
#use parse_start_url if your spider wants to crawl from first page , so overriding
def parse_start_url(self, response):
print '********************************************1**********************************************'
#//div[#class="showMoreCars hide"]/a
#.//ul[#id="pager"]/li[8]/a/#href
self.log('Inside - parse_item %s' % response.url)
hxs = HtmlXPathSelector(response)
item = sumItem()
item['page'] = response.url
title = hxs.xpath('.//h1[#class="page-heading"]/text()').extract()
print '********************************************title**********************************************',title
urls = hxs.xpath('.//a[#id="linkToDetails"]/#href').extract()
print '**********************************************2***url*****************************************',urls
finalurls = []
for url in urls:
print '---------url-------',url
finalurls.append(url)
item['urls'] = finalurls
return item
My items.py file contains
from scrapy.item import Item, Field
class sumItem(Item):
# define the fields for your item here like:
# name = scrapy.Field()
page = Field()
urls = Field()
Still I'm not getting exact output not able to fetch all pages when I am crawling it.
I hope the below code will help.
somespider.py
# -*- coding: utf-8 -*-
import scrapy
import re
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector
from scrapy.spider import BaseSpider
from demo.items import DemoItem
from selenium import webdriver
def removeUnicodes(strData):
if(strData):
strData = strData.encode('utf-8').strip()
strData = re.sub(r'[\n\r\t]',r' ',strData.strip())
return strData
class demoSpider(scrapy.Spider):
name = "domainurls"
allowed_domains = ["domain.com"]
start_urls = ['http://www.domain.com/used/cars-in-trichy/']
def __init__(self):
self.driver = webdriver.Remote("http://127.0.0.1:4444/wd/hub", webdriver.DesiredCapabilities.HTMLUNITWITHJS)
def parse(self, response):
self.driver.get(response.url)
self.driver.implicitly_wait(5)
hxs = Selector(response)
item = DemoItem()
finalurls = []
while True:
next = self.driver.find_element_by_xpath('//div[#class="showMoreCars hide"]/a')
try:
next.click()
# get the data and write it to scrapy items
item['pageurl'] = response.url
item['title'] = removeUnicodes(hxs.xpath('.//h1[#class="page-heading"]/text()').extract()[0])
urls = self.driver.find_elements_by_xpath('.//a[#id="linkToDetails"]')
for url in urls:
url = url.get_attribute("href")
finalurls.append(removeUnicodes(url))
item['urls'] = finalurls
except:
break
self.driver.close()
return item
items.py
from scrapy.item import Item, Field
class DemoItem(Item):
page = Field()
urls = Field()
pageurl = Field()
title = Field()
Note:
You need to have selenium rc server running because HTMLUNITWITHJS works with selenium rc only using Python.
Run your selenium rc server issuing the command :
java -jar selenium-server-standalone-2.44.0.jar
Run your spider using command:
spider crawl domainurls -o someoutput.json
You can check with your browser how the requests are made.
Behind the scene, right after you click on that button "show more cars" your browser will request a JSON data to feed your next page. You can take advantage of this fact and deal directly with the JSON data without the necessity to work with a JavaScript engine as Selenium or PhantomJS.
In your case, as the first step you should simulate an user scrolling down the page given by your start_url parameter and profile at the same time your network requests to discover the endpoint used by the browser to request that JSON. To discover this endpoint in general there is a XHR(XMLHttpRequest) section on the browser's profile tool as here in Safari where you can navigate thought all resources/endpoints used to request the data.
Once you discover this endpoint it's a straightforward task: you give your Spider as start_url the endpoint that you just discovered and according you process and navigate through the JSON's you can discover if it a next page to request.
P.S.: I saw for you that the endpoint url is http://www.carwale.com/webapi/classified/stockfilters/?city=194&kms=0-&year=0-&budget=0-&pn=2
In this case my browser requested the second page, as you can see in the parameter pn. It's is important you set the some header parameters before you send the request. I noticed in your case the headers are:
Accept text/plain, /; q=0.01
Referer http://www.carwale.com/used/cars-in-trichy/
X-Requested-With XMLHttpRequest
sourceid 1
User-Agent Mozilla/5.0...
Hi all I an trying to get whole results from the given link in the code. but my code not giving all results. This link says it contain 2132 results but it returns only 20 results.:
from scrapy.spider import Spider
from scrapy.selector import Selector
from tutorial.items import Flipkart
class Test(Spider):
name = "flip"
allowed_domains = ["flipkart.com"]
start_urls = ["http://www.flipkart.com/mobiles/pr?sid=tyy,4io& otracker=ch_vn_mobile_filter_Mobile%20Brands_All"
]
def parse(self, response):
sel = Selector(response)
sites = sel.xpath('//div[#class="pu-details lastUnit"]')
items = []
for site in sites:
item = Flipkart()
item['title'] = site.xpath('div[1]/a/text()').extract()
items.append(item)
return items**
That is because the site only shows 20 results at a time, and loading of more results is done with JavaScript when the user scrolls to the bottom of the page.
You have two options here:
Find a link on the site which shows all results on a single page (doubtful it exists, but some sites may do so when passed an optional query string, for example).
Handle JavaScript events in your spider. The default Scrapy downloader doesn't do this, so you can either analyze the JS code and send the event signals yourself programmatically or use something like Selenium w/ PhantomJS to let the browser deal with it. I'd recommend the latter since it's more fail-proof than the manual approach of interpreting the JS yourself. See this question for more information, and Google around, there's plenty of information on this topic.
I am trying to scrape a very simple web page with the help of Scrapy and it's xpath selectors but for some reason the selectors I have do not work in Scrapy but they do work in other xpath utilities
I am trying to parse this snippet of html:
<select id="chapterMenu" name="chapterMenu">
<option value="/111-3640-1/20th-century-boys/chapter-1.html" selected="selected">Chapter 1: Friend</option>
<option value="/111-3641-1/20th-century-boys/chapter-2.html">Chapter 2: Karaoke</option>
<option value="/111-3642-1/20th-century-boys/chapter-3.html">Chapter 3: The Boy Who Bought a Guitar</option>
<option value="/111-3643-1/20th-century-boys/chapter-4.html">Chapter 4: Snot Towel</option>
<option value="/111-3644-1/20th-century-boys/chapter-5.html">Chapter 5: Night of the Science Room</option>
</select>
Scrapy parse_item code:
def parse_item(self, response):
itemLoader = XPathItemLoader(item=MangaItem(), response=response)
itemLoader.add_xpath('chapter', '//select[#id="chapterMenu"]/option[#selected="selected"]/text()')
return itemLoader.load_item()
Scrapy does not extract any text from this but if I get the same xpath and html snippet and run it here it works just fine.
if I use this xpath:
//select[#id="chapterMenu"]
I get the correct element but when I try to access the options inside it does not get anything
Scrapy only does a GET request for the url, it is not a web browser and therefore cannot run JavaScript. Because of this Scrapy alone will not be enough to scrape through dynamic web pages.
In addition you will need something like Selenium which basically gives you an interface to several web browsers and their functionalities, one of them being the ability to run JavaScript and get client side generated HTML.
Here is a snippet of how one can go about doing this:
from Project.items import SomeItem
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.selector import Selector
from selenium import webdriver
import time
class RandomSpider(CrawlSpider):
name = 'RandomSpider'
allowed_domains = ['random.com']
start_urls = [
'http://www.random.com'
]
rules = (
Rule(SgmlLinkExtractor(allow=('some_regex_here')), callback='parse_item', follow=True),
)
def __init__(self):
CrawlSpider.__init__(self)
# use any browser you wish
self.browser = webdriver.Firefox()
def __del__(self):
self.browser.close()
def parse_item(self, response):
item = SomeItem()
self.browser.get(response.url)
# let JavaScript Load
time.sleep(3)
# scrape dynamically generated HTML
hxs = Selector(text=self.browser.page_source)
item['some_field'] = hxs.select('some_xpath')
return item
I think I found the webpage you want to extract from, and the chapters are loaded after fetching some JSON data, based on a "mangaid" (that is available in a Javascript Array in the page.
So fetching the chapters is a matter of making a specific GET request to a specific /actions/selector/ endpoint. It's basically emulating what your browser's Javascript engine is doing.
You probably get better performance using this technique than Selenium, but it does involve (minor) Javascript parsing (no real interpretation needed).