it's first time when I'm using Scrapy framework for python.
So I made this code.
# -*- coding: utf-8 -*-
import scrapy
class SpiderSpider(scrapy.Spider):
name = 'spider'
start_urls = [
'https://www.emag.ro/televizoare/c'
]
def parse(self, response):
for i in response.xpath('//div[#class="card-section-wrapper js-section-wrapper"]'):
yield {
'product-name': i.xpath('.//a[#class="product-title js-product-url"]/text()')
.extract_first().replace('\n','')
}
next_page_url = response.xpath('//a[#class="js-change-page"]/#href').extract_first()
if next_page_url is not None:
yield scrapy.Request(response.urljoin(next_page_url))
when I'm looking at the website it has over 800 products. but my script it's only taking the first 2 pages nearly 200 products...
I tried to use css selector and xpath, both same bug.
Can anyone figure out where is the problem?
Thank you!
The website you are trying to crawl is getting data from API. When you click on the pagination link, it sends ajax request to API to fetch more products and show them on the page.
Since
Scrapy doesn't simulate the browser environment itself.
So one way would be that you
Analyse the request in your browser network tab to inspect the endpoint and parameters
Build the similar request yourself in scrapy
Call that endpoint with appropriate arguments to get the products from the API.
Also you need to extract the next page from the json response you get from the API. Usually there is a key named pagination which contains info related to total pages, next page etc.
I finally figure out how to do it.
# -*- coding: utf-8 -*-
import scrapy
from ..items import ScraperItem
class SpiderSpider(scrapy.Spider):
name = 'spider'
page_number = 2
start_urls = [
'https://www.emag.ro/televizoare/c'
]
def parse(self, response):
items = ScraperItem()
for i in response.xpath('//div[#class="card-section-wrapper js-section-wrapper"]'):
product_name = i.xpath('.//a[#class="product-title js-product-url"]/text()').extract_first().replace('\n ','').replace('\n ','')
items["product_name"] = product_name
yield items
next_page = 'https://www.emag.ro/televizoare/p' + str(SpiderSpider.page_number) + '/c'
if SpiderSpider.page_number <= 28:
SpiderSpider.page_number += 1
yield response.follow(next_page, callback = self.parse)
Related
# -*- coding: utf-8 -*-
import scrapy
class SearchSpider(scrapy.Spider):
name = 'search'
allowed_domains = ['www.indeed.com/']
start_urls = ['https://www.indeed.com/jobs?q=data%20analyst&l=united%20states']
def parse(self, response):
listings = response.xpath('//*[#data-tn-component="organicJob"]')
for listing in listings:
title = listing.xpath('.//a[#data-tn-element="jobTitle"]/#title').extract_first()
link = listing.xpath('.//h2[#class="title"]//a/#href').extract_first()
company = listing.xpath('normalize-space(.//span[#class="company"]//a/text())').extract_first()
yield {'title':title,
'link':link,
'company':company}
next_page = response.xpath('//ul[#class="pagination-list"]//a/#href').extract_first()
if next_page:
yield scrapy.Request(response.urljoin(next_page),callback=self.parse)
I am trying to extract all the job titles and company for every job posting in all the indeed pages. However, I am stuck at a point, because the forward button on the indeed page does not have a fixed link which my scraper could follow instead the next page url is the same as the numbered button. Which means that even after requesting the next page url, the numbers at the end change which does not allow me to extract the next page. I am trying to refrain from using selenium or splash, since I am trying to get my results through only Scrapy or Beautifull Soup. However, any help would be greatly appreciated.
I am a newbie to Python and Spider. I am now trying to use Scrapy and Splash to crawl dynamic pages rendered with js, such as crawling problems from https://leetcode.com/problemset/all/.
But when I use response.xpath("//div[#class='css-1ponsav']") in https://leetcode.com/problems/two-sum/ , it seems not to get any information.
Similarly, in login interface https://leetcode.com/accounts/login/ , when you try to call SplashFormRequest.from_response(response,...) to log in, it will return ValueError: No element found in <200 >.
I don't know much about the front-end. I don't know if there is anything to do with graphQL used by LeetCode. Or for other reasons?
Here is the code.
# -*- coding: utf-8 -*-
import json
import scrapy
from scrapy import Request, Selector
from scrapy_splash import SplashRequest
from leetcode_problems.items import ProblemItem
class TestSpiderSpider(scrapy.Spider):
name = 'test_spider'
allowed_domains = ['leetcode.com']
single_problem_url = "https://leetcode.com/problems/two-sum/"
def start_requests(self):
url = self.single_problem_url
yield SplashRequest(url=url, callback=self.single_problem_parse, args={'wait': 2})
def single_problem_parse(self, response):
submission_page = response.xpath("//div[#data-key='submissions']/a/#href").extract_first()
submission_text = response.xpath("//div[#data-key='submissions']//span[#class='title__qRnJ']").extract_first()
print("submission_text:", end=' ')
print(submission_text) #Print Nothing
if submission_page:
yield SplashRequest("https://leetcode.com" + submission_page, self.empty_parse, args={'wait': 2})
I am not that familiar with Splash but 98% of websites that are Javascript generated can be scraped by looking at the XHR filter under Network tab looking for POST or GET responses that generate these outputs.
In your case I can see there is one response that generate the whole page without needing any special query parameters or API keys.
I am trying to use a Scrapy spider to crawl a website using a FormRequest to send a keyword to the search query on a city-specific page. Seems straightforward with what I read, but I'm having trouble. Fairly new to Python so sorry if there is something obvious I'm overlooking.
Here are the main 3 sites I was trying to use to help me:
Mouse vs Python [1]; Stack Overflow; Scrapy.org [3]
From the source code of the specific url I am crawling: www.lkqpickyourpart.com\locations/LKQ_Self_Service_-_Gainesville-224/recents
From the source of the particular page I found:
<input name="dnn$ctl01$txtSearch" type="text" maxlength="255" size="20" id="dnn_ctl01_txtSearch" class="NormalTextBox" autocomplete="off" placeholder="Search..." />
Which I think the name of the search is "dnn_ct101_txtSearch" which I would use in the example I found cited as 2, and I wanted to input "toyota" as my keyword within the vehicle search.
Here is the code I have of my spider right now, and I am aware I am importing excessive stuff in the beggining:
import scrapy
from scrapy.http import FormRequest
from scrapy.item import Item, Field
from scrapy.http import FormRequest
from scrapy.spider import BaseSpider
class LkqSpider(scrapy.Spider):
name = "lkq"
allowed_domains = ["lkqpickyourpart.com\locations/LKQ_Self_Service_-_Gainesville-224/recents"]
start_urls = ['http://www.lkqpickyourpart.com\locations/LKQ_Self_Service_-_Gainesville-224/recents/']
def start_requests(self):
return [ FormRequest("www.lkqpickyourpart.com\locations/LKQ_Self_Service_-_Gainesville-224/recents",
formdata={'dnn$ctl01$txtSearch':'toyota'},
callback=self.parse) ]
def parsel(self):
print self.status
Why is it not searching or printing any kind of results, is the example I'm copying from only intended for logging in on websites not entering to searchbars?
Thanks,
Dan the newbie Python writer
Here you go :)
# -*- coding: utf-8 -*-
from __future__ import unicode_literals
import scrapy
from scrapy.shell import inspect_response
from scrapy.utils.response import open_in_browser
class Cars(scrapy.Item):
Make = scrapy.Field()
Model = scrapy.Field()
Year = scrapy.Field()
Entered_Yard = scrapy.Field()
Section = scrapy.Field()
Color = scrapy.Field()
class LkqSpider(scrapy.Spider):
name = "lkq"
allowed_domains = ["lkqpickyourpart.com"]
start_urls = (
'http://www.lkqpickyourpart.com/DesktopModules/pyp_vehicleInventory/getVehicleInventory.aspx?store=224&page=0&filter=toyota&sp=&cl=&carbuyYardCode=1224&pageSize=1000&language=en-US',
)
def parse(self, response):
section_color = response.xpath(
'//div[#class="pypvi_notes"]/p/text()').extract()
info = response.xpath('//td["pypvi_make"]/text()').extract()
for element in range(0, len(info), 4):
item = Cars()
item["Make"] = info[element]
item["Model"] = info[element + 1]
item["Year"] = info[element + 2]
item["Entered_Yard"] = info[element + 3]
item["Section"] = section_color.pop(
0).replace("Section:", "").strip()
item["Color"] = section_color.pop(0).replace("Color:", "").strip()
yield item
# open_in_browser(response)
# inspect_response(response, self)
The page that you're trying to scrape is generated by an AJAX call.
Scrapy by default doesn't load any dynamically loaded Javascript content including AJAX. Almost all sites that load data dynamically as you scroll down the page are done using AJAX.
^^Trapping^^ AJAX call's are pretty simple using either Chrome Dev Tools or Firebug for Firefox.
All you have to do is observe the XHR requests in Chrome Dev Tools or Firebug. XHR is an AJAX request.
Here's a screen shot of how it looks:
Once you find the link, you can go change its attributes.
This is the link that the XHR request in Chrome Dev Tools gave me:
http://www.lkqpickyourpart.com/DesktopModules/pyp_vehicleInventory/getVehicleInventory.aspx?store=224&page=0&filter=toyota&sp=&cl=&carbuyYardCode=1224&pageSize=1000&language=en-US
I've changed the page size to 1000 up there to give me a 1000 results per page. The default was 15.
There's also a page number over there which you would ideally increase till you capture all the data.
The web page requires javascript rendering framework to load the content in the scrapy code
Use Splash and refer the document for usage.
I'm using the latest version of scrapy (http://doc.scrapy.org/en/latest/index.html) and am trying to figure out how to make scrapy crawl only the URL(s) fed to it as part of start_url list. In most cases I want to crawl only 1 page, but in some cases there may be multiple pages that I will specify. I don't want it to crawl to other pages.
I've tried setting the depth level=1 but I'm not sure that in testing it accomplished what I was hoping to achieve.
Any help will be greatly appreciated!
Thank you!
2015-12-22 - Code update:
# -*- coding: utf-8 -*-
import scrapy
from generic.items import GenericItem
class GenericspiderSpider(scrapy.Spider):
name = "genericspider"
def __init__(self, domain, start_url, entity_id):
self.allowed_domains = [domain]
self.start_urls = [start_url]
self.entity_id = entity_id
def parse(self, response):
for href in response.css("a::attr('href')"):
url = response.urljoin(href.extract())
yield scrapy.Request(url, callback=self.parse_dir_contents)
def parse_dir_contents(self, response):
for sel in response.xpath("//body//a"):
item = GenericItem()
item['entity_id'] = self.entity_id
# gets the actual email address
item['emails'] = response.xpath("//a[starts-with(#href, 'mailto')]").re(r'mailto:\s*(.*?)"')
yield item
Below, in the first response, you mention using a generic spider --- isn't that what I'm doing in the code? Also are you suggesting I remove the
callback=self.parse_dir_contents
from the parse function?
Thank you.
looks like you are using CrawlSpider which is a special kind of Spider to crawl multiple categories inside pages.
For only crawling the urls specified inside start_urls just override the parse method, as that is the default callback of the start requests.
Below is a code for the spider that will scrape the title from a blog (Note: the xpath might not be the same for every blog)
Filename: /spiders/my_spider.py
class MySpider(scrapy.Spider):
name = "craig"
allowed_domains = ["www.blogtrepreneur.com"]
start_urls = ["http://www.blogtrepreneur.com/the-best-juice-cleanse-for-weight-loss/"]
def parse(self, response):
hxs = HtmlXPathSelector(response)
dive = response.xpath('//div[#id="tve_editor"]')
items = []
item = DmozItem()
item["title"] = response.xpath('//h1/text()').extract()
item["article"] = response.xpath('//div[#id="tve_editor"]//p//text()').extract()
items.append(item)
return items
The above code will only fetch the title and the article body of the given article.
I got the same problem, because I was using
import scrapy from scrapy.spiders import CrawlSpider
Then I changed to
import scrapy from scrapy.spiders import Spider
And change the class to
class mySpider(Spider):
I'm trying to load some XPATH rules from a database using Scrapy.
The code I've written so far works fine, however after some debugging I've realised that Scrapy is parsing each item asynchronously, meaning I have no control over the order of which item is being parsed.
What I want to do is figure out which item from the list is currently being parsed when it hits the parse() function so I can reference that index to the rows in my database and acquire the correct XPATH query. The way I'm currently doing this is by using a variable called item_index and incrementing it after each item iteration. Now I realise this is not enough and I'm hoping there's some internal functionality that could help me achieve this.
Does anyone know the proper way of keeping track of this? I've looked through the documentation but couldn't find any info about it. I've also looked at the Scrapy source code but I can't seem to figure out how the list of URL's actually get stored.
Here's my code to explain my problem further:
# -*- coding: utf-8 -*-
from scrapy.spider import Spider
from scrapy.selector import Selector
from dirbot.items import Product
from dirbot.database import DatabaseConnection
# Create a database connection object so we can execute queries
connection = DatabaseConnection()
class DmozSpider(Spider):
name = "dmoz"
start_urls = []
item_index = 0
# Query for all products sold by a merchant
rows = connection.query("SELECT * FROM products_merchant WHERE 1=1")
def start_requests(self):
for row in self.rows:
yield self.make_requests_from_url(row["product_url"])
def parse(self, response):
sel = Selector(response)
item = Product()
item['product_id'] = self.rows[self.item_index]['product_id']
item['merchant_id'] = self.rows[self.item_index]['merchant_id']
item['price'] = sel.xpath(self.rows[self.item_index]['xpath_rule']).extract()
self.item_index+=1
return item
Any guidance would be greatly appreciated!
Thanks
Here's the solution I came up with just in case anyone needs it.
As #toothrot suggested, you need to overload methods within the Request class to be able to access meta information.
Hope this helps someone.
# -*- coding: utf-8 -*-
from scrapy.spider import Spider
from scrapy.selector import Selector
from scrapy.http import Request
from dirbot.items import Product
from dirbot.database import DatabaseConnection
# Create a database connection object so we can execute queries
connection = DatabaseConnection()
class DmozSpider(Spider):
name = "dmoz"
start_urls = []
# Query for all products sold by a merchant
rows = connection.query("SELECT * FROM products_merchant WHERE 1=1")
def start_requests(self):
for indx, row in enumerate(self.rows):
self.start_urls.append( row["product_url"] )
yield self.make_requests_from_url(row["product_url"], {'index': indx})
def make_requests_from_url(self, url, meta):
return Request(url, callback=self.parse, dont_filter=True, meta=meta)
def parse(self, response):
item_index = response.meta['index']
sel = Selector(response)
item = Product()
item['product_id'] = self.rows[item_index]['product_id']
item['merchant_id'] = self.rows[item_index]['merchant_id']
item['price'] = sel.xpath(self.rows[item_index]['xpath_rule']).extract()
return item
You can pass the index (or the row id from the database) along with the request using Request.meta. It's a dictionary you can access from Response.meta in your handler.
For example, when you're building your request:
Request(url, callback=self.some_handler, meta={'row_id': row['id']})
Using a counter like you've attempted won't work because you can't guarantee the order in which the responses are handled.