using scrapy for multiple search options based website - python

I am new to Scrapy and Python.
I would like to scrape a property registar's website which uses a query based search. Most of the examples I have seen use simple web pages, not using search via the FormRequest mechanism. The code I have written is below. Everything is currently hardcoded. I would like to be able to scrape the data base on the year or county .
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
class SecondSpider(CrawlSpider):
name = "second"
'''
def start_requests(self):
return [scrapy.FormRequest("https://www.propertypriceregister.ie/website/npsra/pprweb.nsf/PPR?OpenForm"# this is the form here it asks for the following,
# then the linke changes to this form
https://www.propertypriceregister.ie/website/npsra/PPR/npsra-ppr.nsf/PPR-By-Date?SearchView
&Start=1
&SearchMax=0
&SearchOrder=4
&Query=%5Bdt_execution_date%5D%3E=01/01/2010%20AND%20%5Bdt_execution_date%5D%3C01/01/2011
&County= # this are the fields of query
&Year=2010 # this are the fields of query
&StartMonth= # this are the fields of query
&EndMonth= # this are the fields of query
&Address= # this are the fields of query
formdata={'user': 'john', 'pass': 'secret'},
callback=self.logged_in)]
def logged_in(self, response):
# here you would extract links to follow and return Requests for
# each of them, with another callback
pass
'''
allowed_domains = ["www.propertypriceregister.ie"]
start_urls = ('https://www.propertypriceregister.ie/website/npsra/pprweb.nsf/PPR?OpenForm',)
rules = (
Rule(SgmlLinkExtractor(allow='/website/npsra/PPR/npsra-ppr.nsf/PPR-By-Date?SearchView&Start=1&SearchMax=0&SearchOrder=4&Query=%5Bdt_execution_date%5D%3E=01/01/2010%20AND%20%5Bdt_execution_date%5D%3C01/01/2011&County=&Year=2010&StartMonth=&EndMonth=&Address='),
callback='parse',
follow= True),
)
def parse(self, response):
print response
pass

Before you get started, re-read how Rule objects work. At present, your rule will match a very-specific URL which the site will never show a link for (as it's in the format of a form post).
Next, don't override the parse function of the CrawlSpider (actually, don't use it at all). It's used internally by the CrawlSpider to function (see the warning on the link I provided for additional details).
You'll need to generate a FormRequest for each of the elements to be called, similar to something like this (note: untested, but it should work):
import itertools
... # all your other imports here
class SecondSpider(CrawlSpider):
name = 'second'
allowed_domains = ['propertypriceregister.ie', 'www.propertypriceregister.ie']
rules = (
Rule(LinkExtractor(allow=("/eStampUNID/UNID-")), callback='parse_search'),
)
def start_requests(self):
years = [2010, 2011, 2012, 2013, 2014]
counties = ['County1', 'County2')
for county, year in itertools.product(*[counties, years]):
yield scrapy.FormRequest("https://www.propertypriceregister.ie/website/npsra/pprweb.nsf/PPR?OpenForm",
formdata={'County': county, 'Year': year},
dont_filter=True)
def parse_search(self, response):
# Parse response here
From this point, your rule(s) will be applied to each of the pages you get back from the FormRequest to pull URLs from it. If you want to actually grab anything from those initial urls, override the parse_start_url method of the CrawlSpider.

Related

Scrapy CrawlSpider and rules

I'm begining with Scrapy and I made a couple of spiders attacking to the same site succesfully.
The first one gets the products listed in the entire site except their prices (because prices are hidden for not logged users) and the second one do login in the website.
My problem looks a bit weird, when I merge both codes: The result is not working! The main problem is that the rules aren't processed is like they aren't called by Scrapy.
Because the program have to login in the website, I have to override start_requests but when I override it the rules are not processed. I'm diving into the documentation but I don't understand how the methods/funcions are called by the framework and why the rules aren't processed.
Here is it my spider code:
# -*- coding: utf-8 -*-
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.exceptions import CloseSpider
from oled.items import OledItem
from scrapy.utils.response import open_in_browser
class OledMovilesSpider(CrawlSpider):
name = 'webiste-spider'
allowed_domains = ['website.com']
rules = {
# Para cada item
# Rule(LinkExtractor(allow=(), restrict_xpaths=('//a[contains(text(), '>')]'))),
Rule(LinkExtractor(allow=(), restrict_xpaths=('//h2[#class="product-name"]/a')), callback='parse_item',
follow=False)
}
def start_requests(self):
return [scrapy.FormRequest('https://website.com/index.php?route=account/login',
formdata={'email':'website#website.com','password':'website#'},
callback=self.logged_in)]
def logged_in(self, response):
urls = ['https://gsmoled.com/index.php?route=product/category&path=33_61']
print('antes de return')
return [scrapy.Request(url=url, callback=self.parse_item) for url in urls]
def parse_item(self, response):
print("Dentro de Parse")
open_in_browser(response)
ml_item = OledItem()
# info de producto
ml_item['nombre'] = response.xpath('normalize-space(//title/text())').extract_first()
ml_item['descripcion'] = response.xpath('normalize-space(//*[#id="product-des"])').extract()
ml_item['stock'] = response.xpath('normalize-space(//span[#class=available])').extract()
#ml_item['precio'] = response.xpath('normalize-space(/html/body/main/div/div/div[1]/div[1]/section[1]/div/section[2]/ul/li[1]/span)').extract()
#ml_item['categoria'] = response.xpath('normalize-space(/html/body/main/div/div/div[1]/div[1]/section[1]/div/section[2]/ul/li[2]/span)').extract()
yield ml_item
Could someone tell me why the rules are not processing.
I think you're bypassing the rules by overwriting the start_requests. The parse-method is never called, so the rules aren't processed.
If you want to process the rules for page https://gsmoled.com/index.php?route=product/category&path=33_61 after you're logged in, you can try changing the callback of the logged_in method to parse like this: return [scrapy.Request(url=url, callback=self.parse) for url in urls].
The rules should be processed at that moment, and because you specified 'parse_item' as a callback in the rules, the parse_item method will be executed for all urls generated by the rules.

CrawlSpider with dynamically generated Rules and spider arguments is not scraping

I am currently trying to create a spider which is recursively following the page I am giving it. Its purpose is to scrape the content of articles from news portals. Two things that are important for its implementation:
I must be able to provide an argument for the spider with the portal URL,
I want to dynamically generate a regex to the article URL based on the portal link given earlier.
The argument for the spider is needed because I want to create a spider per portal that I intend to crawl.
Regex is needed for the Rule and LinkExtractor.
The problem is that the spider after running does not work recursively and does not collect any documents. As if dynamically generated rules did not apply.
In the implementation of my spider, I used init to pass there a variable called portal, in which there is an URL to the portal, which I want to scrape.
In init I also create rules and compile them.
I use the Dragnet external library to extract the content of the articles.
import scrapy
from scrapy.loader import ItemLoader
from scrapy.spiders import Rule, CrawlSpider
from scrapy.linkextractors import LinkExtractor
from scrapy.utils.project import get_project_settings
from ..items import ArticleExtractorItem
import re
import tldextract
from dragnet import extract_content
class ArticleExtractorSpider(CrawlSpider):
name = "article_extractor"
def __init__(self, portal=None, *args, **kwargs):
super(ArticleExtractorSpider, self).__init__(*args, **kwargs)
if portal:
url_escaped = re.escape(portal)
article_regex = r'^' + url_escaped + \
r'([a-zA-Z0-9]+-){2,}[a-zA-Z0-9]+\/$'
ArticleExtractorSpider.rules = (
Rule(LinkExtractor(allow=[article_regex]),
callback='parse_article', follow=True),
)
super(ArticleExtractorSpider, self)._compile_rules()
self.portal = portal
def parse_article(self, response):
article = extract_content(response.body)
portal = tldextract.extract(response.url)[1]
l = ItemLoader(item=ArticleExtractorItem(), response=response)
l.add_value('portal', portal)
l.add_value('url', response.url)
l.add_xpath('title', './/meta[#name="twitter:title"]/#content')
l.add_xpath('title', './/meta[#property="og:title"]/#content')
l.add_xpath('publish_date',
'.//meta[#property="article:published_time"]/#content')
l.add_value('article', article)
yield l.load_item()
After starting the spider, no requests were even made. Zero documents have been downloaded.

scrapy rules do not call parsing method

I am new to scrapy and am trying to crawl a domain, following all internal links and scraping the title of url with the pattern /example/.*
crawling works, but the scraping of the title does not since the output file is empty. Most likely I got the rules wrong. Is this the right syntax using the rules in order to achieve what I am looking for?
import scrapy
class BidItem(scrapy.Item):
url = scrapy.Field()
title = scrapy.Field()
spider.py
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from bid.items import BidItem
class GetbidSpider(CrawlSpider):
name = 'getbid'
allowed_domains = ['domain.de']
start_urls = ['https://www.domain.de/']
rules = (
Rule(
LinkExtractor(),
follow=True
),
Rule(
LinkExtractor(allow=['example/.*']),
callback='parse_item'
),
)
def parse_item(self, response):
href = BidItem()
href['url'] = response.url
href['title'] = response.css("h1::text").extract()
return href
crawl: scrapy crawl getbid -o 012916.csv
From the CrawlSpider docs:
If multiple rules match the same link, the first one will be used,
according to the order they’re defined in this attribute.
Since your first rule will match all links, it will always be used and all other rules will be ignored.
Fixing the problem is as simple as switching the order of the rules.

Crawling through Web-pages that have categories

I'm trying to scrap a website that has a uncommon web-page structure, page upon page upon page until i get to the item i'm trying to extract data from,
edit(Thanks to the answers, I have been able to extract most data I require, however I need the path links to get to the said product)
Here's the code I have so far:
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class MySpider(CrawlSpider):
name = 'drapertools.com'
start_urls = ['https://www.drapertools.com/category/0/Product%20Range']
rules = (
Rule(LinkExtractor(allow=['/category-?.*?/'])),
Rule(LinkExtractor(allow=['/product/']), callback='parse_product'),
)
def parse_product(self, response):
yield {
'product_name': response.xpath('//div[#id="product-title"]//h1[#class="text-primary"]/text()').extract_first(),
'product_number': response.xpath('//div[#id="product-title"]//h1[#style="margin-bottom: 20px; color:#000000; font-size: 23px;"]/text()').extract_first(),
'product_price': response.xpath('//div[#id="product-title"]//p/text()').extract_first(),
'product_desc': response.xpath('//div[#class="col-md-6 col-sm-6 col-xs-12 pull-left"]//div[#class="col-md-11 col-sm-11 col-xs-11"]//p/text()').extract_first(),
'product_path': response.xpath('//div[#class="nav-container"]//ol[#class="breadcrumb"]//li//a/text()').extract(),
'product_path_links': response.xpath('//div[#class="nav-container"]//ol[#class="breadcrumb"]//li//a/href()').extract(),
}
I don't know if this would work or anything, can anyone please help me here?
I would greatly appreciate it.
More Info:
I'm trying to access all categories and all items within them
however there is a categories within them and even more before I can get to the item.
I'm thinking of using Guillaume's LinkExtractor Code but i'm not sure that is supposed to be used for the outcome I want...
rules = (
Rule(LinkExtractor(allow=['/category-?.*?/'])),
Rule(LinkExtractor(allow=['/product/']), callback='parse_product'),
)
Why not using a CrawlSpider instead! It's perfect for this use-case!
It basically gets all the links for every page recursively, and calls a callback only for the interesting ones (I'm assuming that you are interested in the products).
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class MySpider(CrawlSpider):
name = 'drapertools.com'
start_urls = ['https://www.drapertools.com/category/0/Product%20Range']
rules = (
Rule(LinkExtractor(allow=['/category-?.*?/'])),
Rule(LinkExtractor(allow=['/product/']), callback='parse_product'),
)
def parse_product(self, response):
yield {
'product_name': response.xpath('//div[#id="product-title"]//h1[#class="text-primary"]/text()').extract_first(),
}
You have the same structure for all pages, maybe you can shorten it?
import scrapy
class DraperToolsSpider(scrapy.Spider):
name = 'drapertools_spider'
start_urls = ["https://www.drapertools.com/category/0/Product%20Range"]
def parse(self, response):
# this will call self.parse by default for all your categories
for url in response.css('.category p a::attr(href)').extract():
yield scrapy.Request(response.urljoin(url))
# here you can add some "if" if you want to catch details only on certain pages
for req in self.parse_details(response):
yield req
def parse_details(self, response):
yield {}

scrapy didn't crawl all link

I want to extract data from http://community.sellfree.co.kr/. Scrapy is working, however it appears to only scrape the start_urls, and doesn't crawl any links.
I would like the spider to crawl the entire site.
The following is my code:
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from metacritic.items import MetacriticItem
class MetacriticSpider(BaseSpider):
name = "metacritic" # Name of the spider, to be used when crawling
allowed_domains = ["sellfree.co.kr"] # Where the spider is allowed to go
start_urls = [
"http://community.sellfree.co.kr/"
]
rules = (Rule (SgmlLinkExtractor(allow=('.*',))
,callback="parse", follow= True),
)
def parse(self, response):
hxs = HtmlXPathSelector(response) # The XPath selector
sites = hxs.select('/html/body')
items = []
for site in sites:
item = MetacriticItem()
item['title'] = site.select('//a[#title]').extract()
items.append(item)
return items
There are two kinds of links on the page. One is onclick="location='../bbs/board.php?bo_table=maket_5_3' and another is <span class="list2">solution</span>
How can I get the crawler to follow both kinds of links?
Before I get started, I'd highly recommend using an updated version of Scrapy. It appears you're still using an old one, as many of the methods/classes you're using have been moved around or deprecated.
To the problem at hand: the scrapy.spiders.BaseSpider class will not do anything with the rules you specify. Instead, use the scrapy.contrib.spiders.CrawlSpider class, which has functionality to handle rules built into.
Next, you'll need to switch your parse() method to a new name, since the the CrawlSpider uses parse() internally to work. (We'll assume parse_page() for the rest of this answer)
To pick up all basic links, and have them crawled, your link extractor will need to be changed. By default, you shouldn't use regular expression syntax for domains you want to follow. The following will pick it up, and your DUPEFILTER will filter out links not on the site:
rules = (
Rule(SgmlLinkExtractor(allow=('')), callback="parse_page", follow=True),
)
As for the onclick=... links, these are JavaScript links, and the page you are trying to process relies on them heavily. Scrapy cannot crawl things like onclick=location.href="javascript:showLayer_tap('2')" or onclick="win_open('./bbs/profile.php?mb_id=wlsdydahs', because it can't execute showLayer_tap() or win_open() in Javascript.
(the following is untested, but should work and provide the basic idea of what you need to do)
You can write your own functions for parsing these, though. For instance, the following can handle onclick=location.href="./photo/":
def process_onclick(value):
m = re.search("location.href=\"(.*?)\"", value)
if m:
return m.group(1)
Then add the following rule (this only handles tables, expand it as needed):
Rule(SgmlLinkExtractor(allow=(''), tags=('table',),
attrs=('onclick',), process_value=process_onclick),
callback="parse_page", follow=True),

Categories