CrawlSpider with dynamically generated Rules and spider arguments is not scraping - python

I am currently trying to create a spider which is recursively following the page I am giving it. Its purpose is to scrape the content of articles from news portals. Two things that are important for its implementation:
I must be able to provide an argument for the spider with the portal URL,
I want to dynamically generate a regex to the article URL based on the portal link given earlier.
The argument for the spider is needed because I want to create a spider per portal that I intend to crawl.
Regex is needed for the Rule and LinkExtractor.
The problem is that the spider after running does not work recursively and does not collect any documents. As if dynamically generated rules did not apply.
In the implementation of my spider, I used init to pass there a variable called portal, in which there is an URL to the portal, which I want to scrape.
In init I also create rules and compile them.
I use the Dragnet external library to extract the content of the articles.
import scrapy
from scrapy.loader import ItemLoader
from scrapy.spiders import Rule, CrawlSpider
from scrapy.linkextractors import LinkExtractor
from scrapy.utils.project import get_project_settings
from ..items import ArticleExtractorItem
import re
import tldextract
from dragnet import extract_content
class ArticleExtractorSpider(CrawlSpider):
name = "article_extractor"
def __init__(self, portal=None, *args, **kwargs):
super(ArticleExtractorSpider, self).__init__(*args, **kwargs)
if portal:
url_escaped = re.escape(portal)
article_regex = r'^' + url_escaped + \
r'([a-zA-Z0-9]+-){2,}[a-zA-Z0-9]+\/$'
ArticleExtractorSpider.rules = (
Rule(LinkExtractor(allow=[article_regex]),
callback='parse_article', follow=True),
)
super(ArticleExtractorSpider, self)._compile_rules()
self.portal = portal
def parse_article(self, response):
article = extract_content(response.body)
portal = tldextract.extract(response.url)[1]
l = ItemLoader(item=ArticleExtractorItem(), response=response)
l.add_value('portal', portal)
l.add_value('url', response.url)
l.add_xpath('title', './/meta[#name="twitter:title"]/#content')
l.add_xpath('title', './/meta[#property="og:title"]/#content')
l.add_xpath('publish_date',
'.//meta[#property="article:published_time"]/#content')
l.add_value('article', article)
yield l.load_item()
After starting the spider, no requests were even made. Zero documents have been downloaded.

Related

Does Scrapy crawl HTML that calls :hover to display additional information?

I'm not sure if this is the correct place for this question.
Here's my question:
If I run scrapy, it can't see the email addresses in the page source. The page has email addresses that are visible only when you hover over a user with an email address .
When I run my spider, I get no emails. What am I doing wrong?
Thank You.
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
import re
class MailsSpider(CrawlSpider):
name = 'mails'
allowed_domains = ['biorxiv.org']
start_urls = ['https://www.biorxiv.org/content/10.1101/2022.02.28.482253v3']
rules = (
Rule(LinkExtractor(allow=r'Items/'), callback='parse_item', follow=True),
)
def parse_item(self, response):
emals = re.findall(r'[\w\.]+#[\w\.]+',response.text)
print(response.url)
print(emails)
Assuming you're allowed to scrape email contacts from a public website,
as said, scrapy does not loads js scripts, you need a full render browser like Playwright to get the address.
I've wrote down a quick and dirty example on how it could work, you can start from here if you wish (after you've installed playwright of course)
import scrapy
from scrapy.http import Request, FormRequest
from playwright.sync_api import sync_playwright
from scrapy.http import HtmlResponse
class PhaseASpider(scrapy.Spider):
name = "test"
def start_requests(self):
yield Request('https://www.biorxiv.org/content/10.1101/2022.02.28.482253v3', callback=self.parse_page)
def parse_page(self,response):
with sync_playwright() as p:
browser = p.firefox.launch(headless=False)
self.page = browser.new_page().
url='https://www.biorxiv.org/content/10.1101/2022.02.28.482253v3'
self.page.goto(url)
self.page.wait_for_load_state("load")
html_page=self.page.content()
response_sel = HtmlResponse(url="my HTML string", body=html_page, encoding='utf-8')
mails=response_sel.xpath('//a[contains(#href, "mailto")]/#href').extract()
for mail in mails:
print(mail.split('mailto:')[1])

Scrapy CrawlSpider and rules

I'm begining with Scrapy and I made a couple of spiders attacking to the same site succesfully.
The first one gets the products listed in the entire site except their prices (because prices are hidden for not logged users) and the second one do login in the website.
My problem looks a bit weird, when I merge both codes: The result is not working! The main problem is that the rules aren't processed is like they aren't called by Scrapy.
Because the program have to login in the website, I have to override start_requests but when I override it the rules are not processed. I'm diving into the documentation but I don't understand how the methods/funcions are called by the framework and why the rules aren't processed.
Here is it my spider code:
# -*- coding: utf-8 -*-
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.exceptions import CloseSpider
from oled.items import OledItem
from scrapy.utils.response import open_in_browser
class OledMovilesSpider(CrawlSpider):
name = 'webiste-spider'
allowed_domains = ['website.com']
rules = {
# Para cada item
# Rule(LinkExtractor(allow=(), restrict_xpaths=('//a[contains(text(), '>')]'))),
Rule(LinkExtractor(allow=(), restrict_xpaths=('//h2[#class="product-name"]/a')), callback='parse_item',
follow=False)
}
def start_requests(self):
return [scrapy.FormRequest('https://website.com/index.php?route=account/login',
formdata={'email':'website#website.com','password':'website#'},
callback=self.logged_in)]
def logged_in(self, response):
urls = ['https://gsmoled.com/index.php?route=product/category&path=33_61']
print('antes de return')
return [scrapy.Request(url=url, callback=self.parse_item) for url in urls]
def parse_item(self, response):
print("Dentro de Parse")
open_in_browser(response)
ml_item = OledItem()
# info de producto
ml_item['nombre'] = response.xpath('normalize-space(//title/text())').extract_first()
ml_item['descripcion'] = response.xpath('normalize-space(//*[#id="product-des"])').extract()
ml_item['stock'] = response.xpath('normalize-space(//span[#class=available])').extract()
#ml_item['precio'] = response.xpath('normalize-space(/html/body/main/div/div/div[1]/div[1]/section[1]/div/section[2]/ul/li[1]/span)').extract()
#ml_item['categoria'] = response.xpath('normalize-space(/html/body/main/div/div/div[1]/div[1]/section[1]/div/section[2]/ul/li[2]/span)').extract()
yield ml_item
Could someone tell me why the rules are not processing.
I think you're bypassing the rules by overwriting the start_requests. The parse-method is never called, so the rules aren't processed.
If you want to process the rules for page https://gsmoled.com/index.php?route=product/category&path=33_61 after you're logged in, you can try changing the callback of the logged_in method to parse like this: return [scrapy.Request(url=url, callback=self.parse) for url in urls].
The rules should be processed at that moment, and because you specified 'parse_item' as a callback in the rules, the parse_item method will be executed for all urls generated by the rules.

How to use Scrapy sitemap spider on sites with text sitemaps?

I tried using a generic Scrapy.spider to follow links, but it didn't work - so I hit upon the idea of simplifying the process by accessing the sitemap.txt instead, but that didn't work either!
I wrote a simple example (to help me understand the algorithm) of a spider to follow the sitemap specified on my site: https://legion-216909.appspot.com/sitemap.txt It is meant to navigate the URLs specified on the sitemap, print them out to screen and output the results into a links.txt file. The code:
import scrapy
from scrapy.spiders import SitemapSpider
class MySpider(SitemapSpider):
name = "spyder_PAGE"
sitemap_urls = ['https://legion-216909.appspot.com/sitemap.txt']
def parse(self, response):
print(response.url)
return response.url
I ran the above spider as Scrapy crawl spyder_PAGE > links.txt but that returned an empty text file. I have gone through the Scrapy docs multiple times, but there is something missing. Where am I going wrong?
SitemapSpider is expecting an XML sitemap format, causing the spider to exit with this error:
[scrapy.spiders.sitemap] WARNING: Ignoring invalid sitemap: <200 https://legion-216909.appspot.com/sitemap.txt>
Since your sitemap.txt file is just a simple list or URLs, it would be easier to just split them with a string method.
For example:
from scrapy import Spider, Request
class MySpider(Spider):
name = "spyder_PAGE"
start_urls = ['https://legion-216909.appspot.com/sitemap.txt']
def parse(self, response):
links = response.text.split('\n')
for link in links:
# yield a request to get this link
print(link)
# https://legion-216909.appspot.com/index.html
# https://legion-216909.appspot.com/content.htm
# https://legion-216909.appspot.com/Dataset/module_4_literature/Unit_1/.DS_Store
You only need to override _parse_sitemap(self, response) from SitemapSpider with the following:
from scrapy import Request
from scrapy.spiders import SitemapSpider
class MySpider(SitemapSpider):
sitemap_urls = [...]
sitemap_rules = [...]
def _parse_sitemap(self, response):
# yield a request for each url in the txt file that matches your filters
urls = response.text.splitlines()
it = self.sitemap_filter(urls)
for loc in it:
for r, c in self._cbs:
if r.search(loc):
yield Request(loc, callback=c)
break

using scrapy for multiple search options based website

I am new to Scrapy and Python.
I would like to scrape a property registar's website which uses a query based search. Most of the examples I have seen use simple web pages, not using search via the FormRequest mechanism. The code I have written is below. Everything is currently hardcoded. I would like to be able to scrape the data base on the year or county .
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
class SecondSpider(CrawlSpider):
name = "second"
'''
def start_requests(self):
return [scrapy.FormRequest("https://www.propertypriceregister.ie/website/npsra/pprweb.nsf/PPR?OpenForm"# this is the form here it asks for the following,
# then the linke changes to this form
https://www.propertypriceregister.ie/website/npsra/PPR/npsra-ppr.nsf/PPR-By-Date?SearchView
&Start=1
&SearchMax=0
&SearchOrder=4
&Query=%5Bdt_execution_date%5D%3E=01/01/2010%20AND%20%5Bdt_execution_date%5D%3C01/01/2011
&County= # this are the fields of query
&Year=2010 # this are the fields of query
&StartMonth= # this are the fields of query
&EndMonth= # this are the fields of query
&Address= # this are the fields of query
formdata={'user': 'john', 'pass': 'secret'},
callback=self.logged_in)]
def logged_in(self, response):
# here you would extract links to follow and return Requests for
# each of them, with another callback
pass
'''
allowed_domains = ["www.propertypriceregister.ie"]
start_urls = ('https://www.propertypriceregister.ie/website/npsra/pprweb.nsf/PPR?OpenForm',)
rules = (
Rule(SgmlLinkExtractor(allow='/website/npsra/PPR/npsra-ppr.nsf/PPR-By-Date?SearchView&Start=1&SearchMax=0&SearchOrder=4&Query=%5Bdt_execution_date%5D%3E=01/01/2010%20AND%20%5Bdt_execution_date%5D%3C01/01/2011&County=&Year=2010&StartMonth=&EndMonth=&Address='),
callback='parse',
follow= True),
)
def parse(self, response):
print response
pass
Before you get started, re-read how Rule objects work. At present, your rule will match a very-specific URL which the site will never show a link for (as it's in the format of a form post).
Next, don't override the parse function of the CrawlSpider (actually, don't use it at all). It's used internally by the CrawlSpider to function (see the warning on the link I provided for additional details).
You'll need to generate a FormRequest for each of the elements to be called, similar to something like this (note: untested, but it should work):
import itertools
... # all your other imports here
class SecondSpider(CrawlSpider):
name = 'second'
allowed_domains = ['propertypriceregister.ie', 'www.propertypriceregister.ie']
rules = (
Rule(LinkExtractor(allow=("/eStampUNID/UNID-")), callback='parse_search'),
)
def start_requests(self):
years = [2010, 2011, 2012, 2013, 2014]
counties = ['County1', 'County2')
for county, year in itertools.product(*[counties, years]):
yield scrapy.FormRequest("https://www.propertypriceregister.ie/website/npsra/pprweb.nsf/PPR?OpenForm",
formdata={'County': county, 'Year': year},
dont_filter=True)
def parse_search(self, response):
# Parse response here
From this point, your rule(s) will be applied to each of the pages you get back from the FormRequest to pull URLs from it. If you want to actually grab anything from those initial urls, override the parse_start_url method of the CrawlSpider.

scrapy crawling at depth not working

I am writing scrapy code to crawl first page and one additional depth of given webpage
Somehow my crawler doesn't enter additional depth. Just crawls given starting urls and ends its operation.
I added filter_links callback function but even thts not getting called so clearly rules are getting ignored. what can be possible reason and what can i change to make it follow rules
import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from crawlWeb.items import CrawlwebItem
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
class DmozSpider(CrawlSpider):
name = "premraj"
start_urls = [
"http://www.broadcom.com",
"http://www.qualcomm.com"
]
rules = [Rule(SgmlLinkExtractor(), callback='parse',process_links="process_links",follow=True)]
def parse(self, response):
#print dir(response)
#print dir(response)
item=CrawlwebItem()
item["html"]=response.body
item["url"]=response.url
yield item
def process_links(self,links):
print links
print "hey!!!!!!!!!!!!!!!!!!!!!"
There is a Warning box in the CrawlSpider documentation. It says:
When writing crawl spider rules, avoid using parse as callback, since
the CrawlSpider uses the parse method itself to implement its logic.
So if you override the parse method, the crawl spider will no longer
work.
Your code does probably not work as expected because you do use parse as callback.

Categories