How to organize multiple Scrapy Spiders and pass data between them? - python

I have a problem that is stopping me from progressing my project.
I'll try to explain it as clearly as I can, but I am relatively new to scraping.
I want to scrape articles from Website A.
Website A doesn't have articles' content in its HTML code but links to articles on other websites (lets say Website B and Website C)
I have created a Spider that extracts links from Website A and yields them in JSON format.
I want to take the extracted links from Website A and scrape the articles from Websites B and C.
Now - I want to create separate Spiders for Website B and Website C (to use them later for scraping those websites directly and not through Website A) and somehow pass scraped data from Website A as arguments to them - but the "somehow" part is what I need your help with.
Thank you :)
EDIT
Anwsering jqc - since I posted my questions I made some developments - this is my code so far.
class QuotesSpider(scrapy.Spider):
name = 'Website A Spider'
start_urls = ['start_url']
def parse(self, response):
self.logger.info('###### Link Parser ######')
important_news = response.xpath('//div[contains(#class, "importantNews")]//div[contains(#class, "items")]/a')
for news in important_news:
yield {
'link': news.xpath('./#href').get(),
'title': news.xpath('.//span[contains(#class, "title")]/text()').get()
}
article_url = news.xpath('./#href').get()
self.logger.info('FOLLOWING URL OF THE ARTICLE')
if 'Website B' in article_url:
yield response.follow(article_url, callback=self.parse_Website_B)
else:
pass
def parse_Website_B(self, response):
yield {
'Website B article title': response.xpath('//p[contains(#class, "Header_desktopTextElement")]').get()
}
Don't worry about unfinished parsing, that's the least concerning part :)
Right now I am creating separate methods to parse particular websites, but I don't know if that is the optimal way.

I would like to see the URL you are trying to crawl. Then I can make some tests and try to decipher your question.
I can give you some hints, I am not sure if I understand you.
If you want to scrape the given URLs from A, you can do it directly into:
def parse_Website_B(self, response):
yield {
'Website B article title': response.xpath('//p[contains(#class, "Header_desktopTextElement")]').get()
}
You just have to yield the links, I would try with start_requests. Have a look in the documentation here.
If you provide the URL, we can try otherwise.
cheers

I think in your case will be much ease to create a list of URLs as a global variable in the spyder file, and then use it as a list for requests.
Something like this:
from twisted.internet import reactor, defer
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
from scrapy import signals
URL_LIST = []
DATA_LIST = []
def store_url(*args, **kwargs):
URL_LIST.append(kwargs['item'])
def store_data(*args, **kwargs):
DATA_LIST.append(kwargs['item'])
class QuotesSpiderWebsiteA(scrapy.Spider):
#Your code
class QuotesSpiderWebsiteB(scrapy.Spider):
# etc...
if __name__ == '__main__':
configure_logging()
runner = CrawlerRunner()
#defer.inlineCallbacks
def crawl():
crawler1 = runner.create_crawler(QuotesSpiderWebsiteA)
crawler2 = runner.create_crawler(QuotesSpiderWebsiteB)
crawler1.signals.connect(store_url, signals.item_scraped)
crawler2.signals.connect(store_data, signals.item_scraped)
yield runner.crawl(crawler1)
yield runner.crawl(crawler2)
reactor.stop()
crawl()
reactor.run()
# the script will block here until the crawling is finished
with open('output.json', 'w', encoding='utf8') as f:
json.dump(DATA_LIST)

Related

Scrapy- can't extract data from h3

I'm starting around with Scrapy, and managed to extract some of the data I need. However, not everything is properly obtained. I'm applying the knowledge from the official tutorial found here, but it's not working. I've Googled around a bit, and also read this SO question but I'm fairly certain this isn't the problem here.
Anyhow, I'm trying to parse the product information from this webshop. I'm trying to obtain the product name, price, rrp, release date, category, universe, author and publisher. Here is the relevant CSS for one product: https://pastebin.com/9tqnjs7A. Here's my code. Everything with a #! at the end isn't working as expected.
import scrapy
import pprint
class ForbiddenPlanetSpider(scrapy.Spider):
name = "fp"
start_urls = [
'https://forbiddenplanet.com/catalog/?q=mortal%20realms&sort=release-date&page=1',
]
def parse(self, response):
for item in response.css("section.zshd-00"):
print(response.css)
name = item.css("h3.h4::text").get() #!
price = item.css("span.clr-price::text").get() + item.css("span.t-small::text").get()
rrp = item.css("del.mqr::text").get()
release = item.css("dd.mzl").get() #!
category = item.css("li.inline-list__item::text").get() #!
universe = item.css("dt.txt").get() #!
authors = item.css("a.SubTitleItems").get() #!
publisher = item.css("dd.mzl").get() #!
pprint.pprint(dict(name=name,
price=price,
rrp=rrp,
release=release,
category=category,
universe=universe,
authors=authors,
publisher = publisher
)
)
I think I need to add some sub-searching (at the moment release and publisher have the same criteria, for example), but I don't know how to word it to search for it (I've tried, but ended up with generic tutorials that don't cover it). Anything pointing me in the right direction is appreciated!
Oh, and I didn't include ' ' spaces because whenever I used one Scrapy immediately failed to find.
Scrapy doesn't render JS, try to disable javascript in your browser and refresh the page, the HTML structure is different for site version without JS.
you should rewrite your selectors with a new HTML structure. Try to use XPATH instead of CSS it's much flexible.
UPD
The easiest way to scrape this website makes a request to https://forbiddenplanet.com/api/products/listing/?q=mortal%20realms&sort=release-date
The response is a JSON object with all necessary data. You may transform the "results" field (or the whole JSON object) to a python dictionary and get all fields with dictionary methods.
A code draft that works and shows the idea.
import scrapy
import json
def get_tags(tags: list):
parsed_tags = []
if tags:
for tag in tags:
parsed_tags.append(tag.get('name'))
return parsed_tags
return None
class ForbiddenplanetSpider(scrapy.Spider):
name = 'forbiddenplanet'
allowed_domains = ['forbiddenplanet.com']
start_urls = ['https://forbiddenplanet.com/api/products/listing/?q=mortal%20realms&sort=release-date']
def parse(self, response):
response_dict = json.loads(response.body)
items = response_dict.get('results')
for item in items:
yield {
'name': item.get('title'),
'price': item.get('site_price'),
'rrp': item.get('rrp'),
'release': item.get('release_date'),
'category': get_tags(item.get('derived_tags').get('type')),
'universe': get_tags(item.get('derived_tags').get('universe')),
'authors': get_tags(item.get('derived_tags').get('author')),
'publisher': get_tags(item.get('derived_tags').get('publisher')),
}
next_page = response_dict.get('next')
if next_page:
yield scrapy.Request(
url=next_page,
callback=self.parse
)

Python Scrapy: Return list of URLs scraped

I am using scrapy to scrape all the links off single domain. I am following all links on the domain but saving all links off the domain. The following scraper works correctly, but I can't access member variables from within the scraper since I am running it with a CrawlerProcess.
import scrapy
from scrapy.crawler import CrawlerProcess
class MySpider(scrapy.Spider):
name = 'myspider'
start_urls = ['https://example.com']
on_domain_urls = set()
off_domain_urls = set()
def parse(self, response):
links = response.xpath('//a/#href')
for link in links:
url = link.get()
if 'example.com' in url and url not in self.on_domain_urls:
print('On domain links found: {}'.format(
len(self.on_domain_urls)))
self.on_domain_urls.add(url)
yield scrapy.Request(url, callback=self.parse)
elif url not in self.off_domain_urls:
print('Offf domain links found: {}'.format(
len(self.on_domain_urls)))
self.off_domain_urls.add(url)
process = CrawlerProcess()
process.crawl(GoodOnYouSpider)
process.start()
# Need access to off_domain_links
How can I access off_domain_links? I could probably move it to a global scope but this seems hack. I can also append to a file, but I'd like to avoid file I/O if possible. Is there a better way to return aggregated data like this?
Did you check the Itempipeline? I think you'll have to use that in this scenario and decide what needs to be done with the variable.
See:
https://docs.scrapy.org/en/latest/topics/item-pipeline.html

Scrapy - issues with crawling deeper into the website

The website I am trying to crawl has the following structure:
there are various modules (for which I generate links without issues) - let's call them "module_urls"
each module page has a random number of links to various pages with videos (let's call them "lesson_urls")
each page has one video
The idea is to print links to all videos.
I have successfully managed to, separately: (1) generate the module_urls, (2) scrape the links to lesson_urls, and (3) scrape the videos. However, I am struggling with creating the appropriate loop to make it all work together.
The following script correctly generates module_urls, but, contrary to my expectations, the request to crawl each url (and then to crawl each sub-url) is never fulfilled. I am sure that this comes from my pure ignorance of the topic - this is the first time I am trying to use Scrapy.
Thank you very much for your help!
video_links = []
def after_login(self, response):
module_urls = self.generate_links()
for module_url in module_urls:
print("This is one module URL: %s" % module_url)
Request(module_url, self.get_lesson_urls)
print(self.video_links)
def get_lesson_urls(self, response):
print("Entered get_lesson_urls")
urls = response.xpath('//*[starts-with(#id,"post")]//li/a/#href').extract()
for lesson_url in urls:
Request(lesson_url, self.get_video_link)
def get_video_link(self, response):
video_address = response.xpath('//*[starts-with(#id, "post")]//iframe[#name = "vooplayerframe"]/#src').extract_first()
self.video_links.append(video_address)
I believe you will need to yield your request objects
video_links = []
def after_login(self, response):
module_urls = self.generate_links()
for module_url in module_urls:
print("This is one module URL: %s" % module_url)
yield Request(module_url, self.get_lesson_urls)
def get_lesson_urls(self, response):
print("Entered get_lesson_urls")
urls = response.xpath('//*[starts-with(#id,"post")]//li/a/#href').extract()
for lesson_url in urls:
yield Request(lesson_url, self.get_video_link)
def get_video_link(self, response):
video_address = response.xpath('//*[starts-with(#id, "post")]//iframe[#name = "vooplayerframe"]/#src').extract_first()
yield video_address
Edit:
Rather than print, if you then yield the urls you want, you can output them to json (and other formats) using:
scrapy crawl myspider -o data.json
You can do further parsing with Scrapy's Item Pipeline: https://doc.scrapy.org/en/latest/topics/item-pipeline.html

Scrapy get all links from any website

I have the following code for a web crawler in Python 3:
import requests
from bs4 import BeautifulSoup
import re
def get_links(link):
return_links = []
r = requests.get(link)
soup = BeautifulSoup(r.content, "lxml")
if r.status_code != 200:
print("Error. Something is wrong here")
else:
for link in soup.findAll('a', attrs={'href': re.compile("^http")}):
return_links.append(link.get('href')))
def recursive_search(links)
for i in links:
links.append(get_links(i))
recursive_search(links)
recursive_search(get_links("https://www.brandonskerritt.github.io"))
The code basically gets all the links off of my GitHub pages website, and then it gets all the links off of those links, and so on until the end of time or an error occurs.
I want to recreate this code in Scrapy so it can obey robots.txt and be a better web crawler overall. I've researched online and I can only find tutorials / guides / stackoverflow / quora / blog posts about how to scrape a specific domain (allowed_domains=["google.com"], for example). I do not want to do this. I want to create code that will scrape all websites recursively.
This isn't much of a problem but all the blog posts etc only show how to get the links from a specific website (for example, it might be that he links are in list tags). The code I have above works for all anchor tags, regardless of what website it's being run on.
I do not want to use this in the wild, I need it for demonstration purposes so I'm not going to suddenly annoy everyone with excessive web crawling.
Any help will be appreciated!
There is an entire section of scrapy guide dedicated to broad crawls. I suggest you to fine-grain your settings for doing this succesfully.
For recreating the behaviour you need in scrapy, you must
set your start url in your page.
write a parse function that follow all links and recursively call itself, adding to a spider variable the requested urls
An untested example (that can be, of course, refined):
class AllSpider(scrapy.Spider):
name = 'all'
start_urls = ['https://yourgithub.com']
def __init__(self):
self.links=[]
def parse(self, response):
self.links.append(response.url)
for href in response.css('a::attr(href)'):
yield response.follow(href, self.parse)
If you want to allow crawling of all domains, simply don't specify allowed_domains, and use a LinkExtractor which extracts all links.
A simple spider that follows all links:
class FollowAllSpider(CrawlSpider):
name = 'follow_all'
start_urls = ['https://example.com']
rules = [Rule(LinkExtractor(), callback='parse_item', follow=True)]
def parse_item(self, response):
pass

Scrapy: Spider optimization

I'm trying to scrap an e-commerce web site, and I'm doing it in 2 steps.
This website has a structure like this:
The homepage has the links to the family-items and subfamily-items pages
Each family & subfamily page has a list of products paginated
Right now I have 2 spiders:
GeneralSpider to get the homepage links and store them
ItemSpider to get elements from each page
I'm completely new to Scrapy, I'm following some tutorials to achieve this. I'm wondering how complex can be the parse functions and how rules works. My spiders right now looks like:
GeneralSpider:
class GeneralSpider(CrawlSpider):
name = 'domain'
allowed_domains = ['domain.org']
start_urls = ['http://www.domain.org/home']
def parse(self, response):
links = LinksItem()
links['content'] = response.xpath("//div[#id='h45F23']").extract()
return links
ItemSpider:
class GeneralSpider(CrawlSpider):
name = 'domain'
allowed_domains = ['domain.org']
f = open("urls.txt")
start_urls = [url.strip() for url in f.readlines()]
# Each URL in the file has pagination if it has more than 30 elements
# I don't know how to paginate over each URL
f.close()
def parse(self, response):
item = ShopItem()
item['name'] = response.xpath("//h1[#id='u_name']").extract()
item['description'] = response.xpath("//h3[#id='desc_item']").extract()
item['prize'] = response.xpath("//div[#id='price_eur']").extract()
return item
Wich is the best way to make the spider follow the pagination of an url ?
If the pagination is JQuery, meaning there is no GET variable in the URL, Would be possible to follow the pagination ?
Can I have different "rules" in the same spider to scrap different parts of the page ? or is better to have the spiders specialized, each spider focused in one thing?
I've also googled looking for any book related with Scrapy, but it seems there isn't any finished book yet, or at least I couldn't find one.
Does anyone know if some Scrapy book that will be released soon ?
Edit:
This 2 URL's fits for this example. In the Eroski Home page you can get the URL's to the products page.
In the products page you have a list of items paginated (Eroski Items):
URL to get Links: Eroski Home
URL to get Items: Eroski Fruits
In the Eroski Fruits page, the pagination of the items seems to be JQuery/AJAX, because more items are shown when you scroll down, is there a way to get all this items with Scrapy ?
Which is the best way to make the spider follow the pagination of an url ?
This is very site-specific and depends on how the pagination is implemented.
If the pagination is JQuery, meaning there is no GET variable in the URL, Would be possible to follow the pagination ?
This is exactly your use case - the pagination is made via additional AJAX calls that you can simulate inside your Scrapy spider.
Can I have different "rules" in the same spider to scrape different parts of the page ? or is better to have the spiders specialized, each spider focused in one thing?
Yes, the "rules" mechanism that a CrawlSpider provides is a very powerful piece of technology - it is highly configurable - you can have multiple rules, some of them would follow specific links that match specific criteria, or located in a specific section of a page. Having a single spider with multiple rules should be preferred comparing to having multiple spiders.
Speaking about your specific use-case, here is the idea:
make a rule to follow categories and subcategories in the navigation menu of the home page - this is there restrict_xpaths would help
in the callback, for every category or subcategory yield a Request that would mimic the AJAX request sent by your browser when you open a category page
in the AJAX response handler (callback) parse the available items and yield an another Request for the same category/subcategory but increasing the page GET parameter (getting next page)
Example working implementation:
import re
import urllib
import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
class ProductItem(scrapy.Item):
description = scrapy.Field()
price = scrapy.Field()
class GrupoeroskiSpider(CrawlSpider):
name = 'grupoeroski'
allowed_domains = ['compraonline.grupoeroski.com']
start_urls = ['http://www.compraonline.grupoeroski.com/supermercado/home.jsp']
rules = [
Rule(LinkExtractor(restrict_xpaths='//div[#class="navmenu"]'), callback='parse_categories')
]
def parse_categories(self, response):
pattern = re.compile(r'/(\d+)\-\w+')
groups = pattern.findall(response.url)
params = {'page': 1, 'categoria': groups.pop(0)}
if groups:
params['grupo'] = groups.pop(0)
if groups:
params['familia'] = groups.pop(0)
url = 'http://www.compraonline.grupoeroski.com/supermercado/ajax/listProducts.jsp?' + urllib.urlencode(params)
yield scrapy.Request(url,
meta={'params': params},
callback=self.parse_products,
headers={'X-Requested-With': 'XMLHttpRequest'})
def parse_products(self, response):
for product in response.xpath('//div[#class="product_element"]'):
item = ProductItem()
item['description'] = product.xpath('.//span[#class="description_1"]/text()').extract()[0]
item['price'] = product.xpath('.//div[#class="precio_line"]/p/text()').extract()[0]
yield item
params = response.meta['params']
params['page'] += 1
url = 'http://www.compraonline.grupoeroski.com/supermercado/ajax/listProducts.jsp?' + urllib.urlencode(params)
yield scrapy.Request(url,
meta={'params': params},
callback=self.parse_products,
headers={'X-Requested-With': 'XMLHttpRequest'})
Hope this is a good starting point for you.
Does anyone know if some Scrapy book that will be released soon?
Nothing specific that I can recall.
Though I heard that some publisher has some plans to may be release a book about web-scraping, but I'm not supposed to tell you that.

Categories