Scrapy Newbie Question - can't get tutorial file working - python

I am a complete newbie to Python and Scrapy so I started by trying to replicate the tutorial. I am trying to scrape the www.dmoz.org website as per the tutorial.
I compose the dmoz_spider.py as indicated below
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from dmoz.items import DmozItem
class DmozSpider(BaseSpider):
name = "dmoz.org"
allowed_domains = ["dmoz.org"]
start_urls = [
"http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
"http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
]
def parse(self, response):
hxs = HtmlXPathSelector(response)
sites = hxs.select('//ul/li')
items = []
for site in sites:
item = DmozItem()
item['title'] = site.select('a/text()').extract()
item['link'] = site.select('a/#href').extract()
item['desc'] = site.select('text()').extract()
items.append(item)
return items
and what I am supposed to get via website is something different.
any idea what I am screwing up?

I had this problem. Make sure you made the below change as it says to do in the tutorial.
Open items.py and see if you changed class
class TutorialItem(Item):
title=Field()
link=Field()
desc=Field()
into:
class DmozItem(Item):
title=Field()
link=Field()
desc=Field()

There is nothing wrong with the code you pasted. The problem must be elsewhere, can you paste the whole output you get? (your comment stops where the interesting part starts...)

You need to go the the directory containing the settings.py file and run
scrapy crawl dmoz from there.
FOllow the structure of your project against https://github.com/scrapy/dirbot for clarity

Related

Scrapy works in shell but not in the code

I am facing an issue while developing my first spider in scrapy. I am able to get the proper information in scrapy shell but it does not work when I implement it in the code. I've read similar posts here but I still was not able to figure out what I'm doing wrong.
import scrapy
from scrapy.loader import ItemLoader
from ..items import ScrapingamazonItem
class AmazonSpiderSpider(scrapy.Spider):
name = 'amazon_spider'
start_urls = ['https://www.amazon.com/s?k=Office+Chair&lo=grid&crid=1N60K12GUA798&qid=1601040579&sprefix=chair&ref=sr_pg_1']
def parse(self, response):
items = response.css('.s-asin .sg-col-inner')
for item in items:
loader = ItemLoader(item=ScrapingamazonItem(), selector=item)
loader.add_css('ProductName', '.a-color-base.a-text-normal::text')
yield loader.load_item()
I am running it using scrapy crawl amazon_spider -o file.csv. The file returns empty.
Any help is deeply appreciated! :)
Try
for item in items:
loader = ItemLoader(item=ScrapingamazonItem(), selector=item)
loader.add_css('ProductName', '.a-color-base.a-text-normal::text')
yield loader.load_item()

Python: why is in scrapy crawlspider not printing or doing anything?

I'm new to scrapy and cant get it to do anything. Eventually I want to scrape all the html comments from a website by following internal links.
For now I'm just trying to scrape the internal links and add them to a list.
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class comment_spider(CrawlSpider):
name = 'test'
allowed_domains = ['https://www.andnowuknow.com/']
start_urls = ["https://www.andnowuknow.com/"]
rules = (Rule(LinkExtractor(), callback='parse_start_url', follow=True),)
def parse_start_url(self, response):
return self.parse_item(response)
def parse_item(self, response):
urls = []
for link in LinkExtractor(allow=(),).extract_links(response):
urls.append(link)
print(urls)
I'm just trying get it to print something at this point, nothing I've tried so far works.
It finishes with an exit code of 0, but won't print so I cant tell whats happening.
What am I missing?
Surely your messages log should give us some hints, but I see your allowed_domains has a URL instead of a domain. You should set it like this:
allowed_domains = ["andnowuknow.com"]
(See it in the official documentation)
Hope it helps.

The class in Item can`t be recognized after building a python scrapy program

I try to build a scrapy project according to a book.
After using 'scrapy startproject tutorial/cd tutorial/scrapy genspider quotes
quotes.toscrape.coom' comands and adding the parse function & changing items, the detail code as fellow:
quotes.py:
import scrapy
from tutorial.items import QuoteItem
class QuotesSpider(scrapy.Spider):
name = "quotes"
allowed_domains = ["quotes.toscrape.com"]
start_urls = ['http://quotes.toscrape.com/']
def parse(self, response):
quotes = response.css('.quote')
for quote in quotes:
item = QuoteItem()
item['text'] = quote.css('.text::text').extract_first()
item['author'] = quote.css('.author::text').extract_first()
item['tags'] = quote.css('.tags .tag::text').extract()
yield item
next = response.css('.pager .next a::attr(href)').extract_first()
url = response.urljoin(next)
yield scrapy.Request(url=url, callback=self.parse)
items.py:
import scrapy
class QuoteItem(scrapy.Item):
text = scrapy.Field()
author = scrapy.Field()
tags = scrapy.Field()
The class QuoteItem cann`t be recognized in quotes.py
error prompt picture
And after I changed to 'from tutorial.tutorial.items import QuoteItem'
and run 'scrapy crawl quotes', there is another error as fellow:
error again
And this caused the results can`t be saved. Someone can help, thanks in advance.
It's working fine with that code!! try using scrapy runspider yourspiderfile.py instead of scrapy crawl quotes.There is no error in the code.
From Scrapy Tutorial:
To put our spider to work, go to the project’s top level directory and run:
scrapy crawl quotes
In new version Scrapy, please go to the top directory:
scrapy runspider xx/spiders/xxx.py(the full path from top level directory)
In your xxx.py, make sure you import the item like this:
from xx.items import xxItem
# (remember to import the right class name, not always QuoteItem)
Pay attention
It's recommended to run scrapy on linux system or WSL. It does not work well on Win. If you meet the problem like the spider is not found, it may be the system issue.

How to make Scrapy crawl only 1 page (make it non recursive)?

I'm using the latest version of scrapy (http://doc.scrapy.org/en/latest/index.html) and am trying to figure out how to make scrapy crawl only the URL(s) fed to it as part of start_url list. In most cases I want to crawl only 1 page, but in some cases there may be multiple pages that I will specify. I don't want it to crawl to other pages.
I've tried setting the depth level=1 but I'm not sure that in testing it accomplished what I was hoping to achieve.
Any help will be greatly appreciated!
Thank you!
2015-12-22 - Code update:
# -*- coding: utf-8 -*-
import scrapy
from generic.items import GenericItem
class GenericspiderSpider(scrapy.Spider):
name = "genericspider"
def __init__(self, domain, start_url, entity_id):
self.allowed_domains = [domain]
self.start_urls = [start_url]
self.entity_id = entity_id
def parse(self, response):
for href in response.css("a::attr('href')"):
url = response.urljoin(href.extract())
yield scrapy.Request(url, callback=self.parse_dir_contents)
def parse_dir_contents(self, response):
for sel in response.xpath("//body//a"):
item = GenericItem()
item['entity_id'] = self.entity_id
# gets the actual email address
item['emails'] = response.xpath("//a[starts-with(#href, 'mailto')]").re(r'mailto:\s*(.*?)"')
yield item
Below, in the first response, you mention using a generic spider --- isn't that what I'm doing in the code? Also are you suggesting I remove the
callback=self.parse_dir_contents
from the parse function?
Thank you.
looks like you are using CrawlSpider which is a special kind of Spider to crawl multiple categories inside pages.
For only crawling the urls specified inside start_urls just override the parse method, as that is the default callback of the start requests.
Below is a code for the spider that will scrape the title from a blog (Note: the xpath might not be the same for every blog)
Filename: /spiders/my_spider.py
class MySpider(scrapy.Spider):
name = "craig"
allowed_domains = ["www.blogtrepreneur.com"]
start_urls = ["http://www.blogtrepreneur.com/the-best-juice-cleanse-for-weight-loss/"]
def parse(self, response):
hxs = HtmlXPathSelector(response)
dive = response.xpath('//div[#id="tve_editor"]')
items = []
item = DmozItem()
item["title"] = response.xpath('//h1/text()').extract()
item["article"] = response.xpath('//div[#id="tve_editor"]//p//text()').extract()
items.append(item)
return items
The above code will only fetch the title and the article body of the given article.
I got the same problem, because I was using
import scrapy from scrapy.spiders import CrawlSpider
Then I changed to
import scrapy from scrapy.spiders import Spider
And change the class to
class mySpider(Spider):

Back to basics: Scrapy

New to scrapy and I definitely need pointers. I've run through some examples and I'm not getting some basics. I'm running scrapy 1.0.3
Spider:
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from matrix_scrape.items import MatrixScrapeItem
class MySpider(BaseSpider):
name = "matrix"
allowed_domains = ["https://www.kickstarter.com/projects/2061039712/matrix-the-internet-of-things-for-everyonetm"]
start_urls = ["https://www.kickstarter.com/projects/2061039712/matrix-the-internet-of-things-for-everyonetm"]
def parse(self, response):
hxs = HtmlXPathSelector(response)
item = MatrixScrapeItem()
item['backers'] = hxs.select("//*[#id="backers_count"]/data").extract()
item['totalPledged'] = hxs.select("//*[#id="pledged"]/data").extract()
print backers, totalPledged
item:
import scrapy
class MatrixScrapeItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
backers = scrapy.Field()
totalPledged = scrapy.Field()
pass
I'm getting the error:
File "/home/will/Desktop/repos/scrapy/matrix_scrape/matrix_scrape/spiders/test.py", line 15
item['backers'] = hxs.select("//*[#id="backers_count"]/data").extract()
Myquestions are: Why isn't the selecting and extracting working properly? I do see people just using Selector a lot instead of HtmlXPathSelector.
Also I'm trying to save this to a csv file and automate it based on time (extract these data points every 30 min). If anyone has any pointers for examples of that, they'd get super brownie points :)
The syntax error is caused by the way you use double quotes. Mix single and double quotes:
item['backers'] = hxs.select('//*[#id="backers_count"]/data').extract()
item['totalPledged'] = hxs.select('//*[#id="pledged"]/data').extract()
As a side note, you can use response.xpath() shortcut instead of instantiating HtmlXPathSelector:
def parse(self, response):
item = MatrixScrapeItem()
item['backers'] = response.xpath('//*[#id="backers_count"]/data').extract()
item['totalPledged'] = response.xpath('//*[#id="pledged"]/data').extract()
print backers, totalPledged
And you've probably meant to get the text() of the data elements:
//*[#id="backers_count"]/data/text()
//*[#id="pledged"]/data/text()

Categories