Scrapy works in shell but not in the code

Scrapy works in shell but not in the code - python

I am facing an issue while developing my first spider in scrapy. I am able to get the proper information in scrapy shell but it does not work when I implement it in the code. I've read similar posts here but I still was not able to figure out what I'm doing wrong.
import scrapy
from scrapy.loader import ItemLoader
from ..items import ScrapingamazonItem
class AmazonSpiderSpider(scrapy.Spider):
name = 'amazon_spider'
start_urls = ['https://www.amazon.com/s?k=Office+Chair&lo=grid&crid=1N60K12GUA798&qid=1601040579&sprefix=chair&ref=sr_pg_1']
def parse(self, response):
items = response.css('.s-asin .sg-col-inner')
for item in items:
loader = ItemLoader(item=ScrapingamazonItem(), selector=item)
loader.add_css('ProductName', '.a-color-base.a-text-normal::text')
yield loader.load_item()
I am running it using scrapy crawl amazon_spider -o file.csv. The file returns empty.
Any help is deeply appreciated! :)

Try
for item in items:
loader = ItemLoader(item=ScrapingamazonItem(), selector=item)
loader.add_css('ProductName', '.a-color-base.a-text-normal::text')
yield loader.load_item()

Related

The class in Item can`t be recognized after building a python scrapy program

I try to build a scrapy project according to a book.
After using 'scrapy startproject tutorial/cd tutorial/scrapy genspider quotes
quotes.toscrape.coom' comands and adding the parse function & changing items, the detail code as fellow:
quotes.py:
import scrapy
from tutorial.items import QuoteItem
class QuotesSpider(scrapy.Spider):
name = "quotes"
allowed_domains = ["quotes.toscrape.com"]
start_urls = ['http://quotes.toscrape.com/']
def parse(self, response):
quotes = response.css('.quote')
for quote in quotes:
item = QuoteItem()
item['text'] = quote.css('.text::text').extract_first()
item['author'] = quote.css('.author::text').extract_first()
item['tags'] = quote.css('.tags .tag::text').extract()
yield item
next = response.css('.pager .next a::attr(href)').extract_first()
url = response.urljoin(next)
yield scrapy.Request(url=url, callback=self.parse)
items.py:
import scrapy
class QuoteItem(scrapy.Item):
text = scrapy.Field()
author = scrapy.Field()
tags = scrapy.Field()
The class QuoteItem cann`t be recognized in quotes.py
error prompt picture
And after I changed to 'from tutorial.tutorial.items import QuoteItem'
and run 'scrapy crawl quotes', there is another error as fellow:
error again
And this caused the results can`t be saved. Someone can help, thanks in advance.

It's working fine with that code!! try using scrapy runspider yourspiderfile.py instead of scrapy crawl quotes.There is no error in the code.

From Scrapy Tutorial:
To put our spider to work, go to the project’s top level directory and run:
scrapy crawl quotes
In new version Scrapy, please go to the top directory:
scrapy runspider xx/spiders/xxx.py(the full path from top level directory)
In your xxx.py, make sure you import the item like this:
from xx.items import xxItem
# (remember to import the right class name, not always QuoteItem)
Pay attention
It's recommended to run scrapy on linux system or WSL. It does not work well on Win. If you meet the problem like the spider is not found, it may be the system issue.

How to use Scrapy sitemap spider on sites with text sitemaps?

I tried using a generic Scrapy.spider to follow links, but it didn't work - so I hit upon the idea of simplifying the process by accessing the sitemap.txt instead, but that didn't work either!
I wrote a simple example (to help me understand the algorithm) of a spider to follow the sitemap specified on my site: https://legion-216909.appspot.com/sitemap.txt It is meant to navigate the URLs specified on the sitemap, print them out to screen and output the results into a links.txt file. The code:
import scrapy
from scrapy.spiders import SitemapSpider
class MySpider(SitemapSpider):
name = "spyder_PAGE"
sitemap_urls = ['https://legion-216909.appspot.com/sitemap.txt']
def parse(self, response):
print(response.url)
return response.url
I ran the above spider as Scrapy crawl spyder_PAGE > links.txt but that returned an empty text file. I have gone through the Scrapy docs multiple times, but there is something missing. Where am I going wrong?

SitemapSpider is expecting an XML sitemap format, causing the spider to exit with this error:
[scrapy.spiders.sitemap] WARNING: Ignoring invalid sitemap: <200 https://legion-216909.appspot.com/sitemap.txt>
Since your sitemap.txt file is just a simple list or URLs, it would be easier to just split them with a string method.
For example:
from scrapy import Spider, Request
class MySpider(Spider):
name = "spyder_PAGE"
start_urls = ['https://legion-216909.appspot.com/sitemap.txt']
def parse(self, response):
links = response.text.split('\n')
for link in links:
# yield a request to get this link
print(link)
# https://legion-216909.appspot.com/index.html
# https://legion-216909.appspot.com/content.htm
# https://legion-216909.appspot.com/Dataset/module_4_literature/Unit_1/.DS_Store

You only need to override _parse_sitemap(self, response) from SitemapSpider with the following:
from scrapy import Request
from scrapy.spiders import SitemapSpider
class MySpider(SitemapSpider):
sitemap_urls = [...]
sitemap_rules = [...]
def _parse_sitemap(self, response):
# yield a request for each url in the txt file that matches your filters
urls = response.text.splitlines()
it = self.sitemap_filter(urls)
for loc in it:
for r, c in self._cbs:
if r.search(loc):
yield Request(loc, callback=c)
break

How to make Scrapy crawl only 1 page (make it non recursive)?

I'm using the latest version of scrapy (http://doc.scrapy.org/en/latest/index.html) and am trying to figure out how to make scrapy crawl only the URL(s) fed to it as part of start_url list. In most cases I want to crawl only 1 page, but in some cases there may be multiple pages that I will specify. I don't want it to crawl to other pages.
I've tried setting the depth level=1 but I'm not sure that in testing it accomplished what I was hoping to achieve.
Any help will be greatly appreciated!
Thank you!
2015-12-22 - Code update:
# -*- coding: utf-8 -*-
import scrapy
from generic.items import GenericItem
class GenericspiderSpider(scrapy.Spider):
name = "genericspider"
def __init__(self, domain, start_url, entity_id):
self.allowed_domains = [domain]
self.start_urls = [start_url]
self.entity_id = entity_id
def parse(self, response):
for href in response.css("a::attr('href')"):
url = response.urljoin(href.extract())
yield scrapy.Request(url, callback=self.parse_dir_contents)
def parse_dir_contents(self, response):
for sel in response.xpath("//body//a"):
item = GenericItem()
item['entity_id'] = self.entity_id
# gets the actual email address
item['emails'] = response.xpath("//a[starts-with(#href, 'mailto')]").re(r'mailto:\s*(.*?)"')
yield item
Below, in the first response, you mention using a generic spider --- isn't that what I'm doing in the code? Also are you suggesting I remove the
callback=self.parse_dir_contents
from the parse function?
Thank you.

looks like you are using CrawlSpider which is a special kind of Spider to crawl multiple categories inside pages.
For only crawling the urls specified inside start_urls just override the parse method, as that is the default callback of the start requests.

Below is a code for the spider that will scrape the title from a blog (Note: the xpath might not be the same for every blog)
Filename: /spiders/my_spider.py
class MySpider(scrapy.Spider):
name = "craig"
allowed_domains = ["www.blogtrepreneur.com"]
start_urls = ["http://www.blogtrepreneur.com/the-best-juice-cleanse-for-weight-loss/"]
def parse(self, response):
hxs = HtmlXPathSelector(response)
dive = response.xpath('//div[#id="tve_editor"]')
items = []
item = DmozItem()
item["title"] = response.xpath('//h1/text()').extract()
item["article"] = response.xpath('//div[#id="tve_editor"]//p//text()').extract()
items.append(item)
return items
The above code will only fetch the title and the article body of the given article.

I got the same problem, because I was using
import scrapy from scrapy.spiders import CrawlSpider
Then I changed to
import scrapy from scrapy.spiders import Spider
And change the class to
class mySpider(Spider):

Distinguishing between HTML and non-HTML pages in Scrapy

I am building a Spider in Scrapy that follows all the links it can find, and sends the url to a pipeline. At the moment, this is my code:
from scrapy import Spider
from scrapy.http import Request
from scrapy.http import TextResponse
from scrapy.selector import Selector
from scrapyTest.items import TestItem
import urlparse
class TestSpider(Spider):
name = 'TestSpider'
allowed_domains = ['pyzaist.com']
start_urls = ['http://pyzaist.com/drone']
def parse(self, response):
item = TestItem()
item["url"] = response.url
yield item
links = response.xpath("//a/#href").extract()
for link in links:
yield Request(urlparse.urljoin(response.url, link))
This does the job, but throws an error whenever the response is just a Response, not a TextResponse or HtmlResponse. This is because there is no Response.xpath(). I tried to test for this by doing:
if type(response) is TextResponse:
links = response.xpath("//a#href").extract()
...
But to no avail. When I do that, it never enters the if statement. I am new to Python, so it might be a language thing. I appreciate any help.

Nevermind, I found the answer. type() only gives information on the immediate type. It tells nothing of inheritance. I was looking for isinstance(). This code works:
if isinstance(response, TextResponse):
links = response.xpath("//a/#href").extract()
...
https://stackoverflow.com/a/2225066/1455074, near the bottom

Scrapy Newbie Question - can't get tutorial file working

I am a complete newbie to Python and Scrapy so I started by trying to replicate the tutorial. I am trying to scrape the www.dmoz.org website as per the tutorial.
I compose the dmoz_spider.py as indicated below
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from dmoz.items import DmozItem
class DmozSpider(BaseSpider):
name = "dmoz.org"
allowed_domains = ["dmoz.org"]
start_urls = [
"http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
"http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
]
def parse(self, response):
hxs = HtmlXPathSelector(response)
sites = hxs.select('//ul/li')
items = []
for site in sites:
item = DmozItem()
item['title'] = site.select('a/text()').extract()
item['link'] = site.select('a/#href').extract()
item['desc'] = site.select('text()').extract()
items.append(item)
return items
and what I am supposed to get via website is something different.
any idea what I am screwing up?

I had this problem. Make sure you made the below change as it says to do in the tutorial.
Open items.py and see if you changed class
class TutorialItem(Item):
title=Field()
link=Field()
desc=Field()
into:
class DmozItem(Item):
title=Field()
link=Field()
desc=Field()

There is nothing wrong with the code you pasted. The problem must be elsewhere, can you paste the whole output you get? (your comment stops where the interesting part starts...)

You need to go the the directory containing the settings.py file and run
scrapy crawl dmoz from there.
FOllow the structure of your project against https://github.com/scrapy/dirbot for clarity

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Scrapy works in shell but not in the code - python

Try for item in items: loader = ItemLoader(item=ScrapingamazonItem(), selector=item) loader.add_css('ProductName', '.a-color-base.a-text-normal::text') yield loader.load_item()

Related

The class in Item can`t be recognized after building a python scrapy program

How to use Scrapy sitemap spider on sites with text sitemaps?

How to make Scrapy crawl only 1 page (make it non recursive)?

Distinguishing between HTML and non-HTML pages in Scrapy

Scrapy Newbie Question - can't get tutorial file working

Categories

Resources