I am trying to scrape some info from the companieshouse of the UK using scrapy.
I made a connection with the website through the shell and throught he command
scrapy shell https://beta.companieshouse.gov.uk/search?q=a
and with
response.xpath('//*[#id="results"]').extract()
I managed to get the results back.
I tried to put this into a program so i could export it to a csv or json. But I am having trouble getting it to work.. This is what i got;
import scrapy
class QuotesSpider(scrapy.Spider):
name = "gov2"
def start_requests(self):
start_urls = ['https://beta.companieshouse.gov.uk/search?q=a']
def parse(self, response):
products = response.xpath('//*[#id="results"]').extract()
print(products)
Very simple but tried a lot. Any insight would be appreciated!!
These lines of code are the problem:
def start_requests(self):
start_urls = ['https://beta.companieshouse.gov.uk/search?q=a']
The start_requests method should return an iterable of Requests; yours returns None.
The default start_requests creates this iterable from urls specified in start_urls, so simply defining that as a class variable (outside of any function) and not overriding start_requests will work as you want.
Try to do:
import scrapy
class QuotesSpider(scrapy.Spider):
name = "gov2"
start_urls = ["https://beta.companieshouse.gov.uk/search?q=a"]
def parse(self, response):
products = response.xpath('//*[#id="results"]').extract()
print(products)
Related
I am new to web-scraping using scrapy. I am trying to scrape a website (Please refer to urls in the code).
From the website ,i am trying to scrap the information's under the 'Intimation For%Month%%Year% ' table and transfer the data to json file.
I am getting an error as "'NoneType' object is not iterable",while executing the command:
scrapy crawl quotes -o quotes.json
Code:
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
def start_requests(self):
urls = [
'http://www.narakkalkuries.com/intimation.html#i'
]
def parse(self, response):
for check in response.xpath('//table[#class="MsoTableGrid"]'):
yield{
'data':check.xpath('//table[#class="MsoTableGrid"]/tr/td/p/b//text()').extract_first()
}
Problem:
In the website,all the intimation data is stored under table with same name table#class="MsoTableGrid".
Option's i tried to extract the data
Option1
response.xpath('//table[#class="MsoTableGrid"]').extract()
Return all the data
Option2
response.xpath('//table[#class="MsoTableGrid"]/tr[i]/td/p/b').extract()
Return few of the vertical column
Option3
response.xpath('//table[#class="MsoTableGrid"]/tr/td/p/b//text()').extract()[1]
Return first element from the whole data
Question:
While using Option3,is it possible to know the if the element returned is a string or not?
While using Option3,is it possible to know the entire range of data returned,so that we can traverse through each returned element ?
How to fix the error "NoneType' object is not iterable"
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
def start_requests(self):
urls = [
'http://www.narakkalkuries.com/intimation.html#i'
]
# Here you need to yield the scrapy.Request
for url in urls:
yield scrapy.Request(url)
def parse(self, response):
for check in response.xpath('//table[#class="MsoTableGrid"]'):
yield{
'data':check.xpath('//table[#class="MsoTableGrid"]/tr/td/p/b//text()').extract_first()
}
To add to that start_requests is expected to be a generator of scrapy.Request objects. Your start_requests does not yield anything:
def start_requests(self):
urls = [
'http://www.narakkalkuries.com/intimation.html#i'
]
To fix that either yield urls one by one in your start_requests method:
def start_requests(self):
urls = [
'http://www.narakkalkuries.com/intimation.html#i'
]
for url in urls:
yield Requst(url)
Or use default start_requests method that is inherited from scrapy.Spider just by setting start_urls class attribute:
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
start_urls = [
'http://www.narakkalkuries.com/intimation.html#i'
]
I'm using the latest version of scrapy (http://doc.scrapy.org/en/latest/index.html) and am trying to figure out how to make scrapy crawl only the URL(s) fed to it as part of start_url list. In most cases I want to crawl only 1 page, but in some cases there may be multiple pages that I will specify. I don't want it to crawl to other pages.
I've tried setting the depth level=1 but I'm not sure that in testing it accomplished what I was hoping to achieve.
Any help will be greatly appreciated!
Thank you!
2015-12-22 - Code update:
# -*- coding: utf-8 -*-
import scrapy
from generic.items import GenericItem
class GenericspiderSpider(scrapy.Spider):
name = "genericspider"
def __init__(self, domain, start_url, entity_id):
self.allowed_domains = [domain]
self.start_urls = [start_url]
self.entity_id = entity_id
def parse(self, response):
for href in response.css("a::attr('href')"):
url = response.urljoin(href.extract())
yield scrapy.Request(url, callback=self.parse_dir_contents)
def parse_dir_contents(self, response):
for sel in response.xpath("//body//a"):
item = GenericItem()
item['entity_id'] = self.entity_id
# gets the actual email address
item['emails'] = response.xpath("//a[starts-with(#href, 'mailto')]").re(r'mailto:\s*(.*?)"')
yield item
Below, in the first response, you mention using a generic spider --- isn't that what I'm doing in the code? Also are you suggesting I remove the
callback=self.parse_dir_contents
from the parse function?
Thank you.
looks like you are using CrawlSpider which is a special kind of Spider to crawl multiple categories inside pages.
For only crawling the urls specified inside start_urls just override the parse method, as that is the default callback of the start requests.
Below is a code for the spider that will scrape the title from a blog (Note: the xpath might not be the same for every blog)
Filename: /spiders/my_spider.py
class MySpider(scrapy.Spider):
name = "craig"
allowed_domains = ["www.blogtrepreneur.com"]
start_urls = ["http://www.blogtrepreneur.com/the-best-juice-cleanse-for-weight-loss/"]
def parse(self, response):
hxs = HtmlXPathSelector(response)
dive = response.xpath('//div[#id="tve_editor"]')
items = []
item = DmozItem()
item["title"] = response.xpath('//h1/text()').extract()
item["article"] = response.xpath('//div[#id="tve_editor"]//p//text()').extract()
items.append(item)
return items
The above code will only fetch the title and the article body of the given article.
I got the same problem, because I was using
import scrapy from scrapy.spiders import CrawlSpider
Then I changed to
import scrapy from scrapy.spiders import Spider
And change the class to
class mySpider(Spider):
I am attempting to learn how to use scrapy, and am trying to do what I think is a simple project. I am attempting to pull 2 pieces of data from a single webpage - crawling additional links isn't needed. However, my code seems to be returning zero results. I have tested the xpaths in Scrapy Shell, and both return the expected results.
My item.py is:
import scrapy
class StockItem(scrapy.Item):
quote = scrapy.Field()
time = scrapy.Field()
My spider, named stockscrapy.py, is:
import scrapy
class StockSpider(scrapy.Spider):
name = "ugaz"
allowed_domains = ["nasdaq.com"]
start_urls = ["http://www.nasdaq.com/symbol/ugaz/"]
def parse(self, response):
stock = StockItem()
stock['quote'] = response.xpath('//*[#id="qwidget_lastsale"]/text()').extract()
stock['time'] = response.xpath('//*[#id="qwidget_markettime"]/text()').extract()
return stock
To run the script, I use the command line:
scrapy crawl ugaz -o stocks.csv
Any and all help is greatly appreciated.
You need to indent the parse block.
import scrapy
class StockSpider(scrapy.Spider):
name = "ugaz"
allowed_domains = ["nasdaq.com"]
start_urls = ["http://www.nasdaq.com/symbol/ugaz/"]
# Indent this block
def parse(self, response):
stock = StockItem()
stock['quote'] = response.xpath('//*[#id="qwidget_lastsale"]/text()').extract()
stock['time'] = response.xpath('//*[#id="qwidget_markettime"]/text()').extract()
return stock
I use Scrapy to scrape data from the first URL.
The first URL returns a response contains a list of URLs.
So far is ok for me. My question is how can I further scrape this list of URLs? After searching, I know I can return a request in the parse but it seems only can process one URL.
This is my parse:
def parse(self, response):
# Get the list of URLs, for example:
list = ["http://a.com", "http://b.com", "http://c.com"]
return scrapy.Request(list[0])
# It works, but how can I continue b.com and c.com?
May I do something like that?
def parse(self, response):
# Get the list of URLs, for example:
list = ["http://a.com", "http://b.com", "http://c.com"]
for link in list:
scrapy.Request(link)
# This is wrong, though I need something like this
Full version:
import scrapy
class MySpider(scrapy.Spider):
name = "mySpider"
allowed_domains = ["x.com"]
start_urls = ["http://x.com"]
def parse(self, response):
# Get the list of URLs, for example:
list = ["http://a.com", "http://b.com", "http://c.com"]
for link in list:
scrapy.Request(link)
# This is wrong, though I need something like this
I think what you're looking for is the yield statement:
def parse(self, response):
# Get the list of URLs, for example:
list = ["http://a.com", "http://b.com", "http://c.com"]
for link in list:
request = scrapy.Request(link)
yield request
For this purpose, you need to subclass scrapy.spider and define a list of URLs to start with. Then, Scrapy will automatically follow the links it finds.
Just do something like this:
import scrapy
class YourSpider(scrapy.Spider):
name = "your_spider"
allowed_domains = ["a.com", "b.com", "c.com"]
start_urls = [
"http://a.com/",
"http://b.com/",
"http://c.com/",
]
def parse(self, response):
# do whatever you want
pass
You can find more information on the official documentation of Scrapy.
# within your parse method:
urlList = response.xpath('//a/#href').extract()
print(urlList) #to see the list of URLs
for url in urlList:
yield scrapy.Request(url, callback=self.parse)
This should work
I need to crawl two URLs with the same spider: example.com/folder/ and example.com/folder/fold2 and retrieve two different things for each url.
start_urls = ['http://www.example.com/folder', 'http://www.example.com/folder/fold2']
1) check something for /folder
2) check something different for /folder/fold2
Looks like you want to override the start_requests method instead of using start_urls:
from scrapy import Spider, Request
class MySpider(Spider):
name = 'myspider'
def start_requests(self):
yield Request('http://www.example.com/folder',
callback=self.parse_folder)
yield Request('http://www.example.com/folder/fold2',
callback=self.parse_subfolder)
# ... define parse_folder and parse_subfolder here