I need to crawl two URLs with the same spider: example.com/folder/ and example.com/folder/fold2 and retrieve two different things for each url.
start_urls = ['http://www.example.com/folder', 'http://www.example.com/folder/fold2']
1) check something for /folder
2) check something different for /folder/fold2
Looks like you want to override the start_requests method instead of using start_urls:
from scrapy import Spider, Request
class MySpider(Spider):
name = 'myspider'
def start_requests(self):
yield Request('http://www.example.com/folder',
callback=self.parse_folder)
yield Request('http://www.example.com/folder/fold2',
callback=self.parse_subfolder)
# ... define parse_folder and parse_subfolder here
Related
start_urls = ['https://image.jpg']
def start_requests(self):
for url in self.start_urls:
request = scrapy.Request(url,callback=self.parse)
yield request
def parse(self, response):
item = GetImgsItem()
# print(response.url)
item['image_urls'] = response.url
yield item
My spider can now download the image from start_urls but the request was sent twice to give one image.
How should I turn it to download in start_requests ?
Question 2:
I created two spiders (spider A , spider B) in my project. In spider A, I have a specific pipeline class to deal the downloaded items. It works well now.
But later when I used spider B, it also used the same pipeline class of spider A. How should I set pipeline class so that it is exclusive for spider A to use ?
To answer your second question take a look at this post:
How can I use different pipelines for different spiders in a single Scrapy project
You can also just delete the pipeline part in your settings.py file and create custom_settings in your spider.
class SpiderA(scrapy.Spider):
name = 'spider_a'
custom_settings = {
'ITEM_PIPELINES': {
'project.pipelines.MyPipeline': 300
}
}
But I think the example shown in the post above is a bit more elegant.
For the first question, you could start with a dummy request and then yield image items in your parse method. This could avoid some hacks to other middlewares.
start_urls = ['https://any.dummy.website']
image_urls = [...]
def parse(self, dummy_response):
yield Item(image_urls=self.image_urls)
I am using scrapy to scrape all the links off single domain. I am following all links on the domain but saving all links off the domain. The following scraper works correctly, but I can't access member variables from within the scraper since I am running it with a CrawlerProcess.
import scrapy
from scrapy.crawler import CrawlerProcess
class MySpider(scrapy.Spider):
name = 'myspider'
start_urls = ['https://example.com']
on_domain_urls = set()
off_domain_urls = set()
def parse(self, response):
links = response.xpath('//a/#href')
for link in links:
url = link.get()
if 'example.com' in url and url not in self.on_domain_urls:
print('On domain links found: {}'.format(
len(self.on_domain_urls)))
self.on_domain_urls.add(url)
yield scrapy.Request(url, callback=self.parse)
elif url not in self.off_domain_urls:
print('Offf domain links found: {}'.format(
len(self.on_domain_urls)))
self.off_domain_urls.add(url)
process = CrawlerProcess()
process.crawl(GoodOnYouSpider)
process.start()
# Need access to off_domain_links
How can I access off_domain_links? I could probably move it to a global scope but this seems hack. I can also append to a file, but I'd like to avoid file I/O if possible. Is there a better way to return aggregated data like this?
Did you check the Itempipeline? I think you'll have to use that in this scenario and decide what needs to be done with the variable.
See:
https://docs.scrapy.org/en/latest/topics/item-pipeline.html
I am trying to scrape some info from the companieshouse of the UK using scrapy.
I made a connection with the website through the shell and throught he command
scrapy shell https://beta.companieshouse.gov.uk/search?q=a
and with
response.xpath('//*[#id="results"]').extract()
I managed to get the results back.
I tried to put this into a program so i could export it to a csv or json. But I am having trouble getting it to work.. This is what i got;
import scrapy
class QuotesSpider(scrapy.Spider):
name = "gov2"
def start_requests(self):
start_urls = ['https://beta.companieshouse.gov.uk/search?q=a']
def parse(self, response):
products = response.xpath('//*[#id="results"]').extract()
print(products)
Very simple but tried a lot. Any insight would be appreciated!!
These lines of code are the problem:
def start_requests(self):
start_urls = ['https://beta.companieshouse.gov.uk/search?q=a']
The start_requests method should return an iterable of Requests; yours returns None.
The default start_requests creates this iterable from urls specified in start_urls, so simply defining that as a class variable (outside of any function) and not overriding start_requests will work as you want.
Try to do:
import scrapy
class QuotesSpider(scrapy.Spider):
name = "gov2"
start_urls = ["https://beta.companieshouse.gov.uk/search?q=a"]
def parse(self, response):
products = response.xpath('//*[#id="results"]').extract()
print(products)
I use Scrapy to scrape data from the first URL.
The first URL returns a response contains a list of URLs.
So far is ok for me. My question is how can I further scrape this list of URLs? After searching, I know I can return a request in the parse but it seems only can process one URL.
This is my parse:
def parse(self, response):
# Get the list of URLs, for example:
list = ["http://a.com", "http://b.com", "http://c.com"]
return scrapy.Request(list[0])
# It works, but how can I continue b.com and c.com?
May I do something like that?
def parse(self, response):
# Get the list of URLs, for example:
list = ["http://a.com", "http://b.com", "http://c.com"]
for link in list:
scrapy.Request(link)
# This is wrong, though I need something like this
Full version:
import scrapy
class MySpider(scrapy.Spider):
name = "mySpider"
allowed_domains = ["x.com"]
start_urls = ["http://x.com"]
def parse(self, response):
# Get the list of URLs, for example:
list = ["http://a.com", "http://b.com", "http://c.com"]
for link in list:
scrapy.Request(link)
# This is wrong, though I need something like this
I think what you're looking for is the yield statement:
def parse(self, response):
# Get the list of URLs, for example:
list = ["http://a.com", "http://b.com", "http://c.com"]
for link in list:
request = scrapy.Request(link)
yield request
For this purpose, you need to subclass scrapy.spider and define a list of URLs to start with. Then, Scrapy will automatically follow the links it finds.
Just do something like this:
import scrapy
class YourSpider(scrapy.Spider):
name = "your_spider"
allowed_domains = ["a.com", "b.com", "c.com"]
start_urls = [
"http://a.com/",
"http://b.com/",
"http://c.com/",
]
def parse(self, response):
# do whatever you want
pass
You can find more information on the official documentation of Scrapy.
# within your parse method:
urlList = response.xpath('//a/#href').extract()
print(urlList) #to see the list of URLs
for url in urlList:
yield scrapy.Request(url, callback=self.parse)
This should work
This is how my spider is set up
class CustomSpider(CrawlSpider):
name = 'custombot'
allowed_domains = ['www.domain.com']
start_urls = ['http://www.domain.com/some-url']
rules = (
Rule(SgmlLinkExtractor(allow=r'.*?something/'), callback='do_stuff', follow=True),
)
def start_requests(self):
return Request('http://www.domain.com/some-other-url', callback=self.do_something_else)
It goes to /some-other-url but not /some-url. What is wrong here? The url specified in start_urls are the ones that need links extracted and sent through the rules filter, where as the ones in start_requests are sent directly to the item parser so it doesn't need to pass through the rules filters.
From the documentation for start_requests, overriding start_requests means that the urls defined in start_urls are ignored.
This is the method called by Scrapy when the spider is opened for
scraping when no particular URLs are specified. If particular URLs are
specified, the make_requests_from_url() is used instead to create the
Requests.
[...]
If you want to change the Requests used to start scraping a domain, this is the method to override.
If you want to just scrape from /some-url, then remove start_requests. If you want to scrape from both, then add /some-url to the start_urls list.