start_urls = ['https://image.jpg']
def start_requests(self):
for url in self.start_urls:
request = scrapy.Request(url,callback=self.parse)
yield request
def parse(self, response):
item = GetImgsItem()
# print(response.url)
item['image_urls'] = response.url
yield item
My spider can now download the image from start_urls but the request was sent twice to give one image.
How should I turn it to download in start_requests ?
Question 2:
I created two spiders (spider A , spider B) in my project. In spider A, I have a specific pipeline class to deal the downloaded items. It works well now.
But later when I used spider B, it also used the same pipeline class of spider A. How should I set pipeline class so that it is exclusive for spider A to use ?
To answer your second question take a look at this post:
How can I use different pipelines for different spiders in a single Scrapy project
You can also just delete the pipeline part in your settings.py file and create custom_settings in your spider.
class SpiderA(scrapy.Spider):
name = 'spider_a'
custom_settings = {
'ITEM_PIPELINES': {
'project.pipelines.MyPipeline': 300
}
}
But I think the example shown in the post above is a bit more elegant.
For the first question, you could start with a dummy request and then yield image items in your parse method. This could avoid some hacks to other middlewares.
start_urls = ['https://any.dummy.website']
image_urls = [...]
def parse(self, dummy_response):
yield Item(image_urls=self.image_urls)
Related
I am using scrapy to scrape all the links off single domain. I am following all links on the domain but saving all links off the domain. The following scraper works correctly, but I can't access member variables from within the scraper since I am running it with a CrawlerProcess.
import scrapy
from scrapy.crawler import CrawlerProcess
class MySpider(scrapy.Spider):
name = 'myspider'
start_urls = ['https://example.com']
on_domain_urls = set()
off_domain_urls = set()
def parse(self, response):
links = response.xpath('//a/#href')
for link in links:
url = link.get()
if 'example.com' in url and url not in self.on_domain_urls:
print('On domain links found: {}'.format(
len(self.on_domain_urls)))
self.on_domain_urls.add(url)
yield scrapy.Request(url, callback=self.parse)
elif url not in self.off_domain_urls:
print('Offf domain links found: {}'.format(
len(self.on_domain_urls)))
self.off_domain_urls.add(url)
process = CrawlerProcess()
process.crawl(GoodOnYouSpider)
process.start()
# Need access to off_domain_links
How can I access off_domain_links? I could probably move it to a global scope but this seems hack. I can also append to a file, but I'd like to avoid file I/O if possible. Is there a better way to return aggregated data like this?
Did you check the Itempipeline? I think you'll have to use that in this scenario and decide what needs to be done with the variable.
See:
https://docs.scrapy.org/en/latest/topics/item-pipeline.html
I am making a spider which will crawl the entire site on the first run and store the data in my database.
But I will keep running this spider on weekly basis to get the updates of the crawled site in my database and I don't want scrapy to crawl the pages which are already present in my database how to achieve this I have made two plans -
1] Make a crawler to fetch the entire site and somehow store the first fetched URL in a csv file then keep following the next pages. Then make another crawler which will start fetching backwards that means it will take the input from the URL in csv and keep running till prev_page exits this way I will get the data, but the url in csv will be crawled twice.
2] Make a crawler which will check condition if the data is in the database then stop, is it possible? This will be the most productive way but I can't find the way out. Maybe making logs files might help in some way?
Update
The site is a blog which updates frequently and sorted as latest post on the top manner
Something like this :
from scrapy import Spider
from scrapy.http import Request, FormRequest
class MintSpiderSpider(Spider):
name = 'Mint_spider'
allowed_domains = ['example.com']
start_urls = ['http://www.example.com/']
def parse(self, response):
urls = response.xpath('//div[#class = "post-inner post-hover"]/h2/a/#href').extract()
for url in urls:
if never_visited(url, database):
yield Request(url, callback=self.parse_lyrics) #do you mean parse_foo ?
next_page_url = response.xpath('//li[#class="next right"]/a/#href').extract_first()
if next_page_url:
yield scrapy.Request(next_page_url, callback=self.parse)
def parse_foo(self, response):
save_url(response.request.url, database)
info = response.xpath('//*[#class="songinfo"]/p/text()').extract()
name = response.xpath('//*[#id="lyric"]/h2/text()').extract()
yield{
'name' : name,
'info': info
}
You just need to implement never_visited and save_url functions.
never_visited will check in your database if url is already there. save_url will add the url into your database.
I'm using the latest version of scrapy (http://doc.scrapy.org/en/latest/index.html) and am trying to figure out how to make scrapy crawl only the URL(s) fed to it as part of start_url list. In most cases I want to crawl only 1 page, but in some cases there may be multiple pages that I will specify. I don't want it to crawl to other pages.
I've tried setting the depth level=1 but I'm not sure that in testing it accomplished what I was hoping to achieve.
Any help will be greatly appreciated!
Thank you!
2015-12-22 - Code update:
# -*- coding: utf-8 -*-
import scrapy
from generic.items import GenericItem
class GenericspiderSpider(scrapy.Spider):
name = "genericspider"
def __init__(self, domain, start_url, entity_id):
self.allowed_domains = [domain]
self.start_urls = [start_url]
self.entity_id = entity_id
def parse(self, response):
for href in response.css("a::attr('href')"):
url = response.urljoin(href.extract())
yield scrapy.Request(url, callback=self.parse_dir_contents)
def parse_dir_contents(self, response):
for sel in response.xpath("//body//a"):
item = GenericItem()
item['entity_id'] = self.entity_id
# gets the actual email address
item['emails'] = response.xpath("//a[starts-with(#href, 'mailto')]").re(r'mailto:\s*(.*?)"')
yield item
Below, in the first response, you mention using a generic spider --- isn't that what I'm doing in the code? Also are you suggesting I remove the
callback=self.parse_dir_contents
from the parse function?
Thank you.
looks like you are using CrawlSpider which is a special kind of Spider to crawl multiple categories inside pages.
For only crawling the urls specified inside start_urls just override the parse method, as that is the default callback of the start requests.
Below is a code for the spider that will scrape the title from a blog (Note: the xpath might not be the same for every blog)
Filename: /spiders/my_spider.py
class MySpider(scrapy.Spider):
name = "craig"
allowed_domains = ["www.blogtrepreneur.com"]
start_urls = ["http://www.blogtrepreneur.com/the-best-juice-cleanse-for-weight-loss/"]
def parse(self, response):
hxs = HtmlXPathSelector(response)
dive = response.xpath('//div[#id="tve_editor"]')
items = []
item = DmozItem()
item["title"] = response.xpath('//h1/text()').extract()
item["article"] = response.xpath('//div[#id="tve_editor"]//p//text()').extract()
items.append(item)
return items
The above code will only fetch the title and the article body of the given article.
I got the same problem, because I was using
import scrapy from scrapy.spiders import CrawlSpider
Then I changed to
import scrapy from scrapy.spiders import Spider
And change the class to
class mySpider(Spider):
I use Scrapy to scrape data from the first URL.
The first URL returns a response contains a list of URLs.
So far is ok for me. My question is how can I further scrape this list of URLs? After searching, I know I can return a request in the parse but it seems only can process one URL.
This is my parse:
def parse(self, response):
# Get the list of URLs, for example:
list = ["http://a.com", "http://b.com", "http://c.com"]
return scrapy.Request(list[0])
# It works, but how can I continue b.com and c.com?
May I do something like that?
def parse(self, response):
# Get the list of URLs, for example:
list = ["http://a.com", "http://b.com", "http://c.com"]
for link in list:
scrapy.Request(link)
# This is wrong, though I need something like this
Full version:
import scrapy
class MySpider(scrapy.Spider):
name = "mySpider"
allowed_domains = ["x.com"]
start_urls = ["http://x.com"]
def parse(self, response):
# Get the list of URLs, for example:
list = ["http://a.com", "http://b.com", "http://c.com"]
for link in list:
scrapy.Request(link)
# This is wrong, though I need something like this
I think what you're looking for is the yield statement:
def parse(self, response):
# Get the list of URLs, for example:
list = ["http://a.com", "http://b.com", "http://c.com"]
for link in list:
request = scrapy.Request(link)
yield request
For this purpose, you need to subclass scrapy.spider and define a list of URLs to start with. Then, Scrapy will automatically follow the links it finds.
Just do something like this:
import scrapy
class YourSpider(scrapy.Spider):
name = "your_spider"
allowed_domains = ["a.com", "b.com", "c.com"]
start_urls = [
"http://a.com/",
"http://b.com/",
"http://c.com/",
]
def parse(self, response):
# do whatever you want
pass
You can find more information on the official documentation of Scrapy.
# within your parse method:
urlList = response.xpath('//a/#href').extract()
print(urlList) #to see the list of URLs
for url in urlList:
yield scrapy.Request(url, callback=self.parse)
This should work
I need to crawl two URLs with the same spider: example.com/folder/ and example.com/folder/fold2 and retrieve two different things for each url.
start_urls = ['http://www.example.com/folder', 'http://www.example.com/folder/fold2']
1) check something for /folder
2) check something different for /folder/fold2
Looks like you want to override the start_requests method instead of using start_urls:
from scrapy import Spider, Request
class MySpider(Spider):
name = 'myspider'
def start_requests(self):
yield Request('http://www.example.com/folder',
callback=self.parse_folder)
yield Request('http://www.example.com/folder/fold2',
callback=self.parse_subfolder)
# ... define parse_folder and parse_subfolder here