in those days i'm making a Spider with Scrapy in Python.
It's basically a simple spider class, that make simple parsing of some field in a Html page.
I don't use the starts_url[] Scrapy field, but i use a personalized list like this:
class start_urls_mod():
def __init__(self, url, data):
self.url=url
self.data=data
#Defined in the class:
url_to_scrape = []
#Populated in the body in this way
self.url_to_scrape.append(start_urls_mod(url_found), str(data_found))
passing the url in this way
for any_url in self.url_to_scrape:
yield scrapy.Request(any_url.url, callback=self.parse_page)
It works good with a limited numbers of url like 3000.
But if i try to make a test and it found about 32532 url to scrape.
In the JSON output file i found only about 3000 url scraped.
My function recall it self:
yield scrapy.Request(any_url.url, callback=self.parse_page)
So the question is, there is some memory limit for the Scrapy items?
No, if you haven't specified CLOSESPIDER_ITEMCOUNT on your settings.
maybe scrapy is finding duplicates in your requests, please check if that stats contain something like dupefilter/filtered on your logs.
Related
I'm trying to scrape data from multiple pages using Scrapy. I'musing the code below, what am I doing wrong?
import scrapy
class CollegeSpider(scrapy.Spider):
name = 'college'
allowed_domains = ['https://engineering.careers360.com/colleges/list-of-engineering-colleges-in-India?sort_filter=alpha']
start_urls = ['https://engineering.careers360.com/colleges/list-of-engineering-colleges-in-India?sort_filter=alpha/']
def parse(self,response):
for college in response.css('div.title'):
if college.css('a::text').extract_first():
yield {'college_name':college.css('a::text').extract_first()}
next_page_url=response.css('li.page-next>a::attr(href)').extract_first()
next_page_url=response.urljoin(next_page_url)
yield scrapy.Request(url=next_page_url,callback=self.praise)
Why do you think you are doing something wrong? Does it show any error? If so, the output should be included in the question in the first place. If it's not doing what you expected, again, you should tell us.
Anyway, looking at the code, there are at least two possible errors:
allowed_domains should be just a domain name, not full URL, as documented.
when you yield new Request to the next page, you should give callback=self.parse instead of self.praise to process the response the same way as the first URL
Is there any scrapy module available to build referrer chains while crawling urls.
Lets say for instance I start my crawl from http://www.example.com and move to http://www.new-example.com and then from http://www.new-example.com to http://very-new-example.com.
Can I create a url chains(a csv or json file) like this:
http://www.example.com, http://www.new-example.com
http://www.example.com, http://www.new-example.com, http://very-new-example.com
and so on, if there's no module or implementation available at the moment then what other options can I try?
Yes you can keep track of referrals by making a global list which accesible by all methods for example.
referral_url_list = []
def call_back1(self, response):
self.referral_url_list.append(response.url)
def call_back1(self, response):
self.referral_url_list.append(response.url)
def call_back1(self, response):
self.referral_url_list.append(response.url)
after spider completion which is detected by spider signals. you can write csv or json file in signal function
I have defined two spiders which do the following:
Spider A:
Visits the home page.
Extracts all the links from the page and stores them in a text file.
This is necessary since the home page has a More Results button which produces further links to different products.
Spider B:
Opens the text file.
Crawls the individual pages and saves the information.
I am trying to combine the two and make a crawl-spider.
The URL structure of the home page is similar to:
http://www.example.com
The URL structure of the individual pages is similar to:
http://www.example.com/Home/Detail?id=some-random-number
The text file contains the list of such URLs which are to be scraped by the second spider.
My question:
How do I combine the two spiders so as to make a single spider which does the complete scraping?
From scrapy documantation:
In the callback function, you parse the response (web page) and return either Item objects, Request objects, or an iterable of both. Those Requests will also contain a callback (maybe the same) and will then be downloaded by Scrapy and then their response handled by the specified callback.
So what you actually need to do is in the parse method (which yuo extract the links there, for each link, yield a new request like:
yield self.make_requests_from_url(http://www.example.com/Home/Detail?id=some-random-number)
the self.make_requests_from_url is already implemented in Spider
Example of such:
class MySpider(Spider):
name = "my_spider"
def parse(self, response):
try:
user_name = Selector(text=response.body).xpath('//*[#id="ft"]/a/#href').extract()[0]
yield self.make_requests_from_url("https://example.com/" + user_name)
yield MyItem(user_name)
except Exception as e:
pass
You can handle the other requests using a different parsing function. do it by returning a Request object and specify the callback explicitly (The self.make_requests_from_url function call the parse function bu default)
Request(url=url,callback=self.parse_user_page)
Hi i need help with the following code to navigate and obtain the data from the remaining pages in the link mentioned in the start_urls. Please help
class texashealthspider(CrawlSpider):
name="texashealth2"
allowed_domains=['www.texashealth.org']
start_urls=['http://jobs.texashealth.org/search/']
rules=(
Rule(SgmlLinkExtractor(allow=("startrow=\d",)),callback="parse",follow=True),
)
def parse(self, response):
hxs=HtmlXPathSelector(response)
titles=hxs.select('//tbody/tr/td')
items = []
for titles in titles:
item=TexashealthItem()
item['title']=titles.select('span[#class="jobTitle"]/a/text()').extract()
item['link']=titles.select('span[#class="jobTitle"]/a/#href').extract()
item['shifttype']=titles.select('span[#class="jobShiftType"]/text()').extract()
item['location']=titles.select('span[#class="jobLocation"]/text()').extract()
items.append(item)
print items
return items
remove the restriction in the allowed_domains=['www.texashealth.org'], make it allowed_domains=['texashealth.org'] or allowed_domains=['jobs.texashealth.org'] - otherwise no page will be crawled
btw, consider changing function name, from docs:
Warning
When writing crawl spider rules, avoid using parse as callback, since the CrawlSpider uses the parse method itself to implement its logic. So if you override the parse method, the crawl spider will no longer work.
I'm using Scrapy to scrape a website. The item page that I want to scrape looks like: http://www.somepage.com/itempage/&page=x. Where x is any number from 1 to 100. Thus, I have an SgmlLinkExractor Rule with a callback function specified for any page resembling this.
The website does not have a listpage with all the items, so I want to somehow well scrapy to scrape those urls (from 1 to 100). This guy here seemed to have the same issue, but couldn't figure it out.
Does anyone have a solution?
You could list all the known URLs in your Spider class' start_urls attribute:
class SomepageSpider(BaseSpider):
name = 'somepage.com'
allowed_domains = ['somepage.com']
start_urls = ['http://www.somepage.com/itempage/&page=%s' % page for page in xrange(1, 101)]
def parse(self, response):
# ...
If it's just a one time thing, you can create a local html file file:///c:/somefile.html with all the links. Start scraping that file and add somepage.com to allowed domains.
Alternately, in the parse function, you can return a new Request which is the next url to be scraped.